SQLite: Using LIMIT for big update and is it worth dropping indexes
I need to update lots of records in one table (around 300k to 500k) and delete several million in another table. As this takes some time, i'm using LIMIT to do this in block so I can show a progress bar to the user.
My first small question is, why can I uses the following statement in SQLiteSpy but not when I use the ADO.NET provider System.Data.SQLite ?
UPDATE Table1
SET Status1 = newValue1, Status2 = value2
WH开发者_JS百科ERE Key1 = key1Value
AND Status1 = value1
LIMIT 1000
I have to use the following statement to make it work:
UPDATE Table1
SET Status1 = newValue1, Status2 = value2
WHERE Key1 = key1Value
AND Key2 in (
SELECT Key2
FROM Table
WHERE Key1 = key1Value
AND Status1 = value1
LIMIT 1000)
I am using the latest version of SQLiteSpy (which uses SQLite 3.7.2) and System.Data.SQlite.
My other question is more complex. I am using 2 tables:
CREATE TABLE Table1 (
Key1 INTEGER NOT NULL,
Key2 INTEGER NOT NULL,
...
Some fixed varchar data fields
...
Status1 CHAR(1) NOT NULL,
Status2 VARCHAR NULL,
Status3 CHAR(1) NOT NULL,
UpdateDate DATETIME NOT NULL,
CONSTRAINT PK_Table1 PRIMARY KEY (Key1 ASC, Key2 ASC))
and
CREATE TABLE Table2 (
Key1 INTEGER NOT NULL,
Key2 INTEGER NOT NULL,
Key3 INTEGER NOT NULL,
...
Some fixed varchar data fields
...
CONSTRAINT PK_Table2 PRIMARY KEY (Key1 ASC, Key2 ASC, Key3 ASC))
with on table1 two indexes:
CREATE INDEX IDX_Tabel1_Status1 ON Table1 (Key1 ASC, Status1 ASC, Key2 ASC)
CREATE INDEX IDX_Tabel1_Status2 ON Table1 (Key1 ASC, Status2 ASC, Key2 ASC)
As you might have guessed, Key1 and Key2 in both tables are liked.
What I want to do, is delete in Table2 all records for records who have a certain status in Table1 and reset the 3 status fields to their original value and update the date in Table1. As the number of records involded can be big (Table1 contains up to 500k records and Table2 between 20M to 40M), and most of the time it concerns between 50% to 100% of table1, I perfom the deletes and updates in "small" blocks (somewhere between 1000 and 10000 records in Table1). So i repeat the 2 following statements until all concerned records are deleted/updated (one delete and update per transaction):
DELETE FROM Table2
WHERE Key1 = @Key1
AND Key2 in (
SELECT Key2
FROM Table
WHERE Key1 = @Key1
AND Status1 = @Status1
LIMIT 1000)
UPDATE Table1
SET Status1 = @NewStatus1, Status2 = @Status2, Status3 = @Status3, UpdateDate = @Date
WHERE Key1 = @Key1
AND Key2 in (
SELECT Key2
FROM Table
WHERE Key1 = @Key1
AND Status1 = @Status1
LIMIT 1000)
The deletes are pretty fast, but the updates take a lot of time (around 2 to 3 seconds for 1000 records). I guess it's because the both indexes need to be updated. So i wonder if it would improve performance to drop both indexes before the delete/updates and recreate them afterwards. But then the subselects would be become slower. At what point (percentage of the total records involved, or absolute number of records), should i consider dropping the indexes ?
Thanks, Marc
At what point (percentage of the total records involved, or absolute number of records), should i consider dropping the indexes ?
I cannot tell, but with such large amounts of data it mostly yields benefits to drop the indexes and recreate them later on, when the transactions have been finished.
The only way to get a good answer is to measure the performance empirically for your scenario.
IMHO, your main problem is the the limit clause to display progress to the user. I would not do that. If left out it could yield a significant performance bonus. Furthermore, If I were a user, I'd rather have the job done as fast as possible than knowing the progress. You can use a marquee to display there's an action going on.
You can also measure the average time per record (or the formula depending on the amount of data) and display an educated guess about the progress. Measure with every run, save it somewhere and build an average from the last 10 runs or so to calculate the progress.
Or you can part the logic into different steps and display a step number for the query that is currently executing.
If you really need to display the progress, try using and exists statement instead of the in clause. It could be faster.
精彩评论