How to select records from a huge table and insert into another table in chunks
Sybase problem: The selected values of column1 (id) of a huge_tbl table (60 million+ records) needs to be copied into small_tbl table which is comparatively small. The criteria to select rows from the huge_tbl depends on column2 values (mydate).
huge_tbl:
- id is primary key of huge_table
- mydate column is a datetime field.
- There is an index (mydate,id) on huge_tbl.
- huge_tbl has many more columns apart from id and mydate.
small_tbl:
- small_tbl has 2 columns: identity column is an identity column id2 is similar to id of huge_tbl
Since the number of records to be inserted is quite large and to avoid the filling up the log space, the insertion is divided into chunks using set rowcount.
Due to this, next iteration has to start from where the previous iteration completed. Suppose the last record inserted in 1st iteration had mydate value as "02/10/2010 09:00:00". There might be more records with the same mydate values in the huge_tbl which were not selected because of "set rowcount ###". The next iteration has to select records where mydate value starts from "02/10/2010 09:00:00" but should also filter out the records already selected in previous iteration.
I tried few approaches but it either ends with duplicate records in small_tbl or it takes too long to process.
Any pointers to this problem would be appreciated.
For example -The below approach is taking too long:
while (row_count &g开发者_高级运维t; 0)
begin
insert into small_tbl(id2)
select id from huge_tbl where mydate between <date1> and <date2>
and 1 = case
when ((mydate = @last_max_mydate)
and id > @last_max_id))
then 1
when (mydate > @last_max_mydate)
then 1
else 0
select @row_count = @@rowcount
select @last_max_identity = max(identity) from small_tbl
select @last_max_id2 = id2 from small_tbl where identity = @last_max_identity
select @last_max_mydate = mydate from huge_tbl where id = @last_max_id2
end
This problem is easy to solve, and has been done a thousand times.
First, if you do not mind me saying so, you are too deep into your solution, and your table, you do not see the problem for what it is. So please answer my questions without worrying about what they mean or intend.
1 Forget the IDENTITY column on huge_table, or what your PRIMARY KEY constraint is defined to be, what is the true relational primary key ? If you are not sure, post all the explicit and CREATE INDEX statements for huge_table.
2 What is the purpose of copying huge_table to small_table ? Whenever you duplicate data (even temporarily), there is a larger problem that is not being identified; if you resolve that, the need to copy will be eliminated.
Response to Your Answer
There must be some misunderstanding. You have provided that information is your original post. Sorry, but I specifically asked, what is the true logical key, NOT the PK, which is in your OP. In order to identify dupes, I need to know the data. The Id
PK means nothing. You are already using that. And getting dupes. If I use that, I will get dupes too. Therefore I will not use that. Therefore I need to know the table, the data, the columns, what the logical keys are, what (minus the Id
column) makes each row unique.
if you can provide the entire table DDL (CREATE TABLE plus all CREATE INDEX sttmts), that would be best. Just change the table name if you need to keep it confidential.
if not then I need a statement, a description, from you re what columns make each row unique (other than the
Id
column).or a sttmt to the effect that there are no unique identifiers, in which case it is not a Relational table, in which case, there is no way to identify dupes.
create table as select statement does not fill database log. Try this one method instead of rowcount split.
here are the answers to your questions:
- id is primary key of huge_tbl. Index on (mydate,id) exists.
Identity column exists on small_tbl and is a primary key.
- the purpose of copying partial data from huge_tbl to small_tbl is bit complex to explain. All I can say is the partial data from huge_tbl is stored in small_tbl for further processing.
Use sqsh. Apparently this answer is too short to allow me to post, so I'll just say again, use sqsh.
精彩评论