开发者

Best practice for operating on large amounts of data

I need to do a lot of processing on a table that has 26+ million rows:

  1. Determine correct size of each column based on said column's data
  2. Identify and remove duplicate rows.
  3. Create a primary key (auto incrementing id)
  4. Create a natural key (unique constraint)
  5. Add and remove columns

Please list your tips on how to speed thi开发者_JAVA技巧s process up and the order in which you would do the list above.

Thanks so much.

UPDATE: Don't need to worry about concurrent users. Also, there are no indexes on this table. This table was loaded from a source file. When all said and done there will be indexes.

UPDATE: If you use a different list from what I listed, please feel free to mention it.

Based on comments so far and what I have found worked:

  1. Create a subset of rows from the 26+ million rows. I found that 500,000 rows works well.
  2. Delete columns that won't be used (if any)
  3. Set appropriate datatype lengths for all columns in one scan using max(len())
  4. Create a (unique if possible) clustered index on column/columns that will eventually be the natural key.
  5. Repeat steps 2-4 on all the rows


If you are going to remove some columns, you should probably do that first if possible. This will reduce the amount of data you have to read for the other operations.

Bear in mind that when you modify data this may also require modifying indexes that include the data. It is therefore often a good idea to remove the indexes if you plan to make a large number of updates to the table, then add them again afterwards.


Order: 5, 2, 1, 3, 4

1: No way around it: Select Max(Len(...)) From ...

2: That all depends on what you consider a duplicate.

3: ALTER TABLE in Books Online will tell you how. No way to speed this up, really.

4: See 3.

5: See 3.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜