Best practice for operating on large amounts of data

2023-01-12 00:20 问答作者：

I need to do a lot of processing on a table that has 26+ million rows:

Determine correct size of each column based on said column's data
Identify and remove duplicate rows.
Create a primary key (auto incrementing id)
Create a natural key (unique constraint)
Add and remove columns

Please list your tips on how to speed thi开发者_JAVA技巧s process up and the order in which you would do the list above.

Thanks so much.

UPDATE: Don't need to worry about concurrent users. Also, there are no indexes on this table. This table was loaded from a source file. When all said and done there will be indexes.

UPDATE: If you use a different list from what I listed, please feel free to mention it.

Based on comments so far and what I have found worked:

Create a subset of rows from the 26+ million rows. I found that 500,000 rows works well.
Delete columns that won't be used (if any)
Set appropriate datatype lengths for all columns in one scan using max(len())
Create a (unique if possible) clustered index on column/columns that will eventually be the natural key.
Repeat steps 2-4 on all the rows

If you are going to remove some columns, you should probably do that first if possible. This will reduce the amount of data you have to read for the other operations.

Bear in mind that when you modify data this may also require modifying indexes that include the data. It is therefore often a good idea to remove the indexes if you plan to make a large number of updates to the table, then add them again afterwards.

Order: 5, 2, 1, 3, 4

1: No way around it: Select Max(Len(...)) From ...

2: That all depends on what you consider a duplicate.

3: ALTER TABLE in Books Online will tell you how. No way to speed this up, really.

4: See 3.

5: See 3.

继续阅读：sql sql-server tsql

Best practice for operating on large amounts of data

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？