Handling 100's of 1,000,000's of rows in T-SQL2005
I have a couple of databases containing simple data which needs to be imported into a new format schema. I've come up with a flexible schema, but it relies on the critical data of the to older DBs to be stored in one table. This table has only a primary key, a foreign key (both int's), a datetime and a decimal field, but adding the count of rows from the two older DBs indicates that the total row count for this new table would be about 200,000,000 rows.
How do I go about dealing with this amount of data? It is data stretching back about 10 years and does need to be available. Fortunately, we don't need to pull out even 1% of it when making queries in the future, but it does all need to be accessible.
I've got ideas based around having multiple tables for year, supplier (of the source data) etc - or even having one database for each year, with the most recent 2 years in one DB (which would also contain the stored procs for managing all this.)
Any and all help, ideas, 开发者_开发问答suggestions very, deeply, much appreciated,
Matt.
Most importantly. consider profiling your queries and measuring where your actual bottlenecks are (try identifying the missing indexes), you might see that you can store everything in a single table, or that buying a few extra hard disks will be enough to get sufficient performance.
Now, for suggestions, have you considered partitioning? You could create partitions per time range, or one partition with the 1% commonly accessed and another with the 99% of the data.
This is roughly equivalent to splitting the tables manually by year or supplier or whatnot, but internally handled by the server.
On the other hand, it might make more sense to actually splitting the tables in 'current' and 'historical'.
Another possible size improvement is using an int (like an epoch) instead of a datetime and provide functions to convert from datetime to int, thus having queries like
SELECT * FROM megaTable WHERE datetime > dateTimeToEpoch('2010-01-23')
This size savings will probably have a cost performance wise if you need to do complex datetime queries. Although on cubes there is the standard technique of storing, instead of an epoch, an int in YYYYMMDD format.
What's the problem with storing this data in a single table? An enterprise-level SQL server like Microsoft SQL 2005 can handle it without much pain.
By the way, do not do tables per year, tables per supplier or other things like this. If you have to store similar set of items, you need one and one only table. Setting multiple tables to store the same type of things will cause problems, like:
Queries would be extremely difficult to write, and performance will be decreased if you have to query from multiple tables.
The database design will be very difficult to understand (especially since it's not something natural to store the same type of items in different places).
You will not be able to easily modify your database (maybe it's not a problem in your case), because instead of changing one table, you would have to change every table.
It would require to automate a bunch of tasks. Let's see you have a table per year. If a new record is inserted on 2011-01-01 00:00:00.001, will a new table be created? Will you check at each insert if you must create a new table? How it would affect performance? Can you test it easily?
If there is a real, visible separation between "recent" and "old" data (for example you have to use daily the data saved the last month only, and you have to keep everything older, but you do not use it), you can build a system with two SQL servers (installed on different machines). The first, highly available server, will serve to handle recent data. The second, less available and optimized for writing, will store everything else. Then, on schedule, a program will move old data from the first one to the second.
With such a small tuple size (2 ints, 1 datetime, 1 decimal) I think you will be fine having a single table with all the results in it. SQL server 2005 does not limit the number of rows in a table.
If you go down this road and run in to performance problems, then it is time to look at alternatives. Until then, I would plow ahead.
EDIT: Assuming you are using DECIMAL(9) or smaller, your total tuple size is 21 bytes which means that you can store the entire table in less than 4 GB of memory. If you have a decent server(8+ GB of memory) and this is the primary memory user, then the table and a secondary index could be stored in memory. This should ensure super fast queries after a slower warm-up time before the cache is populated.
精彩评论