开发者

What would be the most efficient method for storing/updating Interval based data in SQL?

I have a database table with about 700 millions rows plus (growing exponentially) of time based data.

Fields:

PK.ID,
PK.TimeStamp,
Value

I also have 3 other tables grouping this data into Days, Months, Years which contains the sum of the value for each ID in that time period. These tables are updated nightly by a SQL job, the situation has arisen where by the tables will need to updated on the fly when the data in the base table is updated, this can be however up to 2.5 million rows at a time (not very often, typically around 200-500k up to every 5 minutes)开发者_C百科, is this possible without causing massive performance hits or what would be the best method for achieving this?


N.B

  1. The daily, monthly, year tables can be changed if needed, they are used to speed up queries such as 'Get the monthly totals for these 5 ids for the last 5 years', in raw data this is about 13 million rows of data, from the monthly table its 300 rows.

  2. I do have SSIS available to me.

  3. I cant afford to lock any tables during the process.


700M recors in 5 months mean 8.4B in 5 years (assuming data inflow doesn't grow). Welcome to the world of big data. It's exciting here and we welcome more and more new residents every day :)

I'll describe three incremental steps that you can take. The first two are just temporary - at some point you'll have too much data and will have to move on. However, each one takes more work and/or more money so it makes sense to take it a step at a time.

Step 1: Better Hardware - Scale up

Faster disks, RAID, and much more RAM will take you some of the way. Scaling up, as this is called, breaks down eventually, but if you data is growing linearly and not exponentially, then it'll keep you floating for a while.

You can also use SQL Server replication to create a copy of your database on another server. Replication works by reading transaction logs and sending them to your replica. Then you can run the scripts that create your aggregate (daily, monthly, annual) tables on a secondary server that won't kill the performance of your primary one.

Step 2: OLAP

Since you have SSIS at your disposal, start discussing multidimensional data. With good design, OLAP Cubes will take you a long way. They may even be enough to manage billions of records and you'll be able to stop there for several years (been there done that, and it carried us for two years or so).

Step 3: Scale Out

Handle more data by distributing the data and its processing over multiple machines. When done right this allows you to scale almost linearly - have more data then add more machines to keep processing time constant.

If you have the $$$, use solutions from Vertica or Greenplum (there may be other options, these are the ones that I'm familiar with).

If you prefer open source / byo, use Hadoop, log event data to files, use MapReduce to process them, store results to HBase or Hypertable. There are many different configurations and solutions here - the whole field is still in its infancy.


Indexed views.

Indexed views will allow you to store and index aggregated data. One of the most useful aspects of them is that you don't even need to directly reference the view in any of your queries. If someone queries an aggregate that's in the view, the query engine will pull data from the view instead of checking the underlying table.

You will pay some overhead to update the view as data changes, but from your scenario it sounds like this would be acceptable.


Why don't you create monthly tables, just to save the info you need for that months. It'd be like simulating multidimensional tables. Or, if you have access to multidimensional systems (oracle, db2 or so), just work with multidimensionality. That works fine with time period problems like yours. At this moment I don't have enough info to give you, but you can learn a lot about it just googling.

Just as an idea.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜