开发者

Fast grouping and aggregation of huge data set

I've got a huge amount of data (stored in file, but it is irrelevant - main part is that the data doesn't fit into memory) - let's say 109 lines of records.

The record consists of time, some set of keys, and data. The keys aren't unique.

for example

keys:          data:
A | B | C |    
----------------------
1 | 2 | 3 |    10 
1 | 1 | 3 |    150
1 | 1 | 2 |    140
1 | 2 | 5 |    130
5 | 3 | 2 |    120
...

I need to go through all of the data, and filter them using user-defined filter (this isn't problem), and then aggregate, count sum, and return rows with highest data.

For example, in given data, I want to sum every data grouping by A and C.

expected result:

A | C | data
------------
1 | 3 | 160
1 | 2 | 140
1 | 5 | 130

------------ following (data isn't in 3 highest value) doesn't concern me.
5 | 2 | 120

I implemented this u开发者_高级运维sing naive solution, I have Dictionary<tuple(A, C), long>, and sum there. But the problem is, that there can be more unique combinations of A,C than I can fit into memory.

I cannot pre-sum any of the data, as any filtering may appear, nor use SQL (relational DB doesn't fit well for me).

Are there any memory-efficient algorithms usable for grouping this way? How does SQL handle so much data? I'm able to do the grouping on SQL, but there are some reasons why I do not want to use it.

Or, what should I google? I haven't found any useful article on this issue.

(I'm using C#, the question is rather theoretical than 'use following code:')


well, the comments for the question might be considered as an answer...
You can use mapreduce (hadoop is the framework implementation in java)
your map stage will parse each line and extract relevant key and value for each line.
your reduce stage will summarize all the data for the given key.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜