开发者

handling LARGE dataset

What is the best solution for handling LARGE dataset.

I have txt files broken down into multiple files. which if I add up it will be about 100 GB the files are nothing more than just

uniqID1 uniqID2 etc

id pairs and if I want calculate things like 1:unique number of uniqIDs etc 2:list of other IDs uniqID1 is link开发者_开发问答ed to?

what is the best solution? how do I update these into a database?

thank you!


So if you had a table with the following columns:

           id1 varchar(10)   // how long are you ids? are they numeric? text?
           id2 varchar(10)

with about five billion rows in the table, and you wanted quick answers to questions such as:

        how many unique values in column id1 are there?
        what is the set of distinct values from id1 where id2 = {some parameter}

a relational database (that supports SQL) and a table with an index on id1 and another index on id2 would do what you need. SQLite would do the job.

EDIT: to import them it would be best to separate the two values with some character that never occurs in the values, like a comma or a pipe character or a tab, one pair per line:

         foo|bar
         moo|mar

EDIT2: You don't need relational but it doesn't hurt anything, and your data structure is more extensible if the db is relational.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜