handling LARGE dataset
What is the best solution for handling LARGE dataset.
I have txt files broken down into multiple files. which if I add up it will be about 100 GB the files are nothing more than justuniqID1 uniqID2 etc
id pairs and if I want calculate things like 1:unique number of uniqIDs etc 2:list of other IDs uniqID1 is link开发者_开发问答ed to?
what is the best solution? how do I update these into a database?
thank you!
So if you had a table with the following columns:
id1 varchar(10) // how long are you ids? are they numeric? text?
id2 varchar(10)
with about five billion rows in the table, and you wanted quick answers to questions such as:
how many unique values in column id1 are there?
what is the set of distinct values from id1 where id2 = {some parameter}
a relational database (that supports SQL) and a table with an index on id1 and another index on id2 would do what you need. SQLite would do the job.
EDIT: to import them it would be best to separate the two values with some character that never occurs in the values, like a comma or a pipe character or a tab, one pair per line:
foo|bar
moo|mar
EDIT2: You don't need relational but it doesn't hurt anything, and your data structure is more extensible if the db is relational.
精彩评论