handling LARGE dataset

2023-02-12 13:42 问答作者：

What is the best solution for handling LARGE dataset.

I have txt files broken down into multiple files. which if I add up it will be about 100 GB the files are nothing more than just

uniqID1 uniqID2 etc

id pairs and if I want calculate things like 1:unique number of uniqIDs etc 2:list of other IDs uniqID1 is link开发者_开发问答ed to?

what is the best solution? how do I update these into a database?

thank you!

So if you had a table with the following columns:

           id1 varchar(10)   // how long are you ids? are they numeric? text?
           id2 varchar(10)

with about five billion rows in the table, and you wanted quick answers to questions such as:

        how many unique values in column id1 are there?
        what is the set of distinct values from id1 where id2 = {some parameter}

a relational database (that supports SQL) and a table with an index on id1 and another index on id2 would do what you need. SQLite would do the job.

EDIT: to import them it would be best to separate the two values with some character that never occurs in the values, like a comma or a pipe character or a tab, one pair per line:

         foo|bar
         moo|mar

EDIT2: You don't need relational but it doesn't hurt anything, and your data structure is more extensible if the db is relational.

继续阅读：database dataset large-data-volumes

handling LARGE dataset

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？