Cleaning doubles out of a massive word list

2023-03-13 14:53 问答作者：

I got a wordlist which is 56GB and I would like to remove doubles. I've tried to approach this in java but I run out of space on my laptop after 2开发者_高级运维.5M words. So I'm looking for an (online) program or algorithm which would allow me to remove all duplicates.

Thanks in advance, Sir Troll

edit: What I did in java was put it in a TreeSet so they would be ordered and removed of duplicated

I think the problem here is the huge amount of data. I would in a first step try to split the data into several files: e.g. make a file for every char like where you put words with the first character beeing 'a' into a.txt, first char equals 'b' into b.txt. ...

a.txt
b.txt
c.txt -

afterwards i would try using default sorting algorithms and check whether they work with the size of the files. After sorting cleaning of doubles should be easy.

if the files remain to big you can also split using more than 1 char e.g:

aa.txt
ab.txt
ac.txt
...

Frameworks like Mapreduce or Hadoop are perfect for such tasks. You'll need to write your own map and reduce functions. Although i'm sure this must've been done before. A quick search on stackoverflow gave this

I suggest you use a Bloom Filter for this.

For each word, check if it's already present in the filter, otherwise insert it (or, rather some good hash value of it).

It should be fairly efficient and you shouldn't need to provide it with more than a gigabyte or two for it to have practically no false negatives. I leave it to you to work out the math.

I do like the divide-and-conquer comments here, but I have to admit: If you're running into trouble with 2.5mio words, something's going wrong with your original approach. Even if we assume each word is unique within those 2.5mio (which basically rules out that what we're talking about is a text in a natural language) and assuming each word is on average 100 unicode characters long we're at 500MB for storing the unique strings plus some overhead for storing the set structure. Meaning: You should be doing really fine since those numbers are totally overestimated already. Maybe before installing Hadoop, you could try increasing your heap size?

继续阅读：algorithm search text

Cleaning doubles out of a massive word list

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？