Cleaning doubles out of a massive word list
I got a wordlist which is 56GB and I would like to remove doubles. I've tried to approach this in java but I run out of space on my laptop after 2开发者_高级运维.5M words. So I'm looking for an (online) program or algorithm which would allow me to remove all duplicates.
Thanks in advance, Sir Troll
edit: What I did in java was put it in a TreeSet so they would be ordered and removed of duplicated
I think the problem here is the huge amount of data. I would in a first step try to split the data into several files: e.g. make a file for every char like where you put words with the first character beeing 'a' into a.txt, first char equals 'b' into b.txt. ...
- a.txt
- b.txt
- c.txt -
afterwards i would try using default sorting algorithms and check whether they work with the size of the files. After sorting cleaning of doubles should be easy.
if the files remain to big you can also split using more than 1 char e.g:
- aa.txt
- ab.txt
- ac.txt
- ...
Frameworks like Mapreduce or Hadoop are perfect for such tasks. You'll need to write your own map and reduce functions. Although i'm sure this must've been done before. A quick search on stackoverflow gave this
I suggest you use a Bloom Filter for this.
For each word, check if it's already present in the filter, otherwise insert it (or, rather some good hash value of it).
It should be fairly efficient and you shouldn't need to provide it with more than a gigabyte or two for it to have practically no false negatives. I leave it to you to work out the math.
I do like the divide-and-conquer comments here, but I have to admit: If you're running into trouble with 2.5mio words, something's going wrong with your original approach. Even if we assume each word is unique within those 2.5mio (which basically rules out that what we're talking about is a text in a natural language) and assuming each word is on average 100 unicode characters long we're at 500MB for storing the unique strings plus some overhead for storing the set structure. Meaning: You should be doing really fine since those numbers are totally overestimated already. Maybe before installing Hadoop, you could try increasing your heap size?
精彩评论