Remove duplicates from two large text files using unordered_map

2023-03-12 21:24 问答作者：

I am new to a lot of these C++ libraries, so please forgive me if my questions comes across as naive.

I have two large text files, about 160 MB each (about 700000 lines each). I need to remove from file2 all of the duplicate lines that appear in file1. To achieve this, I decided to use unordered_map with a 32 character string as my key. The 32 character string is the first 32 chars of each line (this is enough to uniquely identify the line).

Anyway, so I basically just go through the first file and push the 32 char substring of each line into the unordered_map. Then I go through the second file and check whether the line in file2 exists in my unordered_map. If it doesn't exist, the I write the full line to a new text file.

This works fine for the smaller files.. (40开发者_运维百科 MB each), but for this 160 MB files.. it takes very long to insert into the hashtable (before I even start looking at file2). At around 260,000 inserts.. it seems to have halted or is going very slow. Is it possible that I have reached my memory limitations? If so, can anybody explain how to calculate this? If not, is there something else that I could be doing to make it faster? Maybe choosing a custom hash function, or specifying some parameters that would help optimize it?

My key object pair into the hash table is (string, int), where the string is always 32 chars long, and int is a count I use to handle duplicates. I am running a 64 bit Windows 7 OS w/ 12 GB RAM.

Any help would be greatly appreciated.. thanks guys!!

You don't need a map because you don't have any associative data. An unordered set will do the job. Also, I'd go with some memory efficient hash set implementation like Google's sparse_hash_set. It is very memory efficient and is able to store contents on disk.

Aside from that, you can work on smaller chunks of data. For example, split your files into 10 blocks, remove duplicates from each, then combine them until you reach the a single block with no duplicates. You get the idea.

I would not write a C++ program to do this, but use some existing utilities. In Linux, Unix and Cygwin, perform the following:

cat the two files into 1 large file:

# cat file1 file2 > file3

Use sort -u to extract the unique lines:

# sort -u file3 > file4

Prefer to use operating system utilities rather than (re)writing your own.

继续阅读：tr1 unordered-map

Remove duplicates from two large text files using unordered_map

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？