开发者

Search for multiple words in very large text files (10 GB) using C++ the fastest way

I have this program where I have to search for specific values and its line number in very large text file and there might be multiple occurences for the same value.

I've tried a simple C++ programs which reads the text files line by line and searches for a the value using strstr but it's taking a very longgggggggggggggg time

I also tried to use a system command using grep but still it's taking a lot of time, not as long as before but it's still too much time.

I was searching for a library I can use to fasten the search. Any help an开发者_如何学运维d suggestions? Thank you :)


There are two issues concerning the spead: the time it takes to actually read the data, and the time it takes to search.

Generally speaking, the fastest way to read a file is to mmap it (or the equivalent under Windows). This can get complicated if the entire file won't fit into the address space, but you mention 10GB in the header; if searching is all you do in the program, this shouldn't create any problems.

More generally, if speed is a problem, avoid using getline on a string. Reading large blocks, and picking the lines up (as char[]) out of them, without copying, is significantly faster. (As a simple compromize, you may want to copy when a line crosses a block boundary. If you're dealing with blocks of a MB or more, this shouldn't be too often; I've used this technique on older, 16 bit machines, with blocks of 32KB, and still gotten a significant performance improvement.)

With regards to searching, if you're searching for a single, fixed string (not a regular expression or other pattern matching), you might want to try a BM search. If the string you're searching for is reasonably long, this can make a significant difference over other search algorithms. (I think that some implementations of grep will use this if the search pattern is in fact a fixed string, and is sufficiently long for it to make a difference.)


Use multiple threads. Each thread can be responsible for searching through a portion of the file. For example on a 4 core machine spawn 12 threads. The first thread looks through the first 8%evening of the file, the second thread the second 8% of the file, etc. You will want to tune the number of threads per core to keep the cpu max utilized. Since this is an I/O bound operation you may never reach 100% cpu utilization.

Feeding data to the threads will be a bottleneck using this design. Memory mapping the file might help somewhat but at the end of the day the disk can only read one sector at a time. This will be a bottleneck that you will be hard pressed to resolve. You might consider starting one thread that does nothing but read all the data in to memory and kick off search threads as the data loads up.


Since files are sequential beasts searching from start to end is something that you may not get around however there are a couple of things you could do.

if the data is static you could generate a smaller lookup file (alt. with offsets into the main file), this works good if the same string is repeated multiple times making the index file much smaller. if the file is dynamic you maybe need to regenerate the index file occassionally (offline)

instead of reading line by line, read larger chunks from the file like several MB to speed up I/O.


If you'd like to do use a library you could use xapian.

You may also want to try tokenizing your text before doing the search and I'd also suggest you to try regex too but it will take a lot if you don't have an index on that text so I'd definitely suggest you to try xapian or some search engine.


If your big text file does not change often then create a database (for example SQLite) with a table:

create table word_line_numbers
  (word varchar(100), line_number integer);

Read your file and insert a record in database for every word with something like this:

insert into word_line_numbers(word, line_number) values ('foo', 13452);
insert into word_line_numbers(word, line_number) values ('foo', 13421);
insert into word_line_numbers(word, line_number) values ('bar', 1421);

Create an index of words:

create index wird_line_numbers_idx on word_line_numbers(word);

And then you can find line numbers for words fast using this index:

select line_number from word_line_numbers where word='foo';

For added speed (because of smaller database size) and complexity you can use 2 tables: words(word_id integer primary key, word not null) and word_lines(word_id integer not null references words, line_number integer not null).


I'd try first loading as much of the file into the RAM as possible (memory mapping of the file is a good option) and then search concurrently in parts of it on multiple processors. You'll need to take special care near the buffer boundaries to make sure you aren't missing any words. Also, you may want to try something more efficient than the typical strstr(), see these:
Boyer–Moore string search algorithm
Knuth–Morris–Pratt algorithm

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜