开发者

Efficient memory storage and retrieval of categorized string literals in C++

Note: This is a follow up to this question.

I have a "legacy" program which does hundreds of string matches against big chunks of HTML. For example if the HTML matches 1 of 20+ strings, do something. If it matches 1 of 4 other strings, do something else. There are 50-100 groups of these strings to match against these chunks of HTML (usually whole pages).

I'm taking a whack at refactoring this mess of code and trying to come up with a good approach to do all these matches.

The performance requirements of this code are rather strict. It needs to not wait on I/O when doing these matches so they need to be in memory. Also there can be 100+ copies of this process running at the same time so large I/O on startup could cause slow I/O for other copies.

With these requirements in mind it would be most efficient if only one copy of these strings are stored in RAM (see my previous question linked above).

This program currently runs on Windows with Microsoft compiler but I'd like to keep the solution as cross-platform as possible so I don't think I want to use PE resource files or something.

Mmapping an external file might work but then I have the issue of keeping program version and data version in sync, one does not normally change without the other. Also this requires some file "format" which adds a layer of complexity I'd rather not have.

So after all of this pre-amble it seems like the best solution is to have a bunch arrays of strings which I can then iterate over. This seems kind of messy as I'm mixing code and data heavily, but with the above requirements is there any better way to hand开发者_JAVA百科le this sort of situation?


I'm not sure just how slow the current implementation is. So it's hard to recommend optimizations without knowing what level of optimization is needed.

Given that, however, I might suggest a two-stage approach. Take your string list and compile it into a radix tree, and then save this tree to some custom format (XML might be good enough for your purposes).

Then your process startup should consist of reading in the radix tree, and matching. If you want/need to optimize the memory storage of the tree, that can be done as a separate project, but it sounds to me like improving the matching algorithm would be a more efficient use of time. In some ways this is a 'roll your own regex system' idea. Rather similar to the suggestion to use a parser generator.

Edit: I've used something similar to this where, as a precompile step, a custom script generates a somewhat optimized structure and saves it to a large char* array. (obviously it can't be too big, but it's another option)

The idea is to keep the list there (making maintenance reasonably easy), but having the pre-compilation step speed up the access during runtime.


If the strings that need to be matched can be locked down at compile time you should consider using a tokenizer generator like lex to scan your input for matches. If you aren't familiar with it lex takes a source file which has some regular expressions (including the simplest regular expressions -- string literals) and C action code to be executed when a match is found. It is used often in building compilers and similar programs, and there are several other similar programs that you could also use (flex and antlr come to mind). lex builds state machine tables and then generates efficient C code for matching input against the regular expressions those state tables represent (input is standard input by default, but you can change this). Using this method would probably not result in the duplication of strings (or other data) in memory among the different instances of your program that you fear. You could probably easily generate the regular expressions from the string literals in your existing code, but it may take a good bit of work to rework your program to use the code that lex generated.

If the strings you have to match change over time there are some regular expressions libraries that can compile regular expressions at run time, but these do use lots of RAM and depending on your program's architecture these might be duplicated across different instances of the program.

The great thing about using a regular expression approach rather than lots of strcmp calls is that if you had the patterns:

"string1"
"string2"
"string3"

and the input:

"string2"

The partial match for "string" would be done just once for a DFA (Deterministic Finite-state Automaton) regular expression system (like lex) which would probably speed up your system. Building these things does require a lot of work on lex 's behalf, but all of the hard work is done up front.


Are these literal strings stored in a file? If so, as you suggested, your best option might be to use memory mapped files to share copies of the file across the hundreds of instances of the program. Also, you may want to try and adjust the working set size to try and see if you can reduce the number of page faults, but given that you have so many instances, it might prove to be counterproductive (and besides your program needs to have quota privileges to adjust the working set size).

There are other tricks you can try to optimize IO performance like allocating large pages, but it depends on your file size and the privileges granted to your program.

The bottomline is that you need to experiment to see what works best and remember to measure after each change :)...

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜