Multiline pattern matching
Problem:
In a large file (plain text), there开发者_JAVA技巧 are some "interesting" lines which contain some specific words. The aim is to extract all those lines that contain such words. However, in some cases, even if a line contains such words, it may not be really "interesting", depending on its context (contents of lines above and below that line). Such lines should be excluded.
My algorithm:
I have a regex each for the interesting words and apply this regex on each line of the file. If a match is found, I check if this line was excluded (depending on its context) by applying another set of regexes (which can potentially span across lines). If a match is found again, this line is not an interesting line and move on to remaining lines. If not, I register this line as a interesting line and move on to next line.
To check if a line was excluded, I create a new string that looks like:
N number of lines above current line\n The current line\n N number of lines below current line
This takes an awful amount of time.
My question: Is there a better way of doing this?
Thanks for your time.
regex is not necessarily fast. There are faster string search algorithms out there.
How about a more heuristic-based approach.
Process the file from start to finish. Store every line + offset in line of a word of interest in a lookup structure. Once the lookup structure is populated, start to process through it using something like the following algorithm:
for elem in selected_word_items:
check line + index of related search items in structure.
if within_desired_range:
flag_for_further_processing()
the key here is that you're processing the file once, then using the metadata structure to do your actual context checking. It should be quite a bit faster if you use the right data structures.
A lot depends on the form of your data.
How complex is your context? Do you backtrack on finding interesting matches? If so try and avoid backtracking. Perhaps you can first identify the context which leads to interesting matches on the following lines.
Also, do you need Java for this? Using unix/linux cli tools you can do quite powerful and quick manipulation of text files.
Please post your algorithm and what your data looks like. Don't need real data just realistic data.
Use the multiline switch (?m)
in your regex and include the pre and post lines in your query - this makes the regex work over multiple lines (ie end-of-line $
is just another character). Something like this:
String regex = "(?m)pre lines.*?interesting words.*?post lines";
And use that to match all your input as a single String.
精彩评论