Multiline pattern matching

2023-04-03 21:40 问答作者：

Problem:

In a large file (plain text), there开发者_JAVA技巧 are some "interesting" lines which contain some specific words. The aim is to extract all those lines that contain such words. However, in some cases, even if a line contains such words, it may not be really "interesting", depending on its context (contents of lines above and below that line). Such lines should be excluded.

My algorithm:

I have a regex each for the interesting words and apply this regex on each line of the file. If a match is found, I check if this line was excluded (depending on its context) by applying another set of regexes (which can potentially span across lines). If a match is found again, this line is not an interesting line and move on to remaining lines. If not, I register this line as a interesting line and move on to next line.

To check if a line was excluded, I create a new string that looks like:

N number of lines above current line\n
The current line\n
N number of lines below current line

This takes an awful amount of time.

My question: Is there a better way of doing this?

Thanks for your time.

regex is not necessarily fast. There are faster string search algorithms out there.

How about a more heuristic-based approach.

Process the file from start to finish. Store every line + offset in line of a word of interest in a lookup structure. Once the lookup structure is populated, start to process through it using something like the following algorithm:

for elem in selected_word_items:
    check line + index of related search items in structure.
    if within_desired_range:
        flag_for_further_processing()

the key here is that you're processing the file once, then using the metadata structure to do your actual context checking. It should be quite a bit faster if you use the right data structures.

A lot depends on the form of your data.

How complex is your context? Do you backtrack on finding interesting matches? If so try and avoid backtracking. Perhaps you can first identify the context which leads to interesting matches on the following lines.

Also, do you need Java for this? Using unix/linux cli tools you can do quite powerful and quick manipulation of text files.

Please post your algorithm and what your data looks like. Don't need real data just realistic data.

Use the multiline switch (?m) in your regex and include the pre and post lines in your query - this makes the regex work over multiple lines (ie end-of-line $ is just another character). Something like this:

String regex = "(?m)pre lines.*?interesting words.*?post lines";

And use that to match all your input as a single String.

继续阅读：text-processing

Multiline pattern matching

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？