Findings string segments in a string

2023-02-16 14:31 问答作者：

I have a list of segments (15000+ segments), I want to find out the occurence of segments in a given string. The segment can be single word or multiword, I can not assume space as a delimeter in string.

e.g. String "How can I download codec from internet for facebook, Professional programmer support"

[the string above may not make any sense but I am using it for illustration purpose]

segment list

Microsoft word
Microsoft excel
Professional Programmer.
Google
Facebook
Download codec from internet.

Ouptut :

Download codec from internet
facebook
Professional programmer

Bascially i am trying to do a query reduction.

I want to achieve it less than O(list length + string length) time. As my list is more than 15000 segments, it will be time consuming to search entire list in string. The segment开发者_C百科s are prepared manully and placed in a txt file.

Regards

~Paul

You basically want a string search algorithm like Aho-Corasik string matching. It constructs a state machine for processing bodies of text to detect matches, effectively making it so that it searches for all patterns at the same time. It's runtime is on the order of the length of the text and the total length of the patterns.

In order to do efficient searches, you will need an auxiliary data structure in the form of some sort of index. Here, a great place to start would be to look at a KWIC index:

http://en.wikipedia.org/wiki/Key_Word_in_Context

http://www.cs.duke.edu/~ola/ipc/kwic.html

What your basically asking how to do is write a custom lexer/parser.

Some good background on the subject would be the Dragon Book or something on lex and yacc (flex and bison).

Take a look at this question:

Poor man's lexer for C#

Now of course, alot of people are going to say "just use regular expressions". Perhaps. The deal with using regex in this situation is that your execution time will grow linearly as a function of the number of tokens you are matching against. So, if you end up needing to "segment" more phrases, your execution time will get longer and longer.

What you need to do is have a single pass, popping words on to a stack and checking if they are valid tokens after adding each one. If they aren't, then you need to continue (disregard the token like a compiler disregards comments).

Hope this helps.

Findings string segments in a string

segment list

Ouptut :

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？

segment list

Ouptut :

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集 河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？