Smallest text fragment with k keywords

2023-03-24 09:40 问答作者：

My friend was asked this question in a interview. Given a text document ( say an news article ) and a set of keywords (for ex google search terms), fin开发者_如何学Cd the smallest segment in the text document which contains these keywords (in any order).

The method I could think of is to use a map containing the keyword and the position of the keyword in the text. We then use this map to select the text fragment. But I was looking for better methods.

I would probably have proposed something like that:

1. create a new map of <String, Integer> of size k to associate each keyword 
     with its number of occurences in the text
2. go through the text and each time you encounter a keyword, increment its 
     number of occurences in the map
3. check that each keyword has been encountered at least once, otherwise 
     return a proper statement
4. start from the beginning of the text and each time you encounter a keyword, 
     decrement its number of occurences:
    4.1. if the number reaches 0, stop iterating and set 
           int startIndex = currentIndex
    4.2. otherwise keep iterating
5. start from the end of the text and each time you encounter a keyword, 
     decrement its number of occurences:
    5.1. if the number reaches 0, stop iterating and set 
           int endIndex = currentIndex
    5.2 otherwise keep reverse iterating
6. return [startIndex, endIndex]

The overall complexity is O(n).

It seems to me you could actually complete this task with only one pass through the text file. The solution would be in O(n + $\Sigma k_i$), where $k_i$ are the lengths of the keywords, making heavily use of the KMP algorithm. I should add that I am assuming the number of keywords is small with respect to the sum of the lengths of the text and the keywords.

First construct the failure functions of each keyword.
Keep an array of the start position of each keyword. This array is going to be updated each time the keyword is seen, so that it contains the start position of the last occurrence of the keyword at a given time. Initialize it with minus infinity.
Initialize the “text portion” to be returned to “empty”, and its length to infinity.
Scan through the text using the failure functions, keeping score of the last occurrence of each keyword.
Whenever a keyword is found, compute the length of the portion of text by end_position – min(start_positions). If this text portion is smaller than the one previously found, store it and continue, else keep the previous one. In any case update the start_position of the keyword.

At the end of the first pass, you should have found the smallest text portion containing all key words.

I would look for all suffixes of the news article and compare these with the key words. A suffix tree can be constructed in linear time. Complexity is O(n log n).

You need a data structure with two pieces of info: a counter associated with each keyword (something like map<string,int>) and a queue of keywords. You go through the text and for each keyword encountered do this:

Set state = incomplete. Increment its counter. If all counters >0, set state = complete (you have all the keywords). Insert it into the back of the queue together with its position in the text. While the counter of the keyword in front of the queue is >1, remove that keyword from the front and decrement the counter. If state == complete, peek at the position of the keyword at the front of the queue. It's the start of the interval that has all the keywords.

First observe that any min length fragment will start with one of the keywords.

Maintain 2 hash tables. The first hash table will contain the keywords and their counts. The second hash table will contain the position of those keywords. Store the number of keywords found in a variable so that we can quickly identify if we have found all the keywords. Maintain the beg and end of the text fragment in variable. Finally maintain a min counter to maintain the minimum length text fragment.

Start reading the input word by word, until you reach the first occurrence of a keyword.

Store the position of this keyword in the second hash table.
Check in the first hash table is this word has been seen for this fragment or not.
- If no then increment the count of this word in the first hash table and increment the count of the number of keywords seen.
- Else check if the existing position of this word from the hash table and compare it with the word at the beginning of the fragment. If they are the same, then move the beginning pointer forward and keep moving if forward until you find another keyword. Then do the same check of this keyword. If this is not latest position of this keyword, then keep moving forward. Stop when you find a keyword whose latest position is the beginning pointer.
After every keyword check if the count of keywords seen is equal to the number of keywords. If yes calculate the difference of end pointer and beginning pointer and update the minimum length accordingly
When you are done reading all words, the variable holding the minimum length is the ans.

I guess the trick is to move the beginning pointer in the right manner. That is why we need 2 hash tables. This is linear since the beginning pointer always moves forward at any point.

继续阅读：algorithm

Smallest text fragment with k keywords

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？