Smallest text fragment with k keywords
My friend was asked this question in a interview. Given a text document ( say an news article ) and a set of keywords (for ex google search terms), fin开发者_如何学Cd the smallest segment in the text document which contains these keywords (in any order).
The method I could think of is to use a map containing the keyword and the position of the keyword in the text. We then use this map to select the text fragment. But I was looking for better methods.
I would probably have proposed something like that:
1. create a new map of <String, Integer> of size k to associate each keyword
with its number of occurences in the text
2. go through the text and each time you encounter a keyword, increment its
number of occurences in the map
3. check that each keyword has been encountered at least once, otherwise
return a proper statement
4. start from the beginning of the text and each time you encounter a keyword,
decrement its number of occurences:
4.1. if the number reaches 0, stop iterating and set
int startIndex = currentIndex
4.2. otherwise keep iterating
5. start from the end of the text and each time you encounter a keyword,
decrement its number of occurences:
5.1. if the number reaches 0, stop iterating and set
int endIndex = currentIndex
5.2 otherwise keep reverse iterating
6. return [startIndex, endIndex]
The overall complexity is O(n).
It seems to me you could actually complete this task with only one pass through the text file. The solution would be in O(n + $\Sigma k_i$), where $k_i$ are the lengths of the keywords, making heavily use of the KMP algorithm. I should add that I am assuming the number of keywords is small with respect to the sum of the lengths of the text and the keywords.
- First construct the failure functions of each keyword.
- Keep an array of the start position of each keyword. This array is going to be updated each time the keyword is seen, so that it contains the start position of the last occurrence of the keyword at a given time. Initialize it with minus infinity.
- Initialize the “text portion” to be returned to “empty”, and its length to infinity.
- Scan through the text using the failure functions, keeping score of the last occurrence of each keyword.
- Whenever a keyword is found, compute the length of the portion of text by end_position – min(start_positions). If this text portion is smaller than the one previously found, store it and continue, else keep the previous one. In any case update the start_position of the keyword.
At the end of the first pass, you should have found the smallest text portion containing all key words.
I would look for all suffixes of the news article and compare these with the key words. A suffix tree can be constructed in linear time. Complexity is O(n log n).
You need a data structure with two pieces of info: a counter associated with each keyword (something like map<string,int>
) and a queue of keywords. You go through the text and for each keyword encountered do this:
Set state = incomplete. Increment its counter. If all counters >0, set state = complete (you have all the keywords). Insert it into the back of the queue together with its position in the text. While the counter of the keyword in front of the queue is >1, remove that keyword from the front and decrement the counter. If state == complete, peek at the position of the keyword at the front of the queue. It's the start of the interval that has all the keywords.
First observe that any min length fragment will start with one of the keywords.
Maintain 2 hash tables. The first hash table will contain the keywords and their counts. The second hash table will contain the position of those keywords. Store the number of keywords found in a variable so that we can quickly identify if we have found all the keywords. Maintain the beg and end of the text fragment in variable. Finally maintain a min counter to maintain the minimum length text fragment.
Start reading the input word by word, until you reach the first occurrence of a keyword.
Store the position of this keyword in the second hash table.
Check in the first hash table is this word has been seen for this fragment or not.
If no then increment the count of this word in the first hash table and increment the count of the number of keywords seen.
Else check if the existing position of this word from the hash table and compare it with the word at the beginning of the fragment. If they are the same, then move the beginning pointer forward and keep moving if forward until you find another keyword. Then do the same check of this keyword. If this is not latest position of this keyword, then keep moving forward. Stop when you find a keyword whose latest position is the beginning pointer.
After every keyword check if the count of keywords seen is equal to the number of keywords. If yes calculate the difference of end pointer and beginning pointer and update the minimum length accordingly
When you are done reading all words, the variable holding the minimum length is the ans.
I guess the trick is to move the beginning pointer in the right manner. That is why we need 2 hash tables. This is linear since the beginning pointer always moves forward at any point.
精彩评论