开发者

Optimizing scanning large text and matching against list of words or phrases

I'm working on an app that takes an article (simple HTML page), and a list of vocabulary terms (each may be a word, a phrase, or even a sentence), and creates a link for each term it finds. The problem is that for larger texts with more terms it takes a long time. Currently we are dealing with this by initially displaying the unmarked text, processing the links in the background, and finally reloading the web view when processing finishes. Still, it can take a while and some of our users are not happy with it.

Right now the app uses a simple loop on the terms, doing a replacement in the HTML. Basically:

for (int i=0; i<terms.count; i++){
    NSString *term = [terms objectAtIndex:i];
    NSString *replaceString = [NSString stringWithFormat:@"<a href="myUrl:\\%d>%@</a>", i, term];
 开发者_如何学JAVA   htmlString = [htmlString stringByReplacingOccurrencesOfString:term 
                                                       withString:replaceString 
                                                          options:NSCaseInsensitiveSearch 
                                                            range:NSMakeRange(0, [htmlString length] )];
}

However, we are dealing with multiple languages, so there is not just one replacement per term, but twenty! That's because we have to deal with punctuation at the beginning (upside-down question marks in Spanish) and end of each term. We have to replace "term", "term.", and "term?" with an appropriate hyperlink.

Is there a more efficient method I could use to get this HTML marked up?

I need to keep the index of the original term so that it can be retrieved later when the user clicks the link.


You could process the text as follows:

  1. Instead of looping over the vocabluary, split the text into words and look up each word in the vocabluary.

  2. Create some index, hash table or dictionary to make the lookup efficient.

  3. Don't use stringByReplacingOccurrencesOfString. Each time it's called it makes a copy of the whole text and won't release the memory until the autopool is drained. (Interestingly, you haven't run into memory problems yet.) Instead use a NSMutableString instance where you append each word (and the characters between them), either as it was in the original text or decorated as a link.


What you're doing right now is this:

for each vocabulary word 'term'
    search the HTML text for instances of term
        replace each instance of term with an appropriate hyperlink

If you have a large text, then each search takes that much longer. Further, every time you do a replacement, you have to create a new string containing a copy of the text to do the replacement on, since stringByReplacingOccurrencesOfString:withString:options:range: returns a new string rather than modifying the existing string. Multiply that by N replacements.

A better option would be to make a single pass through the string, searching for all terms at once, and building up the resulting output string in a mutable string to avoid a Shlemiel the Painter-like runtime.

For example, you could use regular expressions like so:

// Create a regular expression that is an alternation of all of the vocabulary
// words.  You only need to create this once at startup.
NSMutableString *pattern = [[[NSMutableString alloc] init] autorelease];
[pattern appendString:@"\\b("];

BOOL isFirstTerm = YES;
for (NSString *term in vocabularyList)
{
    if (!isFirstTerm)
    {
        [pattern appendString:@"|"];
        isFirstTerm = NO;
    }
    [pattern appendString:term];
}
[pattern appendString:@")\\b"];

// Create regular expression object
NSError *error = NULL;
NSRegularExpression *regex = [NSRegularExpression regularExpressionWithPattern:pattern options:NSRegularExpressionCaseInsensitive error:&error];

// Replace vocabulary matches with a hyperlink
NSMutableString *htmlCopy = [[htmlString mutableCopy] autorelease];
[regex replaceMatchesInString:htmlCopy
                      options:0
                        range:NSMakeRange(0, [htmlString length])
                 withTemplate:@"<a href=\"vocab.html?word=\\1\">\\1</a>"];
// Now use htmlCopy


Since the string replace function your calling is Order N (it scans an replaces n words) and you're doing it for m vocabulary terms, you have an n^2 algorithm.

If you could do it in one pass, that would be optimal (order n - n words in html). The idea of presenting the un-replaced text first is still a good one unless it's unnoticeable even for large docs.

How about a hashset of vocabulary words, scan through the html word by (skipping html markup) and if the current scanned word is in the hash set, append that to the target buffer instead of the scanned word. That allows you to have 2 X the html content + 1 hash of vocabulary words in memory at most.


There are two approaches.

  1. Hash Maps - if maximal length of you phrases is limited for example by two, you can iterate over all words and bigrams(2-words) and check them in HashMap - complexity is liniar, since Hash is constant time in ideal

  2. Automaton theory You can combine simple automatons which mach strings to single one and evaluation faster(i.e. dynamic programming). For example we have "John Smith"|"John Stuard" merge them and we get John S(mith|tuard) it is so called prefix optimisation(http://code.google.com/p/graph-expression/wiki/RegexpOptimization)

More advenced algorithm can be found here http://en.wikipedia.org/wiki/Aho%E2%80%93Corasick_string_matching_algorithm

I like this approach more becouse there are no limitation of phrase length and it allow to combine complex regexps.


0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜