Optimizing a lot of Scanner.findWithinHorizon(pattern, 0) calls

2023-01-02 00:05 问答作者：

I'm building a process which extracts data from 6 csv-style files and two poorly laid out .txt reports and builds output CSVs, and I'm fully aware that there's going to be some overhead searching through all that whitespace thousands of times, but I never anticipated converting about 50,000 records would take 12 hours.

Excerpt of my manual matching code (I know it's horrible that I use lists of tokens like that, but it was the best thing I could think of):

public static String lookup(Pattern tokenBefore,
                             List<String> tokensAfter)
{
    String result = null;

    while(_match(tokenBefore)) { // block until all input is read
        if(id.hasNext())
        {
            result = id.next(); // capture the  next token that matches

            if(_matchImmediate(tokensAfter)) // try to match tokensAfter to this result
                return result;
        } else
            return null; // end of file; no match
    }

    return null; // no matches
}

private static boolean _match(List<String> tokens)
{
    return _match(tokens, true);
}

private static boolean _match(Pattern token)
{
    if(token != null)
    {
        return (id.findWithinHorizon(token, 0) != null);
    } else {
        return false;
    }
}

private static boolean _match(List<String> tokens, boolean block)
{
    if(tokens != null && !tokens.isEmpty()) {
        if(id.findWithinHorizon(tokens.get(0), 0) == null)
            return false;

        for(int i = 1; i <= tokens.size(); i++)
        {
            if (i == tokens.size()) { // matches all tokens
                return true;
            } else if(id.hasNext() && !id.next().matches(tokens.get(i))) {
                break; // break to blocking behaviour
            }
        }
    } else {
        return true; // empty list always matches
    }

    if(block)
        return _match(tokens); // loop until we find something or nothing
    else
        return false; // return after just one attempted match
}

private static boolean _matchImmediate(List<String> tokens)
{
    if(tokens != null) {

        for(int i = 0; i <= tokens.size(); i++)
        {
            if (i == tokens.size()) { // matches all tokens
                return true;
            } else if(!id.hasNext() || !id.next().matches(tokens.get(i))) {
                return false; // doesn't match, or end of file
            }
        }

        return false; // we have some serious problems if this ever gets called
    } else {
        return true; // empty list always matches
    }
}

Basically wondering how I would work in an efficient string search (Boyer-Moore or similar). My Scanner id is scanning a java.util.String, figured buffering it to memory would reduce I/O since the search here is being performed thousands of times on a relatively small file. The performance increase compared to scanning a BufferedReader(FileReader(File)) was probably less than 1%, the process still looks to be taking a LONG time.

I've also traced execution and the slowness of my overall conversion process is definitely between the first and last like of the lookup method. In fact, so much so that I ran a shortcut process to count the number of occurrences of various identifiers in the .csv-style files (I use 2 lookup methods, this is just one of them) and the process completed indexing approx 4 different identifiers for 50,000 records in less than a minute. Compared to 12 hours, that's instant.

Some notes (updated 6/6/2010):

I still need the pattern-matching behaviour for tokensBefore.
All ID numbers I need don't necessarily start at a fixed position in a line, but it's guaranteed that after the ID token is the name of the corresponding object.
I would ideally want to return a String, not the start position of the result as an int or something.

Anything to help me out, even if it saves 1ms per search, will help, so all input is appreciated. Thankyou!

Usage scenario 1: I have a list of objects in file A, who in the old-style system have an id number which is not in file A. It is, however, POSSIBLY in another csv-style file (file B) or possibly still in a .txt report (file C) which each also contain a bunch of other information which is not useful here, and so file B needs to be searched through for the object's full name (1 token since it would reside within the second column of any given line), and then the first column should be the ID number. If that doesn't work, we then have to split the search token by whitespace into separate tokens before doing a search of file C for those tokens as well.

Generalised code:

String field;
for (/* each record in file A */)
{
    /* construct the rest of this object from file A info */
    // now to find the ID, if we can
    List<String> objectName = new ArrayList<String>(1);
    objectName.add(Pattern.quote(thisObject.fullName));
    field = lookup(objectSearchToken, objectName); // search file B
    if(field == null) // not found in file B
    {
        lookupReset(false); // initialise scanner to check file C
        objectName.clear(); // not usi开发者_Go百科ng the full name

        String[] tokens = thisObject.fullName.split(id.delimiter().pattern());
        for(String s : tokens)
            objectName.add(Pattern.quote(s));

        field = lookup(objectSearchToken, objectName); // search file C
        lookupReset(true); // back to file B
    } else {
        /* found it, file B specific processing here */
    }

    if(field != null) // found it in B or C
        thisObject.ID = field;
}

The objectName tokens are all uppercase words with possible hyphens or apostrophes in them, separated by spaces (a person's name).

As per aioobe's answer, I have pre-compiled the regex for my constant search tokens, which in this case is just \r\n. The speedup noticed was about 20x in another one of the processes, where I compiled [0-9]{1,3}\\.[0-9]%|\r\n|0|[A-Z'-]+, although it was not noticed in the above code with \r\n. Working along these lines, it has me wondering:

Would it be better for me to match \r\n[^ ] if the only usable matches will be on lines beginning with a non-space character anyway? It may reduce the number of _match executions.

Another possible optimisation is this: concatenate all tokensAfter, and put a (.*) beforehand. It would reduce the number of regexes (all of which are literal anyway) that would be compiled by about 2/3, and also hopefully allow me to pull out the text from that grouping instead of keeping a "potential token" from every line with an ID on it. Is that also worth doing?

The above situation could be resolved if I could get java.util.Scanner to return the token previous to the current one after a call to findWithinHorizon.

Something to start with: Every single time you run id.next().matches(tokens.get(i)) the following code is executed:

Pattern p = Pattern.compile(regex);
Matcher m = p.matcher(input);
return m.matches();

Compiling a regular expression is non-trivial and you should consider compiling the patterns once and for all in your program:

pattern[i] = Pattern.compile(tokens.get(i));

And then simply invoke something like

pattern[i].matcher(str).matches()

继续阅读：optimization regex string string-search

Optimizing a lot of Scanner.findWithinHorizon(pattern, 0) calls

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？