Java string comparison

2023-01-15 10:18 问答作者：

I am comparing substrings in two large text files. Very simple, tokenizing into two token containers, comparing with 2 for loops. Performance is disastrous! Does anybody have an advice or idea how to improve performance?

for (int s = 0; s < txtA.TokenContainer.size(); s++) {
    String strTxtA = txtA.getSubStr(s);
    strLengthA = txtA.getNumToken(s);

    if (strLengthA >= dp.getMinStrLength()) {
        int tokenFileB = 1;

        for (int t = 0; t < txtB.TokenContainer.size(); t++) {
            String strTxtB = txtB.getSubStr(t);
            strLengthB = txtB.getNumToken(t);

            if (strTxtA.equalsIgnoreCase(strTxtB)) {
                try {
                    subStrTemp = new SubStrTemp(
                        txtA.ID, txtB.ID, tokenFileA, tokenFileB,
                        (tokenFileA + strLengthA - 1), 
                        (tokenFileB + strLengthB - 1));

                    if (subStrContainer.contains(subStrTemp) == false) {
                        subStrContainer.addElement(subStrTemp);
                    }
                } catch (Exception ex) {
                    logger.error("error");
                }
    开发者_运维问答        }
            tokenFileB += strLengthB;
        }
        tokenFileA += strLengthA;
    }
}

Generally my code reading two large Strings with Java Tokonizer into containers A and B. And then trying to compare substrings.Possision of Substrgs which are existing in both strings to store into a Vector. But performance is awful, also don't really know how to solve it with HashMap.

Your main problem is that you go through all txtB for each token in txtA.

You should store informations on token from txtA (in a HashMap for instance) and then in a second loop (but not a nested one) you compare the strings with the existing one in the Map.

On the same topic :

term frequency using java program
How to count words in java

You are doing a join with nested loops? Yes, that is O(n^2). What about doing a hash join instead? That is, create a map from (lowercased) strText to t and do lookups with this map rather than iterating over the token container?

Put the tokens of fileA into a trie data structure. Then when tokenising fileB you can check quite quickly if these tokens are in the trie. A few code comments would help.

A said, this is an issue of complexity and you're algorithm runs in O(n^2) instead of O(n) using hash.

For second order improvements try to call less to functions, for example you can get the size once

sizeB = txtB.TokenContainer.size();

Depeneds on the size, you may call the container once to get an array of strings to save the getStr....

Roni

继续阅读：comparison string

Java string comparison

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？