Java Counting # of occurrences of a word in a string

2022-12-26 14:49 问答作者：

I have a large text file I am reading from and I need to find out how many times some words come up. For example, the word the. I'm doing this line by line each line is a string.

I need to make sure that I only count legit the's--the the in other would not count. This means I know I need to use regular expressions in some way. What I was trying so far is this:

numSpace += line.split("[^a-z]the[^a-z]").length;

I realize the regular expression may not be correct at the moment but I tried without that and just tried to find occurrences of the word the and I get wrong numbers too. I was under the impression this would split the string up into an array and how many times that array was split up was how many times the word is in the string. Any ideas I would be grateful.

Update: Given some ideas, I've come up with this:

numThe += line.split("[^a-zA-Z][Tt]he[^a-zA-Z]", -1).length - 1;

Though still getting some strange numbers. I was able to get an acc开发者_运维技巧urate general count (without the regular expression), now my issue is with the regexp.

Using split to count isn't the most efficient, but if you insist on doing that, the proper way is this:

haystack.split(needle, -1).length -1

If you don't set limit to -1, split defaults to 0, which removes trailing empty strings, which messes up your count.

From the API:

The limit parameter controls the number of times the pattern is applied and therefore affects the length of the resulting array. [...] If n is zero then [...] trailing empty strings will be discarded.

You also need to subtract 1 from the length of the array, because N occurrences of the delimiter splits the string into N+1 parts.

As for the regex itself (i.e. the needle), you can use \b the word boundary anchors around the word. If you allow word to contain metacharacters (e.g. count occurrences of "$US"), you may want to Pattern.quote it.

I've come up with this:
numThe += line.split("[^a-zA-Z][Tt]he[^a-zA-Z]", -1).length - 1;
Though still getting some strange numbers. I was able to get an accurate general count (without the regular expression), now my issue is with the regexp.

Now the issue is that you're not counting [Tt]he that appears as the first or last word, because the regex says that it has to be preceded/followed by some character, something that matches [^a-zA-Z] (that is, your match must be of length 5!). You're not allowing the case where there isn't a character at all!

You can try something like this instead:

"(^|[^a-zA-Z])[Tt]he([^a-zA-Z]|$)"

This isn't the most concise solution, but it works.

Something like this (using negative lookarounds) also works:

"(?<![a-zA-Z])[Tt]he(?![^a-zA-Z])"

This has the benefit of matching just [Tt]he, without any extra characters around it like your previous solution did. This is relevant in case you actually want to process the tokens returned by split, because the delimiter in this case isn't "stealing" anything from the tokens.

Non-`split`

Though using split to count is rather convenient, it isn't the most efficient (e.g. it's doing all kinds of work to return those strings that you discard). The fact that as you said you're counting line-by-line means that the pattern would also have to be recompiled and thrown away every line.

A more efficient way would be to use the same regex you did before and do the usual Pattern.compile and while (matcher.find()) count++;

To get the number of occurrence of a specific word use the below code

     Pattern pattern = Pattern.compile("Thewordyouwant");
        Matcher matcher = pattern.matcher(string);
        int count = 0;
        while(matcher.find())
            count++;

Why not run your line through the Java StringTokenizer then you can get the words broken up by not just spaces but also commas and other punctuation. Just run through your tokens and count the occurrence of each "the" or any word you would like.

It would be very easy to expand this a bit and make a map that had each word as a key and kept a count of each word use. Also you may need to consider running each word through a function to stem the word so you can count a more useful thing then just the words.

Splitting the Strings sounds like a lot of overhead just to find out the number of occurrences in a file. You could use String.indexOf(String, int) to recursively go through the whole line/file, like this:

int occurrences = 0;
int index = 0;
while (index < s.length() && (index = s.indexOf("the", index)) >= 0) {
    occurrences++;
    index + 3; //length of 'the'
}

I think this is an area where unit tests can really help. I had a similar thing some time ago where I wanted to break a string up in a number of complex ways and create a number of tests, each of which tested against a different source string, helped me to isolate the regex and also quickly see when I got it wrong.

Certainly if you gave us an example of a test string and the result it would help us to give you better answers.

You can try using the word boundary \b in the regex:

\bthe\b

Also the size of the array returned by the split will be 1 more than the actual number of occurrences of the word the in the string.

Search for " the " using boyer-moore [in the remainder of the string after a hit] and count number of occurences?

public class OccurenceOfWords {
 public static void main(String args[]){    
   String file = "c:\\customer1.txt";
   TreeMap <String ,Integer> index = new TreeMap();

    String []list = null;
      try(    FileReader fr = new FileReader(file);//using arm jdk 7.0 feature
                BufferedReader br = new BufferedReader(fr))
        {
            String line = br.readLine();
            while(line!= null){
                list = line.split("[ \n\t\r:;',.(){}]");
                for(int i = 0 ; i < list.length;i++)
                {
                  String word = list[i].toLowerCase();  
                    if(word.length() != 0)
                    {
                        if(index.get(word)== null)
                        { index.put(word,1);
                         }
                        else    
                        {
                            int occur = index.get(word).intValue();
                            occur++;
                            index.put(word, occur);
                        }
                        line = br.readLine();
                    }  
                }
         }}
                         catch(Exception ex){
                       System.out.println(ex.getMessage());
                       }
                    for(String item : index.keySet()){
                        int repeats = index.get(item).intValue();
                       System.out.printf("\n%10s\t%d",item,repeats);
                 }   
             }               
  }

继续阅读：regex

Java Counting # of occurrences of a word in a string

Non-`split`

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？

Non-split

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集 河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？

Non-`split`

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？