Word Count Algorithm in C#
I am loo开发者_如何学编程king for a good word count class or function. When I copy and paste something from the internet and compare it with my custom word count algorithm and MS Word it is always off by a little more then 10%. I think that is too much . So do you guys know of an accurate word count algorithm in c#.
As @astander suggests, you can do a String.Split as follows:
string[] a = s.Split(
new char[] { ' ', ',', ';', '.', '!', '"', '(', ')', '?' },
StringSplitOptions.RemoveEmptyEntries);
By passing in an array of chars, you can split on multiple word breaks. Removing empty entries will keep you from counting non-word words.
String.Split by predefined chars. Use punctuations, spaces (remove multiple space), and any other chars that you determine to be "word splits"
What have you tried?
I did see that the previous user got nailed for links, but here is some examples of using regex, or char matching. Hope it helps, and nobody gets hurt X-)
String.Split Method (Char[])
Word counter in C#
C# Word Count
Use a regular expression to find words (e.g. [\w]+) and just count the matches
public static Regex regex = new Regex(
"[\\w]+",
RegexOptions.Multiline
| RegexOptions.CultureInvariant
| RegexOptions.Compiled
);
regex.Match(_someString).Count
You also need to check for newlines
, tabs
, and non-breaking spaces
. I find it best to copy the source text into a StringBuilder
and replace all newlines, tabs, and sentence ending characters with spaces. Then split the string based on spaces.
I've just had the same problem in ClipFlair, where I needed to calculate WPM (Words-per-minute) for Movie Captions, so I came up with the following one:
You can define this static extension method in a static class and then add a using clause to the namespace of that static class at any class that needs to use this extension method. The extension method is invoked using s.WordCount(), where s is a string (an identifier [variable/constant] or literal)
public static int WordCount(this string s)
{
int last = s.Length-1;
int count = 0;
for (int i = 0; i <= last; i++)
{
if ( char.IsLetterOrDigit(s[i]) &&
((i==last) || char.IsWhiteSpace(s[i+1]) || char.IsPunctuation(s[i+1])) )
count++;
}
return count;
}
Here is the stripped down version of c# code class i made for counting words , asian words , charaters etc. This is almost same as Microsoft Word. I developed the original code for counting words for Microsoft Word documents.
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Text.RegularExpressions;
namespace BL {
public class WordCount
{
public int NonAsianWordCount { get; set; }
public int AsianWordCount { get; set; }
public int TextLineCount { get; set; }
public int TotalWordCount { get; set; }
public int CharacterCount { get; set; }
public int CharacterCountWithSpaces { get; set; }
//public string Text { get; set; }
public WordCount(){}
~WordCount() {}
public void GetCountWords(string s)
{
#region Regular Expression Collection
string asianExpression = @"[\u3001-\uFFFF]";
string englishExpression = @"[\S]+";
string LineCountExpression = @"[\r]+";
#endregion
#region Asian Character
MatchCollection asiancollection = Regex.Matches(s, asianExpression);
AsianWordCount = asiancollection.Count; //Asian Character Count
s = Regex.Replace(s, asianExpression, " ");
#endregion
#region English Characters Count
MatchCollection collection = Regex.Matches(s, englishExpression);
NonAsianWordCount = collection.Count;
#endregion
#region Text Lines Count
MatchCollection Lines = Regex.Matches(s, LineCountExpression);
TextLineCount = Lines.Count;
#endregion
#region Total Character Count
CharacterCount = AsianWordCount;
CharacterCountWithSpaces = CharacterCount;
foreach (Match word in collection)
{
CharacterCount += word.Value.Length ;
CharacterCountWithSpaces += word.Value.Length + 1;
}
#endregion
#region Total Character Count
TotalWordCount = AsianWordCount + NonAsianWordCount;
#endregion
}
}
}
public static class WordCount
{
public static int Count(string text)
{
int wordCount = 0;
text = text.Trim();// trim white spaces
if (text == ""){return 0;} // end if empty text
foreach (string word in text.Split(' ')) // or use any other char(instead of empty space ' ') that you consider a word splitter
wordCount++;
return wordCount;
}
}
精彩评论