Count word occurrences in a text field with LINQ
How can i get the occurrences count of a Word in a database text field With LINQ ?
Keyword token sample : ASP.NET
EDIT 4 :
Database Records :
Record 1 : [TextField] = "Blah blah blah ASP.NET bli bli bli ASP.NET blu ASP.NET yop yop ASP.NET"
Record 2 : [TextField] = "Blah blah blah bli bli bli blu ASP.NET yop yop ASP.NET"
Record 3 : [TextField] = "Blah ASP.NET blah ASP.NET blah ASP.NET bli ASP.NET bli bli ASP.NET blu ASP.NET yop yop ASP.NET"
So
Record 1 Contains 4 occurrence of "ASP.NET" keyword
Record 2 Contains 2 occurrence of "ASP.NET" keyword
Record 3 Contains 7 occurrence of "ASP.NET" keyword
Collection Extraction IList < RecordModel > (ordered by word count descending)
- Record 3
- Record 1
- Record 2
LinqToSQL should be t开发者_运维知识库he best, but LinqToObject too :)
NB : No issue about the "." of ASP.NET keyword (this is not the goal if this question)
Edit 2: I see you updated the question, changes things a bit, word counts per word eh? Try this:
string input = "some random text: how many times does each word appear in some random text, or not so random in this case";
char[] separators = new char[]{ ' ', ',', ':', ';', '?', '!', '\n', '\r', '\t' };
var query = from s in input.Split( separators )
where s.Length > 0
group s by s into g
let count = g.Count()
orderby count descending
select new {
Word = g.Key,
Count = count
};
Since you are wanting words that might have a "." in them (e.g. "ASP.NET") I've excluded that from the separator list, unfortunately that will pollute some words as a sentence like "Blah blah blah. Blah blah." would show "blah" with a count of 3 and "blah." with a count of 2. You'll need to think of what cleaning strategy you want here, e.g. if the "." has a letter either side it counts as part of a word, otherwise it's whitespace. That kind of logic is best done with some RegEx.
A regex handles this nicely. You can use the \b
metacharacter to anchor the word boundary, and escape the keyword to avoid unintended use of special regex characters. It also handles the cases of trailing periods, commas, etc.
string[] records =
{
"foo ASP.NET bar", "foo bar",
"foo ASP.NET? bar ASP.NET",
"ASP.NET foo ASP.NET! bar ASP.NET",
"ASP.NET, ASP.NET ASP.NET, ASP.NET"
};
string keyword = "ASP.NET";
string pattern = @"\b" + Regex.Escape(keyword) + @"\b";
var query = records.Select((t, i) => new
{
Index = i,
Text = t,
Count = Regex.Matches(t, pattern).Count
})
.OrderByDescending(item => item.Count);
foreach (var item in query)
{
Console.WriteLine("Record {0}: {1} occurrences - {2}",
item.Index, item.Count, item.Text);
}
Voila! :)
Use String.Split() to turn the string into an array of words, then use LINQ to filter this list returning only the words you want, and then check the count of the result, like this:
myDbText.Split(' ').Where(token => token.Equals(word)).Count();
You could Regex.Matches(input, pattern).Count
or you could do the following:
int count = 0; int startIndex = input.IndexOf(word);
while (startIndex != -1) { ++count; startIndex = input.IndexOf(word, startIndex + 1); }
using LINQ here would be ugly
I know this is more than the original question asked, but it still matches the subject and I'm including it for others who search on this question later. This does not require that the whole word be matched in the strings that are searched, however it can be easily modified to do so with code from Ahmad's post.
//use this method to order objects and keep the existing type
class Program
{
static void Main(string[] args)
{
List<TwoFields> tfList = new List<TwoFields>();
tfList.Add(new TwoFields { one = "foo ASP.NET barfoo bar", two = "bar" });
tfList.Add(new TwoFields { one = "foo bar foo", two = "bar" });
tfList.Add(new TwoFields { one = "", two = "barbarbarbarbar" });
string keyword = "bar";
string pattern = Regex.Escape(keyword);
tfList = tfList.OrderByDescending(t => Regex.Matches(string.Format("{0}{1}", t.one, t.two), pattern).Count).ToList();
foreach (TwoFields tf in tfList)
{
Console.WriteLine(string.Format("{0} : {1}", tf.one, tf.two));
}
Console.Read();
}
}
//a class with two string fields to be searched on
public class TwoFields
{
public string one { get; set; }
public string two { get; set; }
}
.
//same as above, but uses multiple keywords
class Program
{
static void Main(string[] args)
{
List<TwoFields> tfList = new List<TwoFields>();
tfList.Add(new TwoFields { one = "one one, two; three four five", two = "bar" });
tfList.Add(new TwoFields { one = "one one two three", two = "bar" });
tfList.Add(new TwoFields { one = "one two three four five five", two = "bar" });
string keywords = " five one ";
string keywordsClean = Regex.Replace(keywords, @"\s+", " ").Trim(); //replace multiple spaces with one space
string pattern = Regex.Escape(keywordsClean).Replace("\\ ","|"); //escape special chars and replace spaces with "or"
tfList = tfList.OrderByDescending(t => Regex.Matches(string.Format("{0}{1}", t.one, t.two), pattern).Count).ToList();
foreach (TwoFields tf in tfList)
{
Console.WriteLine(string.Format("{0} : {1}", tf.one, tf.two));
}
Console.Read();
}
}
public class TwoFields
{
public string one { get; set; }
public string two { get; set; }
}
精彩评论