开发者

Parse a document into sentences

I have a question that should be simple enough to开发者_如何学编程 the experts but painfully mysterious to me :) I'd like to parse a text (pre-processed, no special characters except for regular punctuation marks) into sentences and perform two tasks similar to:

  1. For each sentence, find the number of words (Sentence Length). Then for the document, find the average sentence length. There is no need to report any sentence-level outputs. Note that the document contains a fair number of proper nouns so the Capital letter does not necessary mean the start of the sentence. BUT the sentences in this document usually ends in ",", "!", or "?".

  2. For each sentence, apply a regex pattern. If there's a match, give the sentence a value of, e.g.1. For the whole document, report the number of matches. Again, only document-level outputs are needed.

I'm wondering if there's any way to do that, preferably in C#, or VB. Any help will be greatly appreciated.

======================

Example Paragraph:

This is an example of a paragraph! It contains three sentences? And the average sentence has many words. 

Example Pattern:

"three"

Output:

number of sentences-3.
Average sentence length-6.
Number of matches-1.


You can get a sentence (depends on your definition of sentence) using:

(\a|[\.!\?:])[^\.!\?:]+

And a word using:

[a-zA-Z]+

The rest is easy - just look at the documentation for regular expressions on MSDN.


This should work:

string example =
    "This is an example of a paragraph! It contains three sentences? And the average sentence has many words.";

var splitExample = example.Split(new[] {'.', '!', '?'}, StringSplitOptions.RemoveEmptyEntries);

var matchExpression = new Regex("three");
double avgLength = splitExample.Average(x => x.Split(new []{' '}, StringSplitOptions.RemoveEmptyEntries).Length);
int sentences = splitExample.Length;
int matches = splitExample.Where(x => matchExpression.IsMatch(x)).Count();


You could do a Split based off the period (.) which would give you an array of sentences.

string sentences[] = document.Split('.');

Then you would do a Split on each "sentence array" based on a "space" to get the number of words.

And yes you would then use Regular Expressions to do your matching. Not much else I can add to that since you didn't specify what you're trying to match.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜