.NET Regular Expression to find actual words in text
I am using VB .NET to write a program that will get the words from a suplied text file and count how many times each word appears. I am using this regular expression:-
parser As New Regex("\w+")
It gives me almost 100% correct words. Except when I have words like
"Ms Word App file name is word.exe." or "is this a c# statment If(a>b?1,0) ?"
In such cases I get [word & exe] AND [If, a, b, 1 and 0] as seperate words. it would be nice (for my purpose) that I received word.exe and (If(a>b?1,0) as words开发者_如何学Python.
I guess \w+ looks for white space, sentence terminating punctuation mark and other punctuation marks to determine a word.
I want a similar regular Expression that will not break a word by a punctuation mark, if the punctuation mark is not the end of the word. I think end-of-word can be defined by a trailing WhiteSpace, Sentence terminating Punctuation (you may think of others). if you can suggest some regular expression 9for VB .NET) that will be great help.
Thanks.
If we assume that . with a space after it is a full stop then this regex should work
[\w(?!\S)\.]+
Not a regular expression as such, but you could just do something like:
Dim words() As String = myString.Replace(". ", " ").Split(" "c)
(Code written from memory so probably won't compile exactly like that)
Edit: Realised that the code could be simplyfied.
This expression has pretty good (although not perfect) results based on Expresso's default sample text:
((?:\w+[.\-!?#'])*\w+)(?=\s)
I tried to post my code on COMMENT section, but the it was too long for that. I am replying my own question by the ANSWER really came from Hun1Ahpu & Alan Moore.
I am pasting my code on how I am getting rid of trailing punctuation mark from a word.
Private mstrPunctuations As String = ",.'""`!@#$%^&*()_-+=?"
Dim parser As New Regex("\S+")
Me.mintWordCount = parser.Matches(CleanedSource).Count
For Each Word As Match In parser.Matches(CleanedSource)
Dim NeedChange As Boolean = False
For Each aChar As Char In Me.mstrPunctuations.ToCharArray()
If Word.Value.EndsWith(aChar) Then
NeedChange = True
Exit For
End If
Next
If NeedChange Then
SetStringStat(Word.Value.Substring(0, Word.Value.Length - 1))
Else
SetStringStat(Word.Value)
End If
Next
精彩评论