开发者

.NET Regular Expression to find actual words in text

I am using VB .NET to write a program that will get the words from a suplied text file and count how many times each word appears. I am using this regular expression:-

parser As New Regex("\w+")

It gives me almost 100% correct words. Except when I have words like

"Ms Word App file name is word.exe." or "is this a c# statment If(a>b?1,0) ?"

In such cases I get [word & exe] AND [If, a, b, 1 and 0] as seperate words. it would be nice (for my purpose) that I received word.exe and (If(a>b?1,0) as words开发者_如何学Python.

I guess \w+ looks for white space, sentence terminating punctuation mark and other punctuation marks to determine a word.

I want a similar regular Expression that will not break a word by a punctuation mark, if the punctuation mark is not the end of the word. I think end-of-word can be defined by a trailing WhiteSpace, Sentence terminating Punctuation (you may think of others). if you can suggest some regular expression 9for VB .NET) that will be great help.

Thanks.


If we assume that . with a space after it is a full stop then this regex should work

[\w(?!\S)\.]+


Not a regular expression as such, but you could just do something like:

Dim words() As String = myString.Replace(". ", " ").Split(" "c)

(Code written from memory so probably won't compile exactly like that)

Edit: Realised that the code could be simplyfied.


This expression has pretty good (although not perfect) results based on Expresso's default sample text:

((?:\w+[.\-!?#'])*\w+)(?=\s)


I tried to post my code on COMMENT section, but the it was too long for that. I am replying my own question by the ANSWER really came from Hun1Ahpu & Alan Moore.

I am pasting my code on how I am getting rid of trailing punctuation mark from a word.

Private mstrPunctuations As String = ",.'""`!@#$%^&*()_-+=?"
Dim parser As New Regex("\S+")
        Me.mintWordCount = parser.Matches(CleanedSource).Count
        For Each Word As Match In parser.Matches(CleanedSource)
            Dim NeedChange As Boolean = False
            For Each aChar As Char In Me.mstrPunctuations.ToCharArray()
                If Word.Value.EndsWith(aChar) Then
                    NeedChange = True
                    Exit For
                End If
            Next
            If NeedChange Then
                SetStringStat(Word.Value.Substring(0, Word.Value.Length - 1))
            Else
                SetStringStat(Word.Value)
            End If
        Next
0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜