N-gram function in vb.net -> create grams for words instead of characters
I recently found out about n-grams and the cool possibility to compare frequency of phrases in a text body with it. Now I'm trying to make an vb.net app that simply gets an text body and returns a list of the most frequently used phrases (where n >= 2).
I found an C# example of how to generate a n-gram from a text body so I started out with converting the code to VB. The problem is that this code does create one gram per cha开发者_运维百科racter instead of one per word. The delimiters I want to use for the words is: VbCrLf (new line), vbTab (tabs) and the following characters: !@#$%^&*()_+-={}|\:\"'?¿/.,<>’¡º×÷‘;«»[]
Does anyone have an idea how I can rewrite the following function for this purpose:
Friend Shared Function GenerateNGrams(ByVal text As String, ByVal gramLength As Integer) As String()
If text Is Nothing OrElse text.Length = 0 Then
Return Nothing
End If
Dim grams As New ArrayList()
Dim length As Integer = text.Length
If length < gramLength Then
Dim gram As String
For i As Integer = 1 To length
gram = text.Substring(0, (i) - (0))
If grams.IndexOf(gram) = -1 Then
grams.Add(gram)
End If
Next
gram = text.Substring(length - 1, (length) - (length - 1))
If grams.IndexOf(gram) = -1 Then
grams.Add(gram)
End If
Else
For i As Integer = 1 To gramLength - 1
Dim gram As String = text.Substring(0, (i) - (0))
If grams.IndexOf(gram) = -1 Then
grams.Add(gram)
End If
Next
For i As Integer = 0 To (length - gramLength)
Dim gram As String = text.Substring(i, (i + gramLength) - (i))
If grams.IndexOf(gram) = -1 Then
grams.Add(gram)
End If
Next
For i As Integer = (length - gramLength) + 1 To length - 1
Dim gram As String = text.Substring(i, (length) - (i))
If grams.IndexOf(gram) = -1 Then
grams.Add(gram)
End If
Next
End If
Return Tokeniser.ArrayListToArray(grams)
End Function
An n-gram for words is just a list of length n that stores these words. A list of n-grams is then simply a list of list of words. If you want to store frequency then you need a dictionary that is indexed by these n-grams. For your special case of 2-grams, you can imagine something like this:
Dim frequencies As New Dictionary(Of String(), Integer)(New ArrayComparer(Of String)())
Const separators as String = "!@#$%^&*()_+-={}|\:""'?¿/.,<>’¡º×÷‘;«»[] " & _
ControlChars.CrLf & ControlChars.Tab
Dim words = text.Split(separators.ToCharArray(), StringSplitOptions.RemoveEmptyEntries)
For i As Integer = 0 To words.Length - 2
Dim ngram = New String() { words(i), words(i + 1) }
Dim oldValue As Integer = 0
frequencies.TryGetValue(ngram, oldValue)
frequencies(ngram) = oldValue + 1
Next
frequencies
should now contain a dictionary with all two consecutive word pairs contained in the text, and the frequency with which they appear (as a consecutive pair).
This code requires the ArrayComparer
class:
Public Class ArrayComparer(Of T)
Implements IEqualityComparer(Of T())
Private ReadOnly comparer As IEqualityComparer(Of T)
Public Sub New()
Me.New(EqualityComparer(Of T).Default)
End Sub
Public Sub New(ByVal comparer As IEqualityComparer(Of T))
Me.comparer = comparer
End Sub
Public Overloads Function Equals(ByVal a As T(), ByVal b As T()) As Boolean _
Implements IEqualityComparer(Of T()).Equals
System.Diagnostics.Debug.Assert(a.Length = b.Length)
For i As Integer = 0 to a.Length - 1
If Not comparer.Equals(a(i), b(i)) Then Return False
Next
Return True
End Function
Public Overloads Function GetHashCode(ByVal arr As T()) As Integer _
Implements IEqualityComparer(Of T()).GetHashCode
Dim hashCode As Integer = 17
For Each obj As T In arr
hashCode = ((hashCode << 5) - 1) Xor comparer.GetHashCode(obj)
Next
Return hashCode
End Function
End Class
Unfortunately, this code doesn’t compile on Mono because the VB compiler has problems finding the generic EqualityComparer
class. I’m therefore unable to test whether the GetHashCode
implementationw works as expected but it should be fine.
Thank you Konrad very much for this beginning of an solution!
I tried your code and got the following result:
Text = "Hello I am a test Also I am a test"
(I also included whitespace as a separator)
frequencies now has 9 items:
---------------------
Keys: "Hello", "I"
Value: 1
---------------------
Keys: "I", "am"
Value: 1
---------------------
Keys: "am", "a"
Value: 1
---------------------
Keys: "a", "test"
Value: 1
---------------------
Keys: "test", "Also"
Value: 1
---------------------
Keys: "Also", "I"
Value: 1
---------------------
Keys: "I", "am"
Value: 1
---------------------
Keys: "am", "a"
Value: 1
---------------------
Keys: "a", "test"
Value: 1
---------------------
My first question: shouldn't the 3 last key pairs get a value of 2 as they're found two times in the text?
Second: The reason I got into the n-gram approach is that I don't want to limit the word count (n) to a specific length. Is there a way to make a dynamic approach that tries to find the longest phrase match first and then step down to the last wordcount of 2?
My goal result for the sample query above is:
---------------------
Match: "I am a test"
Frequency: 2
---------------------
Match: "I am a"
Frequency: 2
---------------------
Match: "am a test"
Frequency: 2
---------------------
Match: "I am"
Frequency: 2
---------------------
Match: "am a"
Frequency: 2
---------------------
Match: "a test"
Frequency: 2
---------------------
There is an similar C++ approach for this written by Hatem Mostafa over at codeproject.com: N-gram and Fast Pattern Extraction Algorithm
Sadly I'm no C++ expert and have no idea how to convert this bit of code as it includes a lot of memory handling that .Net doesn't. The only problem with this example is that you have to specify the minimum word pattern length and I want it to be dynamic from 2 to max found.
精彩评论