VB.NET - large generic list filtering against another list
I have two generic lists of type string, the first contains about 1,000,000 terms and the second contains about 100,000 keywords. The terms in the first list may or may not contain keywords from the second 开发者_C百科list. I need to isolate those terms in the first list that don't contain any keyword from the second list. Currently I'm doing like this (VB.NET with framework 3.5):
For Each keyword In keywordList
termList.RemoveAll(AddressOf ContainsKeyword)
Next
Private Shared Function ContainsKeyword(ByVal X As String) As Integer
If X.IndexOf(keyword) >= 0 Then
Return True
Else
Return False
End If
End Function
Needless to say, this takes forever. What is the fastest way to accomplish this? Perhaps using Dictionaries? Any hint would be of help
A straight Dictionary of the keywords won't work here because you are doing Contains checks and not just straight equality checks. One approach you might take is to combine the search terms into a tree. The amount a tree helps depends on how much overlap there is in the search terms. I put together a basic tree implementation (without much testing) as a starting point:
Public Class WordSearchTree
Private ReadOnly _branches As New Dictionary(Of Char, WordSearchTree)
Public Function WordContainsTerm(ByVal word As String) As Boolean
Return Not String.IsNullOrEmpty(word) AndAlso _
Enumerable.Range(0, word.Length - 1) _
.Any(Function(i) WordContainsInternal(word, i))
End Function
Private Function WordContainsInternal(ByVal word As String, ByVal charIndex As Integer) As Boolean
Return _branches.Count = 0 OrElse _
(_branches.ContainsKey(word(charIndex)) AndAlso _
charIndex < word.Length - 1 AndAlso _
_branches(word(charIndex)).WordContainsInternal(word, charIndex + 1))
End Function
Public Shared Function BuildTree(ByVal words As IEnumerable(Of String)) As WordSearchTree
If words Is Nothing Then Throw New ArgumentNullException("words")
Dim ret As New WordSearchTree()
For Each w In words
Dim curTree As WordSearchTree = ret
For Each c In w
If Not curTree._branches.ContainsKey(c) Then
curTree._branches.Add(c, New WordSearchTree())
End If
curTree = curTree._branches(c)
Next
Next
Return ret
End Function
End Class
and with that tree, you could do something like this:
Dim keys As WordSearchTree = WordSearchTree.Build(keywordList)
termList.RemoveAll(AddressOf keys.WordContainsTerm)
精彩评论