开发者

VB.NET - large generic list filtering against another list

I have two generic lists of type string, the first contains about 1,000,000 terms and the second contains about 100,000 keywords. The terms in the first list may or may not contain keywords from the second 开发者_C百科list. I need to isolate those terms in the first list that don't contain any keyword from the second list. Currently I'm doing like this (VB.NET with framework 3.5):

For Each keyword In keywordList
    termList.RemoveAll(AddressOf ContainsKeyword)
Next

Private Shared Function ContainsKeyword(ByVal X As String) As Integer
    If X.IndexOf(keyword) >= 0 Then
        Return True
    Else
        Return False
    End If
End Function

Needless to say, this takes forever. What is the fastest way to accomplish this? Perhaps using Dictionaries? Any hint would be of help


A straight Dictionary of the keywords won't work here because you are doing Contains checks and not just straight equality checks. One approach you might take is to combine the search terms into a tree. The amount a tree helps depends on how much overlap there is in the search terms. I put together a basic tree implementation (without much testing) as a starting point:

Public Class WordSearchTree

    Private ReadOnly _branches As New Dictionary(Of Char, WordSearchTree)

    Public Function WordContainsTerm(ByVal word As String) As Boolean
        Return Not String.IsNullOrEmpty(word) AndAlso _
               Enumerable.Range(0, word.Length - 1) _
                         .Any(Function(i) WordContainsInternal(word, i))
    End Function

    Private Function WordContainsInternal(ByVal word As String, ByVal charIndex As Integer) As Boolean
        Return _branches.Count = 0 OrElse _
               (_branches.ContainsKey(word(charIndex)) AndAlso _
                charIndex < word.Length - 1 AndAlso _
                _branches(word(charIndex)).WordContainsInternal(word, charIndex + 1))
    End Function

    Public Shared Function BuildTree(ByVal words As IEnumerable(Of String)) As WordSearchTree
        If words Is Nothing Then Throw New ArgumentNullException("words")
        Dim ret As New WordSearchTree()
        For Each w In words
            Dim curTree As WordSearchTree = ret
            For Each c In w
                If Not curTree._branches.ContainsKey(c) Then
                    curTree._branches.Add(c, New WordSearchTree())
                End If
                curTree = curTree._branches(c)
            Next
        Next
        Return ret
    End Function

End Class

and with that tree, you could do something like this:

Dim keys As WordSearchTree = WordSearchTree.Build(keywordList)
termList.RemoveAll(AddressOf keys.WordContainsTerm)
0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜