reading file large file very slow, please help
this code takes about 30 mins and high cpu usage, what is the problem
Do
strLine = objReader.ReadLine()
If strLine Is Nothing Then
Exit Do
End If
'check valid proxy
m = Regex.Match(strLine.Trim, strProxyParttern)
strMatch = m.Value.Trim
If String.IsNullOrEmpty(strMatch) = True OrElse _
strMatch.Contains("..") = True Then
Continue Do
End If
' create proxy
With tmpProxy
.IP = strMatch.Substring(0, strMatch.IndexOf(":"))
.Port = CInt(strMatch.Substring(strMatch.IndexOf(":") + 1))
.Status = "new"
End With
' check
If lstProxys.Contains(tmpProxy) = True Then
Continue Do
End If
lstProxys.Add(tmpProxy)
Debug.Print(lstProxys.Count.ToString)
Loop Until strLine Is Nothing
If lstProxys.Count < 1 Then
Exit Sub
End If
is the slowness from the comparism or from reading the file or from the regex?
EDIT
profiling the code like this
Dim myTimer As New System.Diagnostics.Stopwatch()
Dim t1 As Integer = 0
Dim t2 As Integer = 0
Dim t3 As Integer = 0
'read the file line by line, collecting valid proxy
Do
'Read a line fromn the file
myTimer.Reset()
myTimer.Start()
strLine = objReader.ReadLine()
If strLine Is Nothing Then
Exit Do
End If
myTimer.Stop()
t1 = myTimer.Elapsed.Milliseconds
'check valid proxy
myTimer.Reset()
myTimer.Start()
m = Regex.Match(strLine.Trim, strProxyParttern)
strMatch = m.Value.Trim
If String.IsNullOrEmpty(strMatch) = True OrElse _
strMatch.Contains("..") = True Then
Continue Do
End If
myTimer.Stop()
t2 = myTimer.Elapsed.Milliseconds
' create proxy
myTimer.Reset()
myTimer.Start()
tmpProxy.IP = strMatch.Substring(0, strMatch.IndexOf(":"))
tmpProxy.Port = CInt(str开发者_StackOverflow中文版Match.Substring(strMatch.IndexOf(":") + 1))
tmpProxy.Status = "new"
' check
If lstProxys.Contains(tmpProxy) = True Then
Continue Do
End If
lstProxys.Add(tmpProxy)
myTimer.Stop()
t2 = myTimer.Elapsed.Milliseconds
Debug.Print(String.Format("Read={0}, Match={1}, Add={2}", t1, t2, t3))
Loop Until strLine Is Nothing
gave these results
Read=0, Match=0, Add=1
Read=0, Match=0, Add=1
Read=0, Match=0, Add=2
...
Read=0, Match=0, Add=9
Read=0, Match=0, Add=9
Read=0, Match=0, Add=10
...
...
Read=0, Match=0, Add=39
Read=0, Match=0, Add=39
Read=0, Match=0, Add=40
etc
looks like the code is ok right, except for the add to the list
The speed issue is because you are using a List(Of Structure). The List.Contains method is a linear search (it goes through each item of the list to see if it matches) so it takes increasingly longer the more unique items you add to the list.
Because you're dealing with a large number of items, change lstProxys into a HashSet(Of T). You should see a huge performance boost. All you should need to do is change the definition of lstProxys:
Dim lstProxys as New HashSet(Of structure)
The disk I/O is usually the limiting factor for something like this. Depending on the disk speed you could expect a throughput of about 5-20 megabyte per second.
Regular expressions can be slow if they contain expressions that cause a lot of backtracking, so that is a possibility, but it should be pretty bad to be noticable compared to the disk I/O.
As there will never be more than one item in the proxy list, that comparion can't be the problem. You are not creating any new proxy object, but reusing the same, which means that you change the property of the object that you have already put in the list. As you are comparing the object with itself, the list will always contain the object after the first iteration, and will never be added a second time.
Does the proxy class do anything when you assign values to its properties? If it does something like creating a connection, that might be what's taking so long.
is the slowness from the comparism or from reading the file or from the regex?
We could take educated guesses but why not measure it instead.
For example run the following three tests separately under release mode and without the debugger attached and see how long it takes
'Test 1 Just IO
Do
strLine = objReader.ReadLine()
Loop Until strLine Is Nothing
If lstProxys.Count < 1 Then
Exit Sub
End If
'Test 2 IO + Regex
Do
strLine = objReader.ReadLine()
If strLine Is Nothing Then
Exit Do
End If
'check valid proxy
m = Regex.Match(strLine.Trim, strProxyParttern)
strMatch = m.Value.Trim
If String.IsNullOrEmpty(strMatch) = True OrElse _
strMatch.Contains("..") = True Then
Continue Do
End If
Loop Until strLine Is Nothing
If lstProxys.Count < 1 Then
Exit Sub
End If
'Test 3 IO + regex and Compare
Do
strLine = objReader.ReadLine()
If strLine Is Nothing Then
Exit Do
End If
'check valid proxy
m = Regex.Match(strLine.Trim, strProxyParttern)
strMatch = m.Value.Trim
If String.IsNullOrEmpty(strMatch) = True OrElse _
strMatch.Contains("..") = True Then
Continue Do
End If
' create proxy
With tmpProxy
.IP = strMatch.Substring(0, strMatch.IndexOf(":"))
.Port = CInt(strMatch.Substring(strMatch.IndexOf(":") + 1))
.Status = "new"
End With
' check
If lstProxys.Contains(tmpProxy) = True Then
Continue Do
End If
lstProxys.Add(tmpProxy)
Debug.Print(lstProxys.Count.ToString)
Loop Until strLine Is Nothing
If lstProxys.Count < 1 Then
Exit Sub
End If
精彩评论