implement multithreaded crawler
I would like to implement a mulithtreaded crawler using the single thread crawler code I have now. Basically I read the urls from a text file, take each one and crawl and parse it. I know how thread basics of creating a thread and assigning a process to it but not too sure how to implement in the following way:
I need at least 3 threads and need to assign a url to each thread from a list of urls, and then each needs to go and fetch it and parse it before adding contents to a database.
Dim gthread, tthread, ithread As Thread
gthread = New Thread(AddressOf processUrl)
gthread.Start(url)
tthread = New Thread(AddressOf processUrl))
tthread.Start(url)
ithread = New Thread(AddressOf processUrl))
开发者_开发技巧 ithread.Start(url)
WaitUntilAllAreOver:
If gthread.ThreadState = ThreadState.Running Then
Thread.Sleep(5)
GoTo WaitUntilAllAreOver
End If
'etc..
Now the code maynot make sense but what I need to do is add a unique url to each thread to go process.
Any ideas appreciated
The best way to wait for the Thread
instances to finish is to call the .Join method. Take the following example
Public Sub ParseAll(ByVal ParamArray urls As Uri())
Dim list as New List(Of Thread)
For Each url in urls
Dim thread = New Thread(AddressOf ProcessUrl)
thread.Start(url)
list.Add(thread)
Next
For Each thread in list
thread.Join
Next
End Sub
Though you may want to consider using the ThreadPool
here. The ThreadPool
is designed for spawning off lots of small tasks very efficiently.
You could use a synchronized Queue where u push the URLs to and every crawler takes the next URL it visits out of this Queue. When they detect new URLs, the push them into the Queue, too.
I recommend using a Background worker to accomplish this.
Look into the Concurrency and Coordination Runtime (CCR). I have built a few crawlers based on that framework, and it makes things very easy once you understand how the CCR works.
Should take you a few hours to get up to speed with the CCR.
精彩评论