Asynchronous crawling F#
When crawling on webpages I need to be careful as to not make too many requests to the same domain, for example I want to put 1 s between requests. From what I understand it is the time between requests that is important. So to speed things up I want to use async workflows in F#, the idea being make your requests with 1 sec interval but avoid blocking things while waiting for request response.
let getHtmlPrimitiveAsyncTimer (uri : System.Uri) (timer:int) =
async{
let req = (WebRequest.Create(uri)) :?> HttpWebRequest
req.UserAgent<-"Mozilla"
try
Thread.Sleep(timer)
let! resp = (req.AsyncGetResponse())
Console.WriteLine(uri.AbsoluteUri+" got response")
use stream = resp.GetResponseStream()
use reader = new StreamReader(stream)
let html = reader.ReadToEnd()
return html
with
| _ as ex -> return "Bad Link"
}
Then I do something like:
let uri1 = System.Uri "http://rue89.com"
let timer = 1000
let jobs = [|for开发者_StackOverflow中文版 i in 1..10 -> getHtmlPrimitiveAsyncTimer uri1 timer|]
jobs
|> Array.mapi(fun i job -> Console.WriteLine("Starting job "+string i)
Async.StartAsTask(job).Result)
Is this alright ? I am very unsure about 2 things: -Does the Thread.Sleep thing work for delaying the request ? -Is using StartTask a problem ?
I am a beginner (as you may have noticed) in F# (coding in general actually ), and everything envolving Threads scares me :)
Thanks !!
I think what you want to do is - create 10 jobs, numbered 'n', each starting 'n' seconds from now - run those all in parallel
Approximately like
let makeAsync uri n = async {
// create the request
do! Async.Sleep(n * 1000)
// AsyncGetResponse etc
}
let a = [| for i in 1..10 -> makeAsync uri i |]
let results = a |> Async.Parallel |> Async.RunSynchronously
Note that of course they all won't start exactly now, if e.g. you have a 4-core machine, 4 will start running very soon, but then quickly execute up to the Async.Sleep, at which point the next 4 will run up until their sleeps, and so forth. And then in one second the first async wakes up and posts a request, and another second later the 2nd async wakes up, ... so this should work. The 1s is only approximate, since they're starting their timers each a very tiny bit staggered from one another... you may want to buffer it a little, e.g. 1100 ms or something if the cut-off you need is really exactly a second (network latencies and whatnot still leave a bit of this outside the possible control of your program probably).
Thread.Sleep
is suboptimal, it will work ok for a small number of requests, but you're burning a thread, and threads are expensive and it won't scale to a large number.
You don't need StartAsTask
unless you want to interoperate with .NET Tasks or later do a blocking rendezvous with the result via .Result
. If you just want these to all run and then block to collect all the results in an array, Async.Parallel
will do that fork-join parallelism for you just fine. If they're just going to print results, you can fire-and-forget via Async.Start
which will drop the results on the floor.
(An alternative strategy is to use an agent as a throttle. Post all the http requests to a single agent, where the agent is logically single-threaded and sits in a loop, doing Async.Sleep
for 1s, and then handling the next request. That's a nice way to make a general-purpose throttle... may be blog-worthy for me, come to think of it.)
精彩评论