开发者

Asynchronous crawling F#

When crawling on webpages I need to be careful as to not make too many requests to the same domain, for example I want to put 1 s between requests. From what I understand it is the time between requests that is important. So to speed things up I want to use async workflows in F#, the idea being make your requests with 1 sec interval but avoid blocking things while waiting for request response.

let getHtmlPrimitiveAsyncTimer (uri : System.Uri) (timer:int) =
    async{

            let req =  (WebRequest.Create(uri)) :?> HttpWebRequest
            req.UserAgent<-"Mozilla"
            try 

                Thread.Sleep(timer)
                let! resp =    (req.AsyncGetResponse())
                Console.WriteLine(uri.AbsoluteUri+" got response")
                use stream = resp.GetResponseStream()
                use reader = new StreamReader(stream)
                let html = reader.ReadToEnd()
                return html
            with 
            | _ as ex -> return "Bad Link"
                 }

Then I do something like:

let uri1 = System.Uri "http://rue89.com"
let timer = 1000
let jobs = [|for开发者_StackOverflow中文版 i in 1..10 -> getHtmlPrimitiveAsyncTimer uri1 timer|]

jobs
|> Array.mapi(fun i job -> Console.WriteLine("Starting job "+string i)
                               Async.StartAsTask(job).Result)

Is this alright ? I am very unsure about 2 things: -Does the Thread.Sleep thing work for delaying the request ? -Is using StartTask a problem ?

I am a beginner (as you may have noticed) in F# (coding in general actually ), and everything envolving Threads scares me :)

Thanks !!


I think what you want to do is - create 10 jobs, numbered 'n', each starting 'n' seconds from now - run those all in parallel

Approximately like

let makeAsync uri n = async {
    // create the request
    do! Async.Sleep(n * 1000)
    // AsyncGetResponse etc
    }

let a = [| for i in 1..10 -> makeAsync uri i |]
let results = a |> Async.Parallel |> Async.RunSynchronously

Note that of course they all won't start exactly now, if e.g. you have a 4-core machine, 4 will start running very soon, but then quickly execute up to the Async.Sleep, at which point the next 4 will run up until their sleeps, and so forth. And then in one second the first async wakes up and posts a request, and another second later the 2nd async wakes up, ... so this should work. The 1s is only approximate, since they're starting their timers each a very tiny bit staggered from one another... you may want to buffer it a little, e.g. 1100 ms or something if the cut-off you need is really exactly a second (network latencies and whatnot still leave a bit of this outside the possible control of your program probably).

Thread.Sleep is suboptimal, it will work ok for a small number of requests, but you're burning a thread, and threads are expensive and it won't scale to a large number.

You don't need StartAsTask unless you want to interoperate with .NET Tasks or later do a blocking rendezvous with the result via .Result. If you just want these to all run and then block to collect all the results in an array, Async.Parallel will do that fork-join parallelism for you just fine. If they're just going to print results, you can fire-and-forget via Async.Start which will drop the results on the floor.

(An alternative strategy is to use an agent as a throttle. Post all the http requests to a single agent, where the agent is logically single-threaded and sits in a loop, doing Async.Sleep for 1s, and then handling the next request. That's a nice way to make a general-purpose throttle... may be blog-worthy for me, come to think of it.)

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜