Webcrawler - Fetch links
I'm trying to crawl a webpage, and get all the links, and add them to a list<string>
which will be returned in the end, from the function.
My code:
let getUrls s : seq<string> =
let doc = new HtmlDocument() in
doc.LoadHtml s
doc.DocumentNode.SelectNodes "//a[@href]"
|> Seq.map(fun z -> (string z.Attributes.["href"]))
let crawler uri : seq<string> =
开发者_StackOverflow社区 let rec crawl url =
let web = new WebClient()
let data = web.DownloadString url
getUrls data |> Seq.map crawl (* <-- ERROR HERE *)
crawl uri
The problem is that at the last line in the crawl function (the getUrls seq.map...), it simply throws an error:
Type mismatch. Expecting a string -> 'a but given a string -> seq<'a> The resulting type would be infinite when unifying ''a' and 'seq<'a>'
crawl
is returning unit
, but is expected to return seq<string>
. I think you want something like:
let crawler uri =
let rec crawl url =
seq {
let web = new WebClient()
let data = web.DownloadString url
for url in getUrls data do
yield url
yield! crawl url
}
crawl uri
Adding a type annotation to crawl
should point out the issue.
i think something like this:
let crawler (uri : seq<string>) =
let rec crawl url =
let data = Seq.empty
getUrls data
|> Seq.toList
|> function
| h :: t ->
crawl h
t |> List.iter crawl
| _-> ()
crawl uri
In order to fetch links:
open System.Net
open System.IO
open System.Text.RegularExpressions
type Url(x:string)=
member this.tostring = sprintf "%A" x
member this.request = System.Net.WebRequest.Create(x)
member this.response = this.request.GetResponse()
member this.stream = this.response.GetResponseStream()
member this.reader = new System.IO.StreamReader(this.stream)
member this.html = this.reader.ReadToEnd()
let linkex = "href=\s*\"[^\"h]*(http://[^&\"]*)\""
let getLinks (txt:string) = [
for m in Regex.Matches(txt,linkex)
-> m.Groups.Item(1).Value
]
let collectLinks (url:Url) = url.html
|> getLinks
精彩评论