Screen-scraping for PDF links to download
I'm learning C# through creating a small program, and couldn't find a similar post (apologies if this answer is posted somewhere else).
How might I go about screen-scraping a website for links to PDFs (which I can then download to a specified location)? Sometimes a page will have a link to another HTML page which has the actual PDF link, so if the actual PDF can't be found on the first page I'd like it to automatically look for a link that have "PDF" in the text of a link, and then search that resulting HTML page for the real PDF link.
I know that I could probably achieve something similar via filetype searching through google, but that seems like "cheating" to me :) I'd rather learn how to do it in code, but I'm not sure where to start. I'm a little familiar with XML parsing with XElement and such, but I'm not sure how to do it for getting links from an HTML page (or other format?开发者_StackOverflow中文版).
Could anyone point me in the right direction? Thanks!
HtmlAgilityPack is great for this kind of stuff.
Example of implementation:
string pdfLinksUrl = "http://www.google.com/search?q=filetype%3Apdf";
// Load HTML content
var webGet = new HtmlAgilityPack.HtmlWeb();
var doc = webGet.Load(pdfLinksUrl);
// select all <A> nodes from the document using XPath
// (unfortunately we can't select attribute nodes directly as
// it is not yet supported by HAP)
var linkNodes = doc.DocumentNode.SelectNodes("//a[@href]");
// select all href attribute values ending with '.pdf' (case-insensitive)
var pdfUrls = from linkNode in linkNodes
let href = linkNode.Attributes["href"].Value
where href.ToLower().EndsWith(".pdf")
select href;
// write all PDF links to file
System.IO.File.WriteAllLines(@"c:\pdflinks.txt", pdfUrls.ToArray());
As a side note, I would not rely too much on XPath expressions in HAP. There are some XPath functions missing, and putting all extraction logic inside your XPath will make your code less maintainable. I would extract a fair minimum using an XPath expression, and then do all required extraction by iterating through the node collection (Linq methods help a lot).
The real power of the HAP is the ability to parse SGML documents, that is, something which can be invalid from the XHTML point of view (unclosed tags, missing quotes, etc.).
Your best bet is probably to use HTML Agility to screen scrape the page, then select the href attribute to see if it looks like a PDF download. If not, you could then look at the text within the node for keywords such as PDF to decide whether to follow the link or not.
For parsing of any HTML page, use HtmlAgilityPack. It's the best around.
From that you transform any HTMl page into XML which you can search through much easier than HTML.
If you need to crawl a site for information, have a look at NCrawler.
精彩评论