Screen-scraping for PDF links to download

2023-02-16 13:39 问答作者：

I'm learning C# through creating a small program, and couldn't find a similar post (apologies if this answer is posted somewhere else).

How might I go about screen-scraping a website for links to PDFs (which I can then download to a specified location)? Sometimes a page will have a link to another HTML page which has the actual PDF link, so if the actual PDF can't be found on the first page I'd like it to automatically look for a link that have "PDF" in the text of a link, and then search that resulting HTML page for the real PDF link.

I know that I could probably achieve something similar via filetype searching through google, but that seems like "cheating" to me :) I'd rather learn how to do it in code, but I'm not sure where to start. I'm a little familiar with XML parsing with XElement and such, but I'm not sure how to do it for getting links from an HTML page (or other format?开发者_StackOverflow中文版).

Could anyone point me in the right direction? Thanks!

HtmlAgilityPack is great for this kind of stuff.

Example of implementation:

string pdfLinksUrl = "http://www.google.com/search?q=filetype%3Apdf";

// Load HTML content    
var webGet = new HtmlAgilityPack.HtmlWeb();
var doc = webGet.Load(pdfLinksUrl);

// select all <A> nodes from the document using XPath
// (unfortunately we can't select attribute nodes directly as
// it is not yet supported by HAP)
var linkNodes = doc.DocumentNode.SelectNodes("//a[@href]");

// select all href attribute values ending with '.pdf' (case-insensitive)
var pdfUrls = from linkNode in linkNodes
    let href = linkNode.Attributes["href"].Value
    where href.ToLower().EndsWith(".pdf")
    select href;

// write all PDF links to file
System.IO.File.WriteAllLines(@"c:\pdflinks.txt", pdfUrls.ToArray());

As a side note, I would not rely too much on XPath expressions in HAP. There are some XPath functions missing, and putting all extraction logic inside your XPath will make your code less maintainable. I would extract a fair minimum using an XPath expression, and then do all required extraction by iterating through the node collection (Linq methods help a lot).

The real power of the HAP is the ability to parse SGML documents, that is, something which can be invalid from the XHTML point of view (unclosed tags, missing quotes, etc.).

Your best bet is probably to use HTML Agility to screen scrape the page, then select the href attribute to see if it looks like a PDF download. If not, you could then look at the text within the node for keywords such as PDF to decide whether to follow the link or not.

For parsing of any HTML page, use HtmlAgilityPack. It's the best around.

From that you transform any HTMl page into XML which you can search through much easier than HTML.

If you need to crawl a site for information, have a look at NCrawler.

继续阅读：html-content-extraction html-parsing pdf screen-scraping

Screen-scraping for PDF links to download

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

王昌瑞《潜梦追凶》剧组庆生新锐演员未来可期？

Is it allowed to ask users to enter credit card details for own payment method?

Escaping "<" in Perl-generated XML

imessage会显示已读吗？

微信重新建群怎么建？