开发者

is google news example of html scraping

I need to make web app similar to google new开发者_如何转开发s. Do i need to learn html scraping for that or some more techniques


Most of the stuff which Google News shows is all RSS/ATOM . It's way too easy to get the website content through RSS feeds as compared to scraping.

Other than that if you can use Java, then you can scrape html by yourself using the excellent library Goose . It is similar to what Flipboard/Instapaper uses


The easiest solution would be to get the RSS or ATOM feed of the website you are trying to get data from.

Those are well-known formats, and extracting informations from such XML feeds would be much easier than getting it from an HTML page : with RSS/ATOM, you'll just have to parse the XML feed, and extract the tags that contain informations that interest you.

Not sure which language you're working with, but chances are you can find some library that would help you with that.


If the website doesn't export an RSS/ATOM feed... Well, you'll probably have to fallback to HTML scrapping ; good luck with that, as HTML is not quite as well structured as RSS/ATOM : you'll have to find out, for each website, where in the page are the relevant informations.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜