Web scraping - how to identify main content on a webpage

2023-02-04 14:34 问答作者：

Given a news article webpage (from any major news source such as times or bloomberg), I want to identify the main article content on that page and throw out the other misc elements such as ads, menus, sidebars, use开发者_如何学JAVAr comments.

What's a generic way of doing this that will work on most major news sites?

What are some good tools or libraries for data mining? (preferably python based)

There are a number of ways to do it, but, none will always work. Here are the two easiest:

if it's a known finite set of websites: in your scraper convert each url from the normal url to the print url for a given site (cannot really be generalized across sites)
Use the arc90 readability algorithm (reference implementation is in javascript) http://code.google.com/p/arc90labs-readability/ . The short version of this algorithm is it looks for divs with p tags within them. It will not work for some websites but is generally pretty good.

A while ago I wrote a simple Python script for just this task. It uses a heuristic to group text blocks together based on their depth in the DOM. The group with the most text is then assumed to be the main content. It's not perfect, but works generally well for news sites, where the article is generally the biggest grouping of text, even if broken up into multiple div/p tags.

You'd use the script like: python webarticle2text.py <url>

There's no way to do this that's guaranteed to work, but one strategy you might use is to try to find the element with the most visible text inside of it.

Diffbot offers a free(10.000 urls) API to do that, don't know if that approach is what you are looking for, but it might help someone http://www.diffbot.com/

Check the following script. It is really amazing:

from newspaper import Article
URL = "https://www.ksat.com/money/philippines-stops-sending-workers-to-qatar"
article = Article(URL)
article.download()
print(article.html)
article.parse()
print(article.authors)
print(article.publish_date)
#print(article.text)
print(article.top_image)
print(article.movies)
article.nlp()
print(article.keywords)
print(article.summary)

More documentation can be found at http://newspaper.readthedocs.io/en/latest/ and https://github.com/codelucas/newspaper you should install it using:

pip3 install newspaper3k

For a solution in Java have a look at https://github.com/kohlschutter/boilerpipe :

The boilerpipe library provides algorithms to detect and remove the surplus "clutter" (boilerplate, templates) around the main textual content of a web page.

The library already provides specific strategies for common tasks (for example: news article extraction) and may also be easily extended for individual problem settings.

But there is also a python wrapper around this available here:

https://github.com/misja/python-boilerpipe

It might be more useful to extract the RSS feeds (<link type="application/rss+xml" href="..."/>) on that page and parse the data in the feed to get the main content.

Another possibility of separating "real" content from noise is by measuring HTML density of the parts of a HTML page.

You will need a bit of experimentation with the thresholds to extract the "real" content, and I guess you could improve the algorithm by applying heuristics to specify the exact bounds of the HTML segment after having identified the interesting content.

Update: Just found out the URL above does not work right now; here is an alternative link to a cached version of archive.org.

There is a recent (early 2020) comparison of various methods of extracting article body, without and ads, menus, sidebars, user comments, etc. - see https://github.com/scrapinghub/article-extraction-benchmark. A report, data and evaluation scripts are available. It compares many options mentioned in the answers here, as well as some options which were not mentioned:

python-readability
boilerpipe
newspaper3k
dragnet
html-text
Diffbot
Scrapinghub AutoExtract

In short, "smart" open source libraries are adequate if you need to remove e.g. sidebar and menu, but they don't handle removal of unnecessary content inside articles, and are quite noisy overall; sometimes they remove an article itself and return nothing. Commercial services use Computer Vision and Machine Learning, which allows them to provide a much more precise output.

For some use cases simpler libraries like html-text are preferrable, both to commercial services and to "smart" open source libraries - they are fast, and ensure information is not missing (i.e. recall is high).

I would not recommend copy-pasting code snippets, as there are many edge cases even for a seemingly simple task of extracting text from HTML, and there are libraries available (like html-text or html2text) which should be handling these edge cases.

To use a commercial tool, in general one needs to get an API key, and then use a client library. For example, for AutoExtract by Scrapinghub (disclaimer: I work there) you would need to install pip install scrapinghub-autoextract. There is a Python API available - see https://github.com/scrapinghub/scrapinghub-autoextract README for details, but an easy way to get extractions is to create a .txt file with URLs to extract, and then run

python -m autoextract urls.txt --page-type article --api-key <API_KEY> --output res.jl

I wouldn't try to scrape it from the web page - too many things could mess it up - but instead see which web sites publish RSS feeds. For example, the Guardian's RSS feed has most of the text from their leading articles:

http://feeds.guardian.co.uk/theguardian/rss

I don't know if The Times (The London Times, not NY) has one because it's behind a paywall. Good luck with that...

继续阅读：html-parsing python web-scraping webpage

Web scraping - how to identify main content on a webpage

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？