Body Text extraction from websites e.g. extract only article heading and text not all text in site

2023-02-27 09:43 问答作者：

I am looking for algorithms that allow text extraction from websites. I do not mean "strip html", or any of the hundreds of libraries that allow this.

So for example for a news article I would like to identify the heading and all the text, but not the comment开发者_C百科s section and so on.

Are there any algorithms for that out there? Thank you!

In computer science literature this problem is usually referred to as the page segmentation or boiler plate detection problem. See the report Boilerplate Detection using Shallow Text Features and its related blog post. Also, I have a few reports and software sites bookmarked that address the problem. Also, see this stackoverflow question.

there are a few open source tools available that do similar article extraction tasks. https://github.com/jiminoc/goose which was open source by Gravity.com

It has info on the wiki as well as the source you can view. There are dozens of unit tests that show the text extracted from various articles.

"Content extraction" is a very difficult topic. There are no common standards to identify the "main-article" content (there are several approaches to make HTML easier readably for crawlers, e.g. schema.org, but none of these is very popularly used).

So it turns out, that if you want good results, its probably best to define your own XPath selectors for each (news) website you want to scrape. Although there are some APIs for HTML content extraction, but as I said its very hard to develop an algorithm which works for every site.

Some APIs you could use:

alchemyapi.com
diffbot.com
boilerpipe-web.appspot.com
aylien.com
textracto.com

What you're trying to do is called "content extraction". It turns out to be a surprisingly hard problem to solve well, and many naive solutions do quite badly.

Instapaper and Readability both have to solve this, and you may learn something from looking at their solutions. They also both provide services that you may be able to take advantage of - perhaps you can outsource your problem to them and let their API take care of it. :)

Failing that, a search for "html content extraction" returns a great deal of useful results, including a number of papers on the subject.

I compared a few different libraries, and had really great luck with Mozilla's Readability library (Node), or its Python wrapper.

For example, take this CNN article: https://edition.cnn.com/2022/06/01/tech/elon-musk-tesla-ends-work-from-home/index.html

Readability successfully returns only the relevant data:

New York (CNN Business) Elon Musk is demanding that Tesla office workers return to in-person work or leave the company. The policy, disclosed in leaked emails Musk sent to Tesla's executive staff Tuesday, was first reported by electric vehicle news site Electrek. "Anyone who wishes to do remote work must be in the office for a minimum (and I mean *minimum*) of 40 hours per week or depart Tesla. This is less than we ask of factory workers," Musk wrote, adding that the office must be the employee's primary workplace where the other workers they regularly interact with are based — "not a remote branch office unrelated to the job duties." Musk said he would personally review any request for exemption from the policy, but that for the most part, "If you don't show up, we will assume you have resigned."

etc.

I think your best shoot is study what information can you get from the metadata and write a good html parser, oEmbed could be a good standard =)

https://oembed.com/#section7

继续阅读：algorithm text text-extraction web-scraping

Body Text extraction from websites e.g. extract only article heading and text not all text in site

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？