Heuristic Approaches to Finding Main Content

2023-02-11 19:01 问答作者：

Wondering if anybody could point me in the direction of academic papers or related implementations of heuristic approaches to finding the real meat content of a particular webpage.

Obviously this is not a trivial task, since the problem description is so vague, but I think that we all have a general understanding about what is meant by the primary content of a page.

For example, it may include the story text for a news article, but might not include any navigational elements, legal disclaimers, related story teasers, comments, etc. Article tit开发者_运维百科les, dates, author names, and other metadata fall in the grey category.

I imagine that the application value of such an approach is large, and would expect Google to be using it in some way in their search algorithm, so it would appear to me that this subject has been treated by academics in the past.

Any references?

One way to look at this would be as an information extraction problem.

As such, one high-level algorithm would be to collect multiple examples of the same page type and deduce parsing (or extraction) rules for the parts of the page which are different (this is likely to be the main topic). The intuition is that common boilerplate (header, footer, etc) and ads will eventually appear on multiple examples of those web pages, so by training on a few of them, you can quickly start to reliably identify this boilerplate/additional code and subsequently ignore it. It's not foolproof, but this is also the basis of web scraping technologies, both commercial and academic, like RoadRunner:

http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.21.8672&rep=rep1&type=pdf

The citation is:

Valter Crescenzi, Giansalvatore Mecca, Paolo Merialdo: RoadRunner: Towards Automatic Data Extraction from Large Web Sites. VLDB 2001: 109-118

There's also a well-cited survey of extraction technologies:

Alberto H. F. Laender , Berthier A. Ribeiro-Neto , Altigran S. da Silva , Juliana S. Teixeira, A brief survey of web data extraction tools, ACM SIGMOD Record, v.31 n.2, June 2002 [doi>10.1145/565117.565137]

继续阅读：parsing web-crawler

Heuristic Approaches to Finding Main Content

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

王昌瑞《潜梦追凶》剧组庆生新锐演员未来可期？

Is it allowed to ask users to enter credit card details for own payment method?

Escaping "<" in Perl-generated XML

imessage会显示已读吗？

微信重新建群怎么建？