Heuristic Approaches to Finding Main Content
Wondering if anybody could point me in the direction of academic papers or related implementations of heuristic approaches to finding the real meat content of a particular webpage.
Obviously this is not a trivial task, since the problem description is so vague, but I think that we all have a general understanding about what is meant by the primary content of a page.
For example, it may include the story text for a news article, but might not include any navigational elements, legal disclaimers, related story teasers, comments, etc. Article tit开发者_运维百科les, dates, author names, and other metadata fall in the grey category.
I imagine that the application value of such an approach is large, and would expect Google to be using it in some way in their search algorithm, so it would appear to me that this subject has been treated by academics in the past.
Any references?
One way to look at this would be as an information extraction problem.
As such, one high-level algorithm would be to collect multiple examples of the same page type and deduce parsing (or extraction) rules for the parts of the page which are different (this is likely to be the main topic). The intuition is that common boilerplate (header, footer, etc) and ads will eventually appear on multiple examples of those web pages, so by training on a few of them, you can quickly start to reliably identify this boilerplate/additional code and subsequently ignore it. It's not foolproof, but this is also the basis of web scraping technologies, both commercial and academic, like RoadRunner:
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.21.8672&rep=rep1&type=pdf
The citation is:
Valter Crescenzi, Giansalvatore Mecca, Paolo Merialdo: RoadRunner: Towards Automatic Data Extraction from Large Web Sites. VLDB 2001: 109-118
There's also a well-cited survey of extraction technologies:
Alberto H. F. Laender , Berthier A. Ribeiro-Neto , Altigran S. da Silva , Juliana S. Teixeira, A brief survey of web data extraction tools, ACM SIGMOD Record, v.31 n.2, June 2002 [doi>10.1145/565117.565137]
精彩评论