Any ideas about the jQuery equivalent of the READABILITY code? (Or: building the best heuristic to find the main text using jQuery)

2022-12-15 02:18 问答作者：

http://lab.arc90.com/experiments/readability/ is a very handy tool for viewing cluttered newspaper, journal and blog pages in a very readable manner. It does this by using some heuristcis and finding the relevant main text of a web page. Its source code is also available at http://lab.arc90.com/experiments/readability/js/readability.js

Some colleague of mine drew my attention to this as I was struggling with jQuery to grab the "main text" of any newspaper | journal | blog | etc. website. My current heuristic (and implementation in jQuery) uses something like (this is don开发者_Python百科e inside a Firefox Jetpack package):

$(doc).find("div > p").each(function (index) {  
    var textStr = $(this).text();
/*
     We need the pieces of text that are long and in natural language,
     and not some JS code snippets
    */
if(textStr.length > MIN_TEXT_LENGTH && textStr.indexOf("<script") <= 0) {    
    console.log(index);    
    console.log(textStr.length);
    console.log(textStr);
    $(this).attr("id", "clozefox_paragraph_" + index);
    results.push(index);

    wholeText = wholeText + " " + textStr;
}
});

So it is something loke "go grab the paragraphs inside DIVs and check for irrelevant strings like 'script'". I have tried this and most of the time it can grab the main text of web articles however I'd like to have a better heuristic or maybe a better jQuery selection mechanism (and even shorter?).

Do you have better suggestions?

PS: Maybe "Find the innermost DIVs (that is without any child elements of DIV type) and go grab their

s only" would be a better heuristic for my current purpose but I couldn't find out how to express this in jQuery.

This is generally done by analyzing the "link density" of elements on a page. The higher the link density, the more likely it is not content. Here is a great place to get started with thinking about content extraction techniques and algorithms: http://www.quora.com/Whats-the-best-method-to-extract-article-text-from-HTML-documents

Most articles have a rectangular column of text. Try taking some combination of the dimensions of the element and the number of words it (including children) contains. You probably want to favor narrow and tall divs.

Something like probability of main text = (number of words) * (height / width) would be a good start.

继续阅读：heuristics html-content-extraction jquery

Any ideas about the jQuery equivalent of the READABILITY code? (Or: building the best heuristic to find the main text using jQuery)

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？