Any ideas about the jQuery equivalent of the READABILITY code? (Or: building the best heuristic to find the main text using jQuery)
http://lab.arc90.com/experiments/readability/ is a very handy tool for viewing cluttered newspaper, journal and blog pages in a very readable manner. It does this by using some heuristcis and finding the relevant main text of a web page. Its source code is also available at http://lab.arc90.com/experiments/readability/js/readability.js
Some colleague of mine drew my attention to this as I was struggling with jQuery to grab the "main text" of any newspaper | journal | blog | etc. website. My current heuristic (and implementation in jQuery) uses something like (this is don开发者_Python百科e inside a Firefox Jetpack package):
$(doc).find("div > p").each(function (index) {
var textStr = $(this).text();
/*
We need the pieces of text that are long and in natural language,
and not some JS code snippets
*/
if(textStr.length > MIN_TEXT_LENGTH && textStr.indexOf("<script") <= 0) {
console.log(index);
console.log(textStr.length);
console.log(textStr);
$(this).attr("id", "clozefox_paragraph_" + index);
results.push(index);
wholeText = wholeText + " " + textStr;
}
});
So it is something loke "go grab the paragraphs inside DIVs and check for irrelevant strings like 'script'". I have tried this and most of the time it can grab the main text of web articles however I'd like to have a better heuristic or maybe a better jQuery selection mechanism (and even shorter?).
Do you have better suggestions?
PS: Maybe "Find the innermost DIVs (that is without any child elements of DIV type) and go grab their
s only" would be a better heuristic for my current purpose but I couldn't find out how to express this in jQuery.
This is generally done by analyzing the "link density" of elements on a page. The higher the link density, the more likely it is not content. Here is a great place to get started with thinking about content extraction techniques and algorithms: http://www.quora.com/Whats-the-best-method-to-extract-article-text-from-HTML-documents
Most articles have a rectangular column of text. Try taking some combination of the dimensions of the element and the number of words it (including children) contains. You probably want to favor narrow and tall divs.
Something like probability of main text = (number of words) * (height / width)
would be a good start.
精彩评论