开发者

HTML text analysis

I have a crawler that gathers articles from the web and stores the title and the body to a database. Until now the programmer has to come up with a set of rules per sour开发者_JAVA百科ce (usually XPath and sometimes regular expressions) to point to the article title and body sections of the web page. Now I'm trying to go one step ahead and have the program auto-detect the title and the body of the article. My first approach add a weight to each element based on some common criteria. For example:

//@x-weight = 1.0

//h1/@x-weight * 2.0

//h2/@x-weight * 1.8

There are many more rules but you get the point. After assigning the weights based on the markup I take into account and some other aspects such as similarity to /head/title and number of keywords. This approach while producing decent results for most of the web pages (thanks SEO experts :P), it fails catastrophically for some others. I'm thinking the possibility to use an artificial neural network, but I can't find enough evidence that I'll get significantly better results. Another option is to take CSS into the game and adjust the weights by font size.

The question(s):

  1. Which path should I choose?
  2. Am I missing something?
  3. Is there a better way to this?

PS: I know that there isn't a perfect solution for a problem like this.


My suggestion would be to be looking at CSS, rather than h1, h2, h3, as those aren't really used in most websites. Large font sizes would probably mean title, more clearly than given tags and keywords.

Smaller fonts with large paragraphs of text would most likely be the body, likewise.

I don't think there is really a good way to do this, unless you act like you're viewing the page using a webbrowser, rather than just looking at the source (because that's how its intended for people to read. The feasibility of using pictures of a webpage and then using image processing to extract the content, however, is completely unrealistic).

I hope this helps you.


It's tough to come up with weights/rules that work for more than a couple sites - there are some pretty bad sites out there in terms of consistency or use of standard css. In the end I think the best could be a combination:

  1. Use the font size
  2. Use common html tags used for titles, i.e. h1, h2 etc.
  3. Look for the title meta attribute.
  4. Look for css class attributes commonly used in articles / titles (i. *article)
  5. Look for the position of text within the page (i.e. commonly title is in the first 1/3 of the page)

Generate a score with a weighted combination of these criteria. As a configuration part the weight for each could be different from site to site.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜