HTML text analysis

2023-02-11 06:19 问答作者：

I have a crawler that gathers articles from the web and stores the title and the body to a database. Until now the programmer has to come up with a set of rules per sour开发者_JAVA百科ce (usually XPath and sometimes regular expressions) to point to the article title and body sections of the web page. Now I'm trying to go one step ahead and have the program auto-detect the title and the body of the article. My first approach add a weight to each element based on some common criteria. For example:

//@x-weight = 1.0

//h1/@x-weight * 2.0

//h2/@x-weight * 1.8

There are many more rules but you get the point. After assigning the weights based on the markup I take into account and some other aspects such as similarity to /head/title and number of keywords. This approach while producing decent results for most of the web pages (thanks SEO experts :P), it fails catastrophically for some others. I'm thinking the possibility to use an artificial neural network, but I can't find enough evidence that I'll get significantly better results. Another option is to take CSS into the game and adjust the weights by font size.

The question(s):

Which path should I choose?
Am I missing something?
Is there a better way to this?

PS: I know that there isn't a perfect solution for a problem like this.

My suggestion would be to be looking at CSS, rather than h1, h2, h3, as those aren't really used in most websites. Large font sizes would probably mean title, more clearly than given tags and keywords.

Smaller fonts with large paragraphs of text would most likely be the body, likewise.

I don't think there is really a good way to do this, unless you act like you're viewing the page using a webbrowser, rather than just looking at the source (because that's how its intended for people to read. The feasibility of using pictures of a webpage and then using image processing to extract the content, however, is completely unrealistic).

I hope this helps you.

It's tough to come up with weights/rules that work for more than a couple sites - there are some pretty bad sites out there in terms of consistency or use of standard css. In the end I think the best could be a combination:

Use the font size
Use common html tags used for titles, i.e. h1, h2 etc.
Look for the title meta attribute.
Look for css class attributes commonly used in articles / titles (i. *article)
Look for the position of text within the page (i.e. commonly title is in the first 1/3 of the page)

Generate a score with a weighted combination of these criteria. As a configuration part the weight for each could be different from site to site.

继续阅读：neural-network web-crawler

HTML text analysis

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？