开发者

What's the best way to get a description of the website, in Python?

Suppose I downloaded the HTML code, and I can parse it. How do I get the "best" description of that website, i开发者_开发百科f that website does not have meta-description tag?


You could get the first few sentence returned from something like Readability.

Safari 5 uses it, so it must be alright :)


To follow up on the "Readability" suggestion above (which itself is inspired by the website InstaPaper), they have release the JavaScript: http://code.google.com/p/arc90labs-readability/. What's more, some guy took that and ported it to python: http://github.com/gfxmonk/python-readability. Rejoice!


It's very hard to come up with a rule that works 100% of the time, obviously, but my suggestion as a starting point would be to look for the first <h1> tag (or <h2>, <h3>, etc - the highest one you can find) then the bit of text after that can be used as the description. As long as the site is semantically marked-up, that should give you a good description (I guess you could also take the contents of the <h1> itself, but that's more like the "title").

It's interesting to note that Google (for example) uses a keyword-specific extract of the page contents to display as the description, rather than a static description. Not sure if that'll work for your situation, though.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜