开发者

Parsing Random Web Pages

I need to parse a bunch of random pages and add them to a DB. I am thinking of using regular expressions but I was wondering if there are a开发者_运维问答ny 'special' techniques (other than looking for content between known text/tags). The content is more(not always) like:

Some Title
Text related to Title

I guess I don't need to extract complete Text but some way to know where the Title/Paragraph and extract the content from there. The content itself may have images/links that I would like to retain.

Thanks!


Please see this answer: RegEx match open tags except XHTML self-contained tags


  1. Use Python. http://www.python.org/

  2. Use Beautiful Soup. http://www.crummy.com/software/BeautifulSoup/


You need to use a proper HTML parser, and extract the elements you’re interested in via the parser’s API (or via the DOM).

Since I don’t know what language you’re programming in, it’s rather difficult to recommend a parser, but some well known ones are Jericho for Java, and Beautiful Soup for Python.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜