开发者

Extracting the introduction part of a Wikipedia article, by python

I want to extract the introduction part of a wikipedia article(ignoring all other stuff, including tables, images and other parts). I looked at html source of the articles, but I don't see any special tag which this part is wrapped in.

Can anyone give me a quick solution to this开发者_开发技巧? I'm writing python scripts.

thanks


  1. You may want to check mwlib to parse the wikipedia source
  2. Alternatively, use the wikidump lib
  3. HTML screen scraping through BeautifulSoup

Ah, there is a question already on SO on this topic:

  1. Parsing a Wikipedia dump
  2. How to parse/extract data from a mediawiki marked-up article via python


I think you can often get to the intro text by taking the full page, stripping out all the tables, and then looking for the first sequence of <p>...</p> blocks after the marker. That last bit would be this regex:

/<!-- bodytext -->.*?(<p>.*?<\/p>\s*)+/

With the .S option to make . match newlines...

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜