Extracting the introduction part of a Wikipedia article, by python
I want to extract the introduction part of a wikipedia article(ignoring all other stuff, including tables, images and other parts). I looked at html source of the articles, but I don't see any special tag which this part is wrapped in.
Can anyone give me a quick solution to this开发者_开发技巧? I'm writing python scripts.
thanks
- You may want to check mwlib to parse the wikipedia source
- Alternatively, use the wikidump lib
- HTML screen scraping through BeautifulSoup
Ah, there is a question already on SO on this topic:
- Parsing a Wikipedia dump
- How to parse/extract data from a mediawiki marked-up article via python
I think you can often get to the intro text by taking the full page, stripping out all the tables, and then looking for the first sequence of <p>...</p> blocks after the marker. That last bit would be this regex:
/<!-- bodytext -->.*?(<p>.*?<\/p>\s*)+/
With the .S option to make . match newlines...
精彩评论