How to parse a specific wiki page & automate that?

2023-02-25 09:32 问答作者：

I am try to make a web application that needs to parse one specific wikipedia page & extract some information which is stored in a table format on the page. The extracted data would then need to be stored onto a dat开发者_JS百科abase.

I haven't really done anything like this before. What scripting language should I use to do this? I have been reading a little & looks like Python (using urllib2 & BeautifulSoup) should do the job, but is it the best way of approaching the problem.

I know I could also use the WikiMedia api but is using python a good idea for general parsing problems?

Also the tabular data on the wikipedia page may change so I need to parse every day. How do I automate the script for this? Also any ideas for version control without external tools like svn so that updates can be easily reverted if need be?

What scripting language should I use to do this?

Python will do, as you've tagged your question.

looks like Python (using urllib2 & BeautifulSoup) should do the job, but is it the best way of approaching the problem.

It's workable. I'd use lxml.etree personally. An alternative is fetching the page in the raw format, then you have a different parsing task.

I know I could also use the WikiMedia api but is using python a good idea for general parsing problems?

This appears to be a statement and an unrelated argumentative question. Subjectively, if I was approaching the problem you're asking about, I'd use python.

Also the tabular data on the wikipedia page may change so I need to parse every day. How do I automate the script for this?

Unix cron job.

Also any ideas for version control without external tools like svn so that updates can be easily reverted if need be?

A Subversion repository can be run on the same machine as the script you've written. Alternatively you could use a distributed version control system, e.g. git.

Curiously, you've not mentioned what you're planning on doing with this data.

yes Python is an excellent choice for web scraping.

Wikipedia updates the content often but the structure rarely. If the table has something unique like an ID then you can get extract the data more confidently.

Here is a simple example to scrape wikipedia using this library:

from webscraping import common, download, xpath
html = download.Download().fetch('http://en.wikipedia.org/wiki/Stackoverflow')
attributes = {}
for tr in xpath.search(html, '//table//tr'):
    th = xpath.get(tr, '/th')
    if th:
        td = xpath.get(tr, '/td')
        attributes[common.clean(th)] = common.clean(td)
print attributes

And here is the output:

{'Commercial?': 'Yes', 'Available language(s)': 'English', 'URL': 'stackoverflow.com', 'Current status': 'Online', 'Created by': 'Joel Spolsky and Jeff Atwood', 'Registration': 'Optional; Uses OpenID', 'Owner': 'Stack Exchange, Inc.', 'Alexa rank': '160[1]', 'Type of site': 'Question & Answer', 'Launched': 'August 2008'}

继续阅读：parsing python screen-scraping

How to parse a specific wiki page & automate that?

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？