How to parse a specific wiki page & automate that?
I am try to make a web application that needs to parse one specific wikipedia page & extract some information which is stored in a table format on the page. The extracted data would then need to be stored onto a dat开发者_JS百科abase.
I haven't really done anything like this before. What scripting language should I use to do this? I have been reading a little & looks like Python (using urllib2 & BeautifulSoup) should do the job, but is it the best way of approaching the problem.
I know I could also use the WikiMedia api but is using python a good idea for general parsing problems?
Also the tabular data on the wikipedia page may change so I need to parse every day. How do I automate the script for this? Also any ideas for version control without external tools like svn so that updates can be easily reverted if need be?
What scripting language should I use to do this?
Python will do, as you've tagged your question.
looks like Python (using urllib2 & BeautifulSoup) should do the job, but is it the best way of approaching the problem.
It's workable. I'd use lxml.etree
personally. An alternative is fetching the page in the raw format, then you have a different parsing task.
I know I could also use the WikiMedia api but is using python a good idea for general parsing problems?
This appears to be a statement and an unrelated argumentative question. Subjectively, if I was approaching the problem you're asking about, I'd use python.
Also the tabular data on the wikipedia page may change so I need to parse every day. How do I automate the script for this?
Unix cron job.
Also any ideas for version control without external tools like svn so that updates can be easily reverted if need be?
A Subversion repository can be run on the same machine as the script you've written. Alternatively you could use a distributed version control system, e.g. git
.
Curiously, you've not mentioned what you're planning on doing with this data.
yes Python is an excellent choice for web scraping.
Wikipedia updates the content often but the structure rarely. If the table has something unique like an ID then you can get extract the data more confidently.
Here is a simple example to scrape wikipedia using this library:
from webscraping import common, download, xpath
html = download.Download().fetch('http://en.wikipedia.org/wiki/Stackoverflow')
attributes = {}
for tr in xpath.search(html, '//table//tr'):
th = xpath.get(tr, '/th')
if th:
td = xpath.get(tr, '/td')
attributes[common.clean(th)] = common.clean(td)
print attributes
And here is the output:
{'Commercial?': 'Yes', 'Available language(s)': 'English', 'URL': 'stackoverflow.com', 'Current status': 'Online', 'Created by': 'Joel Spolsky and Jeff Atwood', 'Registration': 'Optional; Uses OpenID', 'Owner': 'Stack Exchange, Inc.', 'Alexa rank': '160[1]', 'Type of site': 'Question & Answer', 'Launched': 'August 2008'}
精彩评论