开发者

extracing data from websites using python

I'm pretty new to web development and I have an idea for something that I would like to explore and I'd like some advice on what tools I should use. I know python and have been learning django recently so I would ideally like to incorporate them.

What I want to do is related to some basic html parsing and use of regular expressions I think. Basically, I want to be able to aggregate certain bits of useful information from several websites into one site. Suppose, for example, there are a dozen high schools whose graduation dates, times, and locations I'm interested in knowing. How the i开发者_如何学Gonformation on each high school site is presented is roughly similar and so I want to extract the data for the word after "location" or "venue", "time", "date", etc and then have that automatically posted on my site and I would also like it updated if any of the info happens to change on any of the high school sites.

What would you use to accomplish this task? Also, if you know of any useful tutorials, resources, etc that you could point me to, that would be much appreciated!


For the extraction part I think your best bet would be Beautiful soup mostly beacause it's easy to use and would try to parse anything even broken xml/html.


Check out BeautifulSoup

Update:

If you want to fill forms you can use mechanize

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜