开发者

guidance on python scraping packages

I'm still a newcomer to python, so I hope this question isn't inane.

The more I google for web scraping solutions, the more confused I become (unable to see a forest, despite investigating many trees..)

I've been reading documentation on a number of projects, including (but not limited to) scrapy mechanize spynner

but I can't really figure out which hammer I should be trying to use..

There is a specific page i'm trying to crawl (www.schooldigger.com) It uses asp, and there's some java script I need to be able to emulate.

I'm aware this sort of problem isn't easily dealt with, so I'd love any guidance.

In addition to some general discussion of the options available (and the relationships between different projects, if possible) i have a couple of specific questions

  1. When using scrapy, is there any way to avoid defining the 'items' to be parsed, and just download the first couple hundred pages or so? I don't actually want to download entire websites, but, I would like to be able to see which pages are being downloaded while developing the scraper.

  2. mechanize, asp and javascript, please see a question I posted but havent seen any answers to, https://stackoverflow.com/questions/4249513/emulating-js-in-mechanize

  3. Why not build some sort of utility (either a turbogears application or a browser plug in) that allows a user to select links to follow and items to parse graphically? All i'm suggesting is some sort of gui to sit around a parsing API. I don't know if I have the technical knowledge to create such a project, but I dont see why it isn't possible, in fact, it seems rather feasible given what I know about python. Maybe some feedback about what problems this sort of project would face?

  4. Most importantly, are all web crawlers built 'site specific'? It seems to me that 开发者_开发知识库I'm sort of reinventing the wheel in my code.. (but thats probably because I'm not very good at programming)

  5. Anyone have any examples of fully-featured scrapers? There are lots of examples in the documentation, (which ive been studying), but they all seem to focus on simplicity, just for the exposition of package usage, maybe I'd benefit from a more detailed/ complicated example.

thanks for your thoughts.


For full browser interaction you are best to look at using Selenium-RC

This has a python driver and you can script a browser to "test" just about any site on the internet

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜