开发者

How can I input data into a webpage to scrape the resulting output using Python?

I am familiar with BeautifulSoup and urllib2 to scrape data from a webpage. However, what if a parameter needs to be entered into the page before the result that I want to scrape is returned?

I'm trying to obtain the geographic distance between two addresses using this website: http://www.freemaptools.com/how-far-is-it-between.htm

I want to be able to go to the page, enter two addresses, click "开发者_如何学JAVAShow", and then extract the "Distance as the Crow Flies" and "Distance by Land Transport" values and save them to a dictionary.

Is there any way to input data into a webpage using Python?


Take a look at tools like mechanize or scrape:

  • http://pypi.python.org/pypi/mechanize
  • http://stockrt.github.com/p/emulating-a-browser-in-python-with-mechanize/
  • http://www.ibm.com/developerworks/linux/library/l-python-mechanize-beautiful-soup/

  • http://zesty.ca/scrape/

Packt Publishing has an article on that matter, too:

  • http://www.packtpub.com/article/web-scraping-with-python


Yes! Try mechanize for this kind of Web screen-scraping task.


I think you can also use PySide/PyQt, because they have a browser core of qtwebkit, you can control the browser to open pages, simulate human actions(fill, click...), then scrape data from pages. FMiner is work on this way, it's a web scraping software I developed with PySide.

Or you can try phantomjs, it's an easy library to control browser, but not it's javascript not python lanuage.


In addition with the answers already given, you could simply do a request on that page. Using your browser you could always inspect the Network (under Tools/Web Developer tools) behaviors and actions when you interact with the page. E.g. http://www.freemaptools.com/ajax/getaandb.php?a=Florida_Usa&b=New%20York_Usa&c=6052 -> request query for getting the results page you are expecting. Request that page and scrape the field you wanted to. IMHO, page requests are way faster than screen scraping (case-to-case basis).

But of course, you could always do screen scraping/browser simulation also (Mechanize, Splinter) and use headless browsers (PhantomJS, etc.) or the browser driver of the browser you want to use.


The query may have been resolved.

You can use Selenium WebDriver for this purpose. A web page can be interacted using programming language. All the operations can be performed as if a human user is accessing the web page.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜