Extracting data from an html path with Scrapy for Python
Overview of my project:
I'm trying to create a simple script in python 2.6 that will get traffic time data from Bing Maps. The Scrapy library module package (scrapy.org/) is what I'm using to crawl through each website and extract data from Bing maps.
The picture above shows what i want. (the highlighted data part for now but ultimately the time below will be needed too.)
I first did a test to see if the start url would go though. and then used an output log to print the output of the url if it successfully went through. Once that worked, my next step was to try and extract the data i need from the webpage.
I have been using Firebug, XPather, and XPath Firefox Add-ons to find the html path of the data I want to extract. This link has been pretty helpful in guiding me in correctly coding the path's (doc.scrapy.org/topics/selectors.html). From looking at firebug, this is what i want to extract...
<span class="time">22 min</span>
and XPather shows this as the path for this particular item. ...
/div[@id='TaskHost_DrivingDirectionsSummaryContainer']/div[1]/span[3]
When i run the program in cmd with the given path above, the extracted data prints out as [ ] and when i add /class='time' to the end of span, the data print out is [u'False']. When looking at a bit closer in the DOM window of firebug, I noticed that class="time" is false for get isID and the the the childNode held开发者_开发问答 the data i needed. How do i extract the data from the childNode?
Below is my code so far
from scrapy import log # This module is useful for printing out debug information
from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector, XPathSelectorList, XmlXPathSelector
import html5lib
class BingSpider(BaseSpider):
name = 'bing.com/maps'
allowed_domains = ["bing.com/maps"]
start_urls = [
"http://www.bing.com/maps/?FORM=Z9LH4#Y3A9NDAuNjM2MDAxNTg1OTk5OTh+LTc0LjkxMTAwMzExMiZsdmw9OCZzdHk9ciZydHA9cG9zLjQwLjcxNDU0OF8tNzQuMDA3MTI1X05ldyUyMFlvcmslMkMlMjBOWV9fX2VffnBvcy40MC43MzE5N18tNzQuMTc0MTg1MDAwMDAwMDRfTmV3YXJrJTJDJTIwTkpfX19lXyZtb2RlPUQmcnRvcD0wfjB+MH4="
]
def parse(self, response):
self.log('A response from %s just arrived!' % response.url)
x = HtmlXPathSelector(response)
time=x.select("//div[@id='TaskHost_DrivingDirectionsSummaryContainer']/div[1]/span[3]").extract()
print time
CMD output
2011-09-05 17:43:01-0400 [scrapy] DEBUG: Enabled item pipelines:
2011-09-05 17:43:01-0400 [scrapy] DEBUG: Telnet console listening on 0.0.0.0:602
3
2011-09-05 17:43:01-0400 [scrapy] DEBUG: Web service listening on 0.0.0.0:6080
2011-09-05 17:43:01-0400 [bing.com] INFO: Spider opened
2011-09-05 17:43:02-0400 [bing.com] DEBUG: Crawled (200) <GET http://www.bing.co
m/maps/#Y3A9NDAuNzIzMjYwOTYzMTUwMDl+LTc0LjA5MDY1NSZsdmw9MTImc3R5PXImcnRwPXBvcy40
MC43MzE5N18tNzQuMTc0MTg1X05ld2FyayUyQyUyME5KX19fZV9+cG9zLjQwLjcxNDU0OF8tNzQuMDA3
MTI0OTk5OTk5OTdfTmV3JTIwWW9yayUyQyUyME5ZX19fZV8mbW9kZT1EJnJ0b3A9MH4wfjB+> (refer
er: None)
2011-09-05 17:43:02-0400 [bing.com] DEBUG: A response from http://www.bing.com/m
aps/ just arrived!
[]
2011-09-05 17:43:02-0400 [bing.com] INFO: Closing spider (finished)
2011-09-05 17:43:02-0400 [bing.com] INFO: Spider closed (finished)
When a website uses JavaScript in a significant way, you cannot trust the XPath you get at runtime, because that is the XPath you get after the JavaScript code has run, and Scrapy does not run JavaScript code.
You should:
Open the Network tab of the developer tools of your web browser.
Perform on the website the steps to get to the desired data, while you watch the corresponding requests performed by the website on the Network tab.
Try to reproduce those steps (requests) with Scrapy.
See also Debugging Spiders.
For all scrapping purposes use BeautifulSoup
soup.find('span', class="time")
精彩评论