Scraping from urls which contain regular expression
I've been trying to scrape data from the website: http://uk.ratemyteachers.com/. I want to get information from a certain number of teachers who I do not know the name from the website.
Every teacher has a page in the website which follows a regular pattern. So for instance, this teacher Lois Bank is stored at: http://uk.ratemyteachers.com/lois-banks/184618-t. So the pattern is the name of teacher, slash, a number, dash t.
Before I tried to use the CrawlSpider to crawl from the homepage using Regular Expressions, but it did not work out because the pages which I'm trying to acess are not linked to the homepage, the only way to access them is by searching the name of the teacher in the search box.
I tried to write the following spider, but it did not work out:
from scrapy.selector import HtmlXPathSelector
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.loader import XPathItemLoader
from scrapy.http import Request
from rmt.items import RmtItem_2
class RmtSpider(CrawlSpider):
name = 'rmtspider_4'
allowed_domains = ['uk.ratemyteachers.com']
start_urls = ['http://uk.ratemyteachers.com/[-a-z0-9/]-t+$',] 开发者_如何学Go
def parse_category(self, response):
main_selector = HtmlXPathSelector(response)
xpath = '//div[@class="main-c"]'
sub_selectors = main_selector.select(xpath)
for selector in sub_selectors:
item = RmtItem_2()
l = XPathItemLoader(item=item, selector=selector)
l.add_value ('url', response.url)
l.add_xpath('name', '//div[@class="breadcrumb"]/a[5]/text()')
l.add_xpath('school', '//div[@class="breadcrumb"]/a[3]/text()')
l.add_xpath('department', '//div[@class="breadcrumb"]/a[4]/text()')
l.add_xpath('total_ratings', '////div[@class="desc-details"]/span/text()')
l.add_xpath('location', '//div[@class="breadcrumb"]/a[2]/text()')
yield l.load_item()
I would appreciate if somebody could help me with this issue. I thank you in advance.
There are couple of ways to approach it
(i) You can submit a post request to simulate a search and then extract the url for that particular teacher
(ii) If all the teachers are from the same school, locate the school directory on the same site, and crawl all the teachers.
Why don't you start crawling from the sitemap and work your way down to the teacher through those pages?
As people said, before applying the regex to filter out teachers you need, you need to get the links. Getting the links via brute force is ridiculous.
So you need to use search form to get teachers' links. Use something like this:
class MySpider(BaseSpider):
def start_requests(self): # http://doc.scrapy.org/topics/spiders.html#scrapy.spider.BaseSpider.start_requests
return [FormRequest("http://uk.ratemyteachers.com/SelectSchoolSearch.php",
formdata={'user': 'john', 'pass': 'secret'}, # put your parameters here - use FireBug to see the post data you need
callback=self.parse_search)]
def parse_search(self, response):
...
Or as Philip Southam said - parse all the schools, get all teachers' links and filter out the ones you need.
I guess you need more examples, but you'll have to do it yourself - read the documentation and scrapy sources.
I have heard many good things about HTML Agility Pack (although I haven't used it):
http://html-agility-pack.net/?z=codeplex
精彩评论