Scraping from urls which contain regular expression

2023-03-16 02:10 问答作者：

I've been trying to scrape data from the website: http://uk.ratemyteachers.com/. I want to get information from a certain number of teachers who I do not know the name from the website.

Every teacher has a page in the website which follows a regular pattern. So for instance, this teacher Lois Bank is stored at: http://uk.ratemyteachers.com/lois-banks/184618-t. So the pattern is the name of teacher, slash, a number, dash t.

Before I tried to use the CrawlSpider to crawl from the homepage using Regular Expressions, but it did not work out because the pages which I'm trying to acess are not linked to the homepage, the only way to access them is by searching the name of the teacher in the search box.

I tried to write the following spider, but it did not work out:

from scrapy.selector import HtmlXPathSelector
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.loader import XPathItemLoader
from scrapy.http import Request

from rmt.items import RmtItem_2

class RmtSpider(CrawlSpider):

    name = 'rmtspider_4'
    allowed_domains = ['uk.ratemyteachers.com']
    start_urls = ['http://uk.ratemyteachers.com/[-a-z0-9/]-t+$',]         开发者_如何学Go        

    def parse_category(self, response):

        main_selector = HtmlXPathSelector(response)

        xpath = '//div[@class="main-c"]'
        sub_selectors = main_selector.select(xpath)
        for selector in sub_selectors:            
            item = RmtItem_2()

            l = XPathItemLoader(item=item, selector=selector)
            l.add_value ('url', response.url)
            l.add_xpath('name', '//div[@class="breadcrumb"]/a[5]/text()')
            l.add_xpath('school', '//div[@class="breadcrumb"]/a[3]/text()')
            l.add_xpath('department', '//div[@class="breadcrumb"]/a[4]/text()')
            l.add_xpath('total_ratings', '////div[@class="desc-details"]/span/text()')
            l.add_xpath('location', '//div[@class="breadcrumb"]/a[2]/text()')


            yield l.load_item()

I would appreciate if somebody could help me with this issue. I thank you in advance.

There are couple of ways to approach it

(i) You can submit a post request to simulate a search and then extract the url for that particular teacher

(ii) If all the teachers are from the same school, locate the school directory on the same site, and crawl all the teachers.

Why don't you start crawling from the sitemap and work your way down to the teacher through those pages?

As people said, before applying the regex to filter out teachers you need, you need to get the links. Getting the links via brute force is ridiculous.

So you need to use search form to get teachers' links. Use something like this:

class MySpider(BaseSpider):

    def start_requests(self): # http://doc.scrapy.org/topics/spiders.html#scrapy.spider.BaseSpider.start_requests
        return [FormRequest("http://uk.ratemyteachers.com/SelectSchoolSearch.php",
                        formdata={'user': 'john', 'pass': 'secret'}, # put your parameters here - use FireBug to see the post data you need
                        callback=self.parse_search)]

    def parse_search(self, response):
        ...

Or as Philip Southam said - parse all the schools, get all teachers' links and filter out the ones you need.

I guess you need more examples, but you'll have to do it yourself - read the documentation and scrapy sources.

I have heard many good things about HTML Agility Pack (although I haven't used it):

http://html-agility-pack.net/?z=codeplex

继续阅读：expression scrapy

Scraping from urls which contain regular expression

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？