Scrapy spider index error

2022-12-12 10:36 问答作者：

This is the code for Spyder1 that I've been trying to write within Scrapy framework:

from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector
from scrapy.item import Item
from firm.items import FirmItem

class Spider1(CrawlSpider):
    domain_name = 'wc2'
    start_urls = ['http://www.whitecase.com/A开发者_JAVA技巧ttorneys/List.aspx?LastName=A']
    rules = (
        Rule(SgmlLinkExtractor(allow=["hxs.select(
            '//td[@class='altRow'][1]/a/@href').re('/.a\w+')"]), 
            callback='parse'),
    )

    def parse(self, response):
        hxs = HtmlXPathSelector(response)
        JD = FirmItem()
        JD['school'] = hxs.select(
                   '//td[@class="mainColumnTDa"]').re('(?<=(JD,\s))(.*?)(\d+)'
        )
        return JD    

SPIDER = Spider1()

The regex in the rules successfully pulls all the bio urls that I want from the start url:

>>> hxs.select(
...             '//td[@class="altRow"][1]/a/@href').re('/.a\w+')
[u'/cabel', u'/jacevedo', u'/jacuna', u'/aadler', u'/zahmedani', u'/tairisto', u
'/zalbert', u'/salberts', u'/aaleksandrova', u'/malhadeff', u'/nalivojvodic', u'
/kallchurch', u'/jalleyne', u'/lalonzo', u'/malthoff', u'/valvarez', u'/camon',
u'/randerson', u'/eandreeva', u'/pangeli', u'/jangland', u'/mantczak', u'/darany
i', u'/carhold', u'/marora', u'/garrington', u'/jartzinger', u'/sasayama', u'/ma
sschenfeldt', u'/dattanasio', u'/watterbury', u'/jaudrlicka', u'/caverch', u'/fa
yanruoh', u'/razar']
>>>

But when I run the code I get

[wc2] ERROR: Error processing FirmItem(school=[]) - 

[Failure instance: Traceback: <type 'exceptions.IndexError'>: list index out of range

This is the FirmItem in Items.py

from scrapy.item import Item, Field

class FirmItem(Item):
    school = Field()

    pass

Can you help me understand where the index error occurs?

It seems to me that it has something to do with SgmLinkExtractor.

I've been trying to make this spider work for weeks with Scrapy. They have an excellent tutorial but I am new to python and web programming so I don't understand how for instance SgmlLinkExtractor works behind the scene.

Would it be easier for me to try to write a spider with the same simple functionality with Python libraries? I would appreciate any comments and help.

Thanks

SgmlLinkExtractor doesn't support selectors in its "allow" argument.

So this is wrong:

SgmlLinkExtractor(allow=["hxs.select('//td[@class='altRow'] ...')"])

This is right:

SgmlLinkExtractor(allow=[r"product\.php"])

The parse function is called for each match of your SgmlLinkExtractor.

As Pablo mentioned you want to simplify your SgmlLinkExtractor.

I also tried to put the names scraped from the initial url into a list and then pass each name to parse in the form of absolute url as http://www.whitecase.com/aabbas (for /aabbas).

The following code loops over the list, but I don't know how to pass this to parse . Do you think this is a better idea?

baseurl = 'http://www.whitecase.com'
names = ['aabbas', '/cabel', '/jacevedo', '/jacuna', '/igbadegesin']

def makeurl(baseurl, names):
  for x in names:
      url = baseurl + x
      baseurl = 'http://www.whitecase.com'
      x = ''
      return url

继续阅读：python scrapy web-crawler

Scrapy spider index error

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？