开发者

Formatting CSV Results of Scrapy

I am trying to scrape a website and save and format the results to a CSV file. I am able to save the file, however have three questions regarding the output and formatting:

  • All of the results reside in one cell rather than on multiple lines. Is there a command I am forgetting to use when listing items so they appear in a list?

  • How can I removed the ['u... that precedes each result? (I searched and saw how to do so for print, but not return)

  • Is there a way to add text to certain item results? (For example can I add "http://groupon.com" to the beginning of each deallink result?)

from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector

from deals.items import DealsItem

class DealsSpider(BaseSpider):
    name = "groupon.com"
    allowed_do开发者_开发问答mains = ["groupon.com"]
    start_urls = [
        "http://www.groupon.com/chicago/all",
        "http://www.groupon.com/new-york/all"
    ]

    def parse(self, response):
        hxs = HtmlXPathSelector(response)
        sites = hxs.select('//div[@class="page_content clearfix"]')
        items = []
        for site in sites:
            item = DealsItem()
            item['deal1']       = site.select('//div[@class="c16_grid_8"]/a/@title').extract()
            item['deal1link']   = site.select('//div[@class="c16_grid_8"]/a/@href').extract()
            item['img1']        = site.select('//div[@class="c16_grid_8"]/a/img/@src').extract()
            item['deal2']       = site.select('//div[@class="c16_grid_8 last"]/a/@title').extract()
            item['deal2link']   = site.select('//div[@class="c16_grid_8 last"]/a/@href').extract()
            item['img2']        = site.select('//div[@class="c16_grid_8 last"]/a/img/@src').extract()
            items.append(item)
        return items


Edit: now that I understand the problem better. Should your parse() function look more like the following? That is, yield-ing one item at a time, instead of returning a list. I suspect the list you are returning is what is being stuffed ill-formatted into the one cell.

def parse(self, response):
    hxs = HtmlXPathSelector(response)
    sites = hxs.select('//div[@class="page_content clearfix"]')
    for site in sites:
        item = DealsItem()
        item['deal1']       = site.select('//div[@class="c16_grid_8"]/a/@title').extract()
        item['deal1link']   = site.select('//div[@class="c16_grid_8"]/a/@href').extract()
        item['img1']        = site.select('//div[@class="c16_grid_8"]/a/img/@src').extract()
        item['deal2']       = site.select('//div[@class="c16_grid_8 last"]/a/@title').extract()
        item['deal2link']   = site.select('//div[@class="c16_grid_8 last"]/a/@href').extract()
        item['img2']        = site.select('//div[@class="c16_grid_8 last"]/a/img/@src').extract()
        yield item


Have a look at Item pipeline documentation: http://doc.scrapy.org/topics/item-pipeline.html

The u' represents unicode encoding. http://docs.python.org/howto/unicode.html

>>> s = 'foo'
>>> unicode(s)
u'foo'
>>> str(unicode(s))
'foo'
0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜