开发者

scrapy follow big XML feed links [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accep开发者_高级运维ting answers.

Questions asking for code must demonstrate a minimal understanding of the problem being solved. Include attempted solutions, why they didn't work, and the expected results. See also: Stack Overflow question checklist

Closed 9 years ago.

Improve this question

I'm using scrapy XMLFeedSpider with an itertag to loop over 300 Megs XML feed.

Beside saving each entry in that big feed as an Item, each entry also has some further links to be crawled, this time its links to html pages.

I understand that html pages are crawled using a CrawlerSpider so I'm trying to find a way to follow the links from the big XML feed using such spider.

Thanks, Guy


First of all read: http://readthedocs.org/docs/scrapy/en/latest/intro/tutorial.html

I created a project in scrapy. Here is the code to fetch all urls of that specific XML. You should use spiders-directory.

#!/usr/bin/env python
# -*- coding: utf-8 -*-
import re
from scrapy.item import BaseItem
from scrapy.spider import BaseSpider
from scrapy.selector import XmlXPathSelector
from scrapy.utils.spider import create_spider_for_request
from scrapy.utils.misc import load_object
from scrapy.utils.response import open_in_browser

class TestSpider(BaseSpider):
    name = "test"
    start_urls = ["http://fgeek.kapsi.fi/test.xml"]

    def parse(self, response):
        xpath = XmlXPathSelector(response)
        count = 0
        for url in xpath.select('entries/entry/url').extract():
            print url
0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜