scrapy follow big XML feed links [closed]
Questions asking for code must demonstrate a minimal understanding of the problem being solved. Include attempted solutions, why they didn't work, and the expected results. See also: Stack Overflow question checklist
Closed 9 years ago.
Improve this questionI'm using scrapy XMLFeedSpider with an itertag to loop over 300 Megs XML feed.
Beside saving each entry in that big feed as an Item, each entry also has some further links to be crawled, this time its links to html pages.
I understand that html pages are crawled using a CrawlerSpider so I'm trying to find a way to follow the links from the big XML feed using such spider.
Thanks, Guy
First of all read: http://readthedocs.org/docs/scrapy/en/latest/intro/tutorial.html
I created a project in scrapy. Here is the code to fetch all urls of that specific XML. You should use spiders-directory.
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import re
from scrapy.item import BaseItem
from scrapy.spider import BaseSpider
from scrapy.selector import XmlXPathSelector
from scrapy.utils.spider import create_spider_for_request
from scrapy.utils.misc import load_object
from scrapy.utils.response import open_in_browser
class TestSpider(BaseSpider):
name = "test"
start_urls = ["http://fgeek.kapsi.fi/test.xml"]
def parse(self, response):
xpath = XmlXPathSelector(response)
count = 0
for url in xpath.select('entries/entry/url').extract():
print url
精彩评论