How to remove expired items from database with Scrapy

2022-12-16 08:49 问答作者：

I am using spidering a video site that expires content frequently. I am considering using scrapy to do my spidering, but am not sure how to delete expired items.

Strategies to detect if an item is expired are:

Spider the site's "delete.rss".
Every few days, try reloading the contents page and making sure it still works.
Spider ever开发者_JAVA百科y page of the site's content indexes, and remove the video if it's not found.

Please let me know how to remove expired items in scrapy. I will be storing my scrapy items in a mysql DB via django.

2010-01-18 Update

I have found a solution that is working, but still may not be optimal. I am maintaining a "found_in_last_scan" flag on every video that I sync. When the spider starts, it sets all the flags to False. When it finishes, it deletes videos who still have the flag set to False. I did this by attaching to the signals.spider_opened and signals.spider_closed Please confirm this is a valid strategy and there are no problems with it.

I haven't tested this!
I have to confess that I haven't tried using the Django models in Scrapy, but here goes:

The simplest way I imagine would be to create a new spider for the deleted.rss file by extending the XMLFeedSpider (Copied from the scrapy documentation, then modified). I suggest you do create a new spider because very little of the following logic is related to the logic used for scraping the site:

from scrapy import log
from scrapy.contrib.spiders import XMLFeedSpider
from myproject.items import DeletedUrlItem

class MySpider(XMLFeedSpider):
    domain_name = 'example.com'
    start_urls = ['http://www.example.com/deleted.rss']
    iterator = 'iternodes' # This is actually unnecesary, since it's the default value
    itertag = 'item'

    def parse_node(self, response, url):
        url['url'] = node.select('#path/to/url').extract()

        return url # return an Item 

SPIDER = MySpider()

This is not a working spider for you to use, but IIRC the RSS files are pure XML. I'm not sure how the deleted.rss looks like but I'm sure you can figure out how to extract the URLs from the XML. Now, this example imports myproject.items.DeletedUrlItem which is just a string in this example, but you need to create t he DeletedUrlItem using something like the code below:

You need to create the DeletedUrlItem:

class DeletedUrlItem(Item):
    url = Field()

Instead of saving, you delete the items using Django's Model API in a Scrapy's ItemPipeline - I assume you're using a DjangoItem:

# we raise a DropItem exception so Scrapy
# doesn't try to process the item any further
from scrapy.core.exceptions import DropItem

# import your model
import django.Model.yourModel

class DeleteUrlPipeline(item):

    def process_item(self, spider, item):
        if item['url']:
            delete_item = yourModel.objects.get(url=item['url'])
            delete_item.delete() # actually delete the item!
            raise DropItem("Deleted: %s" % item)

Notice the delete_item.delete().

I'm aware that this answer may contain errors, it's written by memory :-) but I will definitely update if you've got comments or cannot figure this out.

If you have a HTTP URL which you suspect might not be valid at all any more (because you found it in a "deleted" feed, or just because you haven't checked it in a while), the simplest, fastest way to check is to send an HTTP HEAD request for that URL. In Python, that's best done with the httplib module of the standard library: make a connection object c to the host of interest with HTTPConnection (if HTTP 1.1, it may be reusable to check multiple URLs with better performance and lower systrem load), then do one (or more, if feasible, i.e. if HTTP 1.1 is in use) calls of c's request method, first argument 'HEAD', second argument the URL you're checking (without the host part of course;-).

After each request you call c.getresponse() to get an HTTPResponse object, whose status attribute will tell you if the URL is still valid.

Yes, it's a bit low-level, but exactly for this reason it lets you optimize your task a lot better, with just a little knowledge of HTTP;-).

继续阅读：python scrapy screen-scraping

How to remove expired items from database with Scrapy

2010-01-18 Update

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？

2010-01-18 Update

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集 河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？