开发者

Scrapy image download how to use custom filename

For my scrapy project I'm currently using the ImagesPipeline. The downloaded images are stored with a SHA1 hash of their URLs as the file names.

How can I store the files using my own custom file names instead?

What if my custom file name needs to contain another 开发者_StackOverflow社区scraped field from the same item? e.g. use the item['desc'] and the filename for the image with item['image_url']. If I understand correctly, that would involve somehow accessing the other item fields from the Image Pipeline.

Any help will be appreciated.


This is just actualization of the answer for scrapy 0.24 (EDITED), where the image_key() is deprecated

class MyImagesPipeline(ImagesPipeline):

    #Name download version
    def file_path(self, request, response=None, info=None):
        #item=request.meta['item'] # Like this you can use all from item, not just url.
        image_guid = request.url.split('/')[-1]
        return 'full/%s' % (image_guid)

    #Name thumbnail version
    def thumb_path(self, request, thumb_id, response=None, info=None):
        image_guid = thumb_id + response.url.split('/')[-1]
        return 'thumbs/%s/%s.jpg' % (thumb_id, image_guid)

    def get_media_requests(self, item, info):
        #yield Request(item['images']) # Adding meta. I don't know, how to put it in one line :-)
        for image in item['images']:
            yield Request(image)


In scrapy 0.12 I solved something like this

class MyImagesPipeline(ImagesPipeline):

    #Name download version
    def image_key(self, url):
        image_guid = url.split('/')[-1]
        return 'full/%s.jpg' % (image_guid)

    #Name thumbnail version
    def thumb_key(self, url, thumb_id):
        image_guid = thumb_id + url.split('/')[-1]
        return 'thumbs/%s/%s.jpg' % (thumb_id, image_guid)

    def get_media_requests(self, item, info):
        yield Request(item['images'])


I found my way in 2017,scrapy 1.1.3

def file_path(self, request, response=None, info=None):
    return request.meta.get('filename','')

def get_media_requests(self, item, info):
    img_url = item['img_url']
    meta = {'filename': item['name']}
    yield Request(url=img_url, meta=meta)

like the code above,you can add the name you want to a Request meta in get_media_requests(), and get it back in file_path() by request.meta.get('yourname','').


This was the way I solved the problem in Scrapy 0.10 . Check the method persist_image of FSImagesStoreChangeableDirectory. The filename of the downloaded image is key

class FSImagesStoreChangeableDirectory(FSImagesStore):

    def persist_image(self, key, image, buf, info,append_path):

        absolute_path = self._get_filesystem_path(append_path+'/'+key)
        self._mkdir(os.path.dirname(absolute_path), info)
        image.save(absolute_path)

class ProjectPipeline(ImagesPipeline):

    def __init__(self):
        super(ImagesPipeline, self).__init__()
        store_uri = settings.IMAGES_STORE
        if not store_uri:
            raise NotConfigured
        self.store = FSImagesStoreChangeableDirectory(store_uri)


I did a nasty quick hack for that. In my case, I stored the title of image in my feeds. And, I had only 1 image_urls per item, so, I wrote the following script. It basically renames the image files in the /images/full/ directory with the corresponding title in the item feed that I had stored in as json.

import os
import json

img_dir = os.path.join(os.getcwd(), 'images\\full')
item_dir = os.path.join(os.getcwd(), 'data.json')

with open(item_dir, 'r') as item_json:
    items = json.load(item_json)

for item in items:
    if len(item['images']) > 0:
        cur_file = item['images'][0]['path'].split('/')[-1]
        cur_format = cur_file.split('.')[-1]
        new_title = item['title']+'.%s'%cur_format
        file_path = os.path.join(img_dir, cur_file)
        os.rename(file_path, os.path.join(img_dir, new_title))

It's nasty & not recommended. But, it is a naive alternative approach.


I rewrite the code, changing, in thumb_path def, "response." by "request.". If no, it won't work because "response is set to None".

class MyImagesPipeline(ImagesPipeline):

    #Name download version
    def file_path(self, request, response=None, info=None):
        #item=request.meta['item'] # Like this you can use all from item, not just url.
        image_guid = request.url.split('/')[-1]
        return 'full/%s' % (image_guid)

    #Name thumbnail version
    def thumb_path(self, request, thumb_id, response=None, info=None):
        image_guid = thumb_id + request.url.split('/')[-1]
        return 'thumbs/%s/%s.jpg' % (thumb_id, image_guid)

    def get_media_requests(self, item, info):
        #yield Request(item['images']) # Adding meta. Dunno how to put it in one line :-)
        for image in item['images']:
            yield Request(image)
0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜