开发者

Iterate within directory to zip files with python

I need to iterate through a folder and find every instance where the filenames are identical (except for extension) and then zip (preferably using tarfile) each of these into one file.

So I have 5 files named: "example1" each with different file extensions开发者_如何学C. I need to zip them up together and output them as "example1.tar" or something similar.

This would be easy enough with a simple for loop such as:

tar = tarfile.open('example1.tar',"w")

for output in glob ('example1*'):

tar.add(output)

tar.close()

however, there are 300 "example" files and I need to iterate through each one and their associated 5 files in order to make this work. This is way over my head. Any advice greatly appreciated.


The pattern you're describing generalizes to MapReduce. I found a simple implementation of MapReduce online, from which an even-simpler version is:

def map_reduce(data, mapper, reducer):
    d = {}
    for elem in data:
        key, value = mapper(elem)
        d.setdefault(key, []).append(value)
    for key, grp in d.items():
        d[key] = reducer(key, grp)
    return d

You want to group all files by their name without the extension, which you can get from os.path.splitext(fname)[0]. Then, you want to make a tarball out of each group by using the tarfile module. In code, that is:

import os
import tarfile

def make_tar(basename, files):
    tar = tarfile.open(basename + '.tar', 'w')
    for f in files:
        tar.add(f)
    tar.close()

map_reduce(os.listdir('.'),
           lambda x: (os.path.splitext(x)[0], x),
           make_tar)

Edit: If you want to group files in different ways, you just need to modify the second argument to map_reduce. The code above groups files that have the same value for the expression os.path.splitext(x)[0]. So to group by the base file name with all the extensions stripped off, you could replace that expression with strip_all_ext(x) and add:

def strip_all_ext(path):
    head, tail = os.path.split(path)
    basename = tail.split(os.extsep)[0]
    return os.path.join(head, basename)


You could do this:

  • list all files in the directory
  • create a dictionary where the basename is the key and all the extensions are values
  • then tar all the files by dictionary key

Something like this:

import os
import tarfile
from collections import defaultdict

myfiles = os.listdir(".")   # List of all files
totar = defaultdict(list)

# now fill the defaultdict with entries; basename as keys, extensions as values
for name in myfiles:
    base, ext = os.path.splitext(name)
    totar[base].append(ext)

# iterate through all the basenames
for base in totar:
    files = [base+ext for ext in totar[base]]
    # now tar all the files in the list "files"
    tar = tarfile.open(base+".tar", "w")
    for item in files:    
        tar.add(item)
    tar.close()


You have to problems. Solve the separately.

  1. Finding matching names. Use a collections.defaultict

  2. Creating tar files after you find the matching names. You've got that pretty well covered.

So. Solve problem 1 first.

Use glob to get all the names. Use os.path.basename to split the path and basename. Use os.path.splitext to split the name and extension.

A dictionary of lists can be used to save all files that have the same name.

Is that what you're doing in part 1?


Part 2 is putting the files into tar archives. For that, you've got most of the code you need.


Try using the glob module: http://docs.python.org/library/glob.html


#! /usr/bin/env python

import os
import tarfile

tarfiles = {}
for f in os.listdir ('files'):
    prefix = f [:f.rfind ('.') ]
    if prefix in tarfiles: tarfiles [prefix] += [f]
    else: tarfiles [prefix] = [f]

for k, v in tarfiles.items ():
    tf = tarfile.open ('%s.tar.gz' % k, 'w:gz')
    for f in v: tf.addfile (tarfile.TarInfo (f), file ('files/%s' % f) )
    tf.close ()


import os
import tarfile

allfiles = {}

for filename in os.listdir("."):
    basename = '.'.join (filename.split(".")[:-1] )
    if not basename in all_files:
        allfiles[basename] = [filename]
    else:
        allfiles[basename].append(filename)

for basename, filenames in allfiles.items():
    if len(filenames) < 2:
        continue
    tardata = tarfile.open(basename+".tar", "w")
    for filename in filenames:
        tardata.add(filename)
    tardata.close()
0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜