开发者

More efficient way to find & tar millions of files

I've got a job running on my server at the command line prompt for a two days now:

find data/ -nam开发者_StackOverflowe filepattern-*2009* -exec tar uf 2009.tar {} ;

It is taking forever, and then some. Yes, there are millions of files in the target directory. (Each file is a measly 8 bytes in a well hashed directory structure.) But just running...

find data/ -name filepattern-*2009* -print > filesOfInterest.txt

...takes only two hours or so. At the rate my job is running, it won't be finished for a couple of weeks.. That seems unreasonable. Is there a more efficient to do this? Maybe with a more complicated bash script?

A secondary questions is "why is my current approach so slow?"


One option is to use cpio to generate a tar-format archive:

$ find data/ -name "filepattern-*2009*" | cpio -ov --format=ustar > 2009.tar

cpio works natively with a list of filenames from stdin, rather than a top-level directory, which makes it an ideal tool for this situation.


If you already did the second command that created the file list, just use the -T option to tell tar to read the files names from that saved file list. Running 1 tar command vs N tar commands will be a lot better.


Here's a find-tar combination that can do what you want without the use of xargs or exec (which should result in a noticeable speed-up):

tar --version    # tar (GNU tar) 1.14 

# FreeBSD find (on Mac OS X)
find -x data -name "filepattern-*2009*" -print0 | tar --null --no-recursion -uf 2009.tar --files-from -

# for GNU find use -xdev instead of -x
gfind data -xdev -name "filepattern-*2009*" -print0 | tar --null --no-recursion -uf 2009.tar --files-from -

# added: set permissions via tar
find -x data -name "filepattern-*2009*" -print0 | \
    tar --null --no-recursion --owner=... --group=... --mode=... -uf 2009.tar --files-from -


There is xargs for this:

find data/ -name filepattern-*2009* -print0 | xargs -0 tar uf 2009.tar

Guessing why it is slow is hard as there is not much information. What is the structure of the directory, what filesystem do you use, how it was configured on creating. Having milions of files in single directory is quite hard situation for most filesystems.


To correctly handle file names with weird (but legal) characters (such as newlines, ...) you should write your file list to filesOfInterest.txt using find's -print0:

find -x data -name "filepattern-*2009*" -print0 > filesOfInterest.txt
tar --null --no-recursion -uf 2009.tar --files-from filesOfInterest.txt 


The way you currently have things, you are invoking the tar command every single time it finds a file, which is not surprisingly slow. Instead of taking the two hours to print plus the amount of time it takes to open the tar archive, see if the files are out of date, and add them to the archive, you are actually multiplying those times together. You might have better success invoking the tar command once, after you have batched together all the names, possibly using xargs to achieve the invocation. By the way, I hope you are using 'filepattern-*2009*' and not filepattern-*2009* as the stars will be expanded by the shell without quotes.


I was struggling with linux for a long time before I found a much easier and potentially faster solution using Python's tarfile library.

  1. Use glob.glob to search for the desired filepaths
  2. Create a new archive in append mode
  3. Add each filepath to this archive
  4. Close the archive

Here is my code sample:

import tarfile
import glob
from tqdm import tqdm

filepaths = glob.glob("Images/7 *.jpeg")
n = len(filepaths)
print ("{} files found.".format(n))
print ("Creating Archive...")
out = tarfile.open("Images.tar.gz", mode = "a")
for filepath in tqdm(filepaths, "Appending files to the archive..."):
  try:
    out.add(filepath)
  except:
    print ("Failed to add: {}".format(filepath))

print ("Closing the archive...")
out.close()

This took a total of about 12 seconds to find 16222 filepaths and create the archive, however, this was predominantly taken up by simply searching for the filepaths. It took just 7 seconds to create the tar archive with 16000 filepaths. With some multithreading this could be much faster.

If you're looking for a multithreaded implementation, I've made one and placed it here:

import tarfile
import glob
from tqdm import tqdm
import threading

filepaths = glob.glob("Images/7 *.jpeg")
n = len(filepaths)
print ("{} files found.".format(n))
print ("Creating Archive...")
out = tarfile.open("Images.tar.gz", mode = "a")

def add(filepath):
  try:
    out.add(filepath)
  except:
    print ("Failed to add: {}".format(filepath))

def add_multiple(filepaths):
  for filepath in filepaths:
    add(filepath)

max_threads = 16
filepaths_per_thread = 16

interval = max_threads * filepaths_per_thread

for i in tqdm(range(0, n, interval), "Appending files to the archive..."):
  threads = [threading.Thread(target = add_multiple, args = (filepaths[j:j + filepaths_per_thread],)) for j in range(i, min([n, i + interval]), filepaths_per_thread)]
  for thread in threads:
    thread.start()
  for thread in threads:
    thread.join()

print ("Closing the archive...")
out.close()

Of course, you need to make sure that the values of max_threads and filepaths_per_thread are optimized; it takes time to create threads, so the time may actually increase for certain values. A final thing to note is that since we are using append mode, we are automatically creating a new archive with the designated name if one does not already exist. However, if one does already exist, it will simply add to the preexisting archive, not reset it or make a new one.


There is a utility for this called tarsplitter.

tarsplitter -m archive -i folder/*.json -o archive.tar -p 8

will use 8 threads to archive the files matching "folder/*.json" into an output archive of "archive.tar"

https://github.com/AQUAOSOTech/tarsplitter


Simplest (also remove file after archive creation):

find *.1  -exec tar czf '{}.tgz' '{}' --remove-files \;
0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜