More efficient way to find & tar millions of files

2022-12-28 10:56 问答作者：

I've got a job running on my server at the command line prompt for a two days now:

find data/ -nam开发者_StackOverflowe filepattern-*2009* -exec tar uf 2009.tar {} ;

It is taking forever, and then some. Yes, there are millions of files in the target directory. (Each file is a measly 8 bytes in a well hashed directory structure.) But just running...

find data/ -name filepattern-*2009* -print > filesOfInterest.txt

...takes only two hours or so. At the rate my job is running, it won't be finished for a couple of weeks.. That seems unreasonable. Is there a more efficient to do this? Maybe with a more complicated bash script?

A secondary questions is "why is my current approach so slow?"

One option is to use cpio to generate a tar-format archive:

$ find data/ -name "filepattern-*2009*" | cpio -ov --format=ustar > 2009.tar

cpio works natively with a list of filenames from stdin, rather than a top-level directory, which makes it an ideal tool for this situation.

If you already did the second command that created the file list, just use the -T option to tell tar to read the files names from that saved file list. Running 1 tar command vs N tar commands will be a lot better.

Here's a find-tar combination that can do what you want without the use of xargs or exec (which should result in a noticeable speed-up):

tar --version    # tar (GNU tar) 1.14 

# FreeBSD find (on Mac OS X)
find -x data -name "filepattern-*2009*" -print0 | tar --null --no-recursion -uf 2009.tar --files-from -

# for GNU find use -xdev instead of -x
gfind data -xdev -name "filepattern-*2009*" -print0 | tar --null --no-recursion -uf 2009.tar --files-from -

# added: set permissions via tar
find -x data -name "filepattern-*2009*" -print0 | \
    tar --null --no-recursion --owner=... --group=... --mode=... -uf 2009.tar --files-from -

There is xargs for this:

find data/ -name filepattern-*2009* -print0 | xargs -0 tar uf 2009.tar

Guessing why it is slow is hard as there is not much information. What is the structure of the directory, what filesystem do you use, how it was configured on creating. Having milions of files in single directory is quite hard situation for most filesystems.

To correctly handle file names with weird (but legal) characters (such as newlines, ...) you should write your file list to filesOfInterest.txt using find's -print0:

find -x data -name "filepattern-*2009*" -print0 > filesOfInterest.txt
tar --null --no-recursion -uf 2009.tar --files-from filesOfInterest.txt

The way you currently have things, you are invoking the tar command every single time it finds a file, which is not surprisingly slow. Instead of taking the two hours to print plus the amount of time it takes to open the tar archive, see if the files are out of date, and add them to the archive, you are actually multiplying those times together. You might have better success invoking the tar command once, after you have batched together all the names, possibly using xargs to achieve the invocation. By the way, I hope you are using 'filepattern-*2009*' and not filepattern-*2009* as the stars will be expanded by the shell without quotes.

I was struggling with linux for a long time before I found a much easier and potentially faster solution using Python's tarfile library.

Use glob.glob to search for the desired filepaths
Create a new archive in append mode
Add each filepath to this archive
Close the archive

Here is my code sample:

import tarfile
import glob
from tqdm import tqdm

filepaths = glob.glob("Images/7 *.jpeg")
n = len(filepaths)
print ("{} files found.".format(n))
print ("Creating Archive...")
out = tarfile.open("Images.tar.gz", mode = "a")
for filepath in tqdm(filepaths, "Appending files to the archive..."):
  try:
    out.add(filepath)
  except:
    print ("Failed to add: {}".format(filepath))

print ("Closing the archive...")
out.close()

This took a total of about 12 seconds to find 16222 filepaths and create the archive, however, this was predominantly taken up by simply searching for the filepaths. It took just 7 seconds to create the tar archive with 16000 filepaths. With some multithreading this could be much faster.

If you're looking for a multithreaded implementation, I've made one and placed it here:

import tarfile
import glob
from tqdm import tqdm
import threading

filepaths = glob.glob("Images/7 *.jpeg")
n = len(filepaths)
print ("{} files found.".format(n))
print ("Creating Archive...")
out = tarfile.open("Images.tar.gz", mode = "a")

def add(filepath):
  try:
    out.add(filepath)
  except:
    print ("Failed to add: {}".format(filepath))

def add_multiple(filepaths):
  for filepath in filepaths:
    add(filepath)

max_threads = 16
filepaths_per_thread = 16

interval = max_threads * filepaths_per_thread

for i in tqdm(range(0, n, interval), "Appending files to the archive..."):
  threads = [threading.Thread(target = add_multiple, args = (filepaths[j:j + filepaths_per_thread],)) for j in range(i, min([n, i + interval]), filepaths_per_thread)]
  for thread in threads:
    thread.start()
  for thread in threads:
    thread.join()

print ("Closing the archive...")
out.close()

Of course, you need to make sure that the values of max_threads and filepaths_per_thread are optimized; it takes time to create threads, so the time may actually increase for certain values. A final thing to note is that since we are using append mode, we are automatically creating a new archive with the designated name if one does not already exist. However, if one does already exist, it will simply add to the preexisting archive, not reset it or make a new one.

There is a utility for this called tarsplitter.

tarsplitter -m archive -i folder/*.json -o archive.tar -p 8

will use 8 threads to archive the files matching "folder/*.json" into an output archive of "archive.tar"

https://github.com/AQUAOSOTech/tarsplitter

Simplest (also remove file after archive creation):

find *.1  -exec tar czf '{}.tgz' '{}' --remove-files \;

继续阅读：bash find tar

More efficient way to find & tar millions of files

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？