开发者

Bash: Symlink Following

I've got a file tree containing a bunch of data I've generated. I've decided that at several stages of the data generation, I'd like to try some different configuration options in the programs that get used.

My solution was duplicating the data tree, and symlinking to all of the original data (multiplied by the number of new tests that I'm running). Then I'd let the programs clobber away the symlinks as needed. The result would be symlinks to the original tree for data that didn't get affected by my new c开发者_如何学运维onfigurations, and real data for anything new.

The problem is that the -clobber option on most of the programs I use follow symlinks, so it in fact clobbered over my original data. Is there anything I could try (maybe something like bash environment settings?) that might make all of these programs clobber the actual symlink, rather than the data it points at?


This is probably not possible - the option likely works by simply opening the file in 'truncate' mode - to replace a symlink it would actually have to delete the file [the symlink] beforehand with a separate function. You could try setting the permissions (chmod -w) on the original data file to not allow writing, but that might simply stop it from working


Do the programs check whether the new output is the same as the old before writing? If not, then it doesn't seem like the symlinks will offer any advantage, since the datastore will always uniquified in the course of running the analysis.

Are you able to change the way that the programs write their output? The following sequence will avoid following the symlink:

  1. write the new output to out.tmp
  2. if the old output out and out.tmp are identical, delete out.tmp
  3. otherwise, move out.tmp on top of out

How big is the data? Is it big enough that it's worth the effort to optimize the storage requirement? You can always save the full output, and run some analysis (in the simplest case, diff) after the fact to see if the data is the same.

If it is many GBs of data, you may want to look in to a filesystem that will optimize the duplicate data for you (the feature is known as "de-duplication"). Or, you can use LVM's snapshot support, which allows cheap copy-on-write snapshots of a filesystem.

As a poor-man's de-duplicating file system, you can do something like this in bash:

for file in $output_files; do
    md5=`md5sum $file | awk '{print $1}'`
    if [ ! -f "db/$md5" ]; then
        mv $file db/$md5
    fi
    ln -sf db/$md5 $file
done
0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜