Bash: Symlink Following
I've got a file tree containing a bunch of data I've generated. I've decided that at several stages of the data generation, I'd like to try some different configuration options in the programs that get used.
My solution was duplicating the data tree, and symlinking to all of the original data (multiplied by the number of new tests that I'm running). Then I'd let the programs clobber away the symlinks as needed. The result would be symlinks to the original tree for data that didn't get affected by my new c开发者_如何学运维onfigurations, and real data for anything new.
The problem is that the -clobber
option on most of the programs I use follow symlinks, so it in fact clobbered over my original data. Is there anything I could try (maybe something like bash environment settings?) that might make all of these programs clobber the actual symlink, rather than the data it points at?
This is probably not possible - the option likely works by simply opening the file in 'truncate' mode - to replace a symlink it would actually have to delete the file [the symlink] beforehand with a separate function. You could try setting the permissions (chmod -w
) on the original data file to not allow writing, but that might simply stop it from working
Do the programs check whether the new output is the same as the old before writing? If not, then it doesn't seem like the symlinks will offer any advantage, since the datastore will always uniquified in the course of running the analysis.
Are you able to change the way that the programs write their output? The following sequence will avoid following the symlink:
- write the new output to
out.tmp
- if the old output
out
andout.tmp
are identical, deleteout.tmp
- otherwise, move
out.tmp
on top ofout
How big is the data? Is it big enough that it's worth the effort to optimize the storage requirement? You can always save the full output, and run some analysis (in the simplest case, diff
) after the fact to see if the data is the same.
If it is many GBs of data, you may want to look in to a filesystem that will optimize the duplicate data for you (the feature is known as "de-duplication"). Or, you can use LVM's snapshot support, which allows cheap copy-on-write snapshots of a filesystem.
As a poor-man's de-duplicating file system, you can do something like this in bash:
for file in $output_files; do
md5=`md5sum $file | awk '{print $1}'`
if [ ! -f "db/$md5" ]; then
mv $file db/$md5
fi
ln -sf db/$md5 $file
done
精彩评论