Bash: Symlink Following

2023-02-24 03:13 问答作者：

I've got a file tree containing a bunch of data I've generated. I've decided that at several stages of the data generation, I'd like to try some different configuration options in the programs that get used.

My solution was duplicating the data tree, and symlinking to all of the original data (multiplied by the number of new tests that I'm running). Then I'd let the programs clobber away the symlinks as needed. The result would be symlinks to the original tree for data that didn't get affected by my new c开发者_如何学运维onfigurations, and real data for anything new.

The problem is that the -clobber option on most of the programs I use follow symlinks, so it in fact clobbered over my original data. Is there anything I could try (maybe something like bash environment settings?) that might make all of these programs clobber the actual symlink, rather than the data it points at?

This is probably not possible - the option likely works by simply opening the file in 'truncate' mode - to replace a symlink it would actually have to delete the file [the symlink] beforehand with a separate function. You could try setting the permissions (chmod -w) on the original data file to not allow writing, but that might simply stop it from working

Do the programs check whether the new output is the same as the old before writing? If not, then it doesn't seem like the symlinks will offer any advantage, since the datastore will always uniquified in the course of running the analysis.

Are you able to change the way that the programs write their output? The following sequence will avoid following the symlink:

write the new output to out.tmp
if the old output out and out.tmp are identical, delete out.tmp
otherwise, move out.tmp on top of out

How big is the data? Is it big enough that it's worth the effort to optimize the storage requirement? You can always save the full output, and run some analysis (in the simplest case, diff) after the fact to see if the data is the same.

If it is many GBs of data, you may want to look in to a filesystem that will optimize the duplicate data for you (the feature is known as "de-duplication"). Or, you can use LVM's snapshot support, which allows cheap copy-on-write snapshots of a filesystem.

As a poor-man's de-duplicating file system, you can do something like this in bash:

for file in $output_files; do
    md5=`md5sum $file | awk '{print $1}'`
    if [ ! -f "db/$md5" ]; then
        mv $file db/$md5
    fi
    ln -sf db/$md5 $file
done

继续阅读：bash data-analysis filesystems overwrite symlink

Bash: Symlink Following

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？