How to tag a scientific data processing tool to ensure repeatability

2023-01-06 21:12 问答作者：

we develop a data processing tool to extract some scientific results out of a given set of raw data. In data science it is very important that you can re-obtain your results and repeat the calculations, that led to a result set

Since the tool is evolving, we need a way to find out which revision/build of our tool generated a given result set and how to find the corresponding source from which the tool was build.

The tool is written in C++ and Python; gluing together the C++ parts using Boost::Python. We use CMake as a build system generating Make files for Linux. Currently the project is stored in a subversion repo, but some of us already use git resp. hg and we are planning to migrate the whole project to one of them in the very near future.

What are the best practices in a scenario like this to get a unique mapping between source code, binary and result set?

Ideas we are already discussing:

Somehow injecting the global revision number
Using a build number generator
Storing the whole sourcecode inside the executable its开发者_如何学JAVAelf

This is a problem I spend a fair amount of time working on. To what @VonC has already written let me add a few thoughts.

I think that the topic of software configuration management is well understood and often carefully practiced in commercial environments. However, this general approach is often lacking in scientific data processing environments many of which either remain in, or have grown out of, academia. However, if you are in such a working environment, there are readily available sources of information and advice and lots of tools to help. I won't expand on this further.

I don't think that your suggestion of including the whole source code in an executable is, even if feasible, necessary. Indeed, if you get SCM right then one of the essential tests that you have done so, and continue to do so, is your ability to rebuild 'old' executables on demand. You should also be able to determine which revision of sources were used in each executable and version. These ought to make including the source code in an executable unnecessary.

The topic of tying result sets in to computations is also, as you say, essential. Here are some of the components of the solution that we are building:

We are moving away from the traditional unstructured text file that is characteristic of the output of a lot of scientific programs towards structured files, in our case we're looking at HDF5 and XML, in which both the data of interest and the meta-data is stored. The meta-data includes the identification of the program (and version) which was used to produce the results, the identification of the input data sets, job parameters and a bunch of other stuff.

We looked at using a DBMS to store our results; we'd like to go this way but we don't have the resources to do it this year, probably not next either. But businesses use DBMSs for a variety of reasons, and one of the reasons is their ability to roll-back, to provide an audit trail, that sort of thing.

We're also looking closely at which result sets need to be stored. A nice approach would be only ever to store original data sets captured from our field sensors. Unfortunately some of our computations take 1000s of CPU-hours to produce so it is infeasible to reproduce them ab-initio on demand. However, we will be storing far fewer intermediate data sets in future than we have in the past.

We are also making it much harder (I'd like to think impossible but am not sure we are there yet) for users to edit result sets directly. Once someone does that all the provenance information in the world is wrong and useless.

Finally, if you want to read more about the topic, try Googling for 'scientific workflow' and 'data provenance' similar topics.

EDIT: It's not clear from what I wrote above, but we have modified our programs so that they contain their own identification (we use Subversion's keyword capabilities for this with an extension or two of our own) and write this into any output that they produce.

You need to consider git submodules of hg subrepos.

The best practice in this scenario os to have a parent repo which will reference:

the sources of the tool
the result set generated from that tool
ideally the c++ compiler (won't evolve every day)
ideally the python distribution (won't evolve every day)

Each of those are a component, that is an independent repository (Git or Mercurial).
One precise revision of each component will be reference by a parent repository.

The all process is representative of a component-based approach, and is key in using an SCM (here Software Configuration Management) at its fullest.

继续阅读：build-process cmake scientific-computing version-control

How to tag a scientific data processing tool to ensure repeatability

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？