开发者

Reproducibility in scientific programming

Along with producing incorrect results, one of the worst fears开发者_Python百科 in scientific programming is not being able to reproduce the results you've generated. What best practices help ensure your analysis is reproducible?


  • Publish the original raw data online and make it freely available for download.
  • Make the code base open source and available online for download.
  • If randomization is used in optimization, then repeat the optimization several times, choosing the best value that results or use a fixed random seed, so that the same results are repeated.
  • Before performing your analysis, you should split the data into a "training/analysis" dataset and a "testing/validation" dataset. Perform your analsysis on the "training" dataset, and make sure that the results that you get still hold on the "validation" dataset to ensure that your analysis is actually generalizable and isn't simply memorizing peculiarities of the dataset in question.

The first two points are incredibly important, because making the dataset available allows others to perform their own analyses on the same data, which increases the level of confidence in the validity of your own analyses. Additionally, making the dataset available online -- especially if you use linked data formats -- makes it possible for crawlers to aggregate your dataset with other datasets, thereby enabling analyses with larger data sets... in many types of research, the sample size is sometimes too small to be really confident about the results... but sharing your dataset makes it possible to construct very large datasets. Or, someone could use your dataset to validate the analysis that they performed on some other dataset.

Additionally, making your code open source makes it possible for the code and procedure to be reviewed by your peers. Often such reviews lead to the discovery of flaws or of the possibility for additional optimization and improvement. Most importantly, it allows other researchers to improve on your methods, without having to implement everything that you have already done from scratch. It very greatly accelerates the pace of research when researches can focus on just improvements and not on reinventing the wheel.

As for randomization... many algorithms rely on randomization to achieve their results. Stochastic and Monte Carlo methods are quite common, and while they have been proven to converge for certain cases, it is still possible to get different results. The way to ensure that you get the same results, is to have a loop in your code that invokes the computation some fixed number of times, and to choose the best result. If you use enough repititions, you can expect to find global or near-global optima instead of getting stuck in local optima. Another possibility is to use a predetermined seed, although that is not, IMHO, as good an approach since you could pick a seed that causes you to get stuck in local optima. In addition, there is no guarantee that random number generators on different platforms will generate the same results for that seed value.


I'm a software engineer embedded in a team of research geophysicists and we're currently (as always) working to improve our ability to reproduce results upon demand. Here are a few pointers gleaned from our experience:

  1. Put everything under version control: source code, input data sets, makefiles, etc
  2. When building executables: we embed compiler directives in the executables themselves, we tag a build log with a UUID and tag the executable with the same UUID, compute checksums for executables, autobuild everything and auto-update a database (OK, it's just a flat file really) with build details, etc
  3. We use Subversion's keywords to include revision numbers (etc) in every piece of source, and these are written into any output files generated.
  4. We do lots of (semi-)automated regression testing to ensure that new versions of code, or new build variants, produce the same (or similar enough) results, and I'm working on a bunch of programs to quantify the changes which do occur.
  5. My geophysicist colleagues do analyse the programs sensitivities to changes in inputs. I analyse their (the codes, not the geos) sensitivity to compiler settings, to platform and such like.

We're currently working on a workflow system which will record details of every job run: input datasets (including versions), output datasets, program (incl version and variant) used, parameters, etc -- what is commonly called provenance. Once this is up and running the only way to publish results will be by use of the workflow system. Any output datasets will contain details of their own provenance, though we haven't done the detailed design of this yet.

We're quite (perhaps too) relaxed about reproducing numerical results to the least-significant digit. The science underlying our work, and the errors inherent in the measurements of our fundamental datasets, do not support the validity of any of our numerical results beyond 2 or 3 s.f.

We certainly won't be publishing either code or data for peer-review, we're in the oil business.


Plenty of good suggestions already. I'll add (both from bitter experience---before publication, thankfully!),

1) Check your results for stability:

  • try several different subsets of the data
  • rebin the input
  • rebin the output
  • tweak the grid spacing
  • try several random seeds (if applicable)

If it's not stable, you're not done.

Publish the results of the above testing (or at least, keep the evidence and mention that you did it).

2) Spot check the intermediate results

Yes, you're probably going to develop the method on a small sample, then grind through the whole mess. Peak into the middle a few times while that grinding is going on. Better yet, where possible collect statistics on the intermediate steps and look for signs of anomalies therein.

Again, any surprises and you've got to go back and do it again.

And, again, retain and/or publish this.


Things already mentioned that I like include

  • Source control---you need it for yourself anyway.
  • Logging of build environment. Publication of the same is nice.
  • Plan on making code and data available.

Another one no one has mentioned:

3) Document the code

Yes, you're busy writing it, and probably busy designing it as you go along. But I don't mean a detailed document as much as a good explanation for all the surprises. You're going to need to write those up anyway, so think of it as getting a head start on the paper. And you can keep the documentation in source control so that you can freely throw away chunks that don't apply anymore---they'll be there if you need them back.

It wouldn't hurt to build a little README with build instructions and a "How to run" blurb. If you're going to make the code available, people are going to ask about this stuff... Plus, for me, checking back with it helps me stay on track.


publish the program code, make it available for review.

This is not directed at you by any means, but here is my rant:

If you do work sponsored by taxpayer money, if you publish the results in peer-reviewed journal, provide the source code, under open source license or in public domain. I am tired of reading about this great algorithm somebody came up with, which they claim does x, but provide no way to verify/check source code. if I cannot see the code, I cannot verify you results, for algorithm implementations can be very drastic differences.

It is not moral in my opinion to keep work paid by taxpayers out of reach of fellow researchers. it's against science to push papers yet provide no tangible benefit to public in terms of usable work.


I think a lot of the previous answers missed the "scientific computing" part of your question, and answered with very general stuff that applies to any science (make the data and method public, specialized to CS).

What they're missing is that you have to be even more specialized: you have to specific which version of the compiler you used, which switches were used when compiling, which version of the operating system you used, which versions of all the libraries you linked against, what hardware you are using, what else was going being run on your machine at the same time, and so forth. There are published papers out there where every one of these factors influenced the results in a non-trivial way.

For example (on Intel hardware) you could be using a library which uses the FPU's 80-bit floats, do an O/S upgrade, and now that library might now only use 64-bit doubles, and your results can drastically change if your problem was the least bit ill-conditioned.

A compiler upgrade might change the default rounding behaviour, or a single optimization might flip in which order 2 instructions get done, and again for ill-conditioned systems, boom, different results.

Heck, there are some funky stories of sub-optimal algorithms showing 'best' in practical tests because they were tested on a laptop which automatically slowed down the CPU when it overheated (which is what the optimal algorithm did).

None of these things are visible from the source code or the data.


Post code, data, and results on the Internet. Write the URL in the paper.

Also, submit your code to "contests". For example, in music information retrieval, there is MIREX.


Record configuration parameters somehow (eg if you can set a certain variable to a certain value). This may be in the data output, or in version control.

If you're changing your program all the time (I am!), make sure you record what version of your program you're using.


Perhaps this is slightly off topic, but to follow @Jacques Carette lead regarding scientific computing specifics, it may be helpful to consult Verification & Validation ("V&V") literature for some specific questions, especially those that blur the line between reproducibility and correctness. Now that cloud computing is becoming more of an option for large simulation problems, reproducibility among random assortment of random CPUs will be more of a concern. Additionally, I don't know if it possible to fully separate "correctness" from "reproducibility" of your results because your results stemmed from your computational model. Even though your model seems to work on computational cluster A but doesn't on cluster B, you need to follow some guidelines to guarantee your work process for making this model is sound. Specific to reproducibility, there is some buzz in the V&V community to incorporate reproducibility error into overall model uncertainty (I will let the reader investigate this on their own).

For example, for computational fluid dynamics (CFD) work, the gold standard is the ASME V&V guide. For the applied multiphysics modeling and simulation people especially (with its general concepts applicable to the greater scientific computing community), this is an important standard to internalize.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜