开发者

Unit Testing Machine Learning Code

I am writing a fairly complicated machine learning program for my thesis in computer vision. It's working fairly well, but I need to keep trying out new things out and adding new functionality. This is problematic because I sometimes introduce bugs when 开发者_如何学GoI am extending the code or trying to simplify an algorithm.

Clearly the correct thing to do is to add unit tests, but it is not clear how to do this. Many components of my program produce a somewhat subjective answer, and I cannot automate sanity checks.

For example, I had some code that approximated a curve with a lower-resolution curve, so that I could do computationally intensive work on the lower-resolution curve. I accidentally introduced a bug into this code, and only found it through a painstaking search when my the results of my entire program got slightly worse.

But, when I tried to write a unit-test for it, it was unclear what I should do. If I make a simple curve that has a clearly correct lower-resolution version, then I'm not really testing out everything that could go wrong. If I make a simple curve and then perturb the points slightly, my code starts producing different answers, even though this particular piece of code really seems to work fine now.


You may not appreciate the irony, but basically what you have there is legacy code: a chunk of software without any unit tests. Naturally you don't know where to begin. So you may find it helpful to read up on handling legacy code.

The definitive thought on this is Michael Feather's book, Working Effectively with Legacy Code. There used to be a helpful summary​ of that on the ObjectMentor site, but alas the website has gone the way of the company. However WELC has left a legacy in reviews and other articles. Check them out (or just buy the book), although the key lessons are the ones which S.Lott and tvanfosson cover in their replies.


2019 update: I have fixed the link to the WELC summary with a version from the Wayback Machine web archive (thanks @milia).

Also - and despite knowing that answers which comprise mainly links to other sites are low quality answers :) - here is a link to a new (2019 new) Google tutorial on Testing and Debugging ML code. I hope this will be of illumination to future Seekers who stumble across this answer.


"then I'm not really testing out everything that could go wrong."

Correct.

The job of unit tests is not to test everything that could go wrong.

The job of unit tests is to test that what you have does the right thing, given specific inputs and specific expected results. The important part here is the specific visible, external requirements are satisfied by specific test cases. Not that every possible thing that could go wrong is somehow prevented.

Nothing can test everything that could go wrong. You can write a proof, but you'll be hard-pressed to write tests for everything.

Choose your test cases wisely.

Further, the job of unit tests is to test that each small part of the overall application does the right thing -- in isolation.

Your "code that approximated a curve with a lower-resolution curve" for example, probably has several small parts that can be tested as separate units. In isolation. The integrated whole could also be tested to be sure that it works.

Your "computationally intensive work on the lower-resolution curve" for example, probably has several small parts that can be tested as separate units. In isolation.

That point of unit testing is to create small, correct units that are later assembled.


Without seeing your code, it's hard to tell, but I suspect that you are attempting to write tests at too high a level. You might want to think about breaking your methods down into smaller components that are deterministic and testing these. Then test the methods that use these methods by providing mock implementations that return predictable values from the underlying methods (which are probably located on a different object). Then you can write tests that cover the domain of the various methods, ensuring that you have coverage of the full range of possible outcomes. For the small methods you do so by providing values that represent the domain of inputs. For the methods that depend on these, by providing mock implementations that return the range of outcomes from the dependencies.


Your unit tests need to employ some kind of fuzz factor, either by accepting approximations, or using some kind of probabilistic checks.

For example, if you have some function that returns a floating point result, it is almost impossible to write a test that works correctly across all platforms. Your checks would need to perform the approximation.

TEST_ALMOST_EQ(result, 4.0);

Above TEST_ALMOST_EQ might verify that result is between 3.9 and 4.1 (for example).

Alternatively, if your machine learning algorithms are probabilistic, your tests will need to accommodate for it by taking the average of multiple runs and expecting it to be within some range.

x = 0;
for (100 times) {
  x += result_probabilistic_test();
}

avg = x/100;
TEST_RANGE(avg, 10.0, 15.0);

Ofcourse, the tests are non-deterministic, so you will need to tune them such that you can get non-flaky tests with a high probability. (E.g., increase the number of trials, or increase the range of error).

You can also use mocks for this (e.g, a mock random number generator for your probabilistic algorithms), and they usually help for deterministically testing specific code paths, but they are a lot of effort to maintain. Ideally, you would use a combination of fuzzy testing and mocks.

HTH.


Generally, for statistical measures you would build in an epsilon for your answer. I.E. the mean square difference of your points would be < 0.01 or some such. Another option is to run several times and if it fails "too often" then you have an issue.


  1. Get an appropriate test dataset (maybe a subset of what your using usually)
  2. Calculate some metric on this dataset (e.g. the accuracy)
  3. Note down the value obtained (cross-validated)
  4. This should give an indication of what to set the threshold for

Of course if can be that when making changes to your code the performance on the dataset will increase a little, but if it ever decreases by large this would be an indication something is going wrong.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜