Should I use real or sample data for unit tests?
I'm writing a parser for the output of a legacy application, and since there are no specs on the file syntax I've got as many samples of these files as I could.
Now I'm writing the unit tests before implementing the parser (because there is no other sane way to do this) but I'm not sure whether I should:
- use the real files produced by the application, reading from them and comparing the output with the output that I would store in json format in another file.
- or create a sample string with the tokens and possibilities I want to test and a dict (this is python) with the expected output.
I'm inclined to use the second alternative because I would test only 开发者_JAVA技巧what I need to, without all the "real-world" data included on the actual files, but I'm afraid I could forget to test for one possibility or another.
What do you think?
My suggestion is to do both. Write a set of integration tests that run through all the files you have with the expected outputs then unit test with your expected inputs to isolate the parsing logic.
I would recommend writing the integration tests first so you write your parser outside in, it might be disparaging to see a bunch of failing tests, but it'll help you isolate your edge cases earlier.
Btw, I think this is a great question. I recently came across something a similar problem which was transforming large xml feeds from an upstream system into a proprietary format. My solution was to write a set of integration black box tests for the full feeds testing things like record counts and other high level success metrics, then break down inputs into smaller and smaller chunks until I was able to test all the permutations of the data. It was only then that I had a good understanding of how to build the parser.
You should be careful using production data in testing scenarios. It could be a disaster if all your users got an email from a test environment, for example. Its also probably unethical in certain scenarios for developers to have access to prod data, even if there is no way for the users to know this. Think medical, bank, college grades types scenarios.
My answer is you should use data that is close to prod data. If you want to use the actual prod data, you need to scrub it for the above scenarios.
Production data can be a good starting point (assuming it's not sensitive info), since there's a good chance you can't think of all the possible permutations yourself. However, once you get a good working set of data, save it somewhere static, like a file. Then have the tests get it from there instead of dynamically from the production environment. That way you can run the tests with a known set of inputs every time.
The alternative, getting production data on the fly for test inputs, is fraught with problems. Changes in the data could cause a test to pass one time, but fail the next because the inputs changed.
Don't forget to structure the test such that you can add additional possibilities (i.e., regression tests) as they become known.
Using the second solution you offer will allow you to control what is expected and what is returned, which is ideal for unit testing. When creating automated tests, it is best to avoid manual interaction as often as possible - visually scanning the results is one of those practices you should avoid when possible (assuming that's what you meant by "compare").
精彩评论