Bug Hunting Strategies?
Let's say you have a bug that was found in functional testing of a fairly complex part of the software. It could stem from bad/unexpected data in the database, middle-tier code, or something in the front-end.
Fine. We've all been there.
You have unit tests to write & run, debugging/logging statements to insert, sql statements to write & run, things you want to check with FireBug, etc.
Let's say the first step is come up with a list of potential causes that you want to investigate.
Now you have to decide what order to do things in.
Do you:
- Investigate causes in an order based on gut feel开发者_开发知识库ing?
- Investigate causes in order from the quickest to check to the slowest to check?
- Assume that the bug is specific to this feature, and investigate from most feature-specific code to least feature specific code?
- Assume it's someone else's fault, and investigate from the most general code down to your specific code?
- Something else I haven't mentioned?
I have a feeling the first strategy is the most often used. Maybe just because I don't work with many junior developers, and more senior developers tend to have decent intuition. Or maybe we just think we have decent intuition but should really be using a more systematic approach.
Any Thoughts?
I find the Rubber Duck Debugging strategy works well, too.
In my experience, it's probably best to go with gut feel (1) for 30 minutes or so.
If nothing comes out of that, talk to someone else about it.
It's quite amazing how talking to someone else (even if they're non technical), can help.
- Reproduce the bug in a debug environment.
- Examine system state at the point the bug occurs to find the inconsistent / incorrect / unexpected elements of state that directly, visibly led to the bug occurring. Often, just eyeballing the code and call stack will immediately tell you what the problem is.
- Add tests to all points where this state can be created / mutated within the normal flow of control.
- Treating failures of these tests as a new bug, return to step two.
Rinse, lather, repeat until initial cause of the problem is found. Tedious and mechanical, but will get you there.
Except... occasionally the tests in an iteration of step 3 don't fail; most commonly, because some unrelated system corrupting memory is what is leading to the invalid state, or because the problem is threading/timing/uninitialised data dependent and introducing the tests alters timings and/or data layout sufficiently to alter or hide the symptoms. At this point, for this poster at least, the process becomes a more intuitive one, alternating between replacing sanity tests with less intrusive forms and selectively disabling modules to rule out sources of corruption.
I would say it doesn't matter, as long as it's documented and methodical. It's an odd little truism in programming that sometimes doing things in a random order is more efficient than spending a lot of time trying to figure out the "best" way.
Never underestimate the gut feeling; that's experience giving you a heads up. I almost always start with what you'd probably consider to be my "gut" feeling. I look at the error, and check the steps that I think are likely to cause that sort of problem.
My first step in a situation like that is usually to check things in the order that will most quickly reduce the number of things left to check. You could almost think of it as doing a binary search for the bug: "Well, the POST parameters look right, so I can rule out everything before the form submission," etc.
That said, if I have a strong feeling that the problem might be in a particular place, I'll check that first.
I tend to go with gut feeling, and a divide-and-conquer approach; isolating chunks of code of decreasing size where I think "the bug" is.
This doesn't work if you don't know, or don't understand the codebase - if that's the case, find someone who does, and go with their gut feeling.
First I try to understand the bug, then I do all things you suggest, in order based on gut feeling. This is really a trade-off of how certain you are of a specific cause, and how easy that is to test.
In addition, when I investigate a cause I try to directly add the really quick checks as I'm inspecting the code anyway (add some temporary debug output statements, add asserts and such)
Listen to how the experts debug software on Software Engineering radio:
Dave Thomas talks about software archaeology which has some really great tips on debugging.
Andreas Zeller appears in an episode devoted to debugging.
In general, I start with the a subset of hypotheses that I consider the most likely culprits and then sort that subset of hypotheses by how easy each is to disprove, and start with the easiest.
Regardless of the order, the important thing is what you do with your hypothesis. Start trying to disprove each hypothesis rather than to verify it and you'll cover more ground (see Psychology of Intelligence Analysis by Richards J. Heuer, Jr., free PDF).
I'm with @moonshadow, but I'll add that to some degree it depends on what the failure is. That is, some sorts of failure have fairly well known causes, and I'd start with the known cause
For example, on Windows systems "Access Violation" errors are almost always due to the attempt to use or look at (access) unallocated memory. To find the source of such an error, it's a good idea to look at all the places where memory is (or isn't) allocated.
If it's known that the "problem" is due to bad data, then the fix may require changes to data validation or acquisition, even once the error is traced to analysis.
One more point, while thinking through the bug it's often well worth the effort to try to create a small program for creating it.
My order:
- Look at the code of 1-2 most likely causes (chosen based on gut feeling).
- If nothing is found, execute the code in debugger (or if not possible, insert debugging/logging statements to the code).
- If nothing is found, call somebody else and repeat steps 1 and 2 together with him/her.
Here's some helpful hints:
- If you use a language that has generates a stack trace on exceptions start from there.
- Get a copy of the original data that caused the problem if you can.
- Use a good debugger.
- If you have access to one there are things like the ODB for various languages that can be helpful by allowing you to fast forward or reverse through the execution after the event occurs
- Exclude the impossible and you will be left with the solution!
I normally do this:
1) Add a new functional test case(s) to the automated regression test system. I normally start a software project with a own regression test system with
- Excel VBA + C library to control SCSI/IDE interface/device (13 years ago), Test report is Excel speadsheet.
- TCL Expect for Complex network router system testing. Test report is webpage. (6 years ago)
- Today I use Python/Expect. Test report is XML + python base XML analyzer.
This goal for all this works is to make sure once any bug is found, it should never show up in the checkin code or production system again. Also it is easier to reproduce the random and long term problems.
Don't check in any code unless it goes thou an over night automate regression test.
I typically write 1:1 ratio between product code vs. testing code. 20k lines of TCL expert for 20K lines of C++ code. (5 years ago.) For example:
- C code would implement a setup tunnel tcp connection forwarding proxy.
- TCL test cases: (a) Setup the connections make sure the data is pass thru. (b) Setup the connections with different network elements. (c) Do that 10, 100, 1000 times and check for memory leak and system resource issues, etc.
- Do this for every features in the system, one can see why the 1:1 ration on test program to code.
I don't want QA team to do automated test with my test system, since all my checkin code has to pass the tests. I usually run 2 weeks long term regression test before I give the code to the QA team.
QA team running manual test cases also make sure my program have enough build-in diagnostic info to capture any future bugs. The goal is have enough diagnostic info to solve 95% of bugs in < 2 hours. I was able to do that in my last project. (Video network equipment at RBG Networks.)
2) Add diagnostic routine (web base nowadays) to get all the internal information. (Current State, Logs, etc). > 50% of my code (c/c++, specially) are diagnostic code.
3) Add more details log for trouble area that I don't understand.
4) Analyze the info.
5) Try fix the bug.
6) Run over night / over the weekend regression tests. When I was in R&D, I typically ask for at lease 5-10 test systems to run continuously regression tests 24x7. That normally helps ID and solve the memory, resource and long term performance problem before the code hit SQA.
Once an embedded system fails boot into Linux prompt from time to time. I added a test case which it power cycle the system with programmable outlet over and over again and make sure it can "see" the command prompt and start running the test overnight. We were able to quick ID the FPGA code problem and make sure the system is always up after 5000 power cycles. A test case was added and everything a new Verilog code checkin / FPGA code is built. This test case was ran. It was never an issue again.
I suggest reading Debugging By Thinking.
Andreas Zeller has also done some work in systematic debugging studies.
精彩评论