Solving random crashes
I am getting random crashes on my C++ application, it may not crash for a month, and then crash 10 times in a hour, and sometimes it may crash on launch, while sometimes it may crash after several hours of operation (or not crash at all).
I use GCC on GNU/Linux and MingW on Windows, thus I can't use the Visual Studio JIT Debug...
I have no idea on how to proceed, looking randomly on the code would not work, the code is HUGE (and good part was not my work, also it has some good amount of legacy stuff on it), and I also don't have a clue on how to reproduce the crash.
EDIT: Lots of people mentioned that... how I make a core dump, minidump or whateverdump? This is the first time I need postmortem debugging.
EDIT2: Actually, DrMingw captured a call stack, no memory info... Unfortunately, the call stack don't helped me much, because near the end suddenly it go into some library (or something) that I don't have debug info, resulting only into some hexadecimal numbers... So I still need some decent dump 开发者_Go百科that give more information (specially about what was in the memory... specifically, what was in the place that gave the "access violation" error)
Also, my application use Lua and Luabind, maybe the error is being caused by a .lua script, but I have no idea on how to debug that.
Try Valgrind (it's free, open-source):
The Valgrind distribution currently includes six production-quality tools: a memory error detector, two thread error detectors, a cache and branch-prediction profiler, a call-graph generating cache profiler, and a heap profiler. It also includes two experimental tools: a heap/stack/global array overrun detector, and a SimPoint basic block vector generator. It runs on the following platforms: X86/Linux, AMD64/Linux, PPC32/Linux, PPC64/Linux, and X86/Darwin (Mac OS X).
Valgrind Frequently Asked Questions
The Memcheck part of the package is probably the place to start:
Memcheck is a memory error detector. It can detect the following problems that are common in C and C++ programs.
Accessing memory you shouldn't, e.g. overrunning and underrunning heap blocks, overrunning the top of the stack, and accessing memory after it has been freed.
Using undefined values, i.e. values that have not been initialised, or that have been derived from other undefined values.
Incorrect freeing of heap memory, such as double-freeing heap blocks, or mismatched use of malloc/new/new[] versus free/delete/delete[]
Overlapping src and dst pointers in memcpy and related functions.
Memory leaks.
First, you are lucky that your process crashes multiple times in a short time-period. That should make it easy to proceed.
This is how you proceed.
- Get a crash dump
- Isolate a set of potential suspicious functions
- Tighten up state checking
- Repeat
Get a crash dump
First, you really need to get a crash dump.
If you don't get crash dumps when it crashes, start with writing a test that produces reliable crash dumps.
Re-compile the binary with debug symbols or make sure that you can analyze the crash dump with debug symbols.
Find suspicious functions
Given that you have a crash dump, look at it in gdb or your favorite debugger and remember to show all threads! It might not be the thread you see in gdb that is buggy.
Looking at where gdb says your binary crashed, isolate some set of functions you think might cause the problem.
Looking at multiple crashes and isolating code sections that are commonly active in all of the crashes is a real time-saver.
Tighten up state checking
A crash usually happens because some inconsistent state. The best way to proceed is often to tighten the state requirements. You do this the following way.
For each function you think might cause the problem, document what legal state the input or the object must have on entry to the function. (Do the same for what legal state it must have on exit from the function, but that's not too important).
If the function contains a loop, document the legal state it needs to have at the beginning of each loop iteration.
Add asserts for all such expressions of legal state.
Repeat
Then repeat the process. If it still crashes outside of your asserts, tighten the asserts further. At some point the process will crash on an assert and not because of some random crash. At this point you can concentrate on trying to figure out what made your program go from a legal state on entry to the function, to an illegal state at the point where the assert happened.
If you pair the asserts with verbose logging it should be easier to follow what the program does.
If all else fails (particularly if performance under the debugger is unacceptable), extensive logging. Start with the entry points -- is the app transactional? Log each transaction as it comes in. Log all the constructor calls for your key objects. Since the crash is so intermittent, log calls to all the functions that might not get called every day.
You'll at least start narrowing down where the crash could be.
Start the program under debugger (I'm sure there is a debugger together with GCC and MingW) and wait until it crashes under debugger. At the point of crash you will be able to see what specific action is failing, look into assembly code, registers, memory state - this will often help you find the cause of the problem.
Where I work, crashing programs usually generates a core dump file that can be loaded in windbg.
We then have an image of the memory at the time the program crashed. There's nothing much you can do with it, but a least it gives you the last call stack. Once you know the function which crashed, you might then be able to track down the problem are at least you might reduce the problem to a more reproductible test-case.
It sounds like your program is suffering from memory corruption. As already said your best option on Linux is probably valgrind. But here are two other options:
First of all use a debug malloc. Nearly all C libraries offer a debug malloc implementation that initialize memory (normal malloc keeps "old" contents in memory), check the boundaries of an allocated block for corruption and so on. And if that's not enough there is a wide choice of 3rd party implementations.
You might want to have a look at VMWare Workstation. I have not set it up that way, but from their marketing materials they support a rather interesting way of debugging: Run the debugee in a "recording" virtual machine. When memory corruption occurs set a memory breakpoint at the corrupted address an then turn back time in the VM to exactly that moment when that piece of memory was overwritten. See this PDF on how to setup replay debugging with Linux/gdb. I believe there is a 15 or 30 days demo for Workstation 7, that might be enough to shake out those bugs from your code.
These sorts of bugs are always tricky - unless you can reproduce the error then your only option is to make changes to your application so that extra information is logged, and then wait until the error happens again in the wild.
There is an excellent tool called Process Dumper that you can use to obtain a crash dump of a process that experiences an exception or exits unexpectedly - you could ask users to install that and configure rules for your application.
Alternatively if you don't want to ask users to install other applications you could have your application monitor for exceptions and create a dump itself by calling MiniDumpWriteDump.
The other option is to improve the logging, however figuring out what information to log (without just logging everything) can be tricky, and so it can take several iterations of crash - change logging to hunt down the problem.
As I said, these sorts of bugs are always tricky to diagnose - in my experience it generally involves hours and hours of peering through logs and crash dumps until suddenly you get that eureka moment where everything makes sense - the key is collecting the right information.
You've already heard how to handle this under linux: inspect core dumps and run your code under valgrind. So your first step could be to find the errors under Linux and then check if they vanish under mingw. Since nobody did mention mudflap here, I'll be doing it: Use mudflap if your Linux distribution supplies it. mudflap helps you to catch pointer misuse and buffer overflows by tracking the information where a pointer is actually allowed to point to:
- http://gcc.gnu.org/wiki/Mudflap_Pointer_Debugging
And for Windows: There is a JIT debugger for mingw, called DrMingw:
- http://code.google.com/p/jrfonseca/wiki/DrMingw
Run the application on Linux under valgrind
to look for memory errors. Random crashes are usually down to corrupting memory.
Fix every error you find with valgrind's memcheck tool, and then hopefully the crash will go away.
If the whole program takes too long to run under valgrind, then split off functionality into unit tests, and run those under valgrind, hopefully you'll find the memory errors that are causing the problems.
If it doesn't then make sure coredumps are enabled (ulimit -a
) and then when it crashes you'll be able to find out where with gdb
.
That sounds like something tricky like a race condition.
I'd suggest you create a debug build and use that. You should also make sure that a core dump is created when the program crashes.
The next time the program crashes, you can launch gdb on the coredump and see where the problem lies. It'll probably be a consecutive fault, but this should get you started.
The first thing I would do is debug the core dump with gdb (both Windows and Linux). The second would be be running a program like Lint, Prefast (Windows), Clang Analyzer or some other static analysis program (be prepared for a lot of false positives). Third thing would be some kind of runtime check, like Valgrind (or its close variants), Microsoft Application Verifier, or Google Perftools.
And logging. Which doesn't have to go to disk. You could, for instance, log to a global std::list<std::string>
which would be pruned to the last 100 entries. When an exception is caught display the contents of that list.
Start Logging. Put logging statements in places where you think the code flaky. focus on testing the code, and repeat until you narrow down the problem to a module or a function.
Put asserts everywhere!
While you are at it, Only put one expression in an assert.
Write a unit test for the code you think is failing. That way you can exercise the code in isolation from the rest of your runtime environment.
Write more automated tests that exercise the problematic code.
Do not add more code on top of the bad code that is failing. That's just a dumb idea.
Learn how to write out mini-dumps and do post-mortem debugging. It looks like others here have explained that quite well.
Exercise the bad code from as many different possible ways as you can to make you can isolate the bug.
Use a debug build. Run the debug build under the debugger if possible.
Trim down your application by removing binaries, modules etc... if possible so that you can have an easier time attempting to reproduce the bug.
There are a lot of good answers here, but no one has yet touched on the Lua angle.
Lua is generally pretty well behaved, but it is still possible for it to cause memory corruption or crashing if e.g. the Lua stack overflows or underflows, or bad bytecode is executed.
One easy thing you can do that will detect many such errors is to define the lua_assert macro in luaconf.h. Defining this (to e.g. standard C's assert) will enable a variety of sanity checks inside the Lua core.
You have probably made a memory error where you put some values to not allocated space somehow, it is a good reason for random crashes, for a long time noone tries to use that memory so there will be no errors, you can take a look the places where you allocate memory and check where you extensively use pointers. Other than this, as others pointed out you should use extensive logging, in both screen and files.
Another basic check: Make sure you do a full rebuild of your project. If you've been tweaking various files (especially header files) and doing partial builds then things can get messy if your build dependencies aren't perfect. A full rebuild just removes that possibility.
Also for Windows check out Microsoft's Debugging tools for Windows, and particularly their gflags tool.
Two more pointers/ideas (besides core dump and valgrind on Linux):
1) Try Nokia's "Qt Creator". It supports mingw and can act as post-mortem debugger.
2) If it's feasible, maybe just run the application in in gdb constantly?
If your application is not Windows specific, you may try compiling and running your program on other platforms such as Linux (different distribution, 32/64 bits, ... if you've the luxury). That may help trigger the bugs of your program. Of course you should use the tools mentioned in other posts such as gdb, valgrind, etc.
精彩评论