Causes for crash during garbage collection
I've been struggling now for some time with a crash in a C# application that also uses a fair share of C++/CLI modules that are mostly wrappers around native libraries to access device drivers. The crash is not always easily reproducible but I was able to collect half a dozen crash dumps that shows that the program always crashes with an access violation during a garbage collection. This is the native callstack and the last event log:
0:000> k
ChildEBP RetAddr
0012d754 79f95a8f mscorwks!WKS::gc_heap::find_first_object+0x62
0012d7dc 79f933bb mscorwks!WKS::gc_heap::mark_through_cards_for_segments+0x493
0012d814 79f92cbf mscorwks!WKS::gc_heap::mark_phase+0xc3
0012d838 79f93245 mscorwks!WKS::gc_heap::gc1+0x62
0012d84c 79f92f5a mscorwks!WKS::gc_heap::garbage_collect+0x253
0012d878 79f94e26 mscorwks!WKS::GCHeap::GarbageCollectGeneration+0x1a9
0012d904 79f926ce mscorwks!WKS::gc_heap::try_allocate_more_space+0x15b
0012d918 79f92769 mscorwks!WKS::gc_heap::allocate_more_space+0x11
0012d938 79e73291 mscorwks!WKS::GCHeap::Alloc+0x3b
0:000> .lastevent
Last event: 7e8.88: Access violation - code c0000005 (first/second chance not available)
debugger time: Mon Sep 26 11:34:53.646 2011 (UTC + 2:00)
So let me first ask my question and give more details below. My question is: besides a managed heap corruption is there any other cause for a crash during garbage collection?
Now elaborating a bit, the reason I ask this is because I'm having a really hard time trying to identify the code that is corrupting the managed heap and can't seem to find a pattern for the memory that is (supposedly) overwritten.
I already tried to comment all "dangerous" C++/CLI code (specially the parts that use pinned handles) but this didn't help. In trying to find a pattern in the memory that is overwritten I looked at the dissassembled code at the point of the crash:
0:000> u .-a .+a
mscorwks!WKS::gc_heap::find_first_object+0x54:
79f935b9 89450c mov dword ptr [ebp+0Ch],eax
79f935bc 8bd0 mov edx,eax
79f935be 8b02 mov eax,dword ptr [edx]
79f935c0 83e0fe and eax,0FFFFFFFEh
79f935c3 f70000000080 test dword ptr [eax],80000000h <<<<CRASH
79f935c9 0f84b1000000 je mscorwks!WKS::gc_heap::find_first_object+0x73
0:000> r
eax=00000000 ebx=01c81000 ecx=01c80454 edx=01c82fe0 esi=012f0000 edi=000027e1
eip=79f935c3 esp=0012d738 ebp=0012d754 iopl=0 nv up ei pl zr na pe nc
cs=001b ss=0023 ds=0023 es=0023 fs=003b gs=0000 efl=00010246
mscorwks!WKS::gc_heap::find_first_object+0x62:
79f935c3 f70000000080 test dword ptr [eax],80000000h ds:0023:00000000=????????
The crash happens when trying to dereference the EAX register which is null. Now from what I see EAX was loaded from contents pointed to by the EDX register so I looked at the address stored there:
0:000> dd @edx-10
01c82fd0 06542778 00000000 00000000 01c82494
01c82fe0 00000000 00000000 00000000 00000000
01c82ff0 01b641d0 00000000 00000000 01c82380
EDIT: I now see that my analysis was wrong, lacking was an understanding of x86 addressing modes.
So I can see that starting at address 01c82fed (the value stored at EDX) the next 16 bytes are null. But when looking at another similar crash dump I see the following:
0:000> dd @edx-10
018defd4 00000000 00000000 00000000 00000000
018defe4 00000000 00000000 018b468c 01742354
018deff4 00e0907f 00000000 00000000 00000000
So here the 16 bytes before address pointed by EDX and the next 8 from there are null. And the same happens in the other crash dumps that I have, I don't see a pattern here, i.e. it doesn't seem that some piece of code is simply overwriting this region of the memory.
Going back to the question what I would like to know is if there is some other explanation for the crash besides one piece of the code overwriting memory that it shouldn开发者_Python百科't. Or any advice at all in how to proceed, I'm really lost in this one here..
(could the pinned handles cause a problem? We have quite a few of them and what I think that is that is funny is that I always see 137 - no more no less - pinned handles with !gchandles at the point of the crash, it's a strange coincidence for me..).
EDIT: forgot to mention that we're using version 3.5 of the .Net framework. I see reports of similar crashes in .Net 4 when the background GC is active (somewhere there is a mention that this is a bug in .Net) but I don't think that this is relevant here since AFAIK there is no background GC in .Net 3.5.
Not sure if this helps, but generally don't use destructors or let GC handle unmanaged memory. Use the Dispose pattern instead, and move all destructor code to finalizers instead:
ref class MyClass
{
UnsafeObject data;
MyClass()
{
data = CreateUnsafeDataObject();
}
!MyClass() // IDisposable.Dispose()
{
DeleteUnsafeDataObject(data);
}
~MyClass() // Destructor
{
}
}
This will implement the IDisposable pattern on the object. Call Dispose to clear unmanaged data, and you'll at worst have a better chance of figuring out what exactly happens.
So unfortunately my question was a bit misleading since I was looking for alternative explanations besides a managed heap corruption - which turned out to be the problem in the end (caused by an unsafe copy of an unmanged to managed struct). The problem is now solved and I'm posting my findings here in a separate answer, hope that this is ok.
You probably have an exception in one of your finalizers. I believe you need to check them one by one, because there is no place for errors in finalization queue. In case you don't have unmanaged code, its better to not have finalizer at all, just manually call Dispose.
精彩评论