How to debug Page Faults in the Linux Kernel
Currently I am faced to some ugly Kernel OOPS with reboot. I run a MPC5200 based custom design. I get OOPS Messages like this:
VM: Either in interrupt or mm = NULL. mm=0xc0196520 in interrupt: 1
VM: Access of bad area @0x6e615c75
Oops: kernel access of bad area, sig: 11
NIP: C00302E4 XER: 20000000 LR: C00F15D4 SP: C6207B30 REGS: c6207a80 TRAP: 0300 Not tainted
MSR: 00009032 EE: 1 PR: 0 FP: 0 ME: 1 IR/DR: 11
DAR: 6E615C75, DSISR: 20000000
TASK = c6206000[4778] 'SimpleServer2' Last syscall: 102
last math c6206000 last altivec 00000000
GPR00: 7C74696B C6207B30 C6206000 6E615C5D 00000000 00000000 C01BFE68 00000001
GPR08: F0000500 C7CD1600 FFFFFFE3 C7CD1600 00000001 10152540 10100000 10100000
GPR16: C01B0000 00000000 C6207E48 000016D0 00001032 06207BF0 00000000 C0005CC0
GPR24: C0006DCC C6207EA0 C01B0000 C0190000 C0190000 C01D0000 C56A2220 00000001
Call backtrace:
C0018034 C00F1608 C00F6738 C0017D08 C0006EFC C0005CC0 C6207EA0
C011040C C012FEC4 C00EDC7C C00EF078 C00EF518 C0005A7C 10089C18
1001DFAC 10015660 10000608 10003E68 1000804C 10085A0C 100BC020
Kernel panic: Aiee, killing interrupt handler!
In interrupt handler - not syncing
<0>Rebooting in 1 seconds..
These OOPS traces occur during high network load. The main problem I am faced to is that the do_page_fault function is called by the mmu exception mechanism and therefore the stack context within the gdb is not reliable. After debugging开发者_如何转开发 and adding printouts I figured out, that the CPU seems to be in a interrupt context. And therefore this error is not recoverable.
As far as I understand the OOPS trace the address which causes the oops is stored in the DAR Register: DAR: 6E615C75.
How can I get information from this address? I've tried to disassemble the address in gdb but it is not mapped to any function.
If some one is wondering about the OOPS format, this is generated by an outdated Kernel 2.4.25 Kernel, but I think the Mechanism should be the same, as in Kernel 2.6.
By definition, if you page faulted on this address in interrupt context, there is nothing useful in it (i.e. there's no point to try to figure out the data that's pointed to by a bad pointer). You need to disassemble the code leading up to NIP (C00302E4) and see where it got that address and what it was trying to do.
Note that the value in DAR
looks suspiciously like a fragment of an ASCII string. In fact, it looks like an offset of 24 from the value in GPR03
, 0x6E615C5D == "na\]"
.
I suspect you have a string overflowing a struct
pointer, and the faulting instruction is dereferencing a member of that structure that is at offset 24.
精彩评论