[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[microblaze-uclinux] bad page fault kernel panics



Hi,

I'm using the PetaLogix subversion snapshot with MMU support enabled on
the Spartan-3A DSP 1800 Board with MicroBlaze 7.10.d and I was
accidentally running into bad page fault panics over and over again - 
sometimes during boot, sometimes later. In the meantime I found a way
to force this panic by running

  # while [ -d / ] ; do ls ; done

which will end in said bad page fault panic sooner or later (but usually
within the first minute being executed)


Registers, Stack and Call Trace look like this:

Stack:
  c04cce68 c04d43a0 c04cced0 c7df52b4 00000000 00000000 00000000 c0227a7c
  00000000 c0227b4c 00000001 00000007 00010000 00000001 c01ee55c 00000001
  c0226000 00000000 00000008 00000000 00000000 c0019ba4 00000000 00000000
Call Trace:
[<c0019ba4>] update_process_times+0x94/0xa0
[<c00098c0>] account_system_time+0x10/0x16c
[<c0001c20>] timer_interrupt+0x60/0x98
[<c002cf64>] handle_IRQ_event+0x54/0xb4
[<c0001c2c>] timer_interrupt+0x6c/0x98
[<c002d070>] __do_IRQ+0xac/0x140
[<c0068334>] new_inode+0xc/0x9c
[<c002cf64>] handle_IRQ_event+0x54/0xb4
[<c00019fc>] do_IRQ+0x3c/0x94
[<c00880c0>] pid_revalidate+0x24/0x10c
[<c0001bfc>] timer_interrupt+0x3c/0x98
[<c0059428>] do_lookup+0x84/0x1e4
[<c0005ec4>] irq_call+0x0/0x8
[<c005b2f0>] __link_path_walk+0x844/0xee8
[<c005b394>] __link_path_walk+0x8e8/0xee8
[<c0089d08>] proc_pid_lookup+0x30/0x1d4
[<c00f6e4c>] number+0x2c0/0x3f4
[<c00f7a98>] __umodsi3+0x8c/0xbc
[<c005b2f0>] __link_path_walk+0x844/0xee8
[<c00f730c>] vsnprintf+0x38c/0x6dc
[<c0001bfc>] timer_interrupt+0x3c/0x98
[<c005ba94>] link_path_walk+0x100/0x28c
[<c0019b84>] update_process_times+0x74/0xa0
[<c00f7730>] sprintf+0x30/0x44
[<c0001c2c>] timer_interrupt+0x6c/0x98
[<c0088a1c>] proc_self_follow_link+0x28/0x54
[<c006855c>] touch_atime+0xc0/0x150
[<c005ad64>] __link_path_walk+0x2b8/0xee8
[<c005ba94>] link_path_walk+0x100/0x28c
[<c00074d4>] do_page_fault+0x2e0/0x444
[<c005bea0>] do_path_lookup+0xac/0x2a0
[<c0005ae8>] page_fault_instr_trap+0x1e8/0x1f0
[<c0056cc4>] do_execve+0x3c/0x238
[<c005cdf0>] __path_lookup_intent_open+0x64/0xd8
[<c005cdc4>] __path_lookup_intent_open+0x38/0xd8
[<c005cf00>] path_lookup_open+0x8/0x1c
[<c0056694>] open_exec+0x28/0x108
[<c0056cd8>] do_execve+0x50/0x238
[<c0005240>] sys_execve_wrapper+0x0/0x10
[<c0056cc4>] do_execve+0x3c/0x238
[<c0001f9c>] sys_execve+0x54/0x94
[<c0001f6c>] sys_execve+0x24/0x94
[<c0004f78>] sc+0x10/0x18
[<c0004f78>] sc+0x10/0x18

Oops: kernel access of bad area, sig: 11
 Registers dump: mode=1
 r1=C02279CC, r2=00000000, r3=00000861, r4=0000010F
 r5=00000007, r6=00000800, r7=C02279E4, r8=00000018
 r9=00000001, r10=C0226000, r11=000041AA, r12=C0003AF0
 r13=00000000, r14=C022796C, r15=C0005AE8, r16=49CAF500
 r17=00000001, r18=00000000, r19=00000007, r20=4817FFF4
 r21=00000000, r22=00000000, r23=00000000, r24=00000001
 r25=00000000, r26=FFFFFFFF, r27=00000002, r28=0000000A
 r29=00000003, r30=0000000E, r31=C03287B0, rPC=C0003B08
 msr=000041AA, ear=0000010F, esr=000000B2, fsr=C6AA9E80
Kernel panic - not syncing: Aiee, killing interrupt handler!
 <0>Rebooting in 120 seconds..


The Program Counter is always pointing to the same address, which is
inside the _unaligned_data_exception code as objdump shows:

c0003af0 <_unaligned_data_exception>:
c0003af0:   a50303e0    andi    r8, r3, 992
c0003af4:   65080002    bsrli   r8, r8, 2
c0003af8:   a4c30400    andi    r6, r3, 1024
c0003afc:   be260068    bneid   r6, 104     // c0003b64
c0003b00:   a4c30800    andi    r6, r3, 2048

c0003b04 <ex_lw_vm>:
c0003b04:   be060034    beqid   r6, 52      // c0003b38
c0003b08:   e0a40000    lbui    r5, r4, 0   <--- here it is
c0003b0c:   b000c01e    imm -16354
c0003b10:   30c0ce40    addik   r6, r0, -12736
c0003b14:   f0a60000    sbi r5, r6, 0
c0003b18:   e0a40001    lbui    r5, r4, 1
c0003b1c:   f0a60001    sbi r5, r6, 1
c0003b20:   e0a40002    lbui    r5, r4, 2
c0003b24:   f0a60002    sbi r5, r6, 2
c0003b28:   e0a40003    lbui    r5, r4, 3
c0003b2c:   f0a60003    sbi r5, r6, 3
c0003b30:   b8100020    brid    32      // c0003b50
c0003b34:   e8660000    lwi r3, r6, 0


I'm a bit confused as the ESR value is 0xb2 which indicates a Data TLB
Miss Exception, but it gets stuck executing the handler function for
Unaligned Data Exception - which again might be caused by the 0x10f in
r4 which shouldn't be valid address anyway. (other times I got 0xff,
0x10b, 0x102, 0xffffff79 etc. in r4)


After the 120 seconds, when the kernel calls its reboot function, the
Stack and Call Trace looks like the following - where you can see the
exception handler functions:

Stack:
  c001ed7c c01ad2c4 00003c96 00001998 00003330 00008000 00000000 c000cc18
  c000cbec 4817fff4 00000000 c03287b0 c0227934 0000000b c0010d6c c01ae8d8
  00000078 c0375ac8 00000000 00005000 00000000 c02277c4 c0226000 0000000b
Call Trace:
[<c001ed7c>] emergency_restart+0xc/0x20
[<c000cc18>] panic+0x154/0x1dc
[<c000cbec>] panic+0x128/0x1dc
[<c0010d6c>] do_exit+0x624/0x93c
[<c0004060>] die+0x90/0x98
[<c000404c>] die+0x7c/0x98
[<c00071e8>] bad_page_fault+0xcc/0xd8
[<c0003b08>] ex_lw_vm+0x4/0x34
[<c0007328>] do_page_fault+0x134/0x444
[<c0003b08>] ex_lw_vm+0x4/0x34
[<c0005ae8>] page_fault_instr_trap+0x1e8/0x1f0
[<c00081dc>] task_running_tick+0x17c/0x2a8
[<c0003af0>] _unaligned_data_exception+0x0/0x14
[<c0005ae8>] page_fault_instr_trap+0x1e8/0x1f0
[<c0003b08>] ex_lw_vm+0x4/0x34
[<c0019ba4>] update_process_times+0x94/0xa0
[<c00098c0>] account_system_time+0x10/0x16c
[<c0001c20>] timer_interrupt+0x60/0x98
...


So my guess is, the actual panic is caused by a missing entry in the
__ex_table section for that address (0xc0003b08), but I doubt that this
is the real reason - as mentioned above, the data in r4 doesn't look
like valid addresses. However, the Call Trace always starts with an
interrupt handling function - currently always some timer interrupt
handler, but I also had a PS/2 keyboard driver and pressing some keys
forced this panic as well, showing the keyboard ISR functions. For me
it looks like some bad interaction between page fault handling and
interrupts, but that's just another guess..

Now basically my question is, if that's a known issue or at least if
the situation can be reproduced by someone else - or if the problem
might be within my hardware configuration or ..well, dunno.. any hint
is welcome. If more information is needed, let me know.

Thanks in advance,
Sven

___________________________
microblaze-uclinux mailing list
microblaze-uclinux@xxxxxxxxxxxxxx
Project Home Page : http://www.itee.uq.edu.au/~jwilliams/mblaze-uclinux
Mailing List Archive : http://www.itee.uq.edu.au/~listarch/microblaze-uclinux/