Hardware error caused crash

[683650.031028] BUG: unable to handle kernel paging request at 000000000001b790-----------------------------地址错误
[683650.031060] IP: [<ffffffff94b13520>] native_queued_spin_lock_slowpath+0x110/0x200
[683650.031089] PGD 8000005f702d4067 PUD 510c27a067 PMD 0
[683650.031109] Oops: 0002 [#1] SMP
[683650.031567] CPU: 40 PID: 474165 Comm: dfget Kdump: loaded Tainted: G ------------ T 3.10.0-957.27.2.el7.x86_64 #1
[683650.031599] Hardware name: Dell Inc. PowerEdge R640/0RJCR7, BIOS 1.6.13 12/17/2018
[683650.031621] task: ffff8d37d86ac100 ti: ffff8d4adbbdc000 task.ti: ffff8d4adbbdc000
[683650.031643] RIP: 0010:[<ffffffff94b13520>] [<ffffffff94b13520>] native_queued_spin_lock_slowpath+0x110/0x200
[683,650.031673] RSP: 0018: ffff8d4adbbdfea0 the EFLAGS: 00,010,006
[683,650.031689] RAX: 000000000000175f RBX: ffff8d37d86ac100 RCX: 0000000001410000
[683,650.031709] RDX: 000000000001b790 RSI: 00000000bafb5140 RDI: ffff8d88afab1048 ---------- This is sighand_struct.siglock
[683,650.031729] RBP: ffff8d4adbbdfea0 R08: ffff8d58bdf1b780 R09: 0000000000000000
[683,650.031749] R10: 0,000,000,000,000,008 R11: 0000000000000206 R12: ffff8d4adbbdfef0
[683,650.031769] R13: 0000000000000000 R14: 0000000000000002 R15: 0000000000000000
[683,650.031789] the FS: 00007fb5b67fc700 (0000) GS: ffff8d58bdf00000 (0000 ) knlGS: 0000000000000000
[683,650.031812] CS: DS 0010: 0000 ES: 0000 CR0: 0000000080050033
[683,650.031828] CR2: CR3 000000000001b790: 0000005f757a8000 CR4: 00000000007607e0
[683650.031849] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[683650.031869] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[683650.031889] PKRU: 55555554
[683650.031898] Call Trace:
[683650.031912] [<ffffffff9515e2cb>] queued_spin_lock_slowpath+0xb/0xf
[683650.031935] [<ffffffff9516c6a8>] _raw_spin_lock_irq+0x28/0x30
[683650.031955] [<ffffffff94ab0a97>] __set_current_blocked+0x37/0x70
[683650.031974] [<ffffffff94ab0c67>] sigprocmask+0x77/0xb0
[683650.032768] [<ffffffff94ab0d32>] SyS_rt_sigprocmask+0x92/0x100
[683650.033480] [<ffffffff95176ddb>] system_call_fastpath+0x22/0x27
[683650.034184] Code: 87 47 02 c1 e0 10 45 31 c9 85 c0 74 44 48 89 c2 c1 e8 13 48 c1 ea 0d 48 98 83 e2 30 48 81 c2 80 b7 01 00 48 03 14 c5 a0 bf 74 95 <4c> 89 02 41 8b 40 08 85 c0 75 0f 0f 1f 44 00 00 f3 90 41 8b 40
[683650.035684] RIP [<ffffffff94b13520>] native_queued_spin_lock_slowpath+0x110/0x200
[683650.036404] RSP <ffff8d4adbbdfea0>
[683650.037151] CR2: 000000000001b790

 

The kernel code analysis, corresponding sighand_struct.siglock incorrect value, causes an exception.

 

crash> sighand_struct.siglock 0xffff8d88afab0840 -x
siglock = {
{
rlock = {
raw_lock = {
val = {
counter = 0x1415140
}
}
}
}
}

Qspinlock according to the layout of the last eight should be 1, then the number recorded in the cpu if someone waiting for the lock, pending bit should position 1, there will be waiting for the next. However, according to years of experience in the investigation of the crash, this value is obviously unusual.

Since this is a public sighand lock so other threads print lock as follows:

crash> task_struct.sighand ffff8d88983da080
sighand = 0xffff8d88afeb0840-------------------------e就是1110
crash> task_struct.sighand ffff8d88983d8000
sighand = 0xffff8d88afeb0840
crash> task_struct.sighand ffff8d37d86ac100
sighand = 0xffff8d88afab0840------------------------a就是1010

More careful we can see, this lock occurs bit bit flip. This case and other locks, just like in the bus station waiting for a boat, it may occasionally be able to wait until Wuhan.

In addition, this bit flip ipmi and in none mce see an exception.

Guess you like

Origin www.cnblogs.com/10087622blog/p/12149572.html