【32】linux5.0之后 pciehp引入的bug

使用linux 5.2验证pciehp,发现按按钮后,上下电流程反复触发。
把pciehp独立成一个KO后,加了一些打印,发现linux 5.0重构的pciehp流程后,导致线程和中断并发。虽然5.2打了几个补丁,在线程处理完毕后ignore一下对应的event,其实并没有啥鸟用。中断依然打断线程,可能在取出事件后,上下电流程才ignore 对应的event。
不得不吐槽一下内核自带pciehp和aer的流程,之前我们自研的pciehp都是给新员工学习的(其他流程太复杂,hotplug是最简单的)。在几十万台设备上报运行了近10年时间,很少出现问题。

//[ 4056.157393] pcieport 0000:00:03.1: pciehp: pciehp_isr: ctrl 0, slot status reg 0x41
//当前状态是OFF_STATE 0,slot status:button press中断,是在位的
[ 4056.157398] pcieport 0000:00:03.1: pciehp: pending interrupts 0x0001 from Slot Status
//准备pending button press (1) 的事件
[ 4056.157400] pcieport 0000:00:03.1: pciehp: pciehp_isr: ctrl state 0, hotplug pending_events 0x0001
//pending button press (1) 的事件,下面转线程
[ 4056.157437] pcieport 0000:00:03.1: pciehp: pciehp_ist: ctrl state 0, hotplug events 0x0001
//取出button press 事件
//[ 4056.157440] pcieport 0000:00:03.1: pciehp: pciehp_ist: ctrl state 0, old event 1, hotplug pending_events 0x0000
//取出button press事件后,清零事件中心
//[ 4056.157443] pcieport 0000:00:03.1: pciehp: pciehp_ist: Slot(3): ctrl state 0, Attention button pressed
//开始处理attention button
//[ 4056.157446] pcieport 0000:00:03.1: pciehp: pciehp_handle_button_press: Slot(3) ctrl state 1 Powering on due to button press
//修改 ctrl state为BLINKINGON_STATE 1,闪绿灯,灭黄灯,加5s等待work
[ 4056.591254] pcieport 0000:00:03.1: pciehp: pciehp_isr: ctrl 1, slot status reg 0x50
//command complete 中断打断线程
[ 4056.591259] pcieport 0000:00:03.1: pciehp: pending interrupts 0x0010 from Slot Status
//准备pending command complete事件
//[ 4056.925847] pcieport 0000:00:03.1: pciehp: pciehp_isr: ctrl 1, slot status reg 0x50
//command complete 中断打断线程
[ 4056.925852] pcieport 0000:00:03.1: pciehp: pending interrupts 0x0010 from Slot Status
//[ 4061.751430] pcieport 0000:00:03.1: pciehp: pciehp_queue_pushbutton_work: Slot(3) ctrl state 1 Powering on due to button press, pend event 0x8
//5S 反悔时间到了,没有再次按按钮,开始处理按钮,准备pending了一个PCI_EXP_SLTSTA_PDC (8),把 button press转成presence change事件
//[ 4061.751483] pcieport 0000:00:03.1: pciehp: pciehp_request: Slot(3) ctrl state 1 current pending event 0x0, pend event 0x8.
//ctrl state BLINKINGON_STATE 1, pending了一个PCI_EXP_SLTSTA_PDC(8)
//[ 4061.751498] pcieport 0000:00:03.1: pciehp: pciehp_ist: ctrl state 1, hotplug events 0x0008
//取出了PCI_EXP_SLTSTA_PDC事件
[ 4061.751501] pcieport 0000:00:03.1: pciehp: pciehp_ist: ctrl state 1, old event 8, hotplug pending_events 0x0000
//取出PCI_EXP_SLTSTA_PDC事件,清空事件中心
//[ 4061.751506] pcieport 0000:00:03.1: pciehp: pciehp_handle_presence_or_link_change: ctrl state 1, event 8, hotplug pending_events 0x0000, present 1 link active 0.
//当前在位,link down的,pciehp_enable_slot
//[ 4061.751508] pcieport 0000:00:03.1: pciehp: Slot(3): Card present
//开始上电了
[ 4061.751547] pcieport 0000:00:03.1: pciehp: pciehp_get_power_status: SLOTCTRL 70 value read 16f1
//bit 10是1,下电状态
[ 4061.949323] pcieport 0000:00:03.1: pciehp: pciehp_isr: ctrl 3, slot status reg 0x50
//command complete 中断打断线程
[ 4061.949328] pcieport 0000:00:03.1: pciehp: pending interrupts 0x0010 from Slot Status
//[ 4061.949356] pcieport 0000:00:03.1: pciehp: pciehp_power_on_slot: SLOTCTRL 70 write cmd 0
//board_added->pciehp_power_on_slot 上电slot
[ 4061.949361] pcieport 0000:00:03.1: pciehp: __pciehp_link_set: lnk_ctrl = 40
//board_added->pciehp_link_enable->__pciehp_link_set, enable link

[ 4062.083474] pci 0000:05:00.0: [8086:1521] type 00 class 0x020000
[ 4062.083504] pci 0000:05:00.0: reg 0x10: [mem 0x00000000-0x0007ffff]
[ 4062.083517] pci 0000:05:00.0: reg 0x18: [io 0x0000-0x001f]
[ 4062.083526] pci 0000:05:00.0: reg 0x1c: [mem 0x00000000-0x00003fff]
[ 4062.083544] pci 0000:05:00.0: reg 0x30: [mem 0x00000000-0x0007ffff pref]
[ 4062.083554] pci 0000:05:00.0: Max Payload Size set to 512 (was 128, max 512)
[ 4062.083648] pci 0000:05:00.0: PME# supported from D0 D3hot D3cold
[ 4062.083812] pci 0000:05:00.1: [8086:1521] type 00 class 0x020000
[ 4062.083836] pci 0000:05:00.1: reg 0x10: [mem 0x00000000-0x0007ffff]
[ 4062.083849] pci 0000:05:00.1: reg 0x18: [io 0x0000-0x001f]
[ 4062.083856] pci 0000:05:00.1: reg 0x1c: [mem 0x00000000-0x00003fff]
[ 4062.083874] pci 0000:05:00.1: reg 0x30: [mem 0x00000000-0x0007ffff pref]
[ 4062.083883] pci 0000:05:00.1: Max Payload Size set to 512 (was 128, max 512)
[ 4062.083957] pci 0000:05:00.1: PME# supported from D0 D3hot D3cold
[ 4062.084070] pcieport 0000:00:03.1: ASPM: current common clock configuration is broken, reconfiguring
[ 4062.095486] pci 0000:05:00.0: BAR 0: assigned [mem 0xef500000-0xef57ffff]
[ 4062.095493] pci 0000:05:00.0: BAR 6: assigned [mem 0xef580000-0xef5fffff pref]
[ 4062.095496] pci 0000:05:00.1: BAR 0: assigned [mem 0xef600000-0xef67ffff]
[ 4062.095501] pci 0000:05:00.1: BAR 6: assigned [mem 0xef680000-0xef6fffff pref]
[ 4062.095503] pci 0000:05:00.0: BAR 3: assigned [mem 0xef700000-0xef703fff]
[ 4062.095508] pci 0000:05:00.1: BAR 3: assigned [mem 0xef704000-0xef707fff]
[ 4062.095513] pci 0000:05:00.0: BAR 2: assigned [io 0x1000-0x101f]
[ 4062.095517] pci 0000:05:00.1: BAR 2: assigned [io 0x1020-0x103f]
[ 4062.095523] pcieport 0000:00:03.1: PCI bridge to [bus 05]
[ 4062.095525] pcieport 0000:00:03.1: bridge window [io 0x1000-0x1fff]
[ 4062.095529] pcieport 0000:00:03.1: bridge window [mem 0xef500000-0xef7fffff]
[ 4062.095532] pcieport 0000:00:03.1: bridge window [mem 0x10040000000-0x100401fffff 64bit pref]
//[ 4062.095632] igb 0000:05:00.0: enabling device (0000 -> 0002)
//走的到igb的probe,igb enable device
[ 4062.150619] igb 0000:05:00.0: added PHC on eth0
[ 4062.150622] igb 0000:05:00.0: Intel® Gigabit Ethernet Network Connection
[ 4062.150625] igb 0000:05:00.0: eth0: (PCIe:5.0Gb/s:Width x4) e8:61:1f:1e:30:0e
[ 4062.150700] igb 0000:05:00.0: eth0: PBA No: 106300-000
[ 4062.150702] igb 0000:05:00.0: Using MSI-X interrupts. 8 rx queue(s), 8 tx queue(s)
//[ 4062.150815] igb 0000:05:00.1: enabling device (0000 -> 0002)
[ 4062.152863] igb 0000:05:00.0 ens3f0: renamed from eth0
[ 4062.205994] igb 0000:05:00.1: added PHC on eth0
[ 4062.205998] igb 0000:05:00.1: Intel® Gigabit Ethernet Network Connection
[ 4062.206002] igb 0000:05:00.1: eth0: (PCIe:5.0Gb/s:Width x4) e8:61:1f:1e:30:0f
[ 4062.206076] igb 0000:05:00.1: eth0: PBA No: 106300-000
[ 4062.206079] igb 0000:05:00.1: Using MSI-X interrupts. 8 rx queue(s), 8 tx queue(s)
[ 4062.208189] igb 0000:05:00.1 ens3f1: renamed from eth0
//[ 4062.283971] pcieport 0000:00:03.1: pciehp: pciehp_isr: ctrl 3, slot status reg 0x150
//data link change /comman complete 中断打断上电线程,是在位的。当前的ctrl state 是POWERON_STATE 3

[ 4062.283975] pcieport 0000:00:03.1: pciehp: pending interrupts 0x0110 from Slot Status
//准备pending data link change 和comman complete事件,当前的ctrl state 是POWERON_STATE 3
[ 4062.283981] pcieport 0000:00:03.1: pciehp: pciehp_isr: ctrl state 3, hotplug pending_events 0x0100
//pending data link change事件,退出中断,当前的ctrl state 是POWERON_STATE 3
//[ 4062.314015] pcieport 0000:00:03.1: pciehp: pciehp_isr: ctrl 3, slot status reg 0x50
//comman complete 中断打断线程,
//[ 4062.314019] pcieport 0000:00:03.1: pciehp: pending interrupts 0x0010 from Slot Status
[ 4062.314048] pcieport 0000:00:03.1: pciehp: pciehp_ist: ctrl state 5, hotplug events 0x0100
//取出data link change事件,当前的ctrl state 是ON_STATE 5,说明上电流程pciehp_enable_slot把ctrl->state 置成ON_STATE
//按按钮的流程处理完毕,需要返回了
[ 4062.314051] pcieport 0000:00:03.1: pciehp: pciehp_ist: ctrl state 5, old event 100, hotplug pending_events 0x0000
//清空事件中心
[ 4062.314053] pcieport 0000:00:03.1: pciehp: Slot(3): Link Down
//pciehp_handle_presence_or_link_change 发现现在ctrl state是5 ON_STATE开始走linkdown流程,调用pciehp_disable_slot,热插拔流程开始咬住了。

//[ 4062.314123] pcieport 0000:00:03.1: pciehp: pciehp_get_power_status: SLOTCTRL 70 value read 11f1
//pciehp_disable_slot->pciehp_get_power_status,当前是上电的
//[ 4062.314357] igb 0000:05:00.1: removed PHC on ens3f1
//remove_board->pciehp_unconfigure_device进入了igb的remove

[ 4062.344095] pcieport 0000:00:03.1: pciehp: pciehp_isr: ctrl 4, slot status reg 0x50
[ 4062.344099] pcieport 0000:00:03.1: pciehp: pending interrupts 0x0010 from Slot Status
//[ 4062.500043] igb 0000:05:00.0: removed PHC on ens3f0
//remove_board->pciehp_unconfigure_device进入了igb的remove

[ 4062.714755] pcieport 0000:00:03.1: pciehp: pciehp_isr: ctrl 4, slot status reg 0x50
[ 4062.714759] pcieport 0000:00:03.1: pciehp: pending interrupts 0x0010 from Slot Status
[ 4062.714784] pcieport 0000:00:03.1: pciehp: pciehp_power_off_slot: SLOTCTRL 70 write cmd 400
//remove_board->pciehp_power_off_slot, slot 被下电

[ 4062.734792] pcieport 0000:00:03.1: pciehp: pciehp_isr: ctrl 4, slot status reg 0x140
//data link change中断power off 流程,当前的ctrl状态是POWEROFF_STATE 4的
[ 4062.734797] pcieport 0000:00:03.1: pciehp: pending interrupts 0x0100 from Slot Status
[ 4062.734800] pcieport 0000:00:03.1: pciehp: pciehp_isr: ctrl state 4, hotplug pending_events 0x0100
//[ 4063.735393] pcieport 0000:00:03.1: pciehp: pciehp_handle_presence_or_link_change: ctrl state 0, event 100, hotplug pending_events 0x0000, present 1 link active 0.
//
//[ 4063.735396] pcieport 0000:00:03.1: pciehp: Slot(3): Card present
//又要走上电流程了,pciehp_enable_slot

发布了33 篇原创文章 · 获赞 8 · 访问量 1万+

猜你喜欢

转载自blog.csdn.net/linjiasen/article/details/99288464
今日推荐