summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2025-03-19Merge branch 'net-bring-back-dev_addr_sem'Paolo Abeni
Stanislav Fomichev says: ==================== net: bring back dev_addr_sem Kohei reports an issue with dev_addr_sem conversion to netdev instance lock in [0]. Based on the discussion, switching to netdev instance lock to protect the address might not work for the devices that are not using netdev ops lock. Bring dev_addr_sem instance lock back but fix the ordering. 0: https://lore.kernel.org/netdev/20250308203835.60633-2-enjuk@amazon.com ==================== Link: https://patch.msgid.link/20250312190513.1252045-1-sdf@fomichev.me Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-03-19net: reorder dev_addr_sem lockStanislav Fomichev
Lockdep complains about circular lock in 1 -> 2 -> 3 (see below). Change the lock ordering to be: - rtnl_lock - dev_addr_sem - netdev_ops (only for lower devices!) - team_lock (or other per-upper device lock) 1. rtnl_lock -> netdev_ops -> dev_addr_sem rtnl_setlink rtnl_lock do_setlink IFLA_ADDRESS on lower netdev_ops dev_addr_sem 2. rtnl_lock -> team_lock -> netdev_ops rtnl_newlink rtnl_lock do_setlink IFLA_MASTER on lower do_set_master team_add_slave team_lock team_port_add dev_set_mtu netdev_ops 3. rtnl_lock -> dev_addr_sem -> team_lock rtnl_newlink rtnl_lock do_setlink IFLA_ADDRESS on upper dev_addr_sem netif_set_mac_address team_set_mac_address team_lock 4. rtnl_lock -> netdev_ops -> dev_addr_sem rtnl_lock dev_ifsioc dev_set_mac_address_user __tun_chr_ioctl rtnl_lock dev_set_mac_address_user tap_ioctl rtnl_lock dev_set_mac_address_user dev_set_mac_address_user netdev_lock_ops netif_set_mac_address_user dev_addr_sem v2: - move lock reorder to happen after kmalloc (Kuniyuki) Cc: Kohei Enju <enjuk@amazon.com> Fixes: df43d8bf1031 ("net: replace dev_addr_sem with netdev instance lock") Signed-off-by: Stanislav Fomichev <sdf@fomichev.me> Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> Link: https://patch.msgid.link/20250312190513.1252045-3-sdf@fomichev.me Tested-by: Lei Yang <leiyang@redhat.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-03-19Revert "net: replace dev_addr_sem with netdev instance lock"Stanislav Fomichev
This reverts commit df43d8bf10316a7c3b1e47e3cc0057a54df4a5b8. Cc: Kohei Enju <enjuk@amazon.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> Fixes: df43d8bf1031 ("net: replace dev_addr_sem with netdev instance lock") Signed-off-by: Stanislav Fomichev <sdf@fomichev.me> Link: https://patch.msgid.link/20250312190513.1252045-2-sdf@fomichev.me Tested-by: Lei Yang <leiyang@redhat.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-03-19tc-tests: Update tc police action tests for tc buffer size rounding fixes.Jonathan Lennox
Before tc's recent change to fix rounding errors, several tests which specified a burst size of "1m" would translate back to being 1048574 bytes (2b less than 1Mb). sprint_size prints this as "1024Kb". With the tc fix, the burst size is instead correctly reported as 1048576 bytes (precisely 1Mb), which sprint_size prints as "1Mb". This updates the expected output in the tests' matchPattern values to accept either the old or the new output. Signed-off-by: Jonathan Lennox <jonathan.lennox@8x8.com> Link: https://patch.msgid.link/20250312174804.313107-1-jonathan.lennox@8x8.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-03-19block: correct locking order for protecting blk-wbt parametersNilay Shroff
The commit '245618f8e45f ("block: protect wbt_lat_usec using q-> elevator_lock")' introduced q->elevator_lock to protect updates to blk-wbt parameters when writing to the sysfs attribute wbt_ lat_usec and the cgroup attribute io.cost.qos. However, both these attributes also acquire q->rq_qos_mutex, leading to the following lockdep warning: ====================================================== WARNING: possible circular locking dependency detected 6.14.0-rc5+ #138 Not tainted ------------------------------------------------------ bash/5902 is trying to acquire lock: c000000085d495a0 (&q->rq_qos_mutex){+.+.}-{4:4}, at: wbt_init+0x164/0x238 but task is already holding lock: c000000085d498c8 (&q->elevator_lock){+.+.}-{4:4}, at: queue_wb_lat_store+0xb0/0x20c which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #1 (&q->elevator_lock){+.+.}-{4:4}: __mutex_lock+0xf0/0xa58 ioc_qos_write+0x16c/0x85c cgroup_file_write+0xc4/0x32c kernfs_fop_write_iter+0x1b8/0x29c vfs_write+0x410/0x584 ksys_write+0x84/0x140 system_call_exception+0x134/0x360 system_call_vectored_common+0x15c/0x2ec -> #0 (&q->rq_qos_mutex){+.+.}-{4:4}: __lock_acquire+0x1b6c/0x2ae0 lock_acquire+0x140/0x430 __mutex_lock+0xf0/0xa58 wbt_init+0x164/0x238 queue_wb_lat_store+0x1dc/0x20c queue_attr_store+0x12c/0x164 sysfs_kf_write+0x6c/0xb0 kernfs_fop_write_iter+0x1b8/0x29c vfs_write+0x410/0x584 ksys_write+0x84/0x140 system_call_exception+0x134/0x360 system_call_vectored_common+0x15c/0x2ec other info that might help us debug this: Possible unsafe locking scenario: CPU0 CPU1 ---- ---- lock(&q->elevator_lock); lock(&q->rq_qos_mutex); lock(&q->elevator_lock); lock(&q->rq_qos_mutex); *** DEADLOCK *** 6 locks held by bash/5902: #0: c000000051122400 (sb_writers#3){.+.+}-{0:0}, at: ksys_write+0x84/0x140 #1: c00000007383f088 (&of->mutex#2){+.+.}-{4:4}, at: kernfs_fop_write_iter+0x174/0x29c #2: c000000008550428 (kn->active#182){.+.+}-{0:0}, at: kernfs_fop_write_iter+0x180/0x29c #3: c000000085d493a8 (&q->q_usage_counter(io)#5){++++}-{0:0}, at: blk_mq_freeze_queue_nomemsave+0x28/0x40 #4: c000000085d493e0 (&q->q_usage_counter(queue)#5){++++}-{0:0}, at: blk_mq_freeze_queue_nomemsave+0x28/0x40 #5: c000000085d498c8 (&q->elevator_lock){+.+.}-{4:4}, at: queue_wb_lat_store+0xb0/0x20c stack backtrace: CPU: 17 UID: 0 PID: 5902 Comm: bash Kdump: loaded Not tainted 6.14.0-rc5+ #138 Hardware name: IBM,9043-MRX POWER10 (architected) 0x800200 0xf000006 of:IBM,FW1060.00 (NM1060_028) hv:phyp pSeries Call Trace: [c0000000721ef590] [c00000000118f8a8] dump_stack_lvl+0x108/0x18c (unreliable) [c0000000721ef5c0] [c00000000022563c] print_circular_bug+0x448/0x604 [c0000000721ef670] [c000000000225a44] check_noncircular+0x24c/0x26c [c0000000721ef740] [c00000000022bf28] __lock_acquire+0x1b6c/0x2ae0 [c0000000721ef870] [c000000000229240] lock_acquire+0x140/0x430 [c0000000721ef970] [c0000000011cfbec] __mutex_lock+0xf0/0xa58 [c0000000721efaa0] [c00000000096c46c] wbt_init+0x164/0x238 [c0000000721efaf0] [c0000000008f8cd8] queue_wb_lat_store+0x1dc/0x20c [c0000000721efb50] [c0000000008f8fa0] queue_attr_store+0x12c/0x164 [c0000000721efc60] [c0000000007c11cc] sysfs_kf_write+0x6c/0xb0 [c0000000721efca0] [c0000000007bfa4c] kernfs_fop_write_iter+0x1b8/0x29c [c0000000721efcf0] [c0000000006a281c] vfs_write+0x410/0x584 [c0000000721efdc0] [c0000000006a2cc8] ksys_write+0x84/0x140 [c0000000721efe10] [c000000000031b64] system_call_exception+0x134/0x360 [c0000000721efe50] [c00000000000cedc] system_call_vectored_common+0x15c/0x2ec >From the above log it's apparent that method which writes to sysfs attr wbt_lat_usec acquires q->elevator_lock first, and then acquires q->rq_ qos_mutex. However the another method which writes to io.cost.qos, acquires q->rq_qos_mutex first, and then acquires q->rq_qos_mutex. So this could potentially cause the deadlock. A closer look at ioc_qos_write shows that correcting the lock order is non-trivial because q->rq_qos_mutex is acquired in blkg_conf_open_bdev and released in blkg_conf_exit. The function blkg_conf_open_bdev is responsible for parsing user input and finding the corresponding block device (bdev) from the user provided major:minor number. Since we do not know the bdev until blkg_conf_open_bdev completes, we cannot simply move q->elevator_lock acquisition before blkg_conf_open_ bdev. So to address this, we intoduce new helpers blkg_conf_open_bdev_ frozen and blkg_conf_exit_frozen which are just wrappers around blkg_ conf_open_bdev and blkg_conf_exit respectively. The helper blkg_conf_ open_bdev_frozen is similar to blkg_conf_open_bdev, but additionally freezes the queue, acquires q->elevator_lock and ensures the correct locking order is followed between q->elevator_lock and q->rq_qos_mutex. Similarly another helper blkg_conf_exit_frozen in addition to unfreezing the queue ensures that we release the locks in correct order. By using these helpers, now we maintain the same locking order in all code paths where we update blk-wbt parameters. Fixes: 245618f8e45f ("block: protect wbt_lat_usec using q->elevator_lock") Reported-by: kernel test robot <oliver.sang@intel.com> Closes: https://lore.kernel.org/oe-lkp/202503171650.cc082b66-lkp@intel.com Signed-off-by: Nilay Shroff <nilay@linux.ibm.com> Link: https://lore.kernel.org/r/20250319105518.468941-3-nilay@linux.ibm.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-03-19block: release q->elevator_lock in ioc_qos_writeNilay Shroff
The ioc_qos_write method acquires q->elevator_lock to protect updates to blk-wbt parameters. Once these updates are complete, the lock should be released before returning from ioc_qos_write. However, in one code path, the release of q->elevator_lock was mistakenly omitted, potentially leading to a lock leak. This commit fixes the issue by ensuring that q->elevator_lock is properly released in all return paths of ioc_qos_write. Fixes: 245618f8e45f ("block: protect wbt_lat_usec using q->elevator_lock") Reported-by: kernel test robot <oliver.sang@intel.com> Closes: https://lore.kernel.org/oe-lkp/202503171650.cc082b66-lkp@intel.com Signed-off-by: Nilay Shroff <nilay@linux.ibm.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20250319105518.468941-2-nilay@linux.ibm.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-03-19net: stmmac: dwmac-rk: Provide FIFO sizes for DWMAC 1000Chen-Yu Tsai
The DWMAC 1000 DMA capabilities register does not provide actual FIFO sizes, nor does the driver really care. If they are not provided via some other means, the driver will work fine, only disallowing changing the MTU setting. Provide the FIFO sizes through the driver's platform data to enable MTU changes. The FIFO sizes are confirmed to be the same across RK3288, RK3328, RK3399 and PX30, based on their respective manuals. It is likely that Rockchip synthesized their DWMAC 1000 with the same parameters on all their chips that have it. Signed-off-by: Chen-Yu Tsai <wens@csie.org> Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Link: https://patch.msgid.link/20250312163426.2178314-1-wens@kernel.org Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-03-19cgroup/rstat: avoid disabling irqs for O(num_cpu)Eric Dumazet
cgroup_rstat_flush_locked() grabs the irq safe cgroup_rstat_lock while iterating all possible cpus. It only drops the lock if there is scheduler or spin lock contention. If neither, then interrupts can be disabled for a long time. On large machines this can disable interrupts for a long enough time to drop network packets. On 400+ CPU machines I've seen interrupt disabled for over 40 msec. Prevent rstat from disabling interrupts while processing all possible cpus. Instead drop and reacquire cgroup_rstat_lock for each cpu. This approach was previously discussed in https://lore.kernel.org/lkml/ZBz%2FV5a7%2F6PZeM7S@slm.duckdns.org/, though this was in the context of an non-irq rstat spin lock. Benchmark this change with: 1) a single stat_reader process with 400 threads, each reading a test memcg's memory.stat repeatedly for 10 seconds. 2) 400 memory hog processes running in the test memcg and repeatedly charging memory until oom killed. Then they repeat charging and oom killing. v6.14-rc6 with CONFIG_IRQSOFF_TRACER with stat_reader and hogs, finds interrupts are disabled by rstat for 45341 usec: # => started at: _raw_spin_lock_irq # => ended at: cgroup_rstat_flush # # # _------=> CPU# # / _-----=> irqs-off/BH-disabled # | / _----=> need-resched # || / _---=> hardirq/softirq # ||| / _--=> preempt-depth # |||| / _-=> migrate-disable # ||||| / delay # cmd pid |||||| time | caller # \ / |||||| \ | / stat_rea-96532 52d.... 0us*: _raw_spin_lock_irq stat_rea-96532 52d.... 45342us : cgroup_rstat_flush stat_rea-96532 52d.... 45342us : tracer_hardirqs_on <-cgroup_rstat_flush stat_rea-96532 52d.... 45343us : <stack trace> => memcg1_stat_format => memory_stat_format => memory_stat_show => seq_read_iter => vfs_read => ksys_read => do_syscall_64 => entry_SYSCALL_64_after_hwframe With this patch the CONFIG_IRQSOFF_TRACER doesn't find rstat to be the longest holder. The longest irqs-off holder has irqs disabled for 4142 usec, a huge reduction from previous 45341 usec rstat finding. Running stat_reader memory.stat reader for 10 seconds: - without memory hogs: 9.84M accesses => 12.7M accesses - with memory hogs: 9.46M accesses => 11.1M accesses The throughput of memory.stat access improves. The mode of memory.stat access latency after grouping by of 2 buckets: - without memory hogs: 64 usec => 16 usec - with memory hogs: 64 usec => 8 usec The memory.stat latency improves. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Greg Thelen <gthelen@google.com> Tested-by: Greg Thelen <gthelen@google.com> Acked-by: Michal Koutný <mkoutny@suse.com> Reviewed-by: Yosry Ahmed <yosry.ahmed@linux.dev> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-03-19selftests: drv-net: use defer in the ping testJakub Kicinski
Make sure the test cleans up after itself. The XDP off statements at the end of the test may not be reached. Fixes: 75cc19c8ff89 ("selftests: drv-net: add xdp cases for ping.py") Signed-off-by: Jakub Kicinski <kuba@kernel.org> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20250312131040.660386-1-kuba@kernel.org Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-03-19Merge branch 'net-mlx5-hw-steering-cleanups'Paolo Abeni
Tariq Toukan says: ==================== net/mlx5: HW Steering cleanups This short series by Yevgeny contains several small HW Steering cleanups: - Patch 1: removing unused FW commands - Patch 2: using list_move() instead of list_del/add - Patch 3: printing the unsupported combination of match fields ==================== Link: https://patch.msgid.link/1741780194-137519-1-git-send-email-tariqt@nvidia.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-03-19net/mlx5: HWS, log the unsupported mask in definerYevgeny Kliteynik
If a user requested to match on an unsupported combination of fields, print the unsupported combination in the error message. Signed-off-by: Yevgeny Kliteynik <kliteyn@nvidia.com> Reviewed-by: Vlad Dogaru <vdogaru@nvidia.com> Reviewed-by: Mark Bloch <mbloch@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Link: https://patch.msgid.link/1741780194-137519-4-git-send-email-tariqt@nvidia.com Reviewed-by: Dawid Osuchowski <dawid.osuchowski@linux.intel.com> Reviewed-by: Michal Kubiak <michal.kubiak@intel.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-03-19net/mlx5: HWS, use list_move() instead of del/addYevgeny Kliteynik
Wherever applicable, use list_move function instead of list_del + list_add. Signed-off-by: Yevgeny Kliteynik <kliteyn@nvidia.com> Reviewed-by: Vlad Dogaru <vdogaru@nvidia.com> Reviewed-by: Mark Bloch <mbloch@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Link: https://patch.msgid.link/1741780194-137519-3-git-send-email-tariqt@nvidia.com Reviewed-by: Dawid Osuchowski <dawid.osuchowski@linux.intel.com> Reviewed-by: Michal Kubiak <michal.kubiak@intel.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-03-19net/mlx5: HWS, remove unused code for alias flow tablesYevgeny Kliteynik
Alias flow tables are not in use by HWS - remove the unused code. Signed-off-by: Yevgeny Kliteynik <kliteyn@nvidia.com> Reviewed-by: Vlad Dogaru <vdogaru@nvidia.com> Reviewed-by: Mark Bloch <mbloch@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Link: https://patch.msgid.link/1741780194-137519-2-git-send-email-tariqt@nvidia.com Reviewed-by: Dawid Osuchowski <dawid.osuchowski@linux.intel.com> Reviewed-by: Michal Kubiak <michal.kubiak@intel.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-03-19Merge branch 'net-stmmac-deprecate-snps-en-tx-lpi-clockgating-property'Paolo Abeni
Russell King says: ==================== net: stmmac: deprecate "snps,en-tx-lpi-clockgating" property This series deprecates the "snps,en-tx-lpi-clockgating" property for stmmac. MII Transmit clock gating, where the MAC hardware supports gating this clock, is a function of the connected PHY capabilities, which it reports through its status register. GMAC versions that support transmit clock gating twiddle the LPITCSE bit accordingly in the LPI control/status register, which is handled by the GMAC core specific code. So, "snps,en-tx-lpi-clockgating" not something that is a GMAC property, but is a work-around for phylib not providing an interface to determine whether the PHY allows the transmit clock to be disabled. This series converts the two SoCs that make use of this property (which, I hasten to add, is set in the SoC code) to use the PHY capability bit instead of a DT property, then removes the DT property from the .dtsi, deprecates it in the snps,dwmac binding, and finally in the stmmac code. I am expecting some discussion on how to merge this, as I think the order in which these changes is made is important - we don't want to deprecate the old way until the new code has landed. ==================== Link: https://patch.msgid.link/Z9FVHEf3uUqtKzyt@shell.armlinux.org.uk Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-03-19net: stmmac: deprecate "snps,en-tx-lpi-clockgating" propertyRussell King (Oracle)
Whether the MII transmit clock can be stopped is primarily a property of the PHY (there is a capability bit that should be checked first.) Whether the MAC is capable of stopping the transmit clock is a separate issue, but this is already handled by the core DesignWare MAC code. Therefore, snps,en-tx-lpi-clockgating is technically incorrect, and this commit adds a warning should a DT be encountered with the property present. However, we keep backwards compatibility. Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Reviewed-by: Lad Prabhakar <prabhakar.mahadev-lad.rj@bp.renesas.com> Link: https://patch.msgid.link/E1tsIUK-005vGk-H7@rmk-PC.armlinux.org.uk Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-03-19dt-bindings: deprecate "snps,en-tx-lpi-clockgating" propertyRussell King (Oracle)
Whether the MII transmit clock can be stopped is primarily a property of the PHY (there is a capability bit that should be checked first.) Whether the MAC is capable of stopping the transmit clock is a separate issue, but this is already handled by the core DesignWare MAC code. Therefore, snps,en-tx-lpi-clockgating is technically incorrect, so this commit deprecates the property in the binding. Acked-by: Rob Herring (Arm) <robh@kernel.org> Reviewed-by: Lad Prabhakar <prabhakar.mahadev-lad.rj@bp.renesas.com> Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Link: https://patch.msgid.link/E1tsIUF-005vGd-C5@rmk-PC.armlinux.org.uk Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-03-19ARM: dts: stm32: remove "snps,en-tx-lpi-clockgating" propertyRussell King (Oracle)
Whether the MII transmit clock can be stopped is primarily a property of the PHY (there is a capability bit that should be checked first.) Whether the MAC is capable of stopping the transmit clock is a separate issue, but this is already handled by the core DesignWare MAC code. As commit "net: stmmac: stm32: use PHY capability for TX clock stop" adds the flag to use the PHY capability, remove the DT property that is now unecessary. Cc: Samin Guo <samin.guo@starfivetech.com> Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Link: https://patch.msgid.link/E1tsIUA-005vGX-8A@rmk-PC.armlinux.org.uk Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-03-19riscv: dts: starfive: remove "snps,en-tx-lpi-clockgating" propertyRussell King (Oracle)
Whether the MII transmit clock can be stopped is primarily a property of the PHY (there is a capability bit that should be checked first.) Whether the MAC is capable of stopping the transmit clock is a separate issue, but this is already handled by the core DesignWare MAC code. As commit "net: stmmac: starfive: use PHY capability for TX clock stop" adds the flag to use the PHY capability, remove the DT property that is now unecessary. Cc: Samin Guo <samin.guo@starfivetech.com> Acked-by: Conor Dooley <conor.dooley@microchip.com> Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Link: https://patch.msgid.link/E1tsIU5-005vGR-4c@rmk-PC.armlinux.org.uk Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-03-19net: stmmac: stm32: use PHY capability for TX clock stopRussell King (Oracle)
Whether the MII transmit clock can be stopped is primarily a property of the PHY (there is a capability bit that should be checked first.) Whether the MAC is capable of stopping the transmit clock is a separate issue, but this is already handled by the core DesignWare MAC code. Add the flag to allow the stmmac core to use the PHY capability. Cc: Christophe Roullier <christophe.roullier@st.com> Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Link: https://patch.msgid.link/E1tsIU0-005vGL-17@rmk-PC.armlinux.org.uk Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-03-19net: stmmac: starfive: use PHY capability for TX clock stopRussell King (Oracle)
Whether the MII transmit clock can be stopped is primarily a property of the PHY (there is a capability bit that should be checked first.) Whether the MAC is capable of stopping the transmit clock is a separate issue, but this is already handled by the core DesignWare MAC code. Add the flag to allow the stmmac core to use the PHY capability. Cc: Samin Guo <samin.guo@starfivetech.com> Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Link: https://patch.msgid.link/E1tsITu-005vGF-TM@rmk-PC.armlinux.org.uk Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-03-19net: stmmac: allow platforms to use PHY tx clock stop capabilityRussell King (Oracle)
Allow platform glue to instruct stmmac to make use of the PHY transmit clock stop capability when deciding whether to allow the transmit clock from the DWMAC core to be stopped. Reviewed-by: Lad Prabhakar <prabhakar.mahadev-lad.rj@bp.renesas.com> Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Link: https://patch.msgid.link/E1tsITp-005vG9-Px@rmk-PC.armlinux.org.uk Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-03-19Merge tag 'ieee802154-for-net-next-2025-03-10' of ↵Paolo Abeni
git://git.kernel.org/pub/scm/linux/kernel/git/wpan/wpan-next Stefan Schmidt says: ==================== pull-request: ieee802154-next 2025-03-10 An update from ieee802154 for your *net-next* tree: Andy Shevchenko reworked the ca8210 driver to use the gpiod API and fixed a few problems of the driver along the way. * tag 'ieee802154-for-net-next-2025-03-10' of git://git.kernel.org/pub/scm/linux/kernel/git/wpan/wpan-next: dt-bindings: ieee802154: ca8210: Update polarity of the reset pin ieee802154: ca8210: Switch to using gpiod API ieee802154: ca8210: Get platform data via dev_get_platdata() ieee802154: ca8210: Use proper setters and getters for bitwise types ==================== Link: https://patch.msgid.link/20250310185752.2683890-1-stefan@datenfreihafen.org Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-03-19ata: ahci: simplify init functionTomas Henzl
This patch moves all the IRQ vector allocations into a single function. Instead of having the allocations spread out over two separate call sites everything will be handled in ahci_init_irq. Also a direct call into pci(m)_intx will be removed. The main part of this change is done by adding a PCI_IRQ_INTX flag into an already existing pci_alloc_irq_vectors invocation. In the current implementation of the pci_alloc_irq_vectors is the sequence of calls msi-x -> msi -> legacy irq and whatever there succeeds stops the call chain. That makes it impossible to merge all instances into as a single call to pci_alloc_irq_vectors since the order of calls there is: multiple msi-x a single msi a single msi-x a legacy irq. The two last steps can be merged into a single one which are the msi-x and legacy irq option. When PCI_IRQ_INTX flag is set the pci_alloc_irq_vectors succeeds in almost all cases - that makes it possible to convert ahci_init_irq(msi) into a void function. The exception is when dev->irq is zero then the pci_alloc_irq_vectors may return with an error code also pci_intx isn't called from pci_alloc_irq_vectors and thus certain pci calls aren't performed. That's just a negligible issue as later in ahci_init_one the (zero) value of dev->irq is via pci_irq_vector assigned to hpriv->irq. That value is then later tested in ahci_host_activate->ata_host_activate where it is welcomed with a WARN_ON message and fails with setting up irq and then the probe function (ahci_init_one) fails. The special zero value's meaning is that polling mode is being be set up which isn't the case. No functional change. Signed-off-by: Tomas Henzl <thenzl@redhat.com> Link: https://lore.kernel.org/r/20250319155030.16410-1-thenzl@redhat.com Signed-off-by: Niklas Cassel <cassel@kernel.org>
2025-03-19io_uring/cmd: don't expose entire cmd async dataPavel Begunkov
io_uring needs private bits in cmd's ->async_data, and they should never be exposed to drivers as it'd certainly be abused. Leave struct io_uring_cmd_data for the drivers but wrap it into a structure. It's a prep patch and doesn't do anything useful yet. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/20250319061251.21452-3-sidong.yang@furiosa.ai Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-03-19io_uring: rename the data cmd cachePavel Begunkov
Pick a more descriptive name for the cmd async data cache. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/20250319061251.21452-2-sidong.yang@furiosa.ai Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-03-19Merge branch 'resilient-queued-spin-lock'Alexei Starovoitov
Kumar Kartikeya Dwivedi says: ==================== Resilient Queued Spin Lock Changelog: ---------- v3 -> v4 v4: https://lore.kernel.org/bpf/20250303152305.3195648-1-memxor@gmail.com * Fix bisectability problem by reordering locktorture commit before Makefile commit. * Add EXPORT_SYMBOL_GPL to all used symbols and variables by consumers. * Skip BPF selftest when nrprocs < 2. * Fix kdoc to describe return value for res_spin_lock, slowpath. * Move kernel/locking/rqspinlock.{c,h} to kernel/bpf/rqspinlock.{c,h}. v2 -> v3 v2: https://lore.kernel.org/bpf/20250206105435.2159977-1-memxor@gmail.com * Add ifdef's to fallback to Ankur's patch when it gets in, until then copy-paste the implementation. * Change the meaning of RES_DEF_TIMEOUT from two critical section lengths to one for clarity, and use RES_DEF_TIMEOUT * 2 where needed. * Use NSEC_PER_SEC as timeout for TAS fallback. * Add Closes: tags for known syzbot reports. * Change timeout for TAS fallback to 1 second. * Fix more kernel test robot errors. * More comments about smp_wmb in release_held_lock_entry interaction. * Change RES_NR_HELD to 31. * Address comments from Peter, Eduard, Alexei. v1 -> v2 v1: https://lore.kernel.org/bpf/20250107140004.2732830-1-memxor@gmail.com * Address nits from Waiman and Peter * Fix arm64 WFE bug pointed out by Peter. * Fix incorrect memory ordering in release_held_lock_entry, and document subtleties. Explain why release is sufficient in unlock but not in release_held_lock_entry. * Remove dependence on CONFIG_QUEUED_SPINLOCKS and introduce a test-and-set fallback when queued spinlock support is missing on an architecture. * Enforce FIFO ordering for BPF program spin unlocks. * Address comments from Eduard on verifier plumbing. * Add comments as suggested by Waiman. * Refactor paravirt TAS lock to use the implemented TAS fallback. * Use rqspinlock_t as the type throughout so that it can be replaced with a non-qspinlock type in case of fallback. * Testing and benchmarking on arm64, added numbers to the cover letter. * Fix kernel test robot errors. * Fix a BPF selftest bug leading to spurious failures on arm64. Introduction ------------ This patch set introduces Resilient Queued Spin Lock (or rqspinlock with res_spin_lock() and res_spin_unlock() APIs). This is a qspinlock variant which recovers the kernel from a stalled state when the lock acquisition path cannot make forward progress. This can occur when a lock acquisition attempt enters a deadlock situation (e.g. AA, or ABBA), or more generally, when the owner of the lock (which we’re trying to acquire) isn’t making forward progress. The cover letter provides an overview of the motivation, design, and alternative approaches. We then provide evaluation numbers showcasing that while rqspinlock incurs overhead, the performance of rqspinlock approaches that of the normal qspinlock used by the kernel. The evaluations for rqspinlock were performed by replacing the default qspinlock implementation with it and booting the kernel to run the experiments. Support for locktorture is also included with numbers in this series. The cover letter's design section provides an overview of the algorithmic approach. A technical document describing the implementation in more detail is available here: https://github.com/kkdwivedi/rqspinlock/blob/main/rqspinlock.pdf We have a WIP TLA+ proof for liveness and mutual exclusion of rqspinlock built on top of the qspinlock TLA+ proof from Catalin Marinas [3]. We will share more details and the links in the near future. Motivation ---------- In regular kernel code, usage of locks is assumed to be correct, so as to avoid deadlocks and stalls by construction, however, the same is not true for BPF programs. Users write normal C code and the in-kernel eBPF runtime ensures the safety of the kernel by rejecting unsafe programs. Users can upload programs that use locks in an improper fashion, and may cause deadlocks when these programs run inside the kernel. The verifier is responsible for rejecting such programs from being loaded into the kernel. Until now, the eBPF verifier ensured deadlock safety by only permitting one lock acquisition at a time, and by preventing any functions to be called from within the critical section. Additionally, only a few restricted program types are allowed to call spin locks. As the usage of eBPF grows (e.g. with sched_ext) beyond its conventional application in networking, tracing, and security, the limitations on locking are becoming a bottleneck for users. The rqspinlock implementation allows us to permit more flexible locking patterns in BPF programs, without limiting them to the subset that can be proven safe statically (which is fairly small, and requires complex static analysis), while ensuring that the kernel will recover in case we encounter a locking violation at runtime. We make a tradeoff here by accepting programs that may potentially have deadlocks, and recover the kernel quickly at runtime to ensure availability. Additionally, eBPF programs attached to different parts of the kernel can introduce new control flow into the kernel, which increases the likelihood of deadlocks in code not written to handle reentrancy. There have been multiple syzbot reports surfacing deadlocks in internal kernel code due to the diverse ways in which eBPF programs can be attached to different parts of the kernel. By switching the BPF subsystem’s lock usage to rqspinlock, all of these issues can be mitigated at runtime. This spin lock implementation allows BPF maps to become safer and remove mechanisms that have fallen short in assuring safety when nesting programs in arbitrary ways in the same context or across different contexts. The red diffs due to patches 16-18 demonstrate this simplification. > kernel/bpf/hashtab.c | 102 ++++++++++++++++++++++++++++++++--------------------------... > kernel/bpf/lpm_trie.c | 25 ++++++++++++++----------- > kernel/bpf/percpu_freelist.c | 113 +++++++++++++++++++++++++---------------------------------... > kernel/bpf/percpu_freelist.h | 4 ++-- > 4 files changed, 73 insertions(+), 171 deletions(-) Design ------ Deadlocks mostly manifest as stalls in the waiting loops of the qspinlock slow path. Thus, using stalls as a signal for deadlocks avoids introducing cost to the normal fast path, and ensures bounded termination of the waiting loop. Our recovery algorithm is focused on terminating the waiting loops of the qspinlock algorithm when it gets stuck, and implementing bespoke recovery procedures for each class of waiter to restore the lock to a usable state. Deadlock detection is the main mechanism used to provide faster recovery, with the timeout mechanism acting as a final line of defense. Deadlock Detection ~~~~~~~~~~~~~~~~~~ We handle two cases of deadlocks: AA deadlocks (attempts to acquire the same lock again), and ABBA deadlocks (attempts to acquire two locks in the opposite order from two distinct threads). Variants of ABBA deadlocks may be encountered with more than two locks being held in the incorrect order. These are not diagnosed explicitly, as they reduce to ABBA deadlocks. Deadlock detection is triggered immediately when beginning the waiting loop of a lock slow path. While timeouts ensure that any waiting loops in the locking slow path terminate and return to the caller, it can be excessively long in some situations. While the default timeout is short (0.5s), a stall for this duration inside the kernel can set off alerts for latency-critical services with strict SLOs. Ideally, the kernel should recover from an undesired state of the lock as soon as possible. A multi-step strategy is used to recover the kernel from waiting loops in the locking algorithm which may fail to terminate in a bounded amount of time. * Each CPU maintains a table of held locks. Entries are inserted and removed upon entry into lock, and exit from unlock, respectively. * Deadlock detection for AA locks is thus simple: we have an AA deadlock if we find a held lock entry for the lock we’re attempting to acquire on the same CPU. * During deadlock detection for ABBA, we search through the tables of all other CPUs to find situations where we are holding a lock the remote CPU is attempting to acquire, and they are holding a lock we are attempting to acquire. Upon encountering such a condition, we report an ABBA deadlock. * We divide the duration between entry time point into the waiting loop and the timeout time point into intervals of 1 ms, and perform deadlock detection until timeout happens. Upon entry into the slow path, and then completion of each 1 ms interval, we perform detection of both AA and ABBA deadlocks. In the event that deadlock detection yields a positive result, the recovery happens sooner than the timeout. Otherwise, it happens as a last resort upon completion of the timeout. Timeouts ~~~~~~~~ Timeouts act as final line of defense against stalls for waiting loops. The ‘ktime_get_mono_fast_ns’ function is used to poll for the current time, and it is compared to the timestamp indicating the end time in the waiter loop. Each waiting loop is instrumented to check an extra condition using a macro. Internally, the macro implementation amortizes the checking of the timeout to avoid sampling the clock in every iteration. Precisely, the timeout checks are invoked every 64k iterations. Recovery ~~~~~~~~ There is extensive literature in academia on designing locks that support timeouts [0][1], as timeouts can be used as a proxy for detecting the presence of deadlocks and recovering from them, without maintaining explicit metadata to construct a waits-for relationship between two threads at runtime. In case of rqspinlock, the key simplification in our algorithm comes from the fact that upon a timeout, waiters always leave the queue in FIFO order. As such, the timeout is only enforced by the head of the wait queue, while other waiters rely on the head to signal them when a timeout has occurred and when they need to exit. We don’t have to implement complex algorithms and do not need extra synchronization for waiters in the middle of the queue timing out before their predecessor or successor, unlike previous approaches [0][1]. There are three forms of waiters in the original queued spin lock algorithm. The first is the waiter which acquires the pending bit and spins on the lock word without forming a wait queue. The second is the head waiter that is the first waiter heading the wait queue. The third form is of all the non-head waiters queued behind the head, waiting to be signalled through their MCS node to overtake the responsibility of the head. In rqspinlock's recovery algorithm, we are concerned with the second and third kind. First, we augment the waiting loop of the head of the wait queue with a timeout. When this timeout happens, all waiters part of the wait queue will abort their lock acquisition attempts. This happens in three steps. * First, the head breaks out of its loop waiting for pending and locked bits to turn to 0, and non-head waiters break out of their MCS node spin (more on that later). * Next, every waiter (head or non-head) attempts to check whether they are also the tail waiter, in such a case they attempt to zero out the tail word and allow a new queue to be built up for this lock. If they succeed, they have no one to signal next in the queue to stop spinning. * Otherwise, they signal the MCS node of the next waiter to break out of its spin and try resetting the tail word back to 0. This goes on until the tail waiter is found. In case of races, the new tail will be responsible for performing the same task, as the old tail will then fail to reset the tail word and wait for its next pointer to be updated before it signals the new tail to do the same. Timeout Bound ~~~~~~~~~~~~~ The timeout is applied by two types of waiters: the pending bit waiter and the wait queue head waiter. As such, for the pending waiter, only the lock owner is ahead of it, and for the wait queue head waiter, only the lock owner and the pending waiter take precedence in executing their critical sections. We define the timeout value to span at most 1 critical section length, and then use the appropriate value (default, or default x 2) depending on if we are the pending waiter or head of wait queue. Therefore, the waiting loop wait can span at most 2 critical section lengths, and thus, it is unaffected by the amount of contention or the number of CPUs on the host. Non-head waiters simply wait for the wait queue head to signal them on a timeout. In Meta's production, we have noticed uncore PMU reads and SMIs consuming tens of msecs. While these events are rare, a 0.25 second timeout should absorb such tail events and not raise false alarms for timeouts. We will continue monitoring this in production and adjust the timeout if necessary in the future. More details of the recovery algorithm is described in patch 9 and a detailed description is available at [2]. Alternatives ------------ Lockdep: We do not rely on the lockdep facility for reporting violations for primarily two reasons: * Overhead: The lockdep infrastructure can add significant overhead to the lock acquisition path, and is not recommended for use in production due to this reason. While the report is more useful and exhaustive, the overhead can be prohibitive, especially as BPF programs run in hot paths of the kernel. Moreover, it also increases the size of the lock word to store extra metadata, which is not feasible for BPF spin locks that are 4-bytes in size today (similar to qspinlock). * Debug Tool: Lockdep is intended to be used as a debugging facility, providing extra context to the user about the locking violations occurring during runtime. It is always turned off on all production kernels, therefore isn’t available most of the time. We require a mechanism for detecting common variants of deadlocks that is always available in production kernels and never turned off. At the same time, it must not introduce overhead in terms of time (for the slow path) and memory (for the lock word size). Evaluation ---------- We run benchmarks that stress locking scalability and perform comparison against the baseline (qspinlock). For the rqspinlock case, we replace the default qspinlock with it in the kernel, such that all spin locks in the kernel use the rqspinlock slow path. As such, benchmarks that stress kernel spin locks end up exercising rqspinlock. Evaluation setup ~~~~~~~~~~~~~~~~ We set the CPU governor to performance for all experiments. Note: Numbers for arm64 have been obtained without the no-WFE fallback in this series, to perform a fair comparison with the WFE using qspinlock baseline. x86_64: Intel Xeon Platinum 8468 (Sapphire Rapids) 96 cores (48 x 2 sockets) 2 threads per core, 0-95, siblings from 96-191 2 NUMA nodes (every 48 cores), 2 LLCs (every 48 cores), 1 LLC per NUMA node Hyperthreading enabled arm64: Ampere Max Neoverse-N1 256-Core Processor 256 cores (128 cores x 2 sockets) 1 thread per core 2 NUMA nodes (every 128 cores), 1 L2 per core (256 instances), no shared L3 No hyperthreading available The locktorture experiment is run for 30 seconds. Average of 25 runs is used for will-it-scale, after an initial warm up. More information on the locks contended in the will-it-scale experiments is available in the evaluation section of the CNA paper, in table 1 [4]. Legend: QL - qspinlock (avg. throughput) RQL - rqspinlock (avg. throughput) Results ~~~~~~~ locktorture - x86_64 Threads QL RQL Speedup ----------------------------------------------- 1 46910437 45057327 0.96 2 29871063 25085034 0.84 4 13876024 19242776 1.39 8 14638499 13346847 0.91 16 14380506 14104716 0.98 24 17278144 15293077 0.89 32 19494283 17826675 0.91 40 27760955 21002910 0.76 48 28638897 26432549 0.92 56 29336194 26512029 0.9 64 30040731 27421403 0.91 72 29523599 27010618 0.91 80 28846738 27885141 0.97 88 29277418 25963753 0.89 96 28472339 27423865 0.96 104 28093317 26634895 0.95 112 29914000 27872339 0.93 120 29199580 26682695 0.91 128 27755880 27314662 0.98 136 30349095 27092211 0.89 144 29193933 27805445 0.95 152 28956663 26071497 0.9 160 28950009 28183864 0.97 168 29383520 28135091 0.96 176 28475883 27549601 0.97 184 31958138 28602434 0.89 192 31342633 33394385 1.07 will-it-scale open1_threads - x86_64 Threads QL QL stddev stddev% RQL RQL stddev stddev% Speedup ----------------------------------------------------------------------------------------------- 1 1396323.92 7373.12 0.53 1366616.8 4152.08 0.3 0.98 2 1844403.8 3165.26 0.17 1700301.96 2396.58 0.14 0.92 4 2370590.6 24545.54 1.04 1655872.32 47938.71 2.9 0.7 8 2185227.04 9537.9 0.44 1691205.16 9783.25 0.58 0.77 16 2110672.36 10972.99 0.52 1781696.24 15021.43 0.84 0.84 24 1655042.72 18037.23 1.09 2165125.4 5422.54 0.25 1.31 32 1738928.24 7166.64 0.41 1829468.24 9081.59 0.5 1.05 40 1854430.52 6148.24 0.33 1731062.28 3311.95 0.19 0.93 48 1766529.96 5063.86 0.29 1749375.28 2311.27 0.13 0.99 56 1303016.28 6168.4 0.47 1452656 7695.29 0.53 1.11 64 1169557.96 4353.67 0.37 1287370.56 8477.2 0.66 1.1 72 1036023.4 7116.53 0.69 1135513.92 9542.55 0.84 1.1 80 1097913.64 11356 1.03 1176864.8 6771.41 0.58 1.07 88 1123907.36 12843.13 1.14 1072416.48 7412.25 0.69 0.95 96 1166981.52 9402.71 0.81 1129678.76 9499.14 0.84 0.97 104 1108954.04 8171.46 0.74 1032044.44 7840.17 0.76 0.93 112 1000777.76 8445.7 0.84 1078498.8 6551.47 0.61 1.08 120 1029448.4 6992.29 0.68 1093743 8378.94 0.77 1.06 128 1106670.36 10102.15 0.91 1241438.68 23212.66 1.87 1.12 136 1183776.88 6394.79 0.54 1116799.64 18111.38 1.62 0.94 144 1201122 25917.69 2.16 1301779.96 15792.6 1.21 1.08 152 1099737.08 13567.82 1.23 1053647.2 12704.29 1.21 0.96 160 1031186.32 9048.07 0.88 1069961.4 8293.18 0.78 1.04 168 1068817 16486.06 1.54 1096495.36 14021.93 1.28 1.03 176 966633.96 9623.27 1 1081129.84 9474.81 0.88 1.12 184 1004419.04 12111.11 1.21 1037771.24 12001.66 1.16 1.03 192 1088858.08 16522.93 1.52 1027943.12 14238.57 1.39 0.94 will-it-scale open2_threads - x86_64 Threads QL QL stddev stddev% RQL RQL stddev stddev% Speedup ----------------------------------------------------------------------------------------------- 1 1337797.76 4649.19 0.35 1332609.4 3813.14 0.29 1 2 1598300.2 1059.93 0.07 1771891.36 5667.12 0.32 1.11 4 1736573.76 13025.33 0.75 1396901.2 2682.46 0.19 0.8 8 1794367.84 4879.6 0.27 1917478.56 3751.98 0.2 1.07 16 1990998.44 8332.78 0.42 1864165.56 9648.59 0.52 0.94 24 1868148.56 4248.23 0.23 1710136.68 2760.58 0.16 0.92 32 1955180 6719 0.34 1936149.88 1980.87 0.1 0.99 40 1769646.4 4686.54 0.26 1729653.68 4551.22 0.26 0.98 48 1724861.16 4056.66 0.24 1764900 971.11 0.06 1.02 56 1318568 7758.86 0.59 1385660.84 7039.8 0.51 1.05 64 1143290.28 5351.43 0.47 1316686.6 5597.69 0.43 1.15 72 1196762.68 10655.67 0.89 1230173.24 9858.2 0.8 1.03 80 1126308.24 6901.55 0.61 1085391.16 7444.34 0.69 0.96 88 1035672.96 5452.95 0.53 1035541.52 8095.33 0.78 1 96 1030203.36 6735.71 0.65 1020113.48 8683.13 0.85 0.99 104 1039432.88 6583.59 0.63 1083902.48 5775.72 0.53 1.04 112 1113609.04 4380.62 0.39 1072010.36 8983.14 0.84 0.96 120 1109420.96 7183.5 0.65 1079424.12 10929.97 1.01 0.97 128 1095400.04 4274.6 0.39 1095475.2 12042.02 1.1 1 136 1071605.4 11103.73 1.04 1114757.2 10516.55 0.94 1.04 144 1104147.2 9714.75 0.88 1044954.16 7544.2 0.72 0.95 152 1164280.24 13386.15 1.15 1101213.92 11568.49 1.05 0.95 160 1084892.04 7941.25 0.73 1152273.76 9593.38 0.83 1.06 168 983654.76 11772.85 1.2 1111772.28 9806.83 0.88 1.13 176 1087544.24 11262.35 1.04 1077507.76 9442.02 0.88 0.99 184 1101682.4 24701.68 2.24 1095223.2 16707.29 1.53 0.99 192 983712.08 13453.59 1.37 1051244.2 15662.05 1.49 1.07 will-it-scale lock1_threads - x86_64 Threads QL QL stddev stddev% RQL RQL stddev stddev% Speedup ----------------------------------------------------------------------------------------------- 1 4307484.96 3959.31 0.09 4252908.56 10375.78 0.24 0.99 2 7701844.32 4169.88 0.05 7219233.52 6437.11 0.09 0.94 4 14781878.72 22854.85 0.15 15260565.12 37305.71 0.24 1.03 8 12949698.64 99270.42 0.77 9954660.4 142805.68 1.43 0.77 16 12947690.64 72977.27 0.56 10865245.12 49520.31 0.46 0.84 24 11142990.64 33200.39 0.3 11444391.68 37884.46 0.33 1.03 32 9652335.84 22369.48 0.23 9344086.72 21639.22 0.23 0.97 40 9185931.12 5508.96 0.06 8881506.32 5072.33 0.06 0.97 48 9084385.36 10871.05 0.12 8863579.12 4583.37 0.05 0.98 56 6595540.96 33100.59 0.5 6640389.76 46619.96 0.7 1.01 64 5946726.24 47160.5 0.79 6572155.84 91973.73 1.4 1.11 72 6744894.72 43166.65 0.64 5991363.36 80637.56 1.35 0.89 80 6234502.16 118983.16 1.91 5157894.32 73592.72 1.43 0.83 88 5053879.6 199713.75 3.95 4479758.08 36202.27 0.81 0.89 96 5184302.64 99199.89 1.91 5249210.16 122348.69 2.33 1.01 104 4612391.92 40803.05 0.88 4850209.6 26813.28 0.55 1.05 112 4809209.68 24070.68 0.5 4869477.84 27489.04 0.56 1.01 120 5130746.4 34265.5 0.67 4620047.12 44229.54 0.96 0.9 128 5376465.28 95028.05 1.77 4781179.6 43700.93 0.91 0.89 136 5453742.4 86718.87 1.59 5412457.12 40339.68 0.75 0.99 144 5805040.72 84669.31 1.46 5595382.48 68701.65 1.23 0.96 152 5842897.36 31120.33 0.53 5787587.12 43521.68 0.75 0.99 160 5837665.12 14179.44 0.24 5118808.72 45193.23 0.88 0.88 168 5660332.72 27467.09 0.49 5104959.04 40891.75 0.8 0.9 176 5180312.24 28656.39 0.55 4718407.6 58734.13 1.24 0.91 184 4706824.16 50469.31 1.07 4692962.64 92266.85 1.97 1 192 5126054.56 51082.02 1 4680866.8 58743.51 1.25 0.91 will-it-scale lock2_threads - x86_64 Threads QL QL stddev stddev% RQL RQL stddev stddev% Speedup ----------------------------------------------------------------------------------------------- 1 4316091.2 4933.28 0.11 4293104 30369.71 0.71 0.99 2 3500046.4 19852.62 0.57 4507627.76 23667.66 0.53 1.29 4 3639098.96 26370.65 0.72 3673166.32 30822.71 0.84 1.01 8 3714548.56 49953.44 1.34 4055818.56 71630.41 1.77 1.09 16 4188724.64 105414.49 2.52 4316077.12 68956.15 1.6 1.03 24 3737908.32 47391.46 1.27 3762254.56 55345.7 1.47 1.01 32 3820952.8 45207.66 1.18 3710368.96 52651.92 1.42 0.97 40 3791280.8 28630.55 0.76 3661933.52 37671.27 1.03 0.97 48 3765721.84 59553.83 1.58 3604738.64 50861.36 1.41 0.96 56 3175505.76 64336.17 2.03 2771022.48 66586.99 2.4 0.87 64 2620294.48 71651.34 2.73 2650171.68 44810.83 1.69 1.01 72 2861893.6 86542.61 3.02 2537437.2 84571.75 3.33 0.89 80 2976297.2 83566.43 2.81 2645132.8 85992.34 3.25 0.89 88 2547724.8 102014.36 4 2336852.16 80570.25 3.45 0.92 96 2945310.32 82673.25 2.81 2513316.96 45741.81 1.82 0.85 104 3028818.64 90643.36 2.99 2581787.52 52967.48 2.05 0.85 112 2546264.16 102605.82 4.03 2118812.64 62043.19 2.93 0.83 120 2917334.64 112220.01 3.85 2720418.64 64035.96 2.35 0.93 128 2906621.84 69428.1 2.39 2795310.32 56736.87 2.03 0.96 136 2841833.76 105541.11 3.71 3063404.48 62288.94 2.03 1.08 144 3032822.32 134796.56 4.44 3169985.6 149707.83 4.72 1.05 152 2557694.96 62218.15 2.43 2469887.6 68343.78 2.77 0.97 160 2810214.72 61468.79 2.19 2323768.48 54226.71 2.33 0.83 168 2651146.48 76573.27 2.89 2385936.64 52433.98 2.2 0.9 176 2720616.32 89026.19 3.27 2941400.08 59296.64 2.02 1.08 184 2696086 88541.24 3.28 2598225.2 76365.7 2.94 0.96 192 2908194.48 87023.91 2.99 2377677.68 53299.82 2.24 0.82 locktorture - arm64 Threads QL RQL Speedup ----------------------------------------------- 1 43320464 44718174 1.03 2 21056971 29255448 1.39 4 16040120 11563981 0.72 8 12786398 12838909 1 16 13646408 13436730 0.98 24 13597928 13669457 1.01 32 16456220 14600324 0.89 40 16667726 13883101 0.83 48 14347691 14608641 1.02 56 15624580 15180758 0.97 64 18105114 16009137 0.88 72 16606438 14772256 0.89 80 16550202 14124056 0.85 88 16716082 15930618 0.95 96 16489242 16817657 1.02 104 17915808 17165324 0.96 112 17217482 21343282 1.24 120 20449845 20576123 1.01 128 18700902 20286275 1.08 136 17913378 21142921 1.18 144 18225673 18971921 1.04 152 18374206 19229854 1.05 160 23136514 20129504 0.87 168 21096269 17167777 0.81 176 21376794 21594914 1.01 184 23542989 20638298 0.88 192 22793754 20655980 0.91 200 20933027 19628316 0.94 208 23105684 25572720 1.11 216 24158081 23173848 0.96 224 23388984 22485353 0.96 232 21916401 23899343 1.09 240 22292129 22831784 1.02 248 25812762 22636787 0.88 256 24294738 26127113 1.08 will-it-scale open1_threads - arm64 Threads QL QL stddev stddev% RQL RQL stddev stddev% Speedup ----------------------------------------------------------------------------------------------- 1 844452.32 801 0.09 804936.92 900.25 0.11 0.95 2 1309419.08 9495.78 0.73 1265080.24 3171.13 0.25 0.97 4 2113074.24 5363.19 0.25 2041158.28 7883.65 0.39 0.97 8 1916650.96 15749.86 0.82 2039850.04 7562.87 0.37 1.06 16 1835540.72 12940.45 0.7 1937398.56 11461.15 0.59 1.06 24 1876760.48 12581.67 0.67 1966659.16 10012.69 0.51 1.05 32 1834525.6 5571.08 0.3 1929180.4 6221.96 0.32 1.05 40 1851592.76 7848.18 0.42 1937504.44 5991.55 0.31 1.05 48 1845067 4118.68 0.22 1773331.56 6068.23 0.34 0.96 56 1742709.36 6874.03 0.39 1716184.92 6713.16 0.39 0.98 64 1685339.72 6688.91 0.4 1676046.16 5844.06 0.35 0.99 72 1694838.84 2433.41 0.14 1821189.6 2906.89 0.16 1.07 80 1738778.68 2916.74 0.17 1729212.6 3714.41 0.21 0.99 88 1753131.76 2734.34 0.16 1713294.32 4652.82 0.27 0.98 96 1694112.52 4449.69 0.26 1714438.36 5621.66 0.33 1.01 104 1780279.76 2420.52 0.14 1767679.12 3067.66 0.17 0.99 112 1700284.72 9796.23 0.58 1796674.6 4066.06 0.23 1.06 120 1760466.72 3978.65 0.23 1704706.08 4080.04 0.24 0.97 128 1634067.96 5187.94 0.32 1764115.48 3545.02 0.2 1.08 136 1170303.84 7602.29 0.65 1227188.04 8090.84 0.66 1.05 144 953186.16 7859.02 0.82 964822.08 10536.61 1.09 1.01 152 818893.96 7238.86 0.88 853412.44 5932.25 0.7 1.04 160 707460.48 3868.26 0.55 746985.68 10363.03 1.39 1.06 168 658380.56 4938.77 0.75 672101.12 5442.95 0.81 1.02 176 614692.04 3137.74 0.51 615143.36 6197.19 1.01 1 184 574808.88 4741.61 0.82 592395.08 8840.92 1.49 1.03 192 548142.92 6116.31 1.12 571299.68 8388.56 1.47 1.04 200 511621.96 2182.33 0.43 532144.88 5467.04 1.03 1.04 208 506583.32 6834.39 1.35 521427.08 10318.65 1.98 1.03 216 480438.04 3608.96 0.75 510697.76 8086.47 1.58 1.06 224 470644.96 3451.35 0.73 467433.92 5008.59 1.07 0.99 232 466973.72 6599.97 1.41 444345.92 2144.96 0.48 0.95 240 442927.68 2351.56 0.53 440503.56 4289.01 0.97 0.99 248 432991.16 5829.92 1.35 445462.6 5944.03 1.33 1.03 256 409455.44 1430.5 0.35 422219.4 4007.04 0.95 1.03 will-it-scale open2_threads - arm64 Threads QL QL stddev stddev% RQL RQL stddev stddev% Speedup ----------------------------------------------------------------------------------------------- 1 818645.4 1097.02 0.13 774110.24 1562.45 0.2 0.95 2 1281013.04 2188.78 0.17 1238346.24 2149.97 0.17 0.97 4 2058514.16 13105.36 0.64 1985375 3204.48 0.16 0.96 8 1920414.8 16154.63 0.84 1911667.92 8882.98 0.46 1 16 1943729.68 8714.38 0.45 1978946.72 7465.65 0.38 1.02 24 1915846.88 7749.9 0.4 1914442.72 9841.71 0.51 1 32 1964695.92 8854.83 0.45 1914650.28 9357.82 0.49 0.97 40 1845071.12 5103.26 0.28 1891685.44 4278.34 0.23 1.03 48 1838897.6 5123.61 0.28 1843498.2 5391.94 0.29 1 56 1823768.32 3214.14 0.18 1736477.48 5675.49 0.33 0.95 64 1627162.36 3528.1 0.22 1685727.16 6102.63 0.36 1.04 72 1725320.16 4709.83 0.27 1710174.4 6707.54 0.39 0.99 80 1692288.44 9110.89 0.54 1773676.24 4327.94 0.24 1.05 88 1725496.64 4249.71 0.25 1695173.84 5097.14 0.3 0.98 96 1766093.08 2280.09 0.13 1732782.64 3606.1 0.21 0.98 104 1647753 2926.83 0.18 1710876.4 4416.04 0.26 1.04 112 1763785.52 3838.26 0.22 1803813.76 1859.2 0.1 1.02 120 1684095.16 2385.31 0.14 1766903.08 3258.34 0.18 1.05 128 1733528.56 2800.62 0.16 1677446.32 3201.14 0.19 0.97 136 1179187.84 6804.86 0.58 1241839.52 10698.51 0.86 1.05 144 969456.36 6421.85 0.66 1018441.96 8732.19 0.86 1.05 152 839295.64 10422.66 1.24 817531.92 6778.37 0.83 0.97 160 743010.72 6957.98 0.94 749291.16 9388.47 1.25 1.01 168 666049.88 13159.73 1.98 689408.08 10192.66 1.48 1.04 176 609185.56 5685.18 0.93 653744.24 10847.35 1.66 1.07 184 602232.08 12089.72 2.01 597718.6 13856.45 2.32 0.99 192 563919.32 9870.46 1.75 560080.4 8388.47 1.5 0.99 200 522396.28 4155.61 0.8 539168.64 10456.64 1.94 1.03 208 520328.28 9353.14 1.8 510011.4 6061.19 1.19 0.98 216 479797.72 5824.58 1.21 486955.32 4547.05 0.93 1.01 224 467943.8 4484.86 0.96 473252.76 5608.58 1.19 1.01 232 456914.24 3129.5 0.68 457463.2 7474.83 1.63 1 240 450535 5149.78 1.14 437653.56 4604.92 1.05 0.97 248 435475.2 2350.87 0.54 435589.24 6176.01 1.42 1 256 416737.88 2592.76 0.62 424178.28 3932.2 0.93 1.02 will-it-scale lock1_threads - arm64 Threads QL QL stddev stddev% RQL RQL stddev stddev% Speedup ----------------------------------------------------------------------------------------------- 1 2512077.52 3026.1 0.12 2085365.92 1612.44 0.08 0.83 2 4840180.4 3646.31 0.08 4326922.24 3802.17 0.09 0.89 4 9358779.44 6673.07 0.07 8467588.56 5577.05 0.07 0.9 8 9374436.88 18826.26 0.2 8635110.16 4217.66 0.05 0.92 16 9527184.08 14111.94 0.15 8561174.16 3258.6 0.04 0.9 24 8873099.76 17242.32 0.19 9286778.72 4124.51 0.04 1.05 32 8457640.4 10790.92 0.13 8700401.52 5110 0.06 1.03 40 8478771.76 13250.8 0.16 8746198.16 7606.42 0.09 1.03 48 8329097.76 7958.92 0.1 8774265.36 6082.08 0.07 1.05 56 8330143.04 11586.93 0.14 8472426.48 7402.13 0.09 1.02 64 8334684.08 10478.03 0.13 7979193.52 8436.63 0.11 0.96 72 7941815.52 16031.38 0.2 8016885.52 12640.56 0.16 1.01 80 8042221.68 10219.93 0.13 8072222.88 12479.54 0.15 1 88 8190336.8 10751.38 0.13 8432977.6 11865.67 0.14 1.03 96 8235010.08 7267.8 0.09 8022101.28 11910.63 0.15 0.97 104 8154434.08 7770.8 0.1 7987812 7647.42 0.1 0.98 112 7738464.56 11067.72 0.14 7968483.92 20632.93 0.26 1.03 120 8228919.36 10395.79 0.13 8304329.28 11913.76 0.14 1.01 128 7798646.64 8877.8 0.11 8197938.4 7527.81 0.09 1.05 136 5567293.68 66259.82 1.19 5642017.12 126584.59 2.24 1.01 144 4425655.52 55729.96 1.26 4519874.64 82996.01 1.84 1.02 152 3871300.8 77793.78 2.01 3850025.04 80167.3 2.08 0.99 160 3558041.68 55108.3 1.55 3495924.96 83626.42 2.39 0.98 168 3302042.72 45011.89 1.36 3298002.8 59393.64 1.8 1 176 3066165.2 34896.54 1.14 3063027.44 58219.26 1.9 1 184 2817899.6 43585.27 1.55 2859393.84 45258.03 1.58 1.01 192 2690403.76 42236.77 1.57 2630652.24 35953.13 1.37 0.98 200 2563141.44 28145.43 1.1 2539964.32 38556.52 1.52 0.99 208 2502968.8 27687.81 1.11 2477757.28 28240.81 1.14 0.99 216 2474917.76 24128.71 0.97 2483161.44 32198.37 1.3 1 224 2386874.72 32954.66 1.38 2398068.48 37667.29 1.57 1 232 2379248.24 27413.4 1.15 2327601.68 24565.28 1.06 0.98 240 2302146.64 19914.19 0.87 2236074.64 20968.17 0.94 0.97 248 2241798.32 21542.52 0.96 2173312.24 26498.36 1.22 0.97 256 2198765.12 20832.66 0.95 2136159.52 25027.96 1.17 0.97 will-it-scale lock2_threads - arm64 Threads QL QL stddev stddev% RQL RQL stddev stddev% Speedup ----------------------------------------------------------------------------------------------- 1 2499414.32 1932.27 0.08 2075704.8 24589.71 1.18 0.83 2 3887820 34198.36 0.88 4057432.64 11896.04 0.29 1.04 4 3445307.6 7958.3 0.23 3869960.4 3788.5 0.1 1.12 8 4310597.2 14405.9 0.33 3931319.76 5845.33 0.15 0.91 16 3995159.84 22621.85 0.57 3953339.68 15668.9 0.4 0.99 24 4048456.88 22956.51 0.57 3887812.64 30584.77 0.79 0.96 32 3974808.64 20465.87 0.51 3718778.08 27407.24 0.74 0.94 40 3941154.88 15136.68 0.38 3551464.24 33378.67 0.94 0.9 48 3725436.32 17090.67 0.46 3714356.08 19035.26 0.51 1 56 3558449.44 10123.46 0.28 3449656.08 36476.87 1.06 0.97 64 3514616.08 16470.99 0.47 3493197.04 25639.82 0.73 0.99 72 3461700.88 16780.97 0.48 3376565.04 16930.19 0.5 0.98 80 3797008.64 17599.05 0.46 3505856.16 34320.34 0.98 0.92 88 3737459.44 10774.93 0.29 3631757.68 24231.29 0.67 0.97 96 3612816.16 21865.86 0.61 3545354.56 16391.15 0.46 0.98 104 3765167.36 17763.8 0.47 3466467.12 22235.45 0.64 0.92 112 3713386 15455.21 0.42 3402210 18349.66 0.54 0.92 120 3699986.08 15153.08 0.41 3580303.92 19823.01 0.55 0.97 128 3648694.56 11891.62 0.33 3426445.28 22993.32 0.67 0.94 136 800046.88 6039.73 0.75 784412.16 9062.03 1.16 0.98 144 769483.36 5231.74 0.68 714132.8 8953.57 1.25 0.93 152 821081.52 4249.12 0.52 743694.64 8155.18 1.1 0.91 160 789040.16 9187.4 1.16 834865.44 6159.29 0.74 1.06 168 867742.4 8967.66 1.03 734905.36 15582.75 2.12 0.85 176 838650.32 7949.72 0.95 846939.68 8959.8 1.06 1.01 184 854984.48 19475.51 2.28 794549.92 11924.54 1.5 0.93 192 846262.32 13795.86 1.63 899915.12 8639.82 0.96 1.06 200 942602.16 12665.42 1.34 900385.76 8592.23 0.95 0.96 208 954183.68 12853.22 1.35 1166186.96 13045.03 1.12 1.22 216 929319.76 10157.79 1.09 926773.76 10577.01 1.14 1 224 967896.56 9819.6 1.01 951144.32 12343.83 1.3 0.98 232 990621.12 7771.97 0.78 916361.2 17878.44 1.95 0.93 240 995285.04 20104.22 2.02 972119.6 12856.42 1.32 0.98 248 1029436 20404.97 1.98 965301.28 11102.95 1.15 0.94 256 1038724.8 19201.03 1.85 1029942.08 12563.07 1.22 0.99 Written By ---------- Alexei Starovoitov <ast@kernel.org> Kumar Kartikeya Dwivedi <memxor@gmail.com> [0]: https://www.cs.rochester.edu/research/synchronization/pseudocode/timeout.html [1]: https://dl.acm.org/doi/10.1145/571825.571830 [2]: https://github.com/kkdwivedi/rqspinlock/blob/main/rqspinlock.pdf [3]: https://git.kernel.org/pub/scm/linux/kernel/git/cmarinas/kernel-tla.git/plain/qspinlock.tla [4]: https://arxiv.org/pdf/1810.05600 ==================== Link: https://patch.msgid.link/20250316040541.108729-1-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-03-19selftests/bpf: Add tests for rqspinlockKumar Kartikeya Dwivedi
Introduce selftests that trigger AA, ABBA deadlocks, and test the edge case where the held locks table runs out of entries, since we then fallback to the timeout as the final line of defense. Also exercise verifier's AA detection where applicable. Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20250316040541.108729-26-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-03-19bpf: Maintain FIFO property for rqspinlock unlockKumar Kartikeya Dwivedi
Since out-of-order unlocks are unsupported for rqspinlock, and irqsave variants enforce strict FIFO ordering anyway, make the same change for normal non-irqsave variants, such that FIFO ordering is enforced. Two new verifier state fields (active_lock_id, active_lock_ptr) are used to denote the top of the stack, and prev_id and prev_ptr are ascertained whenever popping the topmost entry through an unlock. Take special care to make these fields part of the state comparison in refsafe. Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20250316040541.108729-25-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-03-19bpf: Implement verifier support for rqspinlockKumar Kartikeya Dwivedi
Introduce verifier-side support for rqspinlock kfuncs. The first step is allowing bpf_res_spin_lock type to be defined in map values and allocated objects, so BTF-side is updated with a new BPF_RES_SPIN_LOCK field to recognize and validate. Any object cannot have both bpf_spin_lock and bpf_res_spin_lock, only one of them (and at most one of them per-object, like before) must be present. The bpf_res_spin_lock can also be used to protect objects that require lock protection for their kfuncs, like BPF rbtree and linked list. The verifier plumbing to simulate success and failure cases when calling the kfuncs is done by pushing a new verifier state to the verifier state stack which will verify the failure case upon calling the kfunc. The path where success is indicated creates all lock reference state and IRQ state (if necessary for irqsave variants). In the case of failure, the state clears the registers r0-r5, sets the return value, and skips kfunc processing, proceeding to the next instruction. When marking the return value for success case, the value is marked as 0, and for the failure case as [-MAX_ERRNO, -1]. Then, in the program, whenever user checks the return value as 'if (ret)' or 'if (ret < 0)' the verifier never traverses such branches for success cases, and would be aware that the lock is not held in such cases. We push the kfunc state in check_kfunc_call whenever rqspinlock kfuncs are invoked. We introduce a kfunc_class state to avoid mixing lock irqrestore kfuncs with IRQ state created by bpf_local_irq_save. With all this infrastructure, these kfuncs become usable in programs while satisfying all safety properties required by the kernel. Acked-by: Eduard Zingerman <eddyz87@gmail.com> Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20250316040541.108729-24-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-03-19bpf: Introduce rqspinlock kfuncsKumar Kartikeya Dwivedi
Introduce four new kfuncs, bpf_res_spin_lock, and bpf_res_spin_unlock, and their irqsave/irqrestore variants, which wrap the rqspinlock APIs. bpf_res_spin_lock returns a conditional result, depending on whether the lock was acquired (NULL is returned when lock acquisition succeeds, non-NULL upon failure). The memory pointed to by the returned pointer upon failure can be dereferenced after the NULL check to obtain the error code. Instead of using the old bpf_spin_lock type, introduce a new type with the same layout, and the same alignment, but a different name to avoid type confusion. Preemption is disabled upon successful lock acquisition, however IRQs are not. Special kfuncs can be introduced later to allow disabling IRQs when taking a spin lock. Resilient locks are safe against AA deadlocks, hence not disabling IRQs currently does not allow violation of kernel safety. __irq_flag annotation is used to accept IRQ flags for the IRQ-variants, with the same semantics as existing bpf_local_irq_{save, restore}. These kfuncs will require additional verifier-side support in subsequent commits, to allow programs to hold multiple locks at the same time. Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20250316040541.108729-23-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-03-19bpf: Convert lpm_trie.c to rqspinlockKumar Kartikeya Dwivedi
Convert all LPM trie usage of raw_spinlock to rqspinlock. Note that rcu_dereference_protected in trie_delete_elem is switched over to plain rcu_dereference, the RCU read lock should be held from BPF program side or eBPF syscall path, and the trie->lock is just acquired before the dereference. It is not clear the reason the protected variant was used from the commit history, but the above reasoning makes sense so switch over. Closes: https://lore.kernel.org/lkml/000000000000adb08b061413919e@google.com Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20250316040541.108729-22-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-03-19bpf: Convert percpu_freelist.c to rqspinlockKumar Kartikeya Dwivedi
Convert the percpu_freelist.c code to use rqspinlock, and remove the extralist fallback and trylock-based acquisitions to avoid deadlocks. Key thing to note is the retained while (true) loop to search through other CPUs when failing to push a node due to locking errors. This retains the behavior of the old code, where it would keep trying until it would be able to successfully push the node back into the freelist of a CPU. Technically, we should start iteration for this loop from raw_smp_processor_id() + 1, but to avoid hitting the edge of nr_cpus, we skip execution in the loop body instead. Closes: https://lore.kernel.org/bpf/CAPPBnEa1_pZ6W24+WwtcNFvTUHTHO7KUmzEbOcMqxp+m2o15qQ@mail.gmail.com Closes: https://lore.kernel.org/bpf/CAPPBnEYm+9zduStsZaDnq93q1jPLqO-PiKX9jy0MuL8LCXmCrQ@mail.gmail.com Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20250316040541.108729-21-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-03-19bpf: Convert hashtab.c to rqspinlockKumar Kartikeya Dwivedi
Convert hashtab.c from raw_spinlock to rqspinlock, and drop the hashed per-cpu counter crud from the code base which is no longer necessary. Closes: https://lore.kernel.org/bpf/675302fd.050a0220.2477f.0004.GAE@google.com Closes: https://lore.kernel.org/bpf/000000000000b3e63e061eed3f6b@google.com Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20250316040541.108729-20-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-03-19rqspinlock: Add locktorture supportKumar Kartikeya Dwivedi
Introduce locktorture support for rqspinlock using the newly added macros as the first in-kernel user and consumer. Guard the code with CONFIG_BPF_SYSCALL ifdef since rqspinlock is not available otherwise. Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20250316040541.108729-19-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-03-19rqspinlock: Add entry to Makefile, MAINTAINERSKumar Kartikeya Dwivedi
Ensure that the rqspinlock code is only built when the BPF subsystem is compiled in. Depending on queued spinlock support, we may or may not end up building the queued spinlock slowpath, and instead fallback to the test-and-set implementation. Also add entries to MAINTAINERS file. Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20250316040541.108729-18-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-03-19rqspinlock: Add macros for rqspinlock usageKumar Kartikeya Dwivedi
Introduce helper macros that wrap around the rqspinlock slow path and provide an interface analogous to the raw_spin_lock API. Note that in case of error conditions, preemption and IRQ disabling is automatically unrolled before returning the error back to the caller. Ensure that in absence of CONFIG_QUEUED_SPINLOCKS support, we fallback to the test-and-set implementation. Add some comments describing the subtle memory ordering logic during unlock, and why it's safe. Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20250316040541.108729-17-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-03-19rqspinlock: Add basic support for CONFIG_PARAVIRTKumar Kartikeya Dwivedi
We ripped out PV and virtualization related bits from rqspinlock in an earlier commit, however, a fair lock performs poorly within a virtual machine when the lock holder is preempted. As such, retain the virt_spin_lock fallback to test and set lock, but with timeout and deadlock detection. We can do this by simply depending on the resilient_tas_spin_lock implementation from the previous patch. We don't integrate support for CONFIG_PARAVIRT_SPINLOCKS yet, as that requires more involved algorithmic changes and introduces more complexity. It can be done when the need arises in the future. Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20250316040541.108729-15-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-03-19rqspinlock: Add a test-and-set fallbackKumar Kartikeya Dwivedi
Include a test-and-set fallback when queued spinlock support is not available. Introduce a rqspinlock type to act as a fallback when qspinlock support is absent. Include ifdef guards to ensure the slow path in this file is only compiled when CONFIG_QUEUED_SPINLOCKS=y. Subsequent patches will add further logic to ensure fallback to the test-and-set implementation when queued spinlock support is unavailable on an architecture. Unlike other waiting loops in rqspinlock code, the one for test-and-set has no theoretical upper bound under contention, therefore we need a longer timeout than usual. Bump it up to a second in this case. Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20250316040541.108729-14-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-03-19rqspinlock: Add deadlock detection and recoveryKumar Kartikeya Dwivedi
While the timeout logic provides guarantees for the waiter's forward progress, the time until a stalling waiter unblocks can still be long. The default timeout of 1/4 sec can be excessively long for some use cases. Additionally, custom timeouts may exacerbate recovery time. Introduce logic to detect common cases of deadlocks and perform quicker recovery. This is done by dividing the time from entry into the locking slow path until the timeout into intervals of 1 ms. Then, after each interval elapses, deadlock detection is performed, while also polling the lock word to ensure we can quickly break out of the detection logic and proceed with lock acquisition. A 'held_locks' table is maintained per-CPU where the entry at the bottom denotes a lock being waited for or already taken. Entries coming before it denote locks that are already held. The current CPU's table can thus be looked at to detect AA deadlocks. The tables from other CPUs can be looked at to discover ABBA situations. Finally, when a matching entry for the lock being taken on the current CPU is found on some other CPU, a deadlock situation is detected. This function can take a long time, therefore the lock word is constantly polled in each loop iteration to ensure we can preempt detection and proceed with lock acquisition, using the is_lock_released check. We set 'spin' member of rqspinlock_timeout struct to 0 to trigger deadlock checks immediately to perform faster recovery. Note: Extending lock word size by 4 bytes to record owner CPU can allow faster detection for ABBA. It is typically the owner which participates in a ABBA situation. However, to keep compatibility with existing lock words in the kernel (struct qspinlock), and given deadlocks are a rare event triggered by bugs, we choose to favor compatibility over faster detection. The release_held_lock_entry function requires an smp_wmb, while the release store on unlock will provide the necessary ordering for us. Add comments to document the subtleties of why this is correct. It is possible for stores to be reordered still, but in the context of the deadlock detection algorithm, a release barrier is sufficient and needn't be stronger for unlock's case. Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20250316040541.108729-13-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-03-19rqspinlock: Protect waiters in trylock fallback from stallsKumar Kartikeya Dwivedi
When we run out of maximum rqnodes, the original queued spin lock slow path falls back to a try lock. In such a case, we are again susceptible to stalls in case the lock owner fails to make progress. We use the timeout as a fallback to break out of this loop and return to the caller. This is a fallback for an extreme edge case, when on the same CPU we run out of all 4 qnodes. When could this happen? We are in slow path in task context, we get interrupted by an IRQ, which while in the slow path gets interrupted by an NMI, whcih in the slow path gets another nested NMI, which enters the slow path. All of the interruptions happen after node->count++. We use RES_DEF_TIMEOUT as our spinning duration, but in the case of this fallback, no fairness is guaranteed, so the duration may be too small for contended cases, as the waiting time is not bounded. Since this is an extreme corner case, let's just prefer timing out instead of attempting to spin for longer. Reviewed-by: Barret Rhoden <brho@google.com> Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20250316040541.108729-12-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-03-19rqspinlock: Protect waiters in queue from stallsKumar Kartikeya Dwivedi
Implement the wait queue cleanup algorithm for rqspinlock. There are three forms of waiters in the original queued spin lock algorithm. The first is the waiter which acquires the pending bit and spins on the lock word without forming a wait queue. The second is the head waiter that is the first waiter heading the wait queue. The third form is of all the non-head waiters queued behind the head, waiting to be signalled through their MCS node to overtake the responsibility of the head. In this commit, we are concerned with the second and third kind. First, we augment the waiting loop of the head of the wait queue with a timeout. When this timeout happens, all waiters part of the wait queue will abort their lock acquisition attempts. This happens in three steps. First, the head breaks out of its loop waiting for pending and locked bits to turn to 0, and non-head waiters break out of their MCS node spin (more on that later). Next, every waiter (head or non-head) attempts to check whether they are also the tail waiter, in such a case they attempt to zero out the tail word and allow a new queue to be built up for this lock. If they succeed, they have no one to signal next in the queue to stop spinning. Otherwise, they signal the MCS node of the next waiter to break out of its spin and try resetting the tail word back to 0. This goes on until the tail waiter is found. In case of races, the new tail will be responsible for performing the same task, as the old tail will then fail to reset the tail word and wait for its next pointer to be updated before it signals the new tail to do the same. We terminate the whole wait queue because of two main reasons. Firstly, we eschew per-waiter timeouts with one applied at the head of the wait queue. This allows everyone to break out faster once we've seen the owner / pending waiter not responding for the timeout duration from the head. Secondly, it avoids complicated synchronization, because when not leaving in FIFO order, prev's next pointer needs to be fixed up etc. Lastly, all of these waiters release the rqnode and return to the caller. This patch underscores the point that rqspinlock's timeout does not apply to each waiter individually, and cannot be relied upon as an upper bound. It is possible for the rqspinlock waiters to return early from a failed lock acquisition attempt as soon as stalls are detected. The head waiter cannot directly WRITE_ONCE the tail to zero, as it may race with a concurrent xchg and a non-head waiter linking its MCS node to the head's MCS node through 'prev->next' assignment. One notable thing is that we must use RES_DEF_TIMEOUT * 2 as our maximum duration for the waiting loop (for the wait queue head), since we may have both the owner and pending bit waiter ahead of us, and in the worst case, need to span their maximum permitted critical section lengths. Reviewed-by: Barret Rhoden <brho@google.com> Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20250316040541.108729-11-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-03-19rqspinlock: Protect pending bit owners from stallsKumar Kartikeya Dwivedi
The pending bit is used to avoid queueing in case the lock is uncontended, and has demonstrated benefits for the 2 contender scenario, esp. on x86. In case the pending bit is acquired and we wait for the locked bit to disappear, we may get stuck due to the lock owner not making progress. Hence, this waiting loop must be protected with a timeout check. To perform a graceful recovery once we decide to abort our lock acquisition attempt in this case, we must unset the pending bit since we own it. All waiters undoing their changes and exiting gracefully allows the lock word to be restored to the unlocked state once all participants (owner, waiters) have been recovered, and the lock remains usable. Hence, set the pending bit back to zero before returning to the caller. Introduce a lockevent (rqspinlock_lock_timeout) to capture timeout event statistics. Reviewed-by: Barret Rhoden <brho@google.com> Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20250316040541.108729-10-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-03-19rqspinlock: Hardcode cond_acquire loops for arm64Kumar Kartikeya Dwivedi
Currently, for rqspinlock usage, the implementation of smp_cond_load_acquire (and thus, atomic_cond_read_acquire) are susceptible to stalls on arm64, because they do not guarantee that the conditional expression will be repeatedly invoked if the address being loaded from is not written to by other CPUs. When support for event-streams is absent (which unblocks stuck WFE-based loops every ~100us), we may end up being stuck forever. This causes a problem for us, as we need to repeatedly invoke the RES_CHECK_TIMEOUT in the spin loop to break out when the timeout expires. Let us import the smp_cond_load_acquire_timewait implementation Ankur is proposing in [0], and then fallback to it once it is merged. While we rely on the implementation to amortize the cost of sampling check_timeout for us, it will not happen when event stream support is unavailable. This is not the common case, and it would be difficult to fit our logic in the time_expr_ns >= time_limit_ns comparison, hence just let it be. [0]: https://lore.kernel.org/lkml/20250203214911.898276-1-ankur.a.arora@oracle.com Cc: Ankur Arora <ankur.a.arora@oracle.com> Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20250316040541.108729-9-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-03-19rqspinlock: Add support for timeoutsKumar Kartikeya Dwivedi
Introduce policy macro RES_CHECK_TIMEOUT which can be used to detect when the timeout has expired for the slow path to return an error. It depends on being passed two variables initialized to 0: ts, ret. The 'ts' parameter is of type rqspinlock_timeout. This macro resolves to the (ret) expression so that it can be used in statements like smp_cond_load_acquire to break the waiting loop condition. The 'spin' member is used to amortize the cost of checking time by dispatching to the implementation every 64k iterations. The 'timeout_end' member is used to keep track of the timestamp that denotes the end of the waiting period. The 'ret' parameter denotes the status of the timeout, and can be checked in the slow path to detect timeouts after waiting loops. The 'duration' member is used to store the timeout duration for each waiting loop. The default timeout value defined in the header (RES_DEF_TIMEOUT) is 0.25 seconds. This macro will be used as a condition for waiting loops in the slow path. Since each waiting loop applies a fresh timeout using the same rqspinlock_timeout, we add a new RES_RESET_TIMEOUT as well to ensure the values can be easily reinitialized to the default state. Reviewed-by: Barret Rhoden <brho@google.com> Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20250316040541.108729-8-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-03-19rqspinlock: Drop PV and virtualization supportKumar Kartikeya Dwivedi
Changes to rqspinlock in subsequent commits will be algorithmic modifications, which won't remain in agreement with the implementations of paravirt spin lock and virt_spin_lock support. These future changes include measures for terminating waiting loops in slow path after a certain point. While using a fair lock like qspinlock directly inside virtual machines leads to suboptimal performance under certain conditions, we cannot use the existing virtualization support before we make it resilient as well. Therefore, drop it for now. Note that we need to drop qspinlock_stat.h, as it's only relevant in case of CONFIG_PARAVIRT_SPINLOCKS=y, but we need to keep lock_events.h in the includes, which was indirectly pulled in before. Reviewed-by: Barret Rhoden <brho@google.com> Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20250316040541.108729-7-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-03-19rqspinlock: Add rqspinlock.h headerKumar Kartikeya Dwivedi
This header contains the public declarations usable in the rest of the kernel for rqspinlock. Let's also type alias qspinlock to rqspinlock_t to ensure consistent use of the new lock type. We want to remove dependence on the qspinlock type in later patches as we need to provide a test-and-set fallback, hence begin abstracting away from now onwards. Reviewed-by: Barret Rhoden <brho@google.com> Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20250316040541.108729-6-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-03-19locking: Copy out qspinlock.c to kernel/bpf/rqspinlock.cKumar Kartikeya Dwivedi
In preparation for introducing a new lock implementation, Resilient Queued Spin Lock, or rqspinlock, we first begin our modifications by using the existing qspinlock.c code as the base. Simply copy the code to a new file and rename functions and variables from 'queued' to 'resilient_queued'. Since we place the file in kernel/bpf, include needs to be relative. This helps each subsequent commit in clearly showing how and where the code is being changed. The only change after a literal copy in this commit is renaming the functions where necessary, and rename qnodes to rqnodes. Let's also use EXPORT_SYMBOL_GPL for rqspinlock slowpath. Reviewed-by: Barret Rhoden <brho@google.com> Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20250316040541.108729-5-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-03-19locking: Allow obtaining result of arch_mcs_spin_lock_contendedKumar Kartikeya Dwivedi
To support upcoming changes that require inspecting the return value once the conditional waiting loop in arch_mcs_spin_lock_contended terminates, modify the macro to preserve the result of smp_cond_load_acquire. This enables checking the return value as needed, which will help disambiguate the MCS node’s locked state in future patches. Reviewed-by: Barret Rhoden <brho@google.com> Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20250316040541.108729-4-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-03-19locking: Move common qspinlock helpers to a private headerKumar Kartikeya Dwivedi
Move qspinlock helper functions that encode, decode tail word, set and clear the pending and locked bits, and other miscellaneous definitions and macros to a private header. To this end, create a qspinlock.h header file in kernel/locking. Subsequent commits will introduce a modified qspinlock slow path function, thus moving shared code to a private header will help minimize unnecessary code duplication. Reviewed-by: Barret Rhoden <brho@google.com> Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20250316040541.108729-3-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-03-19RISC-V: KVM: Teardown riscv specific bits after kvm_exitAtish Patra
During a module removal, kvm_exit invokes arch specific disable call which disables AIA. However, we invoke aia_exit before kvm_exit resulting in the following warning. KVM kernel module can't be inserted afterwards due to inconsistent state of IRQ. [25469.031389] percpu IRQ 31 still enabled on CPU0! [25469.031732] WARNING: CPU: 3 PID: 943 at kernel/irq/manage.c:2476 __free_percpu_irq+0xa2/0x150 [25469.031804] Modules linked in: kvm(-) [25469.031848] CPU: 3 UID: 0 PID: 943 Comm: rmmod Not tainted 6.14.0-rc5-06947-g91c763118f47-dirty #2 [25469.031905] Hardware name: riscv-virtio,qemu (DT) [25469.031928] epc : __free_percpu_irq+0xa2/0x150 [25469.031976] ra : __free_percpu_irq+0xa2/0x150 [25469.032197] epc : ffffffff8007db1e ra : ffffffff8007db1e sp : ff2000000088bd50 [25469.032241] gp : ffffffff8131cef8 tp : ff60000080b96400 t0 : ff2000000088baf8 [25469.032285] t1 : fffffffffffffffc t2 : 5249207570637265 s0 : ff2000000088bd90 [25469.032329] s1 : ff60000098b21080 a0 : 037d527a15eb4f00 a1 : 037d527a15eb4f00 [25469.032372] a2 : 0000000000000023 a3 : 0000000000000001 a4 : ffffffff8122dbf8 [25469.032410] a5 : 0000000000000fff a6 : 0000000000000000 a7 : ffffffff8122dc10 [25469.032448] s2 : ff60000080c22eb0 s3 : 0000000200000022 s4 : 000000000000001f [25469.032488] s5 : ff60000080c22e00 s6 : ffffffff80c351c0 s7 : 0000000000000000 [25469.032582] s8 : 0000000000000003 s9 : 000055556b7fb490 s10: 00007ffff0e12fa0 [25469.032621] s11: 00007ffff0e13e9a t3 : ffffffff81354ac7 t4 : ffffffff81354ac7 [25469.032664] t5 : ffffffff81354ac8 t6 : ffffffff81354ac7 [25469.032698] status: 0000000200000100 badaddr: ffffffff8007db1e cause: 0000000000000003 [25469.032738] [<ffffffff8007db1e>] __free_percpu_irq+0xa2/0x150 [25469.032797] [<ffffffff8007dbfc>] free_percpu_irq+0x30/0x5e [25469.032856] [<ffffffff013a57dc>] kvm_riscv_aia_exit+0x40/0x42 [kvm] [25469.033947] [<ffffffff013b4e82>] cleanup_module+0x10/0x32 [kvm] [25469.035300] [<ffffffff8009b150>] __riscv_sys_delete_module+0x18e/0x1fc [25469.035374] [<ffffffff8000c1ca>] syscall_handler+0x3a/0x46 [25469.035456] [<ffffffff809ec9a4>] do_trap_ecall_u+0x72/0x134 [25469.035536] [<ffffffff809f5e18>] handle_exception+0x148/0x156 Invoke aia_exit and other arch specific cleanup functions after kvm_exit so that disable gets a chance to be called first before exit. Fixes: 54e43320c2ba ("RISC-V: KVM: Initial skeletal support for AIA") Fixes: eded6754f398 ("riscv: KVM: add basic support for host vs guest profiling") Signed-off-by: Atish Patra <atishp@rivosinc.com> Reviewed-by: Anup Patel <anup@brainfault.org> Reviewed-by: Sean Christopherson <seanjc@google.com> Link: https://lore.kernel.org/r/20250317-kvm_exit_fix-v1-1-aa5240c5dbd2@rivosinc.com Signed-off-by: Anup Patel <anup@brainfault.org>