summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2025-04-24netfilter: xt_IDLETIMER: convert timeouts to secs_to_jiffies()Easwar Hariharan
Commit b35108a51cf7 ("jiffies: Define secs_to_jiffies()") introduced secs_to_jiffies(). As the value here is a multiple of 1000, use secs_to_jiffies() instead of msecs_to_jiffies to avoid the multiplication. This is converted using scripts/coccinelle/misc/secs_to_jiffies.cocci with the following Coccinelle rules: @depends on patch@ expression E; @@ -msecs_to_jiffies(E * 1000) +secs_to_jiffies(E) -msecs_to_jiffies(E * MSEC_PER_SEC) +secs_to_jiffies(E) Signed-off-by: Easwar Hariharan <eahariha@linux.microsoft.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2025-04-24KVM: x86: Do not use kvm_rip_read() unconditionally for KVM_PROFILINGAdrian Hunter
Not all VMs allow access to RIP. Check guest_state_protected before calling kvm_rip_read(). This avoids, for example, hitting WARN_ON_ONCE in vt_cache_reg() for TDX VMs. Fixes: 81bf912b2c15 ("KVM: TDX: Implement TDX vcpu enter/exit path") Signed-off-by: Adrian Hunter <adrian.hunter@intel.com> Message-ID: <20250415104821.247234-3-adrian.hunter@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-04-24KVM: x86: Do not use kvm_rip_read() unconditionally in KVM tracepointsAdrian Hunter
Not all VMs allow access to RIP. Check guest_state_protected before calling kvm_rip_read(). This avoids, for example, hitting WARN_ON_ONCE in vt_cache_reg() for TDX VMs. Fixes: 81bf912b2c15 ("KVM: TDX: Implement TDX vcpu enter/exit path") Signed-off-by: Adrian Hunter <adrian.hunter@intel.com> Message-ID: <20250415104821.247234-2-adrian.hunter@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-04-24KVM: SVM: WARN if an invalid posted interrupt IRTE entry is addedSean Christopherson
Now that the AMD IOMMU doesn't signal success incorrectly, WARN if KVM attempts to track an AMD IRTE entry without metadata. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-ID: <20250404193923.1413163-8-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-04-24iommu/amd: WARN if KVM attempts to set vCPU affinity without posted intrruptsSean Christopherson
WARN if KVM attempts to set vCPU affinity when posted interrupts aren't enabled, as KVM shouldn't try to enable posting when they're unsupported, and the IOMMU driver darn well should only advertise posting support when AMD_IOMMU_GUEST_IR_VAPIC() is true. Note, KVM consumes is_guest_mode only on success. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-ID: <20250404193923.1413163-7-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-04-24iommu/amd: Return an error if vCPU affinity is set for non-vCPU IRTESean Christopherson
Return -EINVAL instead of success if amd_ir_set_vcpu_affinity() is invoked without use_vapic; lying to KVM about whether or not the IRTE was configured to post IRQs is all kinds of bad. Fixes: d98de49a53e4 ("iommu/amd: Enable vAPIC interrupt remapping mode by default") Signed-off-by: Sean Christopherson <seanjc@google.com> Message-ID: <20250404193923.1413163-6-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-04-24KVM: x86: Take irqfds.lock when adding/deleting IRQ bypass producerSean Christopherson
Take irqfds.lock when adding/deleting an IRQ bypass producer to ensure irqfd->producer isn't modified while kvm_irq_routing_update() is running. The only lock held when a producer is added/removed is irqbypass's mutex. Fixes: 872768800652 ("KVM: x86: select IRQ_BYPASS_MANAGER") Cc: stable@vger.kernel.org Signed-off-by: Sean Christopherson <seanjc@google.com> Message-ID: <20250404193923.1413163-5-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-04-24KVM: x86: Explicitly treat routing entry type changes as changesSean Christopherson
Explicitly treat type differences as GSI routing changes, as comparing MSI data between two entries could get a false negative, e.g. if userspace changed the type but left the type-specific data as-is. Fixes: 515a0c79e796 ("kvm: irqfd: avoid update unmodified entries of the routing") Cc: stable@vger.kernel.org Signed-off-by: Sean Christopherson <seanjc@google.com> Message-ID: <20250404193923.1413163-4-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-04-24KVM: x86: Reset IRTE to host control if *new* route isn't postableSean Christopherson
Restore an IRTE back to host control (remapped or posted MSI mode) if the *new* GSI route prevents posting the IRQ directly to a vCPU, regardless of the GSI routing type. Updating the IRTE if and only if the new GSI is an MSI results in KVM leaving an IRTE posting to a vCPU. The dangling IRTE can result in interrupts being incorrectly delivered to the guest, and in the worst case scenario can result in use-after-free, e.g. if the VM is torn down, but the underlying host IRQ isn't freed. Fixes: efc644048ecd ("KVM: x86: Update IRTE for posted-interrupts") Fixes: 411b44ba80ab ("svm: Implements update_pi_irte hook to setup posted interrupt") Cc: stable@vger.kernel.org Signed-off-by: Sean Christopherson <seanjc@google.com> Message-ID: <20250404193923.1413163-3-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-04-24KVM: SVM: Allocate IR data using atomic allocationSean Christopherson
Allocate SVM's interrupt remapping metadata using GFP_ATOMIC as svm_ir_list_add() is called with IRQs are disabled and irqfs.lock held when kvm_irq_routing_update() reacts to GSI routing changes. Fixes: 411b44ba80ab ("svm: Implements update_pi_irte hook to setup posted interrupt") Cc: stable@vger.kernel.org Signed-off-by: Sean Christopherson <seanjc@google.com> Message-ID: <20250404193923.1413163-2-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-04-24KVM: SVM: Don't update IRTEs if APICv/AVIC is disabledSean Christopherson
Skip IRTE updates if AVIC is disabled/unsupported, as forcing the IRTE into remapped mode (kvm_vcpu_apicv_active() will never be true) is unnecessary and wasteful. The IOMMU driver is responsible for putting IRTEs into remapped mode when an IRQ is allocated by a device, long before that device is assigned to a VM. I.e. the kernel as a whole has major issues if the IRTE isn't already in remapped mode. Opportunsitically kvm_arch_has_irq_bypass() to query for APICv/AVIC, so so that all checks in KVM x86 incorporate the same information. Cc: Yosry Ahmed <yosry.ahmed@linux.dev> Cc: Jim Mattson <jmattson@google.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-ID: <20250401161804.842968-3-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-04-24KVM: arm64, x86: make kvm_arch_has_irq_bypass() inlinePaolo Bonzini
kvm_arch_has_irq_bypass() is a small function and even though it does not appear in any *really* hot paths, it's also not entirely rare. Make it inline---it also works out nicely in preparation for using it in kvm-intel.ko and kvm-amd.ko, since the function is not currently exported. Suggested-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-04-24block: don't autoload drivers on blk-cgroup configurationChristoph Hellwig
Loading a driver just to configure blk-cgroup doesn't make sense, as that assumes and already existing device. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Christian Brauner <brauner@kernel.org> Acked-by: Tejun Heo <tj@kernel.org> Link: https://lore.kernel.org/r/20250423053810.1683309-5-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-04-24block: don't autoload drivers on statChristoph Hellwig
blkdev_get_no_open can trigger the legacy autoload of block drivers. A simple stat of a block device has not historically done that, so disable this behavior again. Fixes: 9abcfbd235f5 ("block: Add atomic write support for statx") Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Christian Brauner <brauner@kernel.org> Acked-by: Tejun Heo <tj@kernel.org> Link: https://lore.kernel.org/r/20250423053810.1683309-4-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-04-24block: remove the backing_inode variable in bdev_statxChristoph Hellwig
backing_inode is only used once, so remove it and update the comment describing the bdev lookup to be a bit more clear. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Christian Brauner <brauner@kernel.org> Acked-by: Tejun Heo <tj@kernel.org> Link: https://lore.kernel.org/r/20250423053810.1683309-3-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-04-24block: move blkdev_{get,put} _no_open prototypes out of blkdev.hChristoph Hellwig
These are only to be used by block internal code. Remove the comment as we grew more users due to reworking block device node opening. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Christian Brauner <brauner@kernel.org> Acked-by: Tejun Heo <tj@kernel.org> Link: https://lore.kernel.org/r/20250423053810.1683309-2-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-04-24block: never reduce ra_pages in blk_apply_bdi_limitsChristoph Hellwig
When the user increased the read-ahead size through sysfs this value currently get lost if the device is reprobe, including on a resume from suspend. As there is no hardware limitation for the read-ahead size there is no real need to reset it or track a separate hardware limitation like for max_sectors. This restores the pre-atomic queue limit behavior in the sd driver as sd did not use blk_queue_io_opt and thus never updated the read ahead size to the value based of the optimal I/O, but changes behavior for all other drivers. As the new behavior seems useful and sd is the driver for which the readahead size tweaks are most useful that seems like a worthwhile trade off. Fixes: 804e498e0496 ("sd: convert to the atomic queue limits API") Reported-by: Holger Hoffstätte <holger@applied-asynchrony.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Link: https://lore.kernel.org/r/20250424082521.1967286-1-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-04-24selftests: ublk: common: fix _get_disk_dev_t for pre-9.0 coreutilsUday Shankar
Some distributions, such as centos stream 9, still have a version of coreutils which does not yet support the %Hr and %Lr formats for stat(1) [1, 2]. Running ublk selftests on these distributions results in the following error in tests that use the _get_disk_dev_t helper: line 23: ?r: syntax error: operand expected (error token is "?r") To better accommodate older distributions, rewrite _get_disk_dev_t to use the much older %t and %T formats for stat instead. [1] https://github.com/coreutils/coreutils/blob/v9.0/NEWS#L114 [2] https://pkgs.org/download/coreutils Signed-off-by: Uday Shankar <ushankar@purestorage.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20250423-ublk_selftests-v1-2-7d060e260e76@purestorage.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-04-24io_uring: don't duplicate flushing in io_req_post_cqePavel Begunkov
io_req_post_cqe() sets submit_state.cq_flush so that *flush_completions() can take care of batch commiting CQEs. Don't commit it twice by using __io_cq_unlock_post(). Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/41c416660c509cee676b6cad96081274bcb459f3.1745493861.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-04-24Merge tag 'nvme-6.15-2025-04-24' of git://git.infradead.org/nvme into block-6.15Jens Axboe
Pull NVMe fix from Christoph: "nvme fixes for Linux 6.15 - fix an out-of-bounds access in nvmet_enable_port (Richard Weinberger)" * tag 'nvme-6.15-2025-04-24' of git://git.infradead.org/nvme: nvmet: fix out-of-bounds access in nvmet_enable_port
2025-04-24spi: spi-qpic-snand: propagate errors from qcom_spi_block_erase()Gabor Juhos
The qcom_spi_block_erase() function returns with error in case of failure. Change the qcom_spi_send_cmdaddr() function to propagate these errors to the callers instead of returning with success. Fixes: 7304d1909080 ("spi: spi-qpic: add driver for QCOM SPI NAND flash Interface") Signed-off-by: Gabor Juhos <j4g8y7@gmail.com> Reviewed-by: Abel Vesa <abel.vesa@linaro.org> Link: https://patch.msgid.link/20250423-qpic-snand-propagate-error-v1-1-4b26ed45fdb5@gmail.com Reviewed-by: Md Sadre Alam <quic_mdalam@quicinc.com> Signed-off-by: Mark Brown <broonie@kernel.org>
2025-04-24ASoC: renesas: rz-ssi: Use NOIRQ_SYSTEM_SLEEP_PM_OPS()Claudiu Beznea
In the latest kernel versions system crashes were noticed occasionally during suspend/resume. This occurs because the RZ SSI suspend trigger (called from snd_soc_suspend()) is executed after rz_ssi_pm_ops->suspend() and it accesses IP registers. After the rz_ssi_pm_ops->suspend() is executed the IP clocks are disabled and its reset line is asserted. Since snd_soc_suspend() is invoked through snd_soc_pm_ops->suspend(), snd_soc_pm_ops is associated with soc_driver (defined in sound/soc/soc-core.c), and there is no parent-child relationship between soc_driver and rz_ssi_driver the power management subsystem does not enforce a specific suspend/resume order between the RZ SSI platform driver and soc_driver. To ensure that the suspend/resume function of rz-ssi is executed after snd_soc_suspend(), use NOIRQ_SYSTEM_SLEEP_PM_OPS(). Fixes: 1fc778f7c833 ("ASoC: renesas: rz-ssi: Add suspend to RAM support") Cc: stable@vger.kernel.org Signed-off-by: Claudiu Beznea <claudiu.beznea.uj@bp.renesas.com> Link: https://patch.msgid.link/20250410141525.4126502-1-claudiu.beznea.uj@bp.renesas.com Signed-off-by: Mark Brown <broonie@kernel.org>
2025-04-24net: phy: marvell-88q2xxx: Enable temperature sensor for mv88q211xNiklas Söderlund
The temperature sensor enabled for mv88q222x devices also functions for mv88q211x based devices. Unify the two devices probe functions to enable the sensors for all devices supported by this driver. The same oddity as for mv88q222x devices exists, the PHY link must be up for a correct temperature reading to be reported. # cat /sys/class/hwmon/hwmon9/temp1_input -75000 # ifconfig end5 up # cat /sys/class/hwmon/hwmon9/temp1_input 59000 Worth noting is that while the temperature register offsets and layout are the same between mv88q211x and mv88q222x devices their names in the datasheets are different. This change keeps the mv88q222x names for the mv88q211x support. Signed-off-by: Niklas Söderlund <niklas.soderlund+renesas@ragnatech.se> Reviewed-by: Dimitri Fedrau <dima.fedrau@gmail.com> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Link: https://patch.msgid.link/20250418145800.2420751-1-niklas.soderlund+renesas@ragnatech.se Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-04-24Merge branch 'net-stmmac-fix-timestamp-snapshots-on-dwmac1000'Paolo Abeni
Alexis Lothore says: ==================== net: stmmac: fix timestamp snapshots on dwmac1000 this is the v2 of a small series containing two small fixes for the timestamp snapshot feature on stmmac, especially on dwmac1000 version. Those issues have been detected on a socfpga (Cyclone V) platform. They kind of follow the big rework sent by Maxime at the end of last year to properly split this feature support between different versions of the DWMAC IP. v1: https://lore.kernel.org/r/20250422-stmmac_ts-v1-0-b59c9f406041@bootlin.com Signed-off-by: Alexis Lothoré <alexis.lothore@bootlin.com> ==================== Link: https://patch.msgid.link/20250423-stmmac_ts-v2-0-e2cf2bbd61b1@bootlin.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-04-24net: stmmac: fix multiplication overflow when reading timestampAlexis Lothoré
The current way of reading a timestamp snapshot in stmmac can lead to integer overflow, as the computation is done on 32 bits. The issue has been observed on a dwmac-socfpga platform returning chaotic timestamp values due to this overflow. The corresponding multiplication is done with a MUL instruction, which returns 32 bit values. Explicitly casting the value to 64 bits replaced the MUL with a UMLAL, which computes and returns the result on 64 bits, and so returns correctly the timestamps. Prevent this overflow by explicitly casting the intermediate value to u64 to make sure that the whole computation is made on u64. While at it, apply the same cast on the other dwmac variant (GMAC4) method for snapshot retrieval. Fixes: 477c3e1f6363 ("net: stmmac: Introduce dwmac1000 timestamping operations") Signed-off-by: Alexis Lothoré <alexis.lothore@bootlin.com> Reviewed-by: Maxime Chevallier <maxime.chevallier@bootlin.com> Link: https://patch.msgid.link/20250423-stmmac_ts-v2-2-e2cf2bbd61b1@bootlin.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-04-24net: stmmac: fix dwmac1000 ptp timestamp status offsetAlexis Lothore
When a PTP interrupt occurs, the driver accesses the wrong offset to learn about the number of available snapshots in the FIFO for dwmac1000: it should be accessing bits 29..25, while it is currently reading bits 19..16 (those are bits about the auxiliary triggers which have generated the timestamps). As a consequence, it does not compute correctly the number of available snapshots, and so possibly do not generate the corresponding clock events if the bogus value ends up being 0. Fix clock events generation by reading the correct bits in the timestamp register for dwmac1000. Fixes: 477c3e1f6363 ("net: stmmac: Introduce dwmac1000 timestamping operations") Signed-off-by: Alexis Lothoré <alexis.lothore@bootlin.com> Reviewed-by: Maxime Chevallier <maxime.chevallier@bootlin.com> Link: https://patch.msgid.link/20250423-stmmac_ts-v2-1-e2cf2bbd61b1@bootlin.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-04-24net: dp83822: Fix OF_MDIO config checkJohannes Schneider
When CONFIG_OF_MDIO is set to be a module the code block is not compiled. Use the IS_ENABLED macro that checks for both built in as well as module. Fixes: 5dc39fd5ef35 ("net: phy: DP83822: Add ability to advertise Fiber connection") Signed-off-by: Johannes Schneider <johannes.schneider@leica-geosystems.com> Reviewed-by: Maxime Chevallier <maxime.chevallier@bootlin.com> Link: https://patch.msgid.link/20250423044724.1284492-1-johannes.schneider@leica-geosystems.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-04-24selftests/fs/mount-notify: test also remove/flush of mntns marksAmir Goldstein
Regression test for FAN_MARK_MNTFS | FAN_MARK_FLUSH bug. Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Jan Kara <jack@suse.cz> Link: https://patch.msgid.link/20250418193903.2607617-3-amir73il@gmail.com
2025-04-24fanotify: fix flush of mntns marksAmir Goldstein
fanotify_mark(fd, FAN_MARK_FLUSH | FAN_MARK_MNTNS, ...) incorrectly ends up causing removal inode marks. Fixes: 0f46d81f2bce ("fanotify: notify on mount attach and detach") Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Jan Kara <jack@suse.cz> Link: https://patch.msgid.link/20250418193903.2607617-2-amir73il@gmail.com
2025-04-24Merge branch 'ipv6-no-rtnl-for-ipv6-routing-table'Paolo Abeni
Kuniyuki Iwashima says: ==================== ipv6: No RTNL for IPv6 routing table. IPv6 routing tables are protected by each table's lock and work in the interrupt context, which means we basically don't need RTNL to modify an IPv6 routing table itself. Currently, the control paths require RTNL because we may need to perform device and nexthop lookups; we must prevent dev/nexthop from going away from the netns. This, however, can be achieved by RCU as well. If we are in the RCU critical section while adding an IPv6 route, synchronize_net() in __dev_change_net_namespace() and unregister_netdevice_many_notify() guarantee that the dev will not be moved to another netns or removed. Also, nexthop is guaranteed not to be freed during the RCU grace period. If we care about a race between nexthop removal and IPv6 route addition, we can get rid of RTNL from the control paths. Patch 1 moves a validation for RTA_MULTIPATH earlier. Patch 2 removes RTNL for SIOCDELRT and RTM_DELROUTE. Patch 3 ~ 11 moves validation and memory allocation earlier. Patch 12 prevents a race between two requests for the same table. Patch 13 & 14 prevents the nexthop race mentioned above. Patch 15 removes RTNL for SIOCADDRT and RTM_NEWROUTE. Test: The script [0] lets each CPU-X create 100000 routes on table-X in a batch. On c7a.metal-48xl EC2 instance with 192 CPUs, without this series: $ sudo ./route_test.sh start adding routes added 19200000 routes (100000 routes * 192 tables). total routes: 19200006 Time elapsed: 191577 milliseconds. with this series: $ sudo ./route_test.sh start adding routes added 19200000 routes (100000 routes * 192 tables). total routes: 19200006 Time elapsed: 62854 milliseconds. I changed the number of routes (1000 ~ 100000 per CPU/table) and consistently saw it finish 3x faster with this series. [0] mkdir tmp NS="test" ip netns add $NS ip -n $NS link add veth0 type veth peer veth1 ip -n $NS link set veth0 up ip -n $NS link set veth1 up TABLES=() for i in $(seq $(nproc)); do TABLES+=("$i") done ROUTES=() for i in {1..100}; do for j in {1..1000}; do ROUTES+=("2001:$i:$j::/64") done done for TABLE in "${TABLES[@]}"; do ( FILE="./tmp/batch-table-$TABLE.txt" > $FILE for ROUTE in "${ROUTES[@]}"; do echo "route add $ROUTE dev veth0 table $TABLE" >> $FILE done ) & done wait echo "start adding routes" START_TIME=$(date +%s%3N) for TABLE in "${TABLES[@]}"; do ip -n $NS -6 -batch "./tmp/batch-table-$TABLE.txt" & done wait END_TIME=$(date +%s%3N) ELAPSED_TIME=$((END_TIME - START_TIME)) echo "added $((${#ROUTES[@]} * ${#TABLES[@]})) routes (${#ROUTES[@]} routes * ${#TABLES[@]} tables)." echo "total routes: $(ip -n $NS -6 route show table all | wc -l)" # Just for debug echo "Time elapsed: ${ELAPSED_TIME} milliseconds." ip netns del $NS rm -fr ./tmp/ v2: https://lore.kernel.org/netdev/20250409011243.26195-1-kuniyu@amazon.com/ v1: https://lore.kernel.org/netdev/20250321040131.21057-1-kuniyu@amazon.com/ ==================== Link: https://patch.msgid.link/20250418000443.43734-1-kuniyu@amazon.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-04-24ipv6: Get rid of RTNL for SIOCADDRT and RTM_NEWROUTE.Kuniyuki Iwashima
Now we are ready to remove RTNL from SIOCADDRT and RTM_NEWROUTE. The remaining things to do are 1. pass false to lwtunnel_valid_encap_type_attr() 2. use rcu_dereference_rtnl() in fib6_check_nexthop() 3. place rcu_read_lock() before ip6_route_info_create_nh(). Let's complete the RTNL-free conversion. When each CPU-X adds 100000 routes on table-X in a batch concurrently on c7a.metal-48xl EC2 instance with 192 CPUs, without this series: $ sudo ./route_test.sh ... added 19200000 routes (100000 routes * 192 tables). time elapsed: 191577 milliseconds. with this series: $ sudo ./route_test.sh ... added 19200000 routes (100000 routes * 192 tables). time elapsed: 62854 milliseconds. I changed the number of routes in each table (1000 ~ 100000) and consistently saw it finish 3x faster with this series. Note that now every caller of lwtunnel_valid_encap_type() passes false as the last argument, and this can be removed later. Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Link: https://patch.msgid.link/20250418000443.43734-16-kuniyu@amazon.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-04-24ipv6: Protect nh->f6i_list with spinlock and flag.Kuniyuki Iwashima
We will get rid of RTNL from RTM_NEWROUTE and SIOCADDRT. Then, we may be going to add a route tied to a dying nexthop. The nexthop itself is not freed during the RCU grace period, but if we link a route after __remove_nexthop_fib() is called for the nexthop, the route will be leaked. To avoid the race between IPv6 route addition under RCU vs nexthop deletion under RTNL, let's add a dead flag and protect it and nh->f6i_list with a spinlock. __remove_nexthop_fib() acquires the nexthop's spinlock and sets false to nh->dead, then calls ip6_del_rt() for the linked route one by one without the spinlock because fib6_purge_rt() acquires it later. While adding an IPv6 route, fib6_add() acquires the nexthop lock and checks the dead flag just before inserting the route. Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Link: https://patch.msgid.link/20250418000443.43734-15-kuniyu@amazon.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-04-24ipv6: Defer fib6_purge_rt() in fib6_add_rt2node() to fib6_add().Kuniyuki Iwashima
The next patch adds per-nexthop spinlock which protects nh->f6i_list. When rt->nh is not NULL, fib6_add_rt2node() will be called under the lock. fib6_add_rt2node() could call fib6_purge_rt() for another route, which could holds another nexthop lock. Then, deadlock could happen between two nexthops. Let's defer fib6_purge_rt() after fib6_add_rt2node(). Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Acked-by: Paolo Abeni <pabeni@redhat.com> Link: https://patch.msgid.link/20250418000443.43734-14-kuniyu@amazon.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-04-24ipv6: Protect fib6_link_table() with spinlock.Kuniyuki Iwashima
We will get rid of RTNL from RTM_NEWROUTE and SIOCADDRT. If the request specifies a new table ID, fib6_new_table() is called to create a new routing table. Two concurrent requests could specify the same table ID, so we need a lock to protect net->ipv6.fib_table_hash[h]. Let's add a spinlock to protect the hash bucket linkage. Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Acked-by: Paolo Abeni <pabeni@redhat.com> Link: https://patch.msgid.link/20250418000443.43734-13-kuniyu@amazon.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-04-24ipv6: Factorise ip6_route_multipath_add().Kuniyuki Iwashima
We will get rid of RTNL from RTM_NEWROUTE and SIOCADDRT and rely on RCU to guarantee dev and nexthop lifetime. Then, the RCU section will start before ip6_route_info_create_nh() in ip6_route_multipath_add(), but ip6_route_info_create() is called in the same loop and will sleep. Let's split the loop into ip6_route_mpath_info_create() and ip6_route_mpath_info_create_nh(). Note that ip6_route_info_append() is now integrated into ip6_route_mpath_info_create_nh() because we need to call different free functions for nexthops that passed ip6_route_info_create_nh(). In case of failure, the remaining nexthops that ip6_route_info_create_nh() has not been called for will be freed by ip6_route_mpath_info_cleanup(). OTOH, if a nexthop passes ip6_route_info_create_nh(), it will be linked to a local temporary list, which will be spliced back to rt6_nh_list. In case of failure, these nexthops will be released by fib6_info_release() in ip6_route_multipath_add(). Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Link: https://patch.msgid.link/20250418000443.43734-12-kuniyu@amazon.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-04-24ipv6: Rename rt6_nh.next to rt6_nh.list.Kuniyuki Iwashima
ip6_route_multipath_add() allocates struct rt6_nh for each config of multipath routes to link them to a local list rt6_nh_list. struct rt6_nh.next is the list node of each config, so the name is quite misleading. Let's rename it to list. Suggested-by: Paolo Abeni <pabeni@redhat.com> Link: https://lore.kernel.org/netdev/c9bee472-c94e-4878-8cc2-1512b2c54db5@redhat.com/ Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Link: https://patch.msgid.link/20250418000443.43734-11-kuniyu@amazon.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-04-24ipv6: Don't pass net to ip6_route_info_append().Kuniyuki Iwashima
net is not used in ip6_route_info_append() after commit 36f19d5b4f99 ("net/ipv6: Remove extra call to ip6_convert_metrics for multipath case"). Let's remove the argument. Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Acked-by: Paolo Abeni <pabeni@redhat.com> Link: https://patch.msgid.link/20250418000443.43734-10-kuniyu@amazon.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-04-24ipv6: Preallocate nhc_pcpu_rth_output in ip6_route_info_create().Kuniyuki Iwashima
ip6_route_info_create_nh() will be called under RCU. It calls fib_nh_common_init() and allocates nhc->nhc_pcpu_rth_output. As with the reason for rt->fib6_nh->rt6i_pcpu, we want to avoid GFP_ATOMIC allocation for nhc->nhc_pcpu_rth_output under RCU. Let's preallocate it in ip6_route_info_create(). Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Link: https://patch.msgid.link/20250418000443.43734-9-kuniyu@amazon.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-04-24ipv6: Preallocate rt->fib6_nh->rt6i_pcpu in ip6_route_info_create().Kuniyuki Iwashima
ip6_route_info_create_nh() will be called under RCU. Then, fib6_nh_init() is also under RCU, but per-cpu memory allocation is very likely to fail with GFP_ATOMIC while bulk-adding IPv6 routes and we would see a bunch of this message in dmesg. percpu: allocation failed, size=8 align=8 atomic=1, atomic alloc failed, no space left percpu: allocation failed, size=8 align=8 atomic=1, atomic alloc failed, no space left Let's preallocate rt->fib6_nh->rt6i_pcpu in ip6_route_info_create(). If something fails before the original memory allocation in fib6_nh_init(), ip6_route_info_create_nh() calls fib6_info_release(), which releases the preallocated per-cpu memory. Note that rt->fib6_nh->rt6i_pcpu is not preallocated when called via ipv6_stub, so we still need alloc_percpu_gfp() in fib6_nh_init(). Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Link: https://patch.msgid.link/20250418000443.43734-8-kuniyu@amazon.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-04-24ipv6: Split ip6_route_info_create().Kuniyuki Iwashima
We will get rid of RTNL from RTM_NEWROUTE and SIOCADDRT and rely on RCU to guarantee dev and nexthop lifetime. Then, we want to allocate as much as possible before entering the RCU section. The RCU section will start in the middle of ip6_route_info_create(), and this is problematic for ip6_route_multipath_add() that calls ip6_route_info_create() multiple times. Let's split ip6_route_info_create() into two parts; one for memory allocation and another for nexthop setup. Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Acked-by: Paolo Abeni <pabeni@redhat.com> Link: https://patch.msgid.link/20250418000443.43734-7-kuniyu@amazon.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-04-24ipv6: Move nexthop_find_by_id() after fib6_info_alloc().Kuniyuki Iwashima
We will get rid of RTNL from RTM_NEWROUTE and SIOCADDRT. Then, we must perform two lookups for nexthop and dev under RCU to guarantee their lifetime. ip6_route_info_create() calls nexthop_find_by_id() first if RTA_NH_ID is specified, and then allocates struct fib6_info. nexthop_find_by_id() must be called under RCU, but we do not want to use GFP_ATOMIC for memory allocation here, which will be likely to fail in ip6_route_multipath_add(). Let's move nexthop_find_by_id() after the memory allocation so that we can later split ip6_route_info_create() into two parts: the sleepable part and the RCU part. Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Acked-by: Paolo Abeni <pabeni@redhat.com> Link: https://patch.msgid.link/20250418000443.43734-6-kuniyu@amazon.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-04-24ipv6: Check GATEWAY in rtm_to_fib6_multipath_config().Kuniyuki Iwashima
In ip6_route_multipath_add(), we call rt6_qualify_for_ecmp() for each entry. If it returns false, the request fails. rt6_qualify_for_ecmp() returns false if either of the conditions below is true: 1. f6i->fib6_flags has RTF_ADDRCONF 2. f6i->nh is not NULL 3. f6i->fib6_nh->fib_nh_gw_family is AF_UNSPEC 1 is unnecessary because rtm_to_fib6_config() never sets RTF_ADDRCONF to cfg->fc_flags. 2. is equivalent with cfg->fc_nh_id. 3. can be replaced by checking RTF_GATEWAY in the base and each multipath entry because AF_INET6 is set to f6i->fib6_nh->fib_nh_gw_family only when cfg.fc_is_fdb is true or RTF_GATEWAY is set, but the former is always false. These checks do not require RCU and can be done earlier. Let's perform the equivalent checks in rtm_to_fib6_multipath_config(). Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Link: https://patch.msgid.link/20250418000443.43734-5-kuniyu@amazon.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-04-24ipv6: Move some validation from ip6_route_info_create() to rtm_to_fib6_config().Kuniyuki Iwashima
ip6_route_info_create() is called from 3 functions: * ip6_route_add() * ip6_route_multipath_add() * addrconf_f6i_alloc() addrconf_f6i_alloc() does not need validation for struct fib6_config in ip6_route_info_create(). ip6_route_multipath_add() calls ip6_route_info_create() for multiple routes with slightly different fib6_config instances, which is copied from the base config passed from userspace. So, we need not validate the same config repeatedly. Let's move such validation into rtm_to_fib6_config(). Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Acked-by: Paolo Abeni <pabeni@redhat.com> Link: https://patch.msgid.link/20250418000443.43734-4-kuniyu@amazon.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-04-24ipv6: Get rid of RTNL for SIOCDELRT and RTM_DELROUTE.Kuniyuki Iwashima
Basically, removing an IPv6 route does not require RTNL because the IPv6 routing tables are protected by per table lock. inet6_rtm_delroute() calls nexthop_find_by_id() to check if the nexthop specified by RTA_NH_ID exists. nexthop uses rbtree and the top-down walk can be safely performed under RCU. ip6_route_del() already relies on RCU and the table lock, but we need to extend the RCU critical section a bit more to cover __ip6_del_rt(). For example, nexthop_for_each_fib6_nh() and inet6_rt_notify() needs RCU. Let's call nexthop_find_by_id() and __ip6_del_rt() under RCU and get rid of RTNL from inet6_rtm_delroute() and SIOCDELRT. Even if the nexthop is removed after rcu_read_unlock() in inet6_rtm_delroute(), __remove_nexthop_fib() cleans up the routes tied to the nexthop, and ip6_route_del() returns -ESRCH. So the request was at least valid as of nexthop_find_by_id(), and it's just a matter of timing. Note that we need to pass false to lwtunnel_valid_encap_type_attr(). The following patches also use the newroute bool. Note also that fib6_get_table() does not require RCU because once allocated fib6_table is not freed until netns dismantle. I will post a follow-up series to convert such callers to RCU-lockless version. [0] Link: https://lore.kernel.org/netdev/20250417174557.65721-1-kuniyu@amazon.com/ #[0] Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Link: https://patch.msgid.link/20250418000443.43734-3-kuniyu@amazon.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-04-24ipv6: Validate RTA_GATEWAY of RTA_MULTIPATH in rtm_to_fib6_config().Kuniyuki Iwashima
We will perform RTM_NEWROUTE and RTM_DELROUTE under RCU, and then we want to perform some validation out of the RCU scope. When creating / removing an IPv6 route with RTA_MULTIPATH, inet6_rtm_newroute() / inet6_rtm_delroute() validates RTA_GATEWAY in each multipath entry. Let's do that in rtm_to_fib6_config(). Note that now RTM_DELROUTE returns an error for RTA_MULTIPATH with 0 entries, which was accepted but should result in -EINVAL as RTM_NEWROUTE. Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Link: https://patch.msgid.link/20250418000443.43734-2-kuniyu@amazon.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-04-24powerpc/boot: Fix dash warningMadhavan Srinivasan
'commit b2accfe7ca5b ("powerpc/boot: Check for ld-option support")' suppressed linker warnings, but the expressed used did not go well with POSIX shell (dash) resulting with this warning arch/powerpc/boot/wrapper: 237: [: 0: unexpected operator ld: warning: arch/powerpc/boot/zImage.epapr has a LOAD segment with RWX permissions Fix the check to handle the reported warning. Patch also fixes couple of shellcheck reported errors for the same line. In arch/powerpc/boot/wrapper line 237: if [ $(${CROSS}ld -v --no-warn-rwx-segments &>/dev/null; echo $?) -eq 0 ]; then ^-- SC2046 (warning): Quote this to prevent word splitting. ^------^ SC2086 (info): Double quote to prevent globbing and word splitting. ^---------^ SC3020 (warning): In POSIX sh, &> is undefined. Fixes: b2accfe7ca5b ("powerpc/boot: Check for ld-option support") Reported-by: Stephen Rothwell <sfr@canb.auug.org.au> Suggested-by: Stephen Rothwell <sfr@canb.auug.org.au> Tested-by: Venkat Rao Bagalkote <venkat88@linux.ibm.com> Reviewed-by: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Madhavan Srinivasan <maddy@linux.ibm.com> Link: https://patch.msgid.link/20250423082154.30625-1-maddy@linux.ibm.com
2025-04-23Merge branch 'pds_core-updates-and-fixes'Jakub Kicinski
Shannon Nelson says: ==================== pds_core: updates and fixes This patchset has fixes for issues seen in recent internal testing of error conditions and stress handling. Note that the first patch in this series is a leftover from an earlier patchset that was abandoned: Link: https://lore.kernel.org/netdev/20250129004337.36898-2-shannon.nelson@amd.com/ ==================== Link: https://patch.msgid.link/20250421174606.3892-1-shannon.nelson@amd.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-04-23pds_core: make wait_context part of q_infoShannon Nelson
Make the wait_context a full part of the q_info struct rather than a stack variable that goes away after pdsc_adminq_post() is done so that the context is still available after the wait loop has given up. There was a case where a slow development firmware caused the adminq request to time out, but then later the FW finally finished the request and sent the interrupt. The handler tried to complete_all() the completion context that had been created on the stack in pdsc_adminq_post() but no longer existed. This caused bad pointer usage, kernel crashes, and much wailing and gnashing of teeth. Fixes: 01ba61b55b20 ("pds_core: Add adminq processing and commands") Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: Shannon Nelson <shannon.nelson@amd.com> Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Link: https://patch.msgid.link/20250421174606.3892-5-shannon.nelson@amd.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-04-23pds_core: Remove unnecessary check in pds_client_adminq_cmd()Brett Creeley
When the pds_core driver was first created there were some race conditions around using the adminq, especially for client drivers. To reduce the possibility of a race condition there's a check against pf->state in pds_client_adminq_cmd(). This is problematic for a couple of reasons: 1. The PDSC_S_INITING_DRIVER bit is set during probe, but not cleared until after everything in probe is complete, which includes creating the auxiliary devices. For pds_fwctl this means it can't make any adminq commands until after pds_core's probe is complete even though the adminq is fully up by the time pds_fwctl's auxiliary device is created. 2. The race conditions around using the adminq have been fixed and this path is already protected against client drivers calling pds_client_adminq_cmd() if the adminq isn't ready, i.e. see pdsc_adminq_post() -> pdsc_adminq_inc_if_up(). Fix this by removing the pf->state check in pds_client_adminq_cmd() because invalid accesses to pds_core's adminq is already handled by pdsc_adminq_post()->pdsc_adminq_inc_if_up(). Fixes: 10659034c622 ("pds_core: add the aux client API") Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: Brett Creeley <brett.creeley@amd.com> Signed-off-by: Shannon Nelson <shannon.nelson@amd.com> Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Link: https://patch.msgid.link/20250421174606.3892-4-shannon.nelson@amd.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-04-23pds_core: handle unsupported PDS_CORE_CMD_FW_CONTROL resultBrett Creeley
If the FW doesn't support the PDS_CORE_CMD_FW_CONTROL command the driver might at the least print garbage and at the worst crash when the user runs the "devlink dev info" devlink command. This happens because the stack variable fw_list is not 0 initialized which results in fw_list.num_fw_slots being a garbage value from the stack. Then the driver tries to access fw_list.fw_names[i] with i >= ARRAY_SIZE and runs off the end of the array. Fix this by initializing the fw_list and by not failing completely if the devcmd fails because other useful information is printed via devlink dev info even if the devcmd fails. Fixes: 45d76f492938 ("pds_core: set up device and adminq") Signed-off-by: Brett Creeley <brett.creeley@amd.com> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: Shannon Nelson <shannon.nelson@amd.com> Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Link: https://patch.msgid.link/20250421174606.3892-3-shannon.nelson@amd.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>