summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2025-04-25crypto: scompress - increment scomp_scratch_users when already allocatedSabrina Dubroca
Commit ddd0a42671c0 only increments scomp_scratch_users when it was 0, causing a panic when using ipcomp: Oops: general protection fault, probably for non-canonical address 0xdffffc0000000000: 0000 [#1] SMP KASAN NOPTI KASAN: null-ptr-deref in range [0x0000000000000000-0x0000000000000007] CPU: 1 UID: 0 PID: 619 Comm: ping Tainted: G N 6.15.0-rc3-net-00032-ga79be02bba5c #41 PREEMPT(full) Tainted: [N]=TEST Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Arch Linux 1.16.3-1-1 04/01/2014 RIP: 0010:inflate_fast+0x5a2/0x1b90 [...] Call Trace: <IRQ> zlib_inflate+0x2d60/0x6620 deflate_sdecompress+0x166/0x350 scomp_acomp_comp_decomp+0x45f/0xa10 scomp_acomp_decompress+0x21/0x120 acomp_do_req_chain+0x3e5/0x4e0 ipcomp_input+0x212/0x550 xfrm_input+0x2de2/0x72f0 [...] Kernel panic - not syncing: Fatal exception in interrupt Kernel Offset: disabled ---[ end Kernel panic - not syncing: Fatal exception in interrupt ]--- Instead, let's keep the old increment, and decrement back to 0 if the scratch allocation fails. Fixes: ddd0a42671c0 ("crypto: scompress - Fix scratch allocation failure handling") Signed-off-by: Sabrina Dubroca <sd@queasysnail.net> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2025-04-24ublk: fix race between io_uring_cmd_complete_in_task and ublk_cancel_cmdMing Lei
ublk_cancel_cmd() calls io_uring_cmd_done() to complete uring_cmd, but we may have scheduled task work via io_uring_cmd_complete_in_task() for dispatching request, then kernel crash can be triggered. Fix it by not trying to canceling the command if ublk block request is started. Fixes: 216c8f5ef0f2 ("ublk: replace monitor with cancelable uring_cmd") Reported-by: Jared Holzman <jholzman@nvidia.com> Tested-by: Jared Holzman <jholzman@nvidia.com> Closes: https://lore.kernel.org/linux-block/d2179120-171b-47ba-b664-23242981ef19@nvidia.com/ Signed-off-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20250425013742.1079549-3-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-04-24ublk: call ublk_dispatch_req() for handling UBLK_U_IO_NEED_GET_DATAMing Lei
We call io_uring_cmd_complete_in_task() to schedule task_work for handling UBLK_U_IO_NEED_GET_DATA. This way is really not necessary because the current context is exactly the ublk queue context, so call ublk_dispatch_req() directly for handling UBLK_U_IO_NEED_GET_DATA. Fixes: 216c8f5ef0f2 ("ublk: replace monitor with cancelable uring_cmd") Tested-by: Jared Holzman <jholzman@nvidia.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20250425013742.1079549-2-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-04-24Merge branch 'mlx5-misc-fixes-2025-04-23'Jakub Kicinski
Mark Bloch says: ==================== mlx5 misc fixes 2025-04-23 This patchset includes misc fixes from the team for the mlx5 core and Ethernet drivers. ==================== Link: https://patch.msgid.link/20250423083611.324567-1-mbloch@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-04-24net/mlx5: E-switch, Fix error handling for enabling roceChris Mi
The cited commit assumes enabling roce always succeeds. But it is not true. Add error handling for it. Fixes: 80f09dfc237f ("net/mlx5: Eswitch, enable RoCE loopback traffic") Signed-off-by: Chris Mi <cmi@nvidia.com> Reviewed-by: Roi Dayan <roid@nvidia.com> Reviewed-by: Maor Gottlieb <maorg@nvidia.com> Signed-off-by: Mark Bloch <mbloch@nvidia.com> Reviewed-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com> Link: https://patch.msgid.link/20250423083611.324567-6-mbloch@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-04-24net/mlx5e: Fix lock order in mlx5e_tx_reporter_ptpsq_unhealthy_recoverCosmin Ratiu
RTNL needs to be acquired before state_lock. Fixes: fdce06bda7e5 ("net/mlx5e: Acquire RTNL lock before RQs/SQs activation/deactivation") Signed-off-by: Cosmin Ratiu <cratiu@nvidia.com> Reviewed-by: Dragos Tatulea <dtatulea@nvidia.com> Signed-off-by: Mark Bloch <mbloch@nvidia.com> Link: https://patch.msgid.link/20250423083611.324567-5-mbloch@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-04-24net/mlx5e: TC, Continue the attr process even if encap entry is invalidJianbo Liu
Previously the offload of the rule with header rewrite and mirror to both internal and external destinations is skipped if the encap entry is not valid. But it shouldn't because driver will try to offload it again if neighbor is updated and encap entry is valid, to replace the old FTE added for slow path. But the extra split attr doesn't exist at that time as the process is skipped, driver then fails to offload it. To fix this issue, remove the checking and continue the attr process if encap entry is invalid. Fixes: b11bde56246e ("net/mlx5e: TC, Offload rewrite and mirror to both internal and external dests") Signed-off-by: Jianbo Liu <jianbol@nvidia.com> Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com> Signed-off-by: Mark Bloch <mbloch@nvidia.com> Link: https://patch.msgid.link/20250423083611.324567-4-mbloch@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-04-24net/mlx5: E-Switch, Initialize MAC Address for Default GIDMaor Gottlieb
Initialize the source MAC address when creating the default GID entry. Since this entry is used only for loopback traffic, it only needs to be a unicast address. A zeroed-out MAC address is sufficient for this purpose. Without this fix, random bits would be assigned as the source address. If these bits formed a multicast address, the firmware would return an error, preventing the user from switching to switchdev mode: Error: mlx5_core: Failed setting eswitch to offloads. kernel answers: Invalid argument Fixes: 80f09dfc237f ("net/mlx5: Eswitch, enable RoCE loopback traffic") Signed-off-by: Maor Gottlieb <maorg@nvidia.com> Signed-off-by: Mark Bloch <mbloch@nvidia.com> Reviewed-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com> Link: https://patch.msgid.link/20250423083611.324567-3-mbloch@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-04-24net/mlx5e: Use custom tunnel header for vxlan gbpVlad Dogaru
Symbolic (e.g. "vxlan") and custom (e.g. "tunnel_header_0") tunnels cannot be combined, but the match params interface does not have fields for matching on vxlan gbp. To match vxlan bgp, the tc_tun layer uses tunnel_header_0. Allow matching on both VNI and GBP by matching the VNI with a custom tunnel header instead of the symbolic field name. Matching solely on the VNI continues to use the symbolic field name. Fixes: 74a778b4a63f ("net/mlx5: HWS, added definers handling") Signed-off-by: Vlad Dogaru <vdogaru@nvidia.com> Reviewed-by: Yevgeny Kliteynik <kliteyn@nvidia.com> Signed-off-by: Mark Bloch <mbloch@nvidia.com> Reviewed-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com> Link: https://patch.msgid.link/20250423083611.324567-2-mbloch@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-04-24xsk: Fix offset calculation in unaligned modee.kubanski
Bring back previous offset calculation behaviour in AF_XDP unaligned umem mode. In unaligned mode, upper 16 bits should contain data offset, lower 48 bits should contain only specific chunk location without offset. Remove pool->headroom duplication into 48bit address. Signed-off-by: Eryk Kubanski <e.kubanski@partner.samsung.com> Fixes: bea14124bacb ("xsk: Get rid of xdp_buff_xsk::orig_addr") Acked-by: Magnus Karlsson <magnus.karlsson@intel.com> Link: https://patch.msgid.link/20250416112925.7501-1-e.kubanski@partner.samsung.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-04-24xsk: Fix race condition in AF_XDP generic RX pathe.kubanski
Move rx_lock from xsk_socket to xsk_buff_pool. Fix synchronization for shared umem mode in generic RX path where multiple sockets share single xsk_buff_pool. RX queue is exclusive to xsk_socket, while FILL queue can be shared between multiple sockets. This could result in race condition where two CPU cores access RX path of two different sockets sharing the same umem. Protect both queues by acquiring spinlock in shared xsk_buff_pool. Lock contention may be minimized in the future by some per-thread FQ buffering. It's safe and necessary to move spin_lock_bh(rx_lock) after xsk_rcv_check(): * xs->pool and spinlock_init is synchronized by xsk_bind() -> xsk_is_bound() memory barriers. * xsk_rcv_check() may return true at the moment of xsk_release() or xsk_unbind_dev(), however this will not cause any data races or race conditions. xsk_unbind_dev() removes xdp socket from all maps and waits for completion of all outstanding rx operations. Packets in RX path will either complete safely or drop. Signed-off-by: Eryk Kubanski <e.kubanski@partner.samsung.com> Fixes: bf0bdd1343efb ("xdp: fix race on generic receive path") Acked-by: Magnus Karlsson <magnus.karlsson@intel.com> Link: https://patch.msgid.link/20250416101908.10919-1-e.kubanski@partner.samsung.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-04-24bcachefs: Rework fiemap transaction restart handlingKent Overstreet
Restart handling in the previous patch was incorrect, so: move btree operations into a separate helper, and run it with a lockrestart_do(). Additionally, clarify whether pagecache or the btree takes precedence. Right now, the btree takes precedence: this is incorrect, but it's needed to pass fstests. Add a giant comment explaining why. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-04-24bcachefs: add fiemap delalloc extent detectionBrian Foster
bcachefs currently populates fiemap data from the extents btree. This works correctly when the fiemap sync flag is provided, but if not, it skips all delalloc extents that have not yet been flushed. This is because delalloc extents from buffered writes are first stored as reservation in the pagecache, and only become resident in the extents btree after writeback completes. Update the fiemap implementation to process holes between extents by scanning pagecache for data, via seek data/hole. If a valid data range is found over a hole in the extent btree, fake up an extent key and flag the extent as delalloc for reporting to userspace. Note that this does not necessarily change behavior for the case where there is dirty pagecache over already written extents, where when in COW mode, writeback will allocate new blocks for the underlying ranges. The existing behavior is consistent with btrfs and it is recommended to use the sync flag for the most up to date extent state from fiemap. Signed-off-by: Brian Foster <bfoster@redhat.com>
2025-04-24bcachefs: refactor fiemap processing into extent helper and structBrian Foster
The bulk of the loop in bch2_fiemap() involves processing the current extent key from the iter, including following indirections and trimming the extent size and such. This patch makes a few changes to reduce the size of the loop and facilitate future changes to support delalloc extents. Define a new bch_fiemap_extent structure to wrap the bkey buffer that holds the extent key to report to userspace along with associated fiemap flags. Update bch2_fill_extent() to take the bch_fiemap_extent as a param instead of the individual fields. Finally, lift the bulk of the extent processing into a bch2_fiemap_extent() helper that takes the current key and formats the bch_fiemap_extent appropriately for the fill function. No functional changes intended by this patch. Signed-off-by: Brian Foster <bfoster@redhat.com>
2025-04-24bcachefs: track current fiemap offset in start variableBrian Foster
Signed-off-by: Brian Foster <bfoster@redhat.com>
2025-04-24bcachefs: drop duplicate fiemap sync flagBrian Foster
FIEMAP_FLAG_SYNC handling was deliberately moved into core code in commit 45dd052e67ad ("fs: handle FIEMAP_FLAG_SYNC in fiemap_prep"), released in kernel v5.8. Update bcachefs accordingly. Signed-off-by: Brian Foster <bfoster@redhat.com>
2025-04-24bcachefs: Fix btree_iter_peek_prev() at end of inodeKent Overstreet
At the end of the inode, on an extents iterator, peek_slot() has to advance to the next position to avoid returning a 0 size extent, which is not allowed. Changing iter->pos confuses peek_prev(), but we don't need to call peek_slot() in this case. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-04-24bcachefs: Make btree_iter_peek_prev() assert more preciseKent Overstreet
The issue this assert is guarding against is that in BTREE_ITER_filter_snapshots mode we only want to be iterating within a single inode number - if we iterate into another inode number with keys for a different snapshot tree, we'll loop arbitrarily long before finding a key we can return. This comes up in the unit tests, where we're using inode 0 for our test keys. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-04-24bcachefs: Unit test fixesKent Overstreet
The peek_end() tests expect an empty btree. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-04-24bcachefs: Print mount opts earlierKent Overstreet
If we aren't mounting with the correct degraded option, it's helpful to know that before we fail to mount degraded. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-04-24bcachefs: unlink: casefold d_invalidateKent Overstreet
casefolding results in additional aliases on lookup for the non-casefolded names - these need invalidating on unlink. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-04-24bcachefs: Fix casefold lookupsKent Overstreet
Add casefolding to bch2_lookup_trans: During the delay between when casefolding was written and when it was merged, the main filesystem lookup path grew self healing - which meant it was no longer using bch2_dirent_lookup_trans(), where casefolding on lookups happens. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-04-24bcachefs: Casefold is now a regular opts.h optionKent Overstreet
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-04-24riscv: uprobes: Add missing fence.i after building the XOL bufferBjörn Töpel
The XOL (execute out-of-line) buffer is used to single-step the replaced instruction(s) for uprobes. The RISC-V port was missing a proper fence.i (i$ flushing) after constructing the XOL buffer, which can result in incorrect execution of stale/broken instructions. This was found running the BPF selftests "test_progs: uprobe_autoattach, attach_probe" on the Spacemit K1/X60, where the uprobes tests randomly blew up. Reviewed-by: Guo Ren <guoren@kernel.org> Fixes: 74784081aac8 ("riscv: Add uprobes supported") Signed-off-by: Björn Töpel <bjorn@rivosinc.com> Link: https://lore.kernel.org/r/20250419111402.1660267-2-bjorn@kernel.org Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
2025-04-24riscv: Replace function-like macro by static inline functionBjörn Töpel
The flush_icache_range() function is implemented as a "function-like macro with unused parameters", which can result in "unused variables" warnings. Replace the macro with a static inline function, as advised by Documentation/process/coding-style.rst. Fixes: 08f051eda33b ("RISC-V: Flush I$ when making a dirty page executable") Signed-off-by: Björn Töpel <bjorn@rivosinc.com> Link: https://lore.kernel.org/r/20250419111402.1660267-1-bjorn@kernel.org Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
2025-04-24Merge tag 'scsi-fixes' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi Pull SCSI fixes from James Bottomley: "The single core change is an obvious bug fix (and falls within the LF guidelines for patches from sanctioned entities). The other driver changes are a bit larger but likewise pretty obvious" * tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi: scsi: mpi3mr: Add level check to control event logging scsi: ufs: core: Add NULL check in ufshcd_mcq_compl_pending_transfer() scsi: core: Clear flags for scsi_cmnd that did not complete scsi: ufs: Introduce quirk to extend PA_HIBERN8TIME for UFS devices scsi: ufs: qcom: Add quirks for Samsung UFS devices scsi: target: iscsi: Fix timeout on deleted connection scsi: mpi3mr: Reset the pending interrupt flag scsi: mpi3mr: Fix pending I/O counter scsi: ufs: mcq: Add NULL check in ufshcd_mcq_abort()
2025-04-24Merge tag 'landlock-6.15-rc4' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/mic/linux Pull landlock fixes from Mickaël Salaün: "Fix some Landlock audit issues, add related tests, and updates documentation" * tag 'landlock-6.15-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/mic/linux: landlock: Update log documentation landlock: Fix documentation for landlock_restrict_self(2) landlock: Fix documentation for landlock_create_ruleset(2) selftests/landlock: Add PID tests for audit records selftests/landlock: Factor out audit fixture in audit_test landlock: Log the TGID of the domain creator landlock: Remove incorrect warning
2025-04-24x86/insn: Fix CTEST instruction decodingKirill A. Shutemov
insn_decoder_test found a problem with decoding APX CTEST instructions: Found an x86 instruction decoder bug, please report this. ffffffff810021df 62 54 94 05 85 ff ctestneq objdump says 6 bytes, but insn_get_length() says 5 It happens because x86-opcode-map.txt doesn't specify arguments for the instruction and the decoder doesn't expect to see ModRM byte. Fixes: 690ca3a3067f ("x86/insn: Add support for APX EVEX instructions to the opcode map") Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: stable@vger.kernel.org # v6.10+ Link: https://lore.kernel.org/r/20250423065815.2003231-1-kirill.shutemov@linux.intel.com
2025-04-24perf/x86: Fix non-sampling (counting) events on certain x86 platformsLuo Gengkun
Perf doesn't work at perf stat for hardware events on certain x86 platforms: $perf stat -- sleep 1 Performance counter stats for 'sleep 1': 16.44 msec task-clock # 0.016 CPUs utilized 2 context-switches # 121.691 /sec 0 cpu-migrations # 0.000 /sec 54 page-faults # 3.286 K/sec <not supported> cycles <not supported> instructions <not supported> branches <not supported> branch-misses The reason is that the check in x86_pmu_hw_config() for sampling events is unexpectedly applied to counting events as well. It should only impact x86 platforms with limit_period used for non-PEBS events. For Intel platforms, it should only impact some older platforms, e.g., HSW, BDW and NHM. Fixes: 88ec7eedbbd2 ("perf/x86: Fix low freqency setting issue") Signed-off-by: Luo Gengkun <luogengkun@huaweicloud.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Kan Liang <kan.liang@linux.intel.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Ravi Bangoria <ravi.bangoria@amd.com> Link: https://lore.kernel.org/r/20250423064724.3716211-1-luogengkun@huaweicloud.com
2025-04-24vxlan: vnifilter: Fix unlocked deletion of default FDB entryIdo Schimmel
When a VNI is deleted from a VXLAN device in 'vnifilter' mode, the FDB entry associated with the default remote (assuming one was configured) is deleted without holding the hash lock. This is wrong and will result in a warning [1] being generated by the lockdep annotation that was added by commit ebe642067455 ("vxlan: Create wrappers for FDB lookup"). Reproducer: # ip link add vx0 up type vxlan dstport 4789 external vnifilter local 192.0.2.1 # bridge vni add vni 10010 remote 198.51.100.1 dev vx0 # bridge vni del vni 10010 dev vx0 Fix by acquiring the hash lock before the deletion and releasing it afterwards. Blame the original commit that introduced the issue rather than the one that exposed it. [1] WARNING: CPU: 3 PID: 392 at drivers/net/vxlan/vxlan_core.c:417 vxlan_find_mac+0x17f/0x1a0 [...] RIP: 0010:vxlan_find_mac+0x17f/0x1a0 [...] Call Trace: <TASK> __vxlan_fdb_delete+0xbe/0x560 vxlan_vni_delete_group+0x2ba/0x940 vxlan_vni_del.isra.0+0x15f/0x580 vxlan_process_vni_filter+0x38b/0x7b0 vxlan_vnifilter_process+0x3bb/0x510 rtnetlink_rcv_msg+0x2f7/0xb70 netlink_rcv_skb+0x131/0x360 netlink_unicast+0x426/0x710 netlink_sendmsg+0x75a/0xc20 __sock_sendmsg+0xc1/0x150 ____sys_sendmsg+0x5aa/0x7b0 ___sys_sendmsg+0xfc/0x180 __sys_sendmsg+0x121/0x1b0 do_syscall_64+0xbb/0x1d0 entry_SYSCALL_64_after_hwframe+0x4b/0x53 Fixes: f9c4bb0b245c ("vxlan: vni filtering support on collect metadata device") Signed-off-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org> Link: https://patch.msgid.link/20250423145131.513029-1-idosch@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-04-24Merge tag 'wireless-2025-04-24' of ↵Jakub Kicinski
https://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless Johannes Berg says: ==================== Some more fixes, notably: * iwlwifi: various regression and iwlmld fixes * mac80211: fix TX frames in monitor mode * brcmfmac: error handling for firmware load * tag 'wireless-2025-04-24' of https://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless: wifi: iwlwifi: restore missing initialization of async_handlers_list wifi: brcm80211: fmac: Add error handling for brcmf_usb_dl_writeimage() wifi: plfxlc: Remove erroneous assert in plfxlc_mac_release wifi: iwlwifi: fix the check for the SCRATCH register upon resume wifi: iwlwifi: don't warn if the NIC is gone in resume wifi: iwlwifi: mld: fix BAID validity check wifi: iwlwifi: back off on continuous errors wifi: iwlwifi: mld: only create debugfs symlink if it does not exist wifi: iwlwifi: mld: inform trans on init failure wifi: iwlwifi: mld: properly handle async notification in op mode start Revert "wifi: iwlwifi: make no_160 more generic" Revert "wifi: iwlwifi: add support for BE213" wifi: mac80211: restore monitor for outgoing frames ==================== Link: https://patch.msgid.link/20250424120535.56499-3-johannes@sipsolutions.net Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-04-24Merge tag 'kvmarm-fixes-6.15-2' of ↵Paolo Bonzini
https://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into HEAD KVM/arm64 fixes for 6.15, round #2 - Single fix for broken usage of 'multi-MIDR' infrastructure in PI code, adding an open-coded erratum check for everyone's favorite pile of sand: Cavium ThunderX
2025-04-24io_uring: fix 'sync' handling of io_fallback_tw()Jens Axboe
A previous commit added a 'sync' parameter to io_fallback_tw(), which if true, means the caller wants to wait on the fallback thread handling it. But the logic is somewhat messed up, ensure that ctxs are swapped and flushed appropriately. Cc: stable@vger.kernel.org Fixes: dfbe5561ae93 ("io_uring: flush offloaded and delayed task_work on exit") Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-04-24x86/boot: Work around broken busybox 'truncate' toolArd Biesheuvel
The GNU coreutils version of truncate, which is the original, accepts a % prefix for the -s size argument which means the file in question should be padded to a multiple of the given size. This is currently used to pad the setup block of bzImage to a multiple of 4k before appending the decompressor. busybox reimplements truncate but does not support this idiom, and therefore fails the build since commit 9c54baab4401 ("x86/boot: Drop CRC-32 checksum and the build tool that generates it") Since very little build code within the kernel depends on the 'truncate' utility, work around this incompatibility by avoiding truncate altogether, and relying on dd to perform the padding. Fixes: 9c54baab4401 ("x86/boot: Drop CRC-32 checksum and the build tool that generates it") Reported-by: <phasta@kernel.org> Tested-by: Philipp Stanner <phasta@kernel.org> Signed-off-by: Ard Biesheuvel <ardb@kernel.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Kees Cook <keescook@chromium.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: https://lore.kernel.org/r/20250424101917.1552527-2-ardb+git@google.com
2025-04-24Merge tag 'net-6.15-rc4' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Pull networking fixes from Paolo Abeni: "No fixes from any subtree. Current release - regressions: - net: fix the missing unlock for detached devices Previous releases - regressions: - sched: fix UAF vulnerability in HFSC qdisc - lwtunnel: disable BHs when required - mptcp: pm: defer freeing of MPTCP userspace path manager entries - tipc: fix NULL pointer dereference in tipc_mon_reinit_self() - eth: virtio-net: disable delayed refill when pausing rx Previous releases - always broken: - phylink: fix suspend/resume with WoL enabled and link down - eth: - mlx5: fix null-ptr-deref in mlx5_create_{inner_,}ttc_table() - xen-netfront: handle NULL returned by xdp_convert_buff_to_frame() - enetc: fix frame corruption on bpf_xdp_adjust_head/tail() and XDP_PASS - stmmac: fix dwmac1000 ptp timestamp status offset - pds_core: prevent possible adminq overflow/stuck condition Misc: - a bunch of MAINTAINERS updates" * tag 'net-6.15-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (32 commits) net: stmmac: fix multiplication overflow when reading timestamp net: stmmac: fix dwmac1000 ptp timestamp status offset net: dp83822: Fix OF_MDIO config check pds_core: make wait_context part of q_info pds_core: Remove unnecessary check in pds_client_adminq_cmd() pds_core: handle unsupported PDS_CORE_CMD_FW_CONTROL result pds_core: Prevent possible adminq overflow/stuck condition net: dsa: mt7530: sync driver-specific behavior of MT7531 variants selftests/tc-testing: Add test for HFSC queue emptying during peek operation net_sched: hfsc: Fix a potential UAF in hfsc_dequeue() too net_sched: hfsc: Fix a UAF vulnerability in class handling selftests: mptcp: diag: use mptcp_lib_get_info_value mptcp: pm: Defer freeing of MPTCP userspace path manager entries net: ethernet: mtk_eth_soc: net: revise NETSYSv3 hardware configuration tipc: fix NULL pointer dereference in tipc_mon_reinit_self() virtio-net: disable delayed refill when pausing rx net: phy: leds: fix memory leak net: phylink: mac_link_(up|down)() clarifications net: phylink: fix suspend/resume with WoL enabled and link down net: lwtunnel: disable BHs when required ...
2025-04-24Merge tag 'v6.15-p5' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6 Pull crypto fixes from Herbert Xu: - Revert acomp multibuffer tests which were buggy - Fix off-by-one regression in new scomp code - Lower quality setting on atmel-sha204a as it may not be random * tag 'v6.15-p5' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6: crypto: atmel-sha204a - Set hwrng quality to lowest possible crypto: scomp - Fix off-by-one bug when calculating last page Revert "crypto: testmgr - Add multibuffer acomp testing"
2025-04-24KVM: x86: Do not use kvm_rip_read() unconditionally for KVM_PROFILINGAdrian Hunter
Not all VMs allow access to RIP. Check guest_state_protected before calling kvm_rip_read(). This avoids, for example, hitting WARN_ON_ONCE in vt_cache_reg() for TDX VMs. Fixes: 81bf912b2c15 ("KVM: TDX: Implement TDX vcpu enter/exit path") Signed-off-by: Adrian Hunter <adrian.hunter@intel.com> Message-ID: <20250415104821.247234-3-adrian.hunter@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-04-24KVM: x86: Do not use kvm_rip_read() unconditionally in KVM tracepointsAdrian Hunter
Not all VMs allow access to RIP. Check guest_state_protected before calling kvm_rip_read(). This avoids, for example, hitting WARN_ON_ONCE in vt_cache_reg() for TDX VMs. Fixes: 81bf912b2c15 ("KVM: TDX: Implement TDX vcpu enter/exit path") Signed-off-by: Adrian Hunter <adrian.hunter@intel.com> Message-ID: <20250415104821.247234-2-adrian.hunter@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-04-24KVM: SVM: WARN if an invalid posted interrupt IRTE entry is addedSean Christopherson
Now that the AMD IOMMU doesn't signal success incorrectly, WARN if KVM attempts to track an AMD IRTE entry without metadata. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-ID: <20250404193923.1413163-8-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-04-24iommu/amd: WARN if KVM attempts to set vCPU affinity without posted intrruptsSean Christopherson
WARN if KVM attempts to set vCPU affinity when posted interrupts aren't enabled, as KVM shouldn't try to enable posting when they're unsupported, and the IOMMU driver darn well should only advertise posting support when AMD_IOMMU_GUEST_IR_VAPIC() is true. Note, KVM consumes is_guest_mode only on success. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-ID: <20250404193923.1413163-7-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-04-24iommu/amd: Return an error if vCPU affinity is set for non-vCPU IRTESean Christopherson
Return -EINVAL instead of success if amd_ir_set_vcpu_affinity() is invoked without use_vapic; lying to KVM about whether or not the IRTE was configured to post IRQs is all kinds of bad. Fixes: d98de49a53e4 ("iommu/amd: Enable vAPIC interrupt remapping mode by default") Signed-off-by: Sean Christopherson <seanjc@google.com> Message-ID: <20250404193923.1413163-6-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-04-24KVM: x86: Take irqfds.lock when adding/deleting IRQ bypass producerSean Christopherson
Take irqfds.lock when adding/deleting an IRQ bypass producer to ensure irqfd->producer isn't modified while kvm_irq_routing_update() is running. The only lock held when a producer is added/removed is irqbypass's mutex. Fixes: 872768800652 ("KVM: x86: select IRQ_BYPASS_MANAGER") Cc: stable@vger.kernel.org Signed-off-by: Sean Christopherson <seanjc@google.com> Message-ID: <20250404193923.1413163-5-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-04-24KVM: x86: Explicitly treat routing entry type changes as changesSean Christopherson
Explicitly treat type differences as GSI routing changes, as comparing MSI data between two entries could get a false negative, e.g. if userspace changed the type but left the type-specific data as-is. Fixes: 515a0c79e796 ("kvm: irqfd: avoid update unmodified entries of the routing") Cc: stable@vger.kernel.org Signed-off-by: Sean Christopherson <seanjc@google.com> Message-ID: <20250404193923.1413163-4-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-04-24KVM: x86: Reset IRTE to host control if *new* route isn't postableSean Christopherson
Restore an IRTE back to host control (remapped or posted MSI mode) if the *new* GSI route prevents posting the IRQ directly to a vCPU, regardless of the GSI routing type. Updating the IRTE if and only if the new GSI is an MSI results in KVM leaving an IRTE posting to a vCPU. The dangling IRTE can result in interrupts being incorrectly delivered to the guest, and in the worst case scenario can result in use-after-free, e.g. if the VM is torn down, but the underlying host IRQ isn't freed. Fixes: efc644048ecd ("KVM: x86: Update IRTE for posted-interrupts") Fixes: 411b44ba80ab ("svm: Implements update_pi_irte hook to setup posted interrupt") Cc: stable@vger.kernel.org Signed-off-by: Sean Christopherson <seanjc@google.com> Message-ID: <20250404193923.1413163-3-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-04-24KVM: SVM: Allocate IR data using atomic allocationSean Christopherson
Allocate SVM's interrupt remapping metadata using GFP_ATOMIC as svm_ir_list_add() is called with IRQs are disabled and irqfs.lock held when kvm_irq_routing_update() reacts to GSI routing changes. Fixes: 411b44ba80ab ("svm: Implements update_pi_irte hook to setup posted interrupt") Cc: stable@vger.kernel.org Signed-off-by: Sean Christopherson <seanjc@google.com> Message-ID: <20250404193923.1413163-2-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-04-24KVM: SVM: Don't update IRTEs if APICv/AVIC is disabledSean Christopherson
Skip IRTE updates if AVIC is disabled/unsupported, as forcing the IRTE into remapped mode (kvm_vcpu_apicv_active() will never be true) is unnecessary and wasteful. The IOMMU driver is responsible for putting IRTEs into remapped mode when an IRQ is allocated by a device, long before that device is assigned to a VM. I.e. the kernel as a whole has major issues if the IRTE isn't already in remapped mode. Opportunsitically kvm_arch_has_irq_bypass() to query for APICv/AVIC, so so that all checks in KVM x86 incorporate the same information. Cc: Yosry Ahmed <yosry.ahmed@linux.dev> Cc: Jim Mattson <jmattson@google.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-ID: <20250401161804.842968-3-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-04-24KVM: arm64, x86: make kvm_arch_has_irq_bypass() inlinePaolo Bonzini
kvm_arch_has_irq_bypass() is a small function and even though it does not appear in any *really* hot paths, it's also not entirely rare. Make it inline---it also works out nicely in preparation for using it in kvm-intel.ko and kvm-amd.ko, since the function is not currently exported. Suggested-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-04-24block: don't autoload drivers on blk-cgroup configurationChristoph Hellwig
Loading a driver just to configure blk-cgroup doesn't make sense, as that assumes and already existing device. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Christian Brauner <brauner@kernel.org> Acked-by: Tejun Heo <tj@kernel.org> Link: https://lore.kernel.org/r/20250423053810.1683309-5-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-04-24block: don't autoload drivers on statChristoph Hellwig
blkdev_get_no_open can trigger the legacy autoload of block drivers. A simple stat of a block device has not historically done that, so disable this behavior again. Fixes: 9abcfbd235f5 ("block: Add atomic write support for statx") Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Christian Brauner <brauner@kernel.org> Acked-by: Tejun Heo <tj@kernel.org> Link: https://lore.kernel.org/r/20250423053810.1683309-4-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-04-24block: remove the backing_inode variable in bdev_statxChristoph Hellwig
backing_inode is only used once, so remove it and update the comment describing the bdev lookup to be a bit more clear. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Christian Brauner <brauner@kernel.org> Acked-by: Tejun Heo <tj@kernel.org> Link: https://lore.kernel.org/r/20250423053810.1683309-3-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>