linux.git - Linus' kernel tree

Age	Commit message (Collapse)	Author
2025-03-07	Merge tag 'drm-misc-fixes-2025-03-06' of ↵	Dave Airlie
	https://gitlab.freedesktop.org/drm/misc/kernel into drm-fixes A Kconfig fix for nouveau, locking and timestamp fixes for imagination, a header guard fix for sched and a DPMS regression fix for bochs. Signed-off-by: Dave Airlie <airlied@redhat.com> From: Maxime Ripard <mripard@redhat.com> Link: https://patchwork.freedesktop.org/patch/msgid/20250306-antelope-of-imminent-anger-bca19e@houat
2025-03-06	x86/boot: Sanitize boot params before parsing command line	Ard Biesheuvel
	The 5-level paging code parses the command line to look for the 'no5lvl' string, and does so very early, before sanitize_boot_params() has been called and has been given the opportunity to wipe bogus data from the fields in boot_params that are not covered by struct setup_header, and are therefore supposed to be initialized to zero by the bootloader. This triggers an early boot crash when using syslinux-efi to boot a recent kernel built with CONFIG_X86_5LEVEL=y and CONFIG_EFI_STUB=n, as the 0xff padding that now fills the unused PE/COFF header is copied into boot_params by the bootloader, and interpreted as the top half of the command line pointer. Fix this by sanitizing the boot_params before use. Note that there is no harm in calling this more than once; subsequent invocations are able to spot that the boot_params have already been cleaned up. Signed-off-by: Ard Biesheuvel <ardb@kernel.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: <stable@vger.kernel.org> # v6.1+ Link: https://lore.kernel.org/r/20250306155915.342465-2-ardb+git@google.com Closes: https://lore.kernel.org/all/202503041549.35913.ulrich.gemkow@ikr.uni-stuttgart.de
2025-03-06	Merge tag 'net-6.14-rc6' of ↵	Linus Torvalds
	git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Pull networking fixes from Paolo Abeni: "Including fixes from bluetooth and wireless. Current release - new code bugs: - wifi: nl80211: disable multi-link reconfiguration Previous releases - regressions: - gso: fix ownership in __udp_gso_segment - wifi: iwlwifi: - fix A-MSDU TSO preparation - free pages allocated when failing to build A-MSDU - ipv6: fix dst ref loop in ila lwtunnel - mptcp: fix 'scheduling while atomic' in mptcp_pm_nl_append_new_local_addr - bluetooth: add check for mgmt_alloc_skb() in mgmt_device_connected() - ethtool: allow NULL nlattrs when getting a phy_device - eth: be2net: fix sleeping while atomic bugs in be_ndo_bridge_getlink Previous releases - always broken: - core: support TCP GSO case for a few missing flags - wifi: mac80211: - fix vendor-specific inheritance - cleanup sta TXQs on flush - llc: do not use skb_get() before dev_queue_xmit() - eth: ipa: nable checksum for IPA_ENDPOINT_AP_MODEM_{RX,TX} for v4.7" * tag 'net-6.14-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (41 commits) net: ipv6: fix missing dst ref drop in ila lwtunnel net: ipv6: fix dst ref loop in ila lwtunnel mctp i3c: handle NULL header address net: dsa: mt7530: Fix traffic flooding for MMIO devices net-timestamp: support TCP GSO case for a few missing flags vlan: enforce underlying device type mptcp: fix 'scheduling while atomic' in mptcp_pm_nl_append_new_local_addr net: ethtool: netlink: Allow NULL nlattrs when getting a phy_device ppp: Fix KMSAN uninit-value warning with bpf net: ipa: Enable checksum for IPA_ENDPOINT_AP_MODEM_{RX,TX} for v4.7 net: ipa: Fix QSB data for v4.7 net: ipa: Fix v4.7 resource group names net: hns3: make sure ptp clock is unregister and freed if hclge_ptp_get_cycle returns an error wifi: nl80211: disable multi-link reconfiguration net: dsa: rtl8366rb: don't prompt users for LED control be2net: fix sleeping while atomic bugs in be_ndo_bridge_getlink llc: do not use skb_get() before dev_queue_xmit() wifi: cfg80211: regulatory: improve invalid hints checking caif_virtio: fix wrong pointer check in cfv_probe() net: gso: fix ownership in __udp_gso_segment ...
2025-03-06	Merge tag 'v6.14-rc5-smb3-fixes' of git://git.samba.org/ksmbd	Linus Torvalds
	Pull smb fixes from Steve French: "Five SMB server fixes, two related client fixes, and minor MAINTAINERS update: - Two SMB3 lock fixes fixes (including use after free and bug on fix) - Fix to race condition that can happen in processing IPC responses - Four ACL related fixes: one related to endianness of num_aces, and two related fixes to the checks for num_aces (for both client and server), and one fixing missing check for num_subauths which can cause memory corruption - And minor update to email addresses in MAINTAINERS file" * tag 'v6.14-rc5-smb3-fixes' of git://git.samba.org/ksmbd: cifs: fix incorrect validation for num_aces field of smb_acl ksmbd: fix incorrect validation for num_aces field of smb_acl smb: common: change the data type of num_aces to le16 ksmbd: fix bug on trap in smb2_lock ksmbd: fix use-after-free in smb2_lock ksmbd: fix type confusion via race condition when using ipc_msg_send_request ksmbd: fix out-of-bounds in parse_sec_desc() MAINTAINERS: update email address in cifs and ksmbd entry
2025-03-06	Merge tag 'exfat-for-6.14-rc6' of ↵	Linus Torvalds
	git://git.kernel.org/pub/scm/linux/kernel/git/linkinjeon/exfat Pull exfat fixes from Namjae Jeon: - Optimize new cluster allocation by correctly find empty entry slot - Add a check to prevent excessive bitmap clearing due to invalid data size of file/dir entry - Fix incorrect error return for zero-byte writes * tag 'exfat-for-6.14-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/linkinjeon/exfat: exfat: add a check for invalid data size exfat: short-circuit zero-byte writes in exfat_file_write_iter exfat: fix soft lockup in exfat_clear_bitmap exfat: fix just enough dentries but allocate a new cluster to dir
2025-03-06	Merge tag 'vfs-6.14-rc6.fixes' of ↵	Linus Torvalds
	gitolite.kernel.org:pub/scm/linux/kernel/git/vfs/vfs Pull vfs fixes from Christian Brauner: - Fix spelling mistakes in idmappings.rst - Fix RCU warnings in override_creds()/revert_creds() - Create new pid namespaces with default limit now that pid_max is namespaced * tag 'vfs-6.14-rc6.fixes' of gitolite.kernel.org:pub/scm/linux/kernel/git/vfs/vfs: pid: Do not set pid_max in new pid namespaces doc: correcting two prefix errors in idmappings.rst cred: Fix RCU warnings in override/revert_creds
2025-03-06	fs/pipe: fix pipe buffer index use in FUSE	Linus Torvalds
	This was another case that Rasmus pointed out where the direct access to the pipe head and tail pointers broke on 32-bit configurations due to the type changes. As with the pipe FIONREAD case, fix it by using the appropriate helper functions that deal with the right pipe index sizing. Reported-by: Rasmus Villemoes <ravi@prevas.dk> Link: https://lore.kernel.org/all/878qpi5wz4.fsf@prevas.dk/ Fixes: 3d252160b818 ("fs/pipe: Read pipe->{head,tail} atomically outside pipe->mutex")Cc: Oleg > Cc: Mateusz Guzik <mjguzik@gmail.com> Cc: K Prateek Nayak <kprateek.nayak@amd.com> Cc: Swapnil Sapkal <swapnil.sapkal@amd.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2025-03-06	fs/pipe: do not open-code pipe head/tail logic in FIONREAD	Linus Torvalds
	Rasmus points out that we do indeed have other cases of breakage from the type changes that were introduced on 32-bit targets in order to read the pipe head and tail values atomically (commit 3d252160b818: "fs/pipe: Read pipe->{head,tail} atomically outside pipe->mutex"). Fix it up by using the proper helper functions that now deal with the pipe buffer index types properly. This makes the code simpler and more obvious. The compiler does the CSE and loop hoisting of the pipe ring size masking that we used to do manually, so open-coding this was never a good idea. Reported-by: Rasmus Villemoes <ravi@prevas.dk> Link: https://lore.kernel.org/all/87cyeu5zgk.fsf@prevas.dk/ Fixes: 3d252160b818 ("fs/pipe: Read pipe->{head,tail} atomically outside pipe->mutex")Cc: Oleg Nesterov <oleg@redhat.com> Cc: Mateusz Guzik <mjguzik@gmail.com> Cc: K Prateek Nayak <kprateek.nayak@amd.com> Cc: Swapnil Sapkal <swapnil.sapkal@amd.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2025-03-06	fs/pipe: express 'pipe_empty()' in terms of 'pipe_occupancy()'	Linus Torvalds
	That's what 'pipe_full()' does, so it's more consistent. But more importantly it gets the type limits right when the pipe head and tail are no longer necessarily 'unsigned int'. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2025-03-06	usb: typec: ucsi: Fix NULL pointer access	Andrei Kuchynski
	Resources should be released only after all threads that utilize them have been destroyed. This commit ensures that resources are not released prematurely by waiting for the associated workqueue to complete before deallocating them. Cc: stable <stable@kernel.org> Fixes: b9aa02ca39a4 ("usb: typec: ucsi: Add polling mechanism for partner tasks like alt mode checking") Signed-off-by: Andrei Kuchynski <akuchynski@chromium.org> Reviewed-by: Heikki Krogerus <heikki.krogerus@linux.intel.com> Link: https://lore.kernel.org/r/20250305111739.1489003-2-akuchynski@chromium.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-03-06	usb: quirks: Add DELAY_INIT and NO_LPM for Prolific Mass Storage Card Reader	Miao Li
	When used on Huawei hisi platforms, Prolific Mass Storage Card Reader which the VID:PID is in 067b:2731 might fail to enumerate at boot time and doesn't work well with LPM enabled, combination quirks: USB_QUIRK_DELAY_INIT + USB_QUIRK_NO_LPM fixed the problems. Signed-off-by: Miao Li <limiao@kylinos.cn> Cc: stable <stable@kernel.org> Link: https://lore.kernel.org/r/20250304070757.139473-1-limiao870622@163.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-03-06	xhci: Handle spurious events on Etron host isoc enpoints	Mathias Nyman
	Unplugging a USB3.0 webcam from Etron hosts while streaming results in errors like this: [ 2.646387] xhci_hcd 0000:03:00.0: ERROR Transfer event TRB DMA ptr not part of current TD ep_index 18 comp_code 13 [ 2.646446] xhci_hcd 0000:03:00.0: Looking for event-dma 000000002fdf8630 trb-start 000000002fdf8640 trb-end 000000002fdf8650 [ 2.646560] xhci_hcd 0000:03:00.0: ERROR Transfer event TRB DMA ptr not part of current TD ep_index 18 comp_code 13 [ 2.646568] xhci_hcd 0000:03:00.0: Looking for event-dma 000000002fdf8660 trb-start 000000002fdf8670 trb-end 000000002fdf8670 Etron xHC generates two transfer events for the TRB if an error is detected while processing the last TRB of an isoc TD. The first event can be any sort of error (like USB Transaction or Babble Detected, etc), and the final event is Success. The xHCI driver will handle the TD after the first event and remove it from its internal list, and then print an "Transfer event TRB DMA ptr not part of current TD" error message after the final event. Commit 5372c65e1311 ("xhci: process isoc TD properly when there was a transaction error mid TD.") is designed to address isoc transaction errors, but unfortunately it doesn't account for this scenario. This issue is similar to the XHCI_SPURIOUS_SUCCESS case where a success event follows a 'short transfer' event, but the TD the event points to is already given back. Expand the spurious success 'short transfer' event handling to cover the spurious success after error on Etron hosts. Kuangyi Chiang reported this issue and submitted a different solution based on using error_mid_td. This commit message is mostly taken from that patch. Reported-by: Kuangyi Chiang <ki.chiang65@gmail.com> Closes: https://lore.kernel.org/linux-usb/20241028025337.6372-6-ki.chiang65@gmail.com/ Tested-by: Kuangyi Chiang <ki.chiang65@gmail.com> Tested-by: Michal Pecio <michal.pecio@gmail.com> Signed-off-by: Mathias Nyman <mathias.nyman@linux.intel.com> Link: https://lore.kernel.org/r/20250306144954.3507700-16-mathias.nyman@linux.intel.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-03-06	usb: xhci: Unify duplicate inc_enq() code	Michal Pecio
	Extract a block of code copied from inc_enq() into a separate function and call it from inc_enq() and the other function which used this code. Remove the pointless 'next' variable which only aliases ring->enqueue. Note: I don't know if any 0.95 xHC ever reached series production, but "AMD 0.96 host" appears to be the "Llano" family APU. Example dmesg at https://linux-hardware.org/?probe=79d5cfd4fd&log=dmesg pci 0000:00:10.0: [1022:7812] type 00 class 0x0c0330 hcc params 0x014042c3 hci version 0x96 quirks 0x0000000000000608 Signed-off-by: Michal Pecio <michal.pecio@gmail.com> Signed-off-by: Mathias Nyman <mathias.nyman@linux.intel.com> Link: https://lore.kernel.org/r/20250306144954.3507700-15-mathias.nyman@linux.intel.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-03-06	usb: xhci: Apply the link chain quirk on NEC isoc endpoints	Michal Pecio
	Two clearly different specimens of NEC uPD720200 (one with start/stop bug, one without) were seen to cause IOMMU faults after some Missed Service Errors. Faulting address is immediately after a transfer ring segment and patched dynamic debug messages revealed that the MSE was received when waiting for a TD near the end of that segment: [ 1.041954] xhci_hcd: Miss service interval error for slot 1 ep 2 expected TD DMA ffa08fe0 [ 1.042120] xhci_hcd: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0005 address=0xffa09000 flags=0x0000] [ 1.042146] xhci_hcd: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0005 address=0xffa09040 flags=0x0000] It gets even funnier if the next page is a ring segment accessible to the HC. Below, it reports MSE in segment at ff1e8000, plows through a zero-filled page at ff1e9000 and starts reporting events for TRBs in page at ff1ea000 every microframe, instead of jumping to seg ff1e6000. [ 7.041671] xhci_hcd: Miss service interval error for slot 1 ep 2 expected TD DMA ff1e8fe0 [ 7.041999] xhci_hcd: Miss service interval error for slot 1 ep 2 expected TD DMA ff1e8fe0 [ 7.042011] xhci_hcd: WARN: buffer overrun event for slot 1 ep 2 on endpoint [ 7.042028] xhci_hcd: All TDs skipped for slot 1 ep 2. Clear skip flag. [ 7.042134] xhci_hcd: WARN: buffer overrun event for slot 1 ep 2 on endpoint [ 7.042138] xhci_hcd: ERROR Transfer event TRB DMA ptr not part of current TD ep_index 2 comp_code 31 [ 7.042144] xhci_hcd: Looking for event-dma 00000000ff1ea040 trb-start 00000000ff1e6820 trb-end 00000000ff1e6820 [ 7.042259] xhci_hcd: WARN: buffer overrun event for slot 1 ep 2 on endpoint [ 7.042262] xhci_hcd: ERROR Transfer event TRB DMA ptr not part of current TD ep_index 2 comp_code 31 [ 7.042266] xhci_hcd: Looking for event-dma 00000000ff1ea050 trb-start 00000000ff1e6820 trb-end 00000000ff1e6820 At some point completion events change from Isoch Buffer Overrun to Short Packet and the HC finally finds cycle bit mismatch in ff1ec000. [ 7.098130] xhci_hcd: ERROR Transfer event TRB DMA ptr not part of current TD ep_index 2 comp_code 13 [ 7.098132] xhci_hcd: Looking for event-dma 00000000ff1ecc50 trb-start 00000000ff1e6820 trb-end 00000000ff1e6820 [ 7.098254] xhci_hcd: ERROR Transfer event TRB DMA ptr not part of current TD ep_index 2 comp_code 13 [ 7.098256] xhci_hcd: Looking for event-dma 00000000ff1ecc60 trb-start 00000000ff1e6820 trb-end 00000000ff1e6820 [ 7.098379] xhci_hcd: Overrun event on slot 1 ep 2 It's possible that data from the isochronous device were written to random buffers of pending TDs on other endpoints (either IN or OUT), other devices or even other HCs in the same IOMMU domain. Lastly, an error from a different USB device on another HC. Was it caused by the above? I don't know, but it may have been. The disk was working without any other issues and generated PCIe traffic to starve the NEC of upstream BW and trigger those MSEs. The two HCs shared one x1 slot by means of a commercial "PCIe splitter" board. [ 7.162604] usb 10-2: reset SuperSpeed USB device number 3 using xhci_hcd [ 7.178990] sd 9:0:0:0: [sdb] tag#0 UNKNOWN(0x2003) Result: hostbyte=0x07 driverbyte=DRIVER_OK cmd_age=0s [ 7.179001] sd 9:0:0:0: [sdb] tag#0 CDB: opcode=0x28 28 00 04 02 ae 00 00 02 00 00 [ 7.179004] I/O error, dev sdb, sector 67284480 op 0x0:(READ) flags 0x80700 phys_seg 5 prio class 0 Fortunately, it appears that this ridiculous bug is avoided by setting the chain bit of Link TRBs on isochronous rings. Other ancient HCs are known which also expect the bit to be set and they ignore Link TRBs if it's not. Reportedly, 0.95 spec guaranteed that the bit is set. The bandwidth-starved NEC HC running a 32KB/uframe UVC endpoint reports tens of MSEs per second and runs into the bug within seconds. Chaining Link TRBs allows the same workload to run for many minutes, many times. No negative side effects seen in UVC recording and UAC playback with a few devices at full speed, high speed and SuperSpeed. The problem doesn't reproduce on the newer Renesas uPD720201/uPD720202 and on old Etron EJ168 and VIA VL805 (but the VL805 has other bug). [shorten line length of log snippets in commit messge -Mathias] Signed-off-by: Michal Pecio <michal.pecio@gmail.com> Cc: stable@vger.kernel.org Signed-off-by: Mathias Nyman <mathias.nyman@linux.intel.com> Link: https://lore.kernel.org/r/20250306144954.3507700-14-mathias.nyman@linux.intel.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-03-06	xhci: Prevent early endpoint restart when handling STALL errors.	Mathias Nyman
	Ensure that an endpoint halted due to device STALL is not restarted before a Clear_Feature(ENDPOINT_HALT) request is sent to the device. The host side of the endpoint may otherwise be started early by the 'Set TR Deq' command completion handler which is called if dequeue is moved past a cancelled or halted TD. Prevent this with a new flag set for bulk and interrupt endpoints when a Stall Error is received. Clear it in hcd->endpoint_reset() which is called after Clear_Feature(ENDPOINT_HALT) is sent. Also add a debug message if a class driver queues a new URB after the STALL. Note that class driver might not be aware of the STALL yet when it submits the URB as URBs are given back in BH. Signed-off-by: Mathias Nyman <mathias.nyman@linux.intel.com> Link: https://lore.kernel.org/r/20250306144954.3507700-13-mathias.nyman@linux.intel.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-03-06	usb: xhci: move debug capabilities from trb_in_td() to handle_tx_event()	Niklas Neronin
	Function trb_in_td() currently includes debug capabilities that are triggered when its debug argument is set to true. The only consumer of these debug capabilities is handle_tx_event(), which calls trb_in_td() twice, once for its primary functionality and a second time solely for debugging purposes if the first call returns 'NULL'. This approach is inefficient and can lead to confusion, as trb_in_td() executes the same code with identical arguments twice, differing only in the debug output during the second execution. To enhance clarity and efficiency, move the debug capabilities out of trb_in_td() and integrates them directly into handle_tx_event(). This change reduces the argument count of trb_in_td() and ensures that debug steps are executed only when necessary, streamlining the function's operation. Signed-off-by: Niklas Neronin <niklas.neronin@linux.intel.com> Signed-off-by: Mathias Nyman <mathias.nyman@linux.intel.com> Link: https://lore.kernel.org/r/20250306144954.3507700-12-mathias.nyman@linux.intel.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-03-06	usb: xhci: refactor trb_in_td() to be static	Niklas Neronin
	Relocate trb_in_td() and marks it as static, as it's exclusively utilized in xhci-ring.c. This adjustment lays the groundwork for future rework of the function. The function's logic remains unchanged; only its access specifier is altered to static and a redundant "else" is removed on line 325 (due to checkpatch.pl complaining). Signed-off-by: Niklas Neronin <niklas.neronin@linux.intel.com> Signed-off-by: Mathias Nyman <mathias.nyman@linux.intel.com> Link: https://lore.kernel.org/r/20250306144954.3507700-11-mathias.nyman@linux.intel.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-03-06	usb: xhci: set page size to the xHCI-supported size	Niklas Neronin
	The current xHCI driver does not validate whether a page size of 4096 bytes is supported. Address the issue by setting the page size to the value supported by the xHCI controller, as read from the Page Size register. In the event of an unexpected value; default to a 4K page size. Additionally, this commit removes unnecessary debug messages and instead prints the supported and used page size once. The xHCI controller supports page sizes of (2^{(n+12)}) bytes, where 'n' is the Page Size Bit. Only one page size is supported, with a maximum page size of 128 KB. Signed-off-by: Niklas Neronin <niklas.neronin@linux.intel.com> Signed-off-by: Mathias Nyman <mathias.nyman@linux.intel.com> Link: https://lore.kernel.org/r/20250306144954.3507700-10-mathias.nyman@linux.intel.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-03-06	usb: xhci: correct debug message page size calculation	Niklas Neronin
	The ffs() function returns the index of the first set bit, starting from 1. If no bits are set, it returns zero. This behavior causes an off-by-one page size in the debug message, as the page size calculation [1] is zero-based, while ffs() is one-based. Fix this by subtracting one from the result of ffs(). Note that since variable 'val' is unsigned, subtracting one from zero will result in the maximum unsigned integer value. Consequently, the condition 'if (val < 16)' will still function correctly. [1], Page size: (2^(n+12)), where 'n' is the set page size bit. Fixes: 81720ec5320c ("usb: host: xhci: use ffs() in xhci_mem_init()") Signed-off-by: Niklas Neronin <niklas.neronin@linux.intel.com> Signed-off-by: Mathias Nyman <mathias.nyman@linux.intel.com> Link: https://lore.kernel.org/r/20250306144954.3507700-9-mathias.nyman@linux.intel.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-03-06	usb: xhci: Skip only one TD on Ring Underrun/Overrun	Michal Pecio
	If skipping is deferred to events other than Missed Service Error itsef, it means we are running on an xHCI 1.0 host and don't know how many TDs were missed until we reach some ordinary transfer completion event. And in case of ring xrun, we can't know where the xrun happened either. If we skip all pending TDs, we may prematurely give back TDs added after the xrun had occurred, risking data loss or buffer UAF by the xHC. If we skip none, a driver may become confused and stop working when all its URBs are missed and appear to be "in flight" forever. Skip exactly one TD on each xrun event - the first one that was missed, as we can now be sure that the HC has finished processing it. Provided that one more TD is queued before any subsequent doorbell ring, it will become safe to skip another TD by the time we get an xrun again. Signed-off-by: Michal Pecio <michal.pecio@gmail.com> Signed-off-by: Mathias Nyman <mathias.nyman@linux.intel.com> Link: https://lore.kernel.org/r/20250306144954.3507700-8-mathias.nyman@linux.intel.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-03-06	usb: xhci: Expedite skipping missed isoch TDs on modern HCs	Michal Pecio
	xHCI spec rev. 1.0 allowed the TRB pointer of Missed Service events to be NULL. Having no idea which of the queued TDs were missed and which are waiting, we can only set a flag to skip missed TDs later. But HCs are also allowed to give us pointer to the last missed TRB, and this became mandatory in spec rev. 1.1 and later. Use this pointer, if available, to immediately skip all missed TDs. This reduces latency and risk of skipping-related bugs, because we can now leave the skip flag cleared for future events. Handle Missed Service Error events as 'error mid TD', if applicable, because rev. 1.0 spec excplicitly says so in notes to 4.10.3.2 and later revs in 4.10.3.2 and 4.11.2.5.2. Notes to 4.9.1 seem to apply. Tested on ASM1142 and ASM3142 v1.1 xHCs which provide TRB pointers. Tested on AMD, Etron, Renesas v1.0 xHCs which provide TRB pointers. Tested on NEC v0.96 and VIA v1.0 xHCs which send a NULL pointer. Change inspired by a discussion about realtime USB audio. Link: https://lore.kernel.org/linux-usb/76e1a191-020d-4a76-97f6-237f9bd0ede0@gmx.net/T/ Signed-off-by: Michal Pecio <michal.pecio@gmail.com> Signed-off-by: Mathias Nyman <mathias.nyman@linux.intel.com> Link: https://lore.kernel.org/r/20250306144954.3507700-7-mathias.nyman@linux.intel.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-03-06	usb: xhci: Fix isochronous Ring Underrun/Overrun event handling	Michal Pecio
	The TRB pointer of these events points at enqueue at the time of error occurrence on xHCI 1.1+ HCs or it's NULL on older ones. By the time we are handling the event, a new TD may be queued at this ring position. I can trigger this race by rising interrupt moderation to increase IRQ handling delay. Similar delay may occur naturally due to system load. If this ever happens after a Missed Service Error, missed TDs will be skipped and the new TD processed as if it matched the event. It could be given back prematurely, risking data loss or buffer UAF by the xHC. Don't complete TDs on xrun events and don't warn if queued TDs don't match the event's TRB pointer, which can be NULL or a link/no-op TRB. Don't warn if there are no queued TDs at all. Now that it's safe, also handle xrun events if the skip flag is clear. This ensures completion of any TD stuck in 'error mid TD' state right before the xrun event, which could happen if a driver submits a finite number of URBs to a buggy HC and then an error occurs on the last TD. Signed-off-by: Michal Pecio <michal.pecio@gmail.com> Signed-off-by: Mathias Nyman <mathias.nyman@linux.intel.com> Link: https://lore.kernel.org/r/20250306144954.3507700-6-mathias.nyman@linux.intel.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-03-06	usb: xhci: Complete 'error mid TD' transfers when handling Missed Service	Michal Pecio
	Missed Service Error after an error mid TD means that the failed TD has already been passed by the xHC without acknowledgment of the final TRB, a known hardware bug. So don't wait any more and give back the TD. Reproduced on NEC uPD720200 under conditions of ludicrously bad USB link quality, confirmed to behave as expected using dynamic debug. Signed-off-by: Michal Pecio <michal.pecio@gmail.com> Signed-off-by: Mathias Nyman <mathias.nyman@linux.intel.com> Link: https://lore.kernel.org/r/20250306144954.3507700-5-mathias.nyman@linux.intel.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-03-06	usb: xhci: Don't skip on Stopped - Length Invalid	Michal Pecio
	Up until commit d56b0b2ab142 ("usb: xhci: ensure skipped isoc TDs are returned when isoc ring is stopped") in v6.11, the driver didn't skip missed isochronous TDs when handling Stoppend and Stopped - Length Invalid events. Instead, it erroneously cleared the skip flag, which would cause the ring to get stuck, as future events won't match the missed TD which is never removed from the queue until it's cancelled. This buggy logic seems to have been in place substantially unchanged since the 3.x series over 10 years ago, which probably speaks first and foremost about relative rarity of this case in normal usage, but by the spec I see no reason why it shouldn't be possible. After d56b0b2ab142, TDs are immediately skipped when handling those Stopped events. This poses a potential problem in case of Stopped - Length Invalid, which occurs either on completed TDs (likely already given back) or Link and No-Op TRBs. Such event won't be recognized as matching any TD (unless it's the rare Link TRB inside a TD) and will result in skipping all pending TDs, giving them back possibly before they are done, risking isoc data loss and maybe UAF by HW. As a compromise, don't skip and don't clear the skip flag on this kind of event. Then the next event will skip missed TDs. A downside of not handling Stopped - Length Invalid on a Link inside a TD is that if the TD is cancelled, its actual length will not be updated to account for TRBs (silently) completed before the TD was stopped. I had no luck producing this sequence of completion events so there is no compelling demonstration of any resulting disaster. It may be a very rare, obscure condition. The sole motivation for this patch is that if such unlikely event does occur, I'd rather risk reporting a cancelled partially done isoc frame as empty than gamble with UAF. This will be fixed more properly by looking at Stopped event's TRB pointer when making skipping decisions, but such rework is unlikely to be backported to v6.12, which will stay around for a few years. Fixes: d56b0b2ab142 ("usb: xhci: ensure skipped isoc TDs are returned when isoc ring is stopped") Cc: stable@vger.kernel.org Signed-off-by: Michal Pecio <michal.pecio@gmail.com> Signed-off-by: Mathias Nyman <mathias.nyman@linux.intel.com> Link: https://lore.kernel.org/r/20250306144954.3507700-4-mathias.nyman@linux.intel.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-03-06	usb: xhci: remove redundant update_ring_for_set_deq_completion() function	Niklas Neronin
	The function is a remnant from a previous implementation and is now redundant. There is no longer a need to search for the dequeue pointer, as both the TRB and segment dequeue pointers are saved within 'queued_deq_seg' and 'queued_deq_ptr'. Signed-off-by: Niklas Neronin <niklas.neronin@linux.intel.com> Signed-off-by: Mathias Nyman <mathias.nyman@linux.intel.com> Link: https://lore.kernel.org/r/20250306144954.3507700-3-mathias.nyman@linux.intel.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-03-06	xhci: show correct U1 and U2 timeout values in debug messages	Mathias Nyman
	U2 value is encoded in 256 microsecond intervals, show in microseconds. U1 value is in microseconds. debug message incorrectly showed "ms" Unwrap debug messages while we anyway modify them. Signed-off-by: Mathias Nyman <mathias.nyman@linux.intel.com> Link: https://lore.kernel.org/r/20250306144954.3507700-2-mathias.nyman@linux.intel.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-03-06	gpio: rcar: Fix missing of_node_put() call	Fabrizio Castro
	of_parse_phandle_with_fixed_args() requires its caller to call into of_node_put() on the node pointer from the output structure, but such a call is currently missing. Call into of_node_put() to rectify that. Fixes: 159f8a0209af ("gpio-rcar: Add DT support") Signed-off-by: Fabrizio Castro <fabrizio.castro.jz@renesas.com> Reviewed-by: Lad Prabhakar <prabhakar.mahadev-lad.rj@bp.renesas.com> Reviewed-by: Geert Uytterhoeven <geert+renesas@glider.be> Link: https://lore.kernel.org/r/20250305163753.34913-2-fabrizio.castro.jz@renesas.com Signed-off-by: Bartosz Golaszewski <bartosz.golaszewski@linaro.org>
2025-03-06	btrfs: fix a leaked chunk map issue in read_one_chunk()	Haoxiang Li
	Add btrfs_free_chunk_map() to free the memory allocated by btrfs_alloc_chunk_map() if btrfs_add_chunk_map() fails. Fixes: 7dc66abb5a47 ("btrfs: use a dedicated data structure for chunk maps") CC: stable@vger.kernel.org Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Haoxiang Li <haoxiang_li2024@163.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-06	Merge tag 'nvme-6.14-2025-03-05' of git://git.infradead.org/nvme into block-6.14	Jens Axboe
	Pull NVMe fixe from Keith: "nvme fixes for Linux 6.14 - TCP use after free fix on polling (Sagi) - Controller memory buffer cleanup fixes (Icenowy) - Free leaking requests on bad user passthrough commands (Keith) - TCP error message fix (Maurizio) - TCP corruption fix on partial PDU (Maurizio) - TCP memory ordering fix for weakly ordered archs (Meir) - Type coercion fix on message error for TCP (Dan)" * tag 'nvme-6.14-2025-03-05' of git://git.infradead.org/nvme: nvme-tcp: fix signedness bug in nvme_tcp_init_connection() nvmet-tcp: Fix a possible sporadic response drops in weakly ordered arch nvme-tcp: fix potential memory corruption in nvme_tcp_recv_pdu() nvme-tcp: Fix a C2HTermReq error message nvmet: remove old function prototype nvme-ioctl: fix leaked requests on mapping error nvme-pci: skip CMB blocks incompatible with PCI P2P DMA nvme-pci: clean up CMBMSC when registering CMB fails nvme-tcp: fix possible UAF in nvme_tcp_poll
2025-03-06	kbuild: install-extmod-build: Fix build when specifying KBUILD_OUTPUT	Inochi Amaoto
	Since commit 5f73e7d0386d ("kbuild: refactor cross-compiling linux-headers package"), the linux-headers pacman package fails to build when "O=" is set. The build system complains: /mnt/chroot/linux/scripts/Makefile.build:41: mnt/chroots/linux-mainline/pacman/linux-upstream/pkg/linux-upstream-headers/usr//lib/modules/6.14.0-rc3-00350-g771dba31fffc/build/scripts/Makefile: No such file or directory This is because the "srcroot" variable is set to "." and the "build" variable is set to the absolute path. This makes the "src" variables point to wrong directory. Change the "build" variable to a relative path to "." to fix build. Fixes: 5f73e7d0386d ("kbuild: refactor cross-compiling linux-headers package") Signed-off-by: Inochi Amaoto <inochiama@gmail.com> Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
2025-03-06	net: ipv6: fix missing dst ref drop in ila lwtunnel	Justin Iurman
	Add missing skb_dst_drop() to drop reference to the old dst before adding the new dst to the skb. Fixes: 79ff2fc31e0f ("ila: Cache a route to translated address") Cc: Tom Herbert <tom@herbertland.com> Signed-off-by: Justin Iurman <justin.iurman@uliege.be> Link: https://patch.msgid.link/20250305081655.19032-1-justin.iurman@uliege.be Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-03-06	net: ipv6: fix dst ref loop in ila lwtunnel	Justin Iurman
	This patch follows commit 92191dd10730 ("net: ipv6: fix dst ref loops in rpl, seg6 and ioam6 lwtunnels") and, on a second thought, the same patch is also needed for ila (even though the config that triggered the issue was pathological, but still, we don't want that to happen). Fixes: 79ff2fc31e0f ("ila: Cache a route to translated address") Cc: Tom Herbert <tom@herbertland.com> Signed-off-by: Justin Iurman <justin.iurman@uliege.be> Link: https://patch.msgid.link/20250304181039.35951-1-justin.iurman@uliege.be Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-03-06	mctp i3c: handle NULL header address	Matt Johnston
	daddr can be NULL if there is no neighbour table entry present, in that case the tx packet should be dropped. saddr will usually be set by MCTP core, but check for NULL in case a packet is transmitted by a different protocol. Signed-off-by: Matt Johnston <matt@codeconstruct.com.au> Fixes: c8755b29b58e ("mctp i3c: MCTP I3C driver") Link: https://patch.msgid.link/20250304-mctp-i3c-null-v1-1-4416bbd56540@codeconstruct.com.au Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-03-06	sched/rt: Update limit of sched_rt sysctl in documentation	Shrikanth Hegde
	By default fair_server dl_server allocates 5% of the bandwidth to the root domain. Due to this writing any value less than 5% fails due to -EBUSY: $ cat /proc/sys/kernel/sched_rt_period_us 1000000 $ echo 49999 > /proc/sys/kernel/sched_rt_runtime_us -bash: echo: write error: Device or resource busy $ echo 50000 > /proc/sys/kernel/sched_rt_runtime_us $ Since the sched_rt_runtime_us allows -1 as the minimum, put this restriction in the documentation. One should check average of runtime/period in /sys/kernel/debug/sched/fair_server/cpuX/* for exact value. Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Juri Lelli <juri.lelli@redhat.com> Link: https://lore.kernel.org/r/20250306052954.452005-3-sshegde@linux.ibm.com
2025-03-06	sched/deadline: Use online cpus for validating runtime	Shrikanth Hegde
	The ftrace selftest reported a failure because writing -1 to sched_rt_runtime_us returns -EBUSY. This happens when the possible CPUs are different from active CPUs. Active CPUs are part of one root domain, while remaining CPUs are part of def_root_domain. Since active cpumask is being used, this results in cpus=0 when a non active CPUs is used in the loop. Fix it by looping over the online CPUs instead for validating the bandwidth calculations. Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Juri Lelli <juri.lelli@redhat.com> Link: https://lore.kernel.org/r/20250306052954.452005-2-sshegde@linux.ibm.com
2025-03-06	pid: Do not set pid_max in new pid namespaces	Michal Koutný
	It is already difficult for users to troubleshoot which of multiple pid limits restricts their workload. The per-(hierarchical-)NS pid_max would contribute to the confusion. Also, the implementation copies the limit upon creation from parent, this pattern showed cumbersome with some attributes in legacy cgroup controllers -- it's subject to race condition between parent's limit modification and children creation and once copied it must be changed in the descendant. Let's do what other places do (ucounts or cgroup limits) -- create new pid namespaces without any limit at all. The global limit (actually any ancestor's limit) is still effectively in place, we avoid the set/unshare race and bumps of global (ancestral) limit have the desired effect on pid namespace that do not care. Link: https://lore.kernel.org/r/20240408145819.8787-1-mkoutny@suse.com/ Link: https://lore.kernel.org/r/20250221170249.890014-1-mkoutny@suse.com/ Fixes: 7863dcc72d0f4 ("pid: allow pid_max to be set per pid namespace") Signed-off-by: Michal Koutný <mkoutny@suse.com> Link: https://lore.kernel.org/r/20250305145849.55491-1-mkoutny@suse.com Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-03-06	drm/bochs: Fix DPMS regression	Takashi Iwai
	The recent rewrite with the use of regular atomic helpers broke the DPMS unblanking on X11. Fix it by moving the call of bochs_hw_blank(false) from CRTC mode_set_nofb() to atomic_enable(). Fixes: 2037174993c8 ("drm/bochs: Use regular atomic helpers") Link: https://bugzilla.suse.com/show_bug.cgi?id=1238209 Signed-off-by: Takashi Iwai <tiwai@suse.de> Reviewed-by: Thomas Zimmermann <tzimmermann@suse.de> Signed-off-by: Thomas Zimmermann <tzimmermann@suse.de> Link: https://patchwork.freedesktop.org/patch/msgid/20250304134203.20534-1-tiwai@suse.de
2025-03-05	mm/page_alloc: fix uninitialized variable	Hao Zhang
	The variable "compact_result" is not initialized in function __alloc_pages_slowpath(). It causes should_compact_retry() to use an uninitialized value. Initialize variable "compact_result" with the value COMPACT_SKIPPED. BUG: KMSAN: uninit-value in __alloc_pages_slowpath+0xee8/0x16c0 mm/page_alloc.c:4416 __alloc_pages_slowpath+0xee8/0x16c0 mm/page_alloc.c:4416 __alloc_frozen_pages_noprof+0xa4c/0xe00 mm/page_alloc.c:4752 alloc_pages_mpol+0x4cd/0x890 mm/mempolicy.c:2270 alloc_frozen_pages_noprof mm/mempolicy.c:2341 [inline] alloc_pages_noprof mm/mempolicy.c:2361 [inline] folio_alloc_noprof+0x1dc/0x350 mm/mempolicy.c:2371 filemap_alloc_folio_noprof+0xa6/0x440 mm/filemap.c:1019 __filemap_get_folio+0xb9a/0x1840 mm/filemap.c:1970 grow_dev_folio fs/buffer.c:1039 [inline] grow_buffers fs/buffer.c:1105 [inline] __getblk_slow fs/buffer.c:1131 [inline] bdev_getblk+0x2c9/0xab0 fs/buffer.c:1431 getblk_unmovable include/linux/buffer_head.h:369 [inline] ext4_getblk+0x3b7/0xe50 fs/ext4/inode.c:864 ext4_bread_batch+0x9f/0x7d0 fs/ext4/inode.c:933 __ext4_find_entry+0x1ebb/0x36c0 fs/ext4/namei.c:1627 ext4_lookup_entry fs/ext4/namei.c:1729 [inline] ext4_lookup+0x189/0xb40 fs/ext4/namei.c:1797 __lookup_slow+0x538/0x710 fs/namei.c:1793 lookup_slow+0x6a/0xd0 fs/namei.c:1810 walk_component fs/namei.c:2114 [inline] link_path_walk+0xf29/0x1420 fs/namei.c:2479 path_openat+0x30f/0x6250 fs/namei.c:3985 do_filp_open+0x268/0x600 fs/namei.c:4016 do_sys_openat2+0x1bf/0x2f0 fs/open.c:1428 do_sys_open fs/open.c:1443 [inline] __do_sys_openat fs/open.c:1459 [inline] __se_sys_openat fs/open.c:1454 [inline] __x64_sys_openat+0x2a1/0x310 fs/open.c:1454 x64_sys_call+0x36f5/0x3c30 arch/x86/include/generated/asm/syscalls_64.h:258 do_syscall_x64 arch/x86/entry/common.c:52 [inline] do_syscall_64+0xcd/0x1e0 arch/x86/entry/common.c:83 entry_SYSCALL_64_after_hwframe+0x77/0x7f Local variable compact_result created at: __alloc_pages_slowpath+0x66/0x16c0 mm/page_alloc.c:4218 __alloc_frozen_pages_noprof+0xa4c/0xe00 mm/page_alloc.c:4752 Link: https://lkml.kernel.org/r/tencent_ED1032321D6510B145CDBA8CBA0093178E09@qq.com Reported-by: syzbot+0cfd5e38e96a5596f2b6@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=0cfd5e38e96a5596f2b6 Signed-off-by: Hao Zhang <zhanghao1@kylinos.cn> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Cc: Michal Hocko <mhocko@kernel.org> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-05	rapidio: add check for rio_add_net() in rio_scan_alloc_net()	Haoxiang Li
	The return value of rio_add_net() should be checked. If it fails, put_device() should be called to free the memory and give up the reference initialized in rio_add_net(). Link: https://lkml.kernel.org/r/20250227041131.3680761-1-haoxiang_li2024@163.com Fixes: e6b585ca6e81 ("rapidio: move net allocation into core code") Signed-off-by: Yang Yingliang <yangyingliang@huawei.com> Signed-off-by: Haoxiang Li <haoxiang_li2024@163.com> Cc: Alexandre Bounine <alex.bou9@gmail.com> Cc: Matt Porter <mporter@kernel.crashing.org> Cc: Dan Carpenter <dan.carpenter@linaro.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-05	rapidio: fix an API misues when rio_add_net() fails	Haoxiang Li
	rio_add_net() calls device_register() and fails when device_register() fails. Thus, put_device() should be used rather than kfree(). Add "mport->net = NULL;" to avoid a use after free issue. Link: https://lkml.kernel.org/r/20250227073409.3696854-1-haoxiang_li2024@163.com Fixes: e8de370188d0 ("rapidio: add mport char device driver") Signed-off-by: Haoxiang Li <haoxiang_li2024@163.com> Reviewed-by: Dan Carpenter <dan.carpenter@linaro.org> Cc: Alexandre Bounine <alex.bou9@gmail.com> Cc: Matt Porter <mporter@kernel.crashing.org> Cc: Yang Yingliang <yangyingliang@huawei.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-05	MAINTAINERS: .mailmap: update Sumit Garg's email address	Sumit Garg
	Update Sumit Garg's email address to @kernel.org. Link: https://lkml.kernel.org/r/20250227113228.1809449-1-sumit.garg@linaro.org Signed-off-by: Sumit Garg <sumit.garg@linaro.org> Cc: Herbert Xu <herbert@gondor.apana.org.au> Cc: Jarkko Sakkinen <jarkko@kernel.org> Cc: Jens Wiklander <jens.wiklander@linaro.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-05	Revert "mm/page_alloc.c: don't show protection in zone's ->lowmem_reserve[] ↵	Gabriel Krisman Bertazi
	for empty zone" Commit 96a5c186efff ("mm/page_alloc.c: don't show protection in zone's ->lowmem_reserve[] for empty zone") removes the protection of lower zones from allocations targeting memory-less high zones. This had an unintended impact on the pattern of reclaims because it makes the high-zone-targeted allocation more likely to succeed in lower zones, which adds pressure to said zones. I.e, the following corresponding checks in zone_watermark_ok/zone_watermark_fast are less likely to trigger: if (free_pages <= min + z->lowmem_reserve[highest_zoneidx]) return false; As a result, we are observing an increase in reclaim and kswapd scans, due to the increased pressure. This was initially observed as increased latency in filesystem operations when benchmarking with fio on a machine with some memory-less zones, but it has since been associated with increased contention in locks related to memory reclaim. By reverting this patch, the original performance was recovered on that machine. The original commit was introduced as a clarification of the /proc/zoneinfo output, so it doesn't seem there are usecases depending on it, making the revert a simple solution. For reference, I collected vmstat with and without this patch on a freshly booted system running intensive randread io from an nvme for 5 minutes. I got: rpm-6.12.0-slfo.1.2 -> pgscan_kswapd 5629543865 Patched -> pgscan_kswapd 33580844 33M scans is similar to what we had in kernels predating this patch. These numbers is fairly representative of the workload on this machine, as measured in several runs. So we are talking about a 2-order of magnitude increase. Link: https://lkml.kernel.org/r/20250226032258.234099-1-krisman@suse.de Fixes: 96a5c186efff ("mm/page_alloc.c: don't show protection in zone's ->lowmem_reserve[] for empty zone") Signed-off-by: Gabriel Krisman Bertazi <krisman@suse.de> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Mel Gorman <mgorman@suse.de> Cc: Baoquan He <bhe@redhat.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-05	mm: fix finish_fault() handling for large folios	Brian Geffon
	When handling faults for anon shmem finish_fault() will attempt to install ptes for the entire folio. Unfortunately if it encounters a single non-pte_none entry in that range it will bail, even if the pte that triggered the fault is still pte_none. When this situation happens the fault will be retried endlessly never making forward progress. This patch fixes this behavior and if it detects that a pte in the range is not pte_none it will fall back to setting a single pte. [bgeffon@google.com: tweak whitespace] Link: https://lkml.kernel.org/r/20250227133236.1296853-1-bgeffon@google.com Link: https://lkml.kernel.org/r/20250226162341.915535-1-bgeffon@google.com Fixes: 43e027e41423 ("mm: memory: extend finish_fault() to support large folio") Signed-off-by: Brian Geffon <bgeffon@google.com> Suggested-by: Baolin Wang <baolin.wang@linux.alibaba.com> Reported-by: Marek Maslanka <mmaslanka@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: David Hildenbrand <david@redhat.com> Cc: Hugh Dickens <hughd@google.com> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Matthew Wilcow (Oracle) <willy@infradead.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Zi Yan <ziy@nvidia.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-05	mm: don't skip arch_sync_kernel_mappings() in error paths	Ryan Roberts
	Fix callers that previously skipped calling arch_sync_kernel_mappings() if an error occurred during a pgtable update. The call is still required to sync any pgtable updates that may have occurred prior to hitting the error condition. These are theoretical bugs discovered during code review. Link: https://lkml.kernel.org/r/20250226121610.2401743-1-ryan.roberts@arm.com Fixes: 2ba3e6947aed ("mm/vmalloc: track which page-table levels were modified") Fixes: 0c95cba49255 ("mm: apply_to_pte_range warn and fail if a large pte is encountered") Signed-off-by: Ryan Roberts <ryan.roberts@arm.com> Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com> Reviewed-by: Catalin Marinas <catalin.marinas@arm.com> Cc: Christop Hellwig <hch@infradead.org> Cc: "Uladzislau Rezki (Sony)" <urezki@gmail.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-05	mm: shmem: remove unnecessary warning in shmem_writepage()	Ricardo Cañuelo Navarro
	Although the scenario where shmem_writepage() is called with info->flags & VM_LOCKED is unlikely to happen, it's still possible, as evidenced by syzbot [1]. However, the warning in this case isn't necessary because the situation is already handled correctly [2]. [2] https://lore.kernel.org/lkml/8afe1f7f-31a2-4fc0-1fbd-f9ba8a116fe3@google.com/ Link: https://lkml.kernel.org/r/20250226-20250221-warning-in-shmem_writepage-v1-1-5ad19420e17e@igalia.com Fixes: 9a976f0c847b ("shmem: skip page split if we're not reclaiming") Signed-off-by: Ricardo Cañuelo Navarro <rcn@igalia.com> Reported-by: Pengfei Xu <pengfei.xu@intel.com> Closes: https://lore.kernel.org/lkml/ZZ9PShXjKJkVelNm@xpf.sh.intel.com/ [1] Suggested-by: Hugh Dickins <hughd@google.com> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Florent Revest <revest@chromium.org> Cc: Christian Brauner <brauner@kernel.org> Cc: David Hildenbrand <david@redhat.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Florent Revest <revest@chromium.org> Cc: Luis Chamberalin <mcgrof@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-05	userfaultfd: fix PTE unmapping stack-allocated PTE copies	Suren Baghdasaryan
	Current implementation of move_pages_pte() copies source and destination PTEs in order to detect concurrent changes to PTEs involved in the move. However these copies are also used to unmap the PTEs, which will fail if CONFIG_HIGHPTE is enabled because the copies are allocated on the stack. Fix this by using the actual PTEs which were kmap()ed. Link: https://lkml.kernel.org/r/20250226185510.2732648-3-surenb@google.com Fixes: adef440691ba ("userfaultfd: UFFDIO_MOVE uABI") Signed-off-by: Suren Baghdasaryan <surenb@google.com> Reported-by: Peter Xu <peterx@redhat.com> Reviewed-by: Peter Xu <peterx@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Barry Song <21cnbao@gmail.com> Cc: Barry Song <v-songbaohua@oppo.com> Cc: David Hildenbrand <david@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jann Horn <jannh@google.com> Cc: Kalesh Singh <kaleshsingh@google.com> Cc: Liam R. Howlett <Liam.Howlett@Oracle.com> Cc: Lokesh Gidra <lokeshgidra@google.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Matthew Wilcow (Oracle) <willy@infradead.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-05	userfaultfd: do not block on locking a large folio with raised refcount	Suren Baghdasaryan
	Lokesh recently raised an issue about UFFDIO_MOVE getting into a deadlock state when it goes into split_folio() with raised folio refcount. split_folio() expects the reference count to be exactly mapcount + num_pages_in_folio + 1 (see can_split_folio()) and fails with EAGAIN otherwise. If multiple processes are trying to move the same large folio, they raise the refcount (all tasks succeed in that) then one of them succeeds in locking the folio, while others will block in folio_lock() while keeping the refcount raised. The winner of this race will proceed with calling split_folio() and will fail returning EAGAIN to the caller and unlocking the folio. The next competing process will get the folio locked and will go through the same flow. In the meantime the original winner will be retried and will block in folio_lock(), getting into the queue of waiting processes only to repeat the same path. All this results in a livelock. An easy fix would be to avoid waiting for the folio lock while holding folio refcount, similar to madvise_free_huge_pmd() where folio lock is acquired before raising the folio refcount. Since we lock and take a refcount of the folio while holding the PTE lock, changing the order of these operations should not break anything. Modify move_pages_pte() to try locking the folio first and if that fails and the folio is large then return EAGAIN without touching the folio refcount. If the folio is single-page then split_folio() is not called, so we don't have this issue. Lokesh has a reproducer [1] and I verified that this change fixes the issue. [1] https://github.com/lokeshgidra/uffd_move_ioctl_deadlock [akpm@linux-foundation.org: reflow comment to 80 cols, s/end/end up/] Link: https://lkml.kernel.org/r/20250226185510.2732648-2-surenb@google.com Fixes: adef440691ba ("userfaultfd: UFFDIO_MOVE uABI") Signed-off-by: Suren Baghdasaryan <surenb@google.com> Reported-by: Lokesh Gidra <lokeshgidra@google.com> Reviewed-by: Peter Xu <peterx@redhat.com> Acked-by: Liam R. Howlett <Liam.Howlett@Oracle.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Barry Song <21cnbao@gmail.com> Cc: Barry Song <v-songbaohua@oppo.com> Cc: David Hildenbrand <david@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jann Horn <jannh@google.com> Cc: Kalesh Singh <kaleshsingh@google.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Matthew Wilcow (Oracle) <willy@infradead.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-05	mm: zswap: use ATOMIC_LONG_INIT to initialize zswap_stored_pages	Sun YangKai
	This is currently the only atomic_long_t variable initialized by ATOMIC_INIT macro found in the kernel by using `grep -r atomic_long_t \| grep ATOMIC_INIT` This was introduced in 6e1fa555ec77, in which we modified the type of zswap_stored_pages to atomic_long_t, but didn't change the initialization. Link: https://lkml.kernel.org/r/20250226153253.19179-1-sunk67188@gmail.com Fixes: 6e1fa555ec77 ("mm: zswap: modify zswap_stored_pages to be atomic_long_t") Signed-off-by: Sun YangKai <sunk67188@gmail.com> Acked-by: Yosry Ahmed <yosry.ahmed@linux.dev> Acked-by: David Hildenbrand <david@redhat.com> Cc: Chengming Zhou <chengming.zhou@linux.dev> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kanchana P Sridhar <kanchana.p.sridhar@intel.com> Cc: Nhat Pham <nphamcs@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-05	mm: shmem: fix potential data corruption during shmem swapin	Baolin Wang
	Alex and Kairui reported some issues (system hang or data corruption) when swapping out or swapping in large shmem folios. This is especially easy to reproduce when the tmpfs is mount with the 'huge=within_size' parameter. Thanks to Kairui's reproducer, the issue can be easily replicated. The root cause of the problem is that swap readahead may asynchronously swap in order 0 folios into the swap cache, while the shmem mapping can still store large swap entries. Then an order 0 folio is inserted into the shmem mapping without splitting the large swap entry, which overwrites the original large swap entry, leading to data corruption. When getting a folio from the swap cache, we should split the large swap entry stored in the shmem mapping if the orders do not match, to fix this issue. Link: https://lkml.kernel.org/r/2fe47c557e74e9df5fe2437ccdc6c9115fa1bf70.1740476943.git.baolin.wang@linux.alibaba.com Fixes: 809bc86517cc ("mm: shmem: support large folio swap out") Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com> Reported-by: Alex Xu (Hello71) <alex_y_xu@yahoo.ca> Reported-by: Kairui Song <ryncsn@gmail.com> Closes: https://lore.kernel.org/all/1738717785.im3r5g2vxc.none@localhost/ Tested-by: Kairui Song <kasong@tencent.com> Cc: David Hildenbrand <david@redhat.com> Cc: Lance Yang <ioworker0@gmail.com> Cc: Matthew Wilcow <willy@infradead.org> Cc: Hugh Dickins <hughd@google.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-05	mm: fix kernel BUG when userfaultfd_move encounters swapcache	Barry Song
	userfaultfd_move() checks whether the PTE entry is present or a swap entry. - If the PTE entry is present, move_present_pte() handles folio migration by setting: src_folio->index = linear_page_index(dst_vma, dst_addr); - If the PTE entry is a swap entry, move_swap_pte() simply copies the PTE to the new dst_addr. This approach is incorrect because, even if the PTE is a swap entry, it can still reference a folio that remains in the swap cache. This creates a race window between steps 2 and 4. 1. add_to_swap: The folio is added to the swapcache. 2. try_to_unmap: PTEs are converted to swap entries. 3. pageout: The folio is written back. 4. Swapcache is cleared. If userfaultfd_move() occurs in the window between steps 2 and 4, after the swap PTE has been moved to the destination, accessing the destination triggers do_swap_page(), which may locate the folio in the swapcache. However, since the folio's index has not been updated to match the destination VMA, do_swap_page() will detect a mismatch. This can result in two critical issues depending on the system configuration. If KSM is disabled, both small and large folios can trigger a BUG during the add_rmap operation due to: page_pgoff(folio, page) != linear_page_index(vma, address) [ 13.336953] page: refcount:6 mapcount:1 mapping:00000000f43db19c index:0xffffaf150 pfn:0x4667c [ 13.337520] head: order:2 mapcount:1 entire_mapcount:0 nr_pages_mapped:1 pincount:0 [ 13.337716] memcg:ffff00000405f000 [ 13.337849] anon flags: 0x3fffc0000020459(locked\|uptodate\|dirty\|owner_priv_1\|head\|swapbacked\|node=0\|zone=0\|lastcpupid=0xffff) [ 13.338630] raw: 03fffc0000020459 ffff80008507b538 ffff80008507b538 ffff000006260361 [ 13.338831] raw: 0000000ffffaf150 0000000000004000 0000000600000000 ffff00000405f000 [ 13.339031] head: 03fffc0000020459 ffff80008507b538 ffff80008507b538 ffff000006260361 [ 13.339204] head: 0000000ffffaf150 0000000000004000 0000000600000000 ffff00000405f000 [ 13.339375] head: 03fffc0000000202 fffffdffc0199f01 ffffffff00000000 0000000000000001 [ 13.339546] head: 0000000000000004 0000000000000000 00000000ffffffff 0000000000000000 [ 13.339736] page dumped because: VM_BUG_ON_PAGE(page_pgoff(folio, page) != linear_page_index(vma, address)) [ 13.340190] ------------[ cut here ]------------ [ 13.340316] kernel BUG at mm/rmap.c:1380! [ 13.340683] Internal error: Oops - BUG: 00000000f2000800 [#1] PREEMPT SMP [ 13.340969] Modules linked in: [ 13.341257] CPU: 1 UID: 0 PID: 107 Comm: a.out Not tainted 6.14.0-rc3-gcf42737e247a-dirty #299 [ 13.341470] Hardware name: linux,dummy-virt (DT) [ 13.341671] pstate: 60000005 (nZCv daif -PAN -UAO -TCO -DIT -SSBS BTYPE=--) [ 13.341815] pc : __page_check_anon_rmap+0xa0/0xb0 [ 13.341920] lr : __page_check_anon_rmap+0xa0/0xb0 [ 13.342018] sp : ffff80008752bb20 [ 13.342093] x29: ffff80008752bb20 x28: fffffdffc0199f00 x27: 0000000000000001 [ 13.342404] x26: 0000000000000000 x25: 0000000000000001 x24: 0000000000000001 [ 13.342575] x23: 0000ffffaf0d0000 x22: 0000ffffaf0d0000 x21: fffffdffc0199f00 [ 13.342731] x20: fffffdffc0199f00 x19: ffff000006210700 x18: 00000000ffffffff [ 13.342881] x17: 6c203d2120296567 x16: 6170202c6f696c6f x15: 662866666f67705f [ 13.343033] x14: 6567617028454741 x13: 2929737365726464 x12: ffff800083728ab0 [ 13.343183] x11: ffff800082996bf8 x10: 0000000000000fd7 x9 : ffff80008011bc40 [ 13.343351] x8 : 0000000000017fe8 x7 : 00000000fffff000 x6 : ffff8000829eebf8 [ 13.343498] x5 : c0000000fffff000 x4 : 0000000000000000 x3 : 0000000000000000 [ 13.343645] x2 : 0000000000000000 x1 : ffff0000062db980 x0 : 000000000000005f [ 13.343876] Call trace: [ 13.344045] __page_check_anon_rmap+0xa0/0xb0 (P) [ 13.344234] folio_add_anon_rmap_ptes+0x22c/0x320 [ 13.344333] do_swap_page+0x1060/0x1400 [ 13.344417] __handle_mm_fault+0x61c/0xbc8 [ 13.344504] handle_mm_fault+0xd8/0x2e8 [ 13.344586] do_page_fault+0x20c/0x770 [ 13.344673] do_translation_fault+0xb4/0xf0 [ 13.344759] do_mem_abort+0x48/0xa0 [ 13.344842] el0_da+0x58/0x130 [ 13.344914] el0t_64_sync_handler+0xc4/0x138 [ 13.345002] el0t_64_sync+0x1ac/0x1b0 [ 13.345208] Code: aa1503e0 f000f801 910f6021 97ff5779 (d4210000) [ 13.345504] ---[ end trace 0000000000000000 ]--- [ 13.345715] note: a.out[107] exited with irqs disabled [ 13.345954] note: a.out[107] exited with preempt_count 2 If KSM is enabled, Peter Xu also discovered that do_swap_page() may trigger an unexpected CoW operation for small folios because ksm_might_need_to_copy() allocates a new folio when the folio index does not match linear_page_index(vma, addr). This patch also checks the swapcache when handling swap entries. If a match is found in the swapcache, it processes it similarly to a present PTE. However, there are some differences. For example, the folio is no longer exclusive because folio_try_share_anon_rmap_pte() is performed during unmapping. Furthermore, in the case of swapcache, the folio has already been unmapped, eliminating the risk of concurrent rmap walks and removing the need to acquire src_folio's anon_vma or lock. Note that for large folios, in the swapcache handling path, we directly return -EBUSY since split_folio() will return -EBUSY regardless if the folio is under writeback or unmapped. This is not an urgent issue, so a follow-up patch may address it separately. [v-songbaohua@oppo.com: minor cleanup according to Peter Xu] Link: https://lkml.kernel.org/r/20250226024411.47092-1-21cnbao@gmail.com Link: https://lkml.kernel.org/r/20250226001400.9129-1-21cnbao@gmail.com Fixes: adef440691ba ("userfaultfd: UFFDIO_MOVE uABI") Signed-off-by: Barry Song <v-songbaohua@oppo.com> Acked-by: Peter Xu <peterx@redhat.com> Reviewed-by: Suren Baghdasaryan <surenb@google.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: Brian Geffon <bgeffon@google.com> Cc: Christian Brauner <brauner@kernel.org> Cc: David Hildenbrand <david@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jann Horn <jannh@google.com> Cc: Kalesh Singh <kaleshsingh@google.com> Cc: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: Lokesh Gidra <lokeshgidra@google.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport (IBM) <rppt@kernel.org> Cc: Nicolas Geoffray <ngeoffray@google.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Shuah Khan <shuah@kernel.org> Cc: ZhangPeng <zhangpeng362@huawei.com> Cc: Tangquan Zheng <zhengtangquan@oppo.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>