summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2025-03-18net: ethtool: Export the link_mode_params definitionsMaxime Chevallier
link_mode_params contains a lookup table of all 802.3 link modes that are currently supported with structured data about each mode's speed, duplex, number of lanes and mediums. As a preparation for a port representation, export that table for the rest of the net stack to use. Signed-off-by: Maxime Chevallier <maxime.chevallier@bootlin.com> Link: https://patch.msgid.link/20250307173611.129125-2-maxime.chevallier@bootlin.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-03-18efivarfs: fix NULL dereference on resumeJames Bottomley
LSMs often inspect the path.mnt of files in the security hooks, and this causes a NULL deref in efivarfs_pm_notify() because the path is constructed with a NULL path.mnt. Fix by obtaining from vfs_kern_mount() instead, and being very careful to ensure that deactivate_super() (potentially triggered by a racing userspace umount) is not called directly from the notifier, because it would deadlock when efivarfs_kill_sb() tried to unregister the notifier chain. [ Al notes: Umm... That's probably safe, but not as a long-term solution - it's too intimately dependent upon fs/super.c internals. The reasons why you can't run into ->s_umount deadlock here are non-trivial... ] Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: James Bottomley <James.Bottomley@HansenPartnership.com> Link: https://lore.kernel.org/r/e54e6a2f-1178-4980-b771-4d9bafc2aa47@tnxip.de Link: https://lore.kernel.org/r/3e998bf87638a442cbc6864cdcd3d8d9e08ce3e3.camel@HansenPartnership.com Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
2025-03-17Merge tag 'mm-hotfixes-stable-2025-03-17-20-09' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull misc hotfixes from Andrew Morton: "15 hotfixes. 7 are cc:stable and the remainder address post-6.13 issues or aren't considered necessary for -stable kernels. 13 are for MM and the other two are for squashfs and procfs. All are singletons. Please see the individual changelogs for details" * tag 'mm-hotfixes-stable-2025-03-17-20-09' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: mm/page_alloc: fix memory accept before watermarks gets initialized mm: decline to manipulate the refcount on a slab page memcg: drain obj stock on cpu hotplug teardown mm/huge_memory: drop beyond-EOF folios with the right number of refs selftests/mm: run_vmtests.sh: fix half_ufd_size_MB calculation mm: fix error handling in __filemap_get_folio() with FGP_NOWAIT mm: memcontrol: fix swap counter leak from offline cgroup mm/vma: do not register private-anon mappings with khugepaged during mmap squashfs: fix invalid pointer dereference in squashfs_cache_delete mm/migrate: fix shmem xarray update during migration mm/hugetlb: fix surplus pages in dissolve_free_huge_page() mm/damon/core: initialize damos->walk_completed in damon_new_scheme() mm/damon: respect core layer filters' allowance decision on ops layer filemap: move prefaulting out of hot write path proc: fix UAF in proc_get_inode()
2025-03-17MAINTAINERS: Remove myselfEric W. Biederman
Unfortunately I no longer have time to meaningfully take part in the linux kernel development. Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2025-03-17Merge tag 'soc-fixes-6.14-2' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/soc/soc Pull SoC fixes from Arnd Bergmann: "The majority of these last fixes are for devicetree files. These address two important regressions for the Qualcomm SMMU and the Raspberry Pi 4 USB controller, as well as a larger number of patches fixing minor mistakes in board specific files for Rockchips, i.MX, starfive and broadcom. The non-DT changes are - A fix for an old boot regression on Renesas shmobile chips - Another boot time regression for for the Qualcomm PDR SoC driver, among a few other Qualcomm firmware driver fixes for efivars and tzmem - Minor Kconfig fixes for davinci and OMAP1 - Minor code fixes for sparx5 reset controllers, OMAP memory controller, i.MX SCU, cpufreq and SoC drivers and a Hisilicon SoC driver - One more update to the Asahi maintainers, adding Neal Gompa as a reviewer" * tag 'soc-fixes-6.14-2' of git://git.kernel.org/pub/scm/linux/kernel/git/soc/soc: (35 commits) ARM: davinci: da850: fix selecting ARCH_DAVINCI_DA8XX soc: hisilicon: kunpeng_hccs: Fix incorrect string assembly memory: omap-gpmc: drop no compatible check reset: mchp: sparx5: Fix for lan966x ARM: shmobile: smp: Enforce shmobile_smp_* alignment MAINTAINERS: Add myself (Neal Gompa) as a reviewer for ARM Apple support MAINTAINERS: Add apple-spi driver & binding files arm64: dts: rockchip: slow down emmc freq for rock 5 itx ARM: dts: BCM5301X: Fix switch port labels of ASUS RT-AC3200 ARM: dts: BCM5301X: Fix switch port labels of ASUS RT-AC5300 ARM: dts: bcm2711: Don't mark timer regs unconfigured ARM: OMAP1: select CONFIG_GENERIC_IRQ_CHIP arm64: dts: rockchip: Add missing PCIe supplies to RockPro64 board dtsi arm64: dts: rockchip: Add avdd HDMI supplies to RockPro64 board dtsi arm64: dts: rockchip: Remove undocumented sdmmc property from lubancat-1 arm64: dts: rockchip: fix pinmux of UART5 for PX30 Ringneck on Haikou arm64: dts: rockchip: fix pinmux of UART0 for PX30 Ringneck on Haikou arm64: dts: rockchip: fix u2phy1_host status for NanoPi R4S arm64: dts: bcm2712: PL011 UARTs are actually r1p5 ARM: dts: bcm2711: PL011 UARTs are actually r1p5 ...
2025-03-17Merge tag 'probes-fixes-v6.14-rc6' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace Pull probes fixes from Masami Hiramatsu: - Clean up tprobe correctly when module unload Tracepoint probes do not set TRACEPOINT_STUB on the 'tpoint' pointer when unloading a module, thus they show as a normal 'fprobe' instead of 'tprobe' and never come back - Fix leakage of tprobe module refcount When a tprobe's target module is loaded, it gets the module's refcount in the module notifier but forgot to put it after registering the probe on it. Fix it by getting the refcount only when registering tprobe. * tag 'probes-fixes-v6.14-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: tracing: tprobe-events: Fix leakage of module refcount tracing: tprobe-events: Fix to clean up tprobe correctly when module unload
2025-03-17Merge branch ↵Paolo Abeni
'net-stmmac-avoid-unnecessary-work-in-stmmac_release-stmmac_dvr_remove' Russell King says: ==================== net: stmmac: avoid unnecessary work in stmmac_release()/stmmac_dvr_remove() This small series is a subset of a RFC I sent earlier. These two patches remove code that is unnecessary and/or wrong in these paths. Details in each commit. ==================== Link: https://patch.msgid.link/Z87bpDd7QYYVU0ML@shell.armlinux.org.uk Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-03-17net: stmmac: remove unnecessary stmmac_mac_set() in stmmac_release()Russell King (Oracle)
stmmac_release() calls phylink_stop() and then goes on to call stmmac_mac_set(, false). However, phylink_stop() will call stmmac_mac_link_down() before returning, which will do this work. Remove this unnecessary call. Reviewed-by: Andrew Lunn <andrew@lunn.ch> Tested-by: Furong Xu <0x1207@gmail.com> Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Link: https://patch.msgid.link/E1trcI6-005rn8-GV@rmk-PC.armlinux.org.uk Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-03-17net: stmmac: remove redundant racy tear-down in stmmac_dvr_remove()Russell King (Oracle)
While the network device is registered, it is published to userspace, and thus userspace can change its state. This means calling functions such as stmmac_stop_all_dma() and stmmac_mac_set() are racy. Moreover, unregister_netdev() will unpublish the network device, and then if appropriate call the .ndo_stop() method, which is stmmac_release(). This will first call phylink_stop() which will synchronously take the link down, resulting in stmmac_mac_link_down() and stmmac_mac_set(, false) being called. stmmac_release() will also call stmmac_stop_all_dma(). Consequently, neither of these two functions need to called prior to unregister_netdev() as that will safely call paths that will result in this work being done if necessary. Remove these redundant racy calls. Reviewed-by: Andrew Lunn <andrew@lunn.ch> Reviewed-by: Furong Xu <0x1207@gmail.com> Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Link: https://patch.msgid.link/E1trcI1-005rn2-CZ@rmk-PC.armlinux.org.uk Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-03-17net: phylink: expand on .pcs_config() method documentationRussell King (Oracle)
Expand on the requirements of the .pcs_config() method documentation, specifically mentioning that it should cause minimal disruption to an established link, and that it should return a positive non-zero value when requiring the .pcs_an_restart() method to be called. Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Link: https://patch.msgid.link/E1trb24-005oVq-Is@rmk-PC.armlinux.org.uk Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-03-17cdc_ether|r8152: ThinkPad Hybrid USB-C/A Dock quirkPhilipp Hahn
Lenovo ThinkPad Hybrid USB-C with USB-A Dock (17ef:a359) is affected by the same problem as the Lenovo Powered USB-C Travel Hub (17ef:721e): Both are based on the Realtek RTL8153B chip used to use the cdc_ether driver. However, using this driver, with the system suspended the device constantly sends pause-frames as soon as the receive buffer fills up. This causes issues with other devices, where some Ethernet switches stop forwarding packets altogether. Using the Realtek driver (r8152) fixes this issue. Pause frames are no longer sent while the host system is suspended. Cc: Leon Schuermann <leon@is.currently.online> Cc: Jakub Kicinski <kuba@kernel.org> Cc: Oliver Neukum <oliver@neukum.org> (maintainer:USB CDC ETHERNET DRIVER) Cc: netdev@vger.kernel.org (open list:NETWORKING DRIVERS) Link: https://git.kernel.org/netdev/net/c/cb82a54904a9 Link: https://git.kernel.org/netdev/net/c/2284bbd0cf39 Link: https://www.lenovo.com/de/de/p/accessories-and-software/docking/docking-usb-docks/40af0135eu Signed-off-by: Philipp Hahn <phahn-oss@avm.de> Reviewed-by: Kory Maincent <kory.maincent@bootlin.com> Link: https://patch.msgid.link/484336aad52d14ccf061b535bc19ef6396ef5120.1741601523.git.p.hahn@avm.de Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-03-17stmmac: intel: Fix warning message for return value in ↵Choong Yong Liang
intel_tsn_lane_is_available() Fix the warning "warn: missing error code? 'ret'" in the intel_tsn_lane_is_available() function. The function now returns 0 to indicate that a TSN lane was found and returns -EINVAL when it is not found. Fixes: a42f6b3f1cc1 ("net: stmmac: configure SerDes according to the interface mode") Signed-off-by: Choong Yong Liang <yong.liang.choong@linux.intel.com> Reviewed-by: Kory Maincent <kory.maincent@bootlin.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20250310050835.808870-1-yong.liang.choong@linux.intel.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-03-17Merge branch 'net-phy-clean-up-phy-package-mmd-access-functions'Paolo Abeni
Heiner Kallweit says: ==================== net: phy: clean up PHY package MMD access functions Move declarations of the functions with users to phylib.h, and remove unused functions. ==================== Link: https://patch.msgid.link/b624fcb7-b493-461a-a0b5-9ca7e9d767bc@gmail.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-03-17net: phy: remove unused functions phy_package_[read|write]_mmdHeiner Kallweit
These functions have never had a user, so remove them. Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/5792e2cd-6f0a-4f7d-a5ef-b932f94d82f3@gmail.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-03-17net: phy: move PHY package MMD access function declarations from phy.h to ↵Heiner Kallweit
phylib.h These functions are used by PHY drivers only, therefore move their declaration to phylib.h. Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/406c8a20-b62e-4ee3-b174-b566724a0876@gmail.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-03-17Merge branch 'mlx5-support-hws-flow-meter-sampler-actions-in-fs-core'Paolo Abeni
Tariq Toukan says: ==================== mlx5: Support HWS flow meter/sampler actions in FS core This series by Moshe adds support for flow meter and flow sampler HW Steering actions in FS core level. As these actions can be shared by multiple rules, these patches use refcounts to manage the HWS actions sharing in FS core level. ==================== Link: https://patch.msgid.link/1741543663-22123-1-git-send-email-tariqt@nvidia.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-03-17net/mlx5: fs, add support for dest flow sampler HWS actionMoshe Shemesh
Add support for HW Steering action of flow sampler destination. For each flow sampler created cache the hws action by sampler id as a key. Hold refcount for each rule using the cached action. Signed-off-by: Moshe Shemesh <moshe@nvidia.com> Reviewed-by: Mark Bloch <mbloch@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Link: https://patch.msgid.link/1741543663-22123-4-git-send-email-tariqt@nvidia.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-03-17net/mlx5: fs, add support for flow meters HWS actionMoshe Shemesh
Add support for HW Steering action of flow meter range. Flow meters range can use one HWS action for the whole range. Thus, share a cached HWS action among rules that use same flow meter object range. Hold refcount for each rule using the cached action. Signed-off-by: Moshe Shemesh <moshe@nvidia.com> Reviewed-by: Mark Bloch <mbloch@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Link: https://patch.msgid.link/1741543663-22123-3-git-send-email-tariqt@nvidia.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-03-17net/mlx5: fs, add API for sharing HWS action by refcountMoshe Shemesh
Counters HWS actions are shared using refcount, to create action on demand by flow steering rule and destroy only when no rules are using the action. The method is extensible to other HWS action types, such as flow meter and sampler actions, in the downstream patches. Add an API to facilitate the reuse of get/put logic for HWS actions shared by refcount. Signed-off-by: Moshe Shemesh <moshe@nvidia.com> Reviewed-by: Mark Bloch <mbloch@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Link: https://patch.msgid.link/1741543663-22123-2-git-send-email-tariqt@nvidia.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-03-17efivarfs: use I_MUTEX_CHILD nested lock to traverse variables on resumeArd Biesheuvel
syzbot warns about a potential deadlock, but this is a false positive resulting from a missing lockdep annotation: iterate_dir() locks the parent whereas the inode_lock() it warns about locks the child, which is guaranteed to be a different lock. So use inode_lock_nested() instead with the appropriate lock class. Reported-by: syzbot+019072ad24ab1d948228@syzkaller.appspotmail.com Suggested-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
2025-03-17Merge branch 'tcp-accecn'David S. Miller
Chia-Yu Chang says: ==================== AccECN protocol preparation patch series Please find the v7 v7 (03-Mar-2025) - Move 2 new patches added in v6 to the next AccECN patch series v6 (27-Dec-2024) - Avoid removing removing the potential CA_ACK_WIN_UPDATE in ack_ev_flags of patch #1 (Eric Dumazet <edumazet@google.com>) - Add reviewed-by tag in patches #2, #3, #4, #5, #6, #7, #8, #12, #14 - Foloiwng 2 new pathces are added after patch #9 (Patch that adds SKB_GSO_TCP_ACCECN) * New patch #10 to replace exisiting SKB_GSO_TCP_ECN with SKB_GSO_TCP_ACCECN in the driver to avoid CWR flag corruption * New patch #11 adds AccECN for virtio by adding new negotiation flag (VIRTIO_NET_F_HOST/GUEST_ACCECN) in feature handshake and translating Accurate ECN GSO flag between virtio_net_hdr (VIRTIO_NET_HDR_GSO_ACCECN) and skb header (SKB_GSO_TCP_ACCECN) - Add detailed changelog and comments in #13 (Eric Dumazet <edumazet@google.com>) - Move patch #14 to the next AccECN patch series (Eric Dumazet <edumazet@google.com>) v5 (5-Nov-2024) - Add helper function "tcp_flags_ntohs" to preserve last 2 bytes of TCP flags of patch #4 (Paolo Abeni <pabeni@redhat.com>) - Fix reverse X-max tree order of patches #4, #11 (Paolo Abeni <pabeni@redhat.com>) - Rename variable "delta" as "timestamp_delta" of patch #2 fo clariety - Remove patch #14 in this series (Paolo Abeni <pabeni@redhat.com>, Joel Granados <joel.granados@kernel.org>) v4 (21-Oct-2024) - Fix line length warning of patches #2, #4, #8, #10, #11, #14 - Fix spaces preferred around '|' (ctx:VxV) warning of patch #7 - Add missing CC'ed of patches #4, #12, #14 v3 (19-Oct-2024) - Fix build error in v2 v2 (18-Oct-2024) - Fix warning caused by NETIF_F_GSO_ACCECN_BIT in patch #9 (Jakub Kicinski <kuba@kernel.org>) The full patch series can be found in https://github.com/L4STeam/linux-net-next/commits/upstream_l4steam/ The Accurate ECN draft can be found in https://datatracker.ietf.org/doc/html/draft-ietf-tcpm-accurate-ecn-28 ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2025-03-17tcp: Pass flags to __tcp_send_ackIlpo Järvinen
Accurate ECN needs to send custom flags to handle IP-ECN field reflection during handshake. Signed-off-by: Ilpo Järvinen <ij@kernel.org> Signed-off-by: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2025-03-17tcp: add new TCP_TW_ACK_OOW state and allow ECN bits in TOSIlpo Järvinen
ECN bits in TOS are always cleared when sending in ACKs in TW. Clearing them is problematic for TCP flows that used Accurate ECN because ECN bits decide which service queue the packet is placed into (L4S vs Classic). Effectively, TW ACKs are always downgraded from L4S to Classic queue which might impact, e.g., delay the ACK will experience on the path compared with the other packets of the flow. Change the TW ACK sending code to differentiate: - In tcp_v4_send_reset(), commit ba9e04a7ddf4f ("ip: fix tos reflection in ack and reset packets") cleans ECN bits for TW reset and this is not affected. - In tcp_v4_timewait_ack(), ECN bits for all TW ACKs are cleaned. But now only ECN bits of ACKs for oow data or paws_reject are cleaned, and ECN bits of other ACKs will not be cleaned. - In tcp_v4_reqsk_send_ack(), commit 66b13d99d96a1 ("ipv4: tcp: fix TOS value in ACK messages sent from TIME_WAIT") did not clean ECN bits of ACKs for oow data or paws_reject. But now the ECN bits rae cleaned for these ACKs. Signed-off-by: Ilpo Järvinen <ij@kernel.org> Signed-off-by: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2025-03-17tcp: AccECN support to tcp_add_backlogIlpo Järvinen
AE flag needs to be preserved for AccECN. Signed-off-by: Ilpo Järvinen <ij@kernel.org> Signed-off-by: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2025-03-17gro: prevent ACE field corruption & better AccECN handlingIlpo Järvinen
There are important differences in how the CWR field behaves in RFC3168 and AccECN. With AccECN, CWR flag is part of the ACE counter and its changes are important so adjust the flags changed mask accordingly. Also, if CWR is there, set the Accurate ECN GSO flag to avoid corrupting CWR flag somewhere. Signed-off-by: Ilpo Järvinen <ij@kernel.org> Signed-off-by: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2025-03-17gso: AccECN supportIlpo Järvinen
Handling the CWR flag differs between RFC 3168 ECN and AccECN. With RFC 3168 ECN aware TSO (NETIF_F_TSO_ECN) CWR flag is cleared starting from 2nd segment which is incompatible how AccECN handles the CWR flag. Such super-segments are indicated by SKB_GSO_TCP_ECN. With AccECN, CWR flag (or more accurately, the ACE field that also includes ECE & AE flags) changes only when new packet(s) with CE mark arrives so the flag should not be changed within a super-skb. The new skb/feature flags are necessary to prevent such TSO engines corrupting AccECN ACE counters by clearing the CWR flag (if the CWR handling feature cannot be turned off). If NIC is completely unaware of RFC3168 ECN (doesn't support NETIF_F_TSO_ECN) or its TSO engine can be set to not touch CWR flag despite supporting also NETIF_F_TSO_ECN, TSO could be safely used with AccECN on such NIC. This should be evaluated per NIC basis (not done in this patch series for any NICs). For the cases, where TSO cannot keep its hands off the CWR flag, a GSO fallback is provided by this patch. Signed-off-by: Ilpo Järvinen <ij@kernel.org> Signed-off-by: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2025-03-17tcp: helpers for ECN mode handlingIlpo Järvinen
Create helpers for TCP ECN modes. No functional changes. Signed-off-by: Ilpo Järvinen <ij@kernel.org> Signed-off-by: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2025-03-17tcp: rework {__,}tcp_ecn_check_ce() -> tcp_data_ecn_check()Ilpo Järvinen
Rename tcp_ecn_check_ce to tcp_data_ecn_check as it is called only for data segments, not for ACKs (with AccECN, also ACKs may get ECN bits). The extra "layer" in tcp_ecn_check_ce() function just checks for ECN being enabled, that can be moved into tcp_ecn_field_check rather than having the __ variant. No functional changes. Signed-off-by: Ilpo Järvinen <ij@kernel.org> Signed-off-by: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2025-03-17tcp: extend TCP flags to allow AE bit/ACE fieldIlpo Järvinen
With AccECN, there's one additional TCP flag to be used (AE) and ACE field that overloads the definition of AE, CWR, and ECE flags. As tcp_flags was previously only 1 byte, the byte-order stuff needs to be added to it's handling. Signed-off-by: Ilpo Järvinen <ij@kernel.org> Signed-off-by: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2025-03-17tcp: use BIT() macro in include/net/tcp.hChia-Yu Chang
Use BIT() macro for TCP flags field and TCP congestion control flags that will be used by the congestion control algorithm. No functional changes. Signed-off-by: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com> Reviewed-by: Ilpo Järvinen <ij@kernel.org> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2025-03-17tcp: create FLAG_TS_PROGRESSIlpo Järvinen
Whenever timestamp advances, it declares progress which can be used by the other parts of the stack to decide that the ACK is the most recent one seen so far. AccECN will use this flag when deciding whether to use the ACK to update AccECN state or not. Signed-off-by: Ilpo Järvinen <ij@kernel.org> Signed-off-by: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2025-03-17tcp: reorganize tcp_in_ack_event() and tcp_count_delivered()Ilpo Järvinen
- Move tcp_count_delivered() earlier and split tcp_count_delivered_ce() out of it - Move tcp_in_ack_event() later - While at it, remove the inline from tcp_in_ack_event() and let the compiler to decide Accurate ECN's heuristics does not know if there is going to be ACE field based CE counter increase or not until after rtx queue has been processed. Only then the number of ACKed bytes/pkts is available. As CE or not affects presence of FLAG_ECE, that information for tcp_in_ack_event is not yet available in the old location of the call to tcp_in_ack_event(). Signed-off-by: Ilpo Järvinen <ij@kernel.org> Signed-off-by: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2025-03-17hwmon: (nct6775-core) Fix out of bounds access for NCT679{8,9}Tasos Sahanidis
pwm_num is set to 7 for these chips, but NCT6776_REG_PWM_MODE and NCT6776_PWM_MODE_MASK only contain 6 values. Fix this by adding another 0 to the end of each array. Signed-off-by: Tasos Sahanidis <tasos@tasossah.com> Link: https://lore.kernel.org/r/20250312030832.106475-1-tasos@tasossah.com Signed-off-by: Guenter Roeck <linux@roeck-us.net>
2025-03-17MAINTAINERS: correct list and scope of LTC4286 HARDWARE MONITORWolfram Sang
This entry has a wrong list, i2c instead of hwmon. Also, it states to maintain Kconfig and Makefile which is not suitable for a single driver. Signed-off-by: Wolfram Sang <wsa+renesas@sang-engineering.com> Fixes: 7707cf82e138 ("dt-bindings: hwmon: Add lltc ltc4286 driver bindings") Link: https://lore.kernel.org/r/20250317091459.41462-2-wsa+renesas@sang-engineering.com Signed-off-by: Guenter Roeck <linux@roeck-us.net>
2025-03-17mmc: sdhci-brcmstb: add cqhci suspend/resume to PM opsKamal Dasu
cqhci timeouts observed on brcmstb platforms during suspend: ... [ 164.832853] mmc0: cqhci: timeout for tag 18 ... Adding cqhci_suspend()/resume() calls to disable cqe in sdhci_brcmstb_suspend()/resume() respectively to fix CQE timeouts seen on PM suspend. Fixes: d46ba2d17f90 ("mmc: sdhci-brcmstb: Add support for Command Queuing (CQE)") Cc: stable@vger.kernel.org Signed-off-by: Kamal Dasu <kamal.dasu@broadcom.com> Reviewed-by: Florian Fainelli <florian.fainelli@broadcom.com> Link: https://lore.kernel.org/r/20250311165946.28190-1-kamal.dasu@broadcom.com Signed-off-by: Ulf Hansson <ulf.hansson@linaro.org>
2025-03-16mm/page_alloc: fix memory accept before watermarks gets initializedKirill A. Shutemov
Watermarks are initialized during the postcore initcall. Until then, all watermarks are set to zero. This causes cond_accept_memory() to incorrectly skip memory acceptance because a watermark of 0 is always met. This can lead to a premature OOM on boot. To ensure progress, accept one MAX_ORDER page if the watermark is zero. Link: https://lkml.kernel.org/r/20250310082855.2587122-1-kirill.shutemov@linux.intel.com Fixes: dcdfdd40fa82 ("mm: Add support for unaccepted memory") Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Tested-by: Farrah Chen <farrah.chen@intel.com> Reported-by: Farrah Chen <farrah.chen@intel.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Pankaj Gupta <pankaj.gupta@amd.com> Cc: Ashish Kalra <ashish.kalra@amd.com> Cc: David Hildenbrand <david@redhat.com> Cc: "Edgecombe, Rick P" <rick.p.edgecombe@intel.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: "Mike Rapoport (IBM)" <rppt@kernel.org> Cc: Thomas Lendacky <thomas.lendacky@amd.com> Cc: <stable@vger.kernel.org> [6.5+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-16mm: decline to manipulate the refcount on a slab pageMatthew Wilcox (Oracle)
Slab pages now have a refcount of 0, so nobody should be trying to manipulate the refcount on them. Doing so has little effect; the object could be freed and reallocated to a different purpose, although the slab itself would not be until the refcount was put making it behave rather like TYPESAFE_BY_RCU. Unfortunately, __iov_iter_get_pages_alloc() does take a refcount. Fix that to not change the refcount, and make put_page() silently not change the refcount. get_page() warns so that we can fix any other callers that need to be changed. Long-term, networking needs to stop taking a refcount on the pages that it uses and rely on the caller to hold whatever references are necessary to make the memory stable. In the medium term, more page types are going to hav a zero refcount, so we'll want to move get_page() and put_page() out of line. Link: https://lkml.kernel.org/r/20250310143544.1216127-1-willy@infradead.org Fixes: 9aec2fb0fd5e (slab: allocate frozen pages) Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reported-by: Hannes Reinecke <hare@suse.de> Closes: https://lore.kernel.org/all/08c29e4b-2f71-4b6d-8046-27e407214d8c@suse.com/ Acked-by: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-16memcg: drain obj stock on cpu hotplug teardownShakeel Butt
Currently on cpu hotplug teardown, only memcg stock is drained but we need to drain the obj stock as well otherwise we will miss the stats accumulated on the target cpu as well as the nr_bytes cached. The stats include MEMCG_KMEM, NR_SLAB_RECLAIMABLE_B & NR_SLAB_UNRECLAIMABLE_B. In addition we are leaking reference to struct obj_cgroup object. Link: https://lkml.kernel.org/r/20250310230934.2913113-1-shakeel.butt@linux.dev Fixes: bf4f059954dc ("mm: memcg/slab: obj_cgroup API") Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev> Reviewed-by: Roman Gushchin <roman.gushchin@linux.dev> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-16mm/huge_memory: drop beyond-EOF folios with the right number of refsZi Yan
When an after-split folio is large and needs to be dropped due to EOF, folio_put_refs(folio, folio_nr_pages(folio)) should be used to drop all page cache refs. Otherwise, the folio will not be freed, causing memory leak. This leak would happen on a filesystem with blocksize > page_size and a truncate is performed, where the blocksize makes folios split to >0 order ones, causing truncated folios not being freed. Link: https://lkml.kernel.org/r/20250310155727.472846-1-ziy@nvidia.com Fixes: c010d47f107f ("mm: thp: split huge page to any lower order pages") Signed-off-by: Zi Yan <ziy@nvidia.com> Reported-by: Hugh Dickins <hughd@google.com> Closes: https://lore.kernel.org/all/fcbadb7f-dd3e-21df-f9a7-2853b53183c4@google.com/ Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: David Hildenbrand <david@redhat.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Kirill A. Shuemov <kirill.shutemov@linux.intel.com> Cc: Luis Chamberalin <mcgrof@kernel.org> Cc: Matthew Wilcow (Oracle) <willy@infradead.org> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Pankaj Raghav <p.raghav@samsung.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Yang Shi <yang@os.amperecomputing.com> Cc: Yu Zhao <yuzhao@google.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-16selftests/mm: run_vmtests.sh: fix half_ufd_size_MB calculationRafael Aquini
We noticed that uffd-stress test was always failing to run when invoked for the hugetlb profiles on x86_64 systems with a processor count of 64 or bigger: ... # ------------------------------------ # running ./uffd-stress hugetlb 128 32 # ------------------------------------ # ERROR: invalid MiB (errno=9, @uffd-stress.c:459) ... # [FAIL] not ok 3 uffd-stress hugetlb 128 32 # exit=1 ... The problem boils down to how run_vmtests.sh (mis)calculates the size of the region it feeds to uffd-stress. The latter expects to see an amount of MiB while the former is just giving out the number of free hugepages halved down. This measurement discrepancy ends up violating uffd-stress' assertion on number of hugetlb pages allocated per CPU, causing it to bail out with the error above. This commit fixes that issue by adjusting run_vmtests.sh's half_ufd_size_MB calculation so it properly renders the region size in MiB, as expected, while maintaining all of its original constraints in place. Link: https://lkml.kernel.org/r/20250218192251.53243-1-aquini@redhat.com Fixes: 2e47a445d7b3 ("selftests/mm: run_vmtests.sh: fix hugetlb mem size calculation") Signed-off-by: Rafael Aquini <raquini@redhat.com> Reviewed-by: David Hildenbrand <david@redhat.com> Reviewed-by: Peter Xu <peterx@redhat.com> Cc: Shuah Khan <shuah@kernel.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-16mm: fix error handling in __filemap_get_folio() with FGP_NOWAITRaphael S. Carvalho
original report: https://lore.kernel.org/all/CAKhLTr1UL3ePTpYjXOx2AJfNk8Ku2EdcEfu+CH1sf3Asr=B-Dw@mail.gmail.com/T/ When doing buffered writes with FGP_NOWAIT, under memory pressure, the system returned ENOMEM despite there being plenty of available memory, to be reclaimed from page cache. The user space used io_uring interface, which in turn submits I/O with FGP_NOWAIT (the fast path). retsnoop pointed to iomap_get_folio: 00:34:16.180612 -> 00:34:16.180651 TID/PID 253786/253721 (reactor-1/combined_tests): entry_SYSCALL_64_after_hwframe+0x76 do_syscall_64+0x82 __do_sys_io_uring_enter+0x265 io_submit_sqes+0x209 io_issue_sqe+0x5b io_write+0xdd xfs_file_buffered_write+0x84 iomap_file_buffered_write+0x1a6 32us [-ENOMEM] iomap_write_begin+0x408 iter=&{.inode=0xffff8c67aa031138,.len=4096,.flags=33,.iomap={.addr=0xffffffffffffffff,.length=4096,.type=1,.flags=3,.bdev=0x… pos=0 len=4096 foliop=0xffffb32c296b7b80 ! 4us [-ENOMEM] iomap_get_folio iter=&{.inode=0xffff8c67aa031138,.len=4096,.flags=33,.iomap={.addr=0xffffffffffffffff,.length=4096,.type=1,.flags=3,.bdev=0x… pos=0 len=4096 This is likely a regression caused by 66dabbb65d67 ("mm: return an ERR_PTR from __filemap_get_folio"), which moved error handling from io_map_get_folio() to __filemap_get_folio(), but broke FGP_NOWAIT handling, so ENOMEM is being escaped to user space. Had it correctly returned -EAGAIN with NOWAIT, either io_uring or user space itself would be able to retry the request. It's not enough to patch io_uring since the iomap interface is the one responsible for it, and pwritev2(RWF_NOWAIT) and AIO interfaces must return the proper error too. The patch was tested with scylladb test suite (its original reproducer), and the tests all pass now when memory is pressured. Link: https://lkml.kernel.org/r/20250224143700.23035-1-raphaelsc@scylladb.com Fixes: 66dabbb65d67 ("mm: return an ERR_PTR from __filemap_get_folio") Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Cc: "Darrick J. Wong" <djwong@kernel.org> Cc: Matthew Wilcow (Oracle) <willy@infradead.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-16mm: memcontrol: fix swap counter leak from offline cgroupMuchun Song
Commit 6769183166b3 removed the parameter of id from swap_cgroup_record() and get the memcg id from mem_cgroup_id(folio_memcg(folio)). However, the caller of it may update a different memcg's counter instead of folio_memcg(folio). E.g. in the caller of mem_cgroup_swapout(), @swap_memcg could be different with @memcg and update the counter of @swap_memcg, but swap_cgroup_record() records the wrong memcg's ID. When it is uncharged from __mem_cgroup_uncharge_swap(), the swap counter will leak since the wrong recorded ID. Fix it by bringing the parameter of id back. Link: https://lkml.kernel.org/r/20250306023133.44838-1-songmuchun@bytedance.com Fixes: 6769183166b3 ("mm/swap_cgroup: decouple swap cgroup recording and clearing") Signed-off-by: Muchun Song <songmuchun@bytedance.com> Reviewed-by: Kairui Song <kasong@tencent.com> Cc: Chris Li <chrisl@kernel.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Shakeel Butt <shakeel.butt@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-16mm/vma: do not register private-anon mappings with khugepaged during mmapDev Jain
We already are registering private-anon VMAs with khugepaged during fault time, in do_huge_pmd_anonymous_page(). Commit "register suitable readonly file vmas for khugepaged" moved the khugepaged registration logic from shmem_mmap to the generic mmap path. The userspace-visible effect should be this: khugepaged will unnecessarily scan mm's which haven't yet faulted in. Note that it won't actually collapse because all PTEs are none. Now that I think about it, the mm is going to have a file VMA anyways during fork+exec, so the mm already gets registered during mmap due to the non-anon case (I *think*), so at least one of either the mmap registration or fault-time registration is redundant. Make this logic specific for non-anon mappings. Link: https://lkml.kernel.org/r/20250306063037.16299-1-dev.jain@arm.com Fixes: 613bec092fe7 ("mm: mmap: register suitable readonly file vmas for khugepaged") Signed-off-by: Dev Jain <dev.jain@arm.com> Cc: "Aneesh Kumar K.V" <aneesh.kumar@kernel.org> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: David Hildenbrand <david@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jann Horn <jannh@google.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Matthew Wilcow (Oracle) <willy@infradead.org> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Yang Shi <yang@os.amperecomputing.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-16squashfs: fix invalid pointer dereference in squashfs_cache_deleteZhiyu Zhang
When mounting a squashfs fails, squashfs_cache_init() may return an error pointer (e.g., -ENOMEM) instead of NULL. However, squashfs_cache_delete() only checks for a NULL cache, and attempts to dereference the invalid pointer. This leads to a kernel crash (BUG: unable to handle kernel paging request in squashfs_cache_delete). This patch fixes the issue by checking IS_ERR(cache) before accessing it. Link: https://lkml.kernel.org/r/20250306132855.2030-1-zhiyuzhang999@gmail.com Fixes: 49ff29240ebb ("squashfs: make squashfs_cache_init() return ERR_PTR(-ENOMEM)") Signed-off-by: Zhiyu Zhang <zhiyuzhang999@gmail.com> Reported-by: Zhiyu Zhang <zhiyuzhang999@gmail.com> Closes: https://lore.kernel.org/linux-fsdevel/CALf2hKvaq8B4u5yfrE+BYt7aNguao99mfWxHngA+=o5hwzjdOg@mail.gmail.com/ Tested-by: Zhiyu Zhang <zhiyuzhang999@gmail.com> Reviewed-by: Phillip Lougher <phillip@squashfs.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-16mm/migrate: fix shmem xarray update during migrationZi Yan
A shmem folio can be either in page cache or in swap cache, but not at the same time. Namely, once it is in swap cache, folio->mapping should be NULL, and the folio is no longer in a shmem mapping. In __folio_migrate_mapping(), to determine the number of xarray entries to update, folio_test_swapbacked() is used, but that conflates shmem in page cache case and shmem in swap cache case. It leads to xarray multi-index entry corruption, since it turns a sibling entry to a normal entry during xas_store() (see [1] for a userspace reproduction). Fix it by only using folio_test_swapcache() to determine whether xarray is storing swap cache entries or not to choose the right number of xarray entries to update. [1] https://lore.kernel.org/linux-mm/Z8idPCkaJW1IChjT@casper.infradead.org/ Note: In __split_huge_page(), folio_test_anon() && folio_test_swapcache() is used to get swap_cache address space, but that ignores the shmem folio in swap cache case. It could lead to NULL pointer dereferencing when a in-swap-cache shmem folio is split at __xa_store(), since !folio_test_anon() is true and folio->mapping is NULL. But fortunately, its caller split_huge_page_to_list_to_order() bails out early with EBUSY when folio->mapping is NULL. So no need to take care of it here. Link: https://lkml.kernel.org/r/20250305200403.2822855-1-ziy@nvidia.com Fixes: fc346d0a70a1 ("mm: migrate high-order folios in swap cache correctly") Signed-off-by: Zi Yan <ziy@nvidia.com> Reported-by: Liu Shixin <liushixin2@huawei.com> Closes: https://lore.kernel.org/all/28546fb4-5210-bf75-16d6-43e1f8646080@huawei.com/ Suggested-by: Hugh Dickins <hughd@google.com> Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Barry Song <baohua@kernel.org> Cc: Charan Teja Kalla <quic_charante@quicinc.com> Cc: David Hildenbrand <david@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Lance Yang <ioworker0@gmail.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-16mm/hugetlb: fix surplus pages in dissolve_free_huge_page()Jinjiang Tu
In dissolve_free_huge_page(), free huge pages are dissolved without adjusting surplus count. However, free huge pages may be accounted as surplus pages, and will lead to wrong surplus count. I reproduce this issue on qemu. The steps are: 1) Node1 is memory-less at first. Hot-add memory to node1 by executing the two commands in qemu monitor: object_add memory-backend-ram,id=mem1,size=1G device_add pc-dimm,id=dimm1,memdev=mem1,node=1 2) online one memory block of Node1 with: echo online_movable > /sys/devices/system/node/node1/memoryX/state 3) create 64 huge pages for node1 4) run a program to reserve (don't consume) all the huge pages 5) echo 0 > nr_huge_pages for node1. After this step, free huge pages in Node1 are surplus. 6) create 80 huge pages for node0 7) offline memory of node1, The memory range to offline contains the free surplus huge pages created in step3) ~ step5) echo offline > /sys/devices/system/node/node1/memoryX/state 8) kill the program in step 4) The result: Node0 Node1 total 80 0 free 80 0 surplus 0 61 To fix it, adjust surplus when destroying huge pages if the node has surplus pages in dissolve_free_hugetlb_folio(). The result with this patch: Node0 Node1 total 80 0 free 80 0 surplus 0 0 Link: https://lkml.kernel.org/r/20250304132106.2872754-1-tujinjiang@huawei.com Fixes: c8721bbbdd36 ("mm: memory-hotplug: enable memory hotplug to handle hugepage") Signed-off-by: Jinjiang Tu <tujinjiang@huawei.com> Acked-by: David Hildenbrand <david@redhat.com> Acked-by: Oscar Salvador <osalvador@suse.de> Cc: Jinjiang Tu <tujinjiang@huawei.com> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Nanyong Sun <sunnanyong@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-16mm/damon/core: initialize damos->walk_completed in damon_new_scheme()SeongJae Park
The function for allocating and initialize a 'struct damos' object, damon_new_scheme(), is not initializing damos->walk_completed field. Only damos_walk_complete() is setting the field. Hence the field will be eventually set and used correctly from second damos_walk() call for the scheme. But the first damos_walk() could mistakenly not walk on the regions. Actually, a common usage of DAMOS for taking an access pattern snapshot is installing a monitoring-purpose DAMOS scheme, doing damos_walk() to retrieve the snapshot, and then removing the scheme. DAMON user-space tool (damo) also gets runtime snapshot in the way. Hence the problem can continuously happen in such use cases. Initialize it properly in the allocation function. Link: https://lkml.kernel.org/r/20250228174450.41472-1-sj@kernel.org Fixes: bf0eaba0ff9c ("mm/damon/core: implement damos_walk()") Signed-off-by: SeongJae Park <sj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-16mm/damon: respect core layer filters' allowance decision on ops layerSeongJae Park
Filtering decisions are made in filters evaluation order. Once a decision is made by a filter, filters that scheduled to be evaluated after the decision-made filter should just respect it. This is the intended and documented behavior. Since core layer-handled filters are evaluated before operations layer-handled filters, decisions made on core layer should respected by ops layer. In case of reject filters, the decision is respected, since core layer-rejected regions are not passed to ops layer. But in case of allow filters, ops layer filters don't know if the region has passed to them because it was allowed by core filters or just because it didn't match to any core layer. The current wrong implementation assumes it was due to not matched by any core filters. As a reuslt, the decision is not respected. Pass the missing information to ops layer using a new filed in 'struct damos', and make the ops layer filters respect it. Link: https://lkml.kernel.org/r/20250228175336.42781-1-sj@kernel.org Fixes: 491fee286e56 ("mm/damon/core: support damos_filter->allow") Signed-off-by: SeongJae Park <sj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-16filemap: move prefaulting out of hot write pathDave Hansen
There is a generic anti-pattern that shows up in the VFS and several filesystems where the hot write paths touch userspace twice when they could get away with doing it once. Dave Chinner suggested that they should all be fixed up[1]. I agree[2]. But, the series to do that fixup spans a bunch of filesystems and a lot of people. This patch fixes common code that absolutely everyone uses. It has measurable performance benefits[3]. I think this patch can go in and not be held up by the others. I will post them separately to their separate maintainers for consideration. But, honestly, I'm not going to lose any sleep if the maintainers don't pick those up. 1. https://lore.kernel.org/all/Z5f-x278Z3wTIugL@dread.disaster.area/ 2. https://lore.kernel.org/all/20250129181749.C229F6F3@davehans-spike.ostc.intel.com/ 3. https://lore.kernel.org/all/202502121529.d62a409e-lkp@intel.com/ This patch: There is a bit of a sordid history here. I originally wrote 998ef75ddb57 ("fs: do not prefault sys_write() user buffer pages") to fix a performance issue that showed up on early SMAP hardware. But that was reverted with 00a3d660cbac because it exposed an underlying filesystem bug. This is a reimplementation of the original commit along with some simplification and comment improvements. The basic problem is that the generic write path has two userspace accesses: one to prefault the write source buffer and then another to perform the actual write. On x86, this means an extra STAC/CLAC pair. These are relatively expensive instructions because they function as barriers. Keep the prefaulting behavior but move it into the slow path that gets run when the write did not make any progress. This avoids livelocks that can happen when the write's source and destination target the same folio. Contrary to the existing comments, the fault-in does not prevent deadlocks. That's accomplished by using an "atomic" usercopy that disables page faults. The end result is that the generic write fast path now touches userspace once instead of twice. 0day has shown some improvements on a couple of microbenchmarks: https://lore.kernel.org/all/202502121529.d62a409e-lkp@intel.com/ Link: https://lkml.kernel.org/r/20250228203722.CAEB63AC@davehans-spike.ostc.intel.com Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Link: https://lore.kernel.org/all/yxyuijjfd6yknryji2q64j3keq2ygw6ca6fs5jwyolklzvo45s@4u63qqqyosy2/ Cc: Ted Ts'o <tytso@mit.edu> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mateusz Guzik <mjguzik@gmail.com> Cc: Dave Chinner <david@fromorbit.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-16proc: fix UAF in proc_get_inode()Ye Bin
Fix race between rmmod and /proc/XXX's inode instantiation. The bug is that pde->proc_ops don't belong to /proc, it belongs to a module, therefore dereferencing it after /proc entry has been registered is a bug unless use_pde/unuse_pde() pair has been used. use_pde/unuse_pde can be avoided (2 atomic ops!) because pde->proc_ops never changes so information necessary for inode instantiation can be saved _before_ proc_register() in PDE itself and used later, avoiding pde->proc_ops->... dereference. rmmod lookup sys_delete_module proc_lookup_de pde_get(de); proc_get_inode(dir->i_sb, de); mod->exit() proc_remove remove_proc_subtree proc_entry_rundown(de); free_module(mod); if (S_ISREG(inode->i_mode)) if (de->proc_ops->proc_read_iter) --> As module is already freed, will trigger UAF BUG: unable to handle page fault for address: fffffbfff80a702b PGD 817fc4067 P4D 817fc4067 PUD 817fc0067 PMD 102ef4067 PTE 0 Oops: Oops: 0000 [#1] PREEMPT SMP KASAN PTI CPU: 26 UID: 0 PID: 2667 Comm: ls Tainted: G Hardware name: QEMU Standard PC (i440FX + PIIX, 1996) RIP: 0010:proc_get_inode+0x302/0x6e0 RSP: 0018:ffff88811c837998 EFLAGS: 00010a06 RAX: dffffc0000000000 RBX: ffffffffc0538140 RCX: 0000000000000007 RDX: 1ffffffff80a702b RSI: 0000000000000001 RDI: ffffffffc0538158 RBP: ffff8881299a6000 R08: 0000000067bbe1e5 R09: 1ffff11023906f20 R10: ffffffffb560ca07 R11: ffffffffb2b43a58 R12: ffff888105bb78f0 R13: ffff888100518048 R14: ffff8881299a6004 R15: 0000000000000001 FS: 00007f95b9686840(0000) GS:ffff8883af100000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: fffffbfff80a702b CR3: 0000000117dd2000 CR4: 00000000000006f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: <TASK> proc_lookup_de+0x11f/0x2e0 __lookup_slow+0x188/0x350 walk_component+0x2ab/0x4f0 path_lookupat+0x120/0x660 filename_lookup+0x1ce/0x560 vfs_statx+0xac/0x150 __do_sys_newstat+0x96/0x110 do_syscall_64+0x5f/0x170 entry_SYSCALL_64_after_hwframe+0x76/0x7e [adobriyan@gmail.com: don't do 2 atomic ops on the common path] Link: https://lkml.kernel.org/r/3d25ded0-1739-447e-812b-e34da7990dcf@p183 Fixes: 778f3dd5a13c ("Fix procfs compat_ioctl regression") Signed-off-by: Ye Bin <yebin10@huawei.com> Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: David S. Miller <davem@davemloft.net> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>