linux.git - Linus' kernel tree

Age	Commit message (Collapse)	Author
2025-03-18	wifi: iwlwifi: mld: KUnit: create chanctx with a custom width	Miri Korenblit
	Currently iwlmld_kunit_add_chanctx receives a band, picks a predefined static chandef, and creates the chanctx from it. Change it to receive a bandwidth as well. Otherwise, the bandwidth in the chanctx/phy will be different than what test specified in the iwl_mld_kunit_link. Signed-off-by: Miri Korenblit <miriam.rachel.korenblit@intel.com> Reviewed-by: Emmanuel Grumbach <emmanuel.grumbach@intel.com> Link: https://patch.msgid.link/20250313002008.85a1285d34cd.Ia71cdcd4241fe73501bc93e3cb2c6bb3f631b9ec@changeid Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2025-03-18	wifi: iwlwifi: mld: KUnit: introduce iwl_mld_kunit_link	Miri Korenblit
	To allow setting up association/EMLSR states with more flexibility, change the relevant functions to receive a new struct, iwl_mld_kunit_link, which will contain all the link parameters (for now just link id, band and bandwidth). Signed-off-by: Miri Korenblit <miriam.rachel.korenblit@intel.com> Reviewed-by: Emmanuel Grumbach <emmanuel.grumbach@intel.com> Link: https://patch.msgid.link/20250313002008.f336491ccc4e.I6b727765eb394a3dbb78cea71e356be1bdc4a17c@changeid Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2025-03-18	wifi: iwlwifi: mld: allow EMLSR for unequal bandwidth	Miri Korenblit
	Allow EMLSR if the bandwidths of the links are unequal if one of the following conditions is true: 1. in low latency mode 2. bandwidth of the secondary link is greater than the bandwidth of the primary 3. the primary link is active and is loaded enough to justify EMLSR Signed-off-by: Miri Korenblit <miriam.rachel.korenblit@intel.com> Reviewed-by: Emmanuel Grumbach <emmanuel.grumbach@intel.com> Link: https://patch.msgid.link/20250313002008.150c330711c4.Ifd72d2e076783991852a7f1756948b4f0efb9fea@changeid Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2025-03-18	wifi: iwlwifi: mld: prevent toggling EMLSR due to FW requests	Miri Korenblit
	We exit EMLSR mode if the FW requested to do so. To prevent repeated toggling of the EMLSR mode (frequent entry and exit), add this exit reason to the EMLSR prevention mechanism. This mechanism avoids re-entering EMLSR for a certain period of time after multiple exits caused by the same reason. Signed-off-by: Miri Korenblit <miriam.rachel.korenblit@intel.com> Reviewed-by: Emmanuel Grumbach <emmanuel.grumbach@intel.com> Link: https://patch.msgid.link/20250313002008.f0e74a7f99af.I447c8788afba85a2a5040ae2c1213b6e05ec14f3@changeid Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2025-03-18	wifi: iwlwifi: mld: remove IWL_MLD_EMLSR_BLOCKED_FW	Miri Korenblit
	The channel load logic moves from the FW to the driver. - Implement the logic: allow EMLSR only if the candidate primary link is active and if its average channel load exceeds the threshold. - Remove IWL_MLD_EMLSR_BLOCKED_FW. Instead, treat ESR_RECOMMEND_LEAVE in the EMLSR_RECOMMENDATION notif as an EXIT reason. Signed-off-by: Miri Korenblit <miriam.rachel.korenblit@intel.com> Reviewed-by: Emmanuel Grumbach <emmanuel.grumbach@intel.com> Link: https://patch.msgid.link/20250313002008.6729a8d67815.Iab39bf0982d8cdbb0db701d31854101c2fcf3b64@changeid Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2025-03-18	wifi: iwlwifi: mld: add support for DHC_TOOLS_UMAC_GET_TAS_STATUS command	Pagadala Yesu Anjaneyulu
	Add debugfs file in mld to retrieve TAS status per radio, TAS block list, current mcc, OEM name and OEM allowed list. This will add ability to get TAS status to user application via debugfs and required for debugging. Add the required API definitions and some debug host command utils. Signed-off-by: Pagadala Yesu Anjaneyulu <pagadala.yesu.anjaneyulu@intel.com> Signed-off-by: Miri Korenblit <miriam.rachel.korenblit@intel.com> Link: https://patch.msgid.link/20250313002008.66524c6ea198.I1625135284fc075148a55dd9ac629e94ca881fe4@changeid Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2025-03-18	wifi: iwlwifi: mld: Ensure wiphy lock is held during debugfs read operations	Pagadala Yesu Anjaneyulu
	The WIPHY_DEBUGFS_READ_WRITE_FILE_OPS_MLD macro is intended to call read/write handlers with the wiphy lock held. However, the current implementation uses the MLD_DEBUGFS_READ_WRAPPER macro, which does not hold the wiphy lock during read operations. This fix updates the WIPHY_DEBUGFS_READ_WRITE_FILE_OPS_MLD macro to use the WIPHY_DEBUGFS_READ_WRAPPER_MLD macro instead, ensuring that the wiphy lock is held during both read and write operations. Signed-off-by: Pagadala Yesu Anjaneyulu <pagadala.yesu.anjaneyulu@intel.com> Signed-off-by: Miri Korenblit <miriam.rachel.korenblit@intel.com> Link: https://patch.msgid.link/20250313002008.2001d2335e9d.I607a8bd12efc6d1190cef1fca44279dbdd2756ea@changeid Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2025-03-18	wifi: iwlwifi: mld: Add support for WIPHY_DEBUGFS_READ_FILE_OPS_MLD macro	Pagadala Yesu Anjaneyulu
	Introduced the WIPHY_DEBUGFS_READ_FILE_OPS_MLD macro to enable reading data from the driver while holding the wiphy lock. This will enable read operations with wiphy locked. Signed-off-by: Pagadala Yesu Anjaneyulu <pagadala.yesu.anjaneyulu@intel.com> Signed-off-by: Miri Korenblit <miriam.rachel.korenblit@intel.com> Link: https://patch.msgid.link/20250313002008.b0ddb6b0a144.I1fab63f2c6f52fea61cc5d7b27775aed58adfd8d@changeid Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2025-03-18	wifi: iwlwifi: mld: Rename WIPHY_DEBUGFS_HANDLER_WRAPPER to ↵	Pagadala Yesu Anjaneyulu
	WIPHY_DEBUGFS_WRITE_HANDLER_WRAPPER Renamed the macro WIPHY_DEBUGFS_HANDLER_WRAPPER to WIPHY_DEBUGFS_WRITE_HANDLER_WRAPPER to better reflect its purpose as a write handler. Additionally, updated the corresponding macro WIPHY_DEBUGFS_HANDLER_WRAPPER_MLD to WIPHY_DEBUGFS_WRITE_HANDLER_WRAPPER_MLD for consistency. This change does not alter the functionality but enhances the maintainability of the code. Signed-off-by: Pagadala Yesu Anjaneyulu <pagadala.yesu.anjaneyulu@intel.com> Reviewed-by: Emmanuel Grumbach <emmanuel.grumbach@intel.com> Signed-off-by: Miri Korenblit <miriam.rachel.korenblit@intel.com> Link: https://patch.msgid.link/20250313002008.bb8a1d7907c8.I53325f2f37ccaad2b212d35d10616e06c1555e48@changeid Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2025-03-18	wifi: nl80211: store chandef on the correct link when starting CAC	Aditya Kumar Singh
	Link ID to store chandef is still being used as 0 even in case of MLO which is incorrect. This leads to issue during CAC completion where link 0 as well gets stopped. Fixes: 0b7798232eee ("wifi: cfg80211/mac80211: use proper link ID for DFS") Signed-off-by: Aditya Kumar Singh <aditya.kumar.singh@oss.qualcomm.com> Link: https://patch.msgid.link/20250314-fix_starting_cac_during_mlo-v1-1-3b51617d7ea5@oss.qualcomm.com Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2025-03-18	LoongArch: KVM: Register perf callbacks for guest	Bibo Mao
	Add selection for GUEST_PERF_EVENTS if KVM is enabled, also add perf callback register when KVM module is loading. Signed-off-by: Bibo Mao <maobibo@loongson.cn> Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>
2025-03-18	LoongArch: KVM: Implement arch-specific functions for guest perf	Bibo Mao
	Three architecture specific functions are added for the guest perf feature, they are kvm_arch_vcpu_in_kernel(), kvm_arch_vcpu_get_ip() and kvm_arch_pmi_in_guest(). Signed-off-by: Bibo Mao <maobibo@loongson.cn> Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>
2025-03-18	LoongArch: KVM: Add stub for kvm_arch_vcpu_preempted_in_kernel()	Bibo Mao
	Pause-Loop Exiting is not supported by LoongArch hardware, nor is pv spinlock feature. So function kvm_vcpu_on_spin() is not used. Function kvm_arch_vcpu_preempted_in_kernel() is defined as a stub function here since it is only called by unused function kvm_vcpu_on_spin(). Signed-off-by: Bibo Mao <maobibo@loongson.cn> Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>
2025-03-18	LoongArch: KVM: Remove PGD saving during VM context switch	Bibo Mao
	PGD table for primary mmu keeps unchanged once VM is created, it is not necessary to save PGD table pointer during VM context switch. And it can be acquired when VM is created. Signed-off-by: Bibo Mao <maobibo@loongson.cn> Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>
2025-03-18	LoongArch: KVM: Remove unnecessary header include path	Masahiro Yamada
	arch/loongarch/kvm/ includes local headers with the double-quote form (#include "..."). Also, TRACE_INCLUDE_PATH in arch/loongarch/kvm/trace.h is relative to include/trace/. Hence, the local header search path is unneeded. Reviewed-by: Bibo Mao <maobibo@loongson.cn> Signed-off-by: Masahiro Yamada <masahiroy@kernel.org> Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>
2025-03-18	Merge net-next/main to resolve conflicts	Johannes Berg
	There are a few conflicts between the work that went into wireless and that's here now, resolve them. Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2025-03-18	rust: optimize rust symbol generation for SeqFile	Kunwu Chan
	When build the kernel using the llvm-18.1.3-rust-1.85.0-x86_64 with ARCH=arm64, the following symbols are generated: $nm vmlinux \| grep ' _R'.*SeqFile \| rustfilt ffff8000805b78ac T <kernel::seq_file::SeqFile>::call_printf This Rust symbol is trivial wrappers around the C functions seq_printf. It doesn't make sense to go through a trivial wrapper for its functions, so mark it inline. Link: https://github.com/Rust-for-Linux/linux/issues/1145 Suggested-by: Alice Ryhl <aliceryhl@google.com> Co-developed-by: Grace Deng <Grace.Deng006@Gmail.com> Signed-off-by: Grace Deng <Grace.Deng006@Gmail.com> Signed-off-by: Kunwu Chan <kunwu.chan@hotmail.com> Link: https://lore.kernel.org/r/20250317030418.2371265-1-kunwu.chan@linux.dev Reviewed-by: Alice Ryhl <aliceryhl@google.com> Reviewed-by: Benno Lossin <benno.lossin@proton.me> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-03-18	rust: file: optimize rust symbol generation for FileDescriptorReservation	Kunwu Chan
	When build the kernel using the llvm-18.1.3-rust-1.85.0-x86_64 with ARCH=arm64, the following symbols are generated: $ nm vmlinux \| grep ' _R'.*FileDescriptorReservation \| rustfilt ... T <kernel::fs::file::FileDescriptorReservation>::fd_install ... T <kernel::fs::file::FileDescriptorReservation>::get_unused_fd_flags ... T <kernel::fs::file::FileDescriptorReservation as core::ops::drop::Drop>::drop These Rust symbols are trivial wrappers around the C functions fd_install, put_unused_fd and put_task_struct. It doesn't make sense to go through a trivial wrapper for these functions, so mark them inline. Link: https://github.com/Rust-for-Linux/linux/issues/1145 Suggested-by: Alice Ryhl <aliceryhl@google.com> Co-developed-by: Grace Deng <Grace.Deng006@Gmail.com> Signed-off-by: Grace Deng <Grace.Deng006@Gmail.com> Signed-off-by: Kunwu Chan <kunwu.chan@hotmail.com> Link: https://lore.kernel.org/r/20250317023702.2360726-1-kunwu.chan@linux.dev Reviewed-by: Alice Ryhl <aliceryhl@google.com> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-03-18	Merge branch 'net-phy-rework-linkmodes-handling-in-a-dedicated-file'	Paolo Abeni
	Maxime Chevallier says: ==================== net: phy: Rework linkmodes handling in a dedicated file This is V5 of the phy_caps series. In a nutshell, this series reworks the way we maintain the list of speed/duplex capablities for each linkmode so that we no longer have multiple definition of these associations. That will help making sure that when people add new linkmodes in include/uapi/linux/ethtool.h, they don't have to update phylib and phylink as well, making the process more straightforward and less error-prone. It also generalises the phy_caps interface to be able to lookup linkmodes from phy_interface_t, which is needed for the multi-port work I've been working on for a while. This V5 addresse Russell's and Paolo's reviews, namely : - Error out when encountering an unknown SPEED_XXX setting It prints an error and fails to initialize phylib. I've tested by introducing a dummy 1.6T speed, I guess it's only a matter of time before that actually happens :) - Deal more gracefully with the fixed-link settings, keeping some level of compatibility with what we had before by making sure we report a single BaseT mode like before. V1 : https://lore.kernel.org/netdev/20250222142727.894124-1-maxime.chevallier@bootlin.com/ V2 : https://lore.kernel.org/netdev/20250226100929.1646454-1-maxime.chevallier@bootlin.com/ V3 : https://lore.kernel.org/netdev/20250228145540.2209551-1-maxime.chevallier@bootlin.com/ V4 : https://lore.kernel.org/netdev/20250303090321.805785-1-maxime.chevallier@bootlin.com/ ==================== Link: https://patch.msgid.link/20250307173611.129125-1-maxime.chevallier@bootlin.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-03-18	ata: libata: Fix NCQ Non-Data log not supported print	Niklas Cassel
	Currently, both ata_dev_config_ncq_send_recv() - which checks for NCQ Send/Recv Log (Log Address 13h) and ata_dev_config_ncq_non_data() - which checks for NCQ Non-Data Log (Log Address 12h), uses the same print when the log is not supported: "NCQ Send/Recv Log not supported" This seems like a copy paste error, since NCQ Non-Data Log is actually a separate log. Fix the print to reference the correct log. Fixes: 284b3b77ea88 ("libata: NCQ encapsulation for ZAC MANAGEMENT OUT") Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Hannes Reinecke <hare@suse.de> Link: https://lore.kernel.org/r/20250317111754.1666084-2-cassel@kernel.org Signed-off-by: Niklas Cassel <cassel@kernel.org>
2025-03-18	net: phylink: Use phy_caps to get an interface's capabilities and modes	Maxime Chevallier
	Phylink has internal code to get the MAC capabilities of a given PHY interface (what are the supported speed and duplex). Extract that into phy_caps, but use the link_capa for conversion. Add an internal phylink helper for the link caps -> mac caps conversion, and use this in phylink_caps_to_linkmodes(). Signed-off-by: Maxime Chevallier <maxime.chevallier@bootlin.com> Link: https://patch.msgid.link/20250307173611.129125-14-maxime.chevallier@bootlin.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-03-18	net: phylink: Convert capabilities to linkmodes using phy_caps	Maxime Chevallier
	phylink_caps_to_linkmodes() is used to derive a list of linkmodes that can be conceivably exposed using a given set of speeds and duplex through phylink's MAC capabilities. This list can be derived from the link_caps array in phy_caps, provided we convert the MAC capabilities into a LINK_CAPA bitmask first. Introduce an internal phylink helper phylink_caps_to_link_caps() to convert from MAC capabilities into phy_caps, then phy_caps_linkmodes() to do the link_caps -> linkmodes conversion. This avoids having to update phylink for every new linkmode. Signed-off-by: Maxime Chevallier <maxime.chevallier@bootlin.com> Link: https://patch.msgid.link/20250307173611.129125-13-maxime.chevallier@bootlin.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-03-18	net: phylink: Add a mapping between MAC_CAPS and LINK_CAPS	Maxime Chevallier
	phylink allows MAC drivers to report the capabilities in terms of speed, duplex and pause support. This is done through a dedicated set of enum values in the form of the MAC_ capabilities. They are very close to what the LINK_CAPA_xxx can express, with the difference that LINK_CAPA don't have any information about Pause/Asym Pause support. To prepare converting phylink to using the phy_caps, add the mapping between MAC capabilities and phy_caps. While doing so, we move the phylink_caps_params array up a bit to simplify future commits. Signed-off-by: Maxime Chevallier <maxime.chevallier@bootlin.com> Link: https://patch.msgid.link/20250307173611.129125-12-maxime.chevallier@bootlin.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-03-18	net: phy: drop phy_settings and the associated lookup helpers	Maxime Chevallier
	The phy_settings array is no longer relevant as it has now been replaced by the link_caps array and associated phy_caps helpers. Signed-off-by: Maxime Chevallier <maxime.chevallier@bootlin.com> Link: https://patch.msgid.link/20250307173611.129125-11-maxime.chevallier@bootlin.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-03-18	net: phylink: Use phy_caps_lookup for fixed-link configuration	Maxime Chevallier
	When phylink creates a fixed-link configuration, it finds a matching linkmode to set as the advertised, lp_advertising and supported modes based on the speed and duplex of the fixed link. Use the newly introduced phy_caps_lookup to get these modes instead of phy_lookup_settings(). This has the side effect that the matched settings and configured linkmodes may now contain several linkmodes (the intersection of supported linkmodes from the phylink settings and the linkmodes that match speed/duplex) instead of the one from phy_lookup_settings(). Signed-off-by: Maxime Chevallier <maxime.chevallier@bootlin.com> Link: https://patch.msgid.link/20250307173611.129125-10-maxime.chevallier@bootlin.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-03-18	net: phy: phy_device: Use link_capabilities lookup for PHY aneg config	Maxime Chevallier
	When configuring PHY advertising with autoneg disabled, we lookd for an exact linkmode to advertise and configure for the requested Speed and Duplex, specially at or over 1G. Using phy_caps_lookup allows us to build a list of the supported linkmodes at that speed that we can advertise instead of the first mode that matches. Signed-off-by: Maxime Chevallier <maxime.chevallier@bootlin.com> Link: https://patch.msgid.link/20250307173611.129125-9-maxime.chevallier@bootlin.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-03-18	net: phy: phy_caps: Allow looking-up link caps based on speed and duplex	Maxime Chevallier
	As the link_caps array is efficient for <speed,duplex> lookups, implement a function for speed/duplex lookups that matches a given mask. This replicates to some extent the phy_lookup_settings() behaviour, matching full link_capabilities instead of a single linkmode. phy.c's phy_santize_settings() and phylink's phylink_ethtool_ksettings_set() performs such lookup using the phy_settings table, but are only interested in the actual speed/duplex that were matched, rathet than the individual linkmode. Similar to phy_lookup_settings(), the newly introduced phy_caps_lookup() will run through the link_caps[] array by descending speed/duplex order. If the link_capabilities for a given <speed/duplex> tuple intersects the passed linkmodes, we consider that a match. Similar to phy_lookup_settings(), we also allow passing an 'exact' boolean, allowing non-exact match. Here, we MUST always match the linkmodes mask, but we allow matching on lower speed settings. Signed-off-by: Maxime Chevallier <maxime.chevallier@bootlin.com> Link: https://patch.msgid.link/20250307173611.129125-8-maxime.chevallier@bootlin.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-03-18	net: phy: phy_caps: Implement link_capabilities lookup by linkmode	Maxime Chevallier
	In several occasions, phylib needs to lookup a set of matching speed and duplex against a given linkmode set. Instead of relying on the phy_settings array and thus iterate over the whole linkmodes list, use the link_capabilities array to lookup these matches, as we aren't interested in the actual link setting that matches but rather the speed and duplex for that setting. Signed-off-by: Maxime Chevallier <maxime.chevallier@bootlin.com> Link: https://patch.msgid.link/20250307173611.129125-7-maxime.chevallier@bootlin.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-03-18	net: phy: phy_caps: Introduce phy_caps_valid	Maxime Chevallier
	With the link_capabilities array, it's trivial to validate a given mask againts a <speed, duplex> tuple. Create a helper for that purpose, and use it to replace a phy_settings lookup in phy_check_valid(); Signed-off-by: Maxime Chevallier <maxime.chevallier@bootlin.com> Link: https://patch.msgid.link/20250307173611.129125-6-maxime.chevallier@bootlin.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-03-18	net: phy: phy_caps: Move __set_linkmode_max_speed to phy_caps	Maxime Chevallier
	Convert the __set_linkmode_max_speed to use the link_capabilities array. This makes it easy to clamp the linkmodes to a given max speed. Introduce a new helper phy_caps_linkmode_max_speed to replace the previous one that used phy_settings. Signed-off-by: Maxime Chevallier <maxime.chevallier@bootlin.com> Link: https://patch.msgid.link/20250307173611.129125-5-maxime.chevallier@bootlin.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-03-18	net: phy: phy_caps: Move phy_speeds to phy_caps	Maxime Chevallier
	Use the newly introduced link_capabilities array to derive the list of possible speeds when given a combination of linkmodes. As link_capabilities is indexed by speed, we don't have to iterate the whole phy_settings array. Signed-off-by: Maxime Chevallier <maxime.chevallier@bootlin.com> Link: https://patch.msgid.link/20250307173611.129125-4-maxime.chevallier@bootlin.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-03-18	net: phy: Use an internal, searchable storage for the linkmodes	Maxime Chevallier
	The canonical definition for all the link modes is in linux/ethtool.h, which is complemented by the link_mode_params array stored in net/ethtool/common.h . That array contains all the metadata about each of these modes, including the Speed and Duplex information. Phylib and phylink needs that information as well for internal management of the link, which was done by duplicating that information in locally-stored arrays and lookup functions. This makes it easy for developpers adding new modes to forget modifying phylib and phylink accordingly. However, the link_mode_params array in net/ethtool/common.c is fairly inefficient to search through, as it isn't sorted in any manner. Phylib and phylink perform a lot of lookup operations, mostly to filter modes by speed and/or duplex. We therefore introduce the link_caps private array in phy_caps.c, that indexes linkmodes in a more efficient manner. Each element associated a tuple <speed, duplex> to a bitfield of all the linkmodes runs at these speed/duplex. We end-up with an array that's fairly short, easily addressable and that it optimised for the typical use-cases of phylib/phylink. That array is initialized at the same time as phylib. As the link_mode_params array is part of the net stack, which phylink depends on, it should always be accessible from phylib. Signed-off-by: Maxime Chevallier <maxime.chevallier@bootlin.com> Link: https://patch.msgid.link/20250307173611.129125-3-maxime.chevallier@bootlin.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-03-18	net: ethtool: Export the link_mode_params definitions	Maxime Chevallier
	link_mode_params contains a lookup table of all 802.3 link modes that are currently supported with structured data about each mode's speed, duplex, number of lanes and mediums. As a preparation for a port representation, export that table for the rest of the net stack to use. Signed-off-by: Maxime Chevallier <maxime.chevallier@bootlin.com> Link: https://patch.msgid.link/20250307173611.129125-2-maxime.chevallier@bootlin.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-03-18	efivarfs: fix NULL dereference on resume	James Bottomley
	LSMs often inspect the path.mnt of files in the security hooks, and this causes a NULL deref in efivarfs_pm_notify() because the path is constructed with a NULL path.mnt. Fix by obtaining from vfs_kern_mount() instead, and being very careful to ensure that deactivate_super() (potentially triggered by a racing userspace umount) is not called directly from the notifier, because it would deadlock when efivarfs_kill_sb() tried to unregister the notifier chain. [ Al notes: Umm... That's probably safe, but not as a long-term solution - it's too intimately dependent upon fs/super.c internals. The reasons why you can't run into ->s_umount deadlock here are non-trivial... ] Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: James Bottomley <James.Bottomley@HansenPartnership.com> Link: https://lore.kernel.org/r/e54e6a2f-1178-4980-b771-4d9bafc2aa47@tnxip.de Link: https://lore.kernel.org/r/3e998bf87638a442cbc6864cdcd3d8d9e08ce3e3.camel@HansenPartnership.com Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
2025-03-17	KVM: arm64: Tear down vGIC on failed vCPU creation	Will Deacon
	If kvm_arch_vcpu_create() fails to share the vCPU page with the hypervisor, we propagate the error back to the ioctl but leave the vGIC vCPU data initialised. Note only does this leak the corresponding memory when the vCPU is destroyed but it can also lead to use-after-free if the redistributor device handling tries to walk into the vCPU. Add the missing cleanup to kvm_arch_vcpu_create(), ensuring that the vGIC vCPU structures are destroyed on error. Cc: <stable@vger.kernel.org> Cc: Marc Zyngier <maz@kernel.org> Cc: Oliver Upton <oliver.upton@linux.dev> Cc: Quentin Perret <qperret@google.com> Signed-off-by: Will Deacon <will@kernel.org> Reviewed-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20250314133409.9123-1-will@kernel.org Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
2025-03-18	mtd: spi-nor: drop unused <linux/of_platform.h>	Tudor Ambarus
	There's nothing used in the SPI NOR core from <linux/of_platform.h>, drop the header inclusion. Reviewed-by: Miquel Raynal <miquel.raynal@bootlin.com> Link: https://lore.kernel.org/r/20250307-spi-nor-headers-cleanup-v1-3-c186a9511c1e@linaro.org Signed-off-by: Tudor Ambarus <tudor.ambarus@linaro.org>
2025-03-18	mtd: spi-nor: explicitly include <linux/of.h>	Tudor Ambarus
	The core driver is using of_property_read_bool() and relies on implicit inclusion of <linux/of.h>, which comes from <linux/mtd/mtd.h>. It is good practice to directly include all headers used, it avoids implicit dependencies and spurious breakage if someone rearranges headers and causes the implicit include to vanish. Include the missing header. Reviewed-by: Miquel Raynal <miquel.raynal@bootlin.com> Link: https://lore.kernel.org/r/20250307-spi-nor-headers-cleanup-v1-1-c186a9511c1e@linaro.org Signed-off-by: Tudor Ambarus <tudor.ambarus@linaro.org>
2025-03-17	Merge tag 'mm-hotfixes-stable-2025-03-17-20-09' of ↵	Linus Torvalds
	git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull misc hotfixes from Andrew Morton: "15 hotfixes. 7 are cc:stable and the remainder address post-6.13 issues or aren't considered necessary for -stable kernels. 13 are for MM and the other two are for squashfs and procfs. All are singletons. Please see the individual changelogs for details" * tag 'mm-hotfixes-stable-2025-03-17-20-09' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: mm/page_alloc: fix memory accept before watermarks gets initialized mm: decline to manipulate the refcount on a slab page memcg: drain obj stock on cpu hotplug teardown mm/huge_memory: drop beyond-EOF folios with the right number of refs selftests/mm: run_vmtests.sh: fix half_ufd_size_MB calculation mm: fix error handling in __filemap_get_folio() with FGP_NOWAIT mm: memcontrol: fix swap counter leak from offline cgroup mm/vma: do not register private-anon mappings with khugepaged during mmap squashfs: fix invalid pointer dereference in squashfs_cache_delete mm/migrate: fix shmem xarray update during migration mm/hugetlb: fix surplus pages in dissolve_free_huge_page() mm/damon/core: initialize damos->walk_completed in damon_new_scheme() mm/damon: respect core layer filters' allowance decision on ops layer filemap: move prefaulting out of hot write path proc: fix UAF in proc_get_inode()
2025-03-17	MAINTAINERS: Remove myself	Eric W. Biederman
	Unfortunately I no longer have time to meaningfully take part in the linux kernel development. Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2025-03-17	perf test dso-data: Correctly free test file in read test	Ian Rogers
	The DSO data read test opens a file but as dsos__exit is used the test file isn't closed. This causes the subsequent subtests in don't fork (-F) mode to fail as one more than expected file descriptor is open. Signed-off-by: Ian Rogers <irogers@google.com> Link: https://lore.kernel.org/r/20250318043151.137973-4-irogers@google.com Signed-off-by: Namhyung Kim <namhyung@kernel.org>
2025-03-17	perf dso: Use lock annotations to fix asan deadlock	Ian Rogers
	dso__list_del with address sanitizer and/or reference count checking will call dso__put that can call dso__data_close reentrantly trying to lock the dso__data_open_lock and deadlocking. Switch from pthread mutexes to perf's mutex so that lock checking is performed in debug builds. Add lock annotations that diagnosed the problem. Release the dso__data_open_lock around the dso__put to avoid the deadlock. Change the declaration of dso__data_get_fd to return a boolean, indicating the fd is valid and the lock is held, to make it compatible with the thread safety annotations as a try lock. Signed-off-by: Ian Rogers <irogers@google.com> Link: https://lore.kernel.org/r/20250318043151.137973-3-irogers@google.com Signed-off-by: Namhyung Kim <namhyung@kernel.org>
2025-03-17	perf mutex: Add annotations for LOCKS_EXCLUDED and LOCKS_RETURNED	Ian Rogers
	Used to annotate when locks shouldn't be held for a function or if a function returns a lock that's used by later mutex lock unlock operations. Signed-off-by: Ian Rogers <irogers@google.com> Link: https://lore.kernel.org/r/20250318043151.137973-2-irogers@google.com Signed-off-by: Namhyung Kim <namhyung@kernel.org>
2025-03-17	mm: page_alloc: defrag_mode kswapd/kcompactd watermarks	Johannes Weiner
	The previous patch added pageblock_order reclaim to kswapd/kcompactd, which helps, but produces only one block at a time. Allocation stalls and THP failure rates are still higher than they could be. To adequately reflect ALLOC_NOFRAGMENT demand for pageblocks, change the watermarking for kswapd & kcompactd: instead of targeting the high watermark in order-0 pages and checking for one suitable block, simply require that the high watermark is entirely met in pageblocks. To this end, track the number of free pages within contiguous pageblocks, then change pgdat_balanced() and compact_finished() to check watermarks against this new value. This further reduces THP latencies and allocation stalls, and improves THP success rates against the previous patch: DEFRAGMODE-ASYNC DEFRAGMODE-ASYNC-WMARKS Hugealloc Time mean 34300.36 ( +0.00%) 28904.00 ( -15.73%) Hugealloc Time stddev 36390.42 ( +0.00%) 33464.37 ( -8.04%) Kbuild Real time 196.13 ( +0.00%) 196.59 ( +0.23%) Kbuild User time 1234.74 ( +0.00%) 1231.67 ( -0.25%) Kbuild System time 62.62 ( +0.00%) 59.10 ( -5.54%) THP fault alloc 57054.53 ( +0.00%) 63223.67 ( +10.81%) THP fault fallback 11581.40 ( +0.00%) 5412.47 ( -53.26%) Direct compact fail 107.80 ( +0.00%) 59.07 ( -44.79%) Direct compact success 4.53 ( +0.00%) 2.80 ( -31.33%) Direct compact success rate % 3.20 ( +0.00%) 3.99 ( +18.66%) Compact daemon scanned migrate 5461033.93 ( +0.00%) 2267500.33 ( -58.48%) Compact daemon scanned free 5824897.93 ( +0.00%) 2339773.00 ( -59.83%) Compact direct scanned migrate 58336.93 ( +0.00%) 47659.93 ( -18.30%) Compact direct scanned free 32791.87 ( +0.00%) 40729.67 ( +24.21%) Compact total migrate scanned 5519370.87 ( +0.00%) 2315160.27 ( -58.05%) Compact total free scanned 5857689.80 ( +0.00%) 2380502.67 ( -59.36%) Alloc stall 2424.60 ( +0.00%) 638.87 ( -73.62%) Pages kswapd scanned 2657018.33 ( +0.00%) 4002186.33 ( +50.63%) Pages kswapd reclaimed 559583.07 ( +0.00%) 718577.80 ( +28.41%) Pages direct scanned 722094.07 ( +0.00%) 355172.73 ( -50.81%) Pages direct reclaimed 107257.80 ( +0.00%) 31162.80 ( -70.95%) Pages total scanned 3379112.40 ( +0.00%) 4357359.07 ( +28.95%) Pages total reclaimed 666840.87 ( +0.00%) 749740.60 ( +12.43%) Swap out 77238.20 ( +0.00%) 110084.33 ( +42.53%) Swap in 11712.80 ( +0.00%) 24457.00 ( +108.80%) File refaults 143438.80 ( +0.00%) 188226.93 ( +31.22%) Also of note is that compaction work overall is reduced. The reason for this is that when free pageblocks are more readily available, allocations are also much more likely to get physically placed in LRU order, instead of being forced to scavenge free space here and there. This means that reclaim by itself has better chances of freeing up whole blocks, and the system relies less on compaction. Comparing all changes to the vanilla kernel: VANILLA DEFRAGMODE-ASYNC-WMARKS Hugealloc Time mean 52739.45 ( +0.00%) 28904.00 ( -45.19%) Hugealloc Time stddev 56541.26 ( +0.00%) 33464.37 ( -40.81%) Kbuild Real time 197.47 ( +0.00%) 196.59 ( -0.44%) Kbuild User time 1240.49 ( +0.00%) 1231.67 ( -0.71%) Kbuild System time 70.08 ( +0.00%) 59.10 ( -15.45%) THP fault alloc 46727.07 ( +0.00%) 63223.67 ( +35.30%) THP fault fallback 21910.60 ( +0.00%) 5412.47 ( -75.29%) Direct compact fail 195.80 ( +0.00%) 59.07 ( -69.48%) Direct compact success 7.93 ( +0.00%) 2.80 ( -57.46%) Direct compact success rate % 3.51 ( +0.00%) 3.99 ( +10.49%) Compact daemon scanned migrate 3369601.27 ( +0.00%) 2267500.33 ( -32.71%) Compact daemon scanned free 5075474.47 ( +0.00%) 2339773.00 ( -53.90%) Compact direct scanned migrate 161787.27 ( +0.00%) 47659.93 ( -70.54%) Compact direct scanned free 163467.53 ( +0.00%) 40729.67 ( -75.08%) Compact total migrate scanned 3531388.53 ( +0.00%) 2315160.27 ( -34.44%) Compact total free scanned 5238942.00 ( +0.00%) 2380502.67 ( -54.56%) Alloc stall 2371.07 ( +0.00%) 638.87 ( -73.02%) Pages kswapd scanned 2160926.73 ( +0.00%) 4002186.33 ( +85.21%) Pages kswapd reclaimed 533191.07 ( +0.00%) 718577.80 ( +34.77%) Pages direct scanned 400450.33 ( +0.00%) 355172.73 ( -11.31%) Pages direct reclaimed 94441.73 ( +0.00%) 31162.80 ( -67.00%) Pages total scanned 2561377.07 ( +0.00%) 4357359.07 ( +70.12%) Pages total reclaimed 627632.80 ( +0.00%) 749740.60 ( +19.46%) Swap out 47959.53 ( +0.00%) 110084.33 ( +129.53%) Swap in 7276.00 ( +0.00%) 24457.00 ( +236.10%) File refaults 138043.00 ( +0.00%) 188226.93 ( +36.35%) THP allocation latencies and %sys time are down dramatically. THP allocation failures are down from nearly 50% to 8.5%. And to recall previous data points, the success rates are steady and reliable without the cumulative deterioration of fragmentation events. Compaction work is down overall. Direct compaction work especially is drastically reduced. As an aside, its success rate of 4% indicates there is room for improvement. For now it's good to rely on it less. Reclaim work is up overall, however direct reclaim work is down. Part of the increase can be attributed to a higher use of THPs, which due to internal fragmentation increase the memory footprint. This is not necessarily an unexpected side-effect for users of THP. However, taken both points together, there may well be some opportunities for fine tuning in the reclaim/compaction coordination. [hannes@cmpxchg.org: fix squawks from rebasing] Link: https://lkml.kernel.org/r/20250314210558.GD1316033@cmpxchg.org Link: https://lkml.kernel.org/r/20250313210647.1314586-6-hannes@cmpxchg.org Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-17	mm: page_alloc: defrag_mode kswapd/kcompactd assistance	Johannes Weiner
	When defrag_mode is enabled, allocation fallbacks strongly prefer whole block conversions instead of polluting or stealing partially used blocks. This means there is a demand for pageblocks even from sub-block requests. Let kswapd/kcompactd help produce them. By the time kswapd gets woken up, normal rmqueue and block conversion fallbacks have been attempted and failed. So always wake kswapd with the block order; it will take care of producing a suitable compaction gap and then chain-wake kcompactd with the block order when its done. VANILLA DEFRAGMODE-ASYNC Hugealloc Time mean 52739.45 ( +0.00%) 34300.36 ( -34.96%) Hugealloc Time stddev 56541.26 ( +0.00%) 36390.42 ( -35.64%) Kbuild Real time 197.47 ( +0.00%) 196.13 ( -0.67%) Kbuild User time 1240.49 ( +0.00%) 1234.74 ( -0.46%) Kbuild System time 70.08 ( +0.00%) 62.62 ( -10.50%) THP fault alloc 46727.07 ( +0.00%) 57054.53 ( +22.10%) THP fault fallback 21910.60 ( +0.00%) 11581.40 ( -47.14%) Direct compact fail 195.80 ( +0.00%) 107.80 ( -44.72%) Direct compact success 7.93 ( +0.00%) 4.53 ( -38.06%) Direct compact success rate % 3.51 ( +0.00%) 3.20 ( -6.89%) Compact daemon scanned migrate 3369601.27 ( +0.00%) 5461033.93 ( +62.07%) Compact daemon scanned free 5075474.47 ( +0.00%) 5824897.93 ( +14.77%) Compact direct scanned migrate 161787.27 ( +0.00%) 58336.93 ( -63.94%) Compact direct scanned free 163467.53 ( +0.00%) 32791.87 ( -79.94%) Compact total migrate scanned 3531388.53 ( +0.00%) 5519370.87 ( +56.29%) Compact total free scanned 5238942.00 ( +0.00%) 5857689.80 ( +11.81%) Alloc stall 2371.07 ( +0.00%) 2424.60 ( +2.26%) Pages kswapd scanned 2160926.73 ( +0.00%) 2657018.33 ( +22.96%) Pages kswapd reclaimed 533191.07 ( +0.00%) 559583.07 ( +4.95%) Pages direct scanned 400450.33 ( +0.00%) 722094.07 ( +80.32%) Pages direct reclaimed 94441.73 ( +0.00%) 107257.80 ( +13.57%) Pages total scanned 2561377.07 ( +0.00%) 3379112.40 ( +31.93%) Pages total reclaimed 627632.80 ( +0.00%) 666840.87 ( +6.25%) Swap out 47959.53 ( +0.00%) 77238.20 ( +61.05%) Swap in 7276.00 ( +0.00%) 11712.80 ( +60.97%) File refaults 138043.00 ( +0.00%) 143438.80 ( +3.91%) With this patch, defrag_mode=1 beats the vanilla kernel in THP success rates and allocation latencies. The trend holds over time: thp_fault_alloc VANILLA DEFRAGMODE-ASYNC 61988 52066 56474 58844 57258 58233 50187 58476 52388 54516 55409 59938 52925 57204 47648 60238 43669 55733 40621 56211 36077 59861 41721 57771 36685 58579 34641 51868 33215 56280 DEFRAGMODE-ASYNC also wins on %sys as ~3/4 of the direct compaction work is shifted to kcompactd. Reclaim activity is higher. Part of that is simply due to the increased memory footprint from higher THP use. The other aspect is that direct reclaim/compaction are still going for requested orders rather than targeting the page blocks required for fallbacks, which is less efficient than it could be. However, this is already a useful tradeoff to make, as in many environments peak periods are short and retaining the ability to produce THP through them is more important. Link: https://lkml.kernel.org/r/20250313210647.1314586-5-hannes@cmpxchg.org Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-17	mm: page_alloc: defrag_mode	Johannes Weiner
	The page allocator groups requests by migratetype to stave off fragmentation. However, in practice this is routinely defeated by the fact that it gives up before invoking reclaim and compaction - which may well produce suitable pages. As a result, fragmentation of physical memory is a common ongoing process in many load scenarios. Fragmentation deteriorates compaction's ability to produce huge pages. Depending on the lifetime of the fragmenting allocations, those effects can be long-lasting or even permanent, requiring drastic measures like forcible idle states or even reboots as the only reliable ways to recover the address space for THP production. In a kernel build test with supplemental THP pressure, the THP allocation rate steadily declines over 15 runs: thp_fault_alloc 61988 56474 57258 50187 52388 55409 52925 47648 43669 40621 36077 41721 36685 34641 33215 This is a hurdle in adopting THP in any environment where hosts are shared between multiple overlapping workloads (cloud environments), and rarely experience true idle periods. To make THP a reliable and predictable optimization, there needs to be a stronger guarantee to avoid such fragmentation. Introduce defrag_mode. When enabled, reclaim/compaction is invoked to its full extent before falling back. Specifically, ALLOC_NOFRAGMENT is enforced on the allocator fastpath and the reclaiming slowpath. For now, fallbacks are permitted to avert OOMs. There is a plan to add defrag_mode=2 to prefer OOMs over fragmentation, but this requires additional prep work in compaction and the reserve management to make it ready for all possible allocation contexts. The following test results are from a kernel build with periodic bursts of THP allocations, over 15 runs: vanilla defrag_mode=1 @claimer[unmovable]: 189 103 @claimer[movable]: 92 103 @claimer[reclaimable]: 207 61 @pollute[unmovable from movable]: 25 0 @pollute[unmovable from reclaimable]: 28 0 @pollute[movable from unmovable]: 38835 0 @pollute[movable from reclaimable]: 147136 0 @pollute[reclaimable from unmovable]: 178 0 @pollute[reclaimable from movable]: 33 0 @steal[unmovable from movable]: 11 0 @steal[unmovable from reclaimable]: 5 0 @steal[reclaimable from unmovable]: 107 0 @steal[reclaimable from movable]: 90 0 @steal[movable from reclaimable]: 354 0 @steal[movable from unmovable]: 130 0 Both types of polluting fallbacks are eliminated in this workload. Interestingly, whole block conversions are reduced as well. This is because once a block is claimed for a type, its empty space remains available for future allocations, instead of being padded with fallbacks; this allows the native type to group up instead of spreading out to new blocks. The assumption in the allocator has been that pollution from movable allocations is less harmful than from other types, since they can be reclaimed or migrated out should the space be needed. However, since fallbacks occur before reclaim/compaction is invoked, movable pollution will still cause non-movable allocations to spread out and claim more blocks. Without fragmentation, THP rates hold steady with defrag_mode=1: thp_fault_alloc 32478 20725 45045 32130 14018 21711 40791 29134 34458 45381 28305 17265 22584 28454 30850 While the downward trend is eliminated, the keen reader will of course notice that the baseline rate is much smaller than the vanilla kernel's to begin with. This is due to deficiencies in how reclaim and compaction are currently driven: ALLOC_NOFRAGMENT increases the extent to which smaller allocations are competing with THPs for pageblocks, while making no effort themselves to reclaim or compact beyond their own request size. This effect already exists with the current usage of ALLOC_NOFRAGMENT, but is amplified by defrag_mode insisting on whole block stealing much more strongly. Subsequent patches will address defrag_mode reclaim strategy to raise the THP success baseline above the vanilla kernel. Link: https://lkml.kernel.org/r/20250313210647.1314586-4-hannes@cmpxchg.org Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-17	mm: page_alloc: trace type pollution from compaction capturing	Johannes Weiner
	When the page allocator places pages of a certain migratetype into blocks of another type, it has lasting effects on the ability to compact and defragment down the line. For improving placement and compaction, visibility into such events is crucial. The most common case, allocator fallbacks, is already annotated, but compaction capturing is also allowed to grab pages of a different type. Extend the tracepoint to cover this case. Link: https://lkml.kernel.org/r/20250313210647.1314586-3-hannes@cmpxchg.org Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Zi Yan <ziy@nvidia.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-17	mm: compaction: push watermark into compaction_suitable() callers	Johannes Weiner
	Patch series "mm: reliable huge page allocator". This series makes changes to the allocator and reclaim/compaction code to try harder to avoid fragmentation. As a result, this makes huge page allocations cheaper, more reliable and more sustainable. It's a subset of the huge page allocator RFC initially proposed here: https://lore.kernel.org/lkml/20230418191313.268131-1-hannes@cmpxchg.org/ The following results are from a kernel build test, with additional concurrent bursts of THP allocations on a memory-constrained system. Comparing before and after the changes over 15 runs: before after Hugealloc Time mean 52739.45 ( +0.00%) 28904.00 ( -45.19%) Hugealloc Time stddev 56541.26 ( +0.00%) 33464.37 ( -40.81%) Kbuild Real time 197.47 ( +0.00%) 196.59 ( -0.44%) Kbuild User time 1240.49 ( +0.00%) 1231.67 ( -0.71%) Kbuild System time 70.08 ( +0.00%) 59.10 ( -15.45%) THP fault alloc 46727.07 ( +0.00%) 63223.67 ( +35.30%) THP fault fallback 21910.60 ( +0.00%) 5412.47 ( -75.29%) Direct compact fail 195.80 ( +0.00%) 59.07 ( -69.48%) Direct compact success 7.93 ( +0.00%) 2.80 ( -57.46%) Direct compact success rate % 3.51 ( +0.00%) 3.99 ( +10.49%) Compact daemon scanned migrate 3369601.27 ( +0.00%) 2267500.33 ( -32.71%) Compact daemon scanned free 5075474.47 ( +0.00%) 2339773.00 ( -53.90%) Compact direct scanned migrate 161787.27 ( +0.00%) 47659.93 ( -70.54%) Compact direct scanned free 163467.53 ( +0.00%) 40729.67 ( -75.08%) Compact total migrate scanned 3531388.53 ( +0.00%) 2315160.27 ( -34.44%) Compact total free scanned 5238942.00 ( +0.00%) 2380502.67 ( -54.56%) Alloc stall 2371.07 ( +0.00%) 638.87 ( -73.02%) Pages kswapd scanned 2160926.73 ( +0.00%) 4002186.33 ( +85.21%) Pages kswapd reclaimed 533191.07 ( +0.00%) 718577.80 ( +34.77%) Pages direct scanned 400450.33 ( +0.00%) 355172.73 ( -11.31%) Pages direct reclaimed 94441.73 ( +0.00%) 31162.80 ( -67.00%) Pages total scanned 2561377.07 ( +0.00%) 4357359.07 ( +70.12%) Pages total reclaimed 627632.80 ( +0.00%) 749740.60 ( +19.46%) Swap out 47959.53 ( +0.00%) 110084.33 ( +129.53%) Swap in 7276.00 ( +0.00%) 24457.00 ( +236.10%) File refaults 138043.00 ( +0.00%) 188226.93 ( +36.35%) THP latencies are cut in half, and failure rates are cut by 75%. These metrics also hold up over time, while the vanilla kernel sees a steady downward trend in success rates with each subsequent run, owed to the cumulative effects of fragmentation. A more detailed discussion of results is in the patch changelogs. The patches first introduce a vm.defrag_mode sysctl, which enforces the existing ALLOC_NOFRAGMENT alloc flag until after reclaim and compaction have run. They then change kswapd and kcompactd to target pageblocks, which boosts success in the ALLOC_NOFRAGMENT hotpaths. Patches #1 and #2 are somewhat unrelated cleanups, but touch the same code and so are included here to avoid conflicts from re-ordering. This patch (of 5): compaction_suitable() hardcodes the min watermark, with a boost to the low watermark for costly orders. However, compaction_ready() requires order-0 at the high watermark. It currently checks the marks twice. Make the watermark a parameter to compaction_suitable() and have the callers pass in what they require: - compaction_zonelist_suitable() is used by the direct reclaim path, so use the min watermark. - compact_suit_allocation_order() has a watermark in context derived from cc->alloc_flags. The only quirk is that kcompactd doesn't initialize cc->alloc_flags explicitly. There is a direct check in kcompactd_do_work() that passes ALLOC_WMARK_MIN, but there is another check downstack in compact_zone() that ends up passing the unset alloc_flags. Since they default to 0, and that coincides with ALLOC_WMARK_MIN, it is correct. But it's subtle. Set cc->alloc_flags explicitly. - should_continue_reclaim() is direct reclaim, use the min watermark. - Finally, consolidate the two checks in compaction_ready() to a single compaction_suitable() call passing the high watermark. There is a tiny change in behavior: before, compaction_suitable() would check order-0 against min or low, depending on costly order. Then there'd be another high watermark check. Now, the high watermark is passed to compaction_suitable(), and the costly order-boost (low - min) is added on top. This means compaction_ready() sets a marginally higher target for free pages. In a kernelbuild + THP pressure test, though, this didn't show any measurable negative effects on memory pressure or reclaim rates. As the comment above the check says, reclaim is usually stopped short on should_continue_reclaim(), and this just defines the worst-case reclaim cutoff in case compaction is not making any headway. [hughd@google.com: stop oops on out-of-range highest_zoneidx] Link: https://lkml.kernel.org/r/005ace8b-07fa-01d4-b54b-394a3e029c07@google.com Link: https://lkml.kernel.org/r/20250313210647.1314586-1-hannes@cmpxchg.org Link: https://lkml.kernel.org/r/20250313210647.1314586-2-hannes@cmpxchg.org Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Hugh Dickins <hughd@google.com> Acked-by: Zi Yan <ziy@nvidia.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-17	mm: convert lru_add_page_tail() to lru_add_split_folio()	Matthew Wilcox (Oracle)
	Remove three hidden calls to compound_head() and accesses to page->lru. Link: https://lkml.kernel.org/r/20250313151458.4145978-1-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-17	selftests/mm/cow: fix the incorrect error handling	Cyan Yang
	Error handling doesn't check the correct return value. This patch will fix it. Link: https://lkml.kernel.org/r/20250312043840.71799-1-cyan.yang@sifive.com Fixes: f4b5fd6946e2 ("selftests/vm: anon_cow: THP tests") Signed-off-by: Cyan Yang <cyan.yang@sifive.com> Reviewed-by: Dev Jain <dev.jain@arm.com> Reviewed-by: Muhammad Usama Anjum <usama.anjum@collabora.com> Cc: David Hildenbrand <david@redhat.com> Cc: Shuah Khan <shuah@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-17	mm/debug: add line breaks	Liu Ye
	Missing a newline character at the end of the format string. Link: https://lkml.kernel.org/r/20250312093717.364031-1-liuye@kylinos.cn Signed-off-by: Liu Ye <liuye@kylinos.cn> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>