summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2023-06-24spi: dt-bindings: atmel,at91rm9200-spi: fix broken sam9x7 compatibleKrzysztof Kozlowski
Commit a3eb95484f27 ("spi: dt-bindings: atmel,at91rm9200-spi: add sam9x7 compatible") adding sam9x7 compatible did not make any sense as it added new compatible into middle of existing compatible list. The intention was probably to add new set of compatibles with sam9x7 as first one. Fixes: a3eb95484f27 ("spi: dt-bindings: atmel,at91rm9200-spi: add sam9x7 compatible") Signed-off-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org> Reviewed-by: Conor Dooley <conor.dooley@microchip.com> Link: https://lore.kernel.org/r/Message-Id: <20230624082054.37697-1-krzysztof.kozlowski@linaro.org> Signed-off-by: Mark Brown <broonie@kernel.org>
2023-06-23Merge branch 'mlxsw-maintain-candidate-rifs'Jakub Kicinski
Petr Machata says: ==================== mlxsw: Maintain candidate RIFs The mlxsw driver currently makes the assumption that the user applies configuration in a bottom-up manner. Thus netdevices need to be added to the bridge before IP addresses are configured on that bridge or SVI added on top of it. Enslaving a netdevice to another netdevice that already has uppers is in fact forbidden by mlxsw for this reason. Despite this safety, it is rather easy to get into situations where the offloaded configuration is just plain wrong. As an example, take a front panel port, configure an IP address: it gets a RIF. Now enslave the port to the bridge, and the RIF is gone. Remove the port from the bridge again, but the RIF never comes back. There is a number of similar situations, where changing the configuration there and back utterly breaks the offload. The situation is going to be made better by implementing a range of replays and post-hoc offloads. This patch set lays the ground for replay of next hops. The particular issue that it deals with is that currently, driver-specific bookkeeping for next hops is hooked off RIF objects, which come and go across the lifetime of a netdevice. We would rather keep these objects at an entity that mirrors the lifetime of the netdevice itself. That way they are at hand and can be offloaded when a RIF is eventually created. To that end, with this patchset, mlxsw keeps a hash table of CRIFs: candidate RIFs, persistent handles for netdevices that mlxsw deems potentially interesting. The lifetime of a CRIF matches that of the underlying netdevice, and thus a RIF can always assume a CRIF exists. A CRIF is where next hops are kept, and when RIF is created, these next hops can be easily offloaded. (Previously only the next hops created after the RIF was created were offloaded.) - Patches #1 and #2 are minor adjustments. - In patches #3 and #4, add CRIF bookkeeping. - In patch #5, link CRIFs to RIFs such that given a netdevice-backed RIF, the corresponding CRIF is easy to look up. - Patch #6 is a clean-up allowed by the previous patches - Patches #7 and #8 move next hop tracking to CRIFs No observable effects are intended as of yet. This will be useful once there is support for RIF creation for netdevices that become mlxsw uppers, which will come in following patch sets. ==================== Link: https://lore.kernel.org/r/cover.1687438411.git.petrm@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-06-23mlxsw: spectrum_router: Track next hops at CRIFsPetr Machata
Move the list of next hops from struct mlxsw_sp_rif to mlxsw_sp_crif. The reason is that eventually, next hops for mlxsw uppers should be offloaded and unoffloaded on demand as a netdevice becomes an upper, or stops being one. Currently, next hops are tracked at RIFs, but RIFs do not exist when a netdevice is not an mlxsw uppers. CRIFs are kept track of throughout the netdevice lifetime. Correspondingly, track at each next hop not its RIF, but its CRIF (from which a RIF can always be deduced). Note that now that next hops are tracked at a CRIF, it is not necessary to move each over to a new RIF when it is necessary to edit a RIF. Therefore drop mlxsw_sp_nexthop_rif_migrate() and have mlxsw_sp_rif_migrate_destroy() call mlxsw_sp_nexthop_rif_update() directly. Signed-off-by: Petr Machata <petrm@nvidia.com> Reviewed-by: Danielle Ratson <danieller@nvidia.com> Link: https://lore.kernel.org/r/e7c1c0a7dd13883b0f09aeda12c4fcf4d63a70e3.1687438411.git.petrm@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-06-23mlxsw: spectrum_router: Split nexthop finalization to two stagesPetr Machata
Nexthop finalization consists of two steps: the part where the offload is removed, because the backing RIF is now gone; and the part where the association to the RIF is severed. Extract from mlxsw_sp_nexthop_type_fini() a helper that covers the unoffloading part, mlxsw_sp_nexthop_type_rif_gone(), so that it can later be called independently. Note that this swaps around the ordering of mlxsw_sp_nexthop_ipip_fini() vs. mlxsw_sp_nexthop_rif_fini(). The current ordering is more of a historical happenstance than a conscious decision. The two cleanups do not depend on each other, and this change should have no observable effects. Signed-off-by: Petr Machata <petrm@nvidia.com> Reviewed-by: Danielle Ratson <danieller@nvidia.com> Link: https://lore.kernel.org/r/7134559534c5f5c4807c3a1569fae56f8887e763.1687438411.git.petrm@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-06-23mlxsw: spectrum_router: Use router.lb_crif instead of .lb_rif_indexPetr Machata
A previous patch added a pointer to loopback CRIF to the router data structure. That makes the loopback RIF index redundant, as everything necessary can be derived from the CRIF. Drop the field and adjust the code accordingly. Signed-off-by: Petr Machata <petrm@nvidia.com> Reviewed-by: Danielle Ratson <danieller@nvidia.com> Link: https://lore.kernel.org/r/8637bf959bc5b6c9d5184b9bd8a0cd53c5132835.1687438411.git.petrm@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-06-23mlxsw: spectrum_router: Link CRIFs to RIFsPetr Machata
When a RIF is about to be created, the registration of the netdevice that it should be associated with must have been seen in the past, and a CRIF created. Therefore make this a hard requirement by looking up the CRIF during RIF creation, and complaining loudly when there isn't one. This then allows to keep a link between a RIF and its corresponding CRIF (and back, as the relationship is one-to-at-most-one), which do. The CRIF will later be useful as the objects tracked there will be offloaded lazily as a result of RIF creation. CRIFs are created when an "interesting" netdevice is registered, and destroyed after such device is unregistered. CRIFs are supposed to already exist when a RIF creation request arises, and exist at least as long as that RIF exists. This makes for a simple invariant: it is always safe to dereference CRIF pointer from "its" RIF. To guarantee this, CRIFs cannot be removed immediately when the UNREGISTER event is delivered. The reason is that if a RIF's netdevices has an IPv6 address, removal of this address is notified in an atomic block. To remove the RIF, the IPv6 removal handler schedules a work item. It must be safe for this work item to access the associated CRIF as well. Thus when a netdevice that backs the CRIF is removed, if it still has a RIF, do not actually free the CRIF, only toggle its can_destroy flag, which this patch adds. Later on, mlxsw_sp_rif_destroy() collects the CRIF. Signed-off-by: Petr Machata <petrm@nvidia.com> Reviewed-by: Danielle Ratson <danieller@nvidia.com> Link: https://lore.kernel.org/r/68c8e33afa6b8c03c431b435e1685ffdff752e63.1687438411.git.petrm@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-06-23mlxsw: spectrum_router: Maintain CRIF for fallback loopback RIFPetr Machata
CRIFs are generally not maintained for loopback RIFs. However, the RIF for the default VRF is used for offloading of blackhole nexthops. Nexthops expect to have a valid CRIF. Therefore in this patch, add code to maintain CRIF for the loopback RIF as well. Signed-off-by: Petr Machata <petrm@nvidia.com> Reviewed-by: Danielle Ratson <danieller@nvidia.com> Link: https://lore.kernel.org/r/7f2b2fcc98770167ed1254a904c3f7f585ba43f0.1687438411.git.petrm@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-06-23mlxsw: spectrum_router: Maintain a hash table of CRIFsPetr Machata
CRIFs are objects that mlxsw maintains for netdevices that may not have an associated RIF (i.e. they may not have been instantiated in the ASIC), but if indeed they do not, it is quite possible they will in the future. These netdevices are candidate RIFs, hence CRIFs. Netdevices for which CRIFs are created include e.g. bridges, LAGs, or front panel ports. The idea is that next hops would be kept at CRIFs, not RIFs, and thus it would be easier to offload and unoffload the entities that have been added before the RIF was created. In this patch, add the code for low-level CRIF maintenance: create and destroy, and keep in a table keyed by the netdevice pointer for easy recall. Signed-off-by: Petr Machata <petrm@nvidia.com> Reviewed-by: Danielle Ratson <danieller@nvidia.com> Link: https://lore.kernel.org/r/186d44e399c475159da20689f2c540719f2d1ed0.1687438411.git.petrm@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-06-23mlxsw: spectrum_router: Use mlxsw_sp_ul_rif_get() to get main VRF LB RIFPetr Machata
The current function, mlxsw_sp_router_ul_rif_get(), is a wrapper around the function mentioned in the subject. As such it forms an external interface of the router code. In future patches we will want to maintain connection between RIFs and the CRIFs (introduced in the next patch) that back them. That will not hold for the VRF-based loopback netdevices, so the whole CRIF business can be kept hidden from the rest of mlxsw. But for the main VRF loopback RIF we do want to keep the RIF-CRIF connection, because that RIF is used for blackhole next hops, and the next hop code can be kept simpler for assuming rif->crif is valid. Hence, instead, call mlxsw_sp_ul_rif_get() to create the main VRF loopback RIF. This being an internal function will take the CRIF argument anyway. Furthermore, the function does not lock, which is not necessary at this point in code yet. Signed-off-by: Petr Machata <petrm@nvidia.com> Reviewed-by: Danielle Ratson <danieller@nvidia.com> Link: https://lore.kernel.org/r/7a39a011a02a84164cd7f5da7985ec5b2ae01ba5.1687438411.git.petrm@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-06-23mlxsw: spectrum_router: Add extack argument to mlxsw_sp_lb_rif_init()Petr Machata
The extack will be handy in later patches. Signed-off-by: Petr Machata <petrm@nvidia.com> Reviewed-by: Danielle Ratson <danieller@nvidia.com> Link: https://lore.kernel.org/r/e87ba300121010d580b80a281877573a7b1377ca.1687438411.git.petrm@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-06-23bonding: do not assume skb mac_header is setEric Dumazet
Drivers must not assume in their ndo_start_xmit() that skbs have their mac_header set. skb->data is all what is needed. bonding seems to be one of the last offender as caught by syzbot: WARNING: CPU: 1 PID: 12155 at include/linux/skbuff.h:2907 skb_mac_offset include/linux/skbuff.h:2913 [inline] WARNING: CPU: 1 PID: 12155 at include/linux/skbuff.h:2907 bond_xmit_hash drivers/net/bonding/bond_main.c:4170 [inline] WARNING: CPU: 1 PID: 12155 at include/linux/skbuff.h:2907 bond_xmit_3ad_xor_slave_get drivers/net/bonding/bond_main.c:5149 [inline] WARNING: CPU: 1 PID: 12155 at include/linux/skbuff.h:2907 bond_3ad_xor_xmit drivers/net/bonding/bond_main.c:5186 [inline] WARNING: CPU: 1 PID: 12155 at include/linux/skbuff.h:2907 __bond_start_xmit drivers/net/bonding/bond_main.c:5442 [inline] WARNING: CPU: 1 PID: 12155 at include/linux/skbuff.h:2907 bond_start_xmit+0x14ab/0x19d0 drivers/net/bonding/bond_main.c:5470 Modules linked in: CPU: 1 PID: 12155 Comm: syz-executor.3 Not tainted 6.1.30-syzkaller #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 05/25/2023 RIP: 0010:skb_mac_header include/linux/skbuff.h:2907 [inline] RIP: 0010:skb_mac_offset include/linux/skbuff.h:2913 [inline] RIP: 0010:bond_xmit_hash drivers/net/bonding/bond_main.c:4170 [inline] RIP: 0010:bond_xmit_3ad_xor_slave_get drivers/net/bonding/bond_main.c:5149 [inline] RIP: 0010:bond_3ad_xor_xmit drivers/net/bonding/bond_main.c:5186 [inline] RIP: 0010:__bond_start_xmit drivers/net/bonding/bond_main.c:5442 [inline] RIP: 0010:bond_start_xmit+0x14ab/0x19d0 drivers/net/bonding/bond_main.c:5470 Code: 8b 7c 24 30 e8 76 dd 1a 01 48 85 c0 74 0d 48 89 c3 e8 29 67 2e fe e9 15 ef ff ff e8 1f 67 2e fe e9 10 ef ff ff e8 15 67 2e fe <0f> 0b e9 45 f8 ff ff e8 09 67 2e fe e9 dc fa ff ff e8 ff 66 2e fe RSP: 0018:ffffc90002fff6e0 EFLAGS: 00010283 RAX: ffffffff835874db RBX: 000000000000ffff RCX: 0000000000040000 RDX: ffffc90004dcf000 RSI: 00000000000000b5 RDI: 00000000000000b6 RBP: ffffc90002fff8b8 R08: ffffffff83586d16 R09: ffffffff83586584 R10: 0000000000000007 R11: ffff8881599fc780 R12: ffff88811b6a7b7e R13: 1ffff110236d4f6f R14: ffff88811b6a7ac0 R15: 1ffff110236d4f76 FS: 00007f2e9eb47700(0000) GS:ffff8881f6b00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000001b2e421000 CR3: 000000010e6d4000 CR4: 00000000003526e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: <TASK> [<ffffffff8471a49f>] netdev_start_xmit include/linux/netdevice.h:4925 [inline] [<ffffffff8471a49f>] __dev_direct_xmit+0x4ef/0x850 net/core/dev.c:4380 [<ffffffff851d845b>] dev_direct_xmit include/linux/netdevice.h:3043 [inline] [<ffffffff851d845b>] packet_direct_xmit+0x18b/0x300 net/packet/af_packet.c:284 [<ffffffff851c7472>] packet_snd net/packet/af_packet.c:3112 [inline] [<ffffffff851c7472>] packet_sendmsg+0x4a22/0x64d0 net/packet/af_packet.c:3143 [<ffffffff8467a4b2>] sock_sendmsg_nosec net/socket.c:716 [inline] [<ffffffff8467a4b2>] sock_sendmsg net/socket.c:736 [inline] [<ffffffff8467a4b2>] __sys_sendto+0x472/0x5f0 net/socket.c:2139 [<ffffffff8467a715>] __do_sys_sendto net/socket.c:2151 [inline] [<ffffffff8467a715>] __se_sys_sendto net/socket.c:2147 [inline] [<ffffffff8467a715>] __x64_sys_sendto+0xe5/0x100 net/socket.c:2147 [<ffffffff8553071f>] do_syscall_x64 arch/x86/entry/common.c:50 [inline] [<ffffffff8553071f>] do_syscall_64+0x2f/0x50 arch/x86/entry/common.c:80 [<ffffffff85600087>] entry_SYSCALL_64_after_hwframe+0x63/0xcd Fixes: 7b8fc0103bb5 ("bonding: add a vlan+srcmac tx hashing option") Reported-by: syzbot <syzkaller@googlegroups.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Jarod Wilson <jarod@redhat.com> Cc: Moshe Tal <moshet@nvidia.com> Cc: Jussi Maki <joamaki@gmail.com> Cc: Jay Vosburgh <j.vosburgh@gmail.com> Cc: Andy Gospodarek <andy@greyhouse.net> Cc: Vladimir Oltean <vladimir.oltean@nxp.com> Link: https://lore.kernel.org/r/20230622152304.2137482-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-06-23net: bcmgenet: Ensure MDIO unregistration has clocks enabledFlorian Fainelli
With support for Ethernet PHY LEDs having been added, while unregistering a MDIO bus and its child device liks PHYs there may be "late" accesses to the MDIO bus. One typical use case is setting the PHY LEDs brightness to OFF for instance. We need to ensure that the MDIO bus controller remains entirely functional since it runs off the main GENET adapter clock. Cc: stable@vger.kernel.org Link: https://lore.kernel.org/all/20230617155500.4005881-1-andrew@lunn.ch/ Fixes: 9a4e79697009 ("net: bcmgenet: utilize generic Broadcom UniMAC MDIO controller driver") Signed-off-by: Florian Fainelli <florian.fainelli@broadcom.com> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Link: https://lore.kernel.org/r/20230622103107.1760280-1-florian.fainelli@broadcom.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-06-24Add Renesas PMIC RAA215300 and built-in RTCMark Brown
Merge series from Biju Das <biju.das.jz@bp.renesas.com>: This patch series aims to add support for Renesas PMIC RAA215300 and built-in RTC found on this PMIC device. The details of PMIC can be found here[1]. Renesas PMIC RAA215300 exposes two separate i2c devices, one for the main device and another for rtc device.
2023-06-23kernel/time/posix-stubs.c: remove duplicated includeYang Li
./kernel/time/posix-stubs.c: linux/syscalls.h is included more than once. Link: https://lkml.kernel.org/r/20230608082312.123939-1-yang.lee@linux.alibaba.com Reported-by: Abaci Robot <abaci@linux.alibaba.com> Closes: https://bugzilla.openanolis.cn/show_bug.cgi?id=5463 Fixes: c1956519cd7e ("syscalls: add sys_ni_posix_timers prototype") Signed-off-by: Yang Li <yang.lee@linux.alibaba.com> Cc: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-23ocfs2: remove redundant assignment to variable bit_offColin Ian King
Variable bit_off is being assigned a value that is never read, it is being re-assigned a new value in the following while loop. Remove the assignment. Cleans up clang scan build warning: fs/ocfs2/localalloc.c:976:18: warning: Although the value stored to 'bit_off' is used in the enclosing expression, the value is never actually read from 'bit_off' [deadcode.DeadStores] Link: https://lkml.kernel.org/r/20230622102736.2831126-1-colin.i.king@gmail.com Signed-off-by: Colin Ian King <colin.i.king@gmail.com> Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com> Cc: Mark Fasheh <mark@fasheh.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Changwei Ge <gechangwei@live.cn> Cc: Gang He <ghe@suse.com> Cc: Jun Piao <piaojun@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-23watchdog/hardlockup: fix typo in config HARDLOCKUP_DETECTOR_PREFER_BUDDYLukas Bulwahn
Commit a5fcc2367e22 ("watchdog/hardlockup: make HAVE_NMI_WATCHDOG sparc64-specific") accidentially introduces a typo in one of the config dependencies of HARDLOCKUP_DETECTOR_PREFER_BUDDY. Fix this accidental typo. Link: https://lkml.kernel.org/r/20230623040717.8645-1-lukas.bulwahn@gmail.com Fixes: a5fcc2367e22 ("watchdog/hardlockup: make HAVE_NMI_WATCHDOG sparc64-specific") Signed-off-by: Lukas Bulwahn <lukas.bulwahn@gmail.com> Reviewed-by: Petr Mladek <pmladek@suse.com> Cc: Douglas Anderson <dianders@chromium.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-23powerpc: move arch_trigger_cpumask_backtrace from nmi.h to irq.hDouglas Anderson
The powerpc architecture was the only one that defined arch_trigger_cpumask_backtrace() in asm/nmi.h instead of asm/irq.h. Move it to be consistent. This fixes compile time errors introduced by commit 7ca8fe94aa92 ("watchdog/hardlockup: define HARDLOCKUP_DETECTOR_ARCH"). That commit caused <asm/nmi.h> to stop being included if the hardlockup detector wasn't enabled. The specific errors were: error: implicit declaration of function `nmi_cpu_backtrace' error: implicit declaration of function `nmi_trigger_cpumask_backtrace' NOTE: when moving this into irq.h, we also change the guards from just checking if "CONFIG_NMI_IPI" is defined to also checking if "CONFIG_PPC_BOOK3S_64" is defined. This matches the code in arch/powerpc/kernel/stacktrace.c. Previously this worked because <asm.nmi.h> was included if "CONFIG_HAVE_HARDLOCKUP_DETECTOR_ARCH" was defined. For powerpc that's only selected if "CONFIG_PPC_BOOK3S_64" is defined. [dianders@chromium.org: change the guards to include CONFIG_PPC_BOOK3S_64] Link: https://lkml.kernel.org/r/20230622202816.v2.1.Ice67126857506712559078e7de26d32d26e64631@changeid Link: https://lkml.kernel.org/r/20230621164809.1.Ice67126857506712559078e7de26d32d26e64631@changeid Fixes: 7ca8fe94aa92 ("watchdog/hardlockup: define HARDLOCKUP_DETECTOR_ARCH") Signed-off-by: Douglas Anderson <dianders@chromium.org> Reported-by: Michael Ellerman <mpe@ellerman.id.au> Closes: https://lore.kernel.org/r/871qi5otdh.fsf@mail.lhotse Reviewed-by: Petr Mladek <pmladek@suse.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Douglas Anderson <dianders@chromium.org> Cc: Laurent Dufour <ldufour@linux.ibm.com> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Cc: Tom Rix <trix@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-23devres: show which resource was invalid in __devm_ioremap_resource()Ben Dooks
The other error prints in this call show the resource which wsan't valid, so add this to the first print when it checks for basic validity of the resource. Link: https://lkml.kernel.org/r/20230621163050.477668-1-ben.dooks@codethink.co.uk Signed-off-by: Ben Dooks <ben.dooks@codethink.co.uk> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-23mm/hugetlb: remove hugetlb_set_page_subpool()Sidhartha Kumar
All users have been converted to hugetlb_set_folio_subpool() so we can safely remove this function. Link: https://lkml.kernel.org/r/20230623054948.280627-1-sidhartha.kumar@oracle.com Signed-off-by: Sidhartha Kumar <sidhartha.kumar@oracle.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Tarun Sahu <tsahu@linux.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-23mm: nommu: correct the range of mmap_sem_read_lock in task_mem()lipeifeng
During the seq_printf,the mmap_sem_read_lock protection is not required. Link: https://lkml.kernel.org/r/20230622040152.1173-1-lipeifeng@oppo.com Signed-off-by: lipeifeng <lipeifeng@oppo.com> Cc: David Hildenbrand <david@redhat.com> Cc: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-23hugetlb: revert use of page_cache_next_miss()Mike Kravetz
Ackerley Tng reported an issue with hugetlbfs fallocate as noted in the Closes tag. The issue showed up after the conversion of hugetlb page cache lookup code to use page_cache_next_miss. User visible effects are: - hugetlbfs fallocate incorrectly returns -EEXIST if pages are presnet in the file. - hugetlb pages will not be included in core dumps if they need to be brought in via GUP. - userfaultfd UFFDIO_COPY will not notice pages already present in the cache. It may try to allocate a new page and potentially return ENOMEM as opposed to EEXIST. Revert the use page_cache_next_miss() in hugetlb code. IMPORTANT NOTE FOR STABLE BACKPORTS: This patch will apply cleanly to v6.3. However, due to the change of filemap_get_folio() return values, it will not function correctly. This patch must be modified for stable backports. [dan.carpenter@linaro.org: fix hugetlbfs_pagecache_present()] Link: https://lkml.kernel.org/r/efa86091-6a2c-4064-8f55-9b44e1313015@moroto.mountain Link: https://lkml.kernel.org/r/20230621212403.174710-2-mike.kravetz@oracle.com Fixes: d0ce0e47b323 ("mm/hugetlb: convert hugetlb fault paths to use alloc_hugetlb_folio()") Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org> Reported-by: Ackerley Tng <ackerleytng@google.com> Closes: https://lore.kernel.org/linux-mm/cover.1683069252.git.ackerleytng@google.com Reviewed-by: Sidhartha Kumar <sidhartha.kumar@oracle.com> Cc: Erdem Aktas <erdemaktas@google.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Matthew Wilcox <willy@infradead.org> Cc: Muchun Song <songmuchun@bytedance.com> Cc: Vishal Annapurve <vannapurve@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-23Revert "page cache: fix page_cache_next/prev_miss off by one"Mike Kravetz
This reverts commit 9425c591e06a9ab27a145ba655fb50532cf0bcc9 The reverted commit fixed up routines primarily used by readahead code such that they could also be used by hugetlb. Unfortunately, this caused a performance regression as pointed out by the Closes: tag. The hugetlb code which uses page_cache_next_miss will be addressed in a subsequent patch. Link: https://lkml.kernel.org/r/20230621212403.174710-1-mike.kravetz@oracle.com Fixes: 9425c591e06a ("page cache: fix page_cache_next/prev_miss off by one") Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> Reported-by: kernel test robot <oliver.sang@intel.com> Closes: https://lore.kernel.org/oe-lkp/202306211346.1e9ff03e-oliver.sang@intel.com Reviewed-by: Sidhartha Kumar <sidhartha.kumar@oracle.com> Cc: Ackerley Tng <ackerleytng@google.com> Cc: Erdem Aktas <erdemaktas@google.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Matthew Wilcox <willy@infradead.org> Cc: Muchun Song <songmuchun@bytedance.com> Cc: Vishal Annapurve <vannapurve@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-23mm/vmscan: fix root proactive reclaim unthrottling unbalanced nodeYosry Ahmed
When memory.reclaim was introduced, it became the first case where cgroup_reclaim() is true for the root cgroup. Johannes concluded [1] that for most cases this is okay, except for one case. Historically, kswapd would throttle reclaim on a node if a lot of pages marked for reclaim are under writeback (aka the node is congested). This occurred by setting LRUVEC_CONGESTED bit in lruvec->flags. The bit would be cleared when the node is balanced. Similarly, cgroup reclaim would set the same bit when an lruvec is congested, and clear it on the way out of reclaim (to throttle local reclaimers). Before the introduction of memory.reclaim, the root memcg was the only target of kswapd reclaim, and non-root memcgs were the only targets of cgroup reclaim, so they would never interfere. Using the same bit for both was fine. After memory.reclaim, it is possible for cgroup reclaim on the root cgroup to clear the bit set by kswapd. This would result in reclaim on the node to be unthrottled before the node is balanced. Fix this by introducing separate bits for cgroup-level and node-level congestion. kswapd can unthrottle an lruvec that is marked as congested by cgroup reclaim (as the entire node should no longer be congested), but not vice versa (to prevent premature unthrottling before the entire node is balanced). [1]https://lore.kernel.org/lkml/20230405200150.GA35884@cmpxchg.org/ Link: https://lkml.kernel.org/r/20230621023101.432780-1-yosryahmed@google.com Signed-off-by: Yosry Ahmed <yosryahmed@google.com> Reported-by: Johannes Weiner <hannes@cmpxchg.org> Closes: https://lore.kernel.org/lkml/20230405200150.GA35884@cmpxchg.org/ Cc: Michal Hocko <mhocko@suse.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Shakeel Butt <shakeelb@google.com> Cc: Muchun Song <songmuchun@bytedance.com> Cc: Yu Zhao <yuzhao@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-23mm: memcg: rename and document global_reclaim()Yosry Ahmed
Evidently, global_reclaim() can be a confusing name. Especially that it used to exist before with a subtly different definition (removed by commit b5ead35e7e1d ("mm: vmscan: naming fixes: global_reclaim() and sane_reclaim()"). It can be interpreted as non-cgroup reclaim, even though it returns true for cgroup reclaim on the root memcg (through memory.reclaim). Rename it to root_reclaim() in an attempt to make it less ambiguous, and add documentation to it as well as cgroup_reclaim. Link: https://lkml.kernel.org/r/20230621023053.432374-1-yosryahmed@google.com Signed-off-by: Yosry Ahmed <yosryahmed@google.com> Reported-by: Johannes Weiner <hannes@cmpxchg.org> Closes: https://lore.kernel.org/lkml/20230405200150.GA35884@cmpxchg.org/ Acked-by: Yu Zhao <yuzhao@google.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Muchun Song <songmuchun@bytedance.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Shakeel Butt <shakeelb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-23mm: kill [add|del]_page_to_lru_list()Kefeng Wang
Now no one call [add|del]_page_to_lru_list(), let's drop unused page interfaces. Link:https://lkml.kernel.org/r/20230619110718.65679-2-wangkefeng.wang@huawei.com Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> Acked-by: Yu Zhao <yuzhao@google.com> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: James Gowans <jgowans@amazon.com> Cc: Matthew Wilcox <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-23mm: compaction: convert to use a folio in isolate_migratepages_block()Kefeng Wang
Directly use a folio instead of page_folio() when page successfully isolated (hugepage and movable page) and after folio_get_nontail_page(), which removes several calls to compound_head(). Link: https://lkml.kernel.org/r/20230619110718.65679-1-wangkefeng.wang@huawei.com Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: James Gowans <jgowans@amazon.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Yu Zhao <yuzhao@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-23mm: zswap: fix double invalidate with exclusive loadsYosry Ahmed
If exclusive loads are enabled for zswap, we invalidate the entry before returning from zswap_frontswap_load(), after dropping the local reference. However, the tree lock is dropped during decompression after the local reference is acquired, so the entry could be invalidated before we drop the local ref. If this happens, the entry is freed once we drop the local ref, and zswap_invalidate_entry() tries to invalidate an already freed entry. Fix this by: (a) Making sure zswap_invalidate_entry() is always called with a local ref held, to avoid being called on a freed entry. (b) Making sure zswap_invalidate_entry() only drops the ref if the entry was actually on the rbtree. Otherwise, another invalidation could have already happened, and the initial ref is already dropped. With these changes, there is no need to check that there is no need to make sure the entry still exists in the tree in zswap_reclaim_entry() before invalidating it, as zswap_reclaim_entry() will make this check internally. Link: https://lkml.kernel.org/r/20230621093009.637544-1-yosryahmed@google.com Fixes: b9c91c43412f ("mm: zswap: support exclusive loads") Signed-off-by: Yosry Ahmed <yosryahmed@google.com> Reported-by: Hyeonggon Yoo <42.hyeyoo@gmail.com> Cc: Dan Streetman <ddstreet@ieee.org> Cc: Domenico Cerasuolo <cerasuolodomenico@gmail.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Cc: Nhat Pham <nphamcs@gmail.com> Cc: Seth Jennings <sjenning@redhat.com> Cc: Vitaly Wool <vitaly.wool@konsulko.com> Cc: Yu Zhao <yuzhao@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-23mm: remove unnecessary pagevec includesMatthew Wilcox (Oracle)
These files no longer need pagevec.h, mostly due to function declarations being moved out of it. Link: https://lkml.kernel.org/r/20230621164557.3510324-14-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-23mm: remove references to pagevecMatthew Wilcox (Oracle)
Most of these should just refer to the LRU cache rather than the data structure used to implement the LRU cache. Link: https://lkml.kernel.org/r/20230621164557.3510324-13-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-23mm: rename invalidate_mapping_pagevec to mapping_try_invalidateMatthew Wilcox (Oracle)
We don't use pagevecs for the LRU cache any more, and we don't know that the failed invalidations were due to the folio being in an LRU cache. So rename it to be more accurate. Link: https://lkml.kernel.org/r/20230621164557.3510324-12-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-23mm: remove struct pagevecMatthew Wilcox (Oracle)
All users are now converted to use the folio_batch so we can get rid of this data structure. Link: https://lkml.kernel.org/r/20230621164557.3510324-11-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-23net: convert sunrpc from pagevec to folio_batchMatthew Wilcox (Oracle)
Remove the last usage of pagevecs. There is a slight change here; we now free the folio_batch as soon as it fills up instead of freeing the folio_batch when we try to add a page to a full batch. This should have no effect in practice. Link: https://lkml.kernel.org/r/20230621164557.3510324-10-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Acked-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-23i915: convert i915_gpu_error to use a folio_batchMatthew Wilcox (Oracle)
Remove one of the last remaining users of pagevec. Link: https://lkml.kernel.org/r/20230621164557.3510324-9-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-23pagevec: rename fbatch_count()Matthew Wilcox (Oracle)
This should always have been called folio_batch_count(). Link: https://lkml.kernel.org/r/20230621164557.3510324-8-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-23mm: remove check_move_unevictable_pages()Matthew Wilcox (Oracle)
All callers have now been converted to call check_move_unevictable_folios(). Link: https://lkml.kernel.org/r/20230621164557.3510324-7-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-23drm: convert drm_gem_put_pages() to use a folio_batchMatthew Wilcox (Oracle)
Remove a few hidden compound_head() calls by converting the returned page to a folio once and using the folio APIs. Link: https://lkml.kernel.org/r/20230621164557.3510324-6-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-23i915: convert shmem_sg_free_table() to use a folio_batchMatthew Wilcox (Oracle)
Remove a few hidden compound_head() calls by converting the returned page to a folio once and using the folio APIs. We also only increment the refcount on the folio once instead of once for each page. Ideally, we would have a for_each_sgt_folio macro, but until then this will do. Link: https://lkml.kernel.org/r/20230621164557.3510324-5-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-23scatterlist: add sg_set_folio()Matthew Wilcox (Oracle)
This wrapper for sg_set_page() lets drivers add folios to a scatterlist more easily. We could, perhaps, do better by using a different page in the folio if offset is larger than UINT_MAX, but let's hope we get a better data structure than this before we need to care about such large folios. Link: https://lkml.kernel.org/r/20230621164557.3510324-4-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-23mm: add __folio_batch_release()Matthew Wilcox (Oracle)
This performs the same role as __pagevec_release(), ie skipping the check for batch length of 0. Link: https://lkml.kernel.org/r/20230621164557.3510324-3-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-23afs: convert pagevec to folio_batch in afs_extend_writeback()Matthew Wilcox (Oracle)
Patch series "Remove pagevecs". Removes a folio->page->folio conversion for each folio that's involved. More importantly, removes one of the last few uses of a pagevec. Link: https://lkml.kernel.org/r/20230621164557.3510324-1-willy@infradead.org Link: https://lkml.kernel.org/r/20230621164557.3510324-2-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-23mm: page_alloc: use the correct type of list for free pagesBaolin Wang
Commit bf75f200569d ("mm/page_alloc: add page->buddy_list and page->pcp_list") introduces page->buddy_list and page->pcp_list as a union with page->lru, but missed to change get_page_from_free_area() to use page->buddy_list to clarify the correct type of list for a free page. Link: https://lkml.kernel.org/r/7e7ab533247d40c0ea0373c18a6a48e5667f9e10.1687333557.git.baolin.wang@linux.alibaba.com Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com> Acked-by: Mel Gorman <mgorman@techsingularity.net> Reviewed-by: David Hildenbrand <david@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-23mm: backing-dev: make bdi_class a static const structureIvan Orlov
Now that the driver core allows for struct class to be in read-only memory, move the bdi_class structure to be declared at build time placing it into read-only memory, instead of having to be dynamically allocated at load time. Link: https://lkml.kernel.org/r/20230620183314.682822-2-gregkh@linuxfoundation.org Signed-off-by: Ivan Orlov <ivan.orlov0322@gmail.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Suggested-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Jens Axboe <axboe@kernel.dk> Cc: Christoph Hellwig <hch@lst.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-23mm/swapfile: delete outdated pte_offset_map() commentHugh Dickins
Delete a triply out-of-date comment from add_swap_count_continuation(): 1. vmalloc_to_page() changed from pte_offset_map() to pte_offset_kernel() 2. pte_offset_map() changed from using kmap_atomic() to kmap_local_page() 3. kmap_atomic() changed from using fixed FIX_KMAP addresses in 2.6.37. Link: https://lkml.kernel.org/r/9022632b-ba9d-8cb0-c25-4be9786481b5@google.com Signed-off-by: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-23mm: pass nid to reserve_bootmem_region()Yajun Deng
early_pfn_to_nid() is called frequently in init_reserved_page(), it returns the node id of the PFN. These PFN are probably from the same memory region, they have the same node id. It's not necessary to call early_pfn_to_nid() for each PFN. Pass nid to reserve_bootmem_region() and drop the call to early_pfn_to_nid() in init_reserved_page(). Also, set nid on all reserved pages before doing this, as some reserved memory regions may not be set nid. The most beneficial function is memmap_init_reserved_pages() if CONFIG_DEFERRED_STRUCT_PAGE_INIT is enabled. The following data was tested on an x86 machine with 190GB of RAM. before: memmap_init_reserved_pages() 67ms after: memmap_init_reserved_pages() 20ms Link: https://lkml.kernel.org/r/20230619023406.424298-1-yajun.deng@linux.dev Signed-off-by: Yajun Deng <yajun.deng@linux.dev> Reviewed-by: Mike Rapoport (IBM) <rppt@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-23mm/gup: do not return 0 from pin_user_pages_fast() for bad argsJason Gunthorpe
These routines are not intended to return zero, the callers cannot do anything sane with a 0 return. They should return an error which means future calls to GUP will not succeed, or they should return some non-zero number of pinned pages which means GUP should be called again. If start + nr_pages overflows it should return -EOVERFLOW to signal the arguments are invalid. Syzkaller keeps tripping on this when fuzzing GUP arguments. Link: https://lkml.kernel.org/r/0-v1-3d5ed1f20d50+104-gup_overflow_jgg@nvidia.com Signed-off-by: Jason Gunthorpe <jgg@nvidia.com> Reported-by: syzbot+353c7be4964c6253f24a@syzkaller.appspotmail.com Closes: https://lore.kernel.org/all/000000000000094fdd05faa4d3a4@google.com Reviewed-by: John Hubbard <jhubbard@nvidia.com> Reviewed-by: Lorenzo Stoakes <lstoakes@gmail.com> Reviewed-by: David Hildenbrand <david@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-23mm: fix shmem THP counters on migrationJan Glauber
The per node numa_stat values for shmem don't change on page migration for THP: grep shmem /sys/fs/cgroup/machine.slice/.../memory.numa_stat: shmem N0=1092616192 N1=10485760 shmem_thp N0=1092616192 N1=10485760 migratepages 9181 0 1: shmem N0=0 N1=1103101952 shmem_thp N0=1092616192 N1=10485760 Fix that by updating shmem_thp counters likewise to shmem counters on page migration. [jglauber@digitalocean.com: use folio_test_pmd_mappable instead of folio_test_transhuge] Link: https://lkml.kernel.org/r/20230622094720.510540-1-jglauber@digitalocean.com Link: https://lkml.kernel.org/r/20230619103351.234837-1-jglauber@digitalocean.com Signed-off-by: Jan Glauber <jglauber@digitalocean.com> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: "Huang, Ying" <ying.huang@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-23selftests: cgroup: fix unexpected failure on test_memcg_sockHaifeng Xu
Before server got a client connection, there were some memory allocations in the test memcg, such as user stack. So do not count those allocations which are not related to socket when checking socket memory accounting. Link: https://lkml.kernel.org/r/20230619124735.2124-1-haifeng.xu@shopee.com Signed-off-by: Haifeng Xu <haifeng.xu@shopee.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Michal Koutný <mkoutny@suse.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Shakeel Butt <shakeelb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-23mm/memcontrol: do not tweak node in mem_cgroup_init()Haifeng Xu
mem_cgroup_init() request for allocations from each possible node, and it's used to be a problem because NODE_DATA is not allocated for offline node. Things have already changed since commit 09f49dca570a9 ("mm: handle uninitialized numa nodes gracefully"), so it's unnecessary to check for !node_online nodes here. How to test? qemu-system-x86_64 \ -kernel vmlinux \ -initrd full.rootfs.cpio.gz \ -append "console=ttyS0,115200 root=/dev/ram0 nokaslr earlyprintk=serial oops=panic panic_on_warn" \ -drive format=qcow2,file=vm_disk.qcow2,media=disk,if=ide \ -enable-kvm \ -cpu host \ -m 8G,slots=2,maxmem=16G \ -smp cores=4,threads=1,sockets=2 \ -object memory-backend-ram,id=mem0,size=4G \ -object memory-backend-ram,id=mem1,size=4G \ -numa node,memdev=mem0,cpus=0-3,nodeid=0 \ -numa node,memdev=mem1,cpus=4-7,nodeid=1 \ -numa node,nodeid=2 \ -net nic,model=virtio,macaddr=52:54:00:12:34:58 \ -net user \ -nographic \ -rtc base=localtime \ -gdb tcp::6000 Guest state when booting: [ 0.048881] NUMA: Node 0 [mem 0x00000000-0x0009ffff] + [mem 0x00100000-0xbfffffff] -> [mem 0x00000000-0xbfffffff] [ 0.050489] NUMA: Node 0 [mem 0x00000000-0xbfffffff] + [mem 0x100000000-0x13fffffff] -> [mem 0x00000000-0x13fffffff] [ 0.052173] NODE_DATA(0) allocated [mem 0x13fffc000-0x13fffffff] [ 0.053164] NODE_DATA(1) allocated [mem 0x23fffa000-0x23fffdfff] [ 0.054187] Zone ranges: [ 0.054587] DMA [mem 0x0000000000001000-0x0000000000ffffff] [ 0.055551] DMA32 [mem 0x0000000001000000-0x00000000ffffffff] [ 0.056515] Normal [mem 0x0000000100000000-0x000000023fffffff] [ 0.057484] Movable zone start for each node [ 0.058149] Early memory node ranges [ 0.058705] node 0: [mem 0x0000000000001000-0x000000000009efff] [ 0.059679] node 0: [mem 0x0000000000100000-0x00000000bffdffff] [ 0.060659] node 0: [mem 0x0000000100000000-0x000000013fffffff] [ 0.061649] node 1: [mem 0x0000000140000000-0x000000023fffffff] [ 0.062638] Initmem setup node 0 [mem 0x0000000000001000-0x000000013fffffff] [ 0.063745] Initmem setup node 1 [mem 0x0000000140000000-0x000000023fffffff] [ 0.064855] DMA zone: 158 reserved pages exceeds freesize 0 [ 0.065746] Initializing node 2 as memoryless [ 0.066437] Initmem setup node 2 as memoryless [ 0.067132] DMA zone: 158 reserved pages exceeds freesize 0 [ 0.068037] On node 0, zone DMA: 1 pages in unavailable ranges [ 0.068265] On node 0, zone DMA: 97 pages in unavailable ranges [ 0.124755] On node 0, zone Normal: 32 pages in unavailable ranges cat /sys/devices/system/node/online 0-1 cat /sys/devices/system/node/possible 0-2 Link: https://lkml.kernel.org/r/20230619130442.2487-1-haifeng.xu@shopee.com Signed-off-by: Haifeng Xu <haifeng.xu@shopee.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Shakeel Butt <shakeelb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-23kasan,kmsan: remove __GFP_KSWAPD_RECLAIM usage from kasan/kmsanTetsuo Handa
syzbot is reporting lockdep warning in __stack_depot_save(), for the caller of __stack_depot_save() (i.e. __kasan_record_aux_stack() in this report) is responsible for masking __GFP_KSWAPD_RECLAIM flag in order not to wake kswapd which in turn wakes kcompactd. Since kasan/kmsan functions might be called with arbitrary locks held, mask __GFP_KSWAPD_RECLAIM flag from all GFP_NOWAIT/GFP_ATOMIC allocations in kasan/kmsan. Note that kmsan_save_stack_with_flags() is changed to mask both __GFP_DIRECT_RECLAIM flag and __GFP_KSWAPD_RECLAIM flag, for wakeup_kswapd() from wake_all_kswapds() from __alloc_pages_slowpath() calls wakeup_kcompactd() if __GFP_KSWAPD_RECLAIM flag is set and __GFP_DIRECT_RECLAIM flag is not set. Link: https://lkml.kernel.org/r/656cb4f5-998b-c8d7-3c61-c2d37aa90f9a@I-love.SAKURA.ne.jp Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Reported-by: syzbot <syzbot+ece2915262061d6e0ac1@syzkaller.appspotmail.com> Closes: https://syzkaller.appspot.com/bug?extid=ece2915262061d6e0ac1 Reviewed-by: "Huang, Ying" <ying.huang@intel.com> Reviewed-by: Alexander Potapenko <glider@google.com> Cc: Andrey Konovalov <andreyknvl@gmail.com> Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Marco Elver <elver@google.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Vincenzo Frascino <vincenzo.frascino@arm.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-23mm: compaction: skip memory hole rapidly when isolating migratable pagesBaolin Wang
On some machines, the normal zone can have a large memory hole like below memory layout, and we can see the range from 0x100000000 to 0x1800000000 is a hole. So when isolating some migratable pages, the scanner can meet the hole and it will take more time to skip the large hole. From my measurement, I can see the isolation scanner will take 80us ~ 100us to skip the large hole [0x100000000 - 0x1800000000]. So adding a new helper to fast search next online memory section to skip the large hole can help to find next suitable pageblock efficiently. With this patch, I can see the large hole scanning only takes < 1us. [ 0.000000] Zone ranges: [ 0.000000] DMA [mem 0x0000000040000000-0x00000000ffffffff] [ 0.000000] DMA32 empty [ 0.000000] Normal [mem 0x0000000100000000-0x0000001fa7ffffff] [ 0.000000] Movable zone start for each node [ 0.000000] Early memory node ranges [ 0.000000] node 0: [mem 0x0000000040000000-0x0000000fffffffff] [ 0.000000] node 0: [mem 0x0000001800000000-0x0000001fa3c7ffff] [ 0.000000] node 0: [mem 0x0000001fa3c80000-0x0000001fa3ffffff] [ 0.000000] node 0: [mem 0x0000001fa4000000-0x0000001fa402ffff] [ 0.000000] node 0: [mem 0x0000001fa4030000-0x0000001fa40effff] [ 0.000000] node 0: [mem 0x0000001fa40f0000-0x0000001fa73cffff] [ 0.000000] node 0: [mem 0x0000001fa73d0000-0x0000001fa745ffff] [ 0.000000] node 0: [mem 0x0000001fa7460000-0x0000001fa746ffff] [ 0.000000] node 0: [mem 0x0000001fa7470000-0x0000001fa758ffff] [ 0.000000] node 0: [mem 0x0000001fa7590000-0x0000001fa7ffffff] [baolin.wang@linux.alibaba.com: limit next_ptn to not exceed cc->free_pfn] Link: https://lkml.kernel.org/r/a1d859c28af0c7e85e91795e7473f553eb180a9d.1686813379.git.baolin.wang@linux.alibaba.com Link: https://lkml.kernel.org/r/75b4c8ca36bf44ad8c42bf0685ac19d272e426ec.1686705221.git.baolin.wang@linux.alibaba.com Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com> Suggested-by: David Hildenbrand <david@redhat.com> Acked-by: David Hildenbrand <david@redhat.com> Acked-by: "Huang, Ying" <ying.huang@intel.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>