summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2025-05-12crypto: null - merge CRYPTO_NULL2 into CRYPTO_NULLEric Biggers
There is no reason to have separate CRYPTO_NULL2 and CRYPTO_NULL options. Just merge them into CRYPTO_NULL. Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2025-05-12crypto: null - remove the default null skcipherEric Biggers
crypto_get_default_null_skcipher() and crypto_put_default_null_skcipher() are no longer used, so remove them. Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2025-05-12crypto: krb5enc - do not select CRYPTO_NULLEric Biggers
The krb5enc code does not use any of the so-called "null algorithms", so it does not need to select CRYPTO_NULL. Presumably this unused dependency got copied from one of the other kconfig options. Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2025-05-12crypto: geniv - use memcpy_sglist() instead of null skcipherEric Biggers
For copying data between two scatterlists, just use memcpy_sglist() instead of the so-called "null skcipher". This is much simpler. Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2025-05-12crypto: gcm - use memcpy_sglist() instead of null skcipherEric Biggers
For copying data between two scatterlists, just use memcpy_sglist() instead of the so-called "null skcipher". This is much simpler. Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2025-05-12crypto: authenc - use memcpy_sglist() instead of null skcipherEric Biggers
For copying data between two scatterlists, just use memcpy_sglist() instead of the so-called "null skcipher". This is much simpler. Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2025-05-12crypto: algif_aead - use memcpy_sglist() instead of null skcipherEric Biggers
For copying data between two scatterlists, just use memcpy_sglist() instead of the so-called "null skcipher". This is much simpler. Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2025-05-12crypto: lib/chacha - add array bounds to function prototypesEric Biggers
Add explicit array bounds to the function prototypes for the parameters that didn't already get handled by the conversion to use chacha_state: - chacha_block_*(): Change 'u8 *out' or 'u8 *stream' to u8 out[CHACHA_BLOCK_SIZE]. - hchacha_block_*(): Change 'u32 *out' or 'u32 *stream' to u32 out[HCHACHA_OUT_WORDS]. - chacha_init(): Change 'const u32 *key' to 'const u32 key[CHACHA_KEY_WORDS]'. Change 'const u8 *iv' to 'const u8 iv[CHACHA_IV_SIZE]'. No functional changes. This just makes it clear when fixed-size arrays are expected. Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2025-05-12crypto: lib/chacha - add strongly-typed state zeroizationEric Biggers
Now that the ChaCha state matrix is strongly-typed, add a helper function chacha_zeroize_state() which zeroizes it. Then convert all applicable callers to use it instead of direct memzero_explicit. No functional changes. Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2025-05-12crypto: lib/chacha - use struct assignment to copy stateEric Biggers
Use struct assignment instead of memcpy() in lib/crypto/chacha.c where appropriate. No functional change. Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2025-05-12crypto: lib/chacha - strongly type the ChaCha stateEric Biggers
The ChaCha state matrix is 16 32-bit words. Currently it is represented in the code as a raw u32 array, or even just a pointer to u32. This weak typing is error-prone. Instead, introduce struct chacha_state: struct chacha_state { u32 x[16]; }; Convert all ChaCha and HChaCha functions to use struct chacha_state. No functional changes. Signed-off-by: Eric Biggers <ebiggers@google.com> Acked-by: Kent Overstreet <kent.overstreet@linux.dev> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2025-05-12crypto: crypto4xx - Remove ahash-related codeHerbert Xu
The hash implementation in crypto4xx has been disabled since 2009. As nobody has tried to fix this remove all the dead code. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2025-05-12powerpc/pseries/htmdump: Include header file to get is_kvm_guest() definitionAthira Rajeev
htmdump_init calls is_kvm_guest() to check for guest environment. is_kvm_guest() is defined in kvm_guest.h header file. Without including the header file, build hits error failing to find the function. arch/powerpc/platforms/pseries/htmdump.c: In function 'htmdump_init': arch/powerpc/platforms/pseries/htmdump.c:469:6: error: implicit declaration of function 'is_kvm_guest'; did you mean '__key_get'? [-Werror=implicit-function-declaration] if (is_kvm_guest()) { ^~~~~~~~~~~~ __key_get This is observed in configs where CONFIG_KVM_GUEST is disabled. Include header file explicitly to avoid the build error Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/oe-kbuild-all/202505061324.elUl4njU-lkp@intel.com/ Signed-off-by: Athira Rajeev <atrajeev@linux.ibm.com> Tested-by: Venkat Rao Bagalkote <venkat88@linux.ibm.com> Signed-off-by: Madhavan Srinivasan <maddy@linux.ibm.com> Link: https://patch.msgid.link/20250506135232.69014-1-atrajeev@linux.ibm.com
2025-05-12KVM: PPC: Book3S HV: Fix IRQ map warnings with XICS on pSeries KVM GuestAmit Machhiwal
The commit 9576730d0e6e ("KVM: PPC: select IRQ_BYPASS_MANAGER") enabled IRQ_BYPASS_MANAGER when CONFIG_KVM was set. Subsequently, commit c57875f5f9be ("KVM: PPC: Book3S HV: Enable IRQ bypass") enabled IRQ bypass and added the necessary callbacks to create/remove the mappings between host real IRQ and the guest GSI. The availability of IRQ bypass is determined by the arch-specific function kvm_arch_has_irq_bypass(), which invokes kvmppc_irq_bypass_add_producer_hv(). This function, in turn, calls kvmppc_set_passthru_irq_hv() to create a mapping in the passthrough IRQ map, associating a host IRQ to a guest GSI. However, when a pSeries KVM guest (L2) is booted within an LPAR (L1) with the kernel boot parameter `xive=off`, it defaults to using emulated XICS controller. As an attempt to establish host IRQ to guest GSI mappings via kvmppc_set_passthru_irq() on a PCI device hotplug (passhthrough) operation fail, returning -ENOENT. This failure occurs because only interrupts with EOI operations handled through OPAL calls (verified via is_pnv_opal_msi()) are currently supported. These mapping failures lead to below repeated warnings in the L1 host: [ 509.220349] kvmppc_set_passthru_irq_hv: Could not assign IRQ map for (58,4970) [ 509.220368] kvmppc_set_passthru_irq (irq 58, gsi 4970) fails: -2 [ 509.220376] vfio-pci 0015:01:00.0: irq bypass producer (token 0000000090bc635b) registration fails: -2 ... [ 509.291781] vfio-pci 0015:01:00.0: irq bypass producer (token 000000003822eed8) registration fails: -2 Fix this by restricting IRQ bypass enablement on pSeries systems by making the IRQ bypass callbacks unavailable when running on pSeries platform. Signed-off-by: Amit Machhiwal <amachhiw@linux.ibm.com> Tested-by: Gautam Menghani <gautam@linux.ibm.com> Reviewed-by: Vaibhav Jain <vaibhav@linux.ibm.com> Signed-off-by: Madhavan Srinivasan <maddy@linux.ibm.com> Link: https://patch.msgid.link/20250425185641.1611857-1-amachhiw@linux.ibm.com
2025-05-12powerpc/8xx: Reduce alignment constraint for kernel memoryChristophe Leroy
8xx has three large page sizes: 8M, 512k and 16k. A too big alignment can lead to wasting memory. On a board which has only 32 MBytes of RAM, every single byte is worth it and a 512k alignment is sometimes too much. Allow mapping kernel memory with 16k pages and reduce the constraint on kernel memory alignment. 512k and 16k pages are handled the same way so reverse tests in order to make 8M pages the special case and other ones (512k and 16k) the alternative. Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu> Signed-off-by: Madhavan Srinivasan <maddy@linux.ibm.com> Link: https://patch.msgid.link/fa9927b70df13627cdf10b992ea71d6562c7760e.1746191262.git.christophe.leroy@csgroup.eu
2025-05-12Add more devm_ functions to fix PM imbalance inMark Brown
Merge series from Bence Csókás <csokas.bence@prolan.hu>: The probe() function of the atmel-quadspi driver got quite convoluted, especially since the addition of SAMA7G5 support, that was forward-ported from an older vendor kernel. During the port, a bug was introduced, where the PM get() and put() calls were imbalanced. To alleivate this - and similar problems in the future - an effort was made to migrate as many functions as possible, to their devm_ managed counterparts. The few functions, which did not yet have a devm_ variant, are added in patch 1 of this series. Patch 2 then uses these APIs to fix the probe() function.
2025-05-11Merge tag 'arm64_cbpf_mitigation_2025_05_08' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux Pull arm64 cBPF BHB mitigation from James Morse: "This adds the BHB mitigation into the code JITted for cBPF programs as these can be loaded by unprivileged users via features like seccomp. The existing mechanisms to disable the BHB mitigation will also prevent the mitigation being JITted. In addition, cBPF programs loaded by processes with the SYS_ADMIN capability are not mitigated as these could equally load an eBPF program that does the same thing. For good measure, the list of 'k' values for CPU's local mitigations is updated from the version on arm's website" * tag 'arm64_cbpf_mitigation_2025_05_08' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux: arm64: proton-pack: Add new CPUs 'k' values for branch mitigation arm64: bpf: Only mitigate cBPF programs loaded by unprivileged users arm64: bpf: Add BHB mitigation to the epilogue for cBPF programs arm64: proton-pack: Expose whether the branchy loop k value arm64: proton-pack: Expose whether the platform is mitigated by firmware arm64: insn: Add support for encoding DSB
2025-05-11mm: userfaultfd: correct dirty flags set for both present and swap pteBarry Song
As David pointed out, what truly matters for mremap and userfaultfd move operations is the soft dirty bit. The current comment and implementation—which always sets the dirty bit for present PTEs and fails to set the soft dirty bit for swap PTEs—are incorrect. This could break features like Checkpoint-Restore in Userspace (CRIU). This patch updates the behavior to correctly set the soft dirty bit for both present and swap PTEs in accordance with mremap. Link: https://lkml.kernel.org/r/20250508220912.7275-1-21cnbao@gmail.com Fixes: adef440691ba ("userfaultfd: UFFDIO_MOVE uABI") Signed-off-by: Barry Song <v-songbaohua@oppo.com> Reported-by: David Hildenbrand <david@redhat.com> Closes: https://lore.kernel.org/linux-mm/02f14ee1-923f-47e3-a994-4950afb9afcc@redhat.com/ Acked-by: Peter Xu <peterx@redhat.com> Reviewed-by: Suren Baghdasaryan <surenb@google.com> Cc: Lokesh Gidra <lokeshgidra@google.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-05-11zsmalloc: don't underflow size calculation in zs_obj_write()Sergey Senozhatsky
Do not mix class->size and object size during offsets/sizes calculation in zs_obj_write(). Size classes can merge into clusters, based on objects-per-zspage and pages-per-zspage characteristics, so some size classes can store objects smaller than class->size. This becomes problematic when object size is much smaller than class->size. zsmalloc can falsely decide that object spans two physical pages, because a larger class->size value is used for that check, while the actual object is much smaller and fits the free space of the first physical page, so there is nothing to write to the second page and memcpy() size calculation underflows. Unable to handle kernel paging request at virtual address ffffc00081ff4000 pc : __memcpy+0x10/0x24 lr : zs_obj_write+0x1b0/0x1d0 [zsmalloc] Call trace: __memcpy+0x10/0x24 (P) zram_write_page+0x150/0x4fc [zram] zram_submit_bio+0x5e0/0x6a4 [zram] __submit_bio+0x168/0x220 submit_bio_noacct_nocheck+0x128/0x2c8 submit_bio_noacct+0x19c/0x2f8 This is mostly seen on system with larger page-sizes, because size class cluters of such systems hold wider size ranges than on 4K PAGE_SIZE systems. Assume a 16K PAGE_SIZE system, a write of 820 bytes object to a 864-bytes size class at offset 15560. 15560 + 864 is more than 16384 so zsmalloc attempts to memcpy() it to two physical pages. However, 16384 - 15560 = 824 which is more than 820, so the object in fact doesn't span two physical pages, and there is no data to write to the second physical page. We always know the exact size in bytes of the object that we are about to write (store), so use it instead of class->size. Link: https://lkml.kernel.org/r/20250507054312.4135983-1-senozhatsky@chromium.org Fixes: 44f76413496e ("zsmalloc: introduce new object mapping API") Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org> Reported-by: Igor Belousov <igor.b@beldev.am> Tested-by: Igor Belousov <igor.b@beldev.am> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Minchan Kim <minchan@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-05-11mm/page_alloc: fix race condition in unaccepted memory handlingKirill A. Shutemov
The page allocator tracks the number of zones that have unaccepted memory using static_branch_enc/dec() and uses that static branch in hot paths to determine if it needs to deal with unaccepted memory. Borislav and Thomas pointed out that the tracking is racy: operations on static_branch are not serialized against adding/removing unaccepted pages to/from the zone. Sanity checks inside static_branch machinery detects it: WARNING: CPU: 0 PID: 10 at kernel/jump_label.c:276 __static_key_slow_dec_cpuslocked+0x8e/0xa0 The comment around the WARN() explains the problem: /* * Warn about the '-1' case though; since that means a * decrement is concurrent with a first (0->1) increment. IOW * people are trying to disable something that wasn't yet fully * enabled. This suggests an ordering problem on the user side. */ The effect of this static_branch optimization is only visible on microbenchmark. Instead of adding more complexity around it, remove it altogether. Link: https://lkml.kernel.org/r/20250506133207.1009676-1-kirill.shutemov@linux.intel.com Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Fixes: dcdfdd40fa82 ("mm: Add support for unaccepted memory") Link: https://lore.kernel.org/all/20250506092445.GBaBnVXXyvnazly6iF@fat_crate.local Reported-by: Borislav Petkov <bp@alien8.de> Tested-by: Borislav Petkov (AMD) <bp@alien8.de> Reported-by: Thomas Gleixner <tglx@linutronix.de> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Brendan Jackman <jackmanb@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: <stable@vger.kernel.org> [6.5+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-05-11mm/page_alloc: ensure try_alloc_pages() plays well with unaccepted memoryKirill A. Shutemov
try_alloc_pages() will not attempt to allocate memory if the system has *any* unaccepted memory. Memory is accepted as needed and can remain in the system indefinitely, causing the interface to always fail. Rather than immediately giving up, attempt to use already accepted memory on free lists. Pass 'alloc_flags' to cond_accept_memory() and do not accept new memory for ALLOC_TRYLOCK requests. Found via code inspection - only BPF uses this at present and the runtime effects are unclear. Link: https://lkml.kernel.org/r/20250506112509.905147-2-kirill.shutemov@linux.intel.com Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Fixes: 97769a53f117 ("mm, bpf: Introduce try_alloc_pages() for opportunistic page allocation") Cc: Alexei Starovoitov <ast@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Brendan Jackman <jackmanb@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-05-11MAINTAINERS: add mm GUP sectionLorenzo Stoakes
As part of the ongoing efforts to sub-divide memory management maintainership and reviewership, establish a section for GUP (Get User Pages) support and add appropriate maintainers and reviewers. Link: https://lkml.kernel.org/r/20250506173601.97562-1-lorenzo.stoakes@oracle.com Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reviewed-by: John Hubbard <jhubbard@nvidia.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: Peter Xu <peterx@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-05-11mm/codetag: move tag retrieval back upfront in __free_pages()David Wang
Commit 51ff4d7486f0 ("mm: avoid extra mem_alloc_profiling_enabled() checks") introduces a possible use-after-free scenario, when page is non-compound, page[0] could be released by other thread right after put_page_testzero failed in current thread, pgalloc_tag_sub_pages afterwards would manipulate an invalid page for accounting remaining pages: [timeline] [thread1] [thread2] | alloc_page non-compound V | get_page, rf counter inc V | in ___free_pages | put_page_testzero fails V | put_page, page released V | in ___free_pages, | pgalloc_tag_sub_pages | manipulate an invalid page V Restore __free_pages() to its state before, retrieve alloc tag beforehand. Link: https://lkml.kernel.org/r/20250505193034.91682-1-00107082@163.com Fixes: 51ff4d7486f0 ("mm: avoid extra mem_alloc_profiling_enabled() checks") Signed-off-by: David Wang <00107082@163.com> Acked-by: Suren Baghdasaryan <surenb@google.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Cc: Brendan Jackman <jackmanb@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-05-11mm/memory: fix mapcount / refcount sanity check for mTHP reuseKairui Song
The following WARNING was triggered during swap stress test with mTHP enabled: [ 6609.335758] ------------[ cut here ]------------ [ 6609.337758] WARNING: CPU: 82 PID: 755116 at mm/memory.c:3794 do_wp_page+0x1084/0x10e0 [ 6609.340922] Modules linked in: zram virtiofs [ 6609.342699] CPU: 82 UID: 0 PID: 755116 Comm: sh Kdump: loaded Not tainted 6.15.0-rc1+ #1429 PREEMPT(voluntary) [ 6609.347620] Hardware name: Red Hat KVM/RHEL-AV, BIOS 0.0.0 02/06/2015 [ 6609.349909] RIP: 0010:do_wp_page+0x1084/0x10e0 [ 6609.351532] Code: ff ff 48 c7 c6 80 ba 49 82 4c 89 ef e8 95 fd fe ff 0f 0b bd f5 ff ff ff e9 43 fb ff ff 41 83 a9 bc 12 00 00 01 e9 5c fb ff ff <0f> 0b e9 a6 fc ff ff 65 ff 00 f0 48 0f b a 6d 00 1f 0f 83 82 fc ff [ 6609.357959] RSP: 0000:ffffc90002273d40 EFLAGS: 00010287 [ 6609.359915] RAX: 000000000000000f RBX: 0000000000000000 RCX: 000fffffffe00000 [ 6609.362606] RDX: 0000000000000010 RSI: 000055a119ac1000 RDI: ffffea000ae6ec00 [ 6609.365143] RBP: ffffea000ae6ec68 R08: 84000002b9bb1025 R09: 000055a119ab6000 [ 6609.367569] R10: ffff8881caa2ad80 R11: 0000000000000000 R12: ffff8881caa2ad80 [ 6609.370255] R13: ffffea000ae6ec00 R14: 000055a119ac1c9c R15: ffffc90002273dd8 [ 6609.373007] FS: 00007f08e467f740(0000) GS:ffff88a07c214000(0000) knlGS:0000000000000000 [ 6609.375999] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 6609.377946] CR2: 000055a119ac1c9c CR3: 00000001adfd6005 CR4: 0000000000770eb0 [ 6609.380376] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 6609.382853] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 6609.385216] PKRU: 55555554 [ 6609.386141] Call Trace: [ 6609.387017] <TASK> [ 6609.387718] ? ___pte_offset_map+0x1b/0x110 [ 6609.389056] __handle_mm_fault+0xa51/0xf00 [ 6609.390363] ? exc_page_fault+0x6a/0x140 [ 6609.391629] handle_mm_fault+0x13d/0x360 [ 6609.392856] do_user_addr_fault+0x2f2/0x7f0 [ 6609.394160] ? sigprocmask+0x77/0xa0 [ 6609.395375] exc_page_fault+0x6a/0x140 [ 6609.396735] asm_exc_page_fault+0x26/0x30 [ 6609.398224] RIP: 0033:0x55a1050bc18b [ 6609.399567] Code: 8b 3f 4d 85 ff 74 40 41 39 5f 18 75 f2 49 8b 7f 08 44 38 27 75 e9 4c 89 c6 4c 89 45 c8 e8 bd 83 fa ff 4c 8b 45 c8 85 c0 75 d5 <41> 83 47 1c 01 48 83 c4 28 4c 89 f8 5b 4 1 5c 41 5d 41 5e 41 5f 5d [ 6609.405971] RSP: 002b:00007ffcf5f37d90 EFLAGS: 00010246 [ 6609.407737] RAX: 0000000000000000 RBX: 00000000182768fa RCX: 0000000000000000 [ 6609.410151] RDX: 00000000000000fa RSI: 000055a105175c7b RDI: 000055a119ac1c60 [ 6609.412606] RBP: 00007ffcf5f37de0 R08: 000055a105175c7b R09: 0000000000000000 [ 6609.414998] R10: 000000004d2dfb5a R11: 0000000000000246 R12: 0000000000000050 [ 6609.417193] R13: 00000000000000fa R14: 000055a119abaf60 R15: 000055a119ac1c80 [ 6609.419268] </TASK> [ 6609.419928] ---[ end trace 0000000000000000 ]--- The WARN_ON here is simply incorrect. The refcount here must be at least the mapcount, not the opposite. Each mapcount must have a corresponding refcount, but the refcount may increase if other components grab the folio, which is acceptable. Meanwhile, having a mapcount larger than refcount is a real problem. So fix the WARN_ON condition. Link: https://lkml.kernel.org/r/20250425074325.61833-1-ryncsn@gmail.com Fixes: 1da190f4d0a6 ("mm: Copy-on-Write (COW) reuse support for PTE-mapped THP") Signed-off-by: Kairui Song <kasong@tencent.com> Reported-by: Kairui Song <kasong@tencent.com> Closes: https://lore.kernel.org/all/CAMgjq7D+ea3eg9gRCVvRnto3Sv3_H3WVhupX4e=k8T5QAfBHbw@mail.gmail.com/ Suggested-by: David Hildenbrand <david@redhat.com> Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-05-11kernel/fork: only call untrack_pfn_clear() on VMAs duplicated for fork()David Hildenbrand
Not intuitive, but vm_area_dup() located in kernel/fork.c is not only used for duplicating VMAs during fork(), but also for duplicating VMAs when splitting VMAs or when mremap()'ing them. VM_PFNMAP mappings can at least get ordinarily mremap()'ed (no change in size) and apparently also shrunk during mremap(), which implies duplicating the VMA in __split_vma() first. In case of ordinary mremap() (no change in size), we first duplicate the VMA in copy_vma_and_data()->copy_vma() to then call untrack_pfn_clear() on the old VMA: we effectively move the VM_PAT reservation. So the untrack_pfn_clear() call on the new VMA duplicating is wrong in that context. Splitting of VMAs seems problematic, because we don't duplicate/adjust the reservation when splitting the VMA. Instead, in memtype_erase() -- called during zapping/munmap -- we shrink a reservation in case only the end address matches: Assume we split a VMA into A and B, both would share a reservation until B is unmapped. So when unmapping B, the reservation would be updated to cover only A. When unmapping A, we would properly remove the now-shrunk reservation. That scenario describes the mremap() shrinking (old_size > new_size), where we split + unmap B, and the untrack_pfn_clear() on the new VMA when is wrong. What if we manage to split a VM_PFNMAP VMA into A and B and unmap A first? It would be broken because we would never free the reservation. Likely, there are ways to trigger such a VMA split outside of mremap(). Affecting other VMA duplication was not intended, vm_area_dup() being used outside of kernel/fork.c was an oversight. So let's fix that for; how to handle VMA splits better should be investigated separately. With a simple reproducer that uses mprotect() to split such a VMA I can trigger x86/PAT: pat_mremap:26448 freeing invalid memtype [mem 0x00000000-0x00000fff] Link: https://lkml.kernel.org/r/20250422144942.2871395-1-david@redhat.com Fixes: dc84bc2aba85 ("x86/mm/pat: Fix VM_PAT handling when fork() fails in copy_page_range()") Signed-off-by: David Hildenbrand <david@redhat.com> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Borislav Petkov <bp@alien8.de> Cc: Rik van Riel <riel@surriel.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-05-11mm: hugetlb: fix incorrect fallback for subpoolWupeng Ma
During our testing with hugetlb subpool enabled, we observe that hstate->resv_huge_pages may underflow into negative values. Root cause analysis reveals a race condition in subpool reservation fallback handling as follow: hugetlb_reserve_pages() /* Attempt subpool reservation */ gbl_reserve = hugepage_subpool_get_pages(spool, chg); /* Global reservation may fail after subpool allocation */ if (hugetlb_acct_memory(h, gbl_reserve) < 0) goto out_put_pages; out_put_pages: /* This incorrectly restores reservation to subpool */ hugepage_subpool_put_pages(spool, chg); When hugetlb_acct_memory() fails after subpool allocation, the current implementation over-commits subpool reservations by returning the full 'chg' value instead of the actual allocated 'gbl_reserve' amount. This discrepancy propagates to global reservations during subsequent releases, eventually causing resv_huge_pages underflow. This problem can be trigger easily with the following steps: 1. reverse hugepage for hugeltb allocation 2. mount hugetlbfs with min_size to enable hugetlb subpool 3. alloc hugepages with two task(make sure the second will fail due to insufficient amount of hugepages) 4. with for a few seconds and repeat step 3 which will make hstate->resv_huge_pages to go below zero. To fix this problem, return corrent amount of pages to subpool during the fallback after hugepage_subpool_get_pages is called. Link: https://lkml.kernel.org/r/20250410062633.3102457-1-mawupeng1@huawei.com Fixes: 1c5ecae3a93f ("hugetlbfs: add minimum size accounting to subpools") Signed-off-by: Wupeng Ma <mawupeng1@huawei.com> Tested-by: Joshua Hahn <joshua.hahnjy@gmail.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Cc: David Hildenbrand <david@redhat.com> Cc: Ma Wupeng <mawupeng1@huawei.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-05-11Merge tag 'its-for-linus-20250509' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 ITS mitigation from Dave Hansen: "Mitigate Indirect Target Selection (ITS) issue. I'd describe this one as a good old CPU bug where the behavior is _obviously_ wrong, but since it just results in bad predictions it wasn't wrong enough to notice. Well, the researchers noticed and also realized that thus bug undermined a bunch of existing indirect branch mitigations. Thus the unusually wide impact on this one. Details: ITS is a bug in some Intel CPUs that affects indirect branches including RETs in the first half of a cacheline. Due to ITS such branches may get wrongly predicted to a target of (direct or indirect) branch that is located in the second half of a cacheline. Researchers at VUSec found this behavior and reported to Intel. Affected processors: - Cascade Lake, Cooper Lake, Whiskey Lake V, Coffee Lake R, Comet Lake, Ice Lake, Tiger Lake and Rocket Lake. Scope of impact: - Guest/host isolation: When eIBRS is used for guest/host isolation, the indirect branches in the VMM may still be predicted with targets corresponding to direct branches in the guest. - Intra-mode using cBPF: cBPF can be used to poison the branch history to exploit ITS. Realigning the indirect branches and RETs mitigates this attack vector. - User/kernel: With eIBRS enabled user/kernel isolation is *not* impacted by ITS. - Indirect Branch Prediction Barrier (IBPB): Due to this bug indirect branches may be predicted with targets corresponding to direct branches which were executed prior to IBPB. This will be fixed in the microcode. Mitigation: As indirect branches in the first half of cacheline are affected, the mitigation is to replace those indirect branches with a call to thunk that is aligned to the second half of the cacheline. RETs that take prediction from RSB are not affected, but they may be affected by RSB-underflow condition. So, RETs in the first half of cacheline are also patched to a return thunk that executes the RET aligned to second half of cacheline" * tag 'its-for-linus-20250509' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: selftest/x86/bugs: Add selftests for ITS x86/its: FineIBT-paranoid vs ITS x86/its: Use dynamic thunks for indirect branches x86/ibt: Keep IBT disabled during alternative patching mm/execmem: Unify early execmem_cache behaviour x86/its: Align RETs in BHB clear sequence to avoid thunking x86/its: Add support for RSB stuffing mitigation x86/its: Add "vmexit" option to skip mitigation on some CPUs x86/its: Enable Indirect Target Selection mitigation x86/its: Add support for ITS-safe return thunk x86/its: Add support for ITS-safe indirect thunk x86/its: Enumerate Indirect Target Selection (ITS) bug Documentation: x86/bugs/its: Add ITS documentation
2025-05-11Merge tag 'ibti-hisory-for-linus-2025-05-06' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 IBTI mitigation from Dave Hansen: "Mitigate Intra-mode Branch History Injection via classic BFP programs This adds the branch history clearing mitigation to cBPF programs for x86. Intra-mode BHI attacks via cBPF a.k.a IBTI-History was reported by researchers at VUSec. For hardware that doesn't support BHI_DIS_S, the recommended mitigation is to run the short software sequence followed by the IBHF instruction after cBPF execution. On hardware that does support BHI_DIS_S, enable BHI_DIS_S and execute the IBHF after cBPF execution. The Indirect Branch History Fence (IBHF) is a new instruction that prevents indirect branch target predictions after the barrier from using branch history from before the barrier while BHI_DIS_S is enabled. On older systems this will map to a NOP. It is recommended to add this fence at the end of the cBPF program to support VM migration. This instruction is required on newer parts with BHI_NO to fully mitigate against these attacks. The current code disables the mitigation for anything running with the SYS_ADMIN capability bit set. The intention was not to waste time mitigating a process that has access to anything it wants anyway" * tag 'ibti-hisory-for-linus-2025-05-06' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/bhi: Do not set BHI_DIS_S in 32-bit mode x86/bpf: Add IBHF call at end of classic BPF x86/bpf: Call branch history clearing sequence on exit
2025-05-11nfsd: remove legacy dprintks from GETATTR and STATFS codepathsJeff Layton
Observability here is now covered by static tracepoints. Signed-off-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2025-05-11nfsd: remove legacy READDIR dprintksJeff Layton
Observability here is now covered by static tracepoints. Signed-off-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2025-05-11nfsd: remove dprintks for v2/3 RENAME eventsJeff Layton
Observability here is now covered by static tracepoints. Signed-off-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2025-05-11nfsd: remove REMOVE/RMDIR dprintksJeff Layton
Observability here is now covered by static tracepoints. Signed-off-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2025-05-11nfsd: remove old LINK dprintksJeff Layton
Observability here is now covered by static tracepoints. Signed-off-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2025-05-11nfsd: remove old v2/3 SYMLINK dprintksJeff Layton
Observability here is now covered by static tracepoints. Signed-off-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2025-05-11nfsd: remove old v2/3 create path dprintksJeff Layton
Observability here is now covered by static tracepoints. Signed-off-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2025-05-11nfsd: add tracepoint for getattr and statfs eventsJeff Layton
There isn't a common helper for getattrs, so add these into the protocol-specific helpers. Signed-off-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2025-05-11nfsd: add tracepoint to nfsd_readdirJeff Layton
Observe the start of NFS READDIR operations. The NFS READDIR's count argument can be interesting when tuning a client's readdir behavior. However, the count argument is not passed to nfsd_readdir(). To properly capture the count argument, this tracepoint must appear in each proc function before the nfsd_readdir() call. Signed-off-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2025-05-11nfsd: add tracepoint to nfsd_renameJeff Layton
Observe the start of RENAME operations for all NFS versions. Signed-off-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2025-05-11nfsd: add tracepoints for unlink eventsJeff Layton
Observe the start of UNLINK, REMOVE, and RMDIR operations for all NFS versions. Signed-off-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2025-05-11nfsd: add tracepoint to nfsd_link()Jeff Layton
Observe the start of NFS LINK operations. Signed-off-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2025-05-11nfsd: add tracepoint to nfsd_symlinkJeff Layton
Observe the start of SYMLINK operations for all NFS versions. Signed-off-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2025-05-11nfsd: add nfsd_vfs_create tracepointsJeff Layton
Observe the start of file and directory creation for all NFS versions. Signed-off-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2025-05-11nfsd: add a tracepoint to nfsd_lookup_dentryJeff Layton
Replace the dprintk in nfsd_lookup_dentry() with a trace point. nfsd_lookup_dentry() is called frequently enough that enabling this dprintk call site would result in log floods and performance issues. Signed-off-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2025-05-11nfsd: add a tracepoint for nfsd_setattrJeff Layton
Turn Sargun's internal kprobe based implementation of this into a normal static tracepoint. Also, remove the dprintk's that got added recently with the fix for zero-length ACLs. Cc: Sargun Dillon <sargun@sargun.me> Signed-off-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2025-05-11NFSD: Add a Call equivalent to the NFSD_TRACE_PROC_RES macrosChuck Lever
Introduce tracing helpers that can be used before the procedure status code is known. These macros are similar to the SVC_RQST_ENDPOINT helpers, but they can be modified to include NFS-specific fields if that is needed later. Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2025-05-11NFSD: Use sockaddr instead of a generic arrayChuck Lever
Record and emit presentation addresses using tracing helpers designed for the task. Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2025-05-11NFSD: Implement FATTR4_CLONE_BLKSIZE attributeChuck Lever
RFC 7862 states that if an NFS server implements a CLONE operation, it MUST also implement FATTR4_CLONE_BLKSIZE. NFSD implements CLONE, but does not implement FATTR4_CLONE_BLKSIZE. Note that in Section 12.2, RFC 7862 claims that FATTR4_CLONE_BLKSIZE is RECOMMENDED, not REQUIRED. Likely this is because a minor version is not permitted to add a REQUIRED attribute. Confusing. We assume this attribute reports a block size as a count of bytes, as RFC 7862 does not specify a unit. Reported-by: Roland Mainz <roland.mainz@nrubsig.org> Suggested-by: Christoph Hellwig <hch@infradead.org> Reviewed-by: Roland Mainz <roland.mainz@nrubsig.org> Cc: stable@vger.kernel.org # v6.7+ Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2025-05-11nfsd: use SHA-256 library API instead of crypto_shash APIEric Biggers
This user of SHA-256 does not support any other algorithm, so the crypto_shash abstraction provides no value. Just use the SHA-256 library API instead, which is much simpler and easier to use. Signed-off-by: Eric Biggers <ebiggers@google.com> Reviewed-by: Scott Mayhew <smayhew@redhat.com> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2025-05-11svcrdma: Unregister the device if svc_rdma_accept() failsChuck Lever
To handle device removal, svc_rdma_accept() requests removal notification for the underlying device when accepting a connection. However svc_rdma_free() is not invoked if svc_rdma_accept() fails. There needs to be a matching "unregister" in that case; otherwise the device cannot be removed. Fixes: c4de97f7c454 ("svcrdma: Handle device removal outside of the CM event handler") Cc: stable@vger.kernel.org Reviewed-by: Zhu Yanjun <yanjun.zhu@linux.dev> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2025-05-11sunrpc: allow SOMAXCONN backlogged TCP connectionsJeff Layton
The connection backlog passed to listen() denotes the number of connections that are fully established, but that have not yet been accept()ed. If the amount goes above that level, new connection requests will be dropped on the floor until the value goes down. If all the knfsd threads are bogged down in (e.g.) disk I/O, new connection attempts can stall because of this. For the same rationale that Trond points out in the userland patch [1], ensure that svc_xprt sockets created by the kernel allow SOMAXCONN (4096) backlogged connections instead of the 64 that they do today. [1]: https://lore.kernel.org/linux-nfs/20240308180223.2965601-1-trond.myklebust@hammerspace.com/ Cc: Trond Myklebust <trond.myklebust@hammerspace.com> Signed-off-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>