summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2025-03-08vfs: Remove invalidate_inodes()Jan Kara
The function can be replaced by evict_inodes. The only difference is that evict_inodes() skips the inodes with positive refcount without touching ->i_lock, but they are equivalent as evict_inodes() repeats the refcount check after having grabbed ->i_lock. Signed-off-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20250307144318.28120-2-jack@suse.cz Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-03-08Merge patch series "initramfs: kunit tests and cleanups"Christian Brauner
David Disseldorp <ddiss@suse.de> says: This patchset adds basic kunit test coverage for initramfs unpacking and cleans up some minor buffer handling issues / inefficiencies. * patches from https://lore.kernel.org/r/20250304061020.9815-1-ddiss@suse.de: initramfs: avoid static buffer for error message initramfs: fix hardlink hash leak without TRAILER initramfs: reuse name_len for dir mtime tracking initramfs: allocate heap buffers together initramfs: avoid memcpy for hex header fields vsprintf: add simple_strntoul initramfs_test: kunit tests for initramfs unpacking init: add initramfs_internal.h Link: https://lore.kernel.org/r/20250304061020.9815-1-ddiss@suse.de Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-03-08initramfs: avoid static buffer for error messageDavid Disseldorp
The dynamic error message printed if CONFIG_RD_$ALG compression support is missing needn't be propagated up to the caller via a static buffer. Print it immediately via pr_err() and set @message to a const string to flag error. Before: text data bss dec hex filename 8006 1118 8 9132 23ac init/initramfs.o After: text data bss dec hex filename 7938 1022 8 8968 2308 init/initramfs.o Signed-off-by: David Disseldorp <ddiss@suse.de> Link: https://lore.kernel.org/r/20250304061020.9815-9-ddiss@suse.de Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-03-08initramfs: fix hardlink hash leak without TRAILERDavid Disseldorp
Covered in Documentation/driver-api/early-userspace/buffer-format.rst , initramfs archives can carry an optional "TRAILER!!!" entry which serves as a boundary for collecting and associating hardlinks with matching inode and major / minor device numbers. Although optional, if hardlinks are found in an archive without a subsequent "TRAILER!!!" entry then the hardlink state hash table is leaked, e.g. unfixed kernel, with initramfs_test.c hunk applied only: unreferenced object 0xffff9405408cc000 (size 8192): comm "kunit_try_catch", pid 53, jiffies 4294892519 hex dump (first 32 bytes): 01 00 00 00 01 00 00 00 00 00 00 00 ff 81 00 00 ................ 00 00 00 00 00 00 00 00 69 6e 69 74 72 61 6d 66 ........initramf backtrace (crc a9fb0ee0): [<0000000066739faa>] __kmalloc_cache_noprof+0x11d/0x250 [<00000000fc755219>] maybe_link.part.5+0xbc/0x120 [<000000000526a128>] do_name+0xce/0x2f0 [<00000000145c1048>] write_buffer+0x22/0x40 [<000000003f0b4f32>] unpack_to_rootfs+0xf9/0x2a0 [<00000000d6f7e5af>] initramfs_test_hardlink+0xe3/0x3f0 [<0000000014fde8d6>] kunit_try_run_case+0x5f/0x130 [<00000000dc9dafc5>] kunit_generic_run_threadfn_adapter+0x18/0x30 [<000000001076c239>] kthread+0xc8/0x100 [<00000000d939f1c1>] ret_from_fork+0x2b/0x40 [<00000000f848ad1a>] ret_from_fork_asm+0x1a/0x30 Fix this by calling free_hash() after initramfs buffer processing in unpack_to_rootfs(). An extra hardlink_seen global is added as an optimization to avoid walking the 32 entry hash array unnecessarily. The expectation is that a "TRAILER!!!" entry will normally be present, and initramfs hardlinks are uncommon. There is one user facing side-effect of this fix: hardlinks can currently be associated across built-in and external initramfs archives, *if* the built-in initramfs archive lacks a "TRAILER!!!" terminator. I'd consider this cross-archive association broken, but perhaps it's used. Signed-off-by: David Disseldorp <ddiss@suse.de> Link: https://lore.kernel.org/r/20250304061020.9815-8-ddiss@suse.de Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-03-08initramfs: reuse name_len for dir mtime trackingDavid Disseldorp
We already have a nulterm-inclusive, checked name_len for the directory name, so use that instead of calling strlen(). Signed-off-by: David Disseldorp <ddiss@suse.de> Link: https://lore.kernel.org/r/20250304061020.9815-7-ddiss@suse.de Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-03-08initramfs: allocate heap buffers togetherDavid Disseldorp
header_buf, symlink_buf and name_buf all share the same lifecycle so needn't be allocated / freed separately. This change leads to a minor reduction in .text size: before: text data bss dec hex filename 7914 1110 8 9032 2348 init/initramfs.o after: text data bss dec hex filename 7854 1110 8 8972 230c init/initramfs.o A previous iteration of this patch reused a single buffer instead of three, given that buffer use is state-sequential (GotHeader, GotName, GotSymlink). However, the slight decrease in heap use during early boot isn't really worth the extra review complexity. Link: https://lore.kernel.org/all/20241107002044.16477-7-ddiss@suse.de/ Signed-off-by: David Disseldorp <ddiss@suse.de> Link: https://lore.kernel.org/r/20250304061020.9815-6-ddiss@suse.de Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-03-08initramfs: avoid memcpy for hex header fieldsDavid Disseldorp
newc/crc cpio headers contain a bunch of 8-character hexadecimal fields which we convert via simple_strtoul(), following memcpy() into a zero-terminated stack buffer. The new simple_strntoul() helper allows us to pass in max_chars=8 to avoid zero-termination and memcpy(). Signed-off-by: David Disseldorp <ddiss@suse.de> Link: https://lore.kernel.org/r/20250304061020.9815-5-ddiss@suse.de Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-03-08vsprintf: add simple_strntoulDavid Disseldorp
cpio extraction currently does a memcpy to ensure that the archive hex fields are null terminated for simple_strtoul(). simple_strntoul() will allow us to avoid the memcpy. Signed-off-by: David Disseldorp <ddiss@suse.de> Link: https://lore.kernel.org/r/20250304061020.9815-4-ddiss@suse.de Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-03-08initramfs_test: kunit tests for initramfs unpackingDavid Disseldorp
Provide some basic initramfs unpack sanity tests covering: - simple file / dir extraction - filename field overrun, as reported and fixed separately via https://lore.kernel.org/r/20241030035509.20194-2-ddiss@suse.de - "070702" cpio data checksums - hardlinks Signed-off-by: David Disseldorp <ddiss@suse.de> Link: https://lore.kernel.org/r/20250304061020.9815-3-ddiss@suse.de Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-03-08LoongArch: KVM: Fix GPA size issue about VMBibo Mao
Physical address space is 48 bit on Loongson-3A5000 physical machine, however it is 47 bit for VM on Loongson-3A5000 system. Size of physical address space of VM is the same with the size of virtual user space (a half) of physical machine. Variable cpu_vabits represents user address space, kernel address space is not included (user space and kernel space are both a half of total). Here cpu_vabits, rather than cpu_vabits - 1, is to represent the size of guest physical address space. Also there is strict checking about page fault GPA address, inject error if it is larger than maximum GPA address of VM. Cc: stable@vger.kernel.org Signed-off-by: Bibo Mao <maobibo@loongson.cn> Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>
2025-03-08LoongArch: KVM: Reload guest CSR registers after sleepBibo Mao
On host, the HW guest CSR registers are lost after suspend and resume operation. Since last_vcpu of boot CPU still records latest vCPU pointer so that the guest CSR register skips to reload when boot CPU resumes and vCPU is scheduled. Here last_vcpu is cleared so that guest CSR registers will reload from scheduled vCPU context after suspend and resume. Cc: stable@vger.kernel.org Signed-off-by: Bibo Mao <maobibo@loongson.cn> Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>
2025-03-08LoongArch: KVM: Add interrupt checking for AVECBibo Mao
There is a newly added macro INT_AVEC with CSR ESTAT register, which is bit 14 used for LoongArch AVEC support. AVEC interrupt status bit 14 is supported with macro CSR_ESTAT_IS, so here replace the hard-coded value 0x1fff with macro CSR_ESTAT_IS so that the AVEC interrupt status is also supported by KVM. Cc: stable@vger.kernel.org Signed-off-by: Bibo Mao <maobibo@loongson.cn> Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>
2025-03-08LoongArch: Set hugetlb mmap base address aligned with pmd sizeBibo Mao
With ltp test case "testcases/bin/hugefork02", there is a dmesg error report message such as: kernel BUG at mm/hugetlb.c:5550! Oops - BUG[#1]: CPU: 0 UID: 0 PID: 1517 Comm: hugefork02 Not tainted 6.14.0-rc2+ #241 Hardware name: QEMU QEMU Virtual Machine, BIOS unknown 2/2/2022 pc 90000000004eaf1c ra 9000000000485538 tp 900000010edbc000 sp 900000010edbf940 a0 900000010edbfb00 a1 9000000108d20280 a2 00007fffe9474000 a3 00007ffff3474000 a4 0000000000000000 a5 0000000000000003 a6 00000000003cadd3 a7 0000000000000000 t0 0000000001ffffff t1 0000000001474000 t2 900000010ecd7900 t3 00007fffe9474000 t4 00007fffe9474000 t5 0000000000000040 t6 900000010edbfb00 t7 0000000000000001 t8 0000000000000005 u0 90000000004849d0 s9 900000010edbfa00 s0 9000000108d20280 s1 00007fffe9474000 s2 0000000002000000 s3 9000000108d20280 s4 9000000002b38b10 s5 900000010edbfb00 s6 00007ffff3474000 s7 0000000000000406 s8 900000010edbfa08 ra: 9000000000485538 unmap_vmas+0x130/0x218 ERA: 90000000004eaf1c __unmap_hugepage_range+0x6f4/0x7d0 PRMD: 00000004 (PPLV0 +PIE -PWE) EUEN: 00000007 (+FPE +SXE +ASXE -BTE) ECFG: 00071c1d (LIE=0,2-4,10-12 VS=7) ESTAT: 000c0000 [BRK] (IS= ECode=12 EsubCode=0) PRID: 0014c010 (Loongson-64bit, Loongson-3A5000) Process hugefork02 (pid: 1517, threadinfo=00000000a670eaf4, task=000000007a95fc64) Call Trace: [<90000000004eaf1c>] __unmap_hugepage_range+0x6f4/0x7d0 [<9000000000485534>] unmap_vmas+0x12c/0x218 [<9000000000494068>] exit_mmap+0xe0/0x308 [<900000000025fdc4>] mmput+0x74/0x180 [<900000000026a284>] do_exit+0x294/0x898 [<900000000026aa30>] do_group_exit+0x30/0x98 [<900000000027bed4>] get_signal+0x83c/0x868 [<90000000002457b4>] arch_do_signal_or_restart+0x54/0xfa0 [<90000000015795e8>] irqentry_exit_to_user_mode+0xb8/0x138 [<90000000002572d0>] tlb_do_page_fault_1+0x114/0x1b4 The problem is that base address allocated from hugetlbfs is not aligned with pmd size. Here add a checking for hugetlbfs and align base address with pmd size. After this patch the test case "testcases/bin/hugefork02" passes to run. This is similar to the commit 7f24cbc9c4d42db8a3c8484d1 ("mm/mmap: teach generic_get_unmapped_area{_topdown} to handle hugetlb mappings"). Cc: stable@vger.kernel.org # 6.13+ Signed-off-by: Bibo Mao <maobibo@loongson.cn> Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>
2025-03-08LoongArch: Set max_pfn with the PFN of the last pageBibo Mao
The current max_pfn equals to zero. In this case, it causes user cannot get some page information through /proc filesystem such as kpagecount. The following message is displayed by stress-ng test suite with command "stress-ng --verbose --physpage 1 -t 1". # stress-ng --verbose --physpage 1 -t 1 stress-ng: error: [1691] physpage: cannot read page count for address 0x134ac000 in /proc/kpagecount, errno=22 (Invalid argument) stress-ng: error: [1691] physpage: cannot read page count for address 0x7ffff207c3a8 in /proc/kpagecount, errno=22 (Invalid argument) stress-ng: error: [1691] physpage: cannot read page count for address 0x134b0000 in /proc/kpagecount, errno=22 (Invalid argument) ... After applying this patch, the kernel can pass the test. # stress-ng --verbose --physpage 1 -t 1 stress-ng: debug: [1701] physpage: [1701] started (instance 0 on CPU 3) stress-ng: debug: [1701] physpage: [1701] exited (instance 0 on CPU 3) stress-ng: debug: [1700] physpage: [1701] terminated (success) Cc: stable@vger.kernel.org # 6.8+ Fixes: ff6c3d81f2e8 ("NUMA: optimize detection of memory with no node id assigned by firmware") Signed-off-by: Bibo Mao <maobibo@loongson.cn> Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>
2025-03-08LoongArch: Use polling play_dead() when resuming from hibernationHuacai Chen
When CONFIG_RANDOM_KMALLOC_CACHES or other randomization infrastructrue enabled, the idle_task's stack may different between the booting kernel and target kernel. So when resuming from hibernation, an ACTION_BOOT_CPU IPI wakeup the idle instruction in arch_cpu_idle_dead() and jump to the interrupt handler. But since the stack pointer is changed, the interrupt handler cannot restore correct context. So rename the current arch_cpu_idle_dead() to idle_play_dead(), make it as the default version of play_dead(), and the new arch_cpu_idle_dead() call play_dead() directly. For hibernation, implement an arch-specific hibernate_resume_nonboot_cpu_disable() to use the polling version (idle instruction is replace by nop, and irq is disabled) of play_dead(), i.e. poll_play_dead(), to avoid IPI handler corrupting the idle_task's stack when resuming from hibernation. This solution is a little similar to commit 406f992e4a372dafbe3c ("x86 / hibernate: Use hlt_play_dead() when resuming from hibernation"). Cc: stable@vger.kernel.org Tested-by: Erpeng Xu <xuerpeng@uniontech.com> Tested-by: Yuli Wang <wangyuli@uniontech.com> Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>
2025-03-08LoongArch: Eliminate superfluous get_numa_distances_cnt()Yuli Wang
In LoongArch, get_numa_distances_cnt() isn't in use, resulting in a compiler warning. Fix follow errors with clang-18 when W=1e: arch/loongarch/kernel/acpi.c:259:28: error: unused function 'get_numa_distances_cnt' [-Werror,-Wunused-function] 259 | static inline unsigned int get_numa_distances_cnt(struct acpi_table_slit *slit) | ^~~~~~~~~~~~~~~~~~~~~~ 1 error generated. Link: https://lore.kernel.org/all/Z7bHPVUH4lAezk0E@kernel.org/ Signed-off-by: Yuli Wang <wangyuli@uniontech.com> Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>
2025-03-08LoongArch: Convert unreachable() to BUG()Tiezhu Yang
When compiling on LoongArch, there exists the following objtool warning in arch/loongarch/kernel/machine_kexec.o: kexec_reboot() falls through to next function crash_shutdown_secondary() Avoid using unreachable() as it can (and will in the absence of UBSAN) generate fall-through code. Use BUG() so we get a "break BRK_BUG" trap (with unreachable annotation). Cc: stable@vger.kernel.org # 6.12+ Acked-by: Josh Poimboeuf <jpoimboe@kernel.org> Signed-off-by: Tiezhu Yang <yangtiezhu@loongson.cn> Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>
2025-03-07binfmt_elf_fdpic: fix variable set but not used warningsunliming
Fix below kernel warning: fs/binfmt_elf_fdpic.c:1024:52: warning: variable 'excess1' set but not used [-Wunused-but-set-variable] Reported-by: kernel test robot <lkp@intel.com> Signed-off-by: sunliming <sunliming@kylinos.cn> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20250308022754.75013-1-sunliming@linux.dev Signed-off-by: Kees Cook <kees@kernel.org>
2025-03-07ubsan/overflow: Enable ignorelist parsing and add type filterKees Cook
Limit integer wrap-around mitigation to only the "size_t" type (for now). Notably this covers all special functions/builtins that return "size_t", like sizeof(). This remains an experimental feature and is likely to be replaced with type annotations. Reviewed-by: Justin Stitt <justinstitt@google.com> Link: https://lore.kernel.org/r/20250307041914.937329-3-kees@kernel.org Signed-off-by: Kees Cook <kees@kernel.org>
2025-03-07ubsan/overflow: Enable pattern exclusionsKees Cook
To make integer wrap-around mitigation actually useful, the associated sanitizers must not instrument cases where the wrap-around is explicitly defined (e.g. "-2UL"), being tested for (e.g. "if (a + b < a)"), or where it has no impact on code flow (e.g. "while (var--)"). Enable pattern exclusions for the integer wrap sanitizers. Reviewed-by: Justin Stitt <justinstitt@google.com> Link: https://lore.kernel.org/r/20250307041914.937329-2-kees@kernel.org Signed-off-by: Kees Cook <kees@kernel.org>
2025-03-07ubsan/overflow: Rework integer overflow sanitizer option to turn on everythingKees Cook
Since we're going to approach integer overflow mitigation a type at a time, we need to enable all of the associated sanitizers, and then opt into types one at a time. Rename the existing "signed wrap" sanitizer to just the entire topic area: "integer wrap". Enable the implicit integer truncation sanitizers, with required callbacks and tests. Notably, this requires features (currently) only available in Clang, so we can depend on the cc-option tests to determine availability instead of doing version tests. Link: https://lore.kernel.org/r/20250307041914.937329-1-kees@kernel.org Signed-off-by: Kees Cook <kees@kernel.org>
2025-03-07samples/check-exec: Fix script nameMickaël Salaün
run-script-ask.sh had an incorrect file extension. This helper script is not used by kselftests. Fixes: 2a69962be4a7 ("samples/check-exec: Add an enlighten "inc" interpreter and 28 tests") Signed-off-by: Mickaël Salaün <mic@digikod.net> Link: https://lore.kernel.org/r/20250306180559.1289243-1-mic@digikod.net Signed-off-by: Kees Cook <kees@kernel.org>
2025-03-07yama: don't abuse rcu_read_lock/get_task_struct in yama_task_prctl()Oleg Nesterov
current->group_leader is stable, no need to take rcu_read_lock() and do get/put_task_struct(). Signed-off-by: Oleg Nesterov <oleg@redhat.com> Link: https://lore.kernel.org/r/20250219161417.GA20851@redhat.com Signed-off-by: Kees Cook <kees@kernel.org>
2025-03-07netpoll: hold rcu read lock in __netpoll_send_skb()Breno Leitao
The function __netpoll_send_skb() is being invoked without holding the RCU read lock. This oversight triggers a warning message when CONFIG_PROVE_RCU_LIST is enabled: net/core/netpoll.c:330 suspicious rcu_dereference_check() usage! netpoll_send_skb netpoll_send_udp write_ext_msg console_flush_all console_unlock vprintk_emit To prevent npinfo from disappearing unexpectedly, ensure that __netpoll_send_skb() is protected with the RCU read lock. Fixes: 2899656b494dcd1 ("netpoll: take rcu_read_lock_bh() in netpoll_send_skb_on_dev()") Signed-off-by: Breno Leitao <leitao@debian.org> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20250306-netpoll_rcu_v2-v2-1-bc4f5c51742a@debian.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-03-07Merge branch 'net-phy-nxp-c45-tja11xx-add-errata-for-tja112xa-b'Jakub Kicinski
Andrei Botila says: ==================== net: phy: nxp-c45-tja11xx: add errata for TJA112XA/B This patch series implements two errata for TJA1120 and TJA1121. The first errata applicable to both RGMII and SGMII version of TJA1120 and TJA1121 deals with achieving full silicon performance. The workaround in this case is putting the PHY in managed mode and applying a series of PHY writes before the link gest established. The second errata applicable only to SGMII version of TJA1120 and TJA1121 deals with achieving a stable operation of SGMII after a startup event. The workaround puts the SGMII PCS into power down mode and back up after restart or wakeup from sleep. ==================== Link: https://patch.msgid.link/20250304160619.181046-1-andrei.botila@oss.nxp.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-03-07net: phy: nxp-c45-tja11xx: add TJA112XB SGMII PCS restart errataAndrei Botila
TJA1120B/TJA1121B can achieve a stable operation of SGMII after a startup event by putting the SGMII PCS into power down mode and restart afterwards. It is necessary to put the SGMII PCS into power down mode and back up. Cc: stable@vger.kernel.org Fixes: f1fe5dff2b8a ("net: phy: nxp-c45-tja11xx: add TJA1120 support") Signed-off-by: Andrei Botila <andrei.botila@oss.nxp.com> Link: https://patch.msgid.link/20250304160619.181046-3-andrei.botila@oss.nxp.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-03-07net: phy: nxp-c45-tja11xx: add TJA112X PHY configuration errataAndrei Botila
The most recent sillicon versions of TJA1120 and TJA1121 can achieve full silicon performance by putting the PHY in managed mode. It is necessary to apply these PHY writes before link gets established. Application of this fix is required after restart of device and wakeup from sleep. Cc: stable@vger.kernel.org Fixes: f1fe5dff2b8a ("net: phy: nxp-c45-tja11xx: add TJA1120 support") Signed-off-by: Andrei Botila <andrei.botila@oss.nxp.com> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Link: https://patch.msgid.link/20250304160619.181046-2-andrei.botila@oss.nxp.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-03-07net: mctp i2c: Copy headers if clonedMatt Johnston
Use skb_cow_head() prior to modifying the TX SKB. This is necessary when the SKB has been cloned, to avoid modifying other shared clones. Signed-off-by: Matt Johnston <matt@codeconstruct.com.au> Fixes: f5b8abf9fc3d ("mctp i2c: MCTP I2C binding driver") Link: https://patch.msgid.link/20250306-matt-mctp-i2c-cow-v1-1-293827212681@codeconstruct.com.au Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-03-07net: mctp i3c: Copy headers if clonedMatt Johnston
Use skb_cow_head() prior to modifying the tx skb. This is necessary when the skb has been cloned, to avoid modifying other shared clones. Signed-off-by: Matt Johnston <matt@codeconstruct.com.au> Fixes: c8755b29b58e ("mctp i3c: MCTP I3C driver") Link: https://patch.msgid.link/20250306-matt-i3c-cow-head-v1-1-d5e6a5495227@codeconstruct.com.au Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-03-07net: dsa: mv88e6xxx: Verify after ATU Load opsJoseph Huang
ATU Load operations could fail silently if there's not enough space on the device to hold the new entry. When this happens, the symptom depends on the unknown flood settings. If unknown multicast flood is disabled, the multicast packets are dropped when the ATU table is full. If unknown multicast flood is enabled, the multicast packets will be flooded to all ports. Either way, IGMP snooping is broken when the ATU Load operation fails silently. Do a Read-After-Write verification after each fdb/mdb add operation to make sure that the operation was really successful, and return -ENOSPC otherwise. Fixes: defb05b9b9b4 ("net: dsa: mv88e6xxx: Add support for fdb_add, fdb_del, and fdb_getnext") Signed-off-by: Joseph Huang <Joseph.Huang@garmin.com> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Link: https://patch.msgid.link/20250306172306.3859214-1-Joseph.Huang@garmin.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-03-07net/mlx5: Fill out devlink dev info only for PFsJiri Pirko
Firmware version query is supported on the PFs. Due to this following kernel warning log is observed: [ 188.590344] mlx5_core 0000:08:00.2: mlx5_fw_version_query:816:(pid 1453): fw query isn't supported by the FW Fix it by restricting the query and devlink info to the PF. Fixes: 8338d9378895 ("net/mlx5: Added devlink info callback") Signed-off-by: Jiri Pirko <jiri@nvidia.com> Reviewed-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Reviewed-by: Parav Pandit <parav@nvidia.com> Link: https://patch.msgid.link/20250306212529.429329-1-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-03-07netmem: prevent TX of unreadable skbsMina Almasry
Currently on stable trees we have support for netmem/devmem RX but not TX. It is not safe to forward/redirect an RX unreadable netmem packet into the device's TX path, as the device may call dma-mapping APIs on dma addrs that should not be passed to it. Fix this by preventing the xmit of unreadable skbs. Tested by configuring tc redirect: sudo tc qdisc add dev eth1 ingress sudo tc filter add dev eth1 ingress protocol ip prio 1 flower ip_proto \ tcp src_ip 192.168.1.12 action mirred egress redirect dev eth1 Before, I see unreadable skbs in the driver's TX path passed to dma mapping APIs. After, I don't see unreadable skbs in the driver's TX path passed to dma mapping APIs. Fixes: 65249feb6b3d ("net: add support for skbs with unreadable frags") Suggested-by: Jakub Kicinski <kuba@kernel.org> Cc: stable@vger.kernel.org Signed-off-by: Mina Almasry <almasrymina@google.com> Link: https://patch.msgid.link/20250306215520.1415465-1-almasrymina@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-03-07Merge tag 'for-net-2025-03-07' of ↵Jakub Kicinski
git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth Luiz Augusto von Dentz says: ==================== bluetooth pull request for net: - btusb: Configure altsetting for HCI_USER_CHANNEL - hci_event: Fix enabling passive scanning - revert: "hci_core: Fix sleeping function called from invalid context" - SCO: fix sco_conn refcounting on sco_conn_ready * tag 'for-net-2025-03-07' of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth: Revert "Bluetooth: hci_core: Fix sleeping function called from invalid context" Bluetooth: hci_event: Fix enabling passive scanning Bluetooth: SCO: fix sco_conn refcounting on sco_conn_ready Bluetooth: btusb: Configure altsetting for HCI_USER_CHANNEL ==================== Link: https://patch.msgid.link/20250307181854.99433-1-luiz.dentz@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-03-07Merge tag 's390-6.14-6' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux Pull s390 fixes from Vasily Gorbik: - Fix return address recovery of traced function in ftrace to ensure reliable stack unwinding - Fix compiler warnings and runtime crashes of vDSO selftests on s390 by introducing a dedicated GNU hash bucket pointer with correct 32-bit entry size - Fix test_monitor_call() inline asm, which misses CC clobber, by switching to an instruction that doesn't modify CC * tag 's390-6.14-6' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux: s390/ftrace: Fix return address recovery of traced function selftests/vDSO: Fix GNU hash table entry size for s390x s390/traps: Fix test_monitor_call() inline assembly
2025-03-08rust: lockdep: Use Pin for all LockClassKey usagesMitchell Levy
Reintroduce dynamically-allocated LockClassKeys such that they are automatically (de)registered. Require that all usages of LockClassKeys ensure that they are Pin'd. Currently, only `'static` LockClassKeys are supported, so Pin is redundant. However, it is intended that dynamically-allocated LockClassKeys will eventually be supported, so using Pin from the outset will make that change simpler. Closes: https://github.com/Rust-for-Linux/linux/issues/1102 Suggested-by: Benno Lossin <benno.lossin@proton.me> Suggested-by: Boqun Feng <boqun.feng@gmail.com> Signed-off-by: Mitchell Levy <levymitchell0@gmail.com> Signed-off-by: Boqun Feng <boqun.feng@gmail.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Benno Lossin <benno.lossin@proton.me> Link: https://lore.kernel.org/r/20250307232717.1759087-12-boqun.feng@gmail.com
2025-03-08rust: sync: condvar: Add wait_interruptible_freezable()Alice Ryhl
To support waiting for a `CondVar` as a freezable process, add a wait_interruptible_freezable() function. Binder needs this function in the appropriate places to freeze a process where some of its threads are blocked on the Binder driver. [ Boqun: Cleaned up the changelog and documentation. ] Signed-off-by: Alice Ryhl <aliceryhl@google.com> Signed-off-by: Boqun Feng <boqun.feng@gmail.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20250307232717.1759087-10-boqun.feng@gmail.com
2025-03-08rust: sync: lock: Add an example for Guard:: Lock_ref()Boqun Feng
To provide examples on usage of `Guard::lock_ref()` along with the unit test, an "assert a lock is held by a guard" example is added. (Also apply feedback from Benno.) Signed-off-by: Boqun Feng <boqun.feng@gmail.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Benno Lossin <benno.lossin@proton.me> Reviewed-by: Alice Ryhl <aliceryhl@google.com> Link: https://lore.kernel.org/r/20250223072114.3715-1-boqun.feng@gmail.com Link: https://lore.kernel.org/r/20250307232717.1759087-9-boqun.feng@gmail.com
2025-03-08rust: sync: Add accessor for the lock behind a given guardAlice Ryhl
In order to assert a particular `Guard` is associated with a particular `Lock`, add an accessor to obtain a reference to the underlying `Lock` of a `Guard`. Binder needs this assertion to ensure unsafe list operations are done with the correct lock held. [Boqun: Capitalize the title and reword the commit log] Signed-off-by: Alice Ryhl <aliceryhl@google.com> Signed-off-by: Boqun Feng <boqun.feng@gmail.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Fiona Behrens <me@kloenk.dev> Link: https://lore.kernel.org/r/20250205-guard-get-lock-v2-1-ba32a8c1d5b7@google.com Link: https://lore.kernel.org/r/20250307232717.1759087-8-boqun.feng@gmail.com
2025-03-08locking/lockdep: Add kasan_check_byte() check in lock_acquire()Waiman Long
KASAN instrumentation of lockdep has been disabled, as we don't need KASAN to check the validity of lockdep internal data structures and incur unnecessary performance overhead. However, the lockdep_map pointer passed in externally may not be valid (e.g. use-after-free) and we run the risk of using garbage data resulting in false lockdep reports. Add kasan_check_byte() call in lock_acquire() for non kernel core data object to catch invalid lockdep_map and print out a KASAN report before any lockdep splat, if any. Suggested-by: Marco Elver <elver@google.com> Signed-off-by: Waiman Long <longman@redhat.com> Signed-off-by: Boqun Feng <boqun.feng@gmail.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Marco Elver <elver@google.com> Reviewed-by: Andrey Konovalov <andreyknvl@gmail.com> Link: https://lore.kernel.org/r/20250214195242.2480920-1-longman@redhat.com Link: https://lore.kernel.org/r/20250307232717.1759087-7-boqun.feng@gmail.com
2025-03-08locking/lockdep: Disable KASAN instrumentation of lockdep.cWaiman Long
Both KASAN and LOCKDEP are commonly enabled in building a debug kernel. Each of them can significantly slow down the speed of a debug kernel. Enabling KASAN instrumentation of the LOCKDEP code will further slow things down. Since LOCKDEP is a high overhead debugging tool, it will never get enabled in a production kernel. The LOCKDEP code is also pretty mature and is unlikely to get major changes. There is also a possibility of recursion similar to KCSAN. To evaluate the performance impact of disabling KASAN instrumentation of lockdep.c, the time to do a parallel build of the Linux defconfig kernel was used as the benchmark. Two x86-64 systems (Skylake & Zen 2) and an arm64 system were used as test beds. Two sets of non-RT and RT kernels with similar configurations except mainly CONFIG_PREEMPT_RT were used for evaluation. For the Skylake system: Kernel Run time Sys time ------ -------- -------- Non-debug kernel (baseline) 0m47.642s 4m19.811s [CONFIG_KASAN_INLINE=y] Debug kernel 2m11.108s (x2.8) 38m20.467s (x8.9) Debug kernel (patched) 1m49.602s (x2.3) 31m28.501s (x7.3) Debug kernel (patched + mitigations=off) 1m30.988s (x1.9) 26m41.993s (x6.2) RT kernel (baseline) 0m54.871s 7m15.340s [CONFIG_KASAN_INLINE=n] RT debug kernel 6m07.151s (x6.7) 135m47.428s (x18.7) RT debug kernel (patched) 3m42.434s (x4.1) 74m51.636s (x10.3) RT debug kernel (patched + mitigations=off) 2m40.383s (x2.9) 57m54.369s (x8.0) [CONFIG_KASAN_INLINE=y] RT debug kernel 3m22.155s (x3.7) 77m53.018s (x10.7) RT debug kernel (patched) 2m36.700s (x2.9) 54m31.195s (x7.5) RT debug kernel (patched + mitigations=off) 2m06.110s (x2.3) 45m49.493s (x6.3) For the Zen 2 system: Kernel Run time Sys time ------ -------- -------- Non-debug kernel (baseline) 1m42.806s 39m48.714s [CONFIG_KASAN_INLINE=y] Debug kernel 4m04.524s (x2.4) 125m35.904s (x3.2) Debug kernel (patched) 3m56.241s (x2.3) 127m22.378s (x3.2) Debug kernel (patched + mitigations=off) 2m38.157s (x1.5) 92m35.680s (x2.3) RT kernel (baseline) 1m51.500s 14m56.322s [CONFIG_KASAN_INLINE=n] RT debug kernel 16m04.962s (x8.7) 244m36.463s (x16.4) RT debug kernel (patched) 9m09.073s (x4.9) 129m28.439s (x8.7) RT debug kernel (patched + mitigations=off) 3m31.662s (x1.9) 51m01.391s (x3.4) For the arm64 system: Kernel Run time Sys time ------ -------- -------- Non-debug kernel (baseline) 1m56.844s 8m47.150s Debug kernel 3m54.774s (x2.0) 92m30.098s (x10.5) Debug kernel (patched) 3m32.429s (x1.8) 77m40.779s (x8.8) RT kernel (baseline) 4m01.641s 18m16.777s [CONFIG_KASAN_INLINE=n] RT debug kernel 19m32.977s (x4.9) 304m23.965s (x16.7) RT debug kernel (patched) 16m28.354s (x4.1) 234m18.149s (x12.8) Turning the mitigations off doesn't seems to have any noticeable impact on the performance of the arm64 system. So the mitigation=off entries aren't included. For the x86 CPUs, CPU mitigations has a much bigger impact on performance, especially the RT debug kernel with CONFIG_KASAN_INLINE=n. The SRSO mitigation in Zen 2 has an especially big impact on the debug kernel. It is also the majority of the slowdown with mitigations on. It is because the patched RET instruction slows down function returns. A lot of helper functions that are normally compiled out or inlined may become real function calls in the debug kernel. With !CONFIG_KASAN_INLINE, the KASAN instrumentation inserts a lot of __asan_loadX*() and __kasan_check_read() function calls to memory access portion of the code. The lockdep's __lock_acquire() function, for instance, has 66 __asan_loadX*() and 6 __kasan_check_read() calls added with KASAN instrumentation. Of course, the actual numbers may vary depending on the compiler used and the exact version of the lockdep code. With the Skylake test system, the parallel kernel build times reduction of the RT debug kernel with this patch are: CONFIG_KASAN_INLINE=n: -37% CONFIG_KASAN_INLINE=y: -22% The time reduction is less with CONFIG_KASAN_INLINE=y, but it is still significant. Setting CONFIG_KASAN_INLINE=y can result in a significant performance improvement. The major drawback is a significant increase in the size of kernel text. In the case of vmlinux, its text size increases from 45997948 to 67606807. That is a 47% size increase (about 21 Mbytes). The size increase of other kernel modules should be similar. With the newly added rtmutex and lockdep lock events, the relevant event counts for the test runs with the Skylake system were: Event type Debug kernel RT debug kernel ---------- ------------ --------------- lockdep_acquire 1,968,663,277 5,425,313,953 rtlock_slowlock - 401,701,156 rtmutex_slowlock - 139,672 The __lock_acquire() calls in the RT debug kernel are x2.8 times of the non-RT debug kernel with the same workload. Since the __lock_acquire() function is a big hitter in term of performance slowdown, this makes the RT debug kernel much slower than the non-RT one. The average lock nesting depth is likely to be higher in the RT debug kernel too leading to longer execution time in the __lock_acquire() function. As the small advantage of enabling KASAN instrumentation to catch potential memory access error in the lockdep debugging tool is probably not worth the drawback of further slowing down a debug kernel, disable KASAN instrumentation in the lockdep code to allow the debug kernels to regain some performance back, especially for the RT debug kernels. Signed-off-by: Waiman Long <longman@redhat.com> Signed-off-by: Boqun Feng <boqun.feng@gmail.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Marco Elver <elver@google.com> Reviewed-by: Andrey Konovalov <andreyknvl@gmail.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: https://lore.kernel.org/r/20250307232717.1759087-6-boqun.feng@gmail.com
2025-03-08locking/lock_events: Add locking events for lockdepWaiman Long
Add some lock events to lockdep to profile its behavior. Signed-off-by: Waiman Long <longman@redhat.com> Signed-off-by: Boqun Feng <boqun.feng@gmail.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20250307232717.1759087-5-boqun.feng@gmail.com
2025-03-08locking/lock_events: Add locking events for rtmutex slow pathsWaiman Long
Add locking events for rtlock_slowlock() and rt_mutex_slowlock() for profiling the slow path behavior of rt_spin_lock() and rt_mutex_lock(). Signed-off-by: Waiman Long <longman@redhat.com> Signed-off-by: Boqun Feng <boqun.feng@gmail.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20250307232717.1759087-4-boqun.feng@gmail.com
2025-03-08Merge branch 'locking/urgent' into locking/core, to pick up locking fixesIngo Molnar
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2025-03-08locking/semaphore: Use wake_q to wake up processes outside lock critical sectionWaiman Long
A circular lock dependency splat has been seen involving down_trylock(): ====================================================== WARNING: possible circular locking dependency detected 6.12.0-41.el10.s390x+debug ------------------------------------------------------ dd/32479 is trying to acquire lock: 0015a20accd0d4f8 ((console_sem).lock){-.-.}-{2:2}, at: down_trylock+0x26/0x90 but task is already holding lock: 000000017e461698 (&zone->lock){-.-.}-{2:2}, at: rmqueue_bulk+0xac/0x8f0 the existing dependency chain (in reverse order) is: -> #4 (&zone->lock){-.-.}-{2:2}: -> #3 (hrtimer_bases.lock){-.-.}-{2:2}: -> #2 (&rq->__lock){-.-.}-{2:2}: -> #1 (&p->pi_lock){-.-.}-{2:2}: -> #0 ((console_sem).lock){-.-.}-{2:2}: The console_sem -> pi_lock dependency is due to calling try_to_wake_up() while holding the console_sem raw_spinlock. This dependency can be broken by using wake_q to do the wakeup instead of calling try_to_wake_up() under the console_sem lock. This will also make the semaphore's raw_spinlock become a terminal lock without taking any further locks underneath it. The hrtimer_bases.lock is a raw_spinlock while zone->lock is a spinlock. The hrtimer_bases.lock -> zone->lock dependency happens via the debug_objects_fill_pool() helper function in the debugobjects code. -> #4 (&zone->lock){-.-.}-{2:2}: __lock_acquire+0xe86/0x1cc0 lock_acquire.part.0+0x258/0x630 lock_acquire+0xb8/0xe0 _raw_spin_lock_irqsave+0xb4/0x120 rmqueue_bulk+0xac/0x8f0 __rmqueue_pcplist+0x580/0x830 rmqueue_pcplist+0xfc/0x470 rmqueue.isra.0+0xdec/0x11b0 get_page_from_freelist+0x2ee/0xeb0 __alloc_pages_noprof+0x2c2/0x520 alloc_pages_mpol_noprof+0x1fc/0x4d0 alloc_pages_noprof+0x8c/0xe0 allocate_slab+0x320/0x460 ___slab_alloc+0xa58/0x12b0 __slab_alloc.isra.0+0x42/0x60 kmem_cache_alloc_noprof+0x304/0x350 fill_pool+0xf6/0x450 debug_object_activate+0xfe/0x360 enqueue_hrtimer+0x34/0x190 __run_hrtimer+0x3c8/0x4c0 __hrtimer_run_queues+0x1b2/0x260 hrtimer_interrupt+0x316/0x760 do_IRQ+0x9a/0xe0 do_irq_async+0xf6/0x160 Normally a raw_spinlock to spinlock dependency is not legitimate and will be warned if CONFIG_PROVE_RAW_LOCK_NESTING is enabled, but debug_objects_fill_pool() is an exception as it explicitly allows this dependency for non-PREEMPT_RT kernel without causing PROVE_RAW_LOCK_NESTING lockdep splat. As a result, this dependency is legitimate and not a bug. Anyway, semaphore is the only locking primitive left that is still using try_to_wake_up() to do wakeup inside critical section, all the other locking primitives had been migrated to use wake_q to do wakeup outside of the critical section. It is also possible that there are other circular locking dependencies involving printk/console_sem or other existing/new semaphores lurking somewhere which may show up in the future. Let just do the migration now to wake_q to avoid headache like this. Reported-by: yzbot+ed801a886dfdbfe7136d@syzkaller.appspotmail.com Signed-off-by: Waiman Long <longman@redhat.com> Signed-off-by: Boqun Feng <boqun.feng@gmail.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: https://lore.kernel.org/r/20250307232717.1759087-3-boqun.feng@gmail.com
2025-03-08locking/rtmutex: Use the 'struct' keyword in kernel-doc commentRandy Dunlap
Add the "struct" keyword to prevent a kernel-doc warning: rtmutex_common.h:67: warning: cannot understand function prototype: 'struct rt_wake_q_head ' Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: Boqun Feng <boqun.feng@gmail.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Acked-by: Waiman Long <longman@redhat.com> Link: https://lore.kernel.org/r/20250307232717.1759087-2-boqun.feng@gmail.com
2025-03-08rust: lockdep: Remove support for dynamically allocated LockClassKeysMitchell Levy
Currently, dynamically allocated LockCLassKeys can be used from the Rust side without having them registered. This is a soundness issue, so remove them. Fixes: 6ea5aa08857a ("rust: sync: introduce `LockClassKey`") Suggested-by: Alice Ryhl <aliceryhl@google.com> Signed-off-by: Mitchell Levy <levymitchell0@gmail.com> Signed-off-by: Boqun Feng <boqun.feng@gmail.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Benno Lossin <benno.lossin@proton.me> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20250307232717.1759087-11-boqun.feng@gmail.com
2025-03-08x86/mm: Define PTRS_PER_PMD for assembly code tooIngo Molnar
Andy reported the following build warning from head_32.S: In file included from arch/x86/kernel/head_32.S:29: arch/x86/include/asm/pgtable_32.h:59:5: error: "PTRS_PER_PMD" is not defined, evaluates to 0 [-Werror=undef] 59 | #if PTRS_PER_PMD > 1 The reason is that on 2-level i386 paging the folded in PMD's PTRS_PER_PMD constant is not defined in assembly headers, only in generic MM C headers. Instead of trying to fish out the definition from the generic headers, just define it - it even has a comment for it already... Reported-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Tested-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: https://lore.kernel.org/r/Z8oa8AUVyi2HWfo9@gmail.com
2025-03-07x86/boot: Drop CRC-32 checksum and the build tool that generates itArd Biesheuvel
Apart from some sanity checks on the size of setup.bin, the only remaining task carried out by the arch/x86/boot/tools/build.c build tool is generating the CRC-32 checksum of the bzImage. This feature was added in commit 7d6e737c8d2698b6 ("x86: add a crc32 checksum to the kernel image.") without any motivation (or any commit log text, for that matter). This checksum is not verified by any known bootloader, and given that a) the checksum of the entire bzImage is reported by most tools (zlib, rhash) as 0xffffffff and not 0x0 as documented, b) the checksum is corrupted when the image is signed for secure boot, which means that no distro ships x86 images with valid CRCs, it seems quite unlikely that this checksum is being used, so let's just drop it, along with the tool that generates it. Instead, use simple file concatenation and truncation to combine the two pieces into bzImage, and replace the checks on the size of the setup block with a couple of ASSERT()s in the linker script. Signed-off-by: Ard Biesheuvel <ardb@kernel.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ian Campbell <ijc@hellion.org.uk> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: https://lore.kernel.org/r/20250307164801.885261-2-ardb+git@google.com
2025-03-07Merge tag 'slab-for-6.14-rc5' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/vbabka/slab Pull slab fix from Vlastimil Babka: - Stable fix for kmem_cache_destroy() called from a WQ_MEM_RECLAIM workqueue causing a warning due to the new kvfree_rcu_barrier() (Uladzislau Rezki) * tag 'slab-for-6.14-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/vbabka/slab: mm/slab/kvfree_rcu: Switch to WQ_MEM_RECLAIM wq
2025-03-07Merge tag 'acpi-6.14-rc6' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm Pull ACPI fix from Rafael Wysocki: "Restore the previous behavior of the ACPI platform_profile sysfs interface that has been changed recently in a way incompatible with the existing user space (Mario Limonciello)" * tag 'acpi-6.14-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: platform/x86/amd: pmf: Add balanced-performance to hidden choices platform/x86/amd: pmf: Add 'quiet' to hidden choices ACPI: platform_profile: Add support for hidden choices