summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2024-12-06ALSA: hda/realtek - Add support for ASUS Zen AIO 27 Z272SD_A272SD audioVasiliy Kovalev
Introduces necessary quirks to enable audio functionality on the ASUS Zen AIO 27 Z272SD_A272SD: - configures verbs to activate internal speakers and headphone jack. - implements adjustments for headset microphone functionality. The speaker and jack configurations were derived from a dump of the working Windows driver, while the headset microphone functionality was fine-tuned through experimental testing. Link: https://lore.kernel.org/all/CAGGMHBOGDUnMewBTrZgoBKFk_A4sNF4fEXwfH9Ay8SNTzjy0-g@mail.gmail.com/T/ Signed-off-by: Vasiliy Kovalev <kovalev@altlinux.org> Link: https://patch.msgid.link/20241205210306.977634-1-kovalev@altlinux.org Signed-off-by: Takashi Iwai <tiwai@suse.de>
2024-12-06ALSA: hda/realtek: Fix headset mic on Acer Nitro 5Hridesh MG
Add a PCI quirk to enable microphone input on the headphone jack on the Acer Nitro 5 AN515-58 laptop. Signed-off-by: Hridesh MG <hridesh699@gmail.com> Cc: <stable@vger.kernel.org> Link: https://patch.msgid.link/20241205171843.7787-1-hridesh699@gmail.com Signed-off-by: Takashi Iwai <tiwai@suse.de>
2024-12-06ALSA: hda: cs35l56: Remove calls to ↵Simon Trimmer
cs35l56_force_sync_asp1_registers_from_cache() Commit 5d7e328e20b3 ("ASoC: cs35l56: Revert support for dual-ownership of ASP registers") replaced cs35l56_force_sync_asp1_registers_from_cache() with a dummy implementation so that the HDA driver would continue to build. Remove the calls from HDA and remove the stub function. Signed-off-by: Simon Trimmer <simont@opensource.cirrus.com> Signed-off-by: Richard Fitzgerald <rf@opensource.cirrus.com> Link: https://patch.msgid.link/20241206105757.718750-1-rf@opensource.cirrus.com Signed-off-by: Takashi Iwai <tiwai@suse.de>
2024-12-06x86/cacheinfo: Delete global num_cache_leavesRicardo Neri
Linux remembers cpu_cachinfo::num_leaves per CPU, but x86 initializes all CPUs from the same global "num_cache_leaves". This is erroneous on systems such as Meteor Lake, where each CPU has a distinct num_leaves value. Delete the global "num_cache_leaves" and initialize num_leaves on each CPU. init_cache_level() no longer needs to set num_leaves. Also, it never had to set num_levels as it is unnecessary in x86. Keep checking for zero cache leaves. Such condition indicates a bug. [ bp: Cleanup. ] Signed-off-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Cc: stable@vger.kernel.org # 6.3+ Link: https://lore.kernel.org/r/20241128002247.26726-3-ricardo.neri-calderon@linux.intel.com
2024-12-06cacheinfo: Allocate memory during CPU hotplug if not done from the primary CPURicardo Neri
Commit 5944ce092b97 ("arch_topology: Build cacheinfo from primary CPU") adds functionality that architectures can use to optionally allocate and build cacheinfo early during boot. Commit 6539cffa9495 ("cacheinfo: Add arch specific early level initializer") lets secondary CPUs correct (and reallocate memory) cacheinfo data if needed. If the early build functionality is not used and cacheinfo does not need correction, memory for cacheinfo is never allocated. x86 does not use the early build functionality. Consequently, during the cacheinfo CPU hotplug callback, last_level_cache_is_valid() attempts to dereference a NULL pointer: BUG: kernel NULL pointer dereference, address: 0000000000000100 #PF: supervisor read access in kernel mode #PF: error_code(0x0000) - not present page PGD 0 P4D 0 Oops: 0000 [#1] PREEPMT SMP NOPTI CPU: 0 PID 19 Comm: cpuhp/0 Not tainted 6.4.0-rc2 #1 RIP: 0010: last_level_cache_is_valid+0x95/0xe0a Allocate memory for cacheinfo during the cacheinfo CPU hotplug callback if not done earlier. Moreover, before determining the validity of the last-level cache info, ensure that it has been allocated. Simply checking for non-zero cache_leaves() is not sufficient, as some architectures (e.g., Intel processors) have non-zero cache_leaves() before allocation. Dereferencing NULL cacheinfo can occur in update_per_cpu_data_slice_size(). This function iterates over all online CPUs. However, a CPU may have come online recently, but its cacheinfo may not have been allocated yet. While here, remove an unnecessary indentation in allocate_cache_info(). [ bp: Massage. ] Fixes: 6539cffa9495 ("cacheinfo: Add arch specific early level initializer") Signed-off-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Radu Rendec <rrendec@redhat.com> Reviewed-by: Nikolay Borisov <nik.borisov@suse.com> Reviewed-by: Andreas Herrmann <aherrmann@suse.de> Reviewed-by: Sudeep Holla <sudeep.holla@arm.com> Cc: stable@vger.kernel.org # 6.3+ Link: https://lore.kernel.org/r/20241128002247.26726-2-ricardo.neri-calderon@linux.intel.com
2024-12-06wifi: cfg80211: sme: init n_channels before channels[] accessHaoyu Li
With the __counted_by annocation in cfg80211_scan_request struct, the "n_channels" struct member must be set before accessing the "channels" array. Failing to do so will trigger a runtime warning when enabling CONFIG_UBSAN_BOUNDS and CONFIG_FORTIFY_SOURCE. Fixes: e3eac9f32ec0 ("wifi: cfg80211: Annotate struct cfg80211_scan_request with __counted_by") Signed-off-by: Haoyu Li <lihaoyu499@gmail.com> Link: https://patch.msgid.link/20241203152049.348806-1-lihaoyu499@gmail.com Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2024-12-06x86/kexec: Restore GDT on return from ::preserve_context kexecDavid Woodhouse
The restore_processor_state() function explicitly states that "the asm code that gets us here will have restored a usable GDT". That wasn't true in the case of returning from a ::preserve_context kexec. Make it so. Without this, the kernel was depending on the called function to reload a GDT which is appropriate for the kernel before returning. Test program: #include <unistd.h> #include <errno.h> #include <stdio.h> #include <stdlib.h> #include <linux/kexec.h> #include <linux/reboot.h> #include <sys/reboot.h> #include <sys/syscall.h> int main (void) { struct kexec_segment segment = {}; unsigned char purgatory[] = { 0x66, 0xba, 0xf8, 0x03, // mov $0x3f8, %dx 0xb0, 0x42, // mov $0x42, %al 0xee, // outb %al, (%dx) 0xc3, // ret }; int ret; segment.buf = &purgatory; segment.bufsz = sizeof(purgatory); segment.mem = (void *)0x400000; segment.memsz = 0x1000; ret = syscall(__NR_kexec_load, 0x400000, 1, &segment, KEXEC_PRESERVE_CONTEXT); if (ret) { perror("kexec_load"); exit(1); } ret = syscall(__NR_reboot, LINUX_REBOOT_MAGIC1, LINUX_REBOOT_MAGIC2, LINUX_REBOOT_CMD_KEXEC); if (ret) { perror("kexec reboot"); exit(1); } printf("Success\n"); return 0; } Signed-off-by: David Woodhouse <dwmw@amazon.co.uk> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20241205153343.3275139-2-dwmw2@infradead.org
2024-12-06tty: serial: Work around warning backtrace in serial8250_set_defaultsGuenter Roeck
Commit 7c7e6c8924e7 ("tty: serial: handle HAS_IOPORT dependencies") triggers warning backtraces on a number of platforms which don't support IO ports. WARNING: CPU: 0 PID: 0 at drivers/tty/serial/8250/8250_port.c:470 serial8250_set_defaults+0x148/0x1d8 Unsupported UART type 0 The problem is seen because serial8250_set_defaults() is called for all members of the serial8250_ports[] array even if that array is not initialized. Work around the problem by only displaying the warning if the port type is not 0 (UPIO_PORT) or if iobase is set for the port. Fixes: 7c7e6c8924e7 ("tty: serial: handle HAS_IOPORT dependencies") Acked-by: Arnd Bergmann <arnd@arndb.de> Cc: Niklas Schnelle <schnelle@linux.ibm.com> Signed-off-by: Guenter Roeck <linux@roeck-us.net> Tested-by: Stafford Horne <shorne@gmail.com> Link: https://lore.kernel.org/r/20241205143033.2695333-1-linux@roeck-us.net Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-12-05iio: magnetometer: yas530: use signed integer type for clamp limitsJakob Hauser
In the function yas537_measure() there is a clamp_val() with limits of -BIT(13) and BIT(13) - 1. The input clamp value h[] is of type s32. The BIT() is of type unsigned long integer due to its define in include/vdso/bits.h. The lower limit -BIT(13) is recognized as -8192 but expressed as an unsigned long integer. The size of an unsigned long integer differs between 32-bit and 64-bit architectures. Converting this to type s32 may lead to undesired behavior. Additionally, in the calculation lines h[0], h[1] and h[2] the unsigned long integer divisor BIT(13) causes an unsigned division, shifting the left-hand side of the equation back and forth, possibly ending up in large positive values instead of negative values on 32-bit architectures. To solve those two issues, declare a signed integer with a value of BIT(13). There is another omission in the clamp line: clamp_val() returns a value and it's going nowhere here. Self-assign it to h[i] to make use of the clamp macro. Finally, replace clamp_val() macro by clamp() because after changing the limits from type unsigned long integer to signed integer it's fine that way. Link: https://lkml.kernel.org/r/11609b2243c295d65ab4d47e78c239d61ad6be75.1732914810.git.jahau@rocketmail.com Fixes: 65f79b501030 ("iio: magnetometer: yas530: Add YAS537 variant") Signed-off-by: Jakob Hauser <jahau@rocketmail.com> Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/oe-kbuild-all/202411230458.dhZwh3TT-lkp@intel.com/ Closes: https://lore.kernel.org/oe-kbuild-all/202411282222.oF0B4110-lkp@intel.com/ Reviewed-by: David Laight <david.laight@aculab.com> Acked-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> Cc: Lars-Peter Clausen <lars@metafoo.de> Cc: Linus Walleij <linus.walleij@linaro.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-12-05sched/numa: fix memory leak due to the overwritten vma->numab_stateAdrian Huang
[Problem Description] When running the hackbench program of LTP, the following memory leak is reported by kmemleak. # /opt/ltp/testcases/bin/hackbench 20 thread 1000 Running with 20*40 (== 800) tasks. # dmesg | grep kmemleak ... kmemleak: 480 new suspected memory leaks (see /sys/kernel/debug/kmemleak) kmemleak: 665 new suspected memory leaks (see /sys/kernel/debug/kmemleak) # cat /sys/kernel/debug/kmemleak unreferenced object 0xffff888cd8ca2c40 (size 64): comm "hackbench", pid 17142, jiffies 4299780315 hex dump (first 32 bytes): ac 74 49 00 01 00 00 00 4c 84 49 00 01 00 00 00 .tI.....L.I..... 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ backtrace (crc bff18fd4): [<ffffffff81419a89>] __kmalloc_cache_noprof+0x2f9/0x3f0 [<ffffffff8113f715>] task_numa_work+0x725/0xa00 [<ffffffff8110f878>] task_work_run+0x58/0x90 [<ffffffff81ddd9f8>] syscall_exit_to_user_mode+0x1c8/0x1e0 [<ffffffff81dd78d5>] do_syscall_64+0x85/0x150 [<ffffffff81e0012b>] entry_SYSCALL_64_after_hwframe+0x76/0x7e ... This issue can be consistently reproduced on three different servers: * a 448-core server * a 256-core server * a 192-core server [Root Cause] Since multiple threads are created by the hackbench program (along with the command argument 'thread'), a shared vma might be accessed by two or more cores simultaneously. When two or more cores observe that vma->numab_state is NULL at the same time, vma->numab_state will be overwritten. Although current code ensures that only one thread scans the VMAs in a single 'numa_scan_period', there might be a chance for another thread to enter in the next 'numa_scan_period' while we have not gotten till numab_state allocation [1]. Note that the command `/opt/ltp/testcases/bin/hackbench 50 process 1000` cannot the reproduce the issue. It is verified with 200+ test runs. [Solution] Use the cmpxchg atomic operation to ensure that only one thread executes the vma->numab_state assignment. [1] https://lore.kernel.org/lkml/1794be3c-358c-4cdc-a43d-a1f841d91ef7@amd.com/ Link: https://lkml.kernel.org/r/20241113102146.2384-1-ahuang12@lenovo.com Fixes: ef6a22b70f6d ("sched/numa: apply the scan delay to every new vma") Signed-off-by: Adrian Huang <ahuang12@lenovo.com> Reported-by: Jiwei Sun <sunjw10@lenovo.com> Reviewed-by: Raghavendra K T <raghavendra.kt@amd.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Cc: Ben Segall <bsegall@google.com> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Juri Lelli <juri.lelli@redhat.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Valentin Schneider <vschneid@redhat.com> Cc: Vincent Guittot <vincent.guittot@linaro.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-12-05mm/damon: fix order of arguments in damos_before_apply tracepointAkinobu Mita
Since the order of the scheme_idx and target_idx arguments in TP_ARGS is reversed, they are stored in the trace record in reverse. Link: https://lkml.kernel.org/r/20241115182023.43118-1-sj@kernel.org Link: https://patch.msgid.link/20241112154828.40307-1-akinobu.mita@gmail.com Fixes: c603c630b509 ("mm/damon/core: add a tracepoint for damos apply target regions") Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com> Signed-off-by: SeongJae Park <sj@kernel.org> Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-12-05lib: stackinit: hide never-taken branch from compilerKees Cook
The never-taken branch leads to an invalid bounds condition, which is by design. To avoid the unwanted warning from the compiler, hide the variable from the optimizer. ../lib/stackinit_kunit.c: In function 'do_nothing_u16_zero': ../lib/stackinit_kunit.c:51:49: error: array subscript 1 is outside array bounds of 'u16[0]' {aka 'short unsigned int[]'} [-Werror=array-bounds=] 51 | #define DO_NOTHING_RETURN_SCALAR(ptr) *(ptr) | ^~~~~~ ../lib/stackinit_kunit.c:219:24: note: in expansion of macro 'DO_NOTHING_RETURN_SCALAR' 219 | return DO_NOTHING_RETURN_ ## which(ptr + 1); \ | ^~~~~~~~~~~~~~~~~~ Link: https://lkml.kernel.org/r/20241117113813.work.735-kees@kernel.org Signed-off-by: Kees Cook <kees@kernel.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-12-05mm/filemap: don't call folio_test_locked() without a reference in ↵David Hildenbrand
next_uptodate_folio() The folio can get freed + buddy-merged + reallocated in the meantime, resulting in us calling folio_test_locked() possibly on a tail page. This makes const_folio_flags VM_BUG_ON_PGFLAGS() when stumbling over the tail page. Could this result in other issues? Doesn't look like it. False positives and false negatives don't really matter, because this folio would get skipped either way when detecting that they have been reallocated in the meantime. Fix it by performing the folio_test_locked() checked after grabbing a reference. If this ever becomes a real problem, we could add a special helper that racily checks if the bit is set even on tail pages ... but let's hope that's not required so we can just handle it cleaner: work on the folio after we hold a reference. Do we really need the folio_test_locked() check if we are going to trylock briefly after? Well, we can at least avoid a xas_reload(). It's a bit unclear which exact change introduced that issue. Likely, ever since we made PG_locked obey to the PF_NO_TAIL policy it could have been triggered in some way. Link: https://lkml.kernel.org/r/20241129125303.4033164-1-david@redhat.com Fixes: 48c935ad88f5 ("page-flags: define PG_locked behavior on compound pages") Signed-off-by: David Hildenbrand <david@redhat.com> Reported-by: syzbot+9f9a7f73fb079b2387a6@syzkaller.appspotmail.com Closes: https://lore.kernel.org/lkml/674184c9.050a0220.1cc393.0001.GAE@google.com/ Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org> Cc: Hillf Danton <hdanton@sina.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-12-05scatterlist: fix incorrect func name in kernel-docRandy Dunlap
Fix a kernel-doc warning by making the kernel-doc function description match the function name: include/linux/scatterlist.h:323: warning: expecting prototype for sg_unmark_bus_address(). Prototype was for sg_dma_unmark_bus_address() instead Link: https://lkml.kernel.org/r/20241130022406.537973-1-rdunlap@infradead.org Fixes: 42399301203e ("lib/scatterlist: add flag for indicating P2PDMA segments in an SGL") Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Cc: Logan Gunthorpe <logang@deltatee.com> Cc: Christoph Hellwig <hch@lst.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-12-05mm: correct typo in MMAP_STATE() macroLorenzo Stoakes
We mistakenly refer to len rather than len_ here. The only existing caller passes len to the len_ parameter so this has no impact on the code, but it is obviously incorrect to do this, so fix it. Link: https://lkml.kernel.org/r/20241118175414.390827-1-lorenzo.stoakes@oracle.com Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reviewed-by: Liam R. Howlett <Liam.Howlett@Oracle.com> Reviewed-by: Wei Yang <richard.weiyang@gmail.com> Cc: Jann Horn <jannh@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-12-05mm: respect mmap hint address when aligning for THPKalesh Singh
Commit efa7df3e3bb5 ("mm: align larger anonymous mappings on THP boundaries") updated __get_unmapped_area() to align the start address for the VMA to a PMD boundary if CONFIG_TRANSPARENT_HUGEPAGE=y. It does this by effectively looking up a region that is of size, request_size + PMD_SIZE, and aligning up the start to a PMD boundary. Commit 4ef9ad19e176 ("mm: huge_memory: don't force huge page alignment on 32 bit") opted out of this for 32bit due to regressions in mmap base randomization. Commit d4148aeab412 ("mm, mmap: limit THP alignment of anonymous mappings to PMD-aligned sizes") restricted this to only mmap sizes that are multiples of the PMD_SIZE due to reported regressions in some performance benchmarks -- which seemed mostly due to the reduced spatial locality of related mappings due to the forced PMD-alignment. Another unintended side effect has emerged: When a user specifies an mmap hint address, the THP alignment logic modifies the behavior, potentially ignoring the hint even if a sufficiently large gap exists at the requested hint location. Example Scenario: Consider the following simplified virtual address (VA) space: ... 0x200000-0x400000 --- VMA A 0x400000-0x600000 --- Hole 0x600000-0x800000 --- VMA B ... A call to mmap() with hint=0x400000 and len=0x200000 behaves differently: - Before THP alignment: The requested region (size 0x200000) fits into the gap at 0x400000, so the hint is respected. - After alignment: The logic searches for a region of size 0x400000 (len + PMD_SIZE) starting at 0x400000. This search fails due to the mapping at 0x600000 (VMA B), and the hint is ignored, falling back to arch_get_unmapped_area[_topdown](). In general the hint is effectively ignored, if there is any existing mapping in the below range: [mmap_hint + mmap_size, mmap_hint + mmap_size + PMD_SIZE) This changes the semantics of mmap hint; from ""Respect the hint if a sufficiently large gap exists at the requested location" to "Respect the hint only if an additional PMD-sized gap exists beyond the requested size". This has performance implications for allocators that allocate their heap using mmap but try to keep it "as contiguous as possible" by using the end of the exisiting heap as the address hint. With the new behavior it's more likely to get a much less contiguous heap, adding extra fragmentation and performance overhead. To restore the expected behavior; don't use thp_get_unmapped_area_vmflags() when the user provided a hint address, for anonymous mappings. Note: As Yang Shi pointed out: the issue still remains for filesystems which are using thp_get_unmapped_area() for their get_unmapped_area() op. It is unclear what worklaods will regress for if we ignore THP alignment when the hint address is provided for such file backed mappings -- so this fix will be handled separately. Link: https://lkml.kernel.org/r/20241118214650.3667577-1-kaleshsingh@google.com Fixes: efa7df3e3bb5 ("mm: align larger anonymous mappings on THP boundaries") Signed-off-by: Kalesh Singh <kaleshsingh@google.com> Reviewed-by: Rik van Riel <riel@surriel.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: David Hildenbrand <david@redhat.com> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Yang Shi <yang@os.amperecomputing.com> Cc: Rik van Riel <riel@surriel.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Hans Boehm <hboehm@google.com> Cc: Lokesh Gidra <lokeshgidra@google.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-12-05mm: memcg: declare do_memsw_account inlineJohn Sperbeck
In commit 66d60c428b23 ("mm: memcg: move legacy memcg event code into memcontrol-v1.c"), the static do_memsw_account() function was moved from a .c file to a .h file. Unfortunately, the traditional inline keyword wasn't added. If a file (e.g., a unit test) includes the .h file, but doesn't refer to do_memsw_account(), it will get a warning like: mm/memcontrol-v1.h:41:13: warning: unused function 'do_memsw_account' [-Wunused-function] 41 | static bool do_memsw_account(void) | ^~~~~~~~~~~~~~~~ Link: https://lkml.kernel.org/r/20241128203959.726527-1-jsperbeck@google.com Fixes: 66d60c428b23 ("mm: memcg: move legacy memcg event code into memcontrol-v1.c") Signed-off-by: John Sperbeck <jsperbeck@google.com> Acked-by: Roman Gushchin <roman.gushchin@linux.dev> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-12-05mm/codetag: swap tags when migrate pagesDavid Wang
Current solution to adjust codetag references during page migration is done in 3 steps: 1. sets the codetag reference of the old page as empty (not pointing to any codetag); 2. subtracts counters of the new page to compensate for its own allocation; 3. sets codetag reference of the new page to point to the codetag of the old page. This does not work if CONFIG_MEM_ALLOC_PROFILING_DEBUG=n because set_codetag_empty() becomes NOOP. Instead, let's simply swap codetag references so that the new page is referencing the old codetag and the old page is referencing the new codetag. This way accounting stays valid and the logic makes more sense. Link: https://lkml.kernel.org/r/20241129025213.34836-1-00107082@163.com Fixes: e0a955bf7f61 ("mm/codetag: add pgalloc_tag_copy()") Signed-off-by: David Wang <00107082@163.com> Closes: https://lore.kernel.org/lkml/20241124074318.399027-1-00107082@163.com/ Acked-by: Suren Baghdasaryan <surenb@google.com> Suggested-by: Suren Baghdasaryan <surenb@google.com> Acked-by: Yu Zhao <yuzhao@google.com> Cc: Kent Overstreet <kent.overstreet@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-12-05ocfs2: update seq_file index in ocfs2_dlm_seq_nextWengang Wang
The following INFO level message was seen: seq_file: buggy .next function ocfs2_dlm_seq_next [ocfs2] did not update position index Fix: Update *pos (so m->index) to make seq_read_iter happy though the index its self makes no sense to ocfs2_dlm_seq_next. Link: https://lkml.kernel.org/r/20241119174500.9198-1-wen.gang.wang@oracle.com Signed-off-by: Wengang Wang <wen.gang.wang@oracle.com> Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com> Cc: Mark Fasheh <mark@fasheh.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Changwei Ge <gechangwei@live.cn> Cc: Jun Piao <piaojun@huawei.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-12-05stackdepot: fix stack_depot_save_flags() in NMI contextMarco Elver
Per documentation, stack_depot_save_flags() was meant to be usable from NMI context if STACK_DEPOT_FLAG_CAN_ALLOC is unset. However, it still would try to take the pool_lock in an attempt to save a stack trace in the current pool (if space is available). This could result in deadlock if an NMI is handled while pool_lock is already held. To avoid deadlock, only try to take the lock in NMI context and give up if unsuccessful. The documentation is fixed to clearly convey this. Link: https://lkml.kernel.org/r/Z0CcyfbPqmxJ9uJH@elver.google.com Link: https://lkml.kernel.org/r/20241122154051.3914732-1-elver@google.com Fixes: 4434a56ec209 ("stackdepot: make fast paths lock-less again") Signed-off-by: Marco Elver <elver@google.com> Reported-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Reviewed-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: Alexander Potapenko <glider@google.com> Cc: Andrey Konovalov <andreyknvl@gmail.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-12-05mm: open-code page_folio() in dump_page()Matthew Wilcox (Oracle)
page_folio() calls page_fixed_fake_head() which will misidentify this page as being a fake head and load off the end of 'precise'. We may have a pointer to a fake head, but that's OK because it contains the right information for dump_page(). gcc-15 is smart enough to catch this with -Warray-bounds: In function 'page_fixed_fake_head', inlined from '_compound_head' at ../include/linux/page-flags.h:251:24, inlined from '__dump_page' at ../mm/debug.c:123:11: ../include/asm-generic/rwonce.h:44:26: warning: array subscript 9 is outside +array bounds of 'struct page[1]' [-Warray-bounds=] Link: https://lkml.kernel.org/r/20241125201721.2963278-2-willy@infradead.org Fixes: fae7d834c43c ("mm: add __dump_folio()") Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reported-by: Kees Cook <kees@kernel.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-12-05mm: open-code PageTail in folio_flags() and const_folio_flags()Matthew Wilcox (Oracle)
It is unsafe to call PageTail() in dump_page() as page_is_fake_head() will almost certainly return true when called on a head page that is copied to the stack. That will cause the VM_BUG_ON_PGFLAGS() in const_folio_flags() to trigger when it shouldn't. Fortunately, we don't need to call PageTail() here; it's fine to have a pointer to a virtual alias of the page's flag word rather than the real page's flag word. Link: https://lkml.kernel.org/r/20241125201721.2963278-1-willy@infradead.org Fixes: fae7d834c43c ("mm: add __dump_folio()") Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Kees Cook <kees@kernel.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-12-05mm: fix vrealloc()'s KASAN poisoning logicAndrii Nakryiko
When vrealloc() reuses already allocated vmap_area, we need to re-annotate poisoned and unpoisoned portions of underlying memory according to the new size. This results in a KASAN splat recorded at [1]. A KASAN mis-reporting issue where there is none. Note, hard-coding KASAN_VMALLOC_PROT_NORMAL might not be exactly correct, but KASAN flag logic is pretty involved and spread out throughout __vmalloc_node_range_noprof(), so I'm using the bare minimum flag here and leaving the rest to mm people to refactor this logic and reuse it here. Link: https://lkml.kernel.org/r/20241126005206.3457974-1-andrii@kernel.org Link: https://lore.kernel.org/bpf/67450f9b.050a0220.21d33d.0004.GAE@google.com/ [1] Fixes: 3ddc2fefe6f3 ("mm: vmalloc: implement vrealloc()") Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Cc: Alexei Starovoitov <ast@kernel.org> Cc: Christoph Hellwig <hch@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Uladzislau Rezki (Sony) <urezki@gmail.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-12-05Revert "readahead: properly shorten readahead when falling back to ↵Jan Kara
do_page_cache_ra()" This reverts commit 7c877586da3178974a8a94577b6045a48377ff25. Anders and Philippe have reported that recent kernels occasionally hang when used with NFS in readahead code. The problem has been bisected to 7c877586da3 ("readahead: properly shorten readahead when falling back to do_page_cache_ra()"). The cause of the problem is that ra->size can be shrunk by read_pages() call and subsequently we end up calling do_page_cache_ra() with negative (read huge positive) number of pages. Let's revert 7c877586da3 for now until we can find a proper way how the logic in read_pages() and page_cache_ra_order() can coexist. This can lead to reduced readahead throughput due to readahead window confusion but that's better than outright hangs. Link: https://lkml.kernel.org/r/20241126145208.985-1-jack@suse.cz Fixes: 7c877586da31 ("readahead: properly shorten readahead when falling back to do_page_cache_ra()") Reported-by: Anders Blomdell <anders.blomdell@gmail.com> Reported-by: Philippe Troin <phil@fifi.org> Signed-off-by: Jan Kara <jack@suse.cz> Tested-by: Philippe Troin <phil@fifi.org> Cc: Matthew Wilcox <willy@infradead.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-12-05selftests/damon: add _damon_sysfs.py to TEST_FILESMaximilian Heyne
When running selftests I encountered the following error message with some damon tests: # Traceback (most recent call last): # File "[...]/damon/./damos_quota.py", line 7, in <module> # import _damon_sysfs # ModuleNotFoundError: No module named '_damon_sysfs' Fix this by adding the _damon_sysfs.py file to TEST_FILES so that it will be available when running the respective damon selftests. Link: https://lkml.kernel.org/r/20241127-picks-visitor-7416685b-mheyne@amazon.de Fixes: 306abb63a8ca ("selftests/damon: implement a python module for test-purpose DAMON sysfs controls") Signed-off-by: Maximilian Heyne <mheyne@amazon.de> Reviewed-by: SeongJae Park <sj@kernel.org> Cc: Shuah Khan <shuah@kernel.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-12-05selftest: hugetlb_dio: fix test namingMark Brown
The string logged when a test passes or fails is used by the selftest framework to identify which test is being reported. The hugetlb_dio test not only uses the same strings for every test that is run but it also uses different strings for test passes and failures which means that test automation is unable to follow what the test is doing at all. Pull the existing duplicated logging of the number of free huge pages before and after the test out of the conditional and replace that and the logging of the result with a single ksft_print_result() which incorporates the parameters passed into the test into the output. Link: https://lkml.kernel.org/r/20241127-kselftest-mm-hugetlb-dio-names-v1-1-22aab01bf550@kernel.org Fixes: fae1980347bf ("selftests: hugetlb_dio: fixup check for initial conditions to skip in the start") Signed-off-by: Mark Brown <broonie@kernel.org> Reviewed-by: Muhammad Usama Anjum <usama.anjum@collabora.com> Cc: Donet Tom <donettom@linux.ibm.com> Cc: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Cc: Shuah Khan <shuah@kernel.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-12-05ocfs2: free inode when ocfs2_get_init_inode() failsTetsuo Handa
syzbot is reporting busy inodes after unmount, for commit 9c89fe0af826 ("ocfs2: Handle error from dquot_initialize()") forgot to call iput() when new_inode() succeeded and dquot_initialize() failed. Link: https://lkml.kernel.org/r/e68c0224-b7c6-4784-b4fa-a9fc8c675525@I-love.SAKURA.ne.jp Fixes: 9c89fe0af826 ("ocfs2: Handle error from dquot_initialize()") Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Reported-by: syzbot+0af00f6a2cba2058b5db@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=0af00f6a2cba2058b5db Tested-by: syzbot+0af00f6a2cba2058b5db@syzkaller.appspotmail.com Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com> Cc: Mark Fasheh <mark@fasheh.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Changwei Ge <gechangwei@live.cn> Cc: Jun Piao <piaojun@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-12-05nilfs2: fix potential out-of-bounds memory access in nilfs_find_entry()Ryusuke Konishi
Syzbot reported that when searching for records in a directory where the inode's i_size is corrupted and has a large value, memory access outside the folio/page range may occur, or a use-after-free bug may be detected if KASAN is enabled. This is because nilfs_last_byte(), which is called by nilfs_find_entry() and others to calculate the number of valid bytes of directory data in a page from i_size and the page index, loses the upper 32 bits of the 64-bit size information due to an inappropriate type of local variable to which the i_size value is assigned. This caused a large byte offset value due to underflow in the end address calculation in the calling nilfs_find_entry(), resulting in memory access that exceeds the folio/page size. Fix this issue by changing the type of the local variable causing the bit loss from "unsigned int" to "u64". The return value of nilfs_last_byte() is also of type "unsigned int", but it is truncated so as not to exceed PAGE_SIZE and no bit loss occurs, so no change is required. Link: https://lkml.kernel.org/r/20241119172403.9292-1-konishi.ryusuke@gmail.com Fixes: 2ba466d74ed7 ("nilfs2: directory entry operations") Signed-off-by: Ryusuke Konishi <konishi.ryusuke@gmail.com> Reported-by: syzbot+96d5d14c47d97015c624@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=96d5d14c47d97015c624 Tested-by: syzbot+96d5d14c47d97015c624@syzkaller.appspotmail.com Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-12-05kasan: make report_lock a raw spinlockJared Kangas
If PREEMPT_RT is enabled, report_lock is a sleeping spinlock and must not be locked when IRQs are disabled. However, KASAN reports may be triggered in such contexts. For example: char *s = kzalloc(1, GFP_KERNEL); kfree(s); local_irq_disable(); char c = *s; /* KASAN report here leads to spin_lock() */ local_irq_enable(); Make report_spinlock a raw spinlock to prevent rescheduling when PREEMPT_RT is enabled. Link: https://lkml.kernel.org/r/20241119210234.1602529-1-jkangas@redhat.com Fixes: 342a93247e08 ("locking/spinlock: Provide RT variant header: <linux/spinlock_rt.h>") Signed-off-by: Jared Kangas <jkangas@redhat.com> Cc: Alexander Potapenko <glider@google.com> Cc: Andrey Konovalov <andreyknvl@gmail.com> Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Vincenzo Frascino <vincenzo.frascino@arm.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-12-05mm/mempolicy: fix migrate_to_node() assuming there is at least one VMA in a MMDavid Hildenbrand
We currently assume that there is at least one VMA in a MM, which isn't true. So we might end up having find_vma() return NULL, to then de-reference NULL. So properly handle find_vma() returning NULL. This fixes the report: Oops: general protection fault, probably for non-canonical address 0xdffffc0000000000: 0000 [#1] PREEMPT SMP KASAN PTI KASAN: null-ptr-deref in range [0x0000000000000000-0x0000000000000007] CPU: 1 UID: 0 PID: 6021 Comm: syz-executor284 Not tainted 6.12.0-rc7-syzkaller-00187-gf868cd251776 #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 10/30/2024 RIP: 0010:migrate_to_node mm/mempolicy.c:1090 [inline] RIP: 0010:do_migrate_pages+0x403/0x6f0 mm/mempolicy.c:1194 Code: ... RSP: 0018:ffffc9000375fd08 EFLAGS: 00010246 RAX: 0000000000000000 RBX: ffffc9000375fd78 RCX: 0000000000000000 RDX: ffff88807e171300 RSI: dffffc0000000000 RDI: ffff88803390c044 RBP: ffff88807e171428 R08: 0000000000000014 R09: fffffbfff2039ef1 R10: ffffffff901cf78f R11: 0000000000000000 R12: 0000000000000003 R13: ffffc9000375fe90 R14: ffffc9000375fe98 R15: ffffc9000375fdf8 FS: 00005555919e1380(0000) GS:ffff8880b8700000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00005555919e1ca8 CR3: 000000007f12a000 CR4: 00000000003526f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: <TASK> kernel_migrate_pages+0x5b2/0x750 mm/mempolicy.c:1709 __do_sys_migrate_pages mm/mempolicy.c:1727 [inline] __se_sys_migrate_pages mm/mempolicy.c:1723 [inline] __x64_sys_migrate_pages+0x96/0x100 mm/mempolicy.c:1723 do_syscall_x64 arch/x86/entry/common.c:52 [inline] do_syscall_64+0xcd/0x250 arch/x86/entry/common.c:83 entry_SYSCALL_64_after_hwframe+0x77/0x7f [akpm@linux-foundation.org: add unlikely()] Link: https://lkml.kernel.org/r/20241120201151.9518-1-david@redhat.com Fixes: 39743889aaf7 ("[PATCH] Swap Migration V5: sys_migrate_pages interface") Signed-off-by: David Hildenbrand <david@redhat.com> Reported-by: syzbot+3511625422f7aa637f0d@syzkaller.appspotmail.com Closes: https://lore.kernel.org/lkml/673d2696.050a0220.3c9d61.012f.GAE@google.com/T/ Reviewed-by: Liam R. Howlett <Liam.Howlett@Oracle.com> Reviewed-by: Christoph Lameter <cl@linux.com> Cc: Liam R. Howlett <Liam.Howlett@Oracle.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-12-05mm/gup: handle NULL pages in unpin_user_pages()John Hubbard
The recent addition of "pofs" (pages or folios) handling to gup has a flaw: it assumes that unpin_user_pages() handles NULL pages in the pages** array. That's not the case, as I discovered when I ran on a new configuration on my test machine. Fix this by skipping NULL pages in unpin_user_pages(), just like unpin_folios() already does. Details: when booting on x86 with "numa=fake=2 movablecore=4G" on Linux 6.12, and running this: tools/testing/selftests/mm/gup_longterm ...I get the following crash: BUG: kernel NULL pointer dereference, address: 0000000000000008 RIP: 0010:sanity_check_pinned_pages+0x3a/0x2d0 ... Call Trace: <TASK> ? __die_body+0x66/0xb0 ? page_fault_oops+0x30c/0x3b0 ? do_user_addr_fault+0x6c3/0x720 ? irqentry_enter+0x34/0x60 ? exc_page_fault+0x68/0x100 ? asm_exc_page_fault+0x22/0x30 ? sanity_check_pinned_pages+0x3a/0x2d0 unpin_user_pages+0x24/0xe0 check_and_migrate_movable_pages_or_folios+0x455/0x4b0 __gup_longterm_locked+0x3bf/0x820 ? mmap_read_lock_killable+0x12/0x50 ? __pfx_mmap_read_lock_killable+0x10/0x10 pin_user_pages+0x66/0xa0 gup_test_ioctl+0x358/0xb20 __se_sys_ioctl+0x6b/0xc0 do_syscall_64+0x7b/0x150 entry_SYSCALL_64_after_hwframe+0x76/0x7e Link: https://lkml.kernel.org/r/20241121034933.77502-1-jhubbard@nvidia.com Fixes: 94efde1d1539 ("mm/gup: avoid an unnecessary allocation call for FOLL_LONGTERM cases") Signed-off-by: John Hubbard <jhubbard@nvidia.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Vivek Kasireddy <vivek.kasireddy@intel.com> Cc: Dave Airlie <airlied@redhat.com> Cc: Gerd Hoffmann <kraxel@redhat.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Christoph Hellwig <hch@infradead.org> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: Peter Xu <peterx@redhat.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Daniel Vetter <daniel.vetter@ffwll.ch> Cc: Dongwon Kim <dongwon.kim@intel.com> Cc: Hugh Dickins <hughd@google.com> Cc: Junxiao Chang <junxiao.chang@intel.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-12-05fs/proc/vmcore.c: fix warning when CONFIG_MMU=nAndrew Morton
>> fs/proc/vmcore.c:424:19: warning: 'mmap_vmcore_fault' defined but not used [-Wunused-function] 424 | static vm_fault_t mmap_vmcore_fault(struct vm_fault *vmf) Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/oe-kbuild-all/202411140156.2o0nS4fl-lkp@intel.com/ Cc: Qi Xi <xiqi2@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-12-05net/mlx5: DR, prevent potential error pointer dereferenceDan Carpenter
The dr_domain_add_vport_cap() function generally returns NULL on error but sometimes we want it to return ERR_PTR(-EBUSY) so the caller can retry. The problem here is that "ret" can be either -EBUSY or -ENOMEM and if it's and -ENOMEM then the error pointer is propogated back and eventually dereferenced in dr_ste_v0_build_src_gvmi_qpn_tag(). Fixes: 11a45def2e19 ("net/mlx5: DR, Add support for SF vports") Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Link: https://patch.msgid.link/07477254-e179-43e2-b1b3-3b9db4674195@stanley.mountain Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-12-05Merge branch 'xdp-a-fistful-of-generic-changes-pt-i'Jakub Kicinski
Alexander Lobakin says: ==================== xdp: a fistful of generic changes pt. I XDP for idpf is currently 6 chapters: * convert Rx to libeth; * convert Tx and stats to libeth; * generic XDP and XSk code changes (you are here); * generic XDP and XSk code additions; * actual XDP for idpf via new libeth_xdp; * XSk for idpf (via ^). Part III does the following: * improve &xdp_buff_xsk cacheline placement; * does some cleanups with marking read-only bpf_prog and xdp_buff arguments const for some generic functions; * allows attaching already registered XDP memory model to RxQ info; * makes system percpu page_pools valid XDP memory models; * starts using netmems in the XDP core code (1 function); * allows mixing pages from several page_pools within one XDP frame; * optimizes &xdp_frame layout and removes no-more-used field. Bullets 4-6 are the most important ones. All of them are prereqs to libeth_xdp. ==================== Link: https://patch.msgid.link/20241203173733.3181246-1-aleksander.lobakin@intel.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-12-05page_pool: make page_pool_put_page_bulk() handle array of netmemsAlexander Lobakin
Currently, page_pool_put_page_bulk() indeed takes an array of pointers to the data, not pages, despite the name. As one side effect, when you're freeing frags from &skb_shared_info, xdp_return_frame_bulk() converts page pointers to virtual addresses and then page_pool_put_page_bulk() converts them back. Moreover, data pointers assume every frag is placed in the host memory, making this function non-universal. Make page_pool_put_page_bulk() handle array of netmems. Pass frag netmems directly and use virt_to_netmem() when freeing xdpf->data, so that the PP core will then get the compound netmem and take care of the rest. Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com> Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com> Link: https://patch.msgid.link/20241203173733.3181246-9-aleksander.lobakin@intel.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-12-05netmem: add a couple of page helper wrappersAlexander Lobakin
Add the following netmem counterparts: * virt_to_netmem() -- simple page_to_netmem(virt_to_page()) wrapper; * netmem_is_pfmemalloc() -- page_is_pfmemalloc() for page-backed netmems, false otherwise; and the following "unsafe" versions: * __netmem_to_page() * __netmem_get_pp() * __netmem_address() They do the same as their non-underscored buddies, but assume the netmem is always page-backed. When working with header &page_pools, you don't need to check whether netmem belongs to the host memory and you can never get NULL instead of &page. Checks for the LSB, clearing the LSB, branches take cycles and increase object code size, sometimes significantly. When you're sure your PP is always host, you can avoid this by using the underscored counterparts. Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com> Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com> Link: https://patch.msgid.link/20241203173733.3181246-8-aleksander.lobakin@intel.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-12-05xdp: register system page pool as an XDP memory modelToke Høiland-Jørgensen
To make the system page pool usable as a source for allocating XDP frames, we need to register it with xdp_reg_mem_model(), so that page return works correctly. This is done in preparation for using the system page_pool to convert XDP_PASS XSk frames to skbs; for the same reason, make the per-cpu variable non-static so we can access it from other source files as well (but w/o exporting). Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com> Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com> Link: https://patch.msgid.link/20241203173733.3181246-7-aleksander.lobakin@intel.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-12-05xsk: allow attaching XSk pool via xdp_rxq_info_reg_mem_model()Alexander Lobakin
When you register an XSk pool as XDP Rxq info memory model, you then need to manually attach it after the registration. Let the user combine both actions into one by just passing a pointer to the pool directly to xdp_rxq_info_reg_mem_model(), which will take care of calling xsk_pool_set_rxq_info(). This looks similar to how a &page_pool gets registered and reduce repeating driver code. Acked-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com> Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com> Link: https://patch.msgid.link/20241203173733.3181246-6-aleksander.lobakin@intel.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-12-05xdp: allow attaching already registered memory model to xdp_rxq_infoAlexander Lobakin
One may need to register memory model separately from xdp_rxq_info. One simple example may be XDP test run code, but in general, it might be useful when memory model registering is managed by one layer and then XDP RxQ info by a different one. Allow such scenarios by adding a simple helper which "attaches" already registered memory model to the desired xdp_rxq_info. As this is mostly needed for Page Pool, add a special function to do that for a &page_pool pointer. Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com> Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com> Link: https://patch.msgid.link/20241203173733.3181246-5-aleksander.lobakin@intel.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-12-05xdp, xsk: constify read-only arguments of some static inline helpersAlexander Lobakin
Lots of read-only helpers for &xdp_buff and &xdp_frame, such as getting the frame length, skb_shared_info etc., don't have their arguments marked with `const` for no reason. Add the missing annotations to leave less place for mistakes and more for optimization. Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com> Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com> Link: https://patch.msgid.link/20241203173733.3181246-4-aleksander.lobakin@intel.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-12-05bpf, xdp: constify some bpf_prog * function argumentsAlexander Lobakin
In lots of places, bpf_prog pointer is used only for tracing or other stuff that doesn't modify the structure itself. Same for net_device. Address at least some of them and add `const` attributes there. The object code didn't change, but that may prevent unwanted data modifications and also allow more helpers to have const arguments. Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com> Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-12-05xsk: align &xdp_buff_xsk harderAlexander Lobakin
After the series "XSk buff on a diet" by Maciej, the greatest pow-2 which &xdp_buff_xsk can be divided got reduced from 16 to 8 on x86_64. Also, sizeof(xdp_buff_xsk) now is 120 bytes, which, taking the previous sentence into account, leads to that it leaves 8 bytes at the end of cacheline, which means an array of buffs will have its elements messed between the cachelines chaotically. Use __aligned_largest for this struct. This alignment is usually 16 bytes, which makes it fill two full cachelines and align an array nicely. ___cacheline_aligned may be excessive here, especially on arches with 128-256 byte CLs, as well as 32-bit arches (76 -> 96 bytes on MIPS32R2), while not doing better than _largest. Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com> Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com> Link: https://patch.msgid.link/20241203173733.3181246-2-aleksander.lobakin@intel.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-12-05Merge branch 'net_sched-sch_sfq-reject-limit-of-1'Jakub Kicinski
Octavian Purdila says: ==================== net_sched: sch_sfq: reject limit of 1 The implementation does not properly support limits of 1. Add an in-kernel check, in addition to existing iproute2 check, since other tools may be used for configuration. This patch set also adds a selfcheck to test that a limit of 1 is rejected. An alternative (or in addition) we could fix the implementation by setting q->tail to NULL in sfq_drop if this is the last slot we marked empty, e.g.: --- a/net/sched/sch_sfq.c +++ b/net/sched/sch_sfq.c @@ -317,8 +317,11 @@ static unsigned int sfq_drop(struct Qdisc *sch, struct sk_buff **to_free) /* It is difficult to believe, but ALL THE SLOTS HAVE LENGTH 1. */ x = q->tail->next; slot = &q->slots[x]; - q->tail->next = slot->next; q->ht[slot->hash] = SFQ_EMPTY_SLOT; + if (x == slot->next) + q->tail = NULL; /* no more active slots */ + else + q->tail->next = slot->next; goto drop; } ==================== Link: https://patch.msgid.link/20241204030520.2084663-1-tavip@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-12-05selftests/tc-testing: sfq: test that kernel rejects limit of 1Octavian Purdila
Add test to check that the kernel rejects a configuration with the limit set to 1. Signed-off-by: Octavian Purdila <tavip@google.com> Link: https://patch.msgid.link/20241204030520.2084663-3-tavip@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-12-05net_sched: sch_sfq: don't allow 1 packet limitOctavian Purdila
The current implementation does not work correctly with a limit of 1. iproute2 actually checks for this and this patch adds the check in kernel as well. This fixes the following syzkaller reported crash: UBSAN: array-index-out-of-bounds in net/sched/sch_sfq.c:210:6 index 65535 is out of range for type 'struct sfq_head[128]' CPU: 0 PID: 2569 Comm: syz-executor101 Not tainted 5.10.0-smp-DEV #1 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 09/13/2024 Call Trace: __dump_stack lib/dump_stack.c:79 [inline] dump_stack+0x125/0x19f lib/dump_stack.c:120 ubsan_epilogue lib/ubsan.c:148 [inline] __ubsan_handle_out_of_bounds+0xed/0x120 lib/ubsan.c:347 sfq_link net/sched/sch_sfq.c:210 [inline] sfq_dec+0x528/0x600 net/sched/sch_sfq.c:238 sfq_dequeue+0x39b/0x9d0 net/sched/sch_sfq.c:500 sfq_reset+0x13/0x50 net/sched/sch_sfq.c:525 qdisc_reset+0xfe/0x510 net/sched/sch_generic.c:1026 tbf_reset+0x3d/0x100 net/sched/sch_tbf.c:319 qdisc_reset+0xfe/0x510 net/sched/sch_generic.c:1026 dev_reset_queue+0x8c/0x140 net/sched/sch_generic.c:1296 netdev_for_each_tx_queue include/linux/netdevice.h:2350 [inline] dev_deactivate_many+0x6dc/0xc20 net/sched/sch_generic.c:1362 __dev_close_many+0x214/0x350 net/core/dev.c:1468 dev_close_many+0x207/0x510 net/core/dev.c:1506 unregister_netdevice_many+0x40f/0x16b0 net/core/dev.c:10738 unregister_netdevice_queue+0x2be/0x310 net/core/dev.c:10695 unregister_netdevice include/linux/netdevice.h:2893 [inline] __tun_detach+0x6b6/0x1600 drivers/net/tun.c:689 tun_detach drivers/net/tun.c:705 [inline] tun_chr_close+0x104/0x1b0 drivers/net/tun.c:3640 __fput+0x203/0x840 fs/file_table.c:280 task_work_run+0x129/0x1b0 kernel/task_work.c:185 exit_task_work include/linux/task_work.h:33 [inline] do_exit+0x5ce/0x2200 kernel/exit.c:931 do_group_exit+0x144/0x310 kernel/exit.c:1046 __do_sys_exit_group kernel/exit.c:1057 [inline] __se_sys_exit_group kernel/exit.c:1055 [inline] __x64_sys_exit_group+0x3b/0x40 kernel/exit.c:1055 do_syscall_64+0x6c/0xd0 entry_SYSCALL_64_after_hwframe+0x61/0xcb RIP: 0033:0x7fe5e7b52479 Code: Unable to access opcode bytes at RIP 0x7fe5e7b5244f. RSP: 002b:00007ffd3c800398 EFLAGS: 00000246 ORIG_RAX: 00000000000000e7 RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007fe5e7b52479 RDX: 000000000000003c RSI: 00000000000000e7 RDI: 0000000000000000 RBP: 00007fe5e7bcd2d0 R08: ffffffffffffffb8 R09: 0000000000000014 R10: 0000000000000000 R11: 0000000000000246 R12: 00007fe5e7bcd2d0 R13: 0000000000000000 R14: 00007fe5e7bcdd20 R15: 00007fe5e7b24270 The crash can be also be reproduced with the following (with a tc recompiled to allow for sfq limits of 1): tc qdisc add dev dummy0 handle 1: root tbf rate 1Kbit burst 100b lat 1s ../iproute2-6.9.0/tc/tc qdisc add dev dummy0 handle 2: parent 1:10 sfq limit 1 ifconfig dummy0 up ping -I dummy0 -f -c2 -W0.1 8.8.8.8 sleep 1 Scenario that triggers the crash: * the first packet is sent and queued in TBF and SFQ; qdisc qlen is 1 * TBF dequeues: it peeks from SFQ which moves the packet to the gso_skb list and keeps qdisc qlen set to 1. TBF is out of tokens so it schedules itself for later. * the second packet is sent and TBF tries to queues it to SFQ. qdisc qlen is now 2 and because the SFQ limit is 1 the packet is dropped by SFQ. At this point qlen is 1, and all of the SFQ slots are empty, however q->tail is not NULL. At this point, assuming no more packets are queued, when sch_dequeue runs again it will decrement the qlen for the current empty slot causing an underflow and the subsequent out of bounds access. Reported-by: syzbot <syzkaller@googlegroups.com> Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Signed-off-by: Octavian Purdila <tavip@google.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20241204030520.2084663-2-tavip@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-12-05net_sched: sch_fq: add three drop_reasonEric Dumazet
Add three new drop_reason, more precise than generic QDISC_DROP: "tc -s qd" show aggregate counters, it might be more useful to use drop_reason infrastructure for bug hunting. 1) SKB_DROP_REASON_FQ_BAND_LIMIT Whenever a packet is added while its band limit is hit. Corresponding value in "tc -s qd" is bandX_drops XXXX 2) SKB_DROP_REASON_FQ_HORIZON_LIMIT Whenever a packet has a timestamp too far in the future. Corresponding value in "tc -s qd" is horizon_drops XXXX 3) SKB_DROP_REASON_FQ_FLOW_LIMIT Whenever a flow has reached its limit. Corresponding value in "tc -s qd" is flows_plimit XXXX Tested: tc qd replace dev eth1 root fq flow_limit 10 limit 100000 perf record -a -e skb:kfree_skb sleep 1; perf script udp_stream 12329 [004] 216.929492: skb:kfree_skb: skbaddr=0xffff888eabe17e00 rx_sk=(nil) protocol=34525 location=__dev_queue_xmit+0x9d9 reason: FQ_FLOW_LIMIT udp_stream 12385 [006] 216.929593: skb:kfree_skb: skbaddr=0xffff888ef8827f00 rx_sk=(nil) protocol=34525 location=__dev_queue_xmit+0x9d9 reason: FQ_FLOW_LIMIT udp_stream 12389 [005] 216.929871: skb:kfree_skb: skbaddr=0xffff888ecb9ba500 rx_sk=(nil) protocol=34525 location=__dev_queue_xmit+0x9d9 reason: FQ_FLOW_LIMIT udp_stream 12316 [009] 216.930398: skb:kfree_skb: skbaddr=0xffff888eca286b00 rx_sk=(nil) protocol=34525 location=__dev_queue_xmit+0x9d9 reason: FQ_FLOW_LIMIT udp_stream 12400 [008] 216.930490: skb:kfree_skb: skbaddr=0xffff888eabf93d00 rx_sk=(nil) protocol=34525 location=__dev_queue_xmit+0x9d9 reason: FQ_FLOW_LIMIT tc qd replace dev eth1 root fq flow_limit 100 limit 10000 perf record -a -e skb:kfree_skb sleep 1; perf script udp_stream 18074 [001] 1058.318040: skb:kfree_skb: skbaddr=0xffffa23c881fc000 rx_sk=(nil) protocol=34525 location=__dev_queue_xmit+0x9d9 reason: FQ_BAND_LIMIT udp_stream 18126 [005] 1058.320651: skb:kfree_skb: skbaddr=0xffffa23c6aad4000 rx_sk=(nil) protocol=34525 location=__dev_queue_xmit+0x9d9 reason: FQ_BAND_LIMIT udp_stream 18118 [006] 1058.321065: skb:kfree_skb: skbaddr=0xffffa23df0d48a00 rx_sk=(nil) protocol=34525 location=__dev_queue_xmit+0x9d9 reason: FQ_BAND_LIMIT udp_stream 18074 [001] 1058.321126: skb:kfree_skb: skbaddr=0xffffa23c881ffa00 rx_sk=(nil) protocol=34525 location=__dev_queue_xmit+0x9d9 reason: FQ_BAND_LIMIT udp_stream 15815 [003] 1058.321224: skb:kfree_skb: skbaddr=0xffffa23c9835db00 rx_sk=(nil) protocol=34525 location=__dev_queue_xmit+0x9d9 reason: FQ_BAND_LIMIT tc -s -d qd sh dev eth1 qdisc fq 8023: root refcnt 257 limit 10000p flow_limit 100p buckets 1024 orphan_mask 1023 bands 3 priomap 1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1 weights 589824 196608 65536 quantum 18Kb initial_quantum 92120b low_rate_threshold 550Kbit refill_delay 40ms timer_slack 10us horizon 10s horizon_drop Sent 492439603330 bytes 336953991 pkt (dropped 61724094, overlimits 0 requeues 4463) backlog 14611228b 9995p requeues 4463 flows 2965 (inactive 1151 throttled 0) band0_pkts 0 band1_pkts 9993 band2_pkts 0 gc 6347 highprio 0 fastpath 30 throttled 5 latency 2.32us flows_plimit 7403693 band1_drops 54320401 Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Victor Nogueira <victor@mojatatu.com> Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com> Acked-by: Jamal Hadi Salim <jhs@mojatatu.com> Link: https://patch.msgid.link/20241204171950.89829-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-12-05tipc: fix NULL deref in cleanup_bearer()Eric Dumazet
syzbot found [1] that after blamed commit, ub->ubsock->sk was NULL when attempting the atomic_dec() : atomic_dec(&tipc_net(sock_net(ub->ubsock->sk))->wq_count); Fix this by caching the tipc_net pointer. [1] Oops: general protection fault, probably for non-canonical address 0xdffffc0000000006: 0000 [#1] PREEMPT SMP KASAN PTI KASAN: null-ptr-deref in range [0x0000000000000030-0x0000000000000037] CPU: 0 UID: 0 PID: 5896 Comm: kworker/0:3 Not tainted 6.13.0-rc1-next-20241203-syzkaller #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 09/13/2024 Workqueue: events cleanup_bearer RIP: 0010:read_pnet include/net/net_namespace.h:387 [inline] RIP: 0010:sock_net include/net/sock.h:655 [inline] RIP: 0010:cleanup_bearer+0x1f7/0x280 net/tipc/udp_media.c:820 Code: 18 48 89 d8 48 c1 e8 03 42 80 3c 28 00 74 08 48 89 df e8 3c f7 99 f6 48 8b 1b 48 83 c3 30 e8 f0 e4 60 00 48 89 d8 48 c1 e8 03 <42> 80 3c 28 00 74 08 48 89 df e8 1a f7 99 f6 49 83 c7 e8 48 8b 1b RSP: 0018:ffffc9000410fb70 EFLAGS: 00010206 RAX: 0000000000000006 RBX: 0000000000000030 RCX: ffff88802fe45a00 RDX: 0000000000000001 RSI: 0000000000000008 RDI: ffffc9000410f900 RBP: ffff88807e1f0908 R08: ffffc9000410f907 R09: 1ffff92000821f20 R10: dffffc0000000000 R11: fffff52000821f21 R12: ffff888031d19980 R13: dffffc0000000000 R14: dffffc0000000000 R15: ffff88807e1f0918 FS: 0000000000000000(0000) GS:ffff8880b8600000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000556ca050b000 CR3: 0000000031c0c000 CR4: 00000000003526f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Fixes: 6a2fa13312e5 ("tipc: Fix use-after-free of kernel socket in cleanup_bearer().") Reported-by: syzbot+46aa5474f179dacd1a3b@syzkaller.appspotmail.com Closes: https://lore.kernel.org/netdev/67508b5f.050a0220.17bd51.0070.GAE@google.com/T/#u Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> Link: https://patch.msgid.link/20241204170548.4152658-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-12-05selftests/bpf: Enable Tx hwtstamp in xdp_hw_metadataSong Yoong Siang
Currently, user needs to manually enable transmit hardware timestamp feature of certain Ethernet drivers, e.g. stmmac and igc drivers, through following command after running the xdp_hw_metadata app. sudo hwstamp_ctl -i eth0 -t 1 To simplify the step test of xdp_hw_metadata, set tx_type to HWTSTAMP_TX_ON to enable hardware timestamping for all outgoing packets, so that user no longer need to execute hwstamp_ctl command. Signed-off-by: Song Yoong Siang <yoong.siang.song@intel.com> Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Acked-by: Stanislav Fomichev <sdf@fomichev.me> Link: https://patch.msgid.link/20241205051936.3156307-1-yoong.siang.song@intel.com
2024-12-05selftests/bpf: Actuate tx_metadata_len in xdp_hw_metadataSong Yoong Siang
set XDP_UMEM_TX_METADATA_LEN flag to reserve tx_metadata_len bytes of per-chunk metadata. Fixes: d5e726d9143c ("xsk: Require XDP_UMEM_TX_METADATA_LEN to actuate tx_metadata_len") Signed-off-by: Song Yoong Siang <yoong.siang.song@intel.com> Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Acked-by: Stanislav Fomichev <sdf@fomichev.me> Link: https://patch.msgid.link/20241205044258.3155799-1-yoong.siang.song@intel.com
2024-12-05perf tools: Fix precise_ip fallback logicNamhyung Kim
Sometimes it returns other than EOPNOTSUPP for invalid precise_ip so it cannot check the error code. Let's move the fallback after the missing feature checks so that it can handle EINVAL as well. This also aligns well with the existing behavior which blindly turns off the precise_ip but we check the missing features correctly now. Fixes: af954f76eea56453 ("perf tools: Check fallback error and order") Reported-by: kernel test robot <oliver.sang@intel.com> Reported-by: Arnaldo Carvalho de Melo <acme@redhat.com> Closes: https://lore.kernel.org/oe-lkp/202411301431.799e5531-lkp@intel.com Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com> Link: https://lore.kernel.org/r/Z1DV0lN8qHSysX7f@google.com Signed-off-by: Namhyung Kim <namhyung@kernel.org>