summaryrefslogtreecommitdiff
path: root/include/linux
AgeCommit message (Collapse)Author
2024-12-21Merge tag 'mm-hotfixes-stable-2024-12-21-12-09' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull misc fixes from Andrew Morton: "25 hotfixes. 16 are cc:stable. 19 are MM and 6 are non-MM. The usual bunch of singletons and doubletons - please see the relevant changelogs for details" * tag 'mm-hotfixes-stable-2024-12-21-12-09' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (25 commits) mm: huge_memory: handle strsep not finding delimiter alloc_tag: fix set_codetag_empty() when !CONFIG_MEM_ALLOC_PROFILING_DEBUG alloc_tag: fix module allocation tags populated area calculation mm/codetag: clear tags before swap mm/vmstat: fix a W=1 clang compiler warning mm: convert partially_mapped set/clear operations to be atomic nilfs2: fix buffer head leaks in calls to truncate_inode_pages() vmalloc: fix accounting with i915 mm/page_alloc: don't call pfn_to_page() on possibly non-existent PFN in split_large_buddy() fork: avoid inappropriate uprobe access to invalid mm nilfs2: prevent use of deleted inode zram: fix uninitialized ZRAM not releasing backing device zram: refuse to use zero sized block device as backing device mm: use clear_user_(high)page() for arch with special user folio handling mm: introduce cpu_icache_is_aliasing() across all architectures mm: add RCU annotation to pte_offset_map(_lock) mm: correctly reference merged VMA mm: use aligned address in copy_user_gigantic_page() mm: use aligned address in clear_gigantic_page() mm: shmem: fix ShmemHugePages at swapout ...
2024-12-21Merge tag 'bpf-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpfLinus Torvalds
Pull BPF fixes from Daniel Borkmann: - Fix inlining of bpf_get_smp_processor_id helper for !CONFIG_SMP systems (Andrea Righi) - Fix BPF USDT selftests helper code to use asm constraint "m" for LoongArch (Tiezhu Yang) - Fix BPF selftest compilation error in get_uprobe_offset when PROCMAP_QUERY is not defined (Jerome Marchand) - Fix BPF bpf_skb_change_tail helper when used in context of BPF sockmap to handle negative skb header offsets (Cong Wang) - Several fixes to BPF sockmap code, among others, in the area of socket buffer accounting (Levi Zim, Zijian Zhang, Cong Wang) * tag 'bpf-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf: selftests/bpf: Test bpf_skb_change_tail() in TC ingress selftests/bpf: Introduce socket_helpers.h for TC tests selftests/bpf: Add a BPF selftest for bpf_skb_change_tail() bpf: Check negative offsets in __bpf_skb_min_len() tcp_bpf: Fix copied value in tcp_bpf_sendmsg skmsg: Return copied bytes in sk_msg_memcopy_from_iter tcp_bpf: Add sk_rmem_alloc related logic for tcp_bpf ingress redirection tcp_bpf: Charge receive socket buffer in bpf_tcp_ingress() selftests/bpf: Fix compilation error in get_uprobe_offset() selftests/bpf: Use asm constraint "m" for LoongArch bpf: Fix bpf_get_smp_processor_id() on !CONFIG_SMP
2024-12-20Merge tag 'io_uring-6.13-20241220' of git://git.kernel.dk/linuxLinus Torvalds
Pull io_uring fixes from Jens Axboe: - Fix for a file ref leak for registered ring fds - Turn the ->timeout_lock into a raw spinlock, as it nests under the io-wq lock which is a raw spinlock as it's called from the scheduler side - Limit ring resizing to DEFER_TASKRUN for now. We will broaden this in the future, but for now, ensure that it's only feasible on rings with a single user - Add sanity check for io-wq enqueuing * tag 'io_uring-6.13-20241220' of git://git.kernel.dk/linux: io_uring: check if iowq is killed before queuing io_uring/register: limit ring resizing to DEFER_TASKRUN io_uring: Fix registered ring file refcount leak io_uring: make ctx->timeout_lock a raw spinlock
2024-12-20tcp_bpf: Add sk_rmem_alloc related logic for tcp_bpf ingress redirectionZijian Zhang
When we do sk_psock_verdict_apply->sk_psock_skb_ingress, an sk_msg will be created out of the skb, and the rmem accounting of the sk_msg will be handled by the skb. For skmsgs in __SK_REDIRECT case of tcp_bpf_send_verdict, when redirecting to the ingress of a socket, although we sk_rmem_schedule and add sk_msg to the ingress_msg of sk_redir, we do not update sk_rmem_alloc. As a result, except for the global memory limit, the rmem of sk_redir is nearly unlimited. Thus, add sk_rmem_alloc related logic to limit the recv buffer. Since the function sk_msg_recvmsg and __sk_psock_purge_ingress_msg are used in these two paths. We use "msg->skb" to test whether the sk_msg is skb backed up. If it's not, we shall do the memory accounting explicitly. Fixes: 604326b41a6f ("bpf, sockmap: convert to generic sk_msg interface") Signed-off-by: Zijian Zhang <zijianzhang@bytedance.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: John Fastabend <john.fastabend@gmail.com> Link: https://lore.kernel.org/bpf/20241210012039.1669389-3-zijianzhang@bytedance.com
2024-12-18alloc_tag: fix set_codetag_empty() when !CONFIG_MEM_ALLOC_PROFILING_DEBUGSuren Baghdasaryan
It was recently noticed that set_codetag_empty() might be used not only to mark NULL alloctag references as empty to avoid warnings but also to reset valid tags (in clear_page_tag_ref()). Since set_codetag_empty() is defined as NOOP for CONFIG_MEM_ALLOC_PROFILING_DEBUG=n, such use of set_codetag_empty() leads to subtle bugs. Fix set_codetag_empty() for CONFIG_MEM_ALLOC_PROFILING_DEBUG=n to reset the tag reference. Link: https://lkml.kernel.org/r/20241130001423.1114965-2-surenb@google.com Fixes: a8fc28dad6d5 ("alloc_tag: introduce clear_page_tag_ref() helper function") Signed-off-by: Suren Baghdasaryan <surenb@google.com> Reported-by: David Wang <00107082@163.com> Closes: https://lore.kernel.org/lkml/20241124074318.399027-1-00107082@163.com/ Cc: David Wang <00107082@163.com> Cc: Kent Overstreet <kent.overstreet@linux.dev> Cc: Mike Rapoport (Microsoft) <rppt@kernel.org> Cc: Pasha Tatashin <pasha.tatashin@soleen.com> Cc: Sourav Panda <souravpanda@google.com> Cc: Yu Zhao <yuzhao@google.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-12-18mm/codetag: clear tags before swapDavid Wang
When CONFIG_MEM_ALLOC_PROFILING_DEBUG is set, kernel WARN would be triggered when calling __alloc_tag_ref_set() during swap: alloc_tag was not cleared (got tag for mm/filemap.c:1951) WARNING: CPU: 0 PID: 816 at ./include/linux/alloc_tag.h... Clear code tags before swap can fix the warning. And this patch also fix a potential invalid address dereference in alloc_tag_add_check() when CONFIG_MEM_ALLOC_PROFILING_DEBUG is set and ref->ct is CODETAG_EMPTY, which is defined as ((void *)1). Link: https://lkml.kernel.org/r/20241213013332.89910-1-00107082@163.com Fixes: 51f43d5d82ed ("mm/codetag: swap tags when migrate pages") Signed-off-by: David Wang <00107082@163.com> Reported-by: kernel test robot <oliver.sang@intel.com> Closes: https://lore.kernel.org/oe-lkp/202412112227.df61ebb-lkp@intel.com Acked-by: Suren Baghdasaryan <surenb@google.com> Cc: Kent Overstreet <kent.overstreet@linux.dev> Cc: Yu Zhao <yuzhao@google.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-12-18mm/vmstat: fix a W=1 clang compiler warningBart Van Assche
Fix the following clang compiler warning that is reported if the kernel is built with W=1: ./include/linux/vmstat.h:518:36: error: arithmetic between different enumeration types ('enum node_stat_item' and 'enum lru_list') [-Werror,-Wenum-enum-conversion] 518 | return node_stat_name(NR_LRU_BASE + lru) + 3; // skip "nr_" | ~~~~~~~~~~~ ^ ~~~ Link: https://lkml.kernel.org/r/20241212213126.1269116-1-bvanassche@acm.org Fixes: 9d7ea9a297e6 ("mm/vmstat: add helpers to get vmstat item names for each enum type") Signed-off-by: Bart Van Assche <bvanassche@acm.org> Cc: Konstantin Khlebnikov <koct9i@gmail.com> Cc: Nathan Chancellor <nathan@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-12-18mm: convert partially_mapped set/clear operations to be atomicUsama Arif
Other page flags in the 2nd page, like PG_hwpoison and PG_anon_exclusive can get modified concurrently. Changes to other page flags might be lost if they are happening at the same time as non-atomic partially_mapped operations. Hence, make partially_mapped operations atomic. Link: https://lkml.kernel.org/r/20241212183351.1345389-1-usamaarif642@gmail.com Fixes: 8422acdc97ed ("mm: introduce a pageflag for partially mapped folios") Reported-by: David Hildenbrand <david@redhat.com> Link: https://lore.kernel.org/all/e53b04ad-1827-43a2-a1ab-864c7efecf6e@redhat.com/ Signed-off-by: Usama Arif <usamaarif642@gmail.com> Acked-by: David Hildenbrand <david@redhat.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Roman Gushchin <roman.gushchin@linux.dev> Cc: Barry Song <baohua@kernel.org> Cc: Domenico Cerasuolo <cerasuolodomenico@gmail.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mike Rapoport (Microsoft) <rppt@kernel.org> Cc: Nico Pache <npache@redhat.com> Cc: Rik van Riel <riel@surriel.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Yu Zhao <yuzhao@google.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-12-18mm: use clear_user_(high)page() for arch with special user folio handlingZi Yan
Some architectures have special handling after clearing user folios: architectures, which set cpu_dcache_is_aliasing() to true, require flushing dcache; arc, which sets cpu_icache_is_aliasing() to true, changes folio->flags to make icache coherent to dcache. So __GFP_ZERO using only clear_page() is not enough to zero user folios and clear_user_(high)page() must be used. Otherwise, user data will be corrupted. Fix it by always clearing user folios with clear_user_(high)page() when cpu_dcache_is_aliasing() is true or cpu_icache_is_aliasing() is true. Rename alloc_zeroed() to user_alloc_needs_zeroing() and invert the logic to clarify its intend. Link: https://lkml.kernel.org/r/20241209182326.2955963-2-ziy@nvidia.com Fixes: 5708d96da20b ("mm: avoid zeroing user movable page twice with init_on_alloc=1") Signed-off-by: Zi Yan <ziy@nvidia.com> Reported-by: Geert Uytterhoeven <geert+renesas@glider.be> Closes: https://lore.kernel.org/linux-mm/CAMuHMdV1hRp_NtR5YnJo=HsfgKQeH91J537Gh4gKk3PFZhSkbA@mail.gmail.com/ Tested-by: Geert Uytterhoeven <geert+renesas@glider.be> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Alexander Potapenko <glider@google.com> Cc: David Hildenbrand <david@redhat.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Kees Cook <keescook@chromium.org> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Vineet Gupta <vgupta@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-12-18mm: introduce cpu_icache_is_aliasing() across all architecturesZi Yan
In commit eacd0e950dc2 ("ARC: [mm] Lazy D-cache flush (non aliasing VIPT)"), arc adds the need to flush dcache to make icache see the code page change. This also requires special handling for clear_user_(high)page(). Introduce cpu_icache_is_aliasing() to make MM code query special clear_user_(high)page() easier. This will be used by the following commit. Link: https://lkml.kernel.org/r/20241209182326.2955963-1-ziy@nvidia.com Fixes: 5708d96da20b ("mm: avoid zeroing user movable page twice with init_on_alloc=1") Signed-off-by: Zi Yan <ziy@nvidia.com> Suggested-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Alexander Potapenko <glider@google.com> Cc: David Hildenbrand <david@redhat.com> Cc: Geert Uytterhoeven <geert+renesas@glider.be> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Kees Cook <keescook@chromium.org> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Vineet Gupta <vgupta@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-12-18mm: add RCU annotation to pte_offset_map(_lock)Petr Malat
RCU lock is taken by ___pte_offset_map() unless it returns NULL. Add this information to its inline callers to avoid sparse warning about context imbalance in pte_unmap(). Link: https://lkml.kernel.org/r/20241210000604.700710-1-oss@malat.biz Signed-off-by: Petr Malat <oss@malat.biz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-12-18io_uring: Fix registered ring file refcount leakJann Horn
Currently, io_uring_unreg_ringfd() (which cleans up registered rings) is only called on exit, but __io_uring_free (which frees the tctx in which the registered ring pointers are stored) is also called on execve (via begin_new_exec -> io_uring_task_cancel -> __io_uring_cancel -> io_uring_cancel_generic -> __io_uring_free). This means: A process going through execve while having registered rings will leak references to the rings' `struct file`. Fix it by zapping registered rings on execve(). This is implemented by moving the io_uring_unreg_ringfd() from io_uring_files_cancel() into its callee __io_uring_cancel(), which is called from io_uring_task_cancel() on execve. This could probably be exploited *on 32-bit kernels* by leaking 2^32 references to the same ring, because the file refcount is stored in a pointer-sized field and get_file() doesn't have protection against refcount overflow, just a WARN_ONCE(); but on 64-bit it should have no impact beyond a memory leak. Cc: stable@vger.kernel.org Fixes: e7a6c00dc77a ("io_uring: add support for registering ring file descriptors") Signed-off-by: Jann Horn <jannh@google.com> Link: https://lore.kernel.org/r/20241218-uring-reg-ring-cleanup-v1-1-8f63e999045b@google.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-12-18Merge tag 'trace-v6.13-rc3' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace Pull tracing fixes from Steven Rostedt: "Replace trace_check_vprintf() with test_event_printk() and ignore_event() The function test_event_printk() checks on boot up if the trace event printf() formats dereference any pointers, and if they do, it then looks at the arguments to make sure that the pointers they dereference will exist in the event on the ring buffer. If they do not, it issues a WARN_ON() as it is a likely bug. But this isn't the case for the strings that can be dereferenced with "%s", as some trace events (notably RCU and some IPI events) save a pointer to a static string in the ring buffer. As the string it points to lives as long as the kernel is running, it is not a bug to reference it, as it is guaranteed to be there when the event is read. But it is also possible (and a common bug) to point to some allocated string that could be freed before the trace event is read and the dereference is to bad memory. This case requires a run time check. The previous way to handle this was with trace_check_vprintf() that would process the printf format piece by piece and send what it didn't care about to vsnprintf() to handle arguments that were not strings. This kept it from having to reimplement vsnprintf(). But it relied on va_list implementation and for architectures that copied the va_list and did not pass it by reference, it wasn't even possible to do this check and it would be skipped. As 64bit x86 passed va_list by reference, most events were tested and this kept out bugs where strings would have been dereferenced after being freed. Instead of relying on the implementation of va_list, extend the boot up test_event_printk() function to validate all the "%s" strings that can be validated at boot, and for the few events that point to strings outside the ring buffer, flag both the event and the field that is dereferenced as "needs_test". Then before the event is printed, a call to ignore_event() is made, and if the event has the flag set, it iterates all its fields and for every field that is to be tested, it will read the pointer directly from the event in the ring buffer and make sure that it is valid. If the pointer is not valid, it will print a WARN_ON(), print out to the trace that the event has unsafe memory and ignore the print format. With this new update, the trace_check_vprintf() can be safely removed and now all events can be verified regardless of architecture" * tag 'trace-v6.13-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: tracing: Check "%s" dereference via the field and not the TP_printk format tracing: Add "%s" check in test_event_printk() tracing: Add missing helper functions in event pointer dereference check tracing: Fix test_event_printk() to process entire print argument
2024-12-18Merge tag 'hyperv-fixes-signed-20241217' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/hyperv/linux Pull hyperv fixes from Wei Liu: - Various fixes to Hyper-V tools in the kernel tree (Dexuan Cui, Olaf Hering, Vitaly Kuznetsov) - Fix a bug in the Hyper-V TSC page based sched_clock() (Naman Jain) - Two bug fixes in the Hyper-V utility functions (Michael Kelley) - Convert open-coded timeouts to secs_to_jiffies() in Hyper-V drivers (Easwar Hariharan) * tag 'hyperv-fixes-signed-20241217' of git://git.kernel.org/pub/scm/linux/kernel/git/hyperv/linux: tools/hv: reduce resource usage in hv_kvp_daemon tools/hv: add a .gitignore file tools/hv: reduce resouce usage in hv_get_dns_info helper hv/hv_kvp_daemon: Pass NIC name to hv_get_dns_info as well Drivers: hv: util: Avoid accessing a ringbuffer not initialized yet Drivers: hv: util: Don't force error code to ENODEV in util_probe() tools/hv: terminate fcopy daemon if read from uio fails drivers: hv: Convert open-coded timeouts to secs_to_jiffies() tools: hv: change permissions of NetworkManager configuration file x86/hyperv: Fix hv tsc page based sched_clock for hibernation tools: hv: Fix a complier warning in the fcopy uio daemon
2024-12-18x86/static-call: fix 32-bit buildJuergen Gross
In 32-bit x86 builds CONFIG_STATIC_CALL_INLINE isn't set, leading to static_call_initialized not being available. Define it as "0" in that case. Reported-by: Stephen Rothwell <sfr@canb.auug.org.au> Fixes: 0ef8047b737d ("x86/static-call: provide a way to do very early static-call updates") Signed-off-by: Juergen Gross <jgross@suse.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2024-12-17Merge tag 'hardening-v6.13-rc4' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux Pull hardening fix from Kees Cook: "Silence a GCC value-range warning that is being ironically triggered by bounds checking" * tag 'hardening-v6.13-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux: fortify: Hide run-time copy size from value range tracking
2024-12-17tracing: Check "%s" dereference via the field and not the TP_printk formatSteven Rostedt
The TP_printk() portion of a trace event is executed at the time a event is read from the trace. This can happen seconds, minutes, hours, days, months, years possibly later since the event was recorded. If the print format contains a dereference to a string via "%s", and that string was allocated, there's a chance that string could be freed before it is read by the trace file. To protect against such bugs, there are two functions that verify the event. The first one is test_event_printk(), which is called when the event is created. It reads the TP_printk() format as well as its arguments to make sure nothing may be dereferencing a pointer that was not copied into the ring buffer along with the event. If it is, it will trigger a WARN_ON(). For strings that use "%s", it is not so easy. The string may not reside in the ring buffer but may still be valid. Strings that are static and part of the kernel proper which will not be freed for the life of the running system, are safe to dereference. But to know if it is a pointer to a static string or to something on the heap can not be determined until the event is triggered. This brings us to the second function that tests for the bad dereferencing of strings, trace_check_vprintf(). It would walk through the printf format looking for "%s", and when it finds it, it would validate that the pointer is safe to read. If not, it would produces a WARN_ON() as well and write into the ring buffer "[UNSAFE-MEMORY]". The problem with this is how it used va_list to have vsnprintf() handle all the cases that it didn't need to check. Instead of re-implementing vsnprintf(), it would make a copy of the format up to the %s part, and call vsnprintf() with the current va_list ap variable, where the ap would then be ready to point at the string in question. For architectures that passed va_list by reference this was possible. For architectures that passed it by copy it was not. A test_can_verify() function was used to differentiate between the two, and if it wasn't possible, it would disable it. Even for architectures where this was feasible, it was a stretch to rely on such a method that is undocumented, and could cause issues later on with new optimizations of the compiler. Instead, the first function test_event_printk() was updated to look at "%s" as well. If the "%s" argument is a pointer outside the event in the ring buffer, it would find the field type of the event that is the problem and mark the structure with a new flag called "needs_test". The event itself will be marked by TRACE_EVENT_FL_TEST_STR to let it be known that this event has a field that needs to be verified before the event can be printed using the printf format. When the event fields are created from the field type structure, the fields would copy the field type's "needs_test" value. Finally, before being printed, a new function ignore_event() is called which will check if the event has the TEST_STR flag set (if not, it returns false). If the flag is set, it then iterates through the events fields looking for the ones that have the "needs_test" flag set. Then it uses the offset field from the field structure to find the pointer in the ring buffer event. It runs the tests to make sure that pointer is safe to print and if not, it triggers the WARN_ON() and also adds to the trace output that the event in question has an unsafe memory access. The ignore_event() makes the trace_check_vprintf() obsolete so it is removed. Link: https://lore.kernel.org/all/CAHk-=wh3uOnqnZPpR0PeLZZtyWbZLboZ7cHLCKRWsocvs9Y7hQ@mail.gmail.com/ Cc: stable@vger.kernel.org Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Al Viro <viro@ZenIV.linux.org.uk> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: https://lore.kernel.org/20241217024720.848621576@goodmis.org Fixes: 5013f454a352c ("tracing: Add check of trace event print fmts for dereferencing pointers") Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2024-12-17Merge tag 'xsa465+xsa466-6.13-tag' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip Pull xen fixes from Juergen Gross: "Fix xen netfront crash (XSA-465) and avoid using the hypercall page that doesn't do speculation mitigations (XSA-466)" * tag 'xsa465+xsa466-6.13-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip: x86/xen: remove hypercall page x86/xen: use new hypercall functions instead of hypercall page x86/xen: add central hypercall functions x86/xen: don't do PV iret hypercall through hypercall page x86/static-call: provide a way to do very early static-call updates objtool/x86: allow syscall instruction x86: make get_cpu_vendor() accessible from Xen code xen/netfront: fix crash when removing device
2024-12-17io_uring: make ctx->timeout_lock a raw spinlockJens Axboe
Chase reports that their tester complaints about a locking context mismatch: ============================= [ BUG: Invalid wait context ] 6.13.0-rc1-gf137f14b7ccb-dirty #9 Not tainted ----------------------------- syz.1.25198/182604 is trying to lock: ffff88805e66a358 (&ctx->timeout_lock){-.-.}-{3:3}, at: spin_lock_irq include/linux/spinlock.h:376 [inline] ffff88805e66a358 (&ctx->timeout_lock){-.-.}-{3:3}, at: io_match_task_safe io_uring/io_uring.c:218 [inline] ffff88805e66a358 (&ctx->timeout_lock){-.-.}-{3:3}, at: io_match_task_safe+0x187/0x250 io_uring/io_uring.c:204 other info that might help us debug this: context-{5:5} 1 lock held by syz.1.25198/182604: #0: ffff88802b7d48c0 (&acct->lock){+.+.}-{2:2}, at: io_acct_cancel_pending_work+0x2d/0x6b0 io_uring/io-wq.c:1049 stack backtrace: CPU: 0 UID: 0 PID: 182604 Comm: syz.1.25198 Not tainted 6.13.0-rc1-gf137f14b7ccb-dirty #9 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1 04/01/2014 Call Trace: <TASK> __dump_stack lib/dump_stack.c:94 [inline] dump_stack_lvl+0x82/0xd0 lib/dump_stack.c:120 print_lock_invalid_wait_context kernel/locking/lockdep.c:4826 [inline] check_wait_context kernel/locking/lockdep.c:4898 [inline] __lock_acquire+0x883/0x3c80 kernel/locking/lockdep.c:5176 lock_acquire.part.0+0x11b/0x370 kernel/locking/lockdep.c:5849 __raw_spin_lock_irq include/linux/spinlock_api_smp.h:119 [inline] _raw_spin_lock_irq+0x36/0x50 kernel/locking/spinlock.c:170 spin_lock_irq include/linux/spinlock.h:376 [inline] io_match_task_safe io_uring/io_uring.c:218 [inline] io_match_task_safe+0x187/0x250 io_uring/io_uring.c:204 io_acct_cancel_pending_work+0xb8/0x6b0 io_uring/io-wq.c:1052 io_wq_cancel_pending_work io_uring/io-wq.c:1074 [inline] io_wq_cancel_cb+0xb0/0x390 io_uring/io-wq.c:1112 io_uring_try_cancel_requests+0x15e/0xd70 io_uring/io_uring.c:3062 io_uring_cancel_generic+0x6ec/0x8c0 io_uring/io_uring.c:3140 io_uring_files_cancel include/linux/io_uring.h:20 [inline] do_exit+0x494/0x27a0 kernel/exit.c:894 do_group_exit+0xb3/0x250 kernel/exit.c:1087 get_signal+0x1d77/0x1ef0 kernel/signal.c:3017 arch_do_signal_or_restart+0x79/0x5b0 arch/x86/kernel/signal.c:337 exit_to_user_mode_loop kernel/entry/common.c:111 [inline] exit_to_user_mode_prepare include/linux/entry-common.h:329 [inline] __syscall_exit_to_user_mode_work kernel/entry/common.c:207 [inline] syscall_exit_to_user_mode+0x150/0x2a0 kernel/entry/common.c:218 do_syscall_64+0xd8/0x250 arch/x86/entry/common.c:89 entry_SYSCALL_64_after_hwframe+0x77/0x7f which is because io_uring has ctx->timeout_lock nesting inside the io-wq acct lock, the latter of which is used from inside the scheduler and hence is a raw spinlock, while the former is a "normal" spinlock and can hence be sleeping on PREEMPT_RT. Change ctx->timeout_lock to be a raw spinlock to solve this nesting dependency on PREEMPT_RT=y. Reported-by: chase xd <sl1589472800@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-12-16fortify: Hide run-time copy size from value range trackingKees Cook
GCC performs value range tracking for variables as a way to provide better diagnostics. One place this is regularly seen is with warnings associated with bounds-checking, e.g. -Wstringop-overflow, -Wstringop-overread, -Warray-bounds, etc. In order to keep the signal-to-noise ratio high, warnings aren't emitted when a value range spans the entire value range representable by a given variable. For example: unsigned int len; char dst[8]; ... memcpy(dst, src, len); If len's value is unknown, it has the full "unsigned int" range of [0, UINT_MAX], and GCC's compile-time bounds checks against memcpy() will be ignored. However, when a code path has been able to narrow the range: if (len > 16) return; memcpy(dst, src, len); Then the range will be updated for the execution path. Above, len is now [0, 16] when reading memcpy(), so depending on other optimizations, we might see a -Wstringop-overflow warning like: error: '__builtin_memcpy' writing between 9 and 16 bytes into region of size 8 [-Werror=stringop-overflow] When building with CONFIG_FORTIFY_SOURCE, the fortified run-time bounds checking can appear to narrow value ranges of lengths for memcpy(), depending on how the compiler constructs the execution paths during optimization passes, due to the checks against the field sizes. For example: if (p_size_field != SIZE_MAX && p_size != p_size_field && p_size_field < size) As intentionally designed, these checks only affect the kernel warnings emitted at run-time and do not block the potentially overflowing memcpy(), so GCC thinks it needs to produce a warning about the resulting value range that might be reaching the memcpy(). We have seen this manifest a few times now, with the most recent being with cpumasks: In function ‘bitmap_copy’, inlined from ‘cpumask_copy’ at ./include/linux/cpumask.h:839:2, inlined from ‘__padata_set_cpumasks’ at kernel/padata.c:730:2: ./include/linux/fortify-string.h:114:33: error: ‘__builtin_memcpy’ reading between 257 and 536870904 bytes from a region of size 256 [-Werror=stringop-overread] 114 | #define __underlying_memcpy __builtin_memcpy | ^ ./include/linux/fortify-string.h:633:9: note: in expansion of macro ‘__underlying_memcpy’ 633 | __underlying_##op(p, q, __fortify_size); \ | ^~~~~~~~~~~~~ ./include/linux/fortify-string.h:678:26: note: in expansion of macro ‘__fortify_memcpy_chk’ 678 | #define memcpy(p, q, s) __fortify_memcpy_chk(p, q, s, \ | ^~~~~~~~~~~~~~~~~~~~ ./include/linux/bitmap.h:259:17: note: in expansion of macro ‘memcpy’ 259 | memcpy(dst, src, len); | ^~~~~~ kernel/padata.c: In function ‘__padata_set_cpumasks’: kernel/padata.c:713:48: note: source object ‘pcpumask’ of size [0, 256] 713 | cpumask_var_t pcpumask, | ~~~~~~~~~~~~~~^~~~~~~~ This warning is _not_ emitted when CONFIG_FORTIFY_SOURCE is disabled, and with the recent -fdiagnostics-details we can confirm the origin of the warning is due to FORTIFY's bounds checking: ../include/linux/bitmap.h:259:17: note: in expansion of macro 'memcpy' 259 | memcpy(dst, src, len); | ^~~~~~ '__padata_set_cpumasks': events 1-2 ../include/linux/fortify-string.h:613:36: 612 | if (p_size_field != SIZE_MAX && | ~~~~~~~~~~~~~~~~~~~~~~~~~~~ 613 | p_size != p_size_field && p_size_field < size) | ~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~ | | | (1) when the condition is evaluated to false | (2) when the condition is evaluated to true '__padata_set_cpumasks': event 3 114 | #define __underlying_memcpy __builtin_memcpy | ^ | | | (3) out of array bounds here Note that the cpumask warning started appearing since bitmap functions were recently marked __always_inline in commit ed8cd2b3bd9f ("bitmap: Switch from inline to __always_inline"), which allowed GCC to gain visibility into the variables as they passed through the FORTIFY implementation. In order to silence these false positives but keep otherwise deterministic compile-time warnings intact, hide the length variable from GCC with OPTIMIZE_HIDE_VAR() before calling the builtin memcpy. Additionally add a comment about why all the macro args have copies with const storage. Reported-by: "Thomas Weißschuh" <linux@weissschuh.net> Closes: https://lore.kernel.org/all/db7190c8-d17f-4a0d-bc2f-5903c79f36c2@t-8ch.de/ Reported-by: Nilay Shroff <nilay@linux.ibm.com> Closes: https://lore.kernel.org/all/20241112124127.1666300-1-nilay@linux.ibm.com/ Tested-by: Nilay Shroff <nilay@linux.ibm.com> Acked-by: Yury Norov <yury.norov@gmail.com> Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Kees Cook <kees@kernel.org>
2024-12-16Merge tag 'soc-fixes-6.13' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/soc/soc Pull SoC fixes from Arnd Bergmann: "Three small fixes for the soc tree: - devicetee fix for the Arm Juno reference machine, to allow more interesting PCI configurations - build fix for SCMI firmware on the NXP i.MX platform - fix for a race condition in Arm FF-A firmware" * tag 'soc-fixes-6.13' of git://git.kernel.org/pub/scm/linux/kernel/git/soc/soc: arm64: dts: fvp: Update PCIe bus-range property firmware: arm_ffa: Fix the race around setting ffa_dev->properties firmware: arm_scmi: Fix i.MX build dependency
2024-12-15Merge tag 'sched_urgent_for_v6.13_rc3-p2' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler fixes from Borislav Petkov: - Prevent incorrect dequeueing of the deadline dlserver helper task and fix its time accounting - Properly track the CFS runqueue runnable stats - Check the total number of all queued tasks in a sched fair's runqueue hierarchy before deciding to stop the tick - Fix the scheduling of the task that got woken last (NEXT_BUDDY) by preventing those from being delayed * tag 'sched_urgent_for_v6.13_rc3-p2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: sched/dlserver: Fix dlserver time accounting sched/dlserver: Fix dlserver double enqueue sched/eevdf: More PELT vs DELAYED_DEQUEUE sched/fair: Fix sched_can_stop_tick() for fair tasks sched/fair: Fix NEXT_BUDDY
2024-12-14Merge tag 'bpf-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpfLinus Torvalds
Pull bpf fixes from Daniel Borkmann: - Fix a bug in the BPF verifier to track changes to packet data property for global functions (Eduard Zingerman) - Fix a theoretical BPF prog_array use-after-free in RCU handling of __uprobe_perf_func (Jann Horn) - Fix BPF tracing to have an explicit list of tracepoints and their arguments which need to be annotated as PTR_MAYBE_NULL (Kumar Kartikeya Dwivedi) - Fix a logic bug in the bpf_remove_insns code where a potential error would have been wrongly propagated (Anton Protopopov) - Avoid deadlock scenarios caused by nested kprobe and fentry BPF programs (Priya Bala Govindasamy) - Fix a bug in BPF verifier which was missing a size check for BTF-based context access (Kumar Kartikeya Dwivedi) - Fix a crash found by syzbot through an invalid BPF prog_array access in perf_event_detach_bpf_prog (Jiri Olsa) - Fix several BPF sockmap bugs including a race causing a refcount imbalance upon element replace (Michal Luczaj) - Fix a use-after-free from mismatching BPF program/attachment RCU flavors (Jann Horn) * tag 'bpf-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf: (23 commits) bpf: Avoid deadlock caused by nested kprobe and fentry bpf programs selftests/bpf: Add tests for raw_tp NULL args bpf: Augment raw_tp arguments with PTR_MAYBE_NULL bpf: Revert "bpf: Mark raw_tp arguments with PTR_MAYBE_NULL" selftests/bpf: Add test for narrow ctx load for pointer args bpf: Check size for BTF-based ctx access of pointer members selftests/bpf: extend changes_pkt_data with cases w/o subprograms bpf: fix null dereference when computing changes_pkt_data of prog w/o subprogs bpf: Fix theoretical prog_array UAF in __uprobe_perf_func() bpf: fix potential error return selftests/bpf: validate that tail call invalidates packet pointers bpf: consider that tail calls invalidate packet pointers selftests/bpf: freplace tests for tracking of changes_packet_data bpf: check changes_pkt_data property for extension programs selftests/bpf: test for changing packet data from global functions bpf: track changes_pkt_data property for global functions bpf: refactor bpf_helper_changes_pkt_data to use helper number bpf: add find_containing_subprog() utility function bpf,perf: Fix invalid prog_array access in perf_event_detach_bpf_prog bpf: Fix UAF via mismatching bpf_prog/attachment RCU flavors ...
2024-12-13bpf: Revert "bpf: Mark raw_tp arguments with PTR_MAYBE_NULL"Kumar Kartikeya Dwivedi
This patch reverts commit cb4158ce8ec8 ("bpf: Mark raw_tp arguments with PTR_MAYBE_NULL"). The patch was well-intended and meant to be as a stop-gap fixing branch prediction when the pointer may actually be NULL at runtime. Eventually, it was supposed to be replaced by an automated script or compiler pass detecting possibly NULL arguments and marking them accordingly. However, it caused two main issues observed for production programs and failed to preserve backwards compatibility. First, programs relied on the verifier not exploring == NULL branch when pointer is not NULL, thus they started failing with a 'dereference of scalar' error. Next, allowing raw_tp arguments to be modified surfaced the warning in the verifier that warns against reg->off when PTR_MAYBE_NULL is set. More information, context, and discusson on both problems is available in [0]. Overall, this approach had several shortcomings, and the fixes would further complicate the verifier's logic, and the entire masking scheme would have to be removed eventually anyway. Hence, revert the patch in preparation of a better fix avoiding these issues to replace this commit. [0]: https://lore.kernel.org/bpf/20241206161053.809580-1-memxor@gmail.com Reported-by: Manu Bretelle <chantra@meta.com> Fixes: cb4158ce8ec8 ("bpf: Mark raw_tp arguments with PTR_MAYBE_NULL") Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20241213221929.3495062-2-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-12-13Merge tag 'block-6.13-20241213' of git://git.kernel.dk/linuxLinus Torvalds
Pull block fixes from Jens Axboe: - Series from Damien fixing issues with the zoned write plugging - Fix for a potential UAF in block cgroups - Fix deadlock around queue freezing and the sysfs lock - Various little cleanups and fixes * tag 'block-6.13-20241213' of git://git.kernel.dk/linux: block: Fix potential deadlock while freezing queue and acquiring sysfs_lock block: Fix queue_iostats_passthrough_show() blk-mq: Clean up blk_mq_requeue_work() mq-deadline: Remove a local variable blk-iocost: Avoid using clamp() on inuse in __propagate_weights() block: Make bio_iov_bvec_set() accept pointer to const iov_iter block: get wp_offset by bdev_offset_from_zone_start blk-cgroup: Fix UAF in blkcg_unpin_online() MAINTAINERS: update Coly Li's email address block: Prevent potential deadlocks in zone write plug error recovery dm: Fix dm-zoned-reclaim zone write pointer alignment block: Ignore REQ_NOWAIT for zone reset and zone finish operations block: Use a zone write plug BIO work for REQ_NOWAIT BIOs
2024-12-13Merge tag 'ffa-fix-6.13' of ↵Arnd Bergmann
https://git.kernel.org/pub/scm/linux/kernel/git/sudeep.holla/linux into arm/fixes Arm FF-A fix for v6.13 A single fix to address a possible race around setting ffa_dev->properties in ffa_device_register() by updating ffa_device_register() to take all the partition information received from the firmware and updating the struct ffa_device accordingly before registering the device to the bus/driver model in the kernel. * tag 'ffa-fix-6.13' of https://git.kernel.org/pub/scm/linux/kernel/git/sudeep.holla/linux: firmware: arm_ffa: Fix the race around setting ffa_dev->properties Link: https://lore.kernel.org/r/20241210101113.3232602-1-sudeep.holla@arm.com Signed-off-by: Arnd Bergmann <arnd@arndb.de>
2024-12-13sched/dlserver: Fix dlserver double enqueueVineeth Pillai (Google)
dlserver can get dequeued during a dlserver pick_task due to the delayed deueue feature and this can lead to issues with dlserver logic as it still thinks that dlserver is on the runqueue. The dlserver throttling and replenish logic gets confused and can lead to double enqueue of dlserver. Double enqueue of dlserver could happend due to couple of reasons: Case 1 ------ Delayed dequeue feature[1] can cause dlserver being stopped during a pick initiated by dlserver: __pick_next_task pick_task_dl -> server_pick_task pick_task_fair pick_next_entity (if (sched_delayed)) dequeue_entities dl_server_stop server_pick_task goes ahead with update_curr_dl_se without knowing that dlserver is dequeued and this confuses the logic and may lead to unintended enqueue while the server is stopped. Case 2 ------ A race condition between a task dequeue on one cpu and same task's enqueue on this cpu by a remote cpu while the lock is released causing dlserver double enqueue. One cpu would be in the schedule() and releasing RQ-lock: current->state = TASK_INTERRUPTIBLE(); schedule(); deactivate_task() dl_stop_server(); pick_next_task() pick_next_task_fair() sched_balance_newidle() rq_unlock(this_rq) at which point another CPU can take our RQ-lock and do: try_to_wake_up() ttwu_queue() rq_lock() ... activate_task() dl_server_start() --> first enqueue wakeup_preempt() := check_preempt_wakeup_fair() update_curr() update_curr_task() if (current->dl_server) dl_server_update() enqueue_dl_entity() --> second enqueue This bug was not apparent as the enqueue in dl_server_start doesn't usually happen because of the defer logic. But as a side effect of the first case(dequeue during dlserver pick), dl_throttled and dl_yield will be set and this causes the time accounting of dlserver to messup and then leading to a enqueue in dl_server_start. Have an explicit flag representing the status of dlserver to avoid the confusion. This is set in dl_server_start and reset in dlserver_stop. Fixes: 63ba8422f876 ("sched/deadline: Introduce deadline servers") Suggested-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: "Vineeth Pillai (Google)" <vineeth@bitbyteword.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Marcel Ziswiler <marcel.ziswiler@codethink.co.uk> # ROCK 5B Link: https://lkml.kernel.org/r/20241213032244.877029-1-vineeth@bitbyteword.org
2024-12-13x86/static-call: provide a way to do very early static-call updatesJuergen Gross
Add static_call_update_early() for updating static-call targets in very early boot. This will be needed for support of Xen guest type specific hypercall functions. This is part of XSA-466 / CVE-2024-53241. Reported-by: Andrew Cooper <andrew.cooper3@citrix.com> Signed-off-by: Juergen Gross <jgross@suse.com> Co-developed-by: Peter Zijlstra <peterz@infradead.org> Co-developed-by: Josh Poimboeuf <jpoimboe@redhat.com>
2024-12-12Merge tag 'net-6.13-rc3' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Pull networking fixes from Jakub Kicinski: "Including fixes from bluetooth, netfilter and wireless. Current release - fix to a fix: - rtnetlink: fix error code in rtnl_newlink() - tipc: fix NULL deref in cleanup_bearer() Current release - regressions: - ip: fix warning about invalid return from in ip_route_input_rcu() Current release - new code bugs: - udp: fix L4 hash after reconnect - eth: lan969x: fix cyclic dependency between modules - eth: bnxt_en: fix potential crash when dumping FW log coredump Previous releases - regressions: - wifi: mac80211: - fix a queue stall in certain cases of channel switch - wake the queues in case of failure in resume - splice: do not checksum AF_UNIX sockets - virtio_net: fix BUG()s in BQL support due to incorrect accounting of purged packets during interface stop - eth: - stmmac: fix TSO DMA API mis-usage causing oops - bnxt_en: fixes for HW GRO: GSO type on 5750X chips and oops due to incorrect aggregation ID mask on 5760X chips Previous releases - always broken: - Bluetooth: improve setsockopt() handling of malformed user input - eth: ocelot: fix PTP timestamping in presence of packet loss - ptp: kvm: x86: avoid "fail to initialize ptp_kvm" when simply not supported" * tag 'net-6.13-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (81 commits) net: dsa: tag_ocelot_8021q: fix broken reception net: dsa: microchip: KSZ9896 register regmap alignment to 32 bit boundaries net: renesas: rswitch: fix initial MPIC register setting Bluetooth: btmtk: avoid UAF in btmtk_process_coredump Bluetooth: iso: Fix circular lock in iso_conn_big_sync Bluetooth: iso: Fix circular lock in iso_listen_bis Bluetooth: SCO: Add support for 16 bits transparent voice setting Bluetooth: iso: Fix recursive locking warning Bluetooth: iso: Always release hdev at the end of iso_listen_bis Bluetooth: hci_event: Fix using rcu_read_(un)lock while iterating Bluetooth: hci_core: Fix sleeping function called from invalid context team: Fix feature propagation of NETIF_F_GSO_ENCAP_ALL team: Fix initial vlan_feature set in __team_compute_features bonding: Fix feature propagation of NETIF_F_GSO_ENCAP_ALL bonding: Fix initial {vlan,mpls}_feature set in bond_compute_features net, team, bonding: Add netdev_base_features helper net/sched: netem: account for backlog updates from child qdisc net: dsa: felix: fix stuck CPU-injected packets with short taprio windows splice: do not checksum AF_UNIX sockets net: usb: qmi_wwan: add Telit FE910C04 compositions ...
2024-12-12block: Make bio_iov_bvec_set() accept pointer to const iov_iterJohn Garry
Make bio_iov_bvec_set() accept a pointer to const iov_iter, which means that we can drop the undesirable casting to struct iov_iter pointer in blk_rq_map_user_bvec(). Signed-off-by: John Garry <john.g.garry@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20241202115727.2320401-1-john.g.garry@oracle.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-12-12net, team, bonding: Add netdev_base_features helperDaniel Borkmann
Both bonding and team driver have logic to derive the base feature flags before iterating over their slave devices to refine the set via netdev_increment_features(). Add a small helper netdev_base_features() so this can be reused instead of having it open-coded multiple times. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Cc: Nikolay Aleksandrov <razor@blackwall.org> Cc: Ido Schimmel <idosch@idosch.org> Cc: Jiri Pirko <jiri@nvidia.com> Reviewed-by: Hangbin Liu <liuhangbin@gmail.com> Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org> Link: https://patch.msgid.link/20241210141245.327886-1-daniel@iogearbox.net Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-12-10bpf: Fix theoretical prog_array UAF in __uprobe_perf_func()Jann Horn
Currently, the pointer stored in call->prog_array is loaded in __uprobe_perf_func(), with no RCU annotation and no immediately visible RCU protection, so it looks as if the loaded pointer can immediately be dangling. Later, bpf_prog_run_array_uprobe() starts a RCU-trace read-side critical section, but this is too late. It then uses rcu_dereference_check(), but this use of rcu_dereference_check() does not actually dereference anything. Fix it by aligning the semantics to bpf_prog_run_array(): Let the caller provide rcu_read_lock_trace() protection and then load call->prog_array with rcu_dereference_check(). This issue seems to be theoretical: I don't know of any way to reach this code without having handle_swbp() further up the stack, which is already holding a rcu_read_lock_trace() lock, so where we take rcu_read_lock_trace() in __uprobe_perf_func()/bpf_prog_run_array_uprobe() doesn't actually have any effect. Fixes: 8c7dcb84e3b7 ("bpf: implement sleepable uprobes by chaining gps") Suggested-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: Jann Horn <jannh@google.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20241210-bpf-fix-uprobe-uaf-v4-1-5fc8959b2b74@google.com
2024-12-10bpf: check changes_pkt_data property for extension programsEduard Zingerman
When processing calls to global sub-programs, verifier decides whether to invalidate all packet pointers in current state depending on the changes_pkt_data property of the global sub-program. Because of this, an extension program replacing a global sub-program must be compatible with changes_pkt_data property of the sub-program being replaced. This commit: - adds changes_pkt_data flag to struct bpf_prog_aux: - this flag is set in check_cfg() for main sub-program; - in jit_subprogs() for other sub-programs; - modifies bpf_check_attach_btf_id() to check changes_pkt_data flag; - moves call to check_attach_btf_id() after the call to check_cfg(), because it needs changes_pkt_data flag to be set: bpf_check: ... ... - check_attach_btf_id resolve_pseudo_ldimm64 resolve_pseudo_ldimm64 --> bpf_prog_is_offloaded bpf_prog_is_offloaded check_cfg check_cfg + check_attach_btf_id ... ... The following fields are set by check_attach_btf_id(): - env->ops - prog->aux->attach_btf_trace - prog->aux->attach_func_name - prog->aux->attach_func_proto - prog->aux->dst_trampoline - prog->aux->mod - prog->aux->saved_dst_attach_type - prog->aux->saved_dst_prog_type - prog->expected_attach_type Neither of these fields are used by resolve_pseudo_ldimm64() or bpf_prog_offload_verifier_prep() (for netronome and netdevsim drivers), so the reordering is safe. Suggested-by: Alexei Starovoitov <alexei.starovoitov@gmail.com> Signed-off-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20241210041100.1898468-6-eddyz87@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-12-10bpf: track changes_pkt_data property for global functionsEduard Zingerman
When processing calls to certain helpers, verifier invalidates all packet pointers in a current state. For example, consider the following program: __attribute__((__noinline__)) long skb_pull_data(struct __sk_buff *sk, __u32 len) { return bpf_skb_pull_data(sk, len); } SEC("tc") int test_invalidate_checks(struct __sk_buff *sk) { int *p = (void *)(long)sk->data; if ((void *)(p + 1) > (void *)(long)sk->data_end) return TCX_DROP; skb_pull_data(sk, 0); *p = 42; return TCX_PASS; } After a call to bpf_skb_pull_data() the pointer 'p' can't be used safely. See function filter.c:bpf_helper_changes_pkt_data() for a list of such helpers. At the moment verifier invalidates packet pointers when processing helper function calls, and does not traverse global sub-programs when processing calls to global sub-programs. This means that calls to helpers done from global sub-programs do not invalidate pointers in the caller state. E.g. the program above is unsafe, but is not rejected by verifier. This commit fixes the omission by computing field bpf_subprog_info->changes_pkt_data for each sub-program before main verification pass. changes_pkt_data should be set if: - subprogram calls helper for which bpf_helper_changes_pkt_data returns true; - subprogram calls a global function, for which bpf_subprog_info->changes_pkt_data should be set. The verifier.c:check_cfg() pass is modified to compute this information. The commit relies on depth first instruction traversal done by check_cfg() and absence of recursive function calls: - check_cfg() would eventually visit every call to subprogram S in a state when S is fully explored; - when S is fully explored: - every direct helper call within S is explored (and thus changes_pkt_data is set if needed); - every call to subprogram S1 called by S was visited with S1 fully explored (and thus S inherits changes_pkt_data from S1). The downside of such approach is that dead code elimination is not taken into account: if a helper call inside global function is dead because of current configuration, verifier would conservatively assume that the call occurs for the purpose of the changes_pkt_data computation. Reported-by: Nick Zavaritsky <mejedi@gmail.com> Closes: https://lore.kernel.org/bpf/0498CA22-5779-4767-9C0C-A9515CEA711F@gmail.com/ Signed-off-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20241210041100.1898468-4-eddyz87@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-12-10bpf: refactor bpf_helper_changes_pkt_data to use helper numberEduard Zingerman
Use BPF helper number instead of function pointer in bpf_helper_changes_pkt_data(). This would simplify usage of this function in verifier.c:check_cfg() (in a follow-up patch), where only helper number is easily available and there is no real need to lookup helper proto. Signed-off-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20241210041100.1898468-3-eddyz87@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-12-10block: Prevent potential deadlocks in zone write plug error recoveryDamien Le Moal
Zone write plugging for handling writes to zones of a zoned block device always execute a zone report whenever a write BIO to a zone fails. The intent of this is to ensure that the tracking of a zone write pointer is always correct to ensure that the alignment to a zone write pointer of write BIOs can be checked on submission and that we can always correctly emulate zone append operations using regular write BIOs. However, this error recovery scheme introduces a potential deadlock if a device queue freeze is initiated while BIOs are still plugged in a zone write plug and one of these write operation fails. In such case, the disk zone write plug error recovery work is scheduled and executes a report zone. This in turn can result in a request allocation in the underlying driver to issue the report zones command to the device. But with the device queue freeze already started, this allocation will block, preventing the report zone execution and the continuation of the processing of the plugged BIOs. As plugged BIOs hold a queue usage reference, the queue freeze itself will never complete, resulting in a deadlock. Avoid this problem by completely removing from the zone write plugging code the use of report zones operations after a failed write operation, instead relying on the device user to either execute a report zones, reset the zone, finish the zone, or give up writing to the device (which is a fairly common pattern for file systems which degrade to read-only after write failures). This is not an unreasonnable requirement as all well-behaved applications, FSes and device mapper already use report zones to recover from write errors whenever possible by comparing the current position of a zone write pointer with what their assumption about the position is. The changes to remove the automatic error recovery are as follows: - Completely remove the error recovery work and its associated resources (zone write plug list head, disk error list, and disk zone_wplugs_work work struct). This also removes the functions disk_zone_wplug_set_error() and disk_zone_wplug_clear_error(). - Change the BLK_ZONE_WPLUG_ERROR zone write plug flag into BLK_ZONE_WPLUG_NEED_WP_UPDATE. This new flag is set for a zone write plug whenever a write opration targetting the zone of the zone write plug fails. This flag indicates that the zone write pointer offset is not reliable and that it must be updated when the next report zone, reset zone, finish zone or disk revalidation is executed. - Modify blk_zone_write_plug_bio_endio() to set the BLK_ZONE_WPLUG_NEED_WP_UPDATE flag for the target zone of a failed write BIO. - Modify the function disk_zone_wplug_set_wp_offset() to clear this new flag, thus implementing recovery of a correct write pointer offset with the reset (all) zone and finish zone operations. - Modify blkdev_report_zones() to always use the disk_report_zones_cb() callback so that disk_zone_wplug_sync_wp_offset() can be called for any zone marked with the BLK_ZONE_WPLUG_NEED_WP_UPDATE flag. This implements recovery of a correct write pointer offset for zone write plugs marked with BLK_ZONE_WPLUG_NEED_WP_UPDATE and within the range of the report zones operation executed by the user. - Modify blk_revalidate_seq_zone() to call disk_zone_wplug_sync_wp_offset() for all sequential write required zones when a zoned block device is revalidated, thus always resolving any inconsistency between the write pointer offset of zone write plugs and the actual write pointer position of sequential zones. Fixes: dd291d77cc90 ("block: Introduce zone write plugging") Cc: stable@vger.kernel.org Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Link: https://lore.kernel.org/r/20241209122357.47838-5-dlemoal@kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-12-10dm: Fix dm-zoned-reclaim zone write pointer alignmentDamien Le Moal
The zone reclaim processing of the dm-zoned device mapper uses blkdev_issue_zeroout() to align the write pointer of a zone being used for reclaiming another zone, to write the valid data blocks from the zone being reclaimed at the same position relative to the zone start in the reclaim target zone. The first call to blkdev_issue_zeroout() will try to use hardware offload using a REQ_OP_WRITE_ZEROES operation if the device reports a non-zero max_write_zeroes_sectors queue limit. If this operation fails because of the lack of hardware support, blkdev_issue_zeroout() falls back to using a regular write operation with the zero-page as buffer. Currently, such REQ_OP_WRITE_ZEROES failure is automatically handled by the block layer zone write plugging code which will execute a report zones operation to ensure that the write pointer of the target zone of the failed operation has not changed and to "rewind" the zone write pointer offset of the target zone as it was advanced when the write zero operation was submitted. So the REQ_OP_WRITE_ZEROES failure does not cause any issue and blkdev_issue_zeroout() works as expected. However, since the automatic recovery of zone write pointers by the zone write plugging code can potentially cause deadlocks with queue freeze operations, a different recovery must be implemented in preparation for the removal of zone write plugging report zones based recovery. Do this by introducing the new function blk_zone_issue_zeroout(). This function first calls blkdev_issue_zeroout() with the flag BLKDEV_ZERO_NOFALLBACK to intercept failures on the first execution which attempt to use the device hardware offload with the REQ_OP_WRITE_ZEROES operation. If this attempt fails, a report zone operation is issued to restore the zone write pointer offset of the target zone to the correct position and blkdev_issue_zeroout() is called again without the BLKDEV_ZERO_NOFALLBACK flag. The report zones operation performing this recovery is implemented using the helper function disk_zone_sync_wp_offset() which calls the gendisk report_zones file operation with the callback disk_report_zones_cb(). This callback updates the target write pointer offset of the target zone using the new function disk_zone_wplug_sync_wp_offset(). dmz_reclaim_align_wp() is modified to change its call to blkdev_issue_zeroout() to a call to blk_zone_issue_zeroout() without any other change needed as the two functions are functionnally equivalent. Fixes: dd291d77cc90 ("block: Introduce zone write plugging") Cc: stable@vger.kernel.org Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Acked-by: Mike Snitzer <snitzer@kernel.org> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Link: https://lore.kernel.org/r/20241209122357.47838-4-dlemoal@kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-12-10virtio_ring: add a func argument 'recycle_done' to virtqueue_reset()Koichiro Den
When virtqueue_reset() has actually recycled all unused buffers, additional work may be required in some cases. Relying solely on its return status is fragile, so introduce a new function argument 'recycle_done', which is invoked when it really occurs. Signed-off-by: Koichiro Den <koichiro.den@canonical.com> Acked-by: Jason Wang <jasowang@redhat.com> Reviewed-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-12-10virtio_ring: add a func argument 'recycle_done' to virtqueue_resize()Koichiro Den
When virtqueue_resize() has actually recycled all unused buffers, additional work may be required in some cases. Relying solely on its return status is fragile, so introduce a new function argument 'recycle_done', which is invoked when the recycle really occurs. Cc: <stable@vger.kernel.org> # v6.11+ Signed-off-by: Koichiro Den <koichiro.den@canonical.com> Acked-by: Jason Wang <jasowang@redhat.com> Reviewed-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-12-09Drivers: hv: util: Avoid accessing a ringbuffer not initialized yetMichael Kelley
If the KVP (or VSS) daemon starts before the VMBus channel's ringbuffer is fully initialized, we can hit the panic below: hv_utils: Registering HyperV Utility Driver hv_vmbus: registering driver hv_utils ... BUG: kernel NULL pointer dereference, address: 0000000000000000 CPU: 44 UID: 0 PID: 2552 Comm: hv_kvp_daemon Tainted: G E 6.11.0-rc3+ #1 RIP: 0010:hv_pkt_iter_first+0x12/0xd0 Call Trace: ... vmbus_recvpacket hv_kvp_onchannelcallback vmbus_on_event tasklet_action_common tasklet_action handle_softirqs irq_exit_rcu sysvec_hyperv_stimer0 </IRQ> <TASK> asm_sysvec_hyperv_stimer0 ... kvp_register_done hvt_op_read vfs_read ksys_read __x64_sys_read This can happen because the KVP/VSS channel callback can be invoked even before the channel is fully opened: 1) as soon as hv_kvp_init() -> hvutil_transport_init() creates /dev/vmbus/hv_kvp, the kvp daemon can open the device file immediately and register itself to the driver by writing a message KVP_OP_REGISTER1 to the file (which is handled by kvp_on_msg() ->kvp_handle_handshake()) and reading the file for the driver's response, which is handled by hvt_op_read(), which calls hvt->on_read(), i.e. kvp_register_done(). 2) the problem with kvp_register_done() is that it can cause the channel callback to be called even before the channel is fully opened, and when the channel callback is starting to run, util_probe()-> vmbus_open() may have not initialized the ringbuffer yet, so the callback can hit the panic of NULL pointer dereference. To reproduce the panic consistently, we can add a "ssleep(10)" for KVP in __vmbus_open(), just before the first hv_ringbuffer_init(), and then we unload and reload the driver hv_utils, and run the daemon manually within the 10 seconds. Fix the panic by reordering the steps in util_probe() so the char dev entry used by the KVP or VSS daemon is not created until after vmbus_open() has completed. This reordering prevents the race condition from happening. Reported-by: Dexuan Cui <decui@microsoft.com> Fixes: e0fa3e5e7df6 ("Drivers: hv: utils: fix a race on userspace daemons registration") Cc: stable@vger.kernel.org Signed-off-by: Michael Kelley <mhklinux@outlook.com> Acked-by: Wei Liu <wei.liu@kernel.org> Link: https://lore.kernel.org/r/20241106154247.2271-3-mhklinux@outlook.com Signed-off-by: Wei Liu <wei.liu@kernel.org> Message-ID: <20241106154247.2271-3-mhklinux@outlook.com>
2024-12-09Merge tag 'locking_urgent_for_v6.13_rc3' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull locking fixes from Borislav Petkov: - Remove if_not_guard() as it is generating incorrect code - Fix the initialization of the fake lockdep_map for the first locked ww_mutex * tag 'locking_urgent_for_v6.13_rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: headers/cleanup.h: Remove the if_not_guard() facility locking/ww_mutex: Fix ww_mutex dummy lockdep map selftest warnings
2024-12-08Merge tag 'timers_urgent_for_v6.13_rc2' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull timer fix from Borislav Petkov: - Handle the case where clocksources with small counter width can, in conjunction with overly long idle sleeps, falsely trigger the negative motion detection of clocksources * tag 'timers_urgent_for_v6.13_rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: clocksource: Make negative motion detection more robust
2024-12-08Merge tag 'mm-hotfixes-stable-2024-12-07-22-39' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull misc fixes from Andrew Morton: "24 hotfixes. 17 are cc:stable. 15 are MM and 9 are non-MM. The usual bunch of singletons - please see the relevant changelogs for details" * tag 'mm-hotfixes-stable-2024-12-07-22-39' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (24 commits) iio: magnetometer: yas530: use signed integer type for clamp limits sched/numa: fix memory leak due to the overwritten vma->numab_state mm/damon: fix order of arguments in damos_before_apply tracepoint lib: stackinit: hide never-taken branch from compiler mm/filemap: don't call folio_test_locked() without a reference in next_uptodate_folio() scatterlist: fix incorrect func name in kernel-doc mm: correct typo in MMAP_STATE() macro mm: respect mmap hint address when aligning for THP mm: memcg: declare do_memsw_account inline mm/codetag: swap tags when migrate pages ocfs2: update seq_file index in ocfs2_dlm_seq_next stackdepot: fix stack_depot_save_flags() in NMI context mm: open-code page_folio() in dump_page() mm: open-code PageTail in folio_flags() and const_folio_flags() mm: fix vrealloc()'s KASAN poisoning logic Revert "readahead: properly shorten readahead when falling back to do_page_cache_ra()" selftests/damon: add _damon_sysfs.py to TEST_FILES selftest: hugetlb_dio: fix test naming ocfs2: free inode when ocfs2_get_init_inode() fails nilfs2: fix potential out-of-bounds memory access in nilfs_find_entry() ...
2024-12-07net: mscc: ocelot: be resilient to loss of PTP packets during transmissionVladimir Oltean
The Felix DSA driver presents unique challenges that make the simplistic ocelot PTP TX timestamping procedure unreliable: any transmitted packet may be lost in hardware before it ever leaves our local system. This may happen because there is congestion on the DSA conduit, the switch CPU port or even user port (Qdiscs like taprio may delay packets indefinitely by design). The technical problem is that the kernel, i.e. ocelot_port_add_txtstamp_skb(), runs out of timestamp IDs eventually, because it never detects that packets are lost, and keeps the IDs of the lost packets on hold indefinitely. The manifestation of the issue once the entire timestamp ID range becomes busy looks like this in dmesg: mscc_felix 0000:00:00.5: port 0 delivering skb without TX timestamp mscc_felix 0000:00:00.5: port 1 delivering skb without TX timestamp At the surface level, we need a timeout timer so that the kernel knows a timestamp ID is available again. But there is a deeper problem with the implementation, which is the monotonically increasing ocelot_port->ts_id. In the presence of packet loss, it will be impossible to detect that and reuse one of the holes created in the range of free timestamp IDs. What we actually need is a bitmap of 63 timestamp IDs tracking which one is available. That is able to use up holes caused by packet loss, but also gives us a unique opportunity to not implement an actual timer_list for the timeout timer (very complicated in terms of locking). We could only declare a timestamp ID stale on demand (lazily), aka when there's no other timestamp ID available. There are pros and cons to this approach: the implementation is much more simple than per-packet timers would be, but most of the stale packets would be quasi-leaked - not really leaked, but blocked in driver memory, since this algorithm sees no reason to free them. An improved technique would be to check for stale timestamp IDs every time we allocate a new one. Assuming a constant flux of PTP packets, this avoids stale packets being blocked in memory, but of course, packets lost at the end of the flux are still blocked until the flux resumes (nobody left to kick them out). Since implementing per-packet timers is way too complicated, this should be good enough. Testing procedure: Persistently block traffic class 5 and try to run PTP on it: $ tc qdisc replace dev swp3 parent root taprio num_tc 8 \ map 0 1 2 3 4 5 6 7 queues 1@0 1@1 1@2 1@3 1@4 1@5 1@6 1@7 \ base-time 0 sched-entry S 0xdf 100000 flags 0x2 [ 126.948141] mscc_felix 0000:00:00.5: port 3 tc 5 min gate length 0 ns not enough for max frame size 1526 at 1000 Mbps, dropping frames over 1 octets including FCS $ ptp4l -i swp3 -2 -P -m --socket_priority 5 --fault_reset_interval ASAP --logSyncInterval -3 ptp4l[70.351]: port 1 (swp3): INITIALIZING to LISTENING on INIT_COMPLETE ptp4l[70.354]: port 0 (/var/run/ptp4l): INITIALIZING to LISTENING on INIT_COMPLETE ptp4l[70.358]: port 0 (/var/run/ptp4lro): INITIALIZING to LISTENING on INIT_COMPLETE [ 70.394583] mscc_felix 0000:00:00.5: port 3 timestamp id 0 ptp4l[70.406]: timed out while polling for tx timestamp ptp4l[70.406]: increasing tx_timestamp_timeout or increasing kworker priority may correct this issue, but a driver bug likely causes it ptp4l[70.406]: port 1 (swp3): send peer delay response failed ptp4l[70.407]: port 1 (swp3): clearing fault immediately ptp4l[70.952]: port 1 (swp3): new foreign master d858d7.fffe.00ca6d-1 [ 71.394858] mscc_felix 0000:00:00.5: port 3 timestamp id 1 ptp4l[71.400]: timed out while polling for tx timestamp ptp4l[71.400]: increasing tx_timestamp_timeout or increasing kworker priority may correct this issue, but a driver bug likely causes it ptp4l[71.401]: port 1 (swp3): send peer delay response failed ptp4l[71.401]: port 1 (swp3): clearing fault immediately [ 72.393616] mscc_felix 0000:00:00.5: port 3 timestamp id 2 ptp4l[72.401]: timed out while polling for tx timestamp ptp4l[72.402]: increasing tx_timestamp_timeout or increasing kworker priority may correct this issue, but a driver bug likely causes it ptp4l[72.402]: port 1 (swp3): send peer delay response failed ptp4l[72.402]: port 1 (swp3): clearing fault immediately ptp4l[72.952]: port 1 (swp3): new foreign master d858d7.fffe.00ca6d-1 [ 73.395291] mscc_felix 0000:00:00.5: port 3 timestamp id 3 ptp4l[73.400]: timed out while polling for tx timestamp ptp4l[73.400]: increasing tx_timestamp_timeout or increasing kworker priority may correct this issue, but a driver bug likely causes it ptp4l[73.400]: port 1 (swp3): send peer delay response failed ptp4l[73.400]: port 1 (swp3): clearing fault immediately [ 74.394282] mscc_felix 0000:00:00.5: port 3 timestamp id 4 ptp4l[74.400]: timed out while polling for tx timestamp ptp4l[74.401]: increasing tx_timestamp_timeout or increasing kworker priority may correct this issue, but a driver bug likely causes it ptp4l[74.401]: port 1 (swp3): send peer delay response failed ptp4l[74.401]: port 1 (swp3): clearing fault immediately ptp4l[74.953]: port 1 (swp3): new foreign master d858d7.fffe.00ca6d-1 [ 75.396830] mscc_felix 0000:00:00.5: port 3 invalidating stale timestamp ID 0 which seems lost [ 75.405760] mscc_felix 0000:00:00.5: port 3 timestamp id 0 ptp4l[75.410]: timed out while polling for tx timestamp ptp4l[75.411]: increasing tx_timestamp_timeout or increasing kworker priority may correct this issue, but a driver bug likely causes it ptp4l[75.411]: port 1 (swp3): send peer delay response failed ptp4l[75.411]: port 1 (swp3): clearing fault immediately (...) Remove the blocking condition and see that the port recovers: $ same tc command as above, but use "sched-entry S 0xff" instead $ same ptp4l command as above ptp4l[99.489]: port 1 (swp3): INITIALIZING to LISTENING on INIT_COMPLETE ptp4l[99.490]: port 0 (/var/run/ptp4l): INITIALIZING to LISTENING on INIT_COMPLETE ptp4l[99.492]: port 0 (/var/run/ptp4lro): INITIALIZING to LISTENING on INIT_COMPLETE [ 100.403768] mscc_felix 0000:00:00.5: port 3 invalidating stale timestamp ID 0 which seems lost [ 100.412545] mscc_felix 0000:00:00.5: port 3 invalidating stale timestamp ID 1 which seems lost [ 100.421283] mscc_felix 0000:00:00.5: port 3 invalidating stale timestamp ID 2 which seems lost [ 100.430015] mscc_felix 0000:00:00.5: port 3 invalidating stale timestamp ID 3 which seems lost [ 100.438744] mscc_felix 0000:00:00.5: port 3 invalidating stale timestamp ID 4 which seems lost [ 100.447470] mscc_felix 0000:00:00.5: port 3 timestamp id 0 [ 100.505919] mscc_felix 0000:00:00.5: port 3 timestamp id 0 ptp4l[100.963]: port 1 (swp3): new foreign master d858d7.fffe.00ca6d-1 [ 101.405077] mscc_felix 0000:00:00.5: port 3 timestamp id 0 [ 101.507953] mscc_felix 0000:00:00.5: port 3 timestamp id 0 [ 102.405405] mscc_felix 0000:00:00.5: port 3 timestamp id 0 [ 102.509391] mscc_felix 0000:00:00.5: port 3 timestamp id 0 [ 103.406003] mscc_felix 0000:00:00.5: port 3 timestamp id 0 [ 103.510011] mscc_felix 0000:00:00.5: port 3 timestamp id 0 [ 104.405601] mscc_felix 0000:00:00.5: port 3 timestamp id 0 [ 104.510624] mscc_felix 0000:00:00.5: port 3 timestamp id 0 ptp4l[104.965]: selected best master clock d858d7.fffe.00ca6d ptp4l[104.966]: port 1 (swp3): assuming the grand master role ptp4l[104.967]: port 1 (swp3): LISTENING to GRAND_MASTER on RS_GRAND_MASTER [ 105.106201] mscc_felix 0000:00:00.5: port 3 timestamp id 0 [ 105.232420] mscc_felix 0000:00:00.5: port 3 timestamp id 0 [ 105.359001] mscc_felix 0000:00:00.5: port 3 timestamp id 0 [ 105.405500] mscc_felix 0000:00:00.5: port 3 timestamp id 0 [ 105.485356] mscc_felix 0000:00:00.5: port 3 timestamp id 0 [ 105.511220] mscc_felix 0000:00:00.5: port 3 timestamp id 0 [ 105.610938] mscc_felix 0000:00:00.5: port 3 timestamp id 0 [ 105.737237] mscc_felix 0000:00:00.5: port 3 timestamp id 0 (...) Notice that in this new usage pattern, a non-congested port should basically use timestamp ID 0 all the time, progressing to higher numbers only if there are unacknowledged timestamps in flight. Compare this to the old usage, where the timestamp ID used to monotonically increase modulo OCELOT_MAX_PTP_ID. In terms of implementation, this simplifies the bookkeeping of the ocelot_port :: ts_id and ptp_skbs_in_flight. Since we need to traverse the list of two-step timestampable skbs for each new packet anyway, the information can already be computed and does not need to be stored. Also, ocelot_port->tx_skbs is always accessed under the switch-wide ocelot->ts_id_lock IRQ-unsafe spinlock, so we don't need the skb queue's lock and can use the unlocked primitives safely. This problem was actually detected using the tc-taprio offload, and is causing trouble in TSN scenarios, which Felix (NXP LS1028A / VSC9959) supports but Ocelot (VSC7514) does not. Thus, I've selected the commit to blame as the one adding initial timestamping support for the Felix switch. Fixes: c0bcf537667c ("net: dsa: ocelot: add hardware timestamping support for Felix") Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Link: https://patch.msgid.link/20241205145519.1236778-5-vladimir.oltean@nxp.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-12-07Merge tag 'io_uring-6.13-20241207' of git://git.kernel.dk/linuxLinus Torvalds
Pull io_uring fix from Jens Axboe: "A single fix for a parameter type which affects 32-bit" * tag 'io_uring-6.13-20241207' of git://git.kernel.dk/linux: io_uring: Change res2 parameter type in io_uring_cmd_done
2024-12-07headers/cleanup.h: Remove the if_not_guard() facilityIngo Molnar
Linus noticed that the new if_not_guard() definition is fragile: "This macro generates actively wrong code if it happens to be inside an if-statement or a loop without a block. IOW, code like this: for (iterate-over-something) if_not_guard(a) return -BUSY; looks like will build fine, but will generate completely incorrect code." The reason is that the __if_not_guard() macro is multi-statement, so while most kernel developers expect macros to be simple or at least compound statements - but for __if_not_guard() it is not so: #define __if_not_guard(_name, _id, args...) \ BUILD_BUG_ON(!__is_cond_ptr(_name)); \ CLASS(_name, _id)(args); \ if (!__guard_ptr(_name)(&_id)) To add insult to injury, the placement of the BUILD_BUG_ON() line makes the macro appear to compile fine, but it will generate incorrect code as Linus reported, for example if used within iteration or conditional statements that will use the first statement of a macro as a loop body or conditional statement body. [ I'd also like to note that the original submission by David Lechner did not contain the BUILD_BUG_ON() line, so it was safer than what we ended up committing. Mea culpa. ] It doesn't appear to be possible to turn this macro into a robust single or compound statement that could be used in single statements, due to the necessity to define an auto scope variable with an open scope and the necessity of it having to expand to a partial 'if' statement with no body. Instead of trying to work around this fragility, just remove the construct before it gets used. Reported-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: David Lechner <dlechner@baylibre.com> Cc: Peter Zijlstra <peterz@infradead.org> Link: https://lore.kernel.org/r/Z1LBnX9TpZLR5Dkf@gmail.com
2024-12-05scatterlist: fix incorrect func name in kernel-docRandy Dunlap
Fix a kernel-doc warning by making the kernel-doc function description match the function name: include/linux/scatterlist.h:323: warning: expecting prototype for sg_unmark_bus_address(). Prototype was for sg_dma_unmark_bus_address() instead Link: https://lkml.kernel.org/r/20241130022406.537973-1-rdunlap@infradead.org Fixes: 42399301203e ("lib/scatterlist: add flag for indicating P2PDMA segments in an SGL") Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Cc: Logan Gunthorpe <logang@deltatee.com> Cc: Christoph Hellwig <hch@lst.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-12-05mm/codetag: swap tags when migrate pagesDavid Wang
Current solution to adjust codetag references during page migration is done in 3 steps: 1. sets the codetag reference of the old page as empty (not pointing to any codetag); 2. subtracts counters of the new page to compensate for its own allocation; 3. sets codetag reference of the new page to point to the codetag of the old page. This does not work if CONFIG_MEM_ALLOC_PROFILING_DEBUG=n because set_codetag_empty() becomes NOOP. Instead, let's simply swap codetag references so that the new page is referencing the old codetag and the old page is referencing the new codetag. This way accounting stays valid and the logic makes more sense. Link: https://lkml.kernel.org/r/20241129025213.34836-1-00107082@163.com Fixes: e0a955bf7f61 ("mm/codetag: add pgalloc_tag_copy()") Signed-off-by: David Wang <00107082@163.com> Closes: https://lore.kernel.org/lkml/20241124074318.399027-1-00107082@163.com/ Acked-by: Suren Baghdasaryan <surenb@google.com> Suggested-by: Suren Baghdasaryan <surenb@google.com> Acked-by: Yu Zhao <yuzhao@google.com> Cc: Kent Overstreet <kent.overstreet@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-12-05stackdepot: fix stack_depot_save_flags() in NMI contextMarco Elver
Per documentation, stack_depot_save_flags() was meant to be usable from NMI context if STACK_DEPOT_FLAG_CAN_ALLOC is unset. However, it still would try to take the pool_lock in an attempt to save a stack trace in the current pool (if space is available). This could result in deadlock if an NMI is handled while pool_lock is already held. To avoid deadlock, only try to take the lock in NMI context and give up if unsuccessful. The documentation is fixed to clearly convey this. Link: https://lkml.kernel.org/r/Z0CcyfbPqmxJ9uJH@elver.google.com Link: https://lkml.kernel.org/r/20241122154051.3914732-1-elver@google.com Fixes: 4434a56ec209 ("stackdepot: make fast paths lock-less again") Signed-off-by: Marco Elver <elver@google.com> Reported-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Reviewed-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: Alexander Potapenko <glider@google.com> Cc: Andrey Konovalov <andreyknvl@gmail.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-12-05mm: open-code PageTail in folio_flags() and const_folio_flags()Matthew Wilcox (Oracle)
It is unsafe to call PageTail() in dump_page() as page_is_fake_head() will almost certainly return true when called on a head page that is copied to the stack. That will cause the VM_BUG_ON_PGFLAGS() in const_folio_flags() to trigger when it shouldn't. Fortunately, we don't need to call PageTail() here; it's fine to have a pointer to a virtual alias of the page's flag word rather than the real page's flag word. Link: https://lkml.kernel.org/r/20241125201721.2963278-1-willy@infradead.org Fixes: fae7d834c43c ("mm: add __dump_folio()") Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Kees Cook <kees@kernel.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>