summaryrefslogtreecommitdiff
path: root/kernel
AgeCommit message (Collapse)Author
2023-01-24module: Don't wait for GOING modulesPetr Pavlu
During a system boot, it can happen that the kernel receives a burst of requests to insert the same module but loading it eventually fails during its init call. For instance, udev can make a request to insert a frequency module for each individual CPU when another frequency module is already loaded which causes the init function of the new module to return an error. Since commit 6e6de3dee51a ("kernel/module.c: Only return -EEXIST for modules that have finished loading"), the kernel waits for modules in MODULE_STATE_GOING state to finish unloading before making another attempt to load the same module. This creates unnecessary work in the described scenario and delays the boot. In the worst case, it can prevent udev from loading drivers for other devices and might cause timeouts of services waiting on them and subsequently a failed boot. This patch attempts a different solution for the problem 6e6de3dee51a was trying to solve. Rather than waiting for the unloading to complete, it returns a different error code (-EBUSY) for modules in the GOING state. This should avoid the error situation that was described in 6e6de3dee51a (user space attempting to load a dependent module because the -EEXIST error code would suggest to user space that the first module had been loaded successfully), while avoiding the delay situation too. This has been tested on linux-next since December 2022 and passes all kmod selftests except test 0009 with module compression enabled but it has been confirmed that this issue has existed and has gone unnoticed since prior to this commit and can also be reproduced without module compression with a simple usleep(5000000) on tools/modprobe.c [0]. These failures are caused by hitting the kernel mod_concurrent_max and can happen either due to a self inflicted kernel module auto-loead DoS somehow or on a system with large CPU count and each CPU count incorrectly triggering many module auto-loads. Both of those issues need to be fixed in-kernel. [0] https://lore.kernel.org/all/Y9A4fiobL6IHp%2F%2FP@bombadil.infradead.org/ Fixes: 6e6de3dee51a ("kernel/module.c: Only return -EEXIST for modules that have finished loading") Co-developed-by: Martin Wilck <mwilck@suse.com> Signed-off-by: Martin Wilck <mwilck@suse.com> Signed-off-by: Petr Pavlu <petr.pavlu@suse.com> Cc: stable@vger.kernel.org Reviewed-by: Petr Mladek <pmladek@suse.com> [mcgrof: enhance commit log with testing and kmod test result interpretation ] Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
2023-01-24tracing: Kconfig: Fix spelling/grammar/punctuationRandy Dunlap
Fix some editorial nits in trace Kconfig. Link: https://lkml.kernel.org/r/20230124181647.15902-1-rdunlap@infradead.org Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-01-24tracing: Make sure trace_printk() can output as soon as it can be usedSteven Rostedt (Google)
Currently trace_printk() can be used as soon as early_trace_init() is called from start_kernel(). But if a crash happens, and "ftrace_dump_on_oops" is set on the kernel command line, all you get will be: [ 0.456075] <idle>-0 0dN.2. 347519us : Unknown type 6 [ 0.456075] <idle>-0 0dN.2. 353141us : Unknown type 6 [ 0.456075] <idle>-0 0dN.2. 358684us : Unknown type 6 This is because the trace_printk() event (type 6) hasn't been registered yet. That gets done via an early_initcall(), which may be early, but not early enough. Instead of registering the trace_printk() event (and other ftrace events, which are not trace events) via an early_initcall(), have them registered at the same time that trace_printk() can be used. This way, if there is a crash before early_initcall(), then the trace_printk()s will actually be useful. Link: https://lkml.kernel.org/r/20230104161412.019f6c55@gandalf.local.home Cc: stable@vger.kernel.org Cc: Masami Hiramatsu <mhiramat@kernel.org> Fixes: e725c731e3bb1 ("tracing: Split tracing initialization into two for early initialization") Reported-by: "Joel Fernandes (Google)" <joel@joelfernandes.org> Tested-by: Joel Fernandes (Google) <joel@joelfernandes.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-01-24ftrace: Export ftrace_free_filter() to modulesMark Rutland
Setting filters on an ftrace ops results in some memory being allocated for the filter hashes, which must be freed before the ops can be freed. This can be done by removing every individual element of the hash by calling ftrace_set_filter_ip() or ftrace_set_filter_ips() with `remove` set, but this is somewhat error prone as it's easy to forget to remove an element. Make it easier to clean this up by exporting ftrace_free_filter(), which can be used to clean up all of the filter hashes after an ftrace_ops has been unregistered. Using this, fix the ftrace-direct* samples to free hashes prior to being unloaded. All other code either removes individual filters explicitly or is built-in and already calls ftrace_free_filter(). Link: https://lkml.kernel.org/r/20230103124912.2948963-3-mark.rutland@arm.com Cc: stable@vger.kernel.org Cc: Florent Revest <revest@chromium.org> Cc: Masami Hiramatsu <mhiramat@kernel.org> Fixes: e1067a07cfbc ("ftrace/samples: Add module to test multi direct modify interface") Fixes: 5fae941b9a6f ("ftrace/samples: Add multi direct interface test module") Reviewed-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Mark Rutland <mark.rutland@arm.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-01-24Compiler attributes: GCC cold function alignment workaroundsMark Rutland
Contemporary versions of GCC (e.g. GCC 12.2.0) drop the alignment specified by '-falign-functions=N' for functions marked with the __cold__ attribute, and potentially for callees of __cold__ functions as these may be implicitly marked as __cold__ by the compiler. LLVM appears to respect '-falign-functions=N' in such cases. This has been reported to GCC in bug 88345: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88345 ... which also covers alignment being dropped when '-Os' is used, which will be addressed in a separate patch. Currently, use of '-falign-functions=N' is limited to CONFIG_FUNCTION_ALIGNMENT, which is largely used for performance and/or analysis reasons (e.g. with CONFIG_DEBUG_FORCE_FUNCTION_ALIGN_64B), but isn't necessary for correct functionality. However, this dropped alignment isn't great for the performance and/or analysis cases. Subsequent patches will use CONFIG_FUNCTION_ALIGNMENT as part of arm64's ftrace implementation, which will require all instrumented functions to be aligned to at least 8-bytes. This patch works around the dropped alignment by avoiding the use of the __cold__ attribute when CONFIG_FUNCTION_ALIGNMENT is non-zero, and by specifically aligning abort(), which GCC implicitly marks as __cold__. As the __cold macro is now dependent upon config options (which is against the policy described at the top of compiler_attributes.h), it is moved into compiler_types.h. I've tested this by building and booting a kernel configured with defconfig + CONFIG_EXPERT=y + CONFIG_DEBUG_FORCE_FUNCTION_ALIGN_64B=y, and looking for misaligned text symbols in /proc/kallsyms: * arm64: Before: # uname -rm 6.2.0-rc3 aarch64 # grep ' [Tt] ' /proc/kallsyms | grep -iv '[048c]0 [Tt] ' | wc -l 5009 After: # uname -rm 6.2.0-rc3-00001-g2a2bedf8bfa9 aarch64 # grep ' [Tt] ' /proc/kallsyms | grep -iv '[048c]0 [Tt] ' | wc -l 919 * x86_64: Before: # uname -rm 6.2.0-rc3 x86_64 # grep ' [Tt] ' /proc/kallsyms | grep -iv '[048c]0 [Tt] ' | wc -l 11537 After: # uname -rm 6.2.0-rc3-00001-g2a2bedf8bfa9 x86_64 # grep ' [Tt] ' /proc/kallsyms | grep -iv '[048c]0 [Tt] ' | wc -l 2805 There's clearly a substantial reduction in the number of misaligned symbols. From manual inspection, the remaining unaligned text labels are a combination of ACPICA functions (due to the use of '-Os'), static call trampolines, and non-function labels in assembly, which will be dealt with in subsequent patches. Signed-off-by: Mark Rutland <mark.rutland@arm.com> Cc: Florent Revest <revest@chromium.org> Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Will Deacon <will@kernel.org> Cc: Miguel Ojeda <ojeda@kernel.org> Cc: Nick Desaulniers <ndesaulniers@google.com> Link: https://lore.kernel.org/r/20230123134603.1064407-3-mark.rutland@arm.com Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2023-01-24ftrace: Add DYNAMIC_FTRACE_WITH_CALL_OPSMark Rutland
Architectures without dynamic ftrace trampolines incur an overhead when multiple ftrace_ops are enabled with distinct filters. in these cases, each call site calls a common trampoline which uses ftrace_ops_list_func() to iterate over all enabled ftrace functions, and so incurs an overhead relative to the size of this list (including RCU protection overhead). Architectures with dynamic ftrace trampolines avoid this overhead for call sites which have a single associated ftrace_ops. In these cases, the dynamic trampoline is customized to branch directly to the relevant ftrace function, avoiding the list overhead. On some architectures it's impractical and/or undesirable to implement dynamic ftrace trampolines. For example, arm64 has limited branch ranges and cannot always directly branch from a call site to an arbitrary address (e.g. from a kernel text address to an arbitrary module address). Calls from modules to core kernel text can be indirected via PLTs (allocated at module load time) to address this, but the same is not possible from calls from core kernel text. Using an indirect branch from a call site to an arbitrary trampoline is possible, but requires several more instructions in the function prologue (or immediately before it), and/or comes with far more complex requirements for patching. Instead, this patch adds a new option, where an architecture can associate each call site with a pointer to an ftrace_ops, placed at a fixed offset from the call site. A shared trampoline can recover this pointer and call ftrace_ops::func() without needing to go via ftrace_ops_list_func(), avoiding the associated overhead. This avoids issues with branch range limitations, and avoids the need to allocate and manipulate dynamic trampolines, making it far simpler to implement and maintain, while having similar performance characteristics. Note that this allows for dynamic ftrace_ops to be invoked directly from an architecture's ftrace_caller trampoline, whereas existing code forces the use of ftrace_ops_get_list_func(), which is in part necessary to permit the ftrace_ops to be freed once unregistered *and* to avoid branch/address-generation range limitation on some architectures (e.g. where ops->func is a module address, and may be outside of the direct branch range for callsites within the main kernel image). The CALL_OPS approach avoids this problems and is safe as: * The existing synchronization in ftrace_shutdown() using ftrace_shutdown() using synchronize_rcu_tasks_rude() (and synchronize_rcu_tasks()) ensures that no tasks hold a stale reference to an ftrace_ops (e.g. in the middle of the ftrace_caller trampoline, or while invoking ftrace_ops::func), when that ftrace_ops is unregistered. Arguably this could also be relied upon for the existing scheme, permitting dynamic ftrace_ops to be invoked directly when ops->func is in range, but this will require additional logic to handle branch range limitations, and is not handled by this patch. * Each callsite's ftrace_ops pointer literal can hold any valid kernel address, and is updated atomically. As an architecture's ftrace_caller trampoline will atomically load the ops pointer then dereference ops->func, there is no risk of invoking ops->func with a mismatches ops pointer, and updates to the ops pointer do not require special care. A subsequent patch will implement architectures support for arm64. There should be no functional change as a result of this patch alone. Signed-off-by: Mark Rutland <mark.rutland@arm.com> Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org> Cc: Florent Revest <revest@chromium.org> Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Will Deacon <will@kernel.org> Link: https://lore.kernel.org/r/20230123134603.1064407-2-mark.rutland@arm.com Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2023-01-23rcu: Disable laziness if lazy-tracking says soJoel Fernandes (Google)
During suspend, we see failures to suspend 1 in 300-500 suspends. Looking closer, it appears that asynchronous RCU callbacks are being queued as lazy even though synchronous callbacks are expedited. These delays appear to not be very welcome by the suspend/resume code as evidenced by these occasional suspend failures. This commit modifies call_rcu() to check if rcu_async_should_hurry(), which will return true if we are in suspend or in-kernel boot. [ paulmck: Alphabetize local variables. ] Ignoring the lazy hint makes the 3000 suspend/resume cycles pass reliably on a 12th gen 12-core Intel CPU, and there is some evidence that it also slightly speeds up boot performance. Fixes: 3cb278e73be5 ("rcu: Make call_rcu() lazy to save power") Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2023-01-23bpf: Support consuming XDP HW metadata from fext programsToke Høiland-Jørgensen
Instead of rejecting the attaching of PROG_TYPE_EXT programs to XDP programs that consume HW metadata, implement support for propagating the offload information. The extension program doesn't need to set a flag or ifindex, these will just be propagated from the target by the verifier. We need to create a separate offload object for the extension program, though, since it can be reattached to a different program later (which means we can't just inherit the offload information from the target). An additional check is added on attach that the new target is compatible with the offload information in the extension prog. Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com> Signed-off-by: Stanislav Fomichev <sdf@google.com> Link: https://lore.kernel.org/r/20230119221536.3349901-9-sdf@google.com Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2023-01-23bpf: XDP metadata RX kfuncsStanislav Fomichev
Define a new kfunc set (xdp_metadata_kfunc_ids) which implements all possible XDP metatada kfuncs. Not all devices have to implement them. If kfunc is not supported by the target device, the default implementation is called instead. The verifier, at load time, replaces a call to the generic kfunc with a call to the per-device one. Per-device kfunc pointers are stored in separate struct xdp_metadata_ops. Cc: John Fastabend <john.fastabend@gmail.com> Cc: David Ahern <dsahern@gmail.com> Cc: Martin KaFai Lau <martin.lau@linux.dev> Cc: Jakub Kicinski <kuba@kernel.org> Cc: Willem de Bruijn <willemb@google.com> Cc: Jesper Dangaard Brouer <brouer@redhat.com> Cc: Anatoly Burakov <anatoly.burakov@intel.com> Cc: Alexander Lobakin <alexandr.lobakin@intel.com> Cc: Magnus Karlsson <magnus.karlsson@gmail.com> Cc: Maryam Tahhan <mtahhan@redhat.com> Cc: xdp-hints@xdp-project.net Cc: netdev@vger.kernel.org Signed-off-by: Stanislav Fomichev <sdf@google.com> Link: https://lore.kernel.org/r/20230119221536.3349901-8-sdf@google.com Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2023-01-23bpf: Introduce device-bound XDP programsStanislav Fomichev
New flag BPF_F_XDP_DEV_BOUND_ONLY plus all the infra to have a way to associate a netdev with a BPF program at load time. netdevsim checks are dropped in favor of generic check in dev_xdp_attach. Cc: John Fastabend <john.fastabend@gmail.com> Cc: David Ahern <dsahern@gmail.com> Cc: Martin KaFai Lau <martin.lau@linux.dev> Cc: Jakub Kicinski <kuba@kernel.org> Cc: Willem de Bruijn <willemb@google.com> Cc: Jesper Dangaard Brouer <brouer@redhat.com> Cc: Anatoly Burakov <anatoly.burakov@intel.com> Cc: Alexander Lobakin <alexandr.lobakin@intel.com> Cc: Magnus Karlsson <magnus.karlsson@gmail.com> Cc: Maryam Tahhan <mtahhan@redhat.com> Cc: xdp-hints@xdp-project.net Cc: netdev@vger.kernel.org Signed-off-by: Stanislav Fomichev <sdf@google.com> Link: https://lore.kernel.org/r/20230119221536.3349901-6-sdf@google.com Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2023-01-23bpf: Reshuffle some parts of bpf/offload.cStanislav Fomichev
To avoid adding forward declarations in the main patch, shuffle some code around. No functional changes. Cc: John Fastabend <john.fastabend@gmail.com> Cc: David Ahern <dsahern@gmail.com> Cc: Martin KaFai Lau <martin.lau@linux.dev> Cc: Jakub Kicinski <kuba@kernel.org> Cc: Willem de Bruijn <willemb@google.com> Cc: Jesper Dangaard Brouer <brouer@redhat.com> Cc: Anatoly Burakov <anatoly.burakov@intel.com> Cc: Alexander Lobakin <alexandr.lobakin@intel.com> Cc: Magnus Karlsson <magnus.karlsson@gmail.com> Cc: Maryam Tahhan <mtahhan@redhat.com> Cc: xdp-hints@xdp-project.net Cc: netdev@vger.kernel.org Signed-off-by: Stanislav Fomichev <sdf@google.com> Link: https://lore.kernel.org/r/20230119221536.3349901-5-sdf@google.com Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2023-01-23bpf: Move offload initialization into late_initcallStanislav Fomichev
So we don't have to initialize it manually from several paths. Cc: John Fastabend <john.fastabend@gmail.com> Cc: David Ahern <dsahern@gmail.com> Cc: Martin KaFai Lau <martin.lau@linux.dev> Cc: Jakub Kicinski <kuba@kernel.org> Cc: Willem de Bruijn <willemb@google.com> Cc: Jesper Dangaard Brouer <brouer@redhat.com> Cc: Anatoly Burakov <anatoly.burakov@intel.com> Cc: Alexander Lobakin <alexandr.lobakin@intel.com> Cc: Magnus Karlsson <magnus.karlsson@gmail.com> Cc: Maryam Tahhan <mtahhan@redhat.com> Cc: xdp-hints@xdp-project.net Cc: netdev@vger.kernel.org Signed-off-by: Stanislav Fomichev <sdf@google.com> Link: https://lore.kernel.org/r/20230119221536.3349901-4-sdf@google.com Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2023-01-23bpf: Rename bpf_{prog,map}_is_dev_bound to is_offloadedStanislav Fomichev
BPF offloading infra will be reused to implement bound-but-not-offloaded bpf programs. Rename existing helpers for clarity. No functional changes. Cc: John Fastabend <john.fastabend@gmail.com> Cc: David Ahern <dsahern@gmail.com> Cc: Martin KaFai Lau <martin.lau@linux.dev> Cc: Willem de Bruijn <willemb@google.com> Cc: Jesper Dangaard Brouer <brouer@redhat.com> Cc: Anatoly Burakov <anatoly.burakov@intel.com> Cc: Alexander Lobakin <alexandr.lobakin@intel.com> Cc: Magnus Karlsson <magnus.karlsson@gmail.com> Cc: Maryam Tahhan <mtahhan@redhat.com> Cc: xdp-hints@xdp-project.net Cc: netdev@vger.kernel.org Reviewed-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Stanislav Fomichev <sdf@google.com> Link: https://lore.kernel.org/r/20230119221536.3349901-3-sdf@google.com Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2023-01-22Merge tag 'sched_urgent_for_v6.2_rc6' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler fixes from Borislav Petkov: - Make sure the scheduler doesn't use stale frequency scaling values when latter get disabled due to a value error - Fix a NULL pointer access on UP configs - Use the proper locking when updating CPU capacity * tag 'sched_urgent_for_v6.2_rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/aperfmperf: Erase stale arch_freq_scale values when disabling frequency invariance readings sched/core: Fix NULL pointer access fault in sched_setaffinity() with non-SMP configs sched/fair: Fixes for capacity inversion detection sched/uclamp: Fix a uninitialized variable warnings
2023-01-22Merge 6.2-rc5 into driver-core-nextGreg Kroah-Hartman
We need the driver core fixes in here as well. Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-01-21Merge tag 'driver-core-6.2-rc5' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core Pull driver core fixes from Greg KH: "Here are three small driver and kernel core fixes for 6.2-rc5. They include: - potential gadget fixup in do_prlimit - device property refcount leak fix - test_async_probe bugfix for reported problem" * tag 'driver-core-6.2-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: prlimit: do_prlimit needs to have a speculation check driver core: Fix test_async_probe_init saves device in wrong array device property: fix of node refcount leak in fwnode_graph_get_next_endpoint()
2023-01-21Merge tag 'kbuild-fixes-v6.2-3' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild Pull Kbuild fixes from Masahiro Yamada: - Hide LDFLAGS_vmlinux from decompressor Makefiles to fix error messages when GNU Make 4.4 is used. - Fix 'make modules' build error when CONFIG_DEBUG_INFO_BTF_MODULES=y. - Fix warnings emitted by GNU Make 4.4 in scripts/kconfig/Makefile. - Support GNU Make 4.4 for scripts/jobserver-exec. - Show clearer error message when kernel/gen_kheaders.sh fails due to missing cpio. * tag 'kbuild-fixes-v6.2-3' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild: kheaders: explicitly validate existence of cpio command scripts: support GNU make 4.4 in jobserver-exec kconfig: Update all declared targets scripts: rpm: make clear that mkspec script contains 4.13 feature init/Kconfig: fix LOCALVERSION_AUTO help text kbuild: fix 'make modules' error when CONFIG_DEBUG_INFO_BTF_MODULES=y kbuild: export top-level LDFLAGS_vmlinux only to scripts/Makefile.vmlinux init/version-timestamp.c: remove unneeded #include <linux/version.h> docs: kbuild: remove mention to dropped $(objtree) feature
2023-01-21prlimit: do_prlimit needs to have a speculation checkGreg Kroah-Hartman
do_prlimit() adds the user-controlled resource value to a pointer that will subsequently be dereferenced. In order to help prevent this codepath from being used as a spectre "gadget" a barrier needs to be added after checking the range. Reported-by: Jordy Zomer <jordyzomer@google.com> Tested-by: Jordy Zomer <jordyzomer@google.com> Suggested-by: Linus Torvalds <torvalds@linuxfoundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-01-20bpf: Avoid recomputing spi in process_dynptr_funcKumar Kartikeya Dwivedi
Currently, process_dynptr_func first calls dynptr_get_spi and then is_dynptr_reg_valid_init and is_dynptr_reg_valid_uninit have to call it again to obtain the spi value. Instead of doing this twice, reuse the already obtained value (which is by default 0, and is only set for PTR_TO_STACK, and only used in that case in aforementioned functions). The input value for these two functions will either be -ERANGE or >= 1, and can either be permitted or rejected based on the respective check. Suggested-by: Joanne Koong <joannelkoong@gmail.com> Acked-by: Joanne Koong <joannelkoong@gmail.com> Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20230121002241.2113993-8-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-01-20bpf: Combine dynptr_get_spi and is_spi_bounds_validKumar Kartikeya Dwivedi
Currently, a check on spi resides in dynptr_get_spi, while others checking its validity for being within the allocated stack slots happens in is_spi_bounds_valid. Almost always barring a couple of cases (where being beyond allocated stack slots is not an error as stack slots need to be populated), both are used together to make checks. Hence, subsume the is_spi_bounds_valid check in dynptr_get_spi, and return -ERANGE to specially distinguish the case where spi is valid but not within allocated slots in the stack state. The is_spi_bounds_valid function is still kept around as it is a generic helper that will be useful for other objects on stack similar to dynptr in the future. Suggested-by: Joanne Koong <joannelkoong@gmail.com> Acked-by: Joanne Koong <joannelkoong@gmail.com> Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20230121002241.2113993-7-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-01-20bpf: Allow reinitializing unreferenced dynptr stack slotsKumar Kartikeya Dwivedi
Consider a program like below: void prog(void) { { struct bpf_dynptr ptr; bpf_dynptr_from_mem(...); } ... { struct bpf_dynptr ptr; bpf_dynptr_from_mem(...); } } Here, the C compiler based on lifetime rules in the C standard would be well within in its rights to share stack storage for dynptr 'ptr' as their lifetimes do not overlap in the two distinct scopes. Currently, such an example would be rejected by the verifier, but this is too strict. Instead, we should allow reinitializing over dynptr stack slots and forget information about the old dynptr object. The destroy_if_dynptr_stack_slot function already makes necessary checks to avoid overwriting referenced dynptr slots. This is done to present a better error message instead of forgetting dynptr information on stack and preserving reference state, leading to an inevitable but undecipherable error at the end about an unreleased reference which has to be associated back to its allocating call instruction to make any sense to the user. Acked-by: Joanne Koong <joannelkoong@gmail.com> Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20230121002241.2113993-6-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-01-20bpf: Invalidate slices on destruction of dynptrs on stackKumar Kartikeya Dwivedi
The previous commit implemented destroy_if_dynptr_stack_slot. It destroys the dynptr which given spi belongs to, but still doesn't invalidate the slices that belong to such a dynptr. While for the case of referenced dynptr, we don't allow their overwrite and return an error early, we still allow it and destroy the dynptr for unreferenced dynptr. To be able to enable precise and scoped invalidation of dynptr slices in this case, we must be able to associate the source dynptr of slices that have been obtained using bpf_dynptr_data. When doing destruction, only slices belonging to the dynptr being destructed should be invalidated, and nothing else. Currently, dynptr slices belonging to different dynptrs are indistinguishible. Hence, allocate a unique id to each dynptr (CONST_PTR_TO_DYNPTR and those on stack). This will be stored as part of reg->id. Whenever using bpf_dynptr_data, transfer this unique dynptr id to the returned PTR_TO_MEM_OR_NULL slice pointer, and store it in a new per-PTR_TO_MEM dynptr_id register state member. Finally, after establishing such a relationship between dynptrs and their slices, implement precise invalidation logic that only invalidates slices belong to the destroyed dynptr in destroy_if_dynptr_stack_slot. Acked-by: Joanne Koong <joannelkoong@gmail.com> Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20230121002241.2113993-5-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-01-20bpf: Fix partial dynptr stack slot reads/writesKumar Kartikeya Dwivedi
Currently, while reads are disallowed for dynptr stack slots, writes are not. Reads don't work from both direct access and helpers, while writes do work in both cases, but have the effect of overwriting the slot_type. While this is fine, handling for a few edge cases is missing. Firstly, a user can overwrite the stack slots of dynptr partially. Consider the following layout: spi: [d][d][?] 2 1 0 First slot is at spi 2, second at spi 1. Now, do a write of 1 to 8 bytes for spi 1. This will essentially either write STACK_MISC for all slot_types or STACK_MISC and STACK_ZERO (in case of size < BPF_REG_SIZE partial write of zeroes). The end result is that slot is scrubbed. Now, the layout is: spi: [d][m][?] 2 1 0 Suppose if user initializes spi = 1 as dynptr. We get: spi: [d][d][d] 2 1 0 But this time, both spi 2 and spi 1 have first_slot = true. Now, when passing spi 2 to dynptr helper, it will consider it as initialized as it does not check whether second slot has first_slot == false. And spi 1 should already work as normal. This effectively replaced size + offset of first dynptr, hence allowing invalid OOB reads and writes. Make a few changes to protect against this: When writing to PTR_TO_STACK using BPF insns, when we touch spi of a STACK_DYNPTR type, mark both first and second slot (regardless of which slot we touch) as STACK_INVALID. Reads are already prevented. Second, prevent writing to stack memory from helpers if the range may contain any STACK_DYNPTR slots. Reads are already prevented. For helpers, we cannot allow it to destroy dynptrs from the writes as depending on arguments, helper may take uninit_mem and dynptr both at the same time. This would mean that helper may write to uninit_mem before it reads the dynptr, which would be bad. PTR_TO_MEM: [?????dd] Depending on the code inside the helper, it may end up overwriting the dynptr contents first and then read those as the dynptr argument. Verifier would only simulate destruction when it does byte by byte access simulation in check_helper_call for meta.access_size, and fail to catch this case, as it happens after argument checks. The same would need to be done for any other non-trivial objects created on the stack in the future, such as bpf_list_head on stack, or bpf_rb_root on stack. A common misunderstanding in the current code is that MEM_UNINIT means writes, but note that writes may also be performed even without MEM_UNINIT in case of helpers, in that case the code after handling meta && meta->raw_mode will complain when it sees STACK_DYNPTR. So that invalid read case also covers writes to potential STACK_DYNPTR slots. The only loophole was in case of meta->raw_mode which simulated writes through instructions which could overwrite them. A future series sequenced after this will focus on the clean up of helper access checks and bugs around that. Fixes: 97e03f521050 ("bpf: Add verifier support for dynptrs") Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20230121002241.2113993-4-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-01-20bpf: Fix missing var_off check for ARG_PTR_TO_DYNPTRKumar Kartikeya Dwivedi
Currently, the dynptr function is not checking the variable offset part of PTR_TO_STACK that it needs to check. The fixed offset is considered when computing the stack pointer index, but if the variable offset was not a constant (such that it could not be accumulated in reg->off), we will end up a discrepency where runtime pointer does not point to the actual stack slot we mark as STACK_DYNPTR. It is impossible to precisely track dynptr state when variable offset is not constant, hence, just like bpf_timer, kptr, bpf_spin_lock, etc. simply reject the case where reg->var_off is not constant. Then, consider both reg->off and reg->var_off.value when computing the stack pointer index. A new helper dynptr_get_spi is introduced to hide over these details since the dynptr needs to be located in multiple places outside the process_dynptr_func checks, hence once we know it's a PTR_TO_STACK, we need to enforce these checks in all places. Note that it is disallowed for unprivileged users to have a non-constant var_off, so this problem should only be possible to trigger from programs having CAP_PERFMON. However, its effects can vary. Without the fix, it is possible to replace the contents of the dynptr arbitrarily by making verifier mark different stack slots than actual location and then doing writes to the actual stack address of dynptr at runtime. Fixes: 97e03f521050 ("bpf: Add verifier support for dynptrs") Acked-by: Joanne Koong <joannelkoong@gmail.com> Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20230121002241.2113993-3-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-01-20bpf: Fix state pruning for STACK_DYNPTR stack slotsKumar Kartikeya Dwivedi
The root of the problem is missing liveness marking for STACK_DYNPTR slots. This leads to all kinds of problems inside stacksafe. The verifier by default inside stacksafe ignores spilled_ptr in stack slots which do not have REG_LIVE_READ marks. Since this is being checked in the 'old' explored state, it must have already done clean_live_states for this old bpf_func_state. Hence, it won't be receiving any more liveness marks from to be explored insns (it has received REG_LIVE_DONE marking from liveness point of view). What this means is that verifier considers that it's safe to not compare the stack slot if was never read by children states. While liveness marks are usually propagated correctly following the parentage chain for spilled registers (SCALAR_VALUE and PTR_* types), the same is not the case for STACK_DYNPTR. clean_live_states hence simply rewrites these stack slots to the type STACK_INVALID since it sees no REG_LIVE_READ marks. The end result is that we will never see STACK_DYNPTR slots in explored state. Even if verifier was conservatively matching !REG_LIVE_READ slots, very next check continuing the stacksafe loop on seeing STACK_INVALID would again prevent further checks. Now as long as verifier stores an explored state which we can compare to when reaching a pruning point, we can abuse this bug to make verifier prune search for obviously unsafe paths using STACK_DYNPTR slots thinking they are never used hence safe. Doing this in unprivileged mode is a bit challenging. add_new_state is only set when seeing BPF_F_TEST_STATE_FREQ (which requires privileges) or when jmps_processed difference is >= 2 and insn_processed difference is >= 8. So coming up with the unprivileged case requires a little more work, but it is still totally possible. The test case being discussed below triggers the heuristic even in unprivileged mode. However, it no longer works since commit 8addbfc7b308 ("bpf: Gate dynptr API behind CAP_BPF"). Let's try to study the test step by step. Consider the following program (C style BPF ASM): 0 r0 = 0; 1 r6 = &ringbuf_map; 3 r1 = r6; 4 r2 = 8; 5 r3 = 0; 6 r4 = r10; 7 r4 -= -16; 8 call bpf_ringbuf_reserve_dynptr; 9 if r0 == 0 goto pc+1; 10 goto pc+1; 11 *(r10 - 16) = 0xeB9F; 12 r1 = r10; 13 r1 -= -16; 14 r2 = 0; 15 call bpf_ringbuf_discard_dynptr; 16 r0 = 0; 17 exit; We know that insn 12 will be a pruning point, hence if we force add_new_state for it, it will first verify the following path as safe in straight line exploration: 0 1 3 4 5 6 7 8 9 -> 10 -> (12) 13 14 15 16 17 Then, when we arrive at insn 12 from the following path: 0 1 3 4 5 6 7 8 9 -> 11 (12) We will find a state that has been verified as safe already at insn 12. Since register state is same at this point, regsafe will pass. Next, in stacksafe, for spi = 0 and spi = 1 (location of our dynptr) is skipped seeing !REG_LIVE_READ. The rest matches, so stacksafe returns true. Next, refsafe is also true as reference state is unchanged in both states. The states are considered equivalent and search is pruned. Hence, we are able to construct a dynptr with arbitrary contents and use the dynptr API to operate on this arbitrary pointer and arbitrary size + offset. To fix this, first define a mark_dynptr_read function that propagates liveness marks whenever a valid initialized dynptr is accessed by dynptr helpers. REG_LIVE_WRITTEN is marked whenever we initialize an uninitialized dynptr. This is done in mark_stack_slots_dynptr. It allows screening off mark_reg_read and not propagating marks upwards from that point. This ensures that we either set REG_LIVE_READ64 on both dynptr slots, or none, so clean_live_states either sets both slots to STACK_INVALID or none of them. This is the invariant the checks inside stacksafe rely on. Next, do a complete comparison of both stack slots whenever they have STACK_DYNPTR. Compare the dynptr type stored in the spilled_ptr, and also whether both form the same first_slot. Only then is the later path safe. Fixes: 97e03f521050 ("bpf: Add verifier support for dynptrs") Acked-by: Eduard Zingerman <eddyz87@gmail.com> Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20230121002241.2113993-2-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-01-21exit: Detect and fix irq disabled state in oopsNicholas Piggin
If a task oopses with irqs disabled, this can cause various cascading problems in the oops path such as sleep-from-invalid warnings, and potentially worse. Since commit 0258b5fd7c712 ("coredump: Limit coredumps to a single thread group"), the unconditional irq enable in coredump_task_exit() will "fix" the irq state to be enabled early in do_exit(), so currently this may not be triggerable, but that is coincidental and fragile. Detect and fix the irqs_disabled() condition in the oops path before calling do_exit(), similarly to the way in_atomic() is handled. Reported-by: Michael Ellerman <mpe@ellerman.id.au> Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: "Eric W. Biederman" <ebiederm@xmission.com> Link: https://lore.kernel.org/lkml/20221004094401.708299-1-npiggin@gmail.com/
2023-01-20Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski
drivers/net/ipa/ipa_interrupt.c drivers/net/ipa/ipa_interrupt.h 9ec9b2a30853 ("net: ipa: disable ipa interrupt during suspend") 8e461e1f092b ("net: ipa: introduce ipa_interrupt_enable()") d50ed3558719 ("net: ipa: enable IPA interrupt handlers separate from registration") https://lore.kernel.org/all/20230119114125.5182c7ab@canb.auug.org.au/ https://lore.kernel.org/all/79e46152-8043-a512-79d9-c3b905462774@tessares.net/ Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-01-20Merge tag 'net-6.2-rc5-2' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Pull networking fixes from Jakub Kicinski: "Including fixes from wireless, bluetooth, bpf and netfilter. Current release - regressions: - Revert "net: team: use IFF_NO_ADDRCONF flag to prevent ipv6 addrconf", fix nsna_ping mode of team - wifi: mt76: fix bugs in Rx queue handling and DMA mapping - eth: mlx5: - add missing mutex_unlock in error reporter - protect global IPsec ASO with a lock Current release - new code bugs: - rxrpc: fix wrong error return in rxrpc_connect_call() Previous releases - regressions: - bluetooth: hci_sync: fix use of HCI_OP_LE_READ_BUFFER_SIZE_V2 - wifi: - mac80211: fix crashes on Rx due to incorrect initialization of rx->link and rx->link_sta - mac80211: fix bugs in iTXQ conversion - Tx stalls, incorrect aggregation handling, crashes - brcmfmac: fix regression for Broadcom PCIe wifi devices - rndis_wlan: prevent buffer overflow in rndis_query_oid - netfilter: conntrack: handle tcp challenge acks during connection reuse - sched: avoid grafting on htb_destroy_class_offload when destroying - virtio-net: correctly enable callback during start_xmit, fix stalls - tcp: avoid the lookup process failing to get sk in ehash table - ipa: disable ipa interrupt during suspend - eth: stmmac: enable all safety features by default Previous releases - always broken: - bpf: - fix pointer-leak due to insufficient speculative store bypass mitigation (Spectre v4) - skip task with pid=1 in send_signal_common() to avoid a splat - fix BPF program ID information in BPF_AUDIT_UNLOAD as well as PERF_BPF_EVENT_PROG_UNLOAD events - fix potential deadlock in htab_lock_bucket from same bucket index but different map_locked index - bluetooth: - fix a buffer overflow in mgmt_mesh_add() - hci_qca: fix driver shutdown on closed serdev - ISO: fix possible circular locking dependency - CIS: hci_event: fix invalid wait context - wifi: brcmfmac: fixes for survey dump handling - mptcp: explicitly specify sock family at subflow creation time - netfilter: nft_payload: incorrect arithmetics when fetching VLAN header bits - tcp: fix rate_app_limited to default to 1 - l2tp: close all race conditions in l2tp_tunnel_register() - eth: mlx5: fixes for QoS config and eswitch configuration - eth: enetc: avoid deadlock in enetc_tx_onestep_tstamp() - eth: stmmac: fix invalid call to mdiobus_get_phy() Misc: - ethtool: add netlink attr in rss get reply only if the value is not empty" * tag 'net-6.2-rc5-2' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (88 commits) Revert "Merge branch 'octeontx2-af-CPT'" tcp: fix rate_app_limited to default to 1 bnxt: Do not read past the end of test names net: stmmac: enable all safety features by default octeontx2-af: add mbox to return CPT_AF_FLT_INT info octeontx2-af: update cpt lf alloc mailbox octeontx2-af: restore rxc conf after teardown sequence octeontx2-af: optimize cpt pf identification octeontx2-af: modify FLR sequence for CPT octeontx2-af: add mbox for CPT LF reset octeontx2-af: recover CPT engine when it gets fault net: dsa: microchip: ksz9477: port map correction in ALU table entry register selftests/net: toeplitz: fix race on tpacket_v3 block close net/ulp: use consistent error code when blocking ULP octeontx2-pf: Fix the use of GFP_KERNEL in atomic context on rt tcp: avoid the lookup process failing to get sk in ehash table Revert "net: team: use IFF_NO_ADDRCONF flag to prevent ipv6 addrconf" MAINTAINERS: add networking entries for Willem net: sched: gred: prevent races when adding offloads to stats l2tp: prevent lockdep issue in l2tp_tunnel_register() ...
2023-01-20PM: sleep: Remove "select SRCU"Paul E. McKenney
Now that the SRCU Kconfig option is unconditionally selected, there is no longer any point in selecting it. Therefore, remove the "select SRCU" Kconfig statements. Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Reviewed-by: John Ogness <john.ogness@linutronix.de> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2023-01-20bpf: Add missing btf_put to register_btf_id_dtor_kfuncsJiri Olsa
We take the BTF reference before we register dtors and we need to put it back when it's done. We probably won't se a problem with kernel BTF, but module BTF would stay loaded (because of the extra ref) even when its module is removed. Cc: Kumar Kartikeya Dwivedi <memxor@gmail.com> Fixes: 5ce937d613a4 ("bpf: Populate pairs of btf_id and destructor kfunc in btf") Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Signed-off-by: Jiri Olsa <jolsa@kernel.org> Link: https://lore.kernel.org/r/20230120122148.1522359-1-jolsa@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-01-20PM: hibernate: swap: don't use /** for non-kernel-doc commentsRandy Dunlap
kernel-doc complains about multiple occurrences of "/**" being used for something that is not a kernel-doc comment, so change all of these to just use "/*" comment style. The warning message for all of these is: FILE:LINE: warning: This comment starts with '/**', but isn't a kernel-doc comment. Refer Documentation/doc-guide/kernel-doc.rst kernel/power/swap.c:585: warning: ... Structure used for CRC32. kernel/power/swap.c:600: warning: ... * CRC32 update function that runs in its own thread. kernel/power/swap.c:627: warning: ... * Structure used for LZO data compression. kernel/power/swap.c:644: warning: ... * Compression function that runs in its own thread. kernel/power/swap.c:952: warning: ... * The following functions allow us to read data using a swap map kernel/power/swap.c:1111: warning: ... * Structure used for LZO data decompression. kernel/power/swap.c:1127: warning: ... * Decompression function that runs in its own thread. Also correct one spello/typo. Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2023-01-20kernels/ksysfs.c: export kernel address bitsThomas Weißschuh
This can be used by userspace to determine the address size of the running kernel. It frees userspace from having to interpret this information from the UTS machine field. Userspace implementation: https://github.com/util-linux/util-linux/pull/1966 Signed-off-by: Thomas Weißschuh <linux@weissschuh.net> Link: https://lore.kernel.org/r/20221221-address-bits-v1-1-8446b13244ac@weissschuh.net Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-01-19bpf: Change modules resolving for kprobe multi linkJiri Olsa
We currently use module_kallsyms_on_each_symbol that iterates all modules/symbols and we try to lookup each such address in user provided symbols/addresses to get list of used modules. This fix instead only iterates provided kprobe addresses and calls __module_address on each to get list of used modules. This turned out to be simpler and also bit faster. On my setup with workload (executed 10 times): # test_progs -t kprobe_multi_bench_attach/modules Current code: Performance counter stats for './test.sh' (5 runs): 76,081,161,596 cycles:k ( +- 0.47% ) 18.3867 +- 0.0992 seconds time elapsed ( +- 0.54% ) With the fix: Performance counter stats for './test.sh' (5 runs): 74,079,889,063 cycles:k ( +- 0.04% ) 17.8514 +- 0.0218 seconds time elapsed ( +- 0.12% ) Signed-off-by: Jiri Olsa <jolsa@kernel.org> Reviewed-by: Zhen Lei <thunder.leizhen@huawei.com> Reviewed-by: Petr Mladek <pmladek@suse.com> Link: https://lore.kernel.org/r/20230116101009.23694-4-jolsa@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-01-19livepatch: Improve the search performance of module_kallsyms_on_each_symbol()Zhen Lei
Currently we traverse all symbols of all modules to find the specified function for the specified module. But in reality, we just need to find the given module and then traverse all the symbols in it. Let's add a new parameter 'const char *modname' to function module_kallsyms_on_each_symbol(), then we can compare the module names directly in this function and call hook 'fn' after matching. If 'modname' is NULL, the symbols of all modules are still traversed for compatibility with other usage cases. Phase1: mod1-->mod2..(subsequent modules do not need to be compared) | Phase2: -->f1-->f2-->f3 Assuming that there are m modules, each module has n symbols on average, then the time complexity is reduced from O(m * n) to O(m) + O(n). Reviewed-by: Petr Mladek <pmladek@suse.com> Acked-by: Song Liu <song@kernel.org> Signed-off-by: Zhen Lei <thunder.leizhen@huawei.com> Signed-off-by: Jiri Olsa <jolsa@kernel.org> Acked-by: Miroslav Benes <mbenes@suse.cz> Reviewed-by: Luis Chamberlain <mcgrof@kernel.org> Link: https://lore.kernel.org/r/20230116101009.23694-2-jolsa@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-01-19bpf: Fix to preserve reg parent/live fields when copying range infoEduard Zingerman
Register range information is copied in several places. The intent is to transfer range/id information from one register/stack spill to another. Currently this is done using direct register assignment, e.g.: static void find_equal_scalars(..., struct bpf_reg_state *known_reg) { ... struct bpf_reg_state *reg; ... *reg = *known_reg; ... } However, such assignments also copy the following bpf_reg_state fields: struct bpf_reg_state { ... struct bpf_reg_state *parent; ... enum bpf_reg_liveness live; ... }; Copying of these fields is accidental and incorrect, as could be demonstrated by the following example: 0: call ktime_get_ns() 1: r6 = r0 2: call ktime_get_ns() 3: r7 = r0 4: if r0 > r6 goto +1 ; r0 & r6 are unbound thus generated ; branch states are identical 5: *(u64 *)(r10 - 8) = 0xdeadbeef ; 64-bit write to fp[-8] --- checkpoint --- 6: r1 = 42 ; r1 marked as written 7: *(u8 *)(r10 - 8) = r1 ; 8-bit write, fp[-8] parent & live ; overwritten 8: r2 = *(u64 *)(r10 - 8) 9: r0 = 0 10: exit This example is unsafe because 64-bit write to fp[-8] at (5) is conditional, thus not all bytes of fp[-8] are guaranteed to be set when it is read at (8). However, currently the example passes verification. First, the execution path 1-10 is examined by verifier. Suppose that a new checkpoint is created by is_state_visited() at (6). After checkpoint creation: - r1.parent points to checkpoint.r1, - fp[-8].parent points to checkpoint.fp[-8]. At (6) the r1.live is set to REG_LIVE_WRITTEN. At (7) the fp[-8].parent is set to r1.parent and fp[-8].live is set to REG_LIVE_WRITTEN, because of the following code called in check_stack_write_fixed_off(): static void save_register_state(struct bpf_func_state *state, int spi, struct bpf_reg_state *reg, int size) { ... state->stack[spi].spilled_ptr = *reg; // <--- parent & live copied if (size == BPF_REG_SIZE) state->stack[spi].spilled_ptr.live |= REG_LIVE_WRITTEN; ... } Note the intent to mark stack spill as written only if 8 bytes are spilled to a slot, however this intent is spoiled by a 'live' field copy. At (8) the checkpoint.fp[-8] should be marked as REG_LIVE_READ but this does not happen: - fp[-8] in a current state is already marked as REG_LIVE_WRITTEN; - fp[-8].parent points to checkpoint.r1, parentage chain is used by mark_reg_read() to mark checkpoint states. At (10) the verification is finished for path 1-10 and jump 4-6 is examined. The checkpoint.fp[-8] never gets REG_LIVE_READ mark and this spill is pruned from the cached states by clean_live_states(). Hence verifier state obtained via path 1-4,6 is deemed identical to one obtained via path 1-6 and program marked as safe. Note: the example should be executed with BPF_F_TEST_STATE_FREQ flag set to force creation of intermediate verifier states. This commit revisits the locations where bpf_reg_state instances are copied and replaces the direct copies with a call to a function copy_register_state(dst, src) that preserves 'parent' and 'live' fields of the 'dst'. Fixes: 679c782de14b ("bpf/verifier: per-register parent pointers") Signed-off-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20230106142214.1040390-2-eddyz87@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-01-19Merge tag 'printk-for-6.2-rc5' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/printk/linux Pull printk fixes from Petr Mladek: - Prevent a potential deadlock when configuring kgdb console - Fix a kernel doc warning * tag 'printk-for-6.2-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/printk/linux: kernel/printk/printk.c: Fix W=1 kernel-doc warning tty: serial: kgdboc: fix mutex locking order for configure_kgdboc()
2023-01-19Merge branch 'rework/console-list-lock' into for-linusPetr Mladek
2023-01-19kernel: events: Export perf_report_aux_output_id()Mike Leach
CoreSight trace being updated to use the perf_report_aux_output_id() in a similar way to intel-pt. This function in needs export visibility to allow it to be called from kernel loadable modules, which CoreSight may configured to be built as. Signed-off-by: Mike Leach <mike.leach@linaro.org> Acked-by: Suzuki K Poulose <suzuki.poulose@arm.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Suzuki K Poulose <suzuki.poulose@arm.com> Link: https://lore.kernel.org/r/20230116124928.5440-12-mike.leach@linaro.org
2023-01-19fs: port i_{g,u}id_into_vfs{g,u}id() to mnt_idmapChristian Brauner
Convert to struct mnt_idmap. Remove legacy file_mnt_user_ns() and mnt_user_ns(). Last cycle we merged the necessary infrastructure in 256c8aed2b42 ("fs: introduce dedicated idmap type for mounts"). This is just the conversion to struct mnt_idmap. Currently we still pass around the plain namespace that was attached to a mount. This is in general pretty convenient but it makes it easy to conflate namespaces that are relevant on the filesystem with namespaces that are relevent on the mount level. Especially for non-vfs developers without detailed knowledge in this area this can be a potential source for bugs. Once the conversion to struct mnt_idmap is done all helpers down to the really low-level helpers will take a struct mnt_idmap argument instead of two namespace arguments. This way it becomes impossible to conflate the two eliminating the possibility of any bugs. All of the vfs and all filesystems only operate on struct mnt_idmap. Acked-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>
2023-01-19fs: port privilege checking helpers to mnt_idmapChristian Brauner
Convert to struct mnt_idmap. Last cycle we merged the necessary infrastructure in 256c8aed2b42 ("fs: introduce dedicated idmap type for mounts"). This is just the conversion to struct mnt_idmap. Currently we still pass around the plain namespace that was attached to a mount. This is in general pretty convenient but it makes it easy to conflate namespaces that are relevant on the filesystem with namespaces that are relevent on the mount level. Especially for non-vfs developers without detailed knowledge in this area this can be a potential source for bugs. Once the conversion to struct mnt_idmap is done all helpers down to the really low-level helpers will take a struct mnt_idmap argument instead of two namespace arguments. This way it becomes impossible to conflate the two eliminating the possibility of any bugs. All of the vfs and all filesystems only operate on struct mnt_idmap. Acked-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>
2023-01-19fs: port inode_init_owner() to mnt_idmapChristian Brauner
Convert to struct mnt_idmap. Last cycle we merged the necessary infrastructure in 256c8aed2b42 ("fs: introduce dedicated idmap type for mounts"). This is just the conversion to struct mnt_idmap. Currently we still pass around the plain namespace that was attached to a mount. This is in general pretty convenient but it makes it easy to conflate namespaces that are relevant on the filesystem with namespaces that are relevent on the mount level. Especially for non-vfs developers without detailed knowledge in this area this can be a potential source for bugs. Once the conversion to struct mnt_idmap is done all helpers down to the really low-level helpers will take a struct mnt_idmap argument instead of two namespace arguments. This way it becomes impossible to conflate the two eliminating the possibility of any bugs. All of the vfs and all filesystems only operate on struct mnt_idmap. Acked-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>
2023-01-19fs: port xattr to mnt_idmapChristian Brauner
Convert to struct mnt_idmap. Last cycle we merged the necessary infrastructure in 256c8aed2b42 ("fs: introduce dedicated idmap type for mounts"). This is just the conversion to struct mnt_idmap. Currently we still pass around the plain namespace that was attached to a mount. This is in general pretty convenient but it makes it easy to conflate namespaces that are relevant on the filesystem with namespaces that are relevent on the mount level. Especially for non-vfs developers without detailed knowledge in this area this can be a potential source for bugs. Once the conversion to struct mnt_idmap is done all helpers down to the really low-level helpers will take a struct mnt_idmap argument instead of two namespace arguments. This way it becomes impossible to conflate the two eliminating the possibility of any bugs. All of the vfs and all filesystems only operate on struct mnt_idmap. Acked-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>
2023-01-19fs: port ->permission() to pass mnt_idmapChristian Brauner
Convert to struct mnt_idmap. Last cycle we merged the necessary infrastructure in 256c8aed2b42 ("fs: introduce dedicated idmap type for mounts"). This is just the conversion to struct mnt_idmap. Currently we still pass around the plain namespace that was attached to a mount. This is in general pretty convenient but it makes it easy to conflate namespaces that are relevant on the filesystem with namespaces that are relevent on the mount level. Especially for non-vfs developers without detailed knowledge in this area this can be a potential source for bugs. Once the conversion to struct mnt_idmap is done all helpers down to the really low-level helpers will take a struct mnt_idmap argument instead of two namespace arguments. This way it becomes impossible to conflate the two eliminating the possibility of any bugs. All of the vfs and all filesystems only operate on struct mnt_idmap. Acked-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>
2023-01-19fs: port ->mkdir() to pass mnt_idmapChristian Brauner
Convert to struct mnt_idmap. Last cycle we merged the necessary infrastructure in 256c8aed2b42 ("fs: introduce dedicated idmap type for mounts"). This is just the conversion to struct mnt_idmap. Currently we still pass around the plain namespace that was attached to a mount. This is in general pretty convenient but it makes it easy to conflate namespaces that are relevant on the filesystem with namespaces that are relevent on the mount level. Especially for non-vfs developers without detailed knowledge in this area this can be a potential source for bugs. Once the conversion to struct mnt_idmap is done all helpers down to the really low-level helpers will take a struct mnt_idmap argument instead of two namespace arguments. This way it becomes impossible to conflate the two eliminating the possibility of any bugs. All of the vfs and all filesystems only operate on struct mnt_idmap. Acked-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>
2023-01-19fs: port ->symlink() to pass mnt_idmapChristian Brauner
Convert to struct mnt_idmap. Last cycle we merged the necessary infrastructure in 256c8aed2b42 ("fs: introduce dedicated idmap type for mounts"). This is just the conversion to struct mnt_idmap. Currently we still pass around the plain namespace that was attached to a mount. This is in general pretty convenient but it makes it easy to conflate namespaces that are relevant on the filesystem with namespaces that are relevent on the mount level. Especially for non-vfs developers without detailed knowledge in this area this can be a potential source for bugs. Once the conversion to struct mnt_idmap is done all helpers down to the really low-level helpers will take a struct mnt_idmap argument instead of two namespace arguments. This way it becomes impossible to conflate the two eliminating the possibility of any bugs. All of the vfs and all filesystems only operate on struct mnt_idmap. Acked-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>
2023-01-18bpf: Fix a possible task gone issue with bpf_send_signal[_thread]() helpersYonghong Song
In current bpf_send_signal() and bpf_send_signal_thread() helper implementation, irq_work is used to handle nmi context. Hao Sun reported in [1] that the current task at the entry of the helper might be gone during irq_work callback processing. To fix the issue, a reference is acquired for the current task before enqueuing into the irq_work so that the queued task is still available during irq_work callback processing. [1] https://lore.kernel.org/bpf/20230109074425.12556-1-sunhao.th@gmail.com/ Fixes: 8b401f9ed244 ("bpf: implement bpf_send_signal() helper") Tested-by: Hao Sun <sunhao.th@gmail.com> Reported-by: Hao Sun <sunhao.th@gmail.com> Signed-off-by: Yonghong Song <yhs@fb.com> Link: https://lore.kernel.org/r/20230118204815.3331855-1-yhs@fb.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-01-18bpf: Fix off-by-one error in bpf_mem_cache_idx()Hou Tao
According to the definition of sizes[NUM_CACHES], when the size passed to bpf_mem_cache_size() is 256, it should return 6 instead 7. Fixes: 7c8199e24fa0 ("bpf: Introduce any context BPF specific memory allocator.") Signed-off-by: Hou Tao <houtao1@huawei.com> Acked-by: Yonghong Song <yhs@fb.com> Link: https://lore.kernel.org/r/20230118084630.3750680-1-houtao@huaweicloud.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-01-18mm/memfd: add MFD_NOEXEC_SEAL and MFD_EXECJeff Xu
The new MFD_NOEXEC_SEAL and MFD_EXEC flags allows application to set executable bit at creation time (memfd_create). When MFD_NOEXEC_SEAL is set, memfd is created without executable bit (mode:0666), and sealed with F_SEAL_EXEC, so it can't be chmod to be executable (mode: 0777) after creation. when MFD_EXEC flag is set, memfd is created with executable bit (mode:0777), this is the same as the old behavior of memfd_create. The new pid namespaced sysctl vm.memfd_noexec has 3 values: 0: memfd_create() without MFD_EXEC nor MFD_NOEXEC_SEAL acts like MFD_EXEC was set. 1: memfd_create() without MFD_EXEC nor MFD_NOEXEC_SEAL acts like MFD_NOEXEC_SEAL was set. 2: memfd_create() without MFD_NOEXEC_SEAL will be rejected. The sysctl allows finer control of memfd_create for old-software that doesn't set the executable bit, for example, a container with vm.memfd_noexec=1 means the old-software will create non-executable memfd by default. Also, the value of memfd_noexec is passed to child namespace at creation time. For example, if the init namespace has vm.memfd_noexec=2, all its children namespaces will be created with 2. [akpm@linux-foundation.org: add stub functions to fix build] [akpm@linux-foundation.org: remove unneeded register_pid_ns_ctl_table_vm() stub, per Jeff] [akpm@linux-foundation.org: s/pr_warn_ratelimited/pr_warn_once/, per review] [akpm@linux-foundation.org: fix CONFIG_SYSCTL=n warning] Link: https://lkml.kernel.org/r/20221215001205.51969-4-jeffxu@google.com Signed-off-by: Jeff Xu <jeffxu@google.com> Co-developed-by: Daniel Verkamp <dverkamp@chromium.org> Signed-off-by: Daniel Verkamp <dverkamp@chromium.org> Reported-by: kernel test robot <lkp@intel.com> Reviewed-by: Kees Cook <keescook@chromium.org> Cc: David Herrmann <dh.herrmann@gmail.com> Cc: Dmitry Torokhov <dmitry.torokhov@gmail.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jann Horn <jannh@google.com> Cc: Jorge Lucangeli Obes <jorgelo@chromium.org> Cc: Shuah Khan <skhan@linuxfoundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-01-18perf/core: Call perf_prepare_sample() before running BPFNamhyung Kim
As BPF can access sample data, it needs to populate the data. Also remove the logic to get the callchain specifically as it's covered by the perf_prepare_sample() now. Signed-off-by: Namhyung Kim <namhyung@kernel.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Tested-by: Jiri Olsa <jolsa@kernel.org> Acked-by: Jiri Olsa <jolsa@kernel.org> Acked-by: Song Liu <song@kernel.org> Acked-by: Peter Zijlstra <peterz@infradead.org> Link: https://lore.kernel.org/r/20230118060559.615653-9-namhyung@kernel.org
2023-01-18perf/core: Introduce perf_prepare_header()Namhyung Kim
Factor out perf_prepare_header() so that it can call perf_prepare_sample() without a header if not needed. Also it checks the filtered_sample_type to avoid duplicate work when perf_prepare_sample() is called twice (or more). Suggested-by: Peter Zijlstr <peterz@infradead.org> Signed-off-by: Namhyung Kim <namhyung@kernel.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Tested-by: Jiri Olsa <jolsa@kernel.org> Acked-by: Jiri Olsa <jolsa@kernel.org> Acked-by: Song Liu <song@kernel.org> Acked-by: Peter Zijlstra <peterz@infradead.org> Link: https://lore.kernel.org/r/20230118060559.615653-8-namhyung@kernel.org