summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2025-03-04x86/cpu: Remove unnecessary headers and reorder the restAhmed S. Darwish
Remove the headers at intel.c that are no longer required. Alphabetically reorder what remains since more headers will be included in further commits. Suggested-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Ahmed S. Darwish <darwi@linutronix.de> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20250304085152.51092-6-darwi@linutronix.de
2025-03-04x86/cpuid: Include <linux/build_bug.h> in <asm/cpuid.h>Ahmed S. Darwish
<asm/cpuid.h> uses static_assert() at multiple locations but it does not include the CPP macro's definition at linux/build_bug.h. Include the needed header to make <asm/cpuid.h> self-sufficient. This gets triggered when cpuid.h is included in new C files, which is to be done in further commits. Fixes: 43d86e3cd9a7 ("x86/cpu: Provide cpuid_read() et al.") Signed-off-by: Ahmed S. Darwish <darwi@linutronix.de> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20250304085152.51092-5-darwi@linutronix.de
2025-03-04Merge branch 'x86/urgent' into x86/cpu, to pick up dependent commitsIngo Molnar
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2025-03-04x86/cpu: Log CPU flag cmdline hacks more verboselyBrendan Jackman
Since using these options is very dangerous, make details as visible as possible: - Instead of a single message for each of the cmdline options, print a separate pr_warn() for each individual flag. - Say explicitly whether the flag is a "feature" or a "bug". Suggested-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Brendan Jackman <jackmanb@google.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Acked-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/r/20250303-setcpuid-taint-louder-v1-3-8d255032cb4c@google.com
2025-03-04x86/cpu: Warn louder about the {set,clear}cpuid boot parametersBrendan Jackman
Commit 814165e9fd1f6 ("x86/cpu: Add the 'setcpuid=' boot parameter") recently expanded the user's ability to break their system horribly by overriding effective CPU flags. This was reflected with updates to the documentation to try and make people aware that this is dangerous. To further reduce the risk of users mistaking this for a "real feature", and try to help them figure out why their kernel is tainted if they do use it: - Upgrade the existing printk to pr_warn, to help ensure kernel logs reflect what changes are in effect. - Print an extra warning that tries to be as dramatic as possible, while also highlighting the fact that it tainted the kernel. Suggested-by: Ingo Molnar <mingo@redhat.com> Signed-off-by: Brendan Jackman <jackmanb@google.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Acked-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/r/20250303-setcpuid-taint-louder-v1-2-8d255032cb4c@google.com
2025-03-04x86/cpu: Remove unnecessary macro indirection related to CPU feature namesBrendan Jackman
These macros used to abstract over CONFIG_X86_FEATURE_NAMES, but that was removed in: 7583e8fbdc49 ("x86/cpu: Remove X86_FEATURE_NAMES") Now they are just an unnecessary indirection, remove them. Signed-off-by: Brendan Jackman <jackmanb@google.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Acked-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/r/20250303-setcpuid-taint-louder-v1-1-8d255032cb4c@google.com
2025-03-04x86/speculation: Add a conditional CS prefix to CALL_NOSPECPawan Gupta
Retpoline mitigation for spectre-v2 uses thunks for indirect branches. To support this mitigation compilers add a CS prefix with -mindirect-branch-cs-prefix. For an indirect branch in asm, this needs to be added manually. CS prefix is already being added to indirect branches in asm files, but not in inline asm. Add CS prefix to CALL_NOSPEC for inline asm as well. There is no JMP_NOSPEC for inline asm. Reported-by: Josh Poimboeuf <jpoimboe@kernel.org> Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Andrew Cooper <andrew.cooper3@citrix.com Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Link: https://lore.kernel.org/r/20250228-call-nospec-v3-2-96599fed0f33@linux.intel.com
2025-03-04x86/speculation: Simplify and make CALL_NOSPEC consistentPawan Gupta
CALL_NOSPEC macro is used to generate Spectre-v2 mitigation friendly indirect branches. At compile time the macro defaults to indirect branch, and at runtime those can be patched to thunk based mitigations. This approach is opposite of what is done for the rest of the kernel, where the compile time default is to replace indirect calls with retpoline thunk calls. Make CALL_NOSPEC consistent with the rest of the kernel, default to retpoline thunk at compile time when CONFIG_MITIGATION_RETPOLINE is enabled. Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Andrew Cooper <andrew.cooper3@citrix.com Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Link: https://lore.kernel.org/r/20250228-call-nospec-v3-1-96599fed0f33@linux.intel.com
2025-03-04x86/smp: Fix mwait_play_dead() and acpi_processor_ffh_play_dead() noreturn ↵Josh Poimboeuf
behavior Fix some related issues (done in a single patch to avoid introducing intermediate bisect warnings): 1) The SMP version of mwait_play_dead() doesn't return, but its !SMP counterpart does. Make its calling behavior consistent by resolving the !SMP version to a BUG(). It should never be called anyway, this just enforces that at runtime and enables its callers to be marked as __noreturn. 2) While the SMP definition of mwait_play_dead() is annotated as __noreturn, the declaration isn't. Nor is it listed in tools/objtool/noreturns.h. Fix that. 3) Similar to #1, the SMP version of acpi_processor_ffh_play_dead() doesn't return but its !SMP counterpart does. Make the !SMP version a BUG(). It should never be called. 4) acpi_processor_ffh_play_dead() doesn't return, but is lacking any __noreturn annotations. Fix that. This fixes the following objtool warnings: vmlinux.o: warning: objtool: acpi_processor_ffh_play_dead+0x67: mwait_play_dead() is missing a __noreturn annotation vmlinux.o: warning: objtool: acpi_idle_play_dead+0x3c: acpi_processor_ffh_play_dead() is missing a __noreturn annotation Fixes: a7dd183f0b38 ("x86/smp: Allow calling mwait_play_dead with an arbitrary hint") Fixes: 541ddf31e300 ("ACPI/processor_idle: Add FFH state handling") Reported-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Tested-by: Paul E. McKenney <paulmck@kernel.org> Link: https://lore.kernel.org/r/e885c6fa9e96a61471b33e48c2162d28b15b14c5.1740962711.git.jpoimboe@kernel.org
2025-03-04be2net: fix sleeping while atomic bugs in be_ndo_bridge_getlinkNikolay Aleksandrov
Partially revert commit b71724147e73 ("be2net: replace polling with sleeping in the FW completion path") w.r.t mcc mutex it introduces and the use of usleep_range. The be2net be_ndo_bridge_getlink() callback is called with rcu_read_lock, so this code has been broken for a long time. Both the mutex_lock and the usleep_range can cause the issue Ian Kumlien reported[1]. The call path is: be_ndo_bridge_getlink -> be_cmd_get_hsw_config -> be_mcc_notify_wait -> be_mcc_wait_compl -> usleep_range() [1] https://lore.kernel.org/netdev/CAA85sZveppNgEVa_FD+qhOMtG_AavK9_mFiU+jWrMtXmwqefGA@mail.gmail.com/ Tested-by: Ian Kumlien <ian.kumlien@gmail.com> Fixes: b71724147e73 ("be2net: replace polling with sleeping in the FW completion path") Signed-off-by: Nikolay Aleksandrov <razor@blackwall.org> Link: https://patch.msgid.link/20250227164129.1201164-1-razor@blackwall.org Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-03-04xen: Kconfig: Drop reference to obsolete configs MCORE2 and MK8Lukas Bulwahn
Commit f388f60ca904 ("x86/cpu: Drop configuration options for early 64-bit CPUs") removes the config symbols MCORE2 and MK8. With that, the references to those two config symbols in xen's x86 Kconfig are obsolete. Drop them. Fixes: f388f60ca904 ("x86/cpu: Drop configuration options for early 64-bit CPUs") Signed-off-by: Lukas Bulwahn <lukas.bulwahn@redhat.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Juergen Gross <jgross@suse.com> Link: https://lore.kernel.org/r/20250303093759.371445-1-lukas.bulwahn@redhat.com
2025-03-04x86/cpu: Properly parse CPUID leaf 0x2 TLB descriptor 0x63Ahmed S. Darwish
CPUID leaf 0x2's one-byte TLB descriptors report the number of entries for specific TLB types, among other properties. Typically, each emitted descriptor implies the same number of entries for its respective TLB type(s). An emitted 0x63 descriptor is an exception: it implies 4 data TLB entries for 1GB pages and 32 data TLB entries for 2MB or 4MB pages. For the TLB descriptors parsing code, the entry count for 1GB pages is encoded at the intel_tlb_table[] mapping, but the 2MB/4MB entry count is totally ignored. Update leaf 0x2's parsing logic 0x2 to account for 32 data TLB entries for 2MB/4MB pages implied by the 0x63 descriptor. Fixes: e0ba94f14f74 ("x86/tlb_info: get last level TLB entry number of CPU") Signed-off-by: Ahmed S. Darwish <darwi@linutronix.de> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: stable@kernel.org Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: https://lore.kernel.org/r/20250304085152.51092-4-darwi@linutronix.de
2025-03-04x86/cpu: Validate CPUID leaf 0x2 EDX outputAhmed S. Darwish
CPUID leaf 0x2 emits one-byte descriptors in its four output registers EAX, EBX, ECX, and EDX. For these descriptors to be valid, the most significant bit (MSB) of each register must be clear. Leaf 0x2 parsing at intel.c only validated the MSBs of EAX, EBX, and ECX, but left EDX unchecked. Validate EDX's most-significant bit as well. Fixes: e0ba94f14f74 ("x86/tlb_info: get last level TLB entry number of CPU") Signed-off-by: Ahmed S. Darwish <darwi@linutronix.de> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: stable@kernel.org Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: https://lore.kernel.org/r/20250304085152.51092-3-darwi@linutronix.de
2025-03-04x86/cacheinfo: Validate CPUID leaf 0x2 EDX outputAhmed S. Darwish
CPUID leaf 0x2 emits one-byte descriptors in its four output registers EAX, EBX, ECX, and EDX. For these descriptors to be valid, the most significant bit (MSB) of each register must be clear. The historical Git commit: 019361a20f016 ("- pre6: Intel: start to add Pentium IV specific stuff (128-byte cacheline etc)...") introduced leaf 0x2 output parsing. It only validated the MSBs of EAX, EBX, and ECX, but left EDX unchecked. Validate EDX's most-significant bit. Signed-off-by: Ahmed S. Darwish <darwi@linutronix.de> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: stable@vger.kernel.org Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: https://lore.kernel.org/r/20250304085152.51092-2-darwi@linutronix.de
2025-03-04init: add initramfs_internal.hDavid Disseldorp
The new header only exports a single unpack function and a CPIO_HDRLEN constant for future test use. Signed-off-by: David Disseldorp <ddiss@suse.de> Link: https://lore.kernel.org/r/20250304061020.9815-2-ddiss@suse.de Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-03-04Merge patch series "some pipe + wait stuff"Christian Brauner
Mateusz Guzik <mjguzik@gmail.com> says: As a side effect of looking at the pipe hang I came up with 3 changes to consider for -next. The first one is a trivial clean up which I wont mind if it merely gets folded into someone else's change for pipes. The second one reduces page alloc/free calls for the backing area (60% less during a kernel build in my testing). I already posted this, but the cc list was not proper. The last one concerns the wait/wakeup mechanism and drops one lock trip in the common case after waking up. * patches from https://lore.kernel.org/r/20250303230409.452687-1-mjguzik@gmail.com: wait: avoid spurious calls to prepare_to_wait_event() in ___wait_event() pipe: cache 2 pages instead of 1 pipe: drop an always true check in anon_pipe_write() Link: https://lore.kernel.org/r/20250303230409.452687-1-mjguzik@gmail.com Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-03-04wait: avoid spurious calls to prepare_to_wait_event() in ___wait_event()Mateusz Guzik
In vast majority of cases the condition determining whether the thread can proceed is true after the first wake up. However, even in that case the thread ends up calling into prepare_to_wait_event() again, suffering a spurious irq + lock trip. Then it calls into finish_wait() to unlink itself. Note that in case of a pending signal the work done by prepare_to_wait_event() gets ignored even without the change. pre-check the condition after waking up instead. Stats gathared during a kernel build: bpftrace -e 'kprobe:prepare_to_wait_event,kprobe:finish_wait \ { @[probe] = count(); }' @[kprobe:finish_wait]: 392483 @[kprobe:prepare_to_wait_event]: 778690 As in calls to prepare_to_wait_event() almost double calls to finish_wait(). This evens out with the patch. Signed-off-by: Mateusz Guzik <mjguzik@gmail.com> Link: https://lore.kernel.org/r/20250303230409.452687-4-mjguzik@gmail.com Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-03-04pipe: cache 2 pages instead of 1Mateusz Guzik
User data is kept in a circular buffer backed by pages allocated as needed. Only having space for one spare is still prone to having to resort to allocation / freeing. In my testing this decreases page allocs by 60% during a kernel build. Signed-off-by: Mateusz Guzik <mjguzik@gmail.com> Link: https://lore.kernel.org/r/20250303230409.452687-3-mjguzik@gmail.com Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-03-04pipe: drop an always true check in anon_pipe_write()Mateusz Guzik
The check operates on the stale value of 'head' and always loops back. Just do it unconditionally. No functional changes. Signed-off-by: Mateusz Guzik <mjguzik@gmail.com> Link: https://lore.kernel.org/r/20250303230409.452687-2-mjguzik@gmail.com Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-03-04perf/core: Fix perf_mmap() failure pathPeter Zijlstra
When f_ops->mmap() returns failure, m_ops->close() is *not* called. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Acked-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reviewed-by: Ravi Bangoria <ravi.bangoria@amd.com> Link: https://lore.kernel.org/r/20241104135519.248358497@infradead.org
2025-03-04perf/core: Detach 'struct perf_cpu_pmu_context' and 'struct pmu' lifetimesPeter Zijlstra
In prepration for being able to unregister a PMU with existing events, it becomes important to detach struct perf_cpu_pmu_context lifetimes from that of struct pmu. Notably struct perf_cpu_pmu_context embeds a struct perf_event_pmu_context that can stay referenced until the last event goes. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Ravi Bangoria <ravi.bangoria@amd.com> Link: https://lore.kernel.org/r/20241104135518.760214287@infradead.org
2025-03-04perf/core: Lift event->mmap_mutex in perf_mmap()Peter Zijlstra
This puts 'all' of perf_mmap() under single event->mmap_mutex. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Ravi Bangoria <ravi.bangoria@amd.com> Link: https://lore.kernel.org/r/20241104135519.582252957@infradead.org
2025-03-04perf/core: Remove retry loop from perf_mmap()Peter Zijlstra
AFAICT there is no actual benefit from the mutex drop on re-try. The 'worst' case scenario is that we instantly re-gain the mutex without perf_mmap_close() getting it. So might as well make that the normal case. Reflow the code to make the ring buffer detach case naturally flow into the no ring buffer case. [ mingo: Forward ported it ] Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Ravi Bangoria <ravi.bangoria@amd.com> Link: https://lore.kernel.org/r/20241104135519.463607258@infradead.org
2025-03-04perf/core: Further simplify perf_mmap()Peter Zijlstra
Perform CSE and such. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Ravi Bangoria <ravi.bangoria@amd.com> Link: https://lore.kernel.org/r/20241104135519.354909594@infradead.org
2025-03-04perf/core: Simplify the perf_mmap() control flowPeter Zijlstra
Identity-transform: if (c) { X1; } else { Y; goto l; } X2; l: into the simpler: if (c) { X1; X2; } else { Y; } [ mingo: Forward ported it ] Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Ravi Bangoria <ravi.bangoria@amd.com> Link: https://lore.kernel.org/r/20241104135519.095904637@infradead.org
2025-03-04perf/bpf: Robustify perf_event_free_bpf_prog()Peter Zijlstra
Ensure perf_event_free_bpf_prog() is safe to call a second time; notably without making any references to event->pmu when there is no prog left. Note: perf_event_detach_bpf_prog() might leave a stale event->prog Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Ravi Bangoria <ravi.bangoria@amd.com> Cc: Alexei Starovoitov <ast@kernel.org> Cc: Daniel Borkmann <daniel@iogearbox.net> Cc: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/r/20241104135518.978956692@infradead.org
2025-03-04perf/core: Introduce perf_free_addr_filters()Peter Zijlstra
Replace _free_event()'s use of perf_addr_filters_splice()s use with an explicit perf_free_addr_filters() with the explicit propery that it is able to be called a second time without ill effect. Most notable, referencing event->pmu must be avoided when there are no filters left (from eg a previous call). Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Ravi Bangoria <ravi.bangoria@amd.com> Link: https://lore.kernel.org/r/20241104135518.868460518@infradead.org
2025-03-04perf/core: Add this_cpc() helperPeter Zijlstra
As a preparation for adding yet another indirection. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Ravi Bangoria <ravi.bangoria@amd.com> Link: https://lore.kernel.org/r/20241104135518.650051565@infradead.org
2025-03-04perf/core: Merge struct pmu::pmu_disable_count into struct ↵Peter Zijlstra
perf_cpu_pmu_context::pmu_disable_count Because it makes no sense to have two per-cpu allocations per pmu. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Ravi Bangoria <ravi.bangoria@amd.com> Link: https://lore.kernel.org/r/20241104135518.518730578@infradead.org
2025-03-04perf/core: Simplify perf_event_alloc()Peter Zijlstra
Using the previous simplifications, transition perf_event_alloc() to the cleanup way of things -- reducing error path magic. [ mingo: Ported it to recent kernels. ] Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Ravi Bangoria <ravi.bangoria@amd.com> Link: https://lore.kernel.org/r/20241104135518.410755241@infradead.org
2025-03-04perf/core: Simplify perf_init_event()Peter Zijlstra
Use the <linux/cleanup.h> guard() and scoped_guard() infrastructure to simplify the control flow. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Ravi Bangoria <ravi.bangoria@amd.com> Cc: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/20241104135518.302444446@infradead.org
2025-03-04perf/core: Simplify perf_pmu_register()Peter Zijlstra
Using the previously introduced perf_pmu_free() and a new IDR helper, simplify the perf_pmu_register error paths. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Acked-by: Ravi Bangoria <ravi.bangoria@amd.com> Link: https://lore.kernel.org/r/20241104135518.198937277@infradead.org
2025-03-04perf/core: Simplify the perf_pmu_register() error pathPeter Zijlstra
The error path of perf_pmu_register() is of course very similar to a subset of perf_pmu_unregister(). Extract this common part in perf_pmu_free() and simplify things. [ mingo: Forward ported it ] Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Acked-by: Ravi Bangoria <ravi.bangoria@amd.com> Link: https://lore.kernel.org/r/20241104135518.090915501@infradead.org
2025-03-04perf/core: Simplify the perf_event_alloc() error pathPeter Zijlstra
The error cleanup sequence in perf_event_alloc() is a subset of the existing _free_event() function (it must of course be). Split this out into __free_event() and simplify the error path. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Ravi Bangoria <ravi.bangoria@amd.com> Link: https://lore.kernel.org/r/20241104135517.967889521@infradead.org
2025-03-04perf/hw_breakpoint: Return EOPNOTSUPP for unsupported breakpoint typeSaket Kumar Bhaskar
Currently, __reserve_bp_slot() returns -ENOSPC for unsupported breakpoint types on the architecture. For example, powerpc does not support hardware instruction breakpoints. This causes the perf_skip BPF selftest to fail, as neither ENOENT nor EOPNOTSUPP is returned by perf_event_open for unsupported breakpoint types. As a result, the test that should be skipped for this arch is not correctly identified. To resolve this, hw_breakpoint_event_init() should exit early by checking for unsupported breakpoint types using hw_breakpoint_slots_cached() and return the appropriate error (-EOPNOTSUPP). Signed-off-by: Saket Kumar Bhaskar <skb99@linux.ibm.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Marco Elver <elver@google.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Ian Rogers <irogers@google.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> Link: https://lore.kernel.org/r/20250303092451.1862862-1-skb99@linux.ibm.com
2025-03-04Merge patch series "mount: handle mount propagation for detached mount trees"Christian Brauner
Christian Brauner <brauner@kernel.org> says: In commit ee2e3f50629f ("mount: fix mounting of detached mounts onto targets that reside on shared mounts") I fixed a bug where propagating the source mount tree of an anonymous mount namespace into a target mount tree of a non-anonymous mount namespace could be used to trigger an integer overflow in the non-anonymous mount namespace causing any new mounts to fail. The cause of this was that the propagation algorithm was unable to recognize mounts from the source mount tree that were already propagated into the target mount tree and then reappeared as propagation targets when walking the destination propagation mount tree. When fixing this I disabled mount propagation into anonymous mount namespaces. Make it possible for anonymous mount namespace to receive mount propagation events correctly. This is no also a correctness issue now that we allow mounting detached mount trees onto detached mount trees. Mark the source anonymous mount namespace with MNTNS_PROPAGATING indicating that all mounts belonging to this mount namespace are currently in the process of being propagated and make the propagation algorithm discard those if they appear as propagation targets. * patches from https://lore.kernel.org/r/20250225-work-mount-propagation-v1-0-e6e3724500eb@kernel.org: selftests: test subdirectory mounting selftests: add test for detached mount tree propagation mount: handle mount propagation for detached mount trees Link: https://lore.kernel.org/r/20250225-work-mount-propagation-v1-0-e6e3724500eb@kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-03-04selftests: test subdirectory mountingChristian Brauner
This tests mounting a subdirectory without ever having to expose the filesystem to a non-anonymous mount namespace. Link: https://lore.kernel.org/r/20250225-work-mount-propagation-v1-3-e6e3724500eb@kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-03-04selftests: add test for detached mount tree propagationChristian Brauner
Test that detached mount trees receive propagation events. Link: https://lore.kernel.org/r/20250225-work-mount-propagation-v1-2-e6e3724500eb@kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-03-04fs: namespace: fix uninitialized variable useArnd Bergmann
clang correctly notices that the 'uflags' variable initialization only happens in some cases: fs/namespace.c:4622:6: error: variable 'uflags' is used uninitialized whenever 'if' condition is false [-Werror,-Wsometimes-uninitialized] 4622 | if (flags & MOVE_MOUNT_F_EMPTY_PATH) uflags = AT_EMPTY_PATH; | ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ fs/namespace.c:4623:48: note: uninitialized use occurs here 4623 | from_name = getname_maybe_null(from_pathname, uflags); | ^~~~~~ fs/namespace.c:4622:2: note: remove the 'if' if its condition is always true 4622 | if (flags & MOVE_MOUNT_F_EMPTY_PATH) uflags = AT_EMPTY_PATH; | ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Fixes: b1e9423d65e3 ("fs: support getname_maybe_null() in move_mount()") Signed-off-by: Arnd Bergmann <arnd@arndb.de> Link: https://lore.kernel.org/r/20250226081201.1876195-1-arnd@kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-03-04mount: handle mount propagation for detached mount treesChristian Brauner
In commit ee2e3f50629f ("mount: fix mounting of detached mounts onto targets that reside on shared mounts") I fixed a bug where propagating the source mount tree of an anonymous mount namespace into a target mount tree of a non-anonymous mount namespace could be used to trigger an integer overflow in the non-anonymous mount namespace causing any new mounts to fail. The cause of this was that the propagation algorithm was unable to recognize mounts from the source mount tree that were already propagated into the target mount tree and then reappeared as propagation targets when walking the destination propagation mount tree. When fixing this I disabled mount propagation into anonymous mount namespaces. Make it possible for anonymous mount namespace to receive mount propagation events correctly. This is no also a correctness issue now that we allow mounting detached mount trees onto detached mount trees. Mark the source anonymous mount namespace with MNTNS_PROPAGATING indicating that all mounts belonging to this mount namespace are currently in the process of being propagated and make the propagation algorithm discard those if they appear as propagation targets. Link: https://lore.kernel.org/r/20250225-work-mount-propagation-v1-1-e6e3724500eb@kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-03-04fs: allow creating detached mounts from fsmount() file descriptorsChristian Brauner
The previous patch series only enabled the creation of detached mounts from detached mounts that were created via open_tree(). In such cases we know that the origin sequence number for the newly created anonymous mount namespace will be set to the sequence number of the mount namespace the source mount belonged to. But fsmount() creates an anonymous mount namespace that does not have an origin mount namespace as the anonymous mount namespace was derived from a filesystem context created via fsopen(). Account for this case and allow the creation of detached mounts from mounts created via fsmount(). Consequently, any such detached mount created from an fsmount() mount will also have a zero origin sequence number. This allows to mount subdirectories without ever having to expose the filesystem to a a non-anonymous mount namespace: fd_context = sys_fsopen("tmpfs", 0); sys_fsconfig(fd_context, FSCONFIG_CMD_CREATE, NULL, NULL, 0); fd_tmpfs = sys_fsmount(fd_context, 0, 0); mkdirat(fd_tmpfs, "subdir", 0755); fd_tree = sys_open_tree(fd_tmpfs, "subdir", OPEN_TREE_CLONE); sys_move_mount(fd_tree, "", -EBADF, "/mnt", MOVE_MOUNT_F_EMPTY_PATH); Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-03-04Merge patch series "fs: expand abilities of anonymous mount namespaces"Christian Brauner
Christian Brauner <brauner@kernel.org> says: This series expands the abilities of anonymous mount namespaces. Terminology =========== detached mount: A detached mount is a mount belonging to an anonymous mount namespace. anonymous mount namespace: An anonymous mount namespace is a mount namespace that does not appear in nsfs. This means neither can it be setns()ed into nor can it be persisted through bind-mounts. attached mount: An attached mount is a mount belonging to a non-anonymous mount namespace. non-anonymous mount namespace: A non-anonymous mount namespace is a mount namespace that does appear in nsfs. This means it can be setns()ed into and can be persisted through bind-mounts. mount namespace sequence number: Each non-anonymous mount namespace has a unique 64bit sequence number that is assigned when the mount namespace is created. The sequence number uniquely identifies a non-anonymous mount namespace. One of the purposes of the sequence number is to prevent mount namespace loops. These can occur when an nsfs mount namespace file is bind mounted into a mount namespace that was created after it. Such loops are prevented by verifying that the sequence number of the target mount namespace is smaller than the sequence number of the nsfs mount namespace file. In other words, the target mount namespace must have been created before the nsfs file mount namespace. In contrast, anonymous mount namespaces don't have a sequence number assigned. Anonymous mount namespaces do not appear in any nsfs instances and can thus not be pinned by bind-mounting them anywhere. They can thus not be used to form cycles. Creating detached mounts from detached mounts ============================================= Currently, detached mounts can only be created from attached mounts. This limitaton prevents various use-cases. For example, the ability to mount a subdirectory without ever having to make the whole filesystem visible first. The current permission model for OPEN_TREE_CLONE flag of the open_tree() system call is: (1) Check that the caller is privileged over the owning user namespace of it's current mount namespace. (2) Check that the caller is located in the mount namespace of the mount it wants to create a detached copy of. While it is not strictly necessary to do it this way it is consistently applied in the new mount api. This model will also be used when allowing the creation of detached mount from another detached mount. The (1) requirement can simply be met by performing the same check as for the non-detached case, i.e., verify that the caller is privileged over its current mount namespace. To meet the (2) requirement it must be possible to infer the origin mount namespace that the anonymous mount namespace of the detached mount was created from. The origin mount namespace of an anonymous mount is the mount namespace that the mounts that were copied into the anonymous mount namespace originate from. In order to check the origin mount namespace of an anonymous mount namespace two methods come to mind: (i) stash a reference to the original mount namespace in the anonymous mount namespace (ii) record the sequence number of the original mount namespace in the anonymous mount namespace The (i) option has more complicated consequences and implications than (ii). For example, it would pin the origin mount namespace. Even with a passive reference it would pointlessly pin memory as access to the origin mount namespace isn't required. With (ii) in place it is possible to perform an equivalent check (2') to (2). The origin mount namespace of the anonymous mount namespace must be the same as the caller's mount namespace. To establish this the sequence number of the caller's mount namespace and the origin sequence number of the anonymous mount namespace are compared. The caller is always located in a non-anonymous mount namespace since anonymous mount namespaces cannot be setns()ed into. The caller's mount namespace will thus always have a valid sequence number. The owning namespace of any mount namespace, anonymous or non-anonymous, can never change. A mount attached to a non-anonymous mount namespace can never change mount namespace. If the sequence number of the non-anonymous mount namespace and the origin sequence number of the anonymous mount namespace match, the owning namespaces must match as well. Hence, the capability check on the owning namespace of the caller's mount namespace ensures that the caller has the ability to copy the mount tree. Mounting detached mounts onto detached mounts ============================================= Currently, detached mounts can only be mounted onto attached mounts. This limitation makes it impossible to assemble a new private rootfs and move it into place. Instead, a detached tree must be created, attached, then mounted open and then either moved or detached again. Lift this restriction. In order to allow mounting detached mounts onto other detached mounts the same permission model used for creating detached mounts from detached mounts can be used (cf. above). Allowing to mount detached mounts onto detached mounts leaves three cases to consider: (1) The source mount is an attached mount and the target mount is a detached mount. This would be equivalent to moving a mount between different mount namespaces. A caller could move an attached mount to a detached mount. The detached mount can now be freely attached to any mount namespace. This changes the current delegatioh model significantly for no good reason. So this will fail. (2) Anonymous mount namespaces are always attached fully, i.e., it is not possible to only attach a subtree of an anoymous mount namespace. This simplifies the implementation and reasoning. Consequently, if the anonymous mount namespace of the source detached mount and the target detached mount are the identical the mount request will fail. (3) The source mount's anonymous mount namespace is different from the target mount's anonymous mount namespace. In this case the source anonymous mount namespace of the source mount tree must be freed after its mounts have been moved to the target anonymous mount namespace. The source anonymous mount namespace must be empty afterwards. By allowing to mount detached mounts onto detached mounts a caller may do the following: fd_tree1 = open_tree(-EBADF, "/mnt", OPEN_TREE_CLONE) fd_tree2 = open_tree(-EBADF, "/tmp", OPEN_TREE_CLONE) fd_tree1 and fd_tree2 refer to two different detached mount trees that belong to two different anonymous mount namespace. It is important to note that fd_tree1 and fd_tree2 both refer to the root of their respective anonymous mount namespaces. By allowing to mount detached mounts onto detached mounts the caller may now do: move_mount(fd_tree1, "", fd_tree2, "", MOVE_MOUNT_F_EMPTY_PATH | MOVE_MOUNT_T_EMPTY_PATH) This will cause the detached mount referred to by fd_tree1 to be mounted on top of the detached mount referred to by fd_tree2. Thus, the detached mount fd_tree1 is moved from its separate anonymous mount namespace into fd_tree2's anonymous mount namespace. It also means that while fd_tree2 continues to refer to the root of its respective anonymous mount namespace fd_tree1 doesn't anymore. This has the consequence that only fd_tree2 can be moved to another anonymous or non-anonymous mount namespace. Moving fd_tree1 will now fail as fd_tree1 doesn't refer to the root of an anoymous mount namespace anymore. Now fd_tree1 and fd_tree2 refer to separate detached mount trees referring to the same anonymous mount namespace. This is conceptually fine. The new mount api does allow for this to happen already via: mount -t tmpfs tmpfs /mnt mkdir -p /mnt/A mount -t tmpfs tmpfs /mnt/A fd_tree3 = open_tree(-EBADF, "/mnt", OPEN_TREE_CLONE | AT_RECURSIVE) fd_tree4 = open_tree(-EBADF, "/mnt/A", 0) Both fd_tree3 and fd_tree4 refer to two different detached mount trees but both detached mount trees refer to the same anonymous mount namespace. An as with fd_tree1 and fd_tree2, only fd_tree3 may be moved another mount namespace as fd_tree3 refers to the root of the anonymous mount namespace just while fd_tree4 doesn't. However, there's an important difference between the fd_tree3/fd_tree4 and the fd_tree1/fd_tree2 example. Closing fd_tree4 and releasing the respective struct file will have no further effect on fd_tree3's detached mount tree. However, closing fd_tree3 will cause the mount tree and the respective anonymous mount namespace to be destroyed causing the detached mount tree of fd_tree4 to be invalid for further mounting. By allowing to mount detached mounts on detached mounts as in the fd_tree1/fd_tree2 example both struct files will affect each other. Both fd_tree1 and fd_tree2 refer to struct files that have FMODE_NEED_UNMOUNT set. When either one of them is closed it ends up unmounting the mount tree. The problem is that both will unconditionally free the mount namespace and may end up causing UAFs for each other. Another problem stems from the fact that fd_tree1 doesn't refer to the root of the anonymous mount namespace. So ignoring the UAF issue, if fd_tree2 were to be closed after fd_tree1, then fd_tree1 would free only a part of the mount tree while leaking the rest of the mount tree. Multiple solutions for this problem come to mind: (1) Reference Counting Anonymous Mount Namespaces A solution to this problem would be reference counting anonymous mount namespaces. The source detached mount tree acquires a reference when it is moved into the anonymous mount namespace of the target mount tree. When fd_tree1 is closed the mount tree isn't unmounted and the anonymous mount namespace shared between the detached mount tree at fd_tree1 and fd_tree2 isn't freed. However, this has another problem. When fd_tree2 is closed before fd_tree1 then closing fd_tree1 will cause the mount tree to be unmounted and the anonymous mount namespace to be destroyed. However, fd_tree1 only refers to a part of the mounts that the shared anonymous mount namespace has collected. So this would leak mounts. (2) Removing FMODE_NEED_UNMOUNT from the struct file of the source detached mount tree In the current state of the mount api the creation of two file descriptors that refer to different detached mount trees but to the same anonymous mount namespace is already possible. See the fd_tree3/fd_tree4 examples above. In those cases only one of the two file descriptors will actually end up unmounting and destroying the detached mount tree. Whether or not a struct file needs to unmount and destroy an anonymous mount namespace is governed by the FMODE_NEED_UNMOUNT flag. In the fd_tree3/fd_tree4 example above only fd_tree3 will refer to a struct file that has FMODE_NEED_UNMOUNT set. A similar solution would work for mounting detached mounts onto detached mounts. When the source detached mount tree is moved to the target detached mount tree and thus from the source anonymous mount namespace to the target anonymous mount namespace the FMODE_NEED_UNMOUNT flag will be removed from the struct file of the source detached mount tree. In the above example the FMODE_NEED_UNMOUNT would be removed from the struct file that fd_tree1 refers to. This requires that the source file descriptor fd_tree1 needs to be kept open until move_mount() is finished so that FMODE_NEED_UNMOUNT can be removed: move_mount(fd_tree1, "", fd_tree2, "", MOVE_MOUNT_F_EMPTY_PATH | MOVE_MOUNT_T_EMPTY_PATH) /* * Remove FMODE_NEED_UNMOUNT so closing fd_tree1 will leave the * mount tree alone. */ close(fd_tree1); /* * Remove the whole mount tree including all the mounts that * were moved from fd_tree1 into fd_tree2. */ close(fd_tree2); Since the source detached mount tree fd_tree1 has now become an attached mount tree, i.e., fd_tree1_mnt->mnt_parent == fd_tree2_mnt is is ineligible for attaching again as move_mount() requires that a detached mount tree can only be attached if it is the root of an anonymous mount namespace. Removing FMODE_NEED_UNMOUNT doesn't require to hold @namespace_sem. Attaching @fd_tree1 to @fd_tree2 requires holding @namespace_sem and so does dissolve_on_fput() should @fd_tree2 have been closed concurrently. While removing FMODE_NEED_UNMOUNT can be done it would require some ugly hacking similar to what's done for splice to remove FMODE_NOWAIT. That's ugly. (3) Use the fact that @fd_tree1 will have a parent mount once it has been attached to @fd_tree2. When dissolve_on_fput() is called the mount that has been passed in will refer to the root of the anonymous mount namespace. If it doesn't it would mean that mounts are leaked. So before allowing to mount detached mounts onto detached mounts this would be a bug. Now that detached mounts can be mounted onto detached mounts it just means that the mount has been attached to another anonymous mount namespace and thus dissolve_on_fput() must not unmount the mount tree or free the anonymous mount namespace as the file referring to the root of the namespace hasn't been closed yet. If it had been closed yet it would be obvious because the mount namespace would be NULL, i.e., the @fd_tree1 would have already been unmounted. If @fd_tree1 hasn't been unmounted yet and has a parent mount it is safe to skip any cleanup as closing @fd_tree2 will take care of all cleanup operations. Imho, (3) is the cleanest solution and thus has been chosen. * patches from https://lore.kernel.org/r/20250221-brauner-open_tree-v1-0-dbcfcb98c676@kernel.org: selftests: seventh test for mounting detached mounts onto detached mounts selftests: sixth test for mounting detached mounts onto detached mounts selftests: fifth test for mounting detached mounts onto detached mounts selftests: fourth test for mounting detached mounts onto detached mounts selftests: third test for mounting detached mounts onto detached mounts selftests: second test for mounting detached mounts onto detached mounts selftests: first test for mounting detached mounts onto detached mounts fs: mount detached mounts onto detached mounts fs: support getname_maybe_null() in move_mount() selftests: create detached mounts from detached mounts fs: create detached mounts from detached mounts fs: add may_copy_tree() fs: add fastpath for dissolve_on_fput() fs: add assert for move_mount() fs: add mnt_ns_empty() helper fs: record sequence number of origin mount namespace Link: https://lore.kernel.org/r/20250221-brauner-open_tree-v1-0-dbcfcb98c676@kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-03-04selftests: seventh test for mounting detached mounts onto detached mountsChristian Brauner
Add a test to verify that detached mounts behave correctly. Link: https://lore.kernel.org/r/20250221-brauner-open_tree-v1-16-dbcfcb98c676@kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-03-04selftests: sixth test for mounting detached mounts onto detached mountsChristian Brauner
Add a test to verify that detached mounts behave correctly. Link: https://lore.kernel.org/r/20250221-brauner-open_tree-v1-15-dbcfcb98c676@kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-03-04selftests: fifth test for mounting detached mounts onto detached mountsChristian Brauner
Add a test to verify that detached mounts behave correctly. Link: https://lore.kernel.org/r/20250221-brauner-open_tree-v1-14-dbcfcb98c676@kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-03-04selftests: fourth test for mounting detached mounts onto detached mountsChristian Brauner
Add a test to verify that detached mounts behave correctly. Link: https://lore.kernel.org/r/20250221-brauner-open_tree-v1-13-dbcfcb98c676@kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-03-04selftests: third test for mounting detached mounts onto detached mountsChristian Brauner
Add a test to verify that detached mounts behave correctly. Link: https://lore.kernel.org/r/20250221-brauner-open_tree-v1-12-dbcfcb98c676@kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-03-04selftests: second test for mounting detached mounts onto detached mountsChristian Brauner
Add a test to verify that detached mounts behave correctly. Link: https://lore.kernel.org/r/20250221-brauner-open_tree-v1-11-dbcfcb98c676@kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-03-04selftests: first test for mounting detached mounts onto detached mountsChristian Brauner
Add a test to verify that detached mounts behave correctly. Link: https://lore.kernel.org/r/20250221-brauner-open_tree-v1-10-dbcfcb98c676@kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-03-04fs: mount detached mounts onto detached mountsChristian Brauner
Currently, detached mounts can only be mounted onto attached mounts. This limitation makes it impossible to assemble a new private rootfs and move it into place. That's an extremely powerful concept for container and service workloads that we should support. Right now, a detached tree must be created, attached, then it can gain additional mounts and then it can either be moved (if it doesn't reside under a shared mount) or a detached mount created again. Lift this restriction. In order to allow mounting detached mounts onto other detached mounts the same permission model used for creating detached mounts from detached mounts can be used: (1) Check that the caller is privileged over the owning user namespace of it's current mount namespace. (2) Check that the caller is located in the mount namespace of the mount it wants to create a detached copy of. The origin mount namespace of the anonymous mount namespace must be the same as the caller's mount namespace. To establish this the sequence number of the caller's mount namespace and the origin sequence number of the anonymous mount namespace are compared. The caller is always located in a non-anonymous mount namespace since anonymous mount namespaces cannot be setns()ed into. The caller's mount namespace will thus always have a valid sequence number. The owning namespace of any mount namespace, anonymous or non-anonymous, can never change. A mount attached to a non-anonymous mount namespace can never change mount namespace. If the sequence number of the non-anonymous mount namespace and the origin sequence number of the anonymous mount namespace match, the owning namespaces must match as well. Hence, the capability check on the owning namespace of the caller's mount namespace ensures that the caller has the ability to attach the mount tree. Link: https://lore.kernel.org/r/20250221-brauner-open_tree-v1-9-dbcfcb98c676@kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>