Age | Commit message (Collapse) | Author |
|
Commit 2a86f6612164 ("kbuild: use KBUILD_DEFCONFIG as the fallback
for DEFCONFIG_LIST") removed arch/$ARCH/defconfig; however,
the document has not been updated to reflect this change yet.
Signed-off-by: Satoru Takeuchi <satoru.takeuchi@gmail.com>
Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
|
|
The headercheck tries to call clang with a mix of compiler arguments
that don't include the target architecture. When building e.g. x86
headers on arm64, this produces a warning like
clang: warning: unknown platform, assuming -mfloat-abi=soft
Add in the KBUILD_CPPFLAGS, which contain the target, in order to make it
build properly.
See also 1b71c2fb04e7 ("kbuild: userprogs: fix bitsize and target
detection on clang").
Reviewed-by: Nathan Chancellor <nathan@kernel.org>
Fixes: feb843a469fb ("kbuild: add $(CLANG_FLAGS) to KBUILD_CPPFLAGS")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/robh/linux
Pull devicetree fix from Rob Herring:
- Revert reserved-memory 'alignment' property to use '#address-cells'
instead of '#size-cells'. What's in use trumps the spec.
* tag 'devicetree-fixes-for-6.14-2' of git://git.kernel.org/pub/scm/linux/kernel/git/robh/linux:
Revert "of: reserved-memory: Fix using wrong number of cells to get property 'alignment'"
|
|
The userprog infrastructure links objects files through $(CC).
Either explicitly by manually calling $(CC) on multiple object files or
implicitly by directly compiling a source file to an executable.
The documentation at Documentation/kbuild/llvm.rst indicates that ld.lld
would be used for linking if LLVM=1 is specified.
However clang instead will use either a globally installed cross linker
from $PATH called ${target}-ld or fall back to the system linker, which
probably does not support crosslinking.
For the normal kernel build this is not an issue because the linker is
always executed directly, without the compiler being involved.
Explicitly pass --ld-path to clang so $(LD) is respected.
As clang 13.0.1 is required to build the kernel, this option is available.
Fixes: 7f3a59db274c ("kbuild: add infrastructure to build userspace programs")
Cc: stable@vger.kernel.org # needs wrapping in $(cc-option) for < 6.9
Signed-off-by: Thomas Weißschuh <thomas.weissschuh@linutronix.de>
Reviewed-by: Nathan Chancellor <nathan@kernel.org>
Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
|
|
pipe_readable(), pipe_writable(), and pipe_poll() can read "pipe->head"
and "pipe->tail" outside of "pipe->mutex" critical section. When the
head and the tail are read individually in that order, there is a window
for interruption between the two reads in which both the head and the
tail can be updated by concurrent readers and writers.
One of the problematic scenarios observed with hackbench running
multiple groups on a large server on a particular pipe inode is as
follows:
pipe->head = 36
pipe->tail = 36
hackbench-118762 [057] ..... 1029.550548: pipe_write: *wakes up: pipe not full*
hackbench-118762 [057] ..... 1029.550548: pipe_write: head: 36 -> 37 [tail: 36]
hackbench-118762 [057] ..... 1029.550548: pipe_write: *wake up next reader 118740*
hackbench-118762 [057] ..... 1029.550548: pipe_write: *wake up next writer 118768*
hackbench-118768 [206] ..... 1029.55055X: pipe_write: *writer wakes up*
hackbench-118768 [206] ..... 1029.55055X: pipe_write: head = READ_ONCE(pipe->head) [37]
... CPU 206 interrupted (exact wakeup was not traced but 118768 did read head at 37 in traces)
hackbench-118740 [057] ..... 1029.550558: pipe_read: *reader wakes up: pipe is not empty*
hackbench-118740 [057] ..... 1029.550558: pipe_read: tail: 36 -> 37 [head = 37]
hackbench-118740 [057] ..... 1029.550559: pipe_read: *pipe is empty; wakeup writer 118768*
hackbench-118740 [057] ..... 1029.550559: pipe_read: *sleeps*
hackbench-118766 [185] ..... 1029.550592: pipe_write: *New writer comes in*
hackbench-118766 [185] ..... 1029.550592: pipe_write: head: 37 -> 38 [tail: 37]
hackbench-118766 [185] ..... 1029.550592: pipe_write: *wakes up reader 118766*
hackbench-118740 [185] ..... 1029.550598: pipe_read: *reader wakes up; pipe not empty*
hackbench-118740 [185] ..... 1029.550599: pipe_read: tail: 37 -> 38 [head: 38]
hackbench-118740 [185] ..... 1029.550599: pipe_read: *pipe is empty*
hackbench-118740 [185] ..... 1029.550599: pipe_read: *reader sleeps; wakeup writer 118768*
... CPU 206 switches back to writer
hackbench-118768 [206] ..... 1029.550601: pipe_write: tail = READ_ONCE(pipe->tail) [38]
hackbench-118768 [206] ..... 1029.550601: pipe_write: pipe_full()? (u32)(37 - 38) >= 16? Yes
hackbench-118768 [206] ..... 1029.550601: pipe_write: *writer goes back to sleep*
[ Tasks 118740 and 118768 can then indefinitely wait on each other. ]
The unsigned arithmetic in pipe_occupancy() wraps around when
"pipe->tail > pipe->head" leading to pipe_full() returning true despite
the pipe being empty.
The case of genuine wraparound of "pipe->head" is handled since pipe
buffer has data allowing readers to make progress until the pipe->tail
wraps too after which the reader will wakeup a sleeping writer, however,
mistaking the pipe to be full when it is in fact empty can lead to
readers and writers waiting on each other indefinitely.
This issue became more problematic and surfaced as a hang in hackbench
after the optimization in commit aaec5a95d596 ("pipe_read: don't wake up
the writer if the pipe is still full") significantly reduced the number
of spurious wakeups of writers that had previously helped mask the
issue.
To avoid missing any updates between the reads of "pipe->head" and
"pipe->write", unionize the two with a single unsigned long
"pipe->head_tail" member that can be loaded atomically.
Using "pipe->head_tail" to read the head and the tail ensures the
lockless checks do not miss any updates to the head or the tail and
since those two are only updated under "pipe->mutex", it ensures that
the head is always ahead of, or equal to the tail resulting in correct
calculations.
[ prateek: commit log, testing on x86 platforms. ]
Reported-and-debugged-by: Swapnil Sapkal <swapnil.sapkal@amd.com>
Closes: https://lore.kernel.org/lkml/e813814e-7094-4673-bc69-731af065a0eb@amd.com/
Reported-by: Alexey Gladkov <legion@kernel.org>
Closes: https://lore.kernel.org/all/Z8Wn0nTvevLRG_4m@example.org/
Fixes: 8cefc107ca54 ("pipe: Use head and tail pointers for the ring, not cursor and length")
Tested-by: Swapnil Sapkal <swapnil.sapkal@amd.com>
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Tested-by: Alexey Gladkov <legion@kernel.org>
Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Add tracing support to track sched_ext core events
(/sched_ext/sched_ext_event). This may be useful for debugging sched_ext
schedulers that trigger a particular event.
The trace point can be used as other trace points, so it can be used in,
for example, `perf trace` and BPF programs, as follows:
======
$> sudo perf trace -e sched_ext:sched_ext_event --filter 'name == "SCX_EV_ENQ_SLICE_DFL"'
======
======
struct tp_sched_ext_event {
struct trace_entry ent;
u32 __data_loc_name;
s64 delta;
};
SEC("tracepoint/sched_ext/sched_ext_event")
int rtp_add_event(struct tp_sched_ext_event *ctx)
{
char event_name[128];
unsigned short offset = ctx->__data_loc_name & 0xFFFF;
bpf_probe_read_str((void *)event_name, 128, (char *)ctx + offset);
bpf_printk("name %s delta %lld", event_name, ctx->delta);
return 0;
}
======
Signed-off-by: Changwoo Min <changwoo@igalia.com>
Acked-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
|
|
The event count could be negative in the future,
so change the event type from u64 to s64.
Signed-off-by: Changwoo Min <changwoo@igalia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
|
|
The UM builds distinguish i386 from x86_64 via SUBARCH, but we don't
support building i386 directly with Clang. To make SUBARCH work for
i386 UM, we need to explicitly test for it.
This lets me run i386 KUnit tests with Clang:
$ ./tools/testing/kunit/kunit.py run \
--make_options LLVM=1 \
--make_options SUBARCH=i386
...
Fixes: c7500c1b53bf ("um: Allow builds with Clang")
Reviewed-by: Nathan Chancellor <nathan@kernel.org>
Link: https://lore.kernel.org/r/20250304162124.it.785-kees@kernel.org
Tested-by: David Gow <davidgow@google.com>
Signed-off-by: Kees Cook <kees@kernel.org>
|
|
Fix a goof where KVM sets CPUID.0x80000022.EAX to CPUID.0x80000022.EBX
instead of zeroing both when PERFMON_V2 isn't supported by KVM. In
practice, barring a buggy CPU (or vCPU model when running nested) only the
!enable_pmu case is affected, as KVM always supports PERFMON_V2 if it's
available in hardware, i.e. CPUID.0x80000022.EBX will be '0' if PERFMON_V2
is unsupported.
For the !enable_pmu case, the bug is relatively benign as KVM will refuse
to enable PMU capabilities, but a VMM that reflects KVM's supported CPUID
into the guest could inadvertently induce #GPs in the guest due to
advertising support for MSRs that KVM refuses to emulate.
Fixes: 94cdeebd8211 ("KVM: x86/cpuid: Add AMD CPUID ExtPerfMonAndDbg leaf 0x80000022")
Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
Link: https://lore.kernel.org/r/20250304082314.472202-3-xiaoyao.li@intel.com
[sean: massage shortlog and changelog, tag for stable]
Cc: stable@vger.kernel.org
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
https://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless
Johannes Berg says:
====================
bugfixes for 6.14:
* regressions from this cycle:
- mac80211: fix sparse warning for monitor
- nl80211: disable multi-link reconfiguration (needs fixing)
* older issues:
- cfg80211: reject badly combined cooked monitor,
fix regulatory hint validity checks
- mac80211: handle TXQ flush w/o driver per-sta flush,
fix debugfs for monitor, fix element inheritance
- iwlwifi: fix rfkill, dead firmware handling, rate API
version, free A-MSDU handling, avoid large
allocations, fix string format
- brcmfmac: fix power handling on some boards
* tag 'wireless-2025-03-04' of https://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless:
wifi: nl80211: disable multi-link reconfiguration
wifi: cfg80211: regulatory: improve invalid hints checking
wifi: brcmfmac: keep power during suspend if board requires it
wifi: mac80211: Fix sparse warning for monitor_sdata
wifi: mac80211: fix vendor-specific inheritance
wifi: mac80211: fix MLE non-inheritance parsing
wifi: iwlwifi: Fix A-MSDU TSO preparation
wifi: iwlwifi: Free pages allocated when failing to build A-MSDU
wifi: iwlwifi: limit printed string from FW file
wifi: iwlwifi: mvm: use the right version of the rate API
wifi: iwlwifi: mvm: don't try to talk to a dead firmware
wifi: iwlwifi: mvm: don't dump the firmware state upon RFKILL while suspend
wifi: iwlwifi: mvm: clean up ROC on failure
wifi: iwlwifi: fw: avoid using an uninitialized variable
wifi: iwlwifi: fw: allocate chained SG tables for dump
wifi: mac80211: remove debugfs dir for virtual monitor
wifi: mac80211: Cleanup sta TXQs on flush
wifi: nl80211: reject cooked mode if it is set along with other flags
====================
Link: https://patch.msgid.link/20250304124435.126272-3-johannes@sipsolutions.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
When fgraph is enabled the traced function return address is replaced with
trampoline return_to_handler(). The original return address of the traced
function is saved in per task return stack along with a stack pointer for
reliable stack unwinding via function_graph_enter_regs().
During stack unwinding e.g. for livepatching, ftrace_graph_ret_addr()
identifies the original return address of the traced function with the
saved stack pointer.
With a recent change, the stack pointers passed to ftrace_graph_ret_addr()
and function_graph_enter_regs() do not match anymore, and therefore the
original return address is not found.
Pass the correct stack pointer to function_graph_enter_regs() to fix this.
Fixes: 7495e179b478 ("s390/tracing: Enable HAVE_FTRACE_GRAPH_FUNC")
Reviewed-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Sumanth Korikkar <sumanthk@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
|
|
Commit 14be4e6f3522 ("selftests: vDSO: fix ELF hash table entry size for s390x")
changed the type of the ELF hash table entries to 64bit on s390x.
However the *GNU* hash tables entries are always 32bit.
The "bucket" pointer is shared between both hash algorithms.
On s390, this caused the GNU hash algorithm to access its 32-bit entries as if they
were 64-bit, triggering compiler warnings (assignment between "Elf64_Xword *" and
"Elf64_Word *") and runtime crashes.
Introduce a new dedicated "gnu_bucket" pointer which is used by the GNU hash.
Fixes: e0746bde6f82 ("selftests/vDSO: support DT_GNU_HASH")
Reviewed-by: Jens Remus <jremus@linux.ibm.com>
Signed-off-by: Thomas Weißschuh <thomas.weissschuh@linutronix.de>
Acked-by: Shuah Khan <skhan@linuxfoundation.org>
Link: https://lore.kernel.org/r/20250217-selftests-vdso-s390-gnu-hash-v2-1-f6c2532ffe2a@linutronix.de
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
|
|
The test_monitor_call() inline assembly uses the xgr instruction, which
also modifies the condition code, to clear a register. However the clobber
list of the inline assembly does not specify that the condition code is
modified, which may lead to incorrect code generation.
Use the lhi instruction instead to clear the register without that the
condition code is modified. Furthermore this limits clearing to the lower
32 bits of val, since its type is int.
Fixes: 17248ea03674 ("s390: fix __EMIT_BUG() macro")
Cc: stable@vger.kernel.org
Reviewed-by: Juergen Christ <jchrist@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
|
|
The alternative path leads to a build error after a recent change:
sound/pci/hda/patch_realtek.c: In function 'alc233_fixup_lenovo_low_en_micmute_led':
include/linux/stddef.h:9:14: error: called object is not a function or function pointer
9 | #define NULL ((void *)0)
| ^
sound/pci/hda/patch_realtek.c:5041:49: note: in expansion of macro 'NULL'
5041 | #define alc233_fixup_lenovo_line2_mic_hotkey NULL
| ^~~~
sound/pci/hda/patch_realtek.c:5063:9: note: in expansion of macro 'alc233_fixup_lenovo_line2_mic_hotkey'
5063 | alc233_fixup_lenovo_line2_mic_hotkey(codec, fix, action);
Using IS_REACHABLE() is somewhat questionable here anyway since it
leads to the input code not working when the HDA driver is builtin
but input is in a loadable module. Replace this with a hard compile-time
dependency on CONFIG_INPUT. In practice this won't chance much
other than solve the compiler error because it is rare to require
sound output but no input support.
Fixes: f603b159231b ("ALSA: hda/realtek - add supported Mic Mute LED for Lenovo platform")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Link: https://patch.msgid.link/20250304142620.582191-1-arnd@kernel.org
Signed-off-by: Takashi Iwai <tiwai@suse.de>
|
|
Hardware reports jack absent after reset/suspension regardless of jack
state, so introduce an additional delay only in suspension case to allow
proper detection to take place after a short delay.
Signed-off-by: Maciej Strozek <mstrozek@opensource.cirrus.com>
Reviewed-by: Charles Keepax <ckeepax@opensource.cirrus.com>
Link: https://patch.msgid.link/20250304140504.139245-1-mstrozek@opensource.cirrus.com
Signed-off-by: Mark Brown <broonie@kernel.org>
|
|
A recent cleanup went a bit too far and dropped clearing the cycle bit
of link TRBs, so it stays different from the rest of the ring half of
the time. Then a race occurs: if the xHC reaches such link TRB before
more commands are queued, the link's cycle bit unintentionally matches
the xHC's cycle so it follows the link and waits for further commands.
If more commands are queued before the xHC gets there, inc_enq() flips
the bit so the xHC later sees a mismatch and stops executing commands.
This function is called before suspend and 50% of times after resuming
the xHC is doomed to get stuck sooner or later. Then some Stop Endpoint
command fails to complete in 5 seconds and this shows up
xhci_hcd 0000:00:10.0: xHCI host not responding to stop endpoint command
xhci_hcd 0000:00:10.0: xHCI host controller not responding, assume dead
xhci_hcd 0000:00:10.0: HC died; cleaning up
followed by loss of all USB decives on the affected bus. That's if you
are lucky, because if Set Deq gets stuck instead, the failure is silent.
Likely responsible for kernel bug 219824. I found this while searching
for possible causes of that regression and reproduced it locally before
hearing back from the reporter. To repro, simply wait for link cycle to
become set (debugfs), then suspend, resume and wait. To accelerate the
failure I used a script which repeatedly starts and stops a UVC camera.
Some HCs get fully reinitialized on resume and they are not affected.
Link: https://bugzilla.kernel.org/show_bug.cgi?id=219824
Fixes: 36b972d4b7ce ("usb: xhci: improve xhci_clear_command_ring()")
Cc: stable@vger.kernel.org
Signed-off-by: Michal Pecio <michal.pecio@gmail.com>
Signed-off-by: Mathias Nyman <mathias.nyman@linux.intel.com>
Link: https://lore.kernel.org/r/20250304113147.3322584-2-mathias.nyman@linux.intel.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
hclge_ptp_get_cycle returns an error
During the initialization of ptp, hclge_ptp_get_cycle might return an error
and returned directly without unregister clock and free it. To avoid that,
call hclge_ptp_destroy_clock to unregist and free clock if
hclge_ptp_get_cycle failed.
Fixes: 8373cd38a888 ("net: hns3: change the method of obtaining default ptp cycle")
Signed-off-by: Peiyang Wang <wangpeiyang1@huawei.com>
Signed-off-by: Jijie Shao <shaojijie@huawei.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20250228105258.1243461-1-shaojijie@huawei.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
Both the APIs in cfg80211 and the implementation in mac80211
aren't really ready yet, we have a large number of fixes. In
addition, it's not possible right now to discover support for
this feature from userspace. Disable it for now, there's no
rush.
Link: https://patch.msgid.link/20250303110538.fbeef42a5687.Iab122c22137e5675ebd99f5c031e30c0e5c7af2e@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
|
|
Add missing "vpcie0v9-supply" and "vpcie1v8-supply" properties to the "pcie0"
node in the Pine64 RockPro64 board dtsi file. This eliminates the following
warnings from the kernel log:
rockchip-pcie f8000000.pcie: supply vpcie1v8 not found, using dummy regulator
rockchip-pcie f8000000.pcie: supply vpcie0v9 not found, using dummy regulator
These additions improve the accuracy of hardware description of the RockPro64
and, in theory, they should result in no functional changes to the way board
works after the changes, because the "vcca_0v9" and "vcca_1v8" regulators are
always enabled. [1][2] However, extended reliability testing, performed by
Chris, [3] has proven that the age-old issues with some PCI Express cards,
when used with a Pine64 RockPro64, are also resolved.
Those issues were already mentioned in the commit 43853e843aa6 (arm64: dts:
rockchip: Remove unsupported node from the Pinebook Pro dts, 2024-04-01),
together with a brief description of the out-of-tree enumeration delay patch
that reportedly resolves those issues. In a nutshell, booting a RockPro64
with some PCI Express cards attached to it caused a kernel oops. [4]
Symptomatically enough, to the commit author's best knowledge, only the Pine64
RockPro64, out of all RK3399-based boards and devices supported upstream, has
been reported to suffer from those PCI Express issues, and only the RockPro64
had some of the PCI Express supplies missing in its DT. Thus, perhaps some
weird timing issues exist that caused the "vcca_1v8" always-on regulator,
which is part of the RK808 PMIC, to actually not be enabled before the PCI
Express is initialized and enumerated on the RockPro64, causing oopses with
some PCIe cards, and the aforementioned enumeration delay patch [4] probably
acted as just a workaround for the underlying timing issue.
Admittedly, the Pine64 RockPro64 is a bit specific board by having a standard
PCI Express slot, allowing use of various standard cards, but pretty much
standard PCI Express cards have been attached to other RK3399 boards as well,
and the commit author is unaware ot such issues reported for them.
It's quite hard to be sure that the PCI Express issues are fully resolved by
these additions to the DT, without some really extensive and time-consuming
testing. However, these additions to the DT can result in good things and
improvements anyway, making them perfectly safe from the standpoint of being
unable to do any harm or cause some unforeseen regressions.
These changes apply to the both supported hardware revisions of the Pine64
RockPro64, i.e. to the production-run revisions 2.0 and 2.1. [1][2]
[1] https://files.pine64.org/doc/rockpro64/rockpro64_v21-SCH.pdf
[2] https://files.pine64.org/doc/rockpro64/rockpro64_v20-SCH.pdf
[3] https://z9.de/hedgedoc/s/nF4d5G7rg#reboot-tests-for-PCIe-improvements
[4] https://lore.kernel.org/lkml/20230509153912.515218-1-vincenzopalazzodev@gmail.com/T/#u
Fixes: bba821f5479e ("arm64: dts: rockchip: add PCIe nodes on rk3399-rockpro64")
Cc: stable@vger.kernel.org
Cc: Vincenzo Palazzo <vincenzopalazzodev@gmail.com>
Cc: Peter Geis <pgwipeout@gmail.com>
Cc: Bjorn Helgaas <helgaas@kernel.org>
Reported-by: Diederik de Haas <didi.debian@cknow.org>
Tested-by: Chris Vogel <chris@z9.de>
Signed-off-by: Dragan Simic <dsimic@manjaro.org>
Tested-by: Diederik de Haas <didi.debian@cknow.org>
Link: https://lore.kernel.org/r/b39cfd7490d8194f053bf3971f13a43472d1769e.1740941097.git.dsimic@manjaro.org
Signed-off-by: Heiko Stuebner <heiko@sntech.de>
|
|
Make NET_DSA_REALTEK_RTL8366RB_LEDS a hidden symbol.
It seems very unlikely user would want to intentionally
disable it.
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Link: https://patch.msgid.link/20250228004534.3428681-1-kuba@kernel.org
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
Partially revert commit b71724147e73 ("be2net: replace polling with
sleeping in the FW completion path") w.r.t mcc mutex it introduces and the
use of usleep_range. The be2net be_ndo_bridge_getlink() callback is
called with rcu_read_lock, so this code has been broken for a long time.
Both the mutex_lock and the usleep_range can cause the issue Ian Kumlien
reported[1]. The call path is:
be_ndo_bridge_getlink -> be_cmd_get_hsw_config -> be_mcc_notify_wait ->
be_mcc_wait_compl -> usleep_range()
[1] https://lore.kernel.org/netdev/CAA85sZveppNgEVa_FD+qhOMtG_AavK9_mFiU+jWrMtXmwqefGA@mail.gmail.com/
Tested-by: Ian Kumlien <ian.kumlien@gmail.com>
Fixes: b71724147e73 ("be2net: replace polling with sleeping in the FW completion path")
Signed-off-by: Nikolay Aleksandrov <razor@blackwall.org>
Link: https://patch.msgid.link/20250227164129.1201164-1-razor@blackwall.org
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
CPUID leaf 0x2's one-byte TLB descriptors report the number of entries
for specific TLB types, among other properties.
Typically, each emitted descriptor implies the same number of entries
for its respective TLB type(s). An emitted 0x63 descriptor is an
exception: it implies 4 data TLB entries for 1GB pages and 32 data TLB
entries for 2MB or 4MB pages.
For the TLB descriptors parsing code, the entry count for 1GB pages is
encoded at the intel_tlb_table[] mapping, but the 2MB/4MB entry count is
totally ignored.
Update leaf 0x2's parsing logic 0x2 to account for 32 data TLB entries
for 2MB/4MB pages implied by the 0x63 descriptor.
Fixes: e0ba94f14f74 ("x86/tlb_info: get last level TLB entry number of CPU")
Signed-off-by: Ahmed S. Darwish <darwi@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: stable@kernel.org
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: https://lore.kernel.org/r/20250304085152.51092-4-darwi@linutronix.de
|
|
CPUID leaf 0x2 emits one-byte descriptors in its four output registers
EAX, EBX, ECX, and EDX. For these descriptors to be valid, the most
significant bit (MSB) of each register must be clear.
Leaf 0x2 parsing at intel.c only validated the MSBs of EAX, EBX, and
ECX, but left EDX unchecked.
Validate EDX's most-significant bit as well.
Fixes: e0ba94f14f74 ("x86/tlb_info: get last level TLB entry number of CPU")
Signed-off-by: Ahmed S. Darwish <darwi@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: stable@kernel.org
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: https://lore.kernel.org/r/20250304085152.51092-3-darwi@linutronix.de
|
|
CPUID leaf 0x2 emits one-byte descriptors in its four output registers
EAX, EBX, ECX, and EDX. For these descriptors to be valid, the most
significant bit (MSB) of each register must be clear.
The historical Git commit:
019361a20f016 ("- pre6: Intel: start to add Pentium IV specific stuff (128-byte cacheline etc)...")
introduced leaf 0x2 output parsing. It only validated the MSBs of EAX,
EBX, and ECX, but left EDX unchecked.
Validate EDX's most-significant bit.
Signed-off-by: Ahmed S. Darwish <darwi@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: stable@vger.kernel.org
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: https://lore.kernel.org/r/20250304085152.51092-2-darwi@linutronix.de
|
|
The new header only exports a single unpack function and a CPIO_HDRLEN
constant for future test use.
Signed-off-by: David Disseldorp <ddiss@suse.de>
Link: https://lore.kernel.org/r/20250304061020.9815-2-ddiss@suse.de
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Mateusz Guzik <mjguzik@gmail.com> says:
As a side effect of looking at the pipe hang I came up with 3 changes to
consider for -next.
The first one is a trivial clean up which I wont mind if it merely gets
folded into someone else's change for pipes.
The second one reduces page alloc/free calls for the backing area (60%
less during a kernel build in my testing). I already posted this, but
the cc list was not proper.
The last one concerns the wait/wakeup mechanism and drops one lock trip
in the common case after waking up.
* patches from https://lore.kernel.org/r/20250303230409.452687-1-mjguzik@gmail.com:
wait: avoid spurious calls to prepare_to_wait_event() in ___wait_event()
pipe: cache 2 pages instead of 1
pipe: drop an always true check in anon_pipe_write()
Link: https://lore.kernel.org/r/20250303230409.452687-1-mjguzik@gmail.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
In vast majority of cases the condition determining whether the thread
can proceed is true after the first wake up.
However, even in that case the thread ends up calling into
prepare_to_wait_event() again, suffering a spurious irq + lock trip.
Then it calls into finish_wait() to unlink itself.
Note that in case of a pending signal the work done by
prepare_to_wait_event() gets ignored even without the change.
pre-check the condition after waking up instead.
Stats gathared during a kernel build:
bpftrace -e 'kprobe:prepare_to_wait_event,kprobe:finish_wait \
{ @[probe] = count(); }'
@[kprobe:finish_wait]: 392483
@[kprobe:prepare_to_wait_event]: 778690
As in calls to prepare_to_wait_event() almost double calls to
finish_wait(). This evens out with the patch.
Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>
Link: https://lore.kernel.org/r/20250303230409.452687-4-mjguzik@gmail.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
User data is kept in a circular buffer backed by pages allocated as
needed. Only having space for one spare is still prone to having to
resort to allocation / freeing.
In my testing this decreases page allocs by 60% during a kernel build.
Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>
Link: https://lore.kernel.org/r/20250303230409.452687-3-mjguzik@gmail.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
The check operates on the stale value of 'head' and always loops back.
Just do it unconditionally. No functional changes.
Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>
Link: https://lore.kernel.org/r/20250303230409.452687-2-mjguzik@gmail.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Christian Brauner <brauner@kernel.org> says:
In commit ee2e3f50629f ("mount: fix mounting of detached mounts onto
targets that reside on shared mounts") I fixed a bug where propagating
the source mount tree of an anonymous mount namespace into a target
mount tree of a non-anonymous mount namespace could be used to trigger
an integer overflow in the non-anonymous mount namespace causing any new
mounts to fail.
The cause of this was that the propagation algorithm was unable to
recognize mounts from the source mount tree that were already propagated
into the target mount tree and then reappeared as propagation targets
when walking the destination propagation mount tree.
When fixing this I disabled mount propagation into anonymous mount
namespaces. Make it possible for anonymous mount namespace to receive
mount propagation events correctly. This is no also a correctness issue
now that we allow mounting detached mount trees onto detached mount
trees.
Mark the source anonymous mount namespace with MNTNS_PROPAGATING
indicating that all mounts belonging to this mount namespace are
currently in the process of being propagated and make the propagation
algorithm discard those if they appear as propagation targets.
* patches from https://lore.kernel.org/r/20250225-work-mount-propagation-v1-0-e6e3724500eb@kernel.org:
selftests: test subdirectory mounting
selftests: add test for detached mount tree propagation
mount: handle mount propagation for detached mount trees
Link: https://lore.kernel.org/r/20250225-work-mount-propagation-v1-0-e6e3724500eb@kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
This tests mounting a subdirectory without ever having to expose the
filesystem to a non-anonymous mount namespace.
Link: https://lore.kernel.org/r/20250225-work-mount-propagation-v1-3-e6e3724500eb@kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Test that detached mount trees receive propagation events.
Link: https://lore.kernel.org/r/20250225-work-mount-propagation-v1-2-e6e3724500eb@kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
clang correctly notices that the 'uflags' variable initialization
only happens in some cases:
fs/namespace.c:4622:6: error: variable 'uflags' is used uninitialized whenever 'if' condition is false [-Werror,-Wsometimes-uninitialized]
4622 | if (flags & MOVE_MOUNT_F_EMPTY_PATH) uflags = AT_EMPTY_PATH;
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
fs/namespace.c:4623:48: note: uninitialized use occurs here
4623 | from_name = getname_maybe_null(from_pathname, uflags);
| ^~~~~~
fs/namespace.c:4622:2: note: remove the 'if' if its condition is always true
4622 | if (flags & MOVE_MOUNT_F_EMPTY_PATH) uflags = AT_EMPTY_PATH;
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Fixes: b1e9423d65e3 ("fs: support getname_maybe_null() in move_mount()")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Link: https://lore.kernel.org/r/20250226081201.1876195-1-arnd@kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
In commit ee2e3f50629f ("mount: fix mounting of detached mounts onto
targets that reside on shared mounts") I fixed a bug where propagating
the source mount tree of an anonymous mount namespace into a target
mount tree of a non-anonymous mount namespace could be used to trigger
an integer overflow in the non-anonymous mount namespace causing any new
mounts to fail.
The cause of this was that the propagation algorithm was unable to
recognize mounts from the source mount tree that were already propagated
into the target mount tree and then reappeared as propagation targets
when walking the destination propagation mount tree.
When fixing this I disabled mount propagation into anonymous mount
namespaces. Make it possible for anonymous mount namespace to receive
mount propagation events correctly. This is no also a correctness issue
now that we allow mounting detached mount trees onto detached mount
trees.
Mark the source anonymous mount namespace with MNTNS_PROPAGATING
indicating that all mounts belonging to this mount namespace are
currently in the process of being propagated and make the propagation
algorithm discard those if they appear as propagation targets.
Link: https://lore.kernel.org/r/20250225-work-mount-propagation-v1-1-e6e3724500eb@kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
The previous patch series only enabled the creation of detached mounts
from detached mounts that were created via open_tree(). In such cases we
know that the origin sequence number for the newly created anonymous
mount namespace will be set to the sequence number of the mount
namespace the source mount belonged to.
But fsmount() creates an anonymous mount namespace that does not have an
origin mount namespace as the anonymous mount namespace was derived from
a filesystem context created via fsopen().
Account for this case and allow the creation of detached mounts from
mounts created via fsmount(). Consequently, any such detached mount
created from an fsmount() mount will also have a zero origin sequence
number.
This allows to mount subdirectories without ever having to expose the
filesystem to a a non-anonymous mount namespace:
fd_context = sys_fsopen("tmpfs", 0);
sys_fsconfig(fd_context, FSCONFIG_CMD_CREATE, NULL, NULL, 0);
fd_tmpfs = sys_fsmount(fd_context, 0, 0);
mkdirat(fd_tmpfs, "subdir", 0755);
fd_tree = sys_open_tree(fd_tmpfs, "subdir", OPEN_TREE_CLONE);
sys_move_mount(fd_tree, "", -EBADF, "/mnt", MOVE_MOUNT_F_EMPTY_PATH);
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Christian Brauner <brauner@kernel.org> says:
This series expands the abilities of anonymous mount namespaces.
Terminology
===========
detached mount:
A detached mount is a mount belonging to an anonymous mount namespace.
anonymous mount namespace:
An anonymous mount namespace is a mount namespace that does not appear
in nsfs. This means neither can it be setns()ed into nor can it be
persisted through bind-mounts.
attached mount:
An attached mount is a mount belonging to a non-anonymous mount
namespace.
non-anonymous mount namespace:
A non-anonymous mount namespace is a mount namespace that does appear
in nsfs. This means it can be setns()ed into and can be persisted
through bind-mounts.
mount namespace sequence number:
Each non-anonymous mount namespace has a unique 64bit sequence number
that is assigned when the mount namespace is created. The sequence
number uniquely identifies a non-anonymous mount namespace.
One of the purposes of the sequence number is to prevent mount
namespace loops. These can occur when an nsfs mount namespace file is
bind mounted into a mount namespace that was created after it.
Such loops are prevented by verifying that the sequence number of the
target mount namespace is smaller than the sequence number of the nsfs
mount namespace file. In other words, the target mount namespace must
have been created before the nsfs file mount namespace.
In contrast, anonymous mount namespaces don't have a sequence number
assigned. Anonymous mount namespaces do not appear in any nsfs
instances and can thus not be pinned by bind-mounting them anywhere.
They can thus not be used to form cycles.
Creating detached mounts from detached mounts
=============================================
Currently, detached mounts can only be created from attached mounts.
This limitaton prevents various use-cases. For example, the ability to
mount a subdirectory without ever having to make the whole filesystem
visible first.
The current permission model for OPEN_TREE_CLONE flag of the open_tree()
system call is:
(1) Check that the caller is privileged over the owning user namespace
of it's current mount namespace.
(2) Check that the caller is located in the mount namespace of the mount
it wants to create a detached copy of.
While it is not strictly necessary to do it this way it is consistently
applied in the new mount api. This model will also be used when allowing
the creation of detached mount from another detached mount.
The (1) requirement can simply be met by performing the same check as
for the non-detached case, i.e., verify that the caller is privileged
over its current mount namespace.
To meet the (2) requirement it must be possible to infer the origin
mount namespace that the anonymous mount namespace of the detached mount
was created from.
The origin mount namespace of an anonymous mount is the mount namespace
that the mounts that were copied into the anonymous mount namespace
originate from.
In order to check the origin mount namespace of an anonymous mount
namespace two methods come to mind:
(i) stash a reference to the original mount namespace in the anonymous
mount namespace
(ii) record the sequence number of the original mount namespace in the
anonymous mount namespace
The (i) option has more complicated consequences and implications than
(ii). For example, it would pin the origin mount namespace. Even with a
passive reference it would pointlessly pin memory as access to the
origin mount namespace isn't required.
With (ii) in place it is possible to perform an equivalent check (2') to
(2). The origin mount namespace of the anonymous mount namespace must be
the same as the caller's mount namespace. To establish this the sequence
number of the caller's mount namespace and the origin sequence number of
the anonymous mount namespace are compared.
The caller is always located in a non-anonymous mount namespace since
anonymous mount namespaces cannot be setns()ed into. The caller's mount
namespace will thus always have a valid sequence number.
The owning namespace of any mount namespace, anonymous or non-anonymous,
can never change. A mount attached to a non-anonymous mount namespace
can never change mount namespace.
If the sequence number of the non-anonymous mount namespace and the
origin sequence number of the anonymous mount namespace match, the
owning namespaces must match as well.
Hence, the capability check on the owning namespace of the caller's
mount namespace ensures that the caller has the ability to copy the
mount tree.
Mounting detached mounts onto detached mounts
=============================================
Currently, detached mounts can only be mounted onto attached mounts.
This limitation makes it impossible to assemble a new private rootfs and
move it into place. Instead, a detached tree must be created, attached,
then mounted open and then either moved or detached again. Lift this
restriction.
In order to allow mounting detached mounts onto other detached mounts
the same permission model used for creating detached mounts from
detached mounts can be used (cf. above).
Allowing to mount detached mounts onto detached mounts leaves three
cases to consider:
(1) The source mount is an attached mount and the target mount is a
detached mount. This would be equivalent to moving a mount between
different mount namespaces. A caller could move an attached mount to
a detached mount. The detached mount can now be freely attached to
any mount namespace. This changes the current delegatioh model
significantly for no good reason. So this will fail.
(2) Anonymous mount namespaces are always attached fully, i.e., it is
not possible to only attach a subtree of an anoymous mount
namespace. This simplifies the implementation and reasoning.
Consequently, if the anonymous mount namespace of the source
detached mount and the target detached mount are the identical the
mount request will fail.
(3) The source mount's anonymous mount namespace is different from the
target mount's anonymous mount namespace.
In this case the source anonymous mount namespace of the source
mount tree must be freed after its mounts have been moved to the
target anonymous mount namespace. The source anonymous mount
namespace must be empty afterwards.
By allowing to mount detached mounts onto detached mounts a caller may
do the following:
fd_tree1 = open_tree(-EBADF, "/mnt", OPEN_TREE_CLONE)
fd_tree2 = open_tree(-EBADF, "/tmp", OPEN_TREE_CLONE)
fd_tree1 and fd_tree2 refer to two different detached mount trees that
belong to two different anonymous mount namespace.
It is important to note that fd_tree1 and fd_tree2 both refer to the
root of their respective anonymous mount namespaces.
By allowing to mount detached mounts onto detached mounts the caller
may now do:
move_mount(fd_tree1, "", fd_tree2, "",
MOVE_MOUNT_F_EMPTY_PATH | MOVE_MOUNT_T_EMPTY_PATH)
This will cause the detached mount referred to by fd_tree1 to be
mounted on top of the detached mount referred to by fd_tree2.
Thus, the detached mount fd_tree1 is moved from its separate anonymous
mount namespace into fd_tree2's anonymous mount namespace.
It also means that while fd_tree2 continues to refer to the root of
its respective anonymous mount namespace fd_tree1 doesn't anymore.
This has the consequence that only fd_tree2 can be moved to another
anonymous or non-anonymous mount namespace. Moving fd_tree1 will now
fail as fd_tree1 doesn't refer to the root of an anoymous mount
namespace anymore.
Now fd_tree1 and fd_tree2 refer to separate detached mount trees
referring to the same anonymous mount namespace.
This is conceptually fine. The new mount api does allow for this to
happen already via:
mount -t tmpfs tmpfs /mnt
mkdir -p /mnt/A
mount -t tmpfs tmpfs /mnt/A
fd_tree3 = open_tree(-EBADF, "/mnt", OPEN_TREE_CLONE | AT_RECURSIVE)
fd_tree4 = open_tree(-EBADF, "/mnt/A", 0)
Both fd_tree3 and fd_tree4 refer to two different detached mount trees
but both detached mount trees refer to the same anonymous mount
namespace. An as with fd_tree1 and fd_tree2, only fd_tree3 may be
moved another mount namespace as fd_tree3 refers to the root of the
anonymous mount namespace just while fd_tree4 doesn't.
However, there's an important difference between the fd_tree3/fd_tree4
and the fd_tree1/fd_tree2 example.
Closing fd_tree4 and releasing the respective struct file will have no
further effect on fd_tree3's detached mount tree.
However, closing fd_tree3 will cause the mount tree and the respective
anonymous mount namespace to be destroyed causing the detached mount
tree of fd_tree4 to be invalid for further mounting.
By allowing to mount detached mounts on detached mounts as in the
fd_tree1/fd_tree2 example both struct files will affect each other.
Both fd_tree1 and fd_tree2 refer to struct files that have
FMODE_NEED_UNMOUNT set.
When either one of them is closed it ends up unmounting the mount
tree. The problem is that both will unconditionally free the mount
namespace and may end up causing UAFs for each other.
Another problem stems from the fact that fd_tree1 doesn't refer to the
root of the anonymous mount namespace. So ignoring the UAF issue, if
fd_tree2 were to be closed after fd_tree1, then fd_tree1 would free
only a part of the mount tree while leaking the rest of the mount
tree.
Multiple solutions for this problem come to mind:
(1) Reference Counting Anonymous Mount Namespaces
A solution to this problem would be reference counting anonymous
mount namespaces. The source detached mount tree acquires a
reference when it is moved into the anonymous mount namespace of
the target mount tree. When fd_tree1 is closed the mount tree
isn't unmounted and the anonymous mount namespace shared between
the detached mount tree at fd_tree1 and fd_tree2 isn't freed.
However, this has another problem. When fd_tree2 is closed before
fd_tree1 then closing fd_tree1 will cause the mount tree to be
unmounted and the anonymous mount namespace to be destroyed.
However, fd_tree1 only refers to a part of the mounts that the
shared anonymous mount namespace has collected. So this would leak
mounts.
(2) Removing FMODE_NEED_UNMOUNT from the struct file of the source
detached mount tree
In the current state of the mount api the creation of two file
descriptors that refer to different detached mount trees but to
the same anonymous mount namespace is already possible. See the
fd_tree3/fd_tree4 examples above.
In those cases only one of the two file descriptors will actually
end up unmounting and destroying the detached mount tree.
Whether or not a struct file needs to unmount and destroy an
anonymous mount namespace is governed by the FMODE_NEED_UNMOUNT
flag. In the fd_tree3/fd_tree4 example above only fd_tree3 will
refer to a struct file that has FMODE_NEED_UNMOUNT set.
A similar solution would work for mounting detached mounts onto
detached mounts. When the source detached mount tree is moved to
the target detached mount tree and thus from the source anonymous
mount namespace to the target anonymous mount namespace the
FMODE_NEED_UNMOUNT flag will be removed from the struct file of
the source detached mount tree.
In the above example the FMODE_NEED_UNMOUNT would be removed from
the struct file that fd_tree1 refers to.
This requires that the source file descriptor fd_tree1 needs to be
kept open until move_mount() is finished so that FMODE_NEED_UNMOUNT
can be removed:
move_mount(fd_tree1, "", fd_tree2, "",
MOVE_MOUNT_F_EMPTY_PATH |
MOVE_MOUNT_T_EMPTY_PATH)
/*
* Remove FMODE_NEED_UNMOUNT so closing fd_tree1 will leave the
* mount tree alone.
*/
close(fd_tree1);
/*
* Remove the whole mount tree including all the mounts that
* were moved from fd_tree1 into fd_tree2.
*/
close(fd_tree2);
Since the source detached mount tree fd_tree1 has now become an
attached mount tree, i.e., fd_tree1_mnt->mnt_parent == fd_tree2_mnt
is is ineligible for attaching again as move_mount() requires that a
detached mount tree can only be attached if it is the root of an
anonymous mount namespace.
Removing FMODE_NEED_UNMOUNT doesn't require to hold @namespace_sem.
Attaching @fd_tree1 to @fd_tree2 requires holding @namespace_sem and
so does dissolve_on_fput() should @fd_tree2 have been closed
concurrently.
While removing FMODE_NEED_UNMOUNT can be done it would require some
ugly hacking similar to what's done for splice to remove
FMODE_NOWAIT. That's ugly.
(3) Use the fact that @fd_tree1 will have a parent mount once it has
been attached to @fd_tree2.
When dissolve_on_fput() is called the mount that has been passed in
will refer to the root of the anonymous mount namespace. If it
doesn't it would mean that mounts are leaked. So before allowing to
mount detached mounts onto detached mounts this would be a bug.
Now that detached mounts can be mounted onto detached mounts it just
means that the mount has been attached to another anonymous mount
namespace and thus dissolve_on_fput() must not unmount the mount
tree or free the anonymous mount namespace as the file referring to
the root of the namespace hasn't been closed yet.
If it had been closed yet it would be obvious because the mount
namespace would be NULL, i.e., the @fd_tree1 would have already been
unmounted. If @fd_tree1 hasn't been unmounted yet and has a parent
mount it is safe to skip any cleanup as closing @fd_tree2 will take
care of all cleanup operations.
Imho, (3) is the cleanest solution and thus has been chosen.
* patches from https://lore.kernel.org/r/20250221-brauner-open_tree-v1-0-dbcfcb98c676@kernel.org:
selftests: seventh test for mounting detached mounts onto detached mounts
selftests: sixth test for mounting detached mounts onto detached mounts
selftests: fifth test for mounting detached mounts onto detached mounts
selftests: fourth test for mounting detached mounts onto detached mounts
selftests: third test for mounting detached mounts onto detached mounts
selftests: second test for mounting detached mounts onto detached mounts
selftests: first test for mounting detached mounts onto detached mounts
fs: mount detached mounts onto detached mounts
fs: support getname_maybe_null() in move_mount()
selftests: create detached mounts from detached mounts
fs: create detached mounts from detached mounts
fs: add may_copy_tree()
fs: add fastpath for dissolve_on_fput()
fs: add assert for move_mount()
fs: add mnt_ns_empty() helper
fs: record sequence number of origin mount namespace
Link: https://lore.kernel.org/r/20250221-brauner-open_tree-v1-0-dbcfcb98c676@kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Add a test to verify that detached mounts behave correctly.
Link: https://lore.kernel.org/r/20250221-brauner-open_tree-v1-16-dbcfcb98c676@kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Add a test to verify that detached mounts behave correctly.
Link: https://lore.kernel.org/r/20250221-brauner-open_tree-v1-15-dbcfcb98c676@kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Add a test to verify that detached mounts behave correctly.
Link: https://lore.kernel.org/r/20250221-brauner-open_tree-v1-14-dbcfcb98c676@kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Add a test to verify that detached mounts behave correctly.
Link: https://lore.kernel.org/r/20250221-brauner-open_tree-v1-13-dbcfcb98c676@kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Add a test to verify that detached mounts behave correctly.
Link: https://lore.kernel.org/r/20250221-brauner-open_tree-v1-12-dbcfcb98c676@kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Add a test to verify that detached mounts behave correctly.
Link: https://lore.kernel.org/r/20250221-brauner-open_tree-v1-11-dbcfcb98c676@kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Add a test to verify that detached mounts behave correctly.
Link: https://lore.kernel.org/r/20250221-brauner-open_tree-v1-10-dbcfcb98c676@kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Currently, detached mounts can only be mounted onto attached mounts.
This limitation makes it impossible to assemble a new private rootfs and
move it into place. That's an extremely powerful concept for container
and service workloads that we should support.
Right now, a detached tree must be created, attached, then it can gain
additional mounts and then it can either be moved (if it doesn't reside
under a shared mount) or a detached mount created again. Lift this
restriction.
In order to allow mounting detached mounts onto other detached mounts
the same permission model used for creating detached mounts from
detached mounts can be used:
(1) Check that the caller is privileged over the owning user namespace
of it's current mount namespace.
(2) Check that the caller is located in the mount namespace of the mount
it wants to create a detached copy of.
The origin mount namespace of the anonymous mount namespace must be the
same as the caller's mount namespace. To establish this the sequence
number of the caller's mount namespace and the origin sequence number of
the anonymous mount namespace are compared.
The caller is always located in a non-anonymous mount namespace since
anonymous mount namespaces cannot be setns()ed into. The caller's mount
namespace will thus always have a valid sequence number.
The owning namespace of any mount namespace, anonymous or non-anonymous,
can never change. A mount attached to a non-anonymous mount namespace
can never change mount namespace.
If the sequence number of the non-anonymous mount namespace and the
origin sequence number of the anonymous mount namespace match, the
owning namespaces must match as well.
Hence, the capability check on the owning namespace of the caller's
mount namespace ensures that the caller has the ability to attach the
mount tree.
Link: https://lore.kernel.org/r/20250221-brauner-open_tree-v1-9-dbcfcb98c676@kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Allow move_mount() to work with NULL path arguments.
Link: https://lore.kernel.org/r/20250221-brauner-open_tree-v1-8-dbcfcb98c676@kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Link: https://lore.kernel.org/r/20250221-brauner-open_tree-v1-7-dbcfcb98c676@kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Add the ability to create detached mounts from detached mounts.
Currently, detached mounts can only be created from attached mounts.
This limitaton prevents various use-cases. For example, the ability to
mount a subdirectory without ever having to make the whole filesystem
visible first.
The current permission model for the OPEN_TREE_CLONE flag of the
open_tree() system call is:
(1) Check that the caller is privileged over the owning user namespace
of it's current mount namespace.
(2) Check that the caller is located in the mount namespace of the mount
it wants to create a detached copy of.
While it is not strictly necessary to do it this way it is consistently
applied in the new mount api. This model will also be used when allowing
the creation of detached mount from another detached mount.
The (1) requirement can simply be met by performing the same check as
for the non-detached case, i.e., verify that the caller is privileged
over its current mount namespace.
To meet the (2) requirement it must be possible to infer the origin
mount namespace that the anonymous mount namespace of the detached mount
was created from.
The origin mount namespace of an anonymous mount is the mount namespace
that the mounts that were copied into the anonymous mount namespace
originate from.
The origin mount namespace of the anonymous mount namespace must be the
same as the caller's mount namespace. To establish this the sequence
number of the caller's mount namespace and the origin sequence number of
the anonymous mount namespace are compared.
The caller is always located in a non-anonymous mount namespace since
anonymous mount namespaces cannot be setns()ed into. The caller's mount
namespace will thus always have a valid sequence number.
The owning namespace of any mount namespace, anonymous or non-anonymous,
can never change. A mount attached to a non-anonymous mount namespace
can never change mount namespace.
If the sequence number of the non-anonymous mount namespace and the
origin sequence number of the anonymous mount namespace match, the
owning namespaces must match as well.
Hence, the capability check on the owning namespace of the caller's
mount namespace ensures that the caller has the ability to copy the
mount tree.
Link: https://lore.kernel.org/r/20250221-brauner-open_tree-v1-6-dbcfcb98c676@kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Add a helper that verifies whether a caller may copy a given mount tree.
Link: https://lore.kernel.org/r/20250221-brauner-open_tree-v1-5-dbcfcb98c676@kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Instead of acquiring the namespace semaphore and the mount lock
everytime we close a file with FMODE_NEED_UNMOUNT set add a fastpath
that checks whether we need to at all. Most of the time the caller will
have attached the mount to the filesystem hierarchy and there's nothing
to do.
Link: https://lore.kernel.org/r/20250221-brauner-open_tree-v1-4-dbcfcb98c676@kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
After we've attached a detached mount tree the anonymous mount namespace
must be empty. Add an assert and make this assumption explicit.
Link: https://lore.kernel.org/r/20250221-brauner-open_tree-v1-3-dbcfcb98c676@kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
|