Age | Commit message (Collapse) | Author |
|
As explained in commit 1378817486d6 ("tipc: block BH
before using dst_cache"), net/core/dst_cache.c
helpers need to be called with BH disabled.
Disabling preemption in rpl_output() is not good enough,
because rpl_output() is called from process context,
lwtunnel_output() only uses rcu_read_lock().
We might be interrupted by a softirq, re-enter rpl_output()
and corrupt dst_cache data structures.
Fix the race by using local_bh_disable() instead of
preempt_disable().
Apply a similar change in rpl_input().
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Alexander Aring <aahringo@redhat.com>
Acked-by: Paolo Abeni <pabeni@redhat.com>
Link: https://lore.kernel.org/r/20240531132636.2637995-3-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
As explained in commit 1378817486d6 ("tipc: block BH
before using dst_cache"), net/core/dst_cache.c
helpers need to be called with BH disabled.
Disabling preemption in ioam6_output() is not good enough,
because ioam6_output() is called from process context,
lwtunnel_output() only uses rcu_read_lock().
We might be interrupted by a softirq, re-enter ioam6_output()
and corrupt dst_cache data structures.
Fix the race by using local_bh_disable() instead of
preempt_disable().
Fixes: 8cb3bf8bff3c ("ipv6: ioam: Add support for the ip6ip6 encapsulation")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Justin Iurman <justin.iurman@uliege.be>
Acked-by: Paolo Abeni <pabeni@redhat.com>
Link: https://lore.kernel.org/r/20240531132636.2637995-2-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
When vmxnet3_rq_create() fails to allocate memory for rq->data_ring.base,
the subsequent call to vmxnet3_rq_destroy_all_rxdataring does not reset
rq->data_ring.desc_size for the data ring that failed, which presumably
causes the hypervisor to reference it on packet reception.
To fix this bug, rq->data_ring.desc_size needs to be set to 0 to tell
the hypervisor to disable this feature.
[ 95.436876] kernel BUG at net/core/skbuff.c:207!
[ 95.439074] invalid opcode: 0000 [#1] PREEMPT SMP NOPTI
[ 95.440411] CPU: 7 PID: 0 Comm: swapper/7 Not tainted 6.9.3-dirty #1
[ 95.441558] Hardware name: VMware, Inc. VMware Virtual
Platform/440BX Desktop Reference Platform, BIOS 6.00 12/12/2018
[ 95.443481] RIP: 0010:skb_panic+0x4d/0x4f
[ 95.444404] Code: 4f 70 50 8b 87 c0 00 00 00 50 8b 87 bc 00 00 00 50
ff b7 d0 00 00 00 4c 8b 8f c8 00 00 00 48 c7 c7 68 e8 be 9f e8 63 58 f9
ff <0f> 0b 48 8b 14 24 48 c7 c1 d0 73 65 9f e8 a1 ff ff ff 48 8b 14 24
[ 95.447684] RSP: 0018:ffffa13340274dd0 EFLAGS: 00010246
[ 95.448762] RAX: 0000000000000089 RBX: ffff8fbbc72b02d0 RCX: 000000000000083f
[ 95.450148] RDX: 0000000000000000 RSI: 00000000000000f6 RDI: 000000000000083f
[ 95.451520] RBP: 000000000000002d R08: 0000000000000000 R09: ffffa13340274c60
[ 95.452886] R10: ffffffffa04ed468 R11: 0000000000000002 R12: 0000000000000000
[ 95.454293] R13: ffff8fbbdab3c2d0 R14: ffff8fbbdbd829e0 R15: ffff8fbbdbd809e0
[ 95.455682] FS: 0000000000000000(0000) GS:ffff8fbeefd80000(0000) knlGS:0000000000000000
[ 95.457178] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 95.458340] CR2: 00007fd0d1f650c8 CR3: 0000000115f28000 CR4: 00000000000406f0
[ 95.459791] Call Trace:
[ 95.460515] <IRQ>
[ 95.461180] ? __die_body.cold+0x19/0x27
[ 95.462150] ? die+0x2e/0x50
[ 95.462976] ? do_trap+0xca/0x110
[ 95.463973] ? do_error_trap+0x6a/0x90
[ 95.464966] ? skb_panic+0x4d/0x4f
[ 95.465901] ? exc_invalid_op+0x50/0x70
[ 95.466849] ? skb_panic+0x4d/0x4f
[ 95.467718] ? asm_exc_invalid_op+0x1a/0x20
[ 95.468758] ? skb_panic+0x4d/0x4f
[ 95.469655] skb_put.cold+0x10/0x10
[ 95.470573] vmxnet3_rq_rx_complete+0x862/0x11e0 [vmxnet3]
[ 95.471853] vmxnet3_poll_rx_only+0x36/0xb0 [vmxnet3]
[ 95.473185] __napi_poll+0x2b/0x160
[ 95.474145] net_rx_action+0x2c6/0x3b0
[ 95.475115] handle_softirqs+0xe7/0x2a0
[ 95.476122] __irq_exit_rcu+0x97/0xb0
[ 95.477109] common_interrupt+0x85/0xa0
[ 95.478102] </IRQ>
[ 95.478846] <TASK>
[ 95.479603] asm_common_interrupt+0x26/0x40
[ 95.480657] RIP: 0010:pv_native_safe_halt+0xf/0x20
[ 95.481801] Code: 22 d7 e9 54 87 01 00 0f 1f 40 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 f3 0f 1e fa eb 07 0f 00 2d 93 ba 3b 00 fb f4 <e9> 2c 87 01 00 66 66 2e 0f 1f 84 00 00 00 00 00 90 90 90 90 90 90
[ 95.485563] RSP: 0018:ffffa133400ffe58 EFLAGS: 00000246
[ 95.486882] RAX: 0000000000004000 RBX: ffff8fbbc1d14064 RCX: 0000000000000000
[ 95.488477] RDX: ffff8fbeefd80000 RSI: ffff8fbbc1d14000 RDI: 0000000000000001
[ 95.490067] RBP: ffff8fbbc1d14064 R08: ffffffffa0652260 R09: 00000000000010d3
[ 95.491683] R10: 0000000000000018 R11: ffff8fbeefdb4764 R12: ffffffffa0652260
[ 95.493389] R13: ffffffffa06522e0 R14: 0000000000000001 R15: 0000000000000000
[ 95.495035] acpi_safe_halt+0x14/0x20
[ 95.496127] acpi_idle_do_entry+0x2f/0x50
[ 95.497221] acpi_idle_enter+0x7f/0xd0
[ 95.498272] cpuidle_enter_state+0x81/0x420
[ 95.499375] cpuidle_enter+0x2d/0x40
[ 95.500400] do_idle+0x1e5/0x240
[ 95.501385] cpu_startup_entry+0x29/0x30
[ 95.502422] start_secondary+0x11c/0x140
[ 95.503454] common_startup_64+0x13e/0x141
[ 95.504466] </TASK>
[ 95.505197] Modules linked in: nft_fib_inet nft_fib_ipv4
nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6
nft_reject nft_ct nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6
nf_defrag_ipv4 rfkill ip_set nf_tables vsock_loopback
vmw_vsock_virtio_transport_common qrtr vmw_vsock_vmci_transport vsock
sunrpc binfmt_misc pktcdvd vmw_balloon pcspkr vmw_vmci i2c_piix4 joydev
loop dm_multipath nfnetlink zram crct10dif_pclmul crc32_pclmul vmwgfx
crc32c_intel polyval_clmulni polyval_generic ghash_clmulni_intel
sha512_ssse3 sha256_ssse3 vmxnet3 sha1_ssse3 drm_ttm_helper vmw_pvscsi
ttm ata_generic pata_acpi serio_raw scsi_dh_rdac scsi_dh_emc
scsi_dh_alua ip6_tables ip_tables fuse
[ 95.516536] ---[ end trace 0000000000000000 ]---
Fixes: 6f4833383e85 ("net: vmxnet3: Fix NULL pointer dereference in vmxnet3_rq_rx_complete()")
Signed-off-by: Matthias Stocker <mstocker@barracuda.com>
Reviewed-by: Subbaraya Sundeep <sbhatta@marvell.com>
Reviewed-by: Ronak Doshi <ronak.doshi@broadcom.com>
Link: https://lore.kernel.org/r/20240531103711.101961-1-mstocker@barracuda.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
This commit allows rcutorture to test double-call_srcu() when the
CONFIG_DEBUG_OBJECTS_RCU_HEAD Kconfig option is enabled. The non-raw
sdp structure's ->spinlock will be acquired in call_srcu(), hence this
commit also removes the current IRQ and preemption disabling so as to
avoid lockdep complaints.
Link: https://lore.kernel.org/all/20240407112714.24460-1-qiang.zhang1211@gmail.com/
Signed-off-by: Zqiang <qiang.zhang1211@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
|
|
This reverts commit 28319d6dc5e2ffefa452c2377dd0f71621b5bff0. The race
it fixed was subject to conditions that don't exist anymore since:
1612160b9127 ("rcu-tasks: Eliminate deadlocks involving do_exit() and RCU tasks")
This latter commit removes the use of SRCU that used to cover the
RCU-tasks blind spot on exit between the tasklist's removal and the
final preemption disabling. The task is now placed instead into a
temporary list inside which voluntary sleeps are accounted as RCU-tasks
quiescent states. This would disarm the deadlock initially reported
against PID namespace exit.
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
|
|
The bypass lock contention mitigation assumes there can be at most
2 contenders on the bypass lock, following this scheme:
1) One kthread takes the bypass lock
2) Another one spins on it and increment the contended counter
3) A third one (a bypass enqueuer) sees the contended counter on and
busy loops waiting on it to decrement.
However this assumption is wrong. There can be only one CPU to find the
lock contended because call_rcu() (the bypass enqueuer) is the only
bypass lock acquire site that may not already hold the NOCB lock
beforehand, all the other sites must first contend on the NOCB lock.
Therefore step 2) is impossible.
The other problem is that the mitigation assumes that contenders all
belong to the same rdp CPU, which is also impossible for a raw spinlock.
In theory the warning could trigger if the enqueuer holds the bypass
lock and another CPU flushes the bypass queue concurrently but this is
prevented from all flush users:
1) NOCB kthreads only flush if they successfully _tried_ to lock the
bypass lock. So no contention management here.
2) Flush on callbacks migration happen remotely when the CPU is offline.
No concurrency against bypass enqueue.
3) Flush on deoffloading happen either locally with IRQs disabled or
remotely when the CPU is not yet online. No concurrency against
bypass enqueue.
4) Flush on barrier entrain happen either locally with IRQs disabled or
remotely when the CPU is offline. No concurrency against
bypass enqueue.
For those reasons, the bypass lock contention mitigation isn't needed
and is even wrong. Remove it but keep the warning reporting a contended
bypass lock on a remote CPU, to keep unexpected contention awareness.
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
|
|
Upon NOCB deoffloading, the rcuo kthread must be forced to sleep
until the corresponding rdp is ever offloaded again. The deoffloader
clears the SEGCBLIST_OFFLOADED flag, wakes up the rcuo kthread which
then notices that change and clears in turn its SEGCBLIST_KTHREAD_CB
flag before going to sleep, until it ever sees the SEGCBLIST_OFFLOADED
flag again, should a re-offloading happen.
Upon NOCB offloading, the rcuo kthread must be forced to wake up and
handle callbacks until the corresponding rdp is ever deoffloaded again.
The offloader sets the SEGCBLIST_OFFLOADED flag, wakes up the rcuo
kthread which then notices that change and sets in turn its
SEGCBLIST_KTHREAD_CB flag before going to check callbacks, until it
ever sees the SEGCBLIST_OFFLOADED flag cleared again, should a
de-offloading happen again.
This is all a crude ad-hoc and error-prone kthread (un-)parking
re-implementation.
Consolidate the behaviour with the appropriate API instead.
[ paulmck: Apply Qiang Zhang feedback provided in Link: below. ]
Link: https://lore.kernel.org/all/20240509074046.15629-1-qiang.zhang1211@gmail.com/
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
|
|
The (de-)offloading process used to take care about the NOCB timer when
it depended on the per-rdp NOCB locking. However this isn't the case
anymore for a long while. It can now safely be armed and run during the
(de-)offloading process, which doesn't care about it anymore.
Update the comments accordingly.
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
|
|
The parts explaining the bypass lifecycle in (de-)offloading are out
of date and/or wrong. Bypass is simply enabled whenever SEGCBLIST_RCU_CORE
flag is off.
Fix the comments accordingly.
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
|
|
There is no direct RCU counterpart to lockdep_assert_irqs_disabled()
and friends. Although it is possible to construct them, it would
be more convenient to have the following lockdep assertions:
lockdep_assert_in_rcu_read_lock()
lockdep_assert_in_rcu_read_lock_bh()
lockdep_assert_in_rcu_read_lock_sched()
lockdep_assert_in_rcu_reader()
This commit therefore creates them.
Reported-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/geert/renesas-drivers into fixes
pinctrl: renesas: Fixes for v6.10
- Fix PREEMPT_RT build failure on RZ/G2L.
Signed-off-by: Linus Walleij <linus.walleij@linaro.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/cxl/cxl
Pull cxl fixes from Dave Jiang:
- Compile fix for cxl-test from missing linux/vmalloc.h
- Fix for memregion leaks in devm_cxl_add_region()
* tag 'cxl-fixes-6.10-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/cxl/cxl:
cxl/region: Fix memregion leaks in devm_cxl_add_region()
cxl/test: Add missing vmalloc.h for tools/testing/cxl/test/mem.c
|
|
In case of error during ima_fs_init() all the dentry already created
are removed. {ascii, binary}_securityfs_measurement_lists are freed
calling for each array the remove_securityfs_measurement_lists(). This
function, at the end, assigns to zero the securityfs_measurement_list_count.
This causes during the second call of remove_securityfs_measurement_lists()
to leave the dentry of the array pending, not removing them correctly,
because the securityfs_measurement_list_count is already zero.
Move the securityfs_measurement_list_count = 0 after the two
remove_securityfs_measurement_lists() calls to correctly remove all the
dentry already allocated.
Fixes: 9fa8e7625008 ("ima: add crypto agility support for template-hash algorithm")
Signed-off-by: Enrico Bravi <enrico.bravi@polito.it>
Reviewed-by: Roberto Sassu <roberto.sassu@huawei.com>
Signed-off-by: Mimi Zohar <zohar@linux.ibm.com>
|
|
If only isolated partitions are being created underneath the cgroup root,
there will only be one sched domain with top_cpuset.effective_cpus. We can
skip the unnecessary sched domains scanning code and save some cycles.
Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
|
|
A recent "cleanup" broke IIO channel read outs and thereby thermal
mitigation on the Lenovo ThinkPad X13s by returning zero instead of the
expected IIO value type in iio_read_channel_processed_scale():
thermal thermal_zone12: failed to read out thermal zone (-22)
Fixes: 3092bde731ca ("iio: inkern: move to the cleanup.h magic")
Cc: Nuno Sa <nuno.sa@analog.com>
Signed-off-by: Johan Hovold <johan+linaro@kernel.org>
Link: https://lore.kernel.org/r/20240530074416.13697-1-johan+linaro@kernel.org
Signed-off-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
|
|
Use IRQ ONESHOT flag to ensure the timestamp is not updated in the
hard handler during the thread handler. And use a fixed value of 1
sample that correspond to this first timestamp.
This way we can ensure the timestamp is always corresponding to the
value used by the timestamping mechanism. Otherwise, it is possible
that between FIFO count read and FIFO processing the timestamp is
overwritten in the hard handler.
Fixes: 111e1abd0045 ("iio: imu: inv_mpu6050: use the common inv_sensors timestamp module")
Cc: stable@vger.kernel.org
Signed-off-by: Jean-Baptiste Maneyrol <jean-baptiste.maneyrol@tdk.com>
Link: https://lore.kernel.org/r/20240527150117.608792-1-inv.git-commit@tdk.com
Signed-off-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
|
|
This patch fixes two issues regarding the sampling frequency setting:
-The attribute was set as per device, not per channel. As such, when
setting the sampling frequency, the configuration was always done for
the slot 0, and the correct configuration was applied on the next
channel configuration call by the LRU mechanism.
-The LRU implementation does not take into account external settings of
the slot registers. When setting the sampling frequency directly to a
slot register in write_raw(), there is no guarantee that other channels
were not also using that slot and now incorrectly retain their config
as live.
Set the sampling frequency attribute as separate in the channel templates.
Do not set the sampling directly to the slot register in write_raw(),
just mark the config as not live and let the LRU mechanism handle it.
As the reg variable is no longer used, remove it.
Fixes: 76a1e6a42802 ("iio: adc: ad7173: add AD7173 driver")
Signed-off-by: Dumitru Ceclan <dumitru.ceclan@analog.com>
Link: https://lore.kernel.org/r/20240530-ad7173-fixes-v3-5-b85f33079e18@analog.com
Signed-off-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
|
|
The previous value of the append status bit was not cleared before
setting the new value. This caused the bit to remain set after enabling
buffered mode for multiple channels and not permit further buffered
reads from a single channel after the fact.
Fixes: 76a1e6a42802 ("iio: adc: ad7173: add AD7173 driver")
Signed-off-by: Dumitru Ceclan <dumitru.ceclan@analog.com>
Link: https://lore.kernel.org/r/20240530-ad7173-fixes-v3-4-b85f33079e18@analog.com
Signed-off-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
|
|
STATX_SUBVOL
To pick the changes from:
2a82bb02941fb53d ("statx: stx_subvol")
This silences this perf build warning:
Warning: Kernel ABI header differences:
diff -u tools/include/uapi/linux/stat.h include/uapi/linux/stat.h
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Ian Rogers <irogers@google.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Kan Liang <kan.liang@linux.intel.com>
Cc: Kent Overstreet <kent.overstreet@linux.dev>
Cc: Namhyung Kim <namhyung@kernel.org>
Link: https://lore.kernel.org/lkml/ZlnK2Fmx_gahzwZI@x1
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
|
|
into HEAD
KVM/riscv fixes for 6.10, take #1
- No need to use mask when hart-index-bits is 0
- Fix incorrect reg_subtype labels in kvm_riscv_vcpu_set_reg_isa_ext()
|
|
* Fixes and debugging help for the #VE sanity check. Also disable
it by default, even for CONFIG_DEBUG_KERNEL, because it was found
to trigger spuriously (most likely a processor erratum as the
exact symptoms vary by generation).
* Avoid WARN() when two NMIs arrive simultaneously during an NMI-disabled
situation (GIF=0 or interrupt shadow) when the processor supports
virtual NMI. While generally KVM will not request an NMI window
when virtual NMIs are supported, in this case it *does* have to
single-step over the interrupt shadow or enable the STGI intercept,
in order to deliver the latched second NMI.
* Drop support for hand tuning APIC timer advancement from userspace.
Since we have adaptive tuning, and it has proved to work well,
drop the module parameter for manual configuration and with it a
few stupid bugs that it had.
|
|
Remove support for specifying a static local APIC timer advancement value,
and instead present a read-only boolean parameter to let userspace enable
or disable KVM's dynamic APIC timer advancement. Realistically, it's all
but impossible for userspace to specify an advancement that is more
precise than what KVM's adaptive tuning can provide. E.g. a static value
needs to be tuned for the exact hardware and kernel, and if KVM is using
hrtimers, likely requires additional tuning for the exact configuration of
the entire system.
Dropping support for a userspace provided value also fixes several flaws
in the interface. E.g. KVM interprets a negative value other than -1 as a
large advancement, toggling between a negative and positive value yields
unpredictable behavior as vCPUs will switch from dynamic to static
advancement, changing the advancement in the middle of VM creation can
result in different values for vCPUs within a VM, etc. Those flaws are
mostly fixable, but there's almost no justification for taking on yet more
complexity (it's minimal complexity, but still non-zero).
The only arguments against using KVM's adaptive tuning is if a setup needs
a higher maximum, or if the adjustments are too reactive, but those are
arguments for letting userspace control the absolute max advancement and
the granularity of each adjustment, e.g. similar to how KVM provides knobs
for halt polling.
Link: https://lore.kernel.org/all/20240520115334.852510-1-zhoushuling@huawei.com
Cc: Shuling Zhou <zhoushuling@huawei.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-ID: <20240522010304.1650603-1-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
As documented in APM[1], LBR Virtualization must be enabled for SEV-ES
guests. Although KVM currently enforces LBRV for SEV-ES guests, there
are multiple issues with it:
o MSR_IA32_DEBUGCTLMSR is still intercepted. Since MSR_IA32_DEBUGCTLMSR
interception is used to dynamically toggle LBRV for performance reasons,
this can be fatal for SEV-ES guests. For ex SEV-ES guest on Zen3:
[guest ~]# wrmsr 0x1d9 0x4
KVM: entry failed, hardware error 0xffffffff
EAX=00000004 EBX=00000000 ECX=000001d9 EDX=00000000
Fix this by never intercepting MSR_IA32_DEBUGCTLMSR for SEV-ES guests.
No additional save/restore logic is required since MSR_IA32_DEBUGCTLMSR
is of swap type A.
o KVM will disable LBRV if userspace sets MSR_IA32_DEBUGCTLMSR before the
VMSA is encrypted. Fix this by moving LBRV enablement code post VMSA
encryption.
[1]: AMD64 Architecture Programmer's Manual Pub. 40332, Rev. 4.07 - June
2023, Vol 2, 15.35.2 Enabling SEV-ES.
https://bugzilla.kernel.org/attachment.cgi?id=304653
Fixes: 376c6d285017 ("KVM: SVM: Provide support for SEV-ES vCPU creation/loading")
Co-developed-by: Nikunj A Dadhania <nikunj@amd.com>
Signed-off-by: Nikunj A Dadhania <nikunj@amd.com>
Signed-off-by: Ravi Bangoria <ravi.bangoria@amd.com>
Message-ID: <20240531044644.768-4-ravi.bangoria@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
As documented in APM[1], LBR Virtualization must be enabled for SEV-ES
guests. So, prevent SEV-ES guests when LBRV support is missing.
[1]: AMD64 Architecture Programmer's Manual Pub. 40332, Rev. 4.07 - June
2023, Vol 2, 15.35.2 Enabling SEV-ES.
https://bugzilla.kernel.org/attachment.cgi?id=304653
Fixes: 376c6d285017 ("KVM: SVM: Provide support for SEV-ES vCPU creation/loading")
Signed-off-by: Ravi Bangoria <ravi.bangoria@amd.com>
Message-ID: <20240531044644.768-3-ravi.bangoria@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
KVM currently allows userspace to read/write MSRs even after the VMSA is
encrypted. This can cause unintentional issues if MSR access has side-
effects. For ex, while migrating a guest, userspace could attempt to
migrate MSR_IA32_DEBUGCTLMSR and end up unintentionally disabling LBRV on
the target. Fix this by preventing access to those MSRs which are context
switched via the VMSA, once the VMSA is encrypted.
Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Nikunj A Dadhania <nikunj@amd.com>
Signed-off-by: Ravi Bangoria <ravi.bangoria@amd.com>
Message-ID: <20240531044644.768-2-ravi.bangoria@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/chenhuacai/linux-loongson
Pull LoongArch fixes from Huacai Chen:
"Some bootloader interface fixes, a dts fix, and a trivial cleanup"
* tag 'loongarch-fixes-6.10-1' of git://git.kernel.org/pub/scm/linux/kernel/git/chenhuacai/linux-loongson:
LoongArch: Fix GMAC's phy-mode definitions in dts
LoongArch: Override higher address bits in JUMP_VIRT_ADDR
LoongArch: Fix entry point in kernel image header
LoongArch: Add all CPUs enabled by fdt to NUMA node 0
LoongArch: Fix built-in DTB detection
LoongArch: Remove CONFIG_ACPI_TABLE_UPGRADE in platform_init()
|
|
its_vlpi_prop_update() calls lpi_write_config() which obtains the
mapping information for a VLPI without lock held. So it could race
with its_vlpi_unmap().
Since all calls from its_irq_set_vcpu_affinity() require the same
lock to be held, hoist the locking there instead of sprinkling the
locking all over the place.
This bug was discovered using Coverity Static Analysis Security Testing
(SAST) by Synopsys, Inc.
[ tglx: Use guard() instead of goto ]
Fixes: 015ec0386ab6 ("irqchip/gic-v3-its: Add VLPI configuration handling")
Suggested-by: Marc Zyngier <maz@kernel.org>
Signed-off-by: Hagar Hemdan <hagarhem@amazon.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: stable@vger.kernel.org
Reviewed-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20240531162144.28650-1-hagarhem@amazon.com
|
|
After commit 1a80dbcb2dba, bpf_link can be freed by
link->ops->dealloc_deferred, but the code still tests and uses
link->ops->dealloc afterward, which leads to a use-after-free as
reported by syzbot. Actually, one of them should be sufficient, so
just call one of them instead of both. Also add a WARN_ON() in case
of any problematic implementation.
Fixes: 1a80dbcb2dba ("bpf: support deferring bpf_link dealloc to after RCU grace period")
Reported-by: syzbot+1989ee16d94720836244@syzkaller.appspotmail.com
Signed-off-by: Cong Wang <cong.wang@bytedance.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Jiri Olsa <jolsa@kernel.org>
Link: https://lore.kernel.org/bpf/20240602182703.207276-1-xiyou.wangcong@gmail.com
|
|
Fix unchecked MSR access error for processors with no HWP support. On
such processors, maximum frequency can be changed by the system firmware
using ACPI event ACPI_PROCESSOR_NOTIFY_HIGEST_PERF_CHANGED. This results
in accessing HWP MSR 0x771.
Call Trace:
<TASK>
generic_exec_single+0x58/0x120
smp_call_function_single+0xbf/0x110
rdmsrl_on_cpu+0x46/0x60
intel_pstate_get_hwp_cap+0x1b/0x70
intel_pstate_update_limits+0x2a/0x60
acpi_processor_notify+0xb7/0x140
acpi_ev_notify_dispatch+0x3b/0x60
HWP MSR 0x771 can be only read on a CPU which supports HWP and enabled.
Hence intel_pstate_get_hwp_cap() can only be called when hwp_active is
true.
Reported-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Closes: https://lore.kernel.org/linux-pm/20240529155740.Hq2Hw7be@linutronix.de/
Fixes: e8217b4bece3 ("cpufreq: intel_pstate: Update the maximum CPU frequency consistently")
Tested-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
|
|
The call to cc_platform_has() triggers a fault and system crash if call depth
tracking is active because the GS segment has been reset by load_segments() and
GS_BASE is now 0 but call depth tracking uses per-CPU variables to operate.
Call cc_platform_has() earlier in the function when GS is still valid.
[ bp: Massage. ]
Fixes: 5d8213864ade ("x86/retbleed: Add SKL return thunk")
Signed-off-by: David Kaplan <david.kaplan@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com>
Cc: <stable@kernel.org>
Link: https://lore.kernel.org/r/20240603083036.637-1-bp@kernel.org
|
|
The iterator variable dst cannot be NULL and the if check can be removed.
Remove it and fix the following Coccinelle/coccicheck warning reported
by itnull.cocci:
ERROR: iterator variable bound on line 762 cannot be NULL
Signed-off-by: Thorsten Blum <thorsten.blum@toblux.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com>
Acked-by: Jiri Olsa <jolsa@kernel.org>
Link: https://lore.kernel.org/bpf/20240529101900.103913-2-thorsten.blum@toblux.com
|
|
Including silence detection register as volatile.
Signed-off-by: Jack Yu <jack.yu@realtek.com>
Link: https://msgid.link/r/c66a6bd6d220426793096b42baf85437@realtek.com
Signed-off-by: Mark Brown <broonie@kernel.org>
|
|
make allmodconfig && make W=1 C=1 reports:
WARNING: modpost: missing MODULE_DESCRIPTION() in sound/soc/fsl/imx-pcm-dma.o
Add the missing invocation of the MODULE_DESCRIPTION() macro.
Signed-off-by: Jeff Johnson <quic_jjohnson@quicinc.com>
Link: https://msgid.link/r/20240602-md-snd-fsl-imx-pcm-dma-v1-1-e7efc33c6bf3@quicinc.com
Signed-off-by: Mark Brown <broonie@kernel.org>
|
|
make allmodconfig && make W=1 C=1 reports:
WARNING: modpost: missing MODULE_DESCRIPTION() in sound/soc/mxs/snd-soc-mxs-pcm.o
Add the missing invocation of the MODULE_DESCRIPTION() macro.
Signed-off-by: Jeff Johnson <quic_jjohnson@quicinc.com>
Link: https://msgid.link/r/20240602-md-snd-soc-mxs-pcm-v1-1-1e663d11328d@quicinc.com
Signed-off-by: Mark Brown <broonie@kernel.org>
|
|
Not having linux-arm-msm@ in cc for audio-related changes for Qualcomm
platforms means that interested parties can easily miss the patches. Add
corresponding L: entry so that linux-arm-msm ML gets CC'ed for audio
patches too.
Signed-off-by: Dmitry Baryshkov <dmitry.baryshkov@linaro.org>
Acked-by: Srinivas Kandagatla <srinivas.kandagatla@linaro.org>
Link: https://msgid.link/r/20240531-asoc-qcom-cc-lamsm-v1-1-f026ad618496@linaro.org
Signed-off-by: Mark Brown <broonie@kernel.org>
|
|
We just return 0 after the skip_tlv label. No need to use a label.
Signed-off-by: Bard Liao <yung-chuan.liao@linux.intel.com>
Reviewed-by: Pierre-Louis Bossart <pierre-louis.bossart@linux.intel.com>
Reviewed-by: Péter Ujfalusi <peter.ujfalusi@linux.intel.com>
Signed-off-by: Peter Ujfalusi <peter.ujfalusi@linux.intel.com>
Link: https://msgid.link/r/20240603073224.14726-3-peter.ujfalusi@linux.intel.com
Signed-off-by: Mark Brown <broonie@kernel.org>
|
|
sof_ipc4_dma_config_tlv{} is for Audio DSP firmware only.
Don't set it in dspless mode.
Fixes: 17386cb1b48b ("ASoC: SOF: Intel: hda-dai: set dma_stream_channel_map device")
Signed-off-by: Bard Liao <yung-chuan.liao@linux.intel.com>
Reviewed-by: Pierre-Louis Bossart <pierre-louis.bossart@linux.intel.com>
Reviewed-by: Péter Ujfalusi <peter.ujfalusi@linux.intel.com>
Signed-off-by: Peter Ujfalusi <peter.ujfalusi@linux.intel.com>
Link: https://msgid.link/r/20240603073224.14726-2-peter.ujfalusi@linux.intel.com
Signed-off-by: Mark Brown <broonie@kernel.org>
|
|
I accidentally picked up an earlier version of this patch, which had
already landed via mm. The patch I picked up contains a bug, which I
kept as I thought it was a fix. So let's just revert it.
This reverts commit 4c6c0020427a4547845a83f7e4d6085e16c3e24f.
Fixes: 4c6c0020427a ("riscv: mm: accelerate pagefault when badaccess")
Reviewed-by: Kefeng Wang <wangkefeng.wang@huawei.com>
Reviewed-by: Alexandre Ghiti <alexghiti@rivosinc.com>
Link: https://lore.kernel.org/r/20240530164451.21336-1-palmer@rivosinc.com
Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
|
|
On riscv32, it is possible for the last page in virtual address space
(0xfffff000) to be allocated. This page overlaps with PTR_ERR, so that
shouldn't happen.
There is already some code to ensure memblock won't allocate the last page.
However, buddy allocator is left unchecked.
Fix this by reserving physical memory that would be mapped at virtual
addresses greater than 0xfffff000.
Reported-by: Björn Töpel <bjorn@kernel.org>
Closes: https://lore.kernel.org/linux-riscv/878r1ibpdn.fsf@all.your.base.are.belong.to.us
Fixes: 76d2a0493a17 ("RISC-V: Init and Halt Code")
Signed-off-by: Nam Cao <namcao@linutronix.de>
Cc: <stable@vger.kernel.org>
Tested-by: Björn Töpel <bjorn@rivosinc.com>
Reviewed-by: Björn Töpel <bjorn@rivosinc.com>
Reviewed-by: Mike Rapoport (IBM) <rppt@kernel.org>
Link: https://lore.kernel.org/r/20240425115201.3044202-1-namcao@linutronix.de
Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
|
|
Fix the following allmodconfig "make W=1" issues:
WARNING: modpost: missing MODULE_DESCRIPTION() in fs/nls/mac-celtic.o
WARNING: modpost: missing MODULE_DESCRIPTION() in fs/nls/mac-centeuro.o
WARNING: modpost: missing MODULE_DESCRIPTION() in fs/nls/mac-croatian.o
WARNING: modpost: missing MODULE_DESCRIPTION() in fs/nls/mac-cyrillic.o
WARNING: modpost: missing MODULE_DESCRIPTION() in fs/nls/mac-gaelic.o
WARNING: modpost: missing MODULE_DESCRIPTION() in fs/nls/mac-greek.o
WARNING: modpost: missing MODULE_DESCRIPTION() in fs/nls/mac-iceland.o
WARNING: modpost: missing MODULE_DESCRIPTION() in fs/nls/mac-inuit.o
WARNING: modpost: missing MODULE_DESCRIPTION() in fs/nls/mac-romanian.o
WARNING: modpost: missing MODULE_DESCRIPTION() in fs/nls/mac-roman.o
WARNING: modpost: missing MODULE_DESCRIPTION() in fs/nls/mac-turkish.o
WARNING: modpost: missing MODULE_DESCRIPTION() in fs/nls/nls_ascii.o
WARNING: modpost: missing MODULE_DESCRIPTION() in fs/nls/nls_cp1250.o
WARNING: modpost: missing MODULE_DESCRIPTION() in fs/nls/nls_cp1251.o
WARNING: modpost: missing MODULE_DESCRIPTION() in fs/nls/nls_cp1255.o
WARNING: modpost: missing MODULE_DESCRIPTION() in fs/nls/nls_cp437.o
WARNING: modpost: missing MODULE_DESCRIPTION() in fs/nls/nls_cp737.o
WARNING: modpost: missing MODULE_DESCRIPTION() in fs/nls/nls_cp775.o
WARNING: modpost: missing MODULE_DESCRIPTION() in fs/nls/nls_cp850.o
WARNING: modpost: missing MODULE_DESCRIPTION() in fs/nls/nls_cp852.o
WARNING: modpost: missing MODULE_DESCRIPTION() in fs/nls/nls_cp855.o
WARNING: modpost: missing MODULE_DESCRIPTION() in fs/nls/nls_cp857.o
WARNING: modpost: missing MODULE_DESCRIPTION() in fs/nls/nls_cp860.o
WARNING: modpost: missing MODULE_DESCRIPTION() in fs/nls/nls_cp861.o
WARNING: modpost: missing MODULE_DESCRIPTION() in fs/nls/nls_cp862.o
WARNING: modpost: missing MODULE_DESCRIPTION() in fs/nls/nls_cp863.o
WARNING: modpost: missing MODULE_DESCRIPTION() in fs/nls/nls_cp864.o
WARNING: modpost: missing MODULE_DESCRIPTION() in fs/nls/nls_cp865.o
WARNING: modpost: missing MODULE_DESCRIPTION() in fs/nls/nls_cp866.o
WARNING: modpost: missing MODULE_DESCRIPTION() in fs/nls/nls_cp869.o
WARNING: modpost: missing MODULE_DESCRIPTION() in fs/nls/nls_cp874.o
WARNING: modpost: missing MODULE_DESCRIPTION() in fs/nls/nls_cp932.o
WARNING: modpost: missing MODULE_DESCRIPTION() in fs/nls/nls_cp936.o
WARNING: modpost: missing MODULE_DESCRIPTION() in fs/nls/nls_cp949.o
WARNING: modpost: missing MODULE_DESCRIPTION() in fs/nls/nls_cp950.o
WARNING: modpost: missing MODULE_DESCRIPTION() in fs/nls/nls_euc-jp.o
WARNING: modpost: missing MODULE_DESCRIPTION() in fs/nls/nls_iso8859-13.o
WARNING: modpost: missing MODULE_DESCRIPTION() in fs/nls/nls_iso8859-14.o
WARNING: modpost: missing MODULE_DESCRIPTION() in fs/nls/nls_iso8859-15.o
WARNING: modpost: missing MODULE_DESCRIPTION() in fs/nls/nls_iso8859-1.o
WARNING: modpost: missing MODULE_DESCRIPTION() in fs/nls/nls_iso8859-2.o
WARNING: modpost: missing MODULE_DESCRIPTION() in fs/nls/nls_iso8859-3.o
WARNING: modpost: missing MODULE_DESCRIPTION() in fs/nls/nls_iso8859-4.o
WARNING: modpost: missing MODULE_DESCRIPTION() in fs/nls/nls_iso8859-5.o
WARNING: modpost: missing MODULE_DESCRIPTION() in fs/nls/nls_iso8859-6.o
WARNING: modpost: missing MODULE_DESCRIPTION() in fs/nls/nls_iso8859-7.o
WARNING: modpost: missing MODULE_DESCRIPTION() in fs/nls/nls_iso8859-9.o
WARNING: modpost: missing MODULE_DESCRIPTION() in fs/nls/nls_koi8-r.o
WARNING: modpost: missing MODULE_DESCRIPTION() in fs/nls/nls_koi8-ru.o
WARNING: modpost: missing MODULE_DESCRIPTION() in fs/nls/nls_koi8-u.o
WARNING: modpost: missing MODULE_DESCRIPTION() in fs/nls/nls_ucs2_utils.o
WARNING: modpost: missing MODULE_DESCRIPTION() in fs/nls/nls_utf8.o
Signed-off-by: Jeff Johnson <quic_jjohnson@quicinc.com>
Link: https://lore.kernel.org/r/20240516-md-fs-nls-v1-1-ed540d8239bf@quicinc.com
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
I have a program that sets up a periodic timer with 10ms interval. When
the program attempts to call fallocate(2) on tmpfs, it goes into an
infinite loop. fallocate(2) takes longer than 10ms, so it gets
interrupted by a signal and it returns EINTR. On EINTR, the fallocate
call is restarted, going into the same loop again.
Let's change the signal_pending() check in shmem_fallocate() loop to
fatal_signal_pending(). This solves the problem of shmem_fallocate()
constantly restarting. Since most other filesystem's fallocate methods
don't react to signals, it is unlikely userspace really relies on timely
delivery of non-fatal signals while fallocate is running. Also the
comment before the signal check:
/*
* Good, the fallocate(2) manpage permits EINTR: we may have
* been interrupted because we are using up too much memory.
*/
indicates that the check was mainly added for OOM situations in which
case the process will be sent a fatal signal so this change preserves
the behavior in OOM situations.
[JK: Update changelog and comment based on upstream discussion]
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20240515221044.590-1-jack@suse.cz
Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Convert the hostfs filesystem to the new internal mount API as the old
one will be obsoleted and removed. This allows greater flexibility in
communication of mount parameters between userspace, the VFS and the
filesystem.
See Documentation/filesystems/mount_api.txt for more information.
Signed-off-by: Hongbo Li <lihongbo22@huawei.com>
Link: https://lore.kernel.org/r/20240530120111.3794664-1-lihongbo22@huawei.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Currently, none of the X1E80100 supported boards upstream have enabled
DP. As for USB, the reason it is not broken when it's obvious that the
offsets are wrong is because the only difference with respect to USB is
the difference in register name. The V6 uses QPHY_V6_PCS_CDR_RESET_TIME
while V6 N4 uses QPHY_V6_N4_PCS_RX_CONFIG. Now, in order for the DP to
work, the DP serdes tables need to be added as they have different
values for V6 N4 when compared to V6 ones, even though they use the same
V6 offsets. While at it, switch swing and pre-emphasis tables to V6 as
well.
Fixes: d7b3579f84f7 ("phy: qcom-qmp-combo: Add x1e80100 USB/DP combo phys")
Co-developed-by: Kuogee Hsieh <quic_khsieh@quicinc.com>
Signed-off-by: Kuogee Hsieh <quic_khsieh@quicinc.com>
Signed-off-by: Abel Vesa <abel.vesa@linaro.org>
Reviewed-by: Dmitry Baryshkov <dmitry.baryshkov@linaro.org>
Link: https://lore.kernel.org/r/20240527-x1e80100-phy-qualcomm-combo-fix-dp-v1-3-be8a0b882117@linaro.org
Signed-off-by: Vinod Koul <vkoul@kernel.org>
|
|
The new X1E80100 SoC bumps up the HW version of QMP phy to v6 N4 for
combo USB and DP PHY. Currently, the X1E80100 uses the pure V6 PCS
register offsets, which are different. Add the offsets so the
mentioned platform can be fixed later on. Add the new PCS offsets
in a dedicated header file.
Fixes: d7b3579f84f7 ("phy: qcom-qmp-combo: Add x1e80100 USB/DP combo phys")
Co-developed-by: Kuogee Hsieh <quic_khsieh@quicinc.com>
Signed-off-by: Kuogee Hsieh <quic_khsieh@quicinc.com>
Signed-off-by: Abel Vesa <abel.vesa@linaro.org>
Reviewed-by: Dmitry Baryshkov <dmitry.baryshkov@linaro.org>
Link: https://lore.kernel.org/r/20240527-x1e80100-phy-qualcomm-combo-fix-dp-v1-2-be8a0b882117@linaro.org
Signed-off-by: Vinod Koul <vkoul@kernel.org>
|
|
Currently, the x1e80100 uses pure V6 register offsets for DP part of the
combo PHY. This hasn't been an issue because external DP is not yet
enabled on any of the boards yet. But in order to enabled it, all these
new V6 N4 register offsets are needed. So add them.
Fixes: 762c3565f3c8 ("phy: qcom-qmp: qserdes-txrx: Add V6 N4 register offsets")
Co-developed-by: Kuogee Hsieh <quic_khsieh@quicinc.com>
Signed-off-by: Kuogee Hsieh <quic_khsieh@quicinc.com>
Signed-off-by: Abel Vesa <abel.vesa@linaro.org>
Reviewed-by: Dmitry Baryshkov <dmitry.baryshkov@linaro.org>
Link: https://lore.kernel.org/r/20240527-x1e80100-phy-qualcomm-combo-fix-dp-v1-1-be8a0b882117@linaro.org
Signed-off-by: Vinod Koul <vkoul@kernel.org>
|
|
Back in 2021 we already discussed removing deny_write_access() for
executables. Back then I was hesistant because I thought that this might
cause issues in userspace. But even back then I had started taking some
notes on what could potentially depend on this and I didn't come up with
a lot so I've changed my mind and I would like to try this.
Here are some of the notes that I took:
(1) The deny_write_access() mechanism is causing really pointless issues
such as [1]. If a thread in a thread-group opens a file writable,
then writes some stuff, then closing the file descriptor and then
calling execve() they can fail the execve() with ETXTBUSY because
another thread in the thread-group could have concurrently called
fork(). Multi-threaded libraries such as go suffer from this.
(2) There are userspace attacks that rely on overwriting the binary of a
running process. These attacks are _mitigated_ but _not at all
prevented_ from ocurring by the deny_write_access() mechanism.
I'll go over some details. The clearest example of such attacks was
the attack against runC in CVE-2019-5736 (cf. [3]).
An attack could compromise the runC host binary from inside a
_privileged_ runC container. The malicious binary could then be used
to take over the host.
(It is crucial to note that this attack is _not_ possible with
unprivileged containers. IOW, the setup here is already insecure.)
The attack can be made when attaching to a running container or when
starting a container running a specially crafted image. For example,
when runC attaches to a container the attacker can trick it into
executing itself.
This could be done by replacing the target binary inside the
container with a custom binary pointing back at the runC binary
itself. As an example, if the target binary was /bin/bash, this
could be replaced with an executable script specifying the
interpreter path #!/proc/self/exe.
As such when /bin/bash is executed inside the container, instead the
target of /proc/self/exe will be executed. That magic link will
point to the runc binary on the host. The attacker can then proceed
to write to the target of /proc/self/exe to try and overwrite the
runC binary on the host.
However, this will not succeed because of deny_write_access(). Now,
one might think that this would prevent the attack but it doesn't.
To overcome this, the attacker has multiple ways:
* Open a file descriptor to /proc/self/exe using the O_PATH flag and
then proceed to reopen the binary as O_WRONLY through
/proc/self/fd/<nr> and try to write to it in a busy loop from a
separate process. Ultimately it will succeed when the runC binary
exits. After this the runC binary is compromised and can be used
to attack other containers or the host itself.
* Use a malicious shared library annotating a function in there with
the constructor attribute making the malicious function run as an
initializor. The malicious library will then open /proc/self/exe
for creating a new entry under /proc/self/fd/<nr>. It'll then call
exec to a) force runC to exit and b) hand the file descriptor off
to a program that then reopens /proc/self/fd/<nr> for writing
(which is now possible because runC has exited) and overwriting
that binary.
To sum up: the deny_write_access() mechanism doesn't prevent such
attacks in insecure setups. It just makes them minimally harder.
That's all.
The only way back then to prevent this is to create a temporary copy
of the calling binary itself when it starts or attaches to
containers. So what I did back then for LXC (and Aleksa for runC)
was to create an anonymous, in-memory file using the memfd_create()
system call and to copy itself into the temporary in-memory file,
which is then sealed to prevent further modifications. This sealed,
in-memory file copy is then executed instead of the original on-disk
binary.
Any compromising write operations from a privileged container to the
host binary will then write to the temporary in-memory binary and
not to the host binary on-disk, preserving the integrity of the host
binary. Also as the temporary, in-memory binary is sealed, writes to
this will also fail.
The point is that deny_write_access() is uselss to prevent these
attacks.
(3) Denying write access to an inode because it's currently used in an
exec path could easily be done on an LSM level. It might need an
additional hook but that should be about it.
(4) The MAP_DENYWRITE flag for mmap() has been deprecated a long time
ago so while we do protect the main executable the bigger portion of
the things you'd think need protecting such as the shared libraries
aren't. IOW, we let anyone happily overwrite shared libraries.
(5) We removed all remaining uses of VM_DENYWRITE in [2]. That means:
(5.1) We removed the legacy uselib() protection for preventing
overwriting of shared libraries. Nobody cared in 3 years.
(5.2) We allow write access to the elf interpreter after exec
completed treating it on a par with shared libraries.
Yes, someone in userspace could potentially be relying on this. It's not
completely out of the realm of possibility but let's find out if that's
actually the case and not guess.
Link: https://github.com/golang/go/issues/22315 [1]
Link: 49624efa65ac ("Merge tag 'denywrite-for-5.15' of git://github.com/davidhildenbrand/linux") [2]
Link: https://unit42.paloaltonetworks.com/breaking-docker-via-runc-explaining-cve-2019-5736 [3]
Link: https://lwn.net/Articles/866493
Link: https://github.com/golang/go/issues/22220
Link: https://github.com/golang/go/blob/5bf8c0cf09ee5c7e5a37ab90afcce154ab716a97/src/cmd/go/internal/work/buildid.go#L724
Link: https://github.com/golang/go/blob/5bf8c0cf09ee5c7e5a37ab90afcce154ab716a97/src/cmd/go/internal/work/exec.go#L1493
Link: https://github.com/golang/go/blob/5bf8c0cf09ee5c7e5a37ab90afcce154ab716a97/src/cmd/go/internal/script/cmds.go#L457
Link: https://github.com/golang/go/blob/5bf8c0cf09ee5c7e5a37ab90afcce154ab716a97/src/cmd/go/internal/test/test.go#L1557
Link: https://github.com/golang/go/blob/5bf8c0cf09ee5c7e5a37ab90afcce154ab716a97/src/os/exec/lp_linux_test.go#L61
Link: https://github.com/buildkite/agent/pull/2736
Link: https://github.com/rust-lang/rust/issues/114554
Link: https://bugs.openjdk.org/browse/JDK-8068370
Link: https://github.com/dotnet/runtime/issues/58964
Link: https://lore.kernel.org/r/20240531-vfs-i_writecount-v1-1-a17bea7ee36b@kernel.org
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Add a missing double quote in the unsafe_copy_dirent_name() macro
comment.
Signed-off-by: Thorsten Blum <thorsten.blum@toblux.com>
Link: https://lore.kernel.org/r/20240602004729.229634-2-thorsten.blum@toblux.com
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Since commit c512c6918719 ("uaccess: implement a proper
unsafe_copy_to_user() and switch filldir over to it") the header file is
no longer needed.
Signed-off-by: Thorsten Blum <thorsten.blum@toblux.com>
Link: https://lore.kernel.org/r/20240602101534.348159-2-thorsten.blum@toblux.com
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Note the macro used here works regardless of LOCKDEP.
Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>
Link: https://lore.kernel.org/r/20240602123720.775702-1-mjguzik@gmail.com
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
TOMOYO project has moved to SourceForge.net .
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
|