Age | Commit message (Collapse) | Author |
|
Since commit ad440432d1f9 ("dt-bindings: mfd: Ensure 'syscon' has a
more specific compatible") it is required to provide at least 2 compatibles
string for syscon node.
This patch documents the new compatible for stm32f4 SoC to support
global/shared CAN registers access for bxCAN controllers.
Signed-off-by: Dario Binacchi <dario.binacchi@amarulasolutions.com>
Acked-by: Rob Herring <robh@kernel.org>
Link: https://lore.kernel.org/all/20230328073328.3949796-2-dario.binacchi@amarulasolutions.com
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
|
|
Sven Auhagen says:
====================
net: mvpp2: rss fixes
This patch series fixes up some rss problems
in the mvpp2 driver.
The classifier is missing some fragmentation flags,
the parser has the QinQ headers switched and
the PPPoE Layer 4 detecion is not working
correctly.
This is leading to no or bad rss for the default
settings.
====================
Link: https://lore.kernel.org/r/20230325163903.ofefgus43x66as7i@Svens-MacBookPro.local
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
In PPPoE add all IPv4 header option length to the parser
and adjust the L3 and L4 offset accordingly.
Currently the L4 match does not work with PPPoE and
all packets are matched as L3 IP4 OPT.
Fixes: 3f518509dedc ("ethernet: Add new driver for Marvell Armada 375 network unit")
Signed-off-by: Sven Auhagen <sven.auhagen@voleatech.de>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
The mvpp2 parser entry for QinQ has the inner and outer VLAN
in the wrong order.
Fix the problem by swapping them.
Fixes: 3f518509dedc ("ethernet: Add new driver for Marvell Armada 375 network unit")
Signed-off-by: Sven Auhagen <sven.auhagen@voleatech.de>
Reviewed-by: Marcin Wojtas <mw@semihalf.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
Add missing IP Fragmentation Flag.
Fixes: f9358e12a0af ("net: mvpp2: split ingress traffic into multiple flows")
Signed-off-by: Sven Auhagen <sven.auhagen@voleatech.de>
Reviewed-by: Marcin Wojtas <mw@semihalf.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
atomic_t based reference counting, including refcount_t, uses
atomic_inc_not_zero() for acquiring a reference. atomic_inc_not_zero() is
implemented with a atomic_try_cmpxchg() loop. High contention of the
reference count leads to retry loops and scales badly. There is nothing to
improve on this implementation as the semantics have to be preserved.
Provide rcuref as a scalable alternative solution which is suitable for RCU
managed objects. Similar to refcount_t it comes with overflow and underflow
detection and mitigation.
rcuref treats the underlying atomic_t as an unsigned integer and partitions
this space into zones:
0x00000000 - 0x7FFFFFFF valid zone (1 .. (INT_MAX + 1) references)
0x80000000 - 0xBFFFFFFF saturation zone
0xC0000000 - 0xFFFFFFFE dead zone
0xFFFFFFFF no reference
rcuref_get() unconditionally increments the reference count with
atomic_add_negative_relaxed(). rcuref_put() unconditionally decrements the
reference count with atomic_add_negative_release().
This unconditional increment avoids the inc_not_zero() problem, but
requires a more complex implementation on the put() side when the count
drops from 0 to -1.
When this transition is detected then it is attempted to mark the reference
count dead, by setting it to the midpoint of the dead zone with a single
atomic_cmpxchg_release() operation. This operation can fail due to a
concurrent rcuref_get() elevating the reference count from -1 to 0 again.
If the unconditional increment in rcuref_get() hits a reference count which
is marked dead (or saturated) it will detect it after the fact and bring
back the reference count to the midpoint of the respective zone. The zones
provide enough tolerance which makes it practically impossible to escape
from a zone.
The racy implementation of rcuref_put() requires to protect rcuref_put()
against a grace period ending in order to prevent a subtle use after
free. As RCU is the only mechanism which allows to protect against that, it
is not possible to fully replace the atomic_inc_not_zero() based
implementation of refcount_t with this scheme.
The final drop is slightly more expensive than the atomic_dec_return()
counterpart, but that's not the case which this is optimized for. The
optimization is on the high frequeunt get()/put() pairs and their
scalability.
The performance of an uncontended rcuref_get()/put() pair where the put()
is not dropping the last reference is still on par with the plain atomic
operations, while at the same time providing overflow and underflow
detection and mitigation.
The performance of rcuref compared to plain atomic_inc_not_zero() and
atomic_dec_return() based reference counting under contention:
- Micro benchmark: All CPUs running a increment/decrement loop on an
elevated reference count, which means the 0 to -1 transition never
happens.
The performance gain depends on microarchitecture and the number of
CPUs and has been observed in the range of 1.3X to 4.7X
- Conversion of dst_entry::__refcnt to rcuref and testing with the
localhost memtier/memcached benchmark. That benchmark shows the
reference count contention prominently.
The performance gain depends on microarchitecture and the number of
CPUs and has been observed in the range of 1.1X to 2.6X over the
previous fix for the false sharing issue vs. struct
dst_entry::__refcnt.
When memtier is run over a real 1Gb network connection, there is a
small gain on top of the false sharing fix. The two changes combined
result in a 2%-5% total gain for that networked test.
Reported-by: Wangyang Guo <wangyang.guo@intel.com>
Reported-by: Arjan Van De Ven <arjan.van.de.ven@intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20230323102800.158429195@linutronix.de
|
|
atomic_add_negative() does not provide the relaxed/acquire/release
variants.
Provide them in preparation for a new scalable reference count algorithm.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Mark Rutland <mark.rutland@arm.com>
Link: https://lore.kernel.org/r/20230323102800.101763813@linutronix.de
|
|
Just skip the opcode(BPF_ST | BPF_NOSPEC) in the BPF JIT instead of
failing to JIT the entire program, given LoongArch currently has no
couterpart of a speculation barrier instruction. To verify the issue,
use the ltp testcase as shown below.
Also, Wang says:
I can confirm there's currently no speculation barrier equivalent
on LonogArch. (Loongson says there are builtin mitigations for
Spectre-V1 and V2 on their chips, and AFAIK efforts to port the
exploits to mips/LoongArch have all failed a few years ago.)
Without this patch:
$ ./bpf_prog02
[...]
bpf_common.c:123: TBROK: Failed verification: ??? (524)
[...]
Summary:
passed 0
failed 0
broken 1
skipped 0
warnings 0
With this patch:
$ ./bpf_prog02
[...]
Summary:
passed 0
failed 0
broken 0
skipped 0
warnings 0
Fixes: 5dc615520c4d ("LoongArch: Add BPF JIT support")
Signed-off-by: George Guo <guodongtai@kylinos.cn>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: WANG Xuerui <git@xen0n.name>
Cc: Tiezhu Yang <yangtiezhu@loongson.cn>
Link: https://lore.kernel.org/bpf/20230328071335.2664966-1-guodongtai@kylinos.cn
|
|
While reviewing the udp-iter batching patches, noticed the bpf_iter_tcp
calling sock_put() is incorrect. It should call sock_gen_put instead
because bpf_iter_tcp is iterating the ehash table which has the req sk
and tw sk. This patch replaces all sock_put with sock_gen_put in the
bpf_iter_tcp codepath.
Fixes: 04c7820b776f ("bpf: tcp: Bpf iter batching and lock_sock")
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20230328004232.2134233-1-martin.lau@linux.dev
|
|
To determine whether the guest has caused an external interruption loop
upon code 20 (external interrupt) intercepts, the ext_new_psw needs to
be inspected to see whether external interrupts are enabled.
Under non-PV, ext_new_psw can simply be taken from guest lowcore. Under
PV, KVM can only access the encrypted guest lowcore and hence the
ext_new_psw must not be taken from guest lowcore.
handle_external_interrupt() incorrectly did that and hence was not able
to reliably tell whether an external interruption loop is happening or
not. False negatives cause spurious failures of my kvm-unit-test
for extint loops[1] under PV.
Since code 20 is only caused under PV if and only if the guest's
ext_new_psw is enabled for external interrupts, false positive detection
of a external interruption loop can not happen.
Fix this issue by instead looking at the guest PSW in the state
description. Since the PSW swap for external interrupt is done by the
ultravisor before the intercept is caused, this reliably tells whether
the guest is enabled for external interrupts in the ext_new_psw.
Also update the comments to explain better what is happening.
[1] https://lore.kernel.org/kvm/20220812062151.1980937-4-nrb@linux.ibm.com/
Signed-off-by: Nico Boehr <nrb@linux.ibm.com>
Reviewed-by: Janosch Frank <frankja@linux.ibm.com>
Reviewed-by: Christian Borntraeger <borntraeger@linux.ibm.com>
Fixes: 201ae986ead7 ("KVM: s390: protvirt: Implement interrupt injection")
Link: https://lore.kernel.org/r/20230213085520.100756-2-nrb@linux.ibm.com
Message-Id: <20230213085520.100756-2-nrb@linux.ibm.com>
Signed-off-by: Janosch Frank <frankja@linux.ibm.com>
|
|
With ISO 15765-2:2016 the PDU size is not limited to 2^12 - 1 (4095)
bytes but can be represented as a 32 bit unsigned integer value which
allows 2^32 - 1 bytes (~4GB). The use-cases like automotive unified
diagnostic services (UDS) and flashing of ECUs still use the small
static buffers which are provided at socket creation time.
When a use-case requires to transfer PDUs up to 1025 kByte the maximum
PDU size can now be extended by setting the module parameter
max_pdu_size. The extended size buffers are only allocated on a
per-socket/connection base when needed at run-time.
changes since v2: https://lore.kernel.org/all/20230313172510.3851-1-socketcan@hartkopp.net
- use ARRAY_SIZE() to reference DEFAULT_MAX_PDU_SIZE only at one place
changes since v1: https://lore.kernel.org/all/20230311143446.3183-1-socketcan@hartkopp.net
- limit the minimum 'max_pdu_size' to 4095 to maintain the classic
behavior before ISO 15765-2:2016
Link: https://github.com/raspberrypi/linux/issues/5371
Signed-off-by: Oliver Hartkopp <socketcan@hartkopp.net>
Link: https://lore.kernel.org/all/20230326115911.15094-1-socketcan@hartkopp.net
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
|
|
Double-free error in bpf_linker__free() was reported by James Hilliard.
The error is caused by miss-use of realloc() in extend_sec().
The error occurs when two files with empty sections of the same name
are linked:
- when first file is processed:
- extend_sec() calls realloc(dst->raw_data, dst_align_sz)
with dst->raw_data == NULL and dst_align_sz == 0;
- dst->raw_data is set to a special pointer to a memory block of
size zero;
- when second file is processed:
- extend_sec() calls realloc(dst->raw_data, dst_align_sz)
with dst->raw_data == <special pointer> and dst_align_sz == 0;
- realloc() "frees" dst->raw_data special pointer and returns NULL;
- extend_sec() exits with -ENOMEM, and the old dst->raw_data value
is preserved (it is now invalid);
- eventually, bpf_linker__free() attempts to free dst->raw_data again.
This patch fixes the bug by avoiding -ENOMEM exit for dst_align_sz == 0.
The fix was suggested by Andrii Nakryiko <andrii.nakryiko@gmail.com>.
Reported-by: James Hilliard <james.hilliard1@gmail.com>
Signed-off-by: Eduard Zingerman <eddyz87@gmail.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Tested-by: James Hilliard <james.hilliard1@gmail.com>
Link: https://lore.kernel.org/bpf/CADvTj4o7ZWUikKwNTwFq0O_AaX+46t_+Ca9gvWMYdWdRtTGeHQ@mail.gmail.com/
Link: https://lore.kernel.org/bpf/20230328004738.381898-3-eddyz87@gmail.com
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/mkl/linux-can-next
Marc Kleine-Budde says:
====================
pull-request: can-next 2023-03-27
The first 2 patches by Geert Uytterhoeven add transceiver support and
improve the error messages in the rcar_canfd driver.
Cai Huoqing contributes 3 patches which remove a redundant call to
pci_clear_master() in the c_can, ctucanfd and kvaser_pciefd driver.
Frank Jungclaus's patch replaces the struct esd_usb_msg with a union
in the esd_usb driver to improve readability.
Markus Schneider-Pargmann contributes 5 patches to improve the
performance in the m_can driver, especially for SPI attached
controllers like the tcan4x5x.
* tag 'linux-can-next-for-6.4-20230327' of git://git.kernel.org/pub/scm/linux/kernel/git/mkl/linux-can-next:
can: m_can: Keep interrupts enabled during peripheral read
can: m_can: Disable unused interrupts
can: m_can: Remove double interrupt enable
can: m_can: Always acknowledge all interrupts
can: m_can: Remove repeated check for is_peripheral
can: esd_usb: Improve code readability by means of replacing struct esd_usb_msg with a union
can: kvaser_pciefd: Remove redundant pci_clear_master
can: ctucanfd: Remove redundant pci_clear_master
can: c_can: Remove redundant pci_clear_master
can: rcar_canfd: Improve error messages
can: rcar_canfd: Add transceiver support
====================
Link: https://lore.kernel.org/r/20230327073354.1003134-1-mkl@pengutronix.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Shay Agroskin says:
====================
Add tx push buf len param to ethtool
This patchset adds a new sub-configuration to ethtool get/set queue
params (ethtool -g) called 'tx-push-buf-len'.
This configuration specifies the maximum number of bytes of a
transmitted packet a driver can push directly to the underlying
device ('push' mode). The motivation for pushing some of the bytes to
the device has the advantages of
- Allowing a smart device to take fast actions based on the packet's
header
- Reducing latency for small packets that can be copied completely into
the device
This new param is practically similar to tx-copybreak value that can be
set using ethtool's tunable but conceptually serves a different purpose.
While tx-copybreak is used to reduce the overhead of DMA mapping and
makes no sense to use if less than the whole segment gets copied,
tx-push-buf-len allows to improve performance by analyzing the packet's
data (usually headers) before performing the DMA operation.
The configuration can be queried and set using the commands:
$ ethtool -g [interface]
# ethtool -G [interface] tx-push-buf-len [number of bytes]
This patchset also adds support for the new configuration in ENA driver
for which this parameter ensures efficient resources management on the
device side.
====================
Link: https://lore.kernel.org/r/20230323163610.1281468-1-shayagr@amazon.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
LLQ is auto enabled by the device and disabling it isn't supported on
new ENA generations while on old ones causes sub-optimal performance.
This patch adds advertisement of push-mode when LLQ mode is used, but
rejects an attempt to modify it.
Reviewed-by: Michal Kubiak <michal.kubiak@intel.com>
Signed-off-by: Shay Agroskin <shayagr@amazon.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
The ENA driver allows for two distinct values for the number of bytes
of the packet's payload that can be written directly to the device.
For a value of 224 the driver turns on Large LLQ Header mode in which
the first 224 of the packet's payload are written to the LLQ.
Reviewed-by: Michal Kubiak <michal.kubiak@intel.com>
Signed-off-by: Shay Agroskin <shayagr@amazon.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
With the ability to modify LLQ entry size, the size of packet's
payload that can be written directly to the device changes.
This patch makes the driver recalculate this information every device
negotiation (also called device reset).
Reviewed-by: Michal Kubiak <michal.kubiak@intel.com>
Signed-off-by: Shay Agroskin <shayagr@amazon.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Allow configuring the device with large LLQ headers. The Low Latency
Queue (LLQ) allows the driver to write the first N bytes of the packet,
along with the rest of the TX descriptors directly into device (N can be
either 96 or 224 for large LLQ headers configuration).
Having L4 TCP/UDP headers contained in the first 96 bytes of the packet
is required to get maximum performance from the device.
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Reviewed-by: Michal Kubiak <michal.kubiak@intel.com>
Signed-off-by: David Arinzon <darinzon@amazon.com>
Signed-off-by: Shay Agroskin <shayagr@amazon.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Move ena_calc_io_queue_size() implementation closer to the file's
beginning so that it can be later called from ena_device_init()
function without adding a function declaration.
Also add an empty line at some spots to separate logical blocks in
funcitons.
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Reviewed-by: Michal Kubiak <michal.kubiak@intel.com>
Signed-off-by: Shay Agroskin <shayagr@amazon.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
This attribute, which is part of ethtool's ring param configuration
allows the user to specify the maximum number of the packet's payload
that can be written directly to the device.
Example usage:
# ethtool -G [interface] tx-push-buf-len [number of bytes]
Co-developed-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Shay Agroskin <shayagr@amazon.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Similar to NL_SET_ERR_MSG_FMT, add a macro which sets netlink policy
error message with a format string.
Signed-off-by: Shay Agroskin <shayagr@amazon.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/mkl/linux-can
Marc Kleine-Budde says:
====================
pull-request: can 2023-03-27
Oleksij Rempel and Hillf Danton contribute a patch for the CAN J1939
protocol that prevents a potential deadlock in j1939_sk_errqueue().
Ivan Orlov fixes an uninit-value in the CAN BCM protocol in the
bcm_tx_setup() function.
* tag 'linux-can-fixes-for-6.3-20230327' of git://git.kernel.org/pub/scm/linux/kernel/git/mkl/linux-can:
can: bcm: bcm_tx_setup(): fix KMSAN uninit-value in vfs_write
can: j1939: prevent deadlock by moving j1939_sk_errqueue()
====================
Link: https://lore.kernel.org/r/20230327124807.1157134-1-mkl@pengutronix.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
clang with W=1 reports
drivers/net/ethernet/qlogic/qed/qed_ll2.c:649:6: error: variable
'num_ooo_add_to_peninsula' set but not used [-Werror,-Wunused-but-set-variable]
u32 num_ooo_add_to_peninsula = 0, cid;
^
This variable is not used so remove it.
Signed-off-by: Tom Rix <trix@redhat.com>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Link: https://lore.kernel.org/r/20230326001733.1343274-1-trix@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Some MAINTAINERS sections mention to mail patches to the list
linux-nfc@lists.01.org. Probably due to changes on Intel's 01.org website
and servers, the list server lists.01.org/ml01.01.org is simply gone.
Considering emails recorded on lore.kernel.org, only a handful of emails
where sent to the linux-nfc@lists.01.org list, and they are usually also
sent to the netdev mailing list as well, where they are then picked up.
So, there is no big benefit in restoring the linux-nfc elsewhere.
Remove all occurrences of the linux-nfc@lists.01.org list in MAINTAINERS.
Suggested-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org>
Link: https://lore.kernel.org/all/CAKXUXMzggxQ43DUZZRkPMGdo5WkzgA=i14ySJUFw4kZfE5ZaZA@mail.gmail.com/
Signed-off-by: Lukas Bulwahn <lukas.bulwahn@gmail.com>
Reviewed-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org>
Link: https://lore.kernel.org/r/20230324081613.32000-1-lukas.bulwahn@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
I've read through or reworked a good portion of this driver. Add myself
as a reviewer.
Signed-off-by: Sean Anderson <sean.anderson@seco.com>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Acked-by: Madalin Bucur <madalin.bucur@oss.nxp.com>
Link: https://lore.kernel.org/r/20230323145957.2999211-1-sean.anderson@seco.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Currently, MAX_SKB_FRAGS value is 17.
For standard tcp sendmsg() traffic, no big deal because tcp_sendmsg()
attempts order-3 allocations, stuffing 32768 bytes per frag.
But with zero copy, we use order-0 pages.
For BIG TCP to show its full potential, we add a config option
to be able to fit up to 45 segments per skb.
This is also needed for BIG TCP rx zerocopy, as zerocopy currently
does not support skbs with frag list.
We have used MAX_SKB_FRAGS=45 value for years at Google before
we deployed 4K MTU, with no adverse effect, other than
a recent issue in mlx4, fixed in commit 26782aad00cc
("net/mlx4: MLX4_TX_BOUNCE_BUFFER_SIZE depends on MAX_SKB_FRAGS")
Back then, goal was to be able to receive full size (64KB) GRO
packets without the frag_list overhead.
Note that /proc/sys/net/core/max_skb_frags can also be used to limit
the number of fragments TCP can use in tx packets.
By default we keep the old/legacy value of 17 until we get
more coverage for the updated values.
Sizes of struct skb_shared_info on 64bit arches
MAX_SKB_FRAGS | sizeof(struct skb_shared_info):
==============================================
17 320
21 320+64 = 384
25 320+128 = 448
29 320+192 = 512
33 320+256 = 576
37 320+320 = 640
41 320+384 = 704
45 320+448 = 768
This inflation might cause problems for drivers assuming they could pack
both the incoming packet (for MTU=1500) and skb_shared_info in half a page,
using build_skb().
v3: fix build error when CONFIG_NET=n
v2: fix two build errors assuming MAX_SKB_FRAGS was "unsigned long"
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org>
Reviewed-by: Jason Xing <kerneljasonxing@gmail.com>
Link: https://lore.kernel.org/r/20230323162842.1935061-1-eric.dumazet@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
A system with more than one of these SSDs will only have one usable.
The kernel fails to detect more than one nvme device due to duplicate
cntlids.
before:
[ 9.395229] nvme 0000:01:00.0: platform quirk: setting simple suspend
[ 9.395262] nvme nvme0: pci function 0000:01:00.0
[ 9.395282] nvme 0000:03:00.0: platform quirk: setting simple suspend
[ 9.395305] nvme nvme1: pci function 0000:03:00.0
[ 9.409873] nvme nvme0: Duplicate cntlid 1 with nvme1, subsys nqn.2022-07.com.siliconmotion:nvm-subsystem-sn- , rejecting
[ 9.409982] nvme nvme0: Removing after probe failure status: -22
[ 9.427487] nvme nvme1: allocated 64 MiB host memory buffer.
[ 9.445088] nvme nvme1: 16/0/0 default/read/poll queues
[ 9.449898] nvme nvme1: Ignoring bogus Namespace Identifiers
after:
[ 1.161890] nvme 0000:01:00.0: platform quirk: setting simple suspend
[ 1.162660] nvme nvme0: pci function 0000:01:00.0
[ 1.162684] nvme 0000:03:00.0: platform quirk: setting simple suspend
[ 1.162707] nvme nvme1: pci function 0000:03:00.0
[ 1.191354] nvme nvme0: allocated 64 MiB host memory buffer.
[ 1.193378] nvme nvme1: allocated 64 MiB host memory buffer.
[ 1.211044] nvme nvme1: 16/0/0 default/read/poll queues
[ 1.211080] nvme nvme0: 16/0/0 default/read/poll queues
[ 1.216145] nvme nvme0: Ignoring bogus Namespace Identifiers
[ 1.216261] nvme nvme1: Ignoring bogus Namespace Identifiers
Adding the NVME_QUIRK_IGNORE_DEV_SUBNQN quirk to resolves the issue.
Signed-off-by: Juraj Pecigos <kernel@juraj.dev>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
|
|
da7213.c is missing pm_runtime_disable(), thus we will get
below error when rmmod -> insmod.
$ rmmod snd-soc-da7213.ko
$ insmod snd-soc-da7213.ko
da7213 0-001a: Unbalanced pm_runtime_enable!"
[Kuninori adjusted to latest upstream]
Signed-off-by: Duy Nguyen <duy.nguyen.rh@renesas.com>
Signed-off-by: Kuninori Morimoto <kuninori.morimoto.gx@renesas.com>
Tested-by: Khanh Le <khanh.le.xr@renesas.com>
Link: https://lore.kernel.org/r/87mt3xg2tk.wl-kuninori.morimoto.gx@renesas.com
Signed-off-by: Mark Brown <broonie@kernel.org>
|
|
Before relocating a block group we pause scrub, then do the relocation and
then unpause scrub. The relocation process requires starting and committing
a transaction, and if we have a failure in the critical section of the
transaction commit path (transaction state >= TRANS_STATE_COMMIT_START),
we will deadlock if there is a paused scrub.
That results in stack traces like the following:
[42.479] BTRFS info (device sdc): relocating block group 53876686848 flags metadata|raid6
[42.936] BTRFS warning (device sdc): Skipping commit of aborted transaction.
[42.936] ------------[ cut here ]------------
[42.936] BTRFS: Transaction aborted (error -28)
[42.936] WARNING: CPU: 11 PID: 346822 at fs/btrfs/transaction.c:1977 btrfs_commit_transaction+0xcc8/0xeb0 [btrfs]
[42.936] Modules linked in: dm_flakey dm_mod loop btrfs (...)
[42.936] CPU: 11 PID: 346822 Comm: btrfs Tainted: G W 6.3.0-rc2-btrfs-next-127+ #1
[42.936] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014
[42.936] RIP: 0010:btrfs_commit_transaction+0xcc8/0xeb0 [btrfs]
[42.936] Code: ff ff 45 8b (...)
[42.936] RSP: 0018:ffffb58649633b48 EFLAGS: 00010282
[42.936] RAX: 0000000000000000 RBX: ffff8be6ef4d5bd8 RCX: 0000000000000000
[42.936] RDX: 0000000000000002 RSI: ffffffffb35e7782 RDI: 00000000ffffffff
[42.936] RBP: ffff8be6ef4d5c98 R08: 0000000000000000 R09: ffffb586496339e8
[42.936] R10: 0000000000000001 R11: 0000000000000001 R12: ffff8be6d38c7c00
[42.936] R13: 00000000ffffffe4 R14: ffff8be6c268c000 R15: ffff8be6ef4d5cf0
[42.936] FS: 00007f381a82b340(0000) GS:ffff8beddfcc0000(0000) knlGS:0000000000000000
[42.936] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[42.936] CR2: 00007f1e35fb7638 CR3: 0000000117680006 CR4: 0000000000370ee0
[42.936] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[42.936] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[42.936] Call Trace:
[42.936] <TASK>
[42.936] ? start_transaction+0xcb/0x610 [btrfs]
[42.936] prepare_to_relocate+0x111/0x1a0 [btrfs]
[42.936] relocate_block_group+0x57/0x5d0 [btrfs]
[42.936] ? btrfs_wait_nocow_writers+0x25/0xb0 [btrfs]
[42.936] btrfs_relocate_block_group+0x248/0x3c0 [btrfs]
[42.936] ? __pfx_autoremove_wake_function+0x10/0x10
[42.936] btrfs_relocate_chunk+0x3b/0x150 [btrfs]
[42.936] btrfs_balance+0x8ff/0x11d0 [btrfs]
[42.936] ? __kmem_cache_alloc_node+0x14a/0x410
[42.936] btrfs_ioctl+0x2334/0x32c0 [btrfs]
[42.937] ? mod_objcg_state+0xd2/0x360
[42.937] ? refill_obj_stock+0xb0/0x160
[42.937] ? seq_release+0x25/0x30
[42.937] ? __rseq_handle_notify_resume+0x3b5/0x4b0
[42.937] ? percpu_counter_add_batch+0x2e/0xa0
[42.937] ? __x64_sys_ioctl+0x88/0xc0
[42.937] __x64_sys_ioctl+0x88/0xc0
[42.937] do_syscall_64+0x38/0x90
[42.937] entry_SYSCALL_64_after_hwframe+0x72/0xdc
[42.937] RIP: 0033:0x7f381a6ffe9b
[42.937] Code: 00 48 89 44 24 (...)
[42.937] RSP: 002b:00007ffd45ecf060 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[42.937] RAX: ffffffffffffffda RBX: 0000000000000001 RCX: 00007f381a6ffe9b
[42.937] RDX: 00007ffd45ecf150 RSI: 00000000c4009420 RDI: 0000000000000003
[42.937] RBP: 0000000000000003 R08: 0000000000000013 R09: 0000000000000000
[42.937] R10: 00007f381a60c878 R11: 0000000000000246 R12: 00007ffd45ed0423
[42.937] R13: 00007ffd45ecf150 R14: 0000000000000000 R15: 00007ffd45ecf148
[42.937] </TASK>
[42.937] ---[ end trace 0000000000000000 ]---
[42.937] BTRFS: error (device sdc: state A) in cleanup_transaction:1977: errno=-28 No space left
[59.196] INFO: task btrfs:346772 blocked for more than 120 seconds.
[59.196] Tainted: G W 6.3.0-rc2-btrfs-next-127+ #1
[59.196] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[59.196] task:btrfs state:D stack:0 pid:346772 ppid:1 flags:0x00004002
[59.196] Call Trace:
[59.196] <TASK>
[59.196] __schedule+0x392/0xa70
[59.196] ? __pv_queued_spin_lock_slowpath+0x165/0x370
[59.196] schedule+0x5d/0xd0
[59.196] __scrub_blocked_if_needed+0x74/0xc0 [btrfs]
[59.197] ? __pfx_autoremove_wake_function+0x10/0x10
[59.197] scrub_pause_off+0x21/0x50 [btrfs]
[59.197] scrub_simple_mirror+0x1c7/0x950 [btrfs]
[59.197] ? scrub_parity_put+0x1a5/0x1d0 [btrfs]
[59.198] ? __pfx_autoremove_wake_function+0x10/0x10
[59.198] scrub_stripe+0x20d/0x740 [btrfs]
[59.198] scrub_chunk+0xc4/0x130 [btrfs]
[59.198] scrub_enumerate_chunks+0x3e4/0x7a0 [btrfs]
[59.198] ? __pfx_autoremove_wake_function+0x10/0x10
[59.198] btrfs_scrub_dev+0x236/0x6a0 [btrfs]
[59.199] ? btrfs_ioctl+0xd97/0x32c0 [btrfs]
[59.199] ? _copy_from_user+0x7b/0x80
[59.199] btrfs_ioctl+0xde1/0x32c0 [btrfs]
[59.199] ? refill_stock+0x33/0x50
[59.199] ? should_failslab+0xa/0x20
[59.199] ? kmem_cache_alloc_node+0x151/0x460
[59.199] ? alloc_io_context+0x1b/0x80
[59.199] ? preempt_count_add+0x70/0xa0
[59.199] ? __x64_sys_ioctl+0x88/0xc0
[59.199] __x64_sys_ioctl+0x88/0xc0
[59.199] do_syscall_64+0x38/0x90
[59.199] entry_SYSCALL_64_after_hwframe+0x72/0xdc
[59.199] RIP: 0033:0x7f82ffaffe9b
[59.199] RSP: 002b:00007f82ff9fcc50 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[59.199] RAX: ffffffffffffffda RBX: 000055b191e36310 RCX: 00007f82ffaffe9b
[59.199] RDX: 000055b191e36310 RSI: 00000000c400941b RDI: 0000000000000003
[59.199] RBP: 0000000000000000 R08: 00007fff1575016f R09: 0000000000000000
[59.199] R10: 0000000000000000 R11: 0000000000000246 R12: 00007f82ff9fd640
[59.199] R13: 000000000000006b R14: 00007f82ffa87580 R15: 0000000000000000
[59.199] </TASK>
[59.199] INFO: task btrfs:346773 blocked for more than 120 seconds.
[59.200] Tainted: G W 6.3.0-rc2-btrfs-next-127+ #1
[59.200] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[59.201] task:btrfs state:D stack:0 pid:346773 ppid:1 flags:0x00004002
[59.201] Call Trace:
[59.201] <TASK>
[59.201] __schedule+0x392/0xa70
[59.201] ? __pv_queued_spin_lock_slowpath+0x165/0x370
[59.201] schedule+0x5d/0xd0
[59.201] __scrub_blocked_if_needed+0x74/0xc0 [btrfs]
[59.201] ? __pfx_autoremove_wake_function+0x10/0x10
[59.201] scrub_pause_off+0x21/0x50 [btrfs]
[59.202] scrub_simple_mirror+0x1c7/0x950 [btrfs]
[59.202] ? scrub_parity_put+0x1a5/0x1d0 [btrfs]
[59.202] ? __pfx_autoremove_wake_function+0x10/0x10
[59.202] scrub_stripe+0x20d/0x740 [btrfs]
[59.202] scrub_chunk+0xc4/0x130 [btrfs]
[59.203] scrub_enumerate_chunks+0x3e4/0x7a0 [btrfs]
[59.203] ? __pfx_autoremove_wake_function+0x10/0x10
[59.203] btrfs_scrub_dev+0x236/0x6a0 [btrfs]
[59.203] ? btrfs_ioctl+0xd97/0x32c0 [btrfs]
[59.203] ? _copy_from_user+0x7b/0x80
[59.203] btrfs_ioctl+0xde1/0x32c0 [btrfs]
[59.204] ? should_failslab+0xa/0x20
[59.204] ? kmem_cache_alloc_node+0x151/0x460
[59.204] ? alloc_io_context+0x1b/0x80
[59.204] ? preempt_count_add+0x70/0xa0
[59.204] ? __x64_sys_ioctl+0x88/0xc0
[59.204] __x64_sys_ioctl+0x88/0xc0
[59.204] do_syscall_64+0x38/0x90
[59.204] entry_SYSCALL_64_after_hwframe+0x72/0xdc
[59.204] RIP: 0033:0x7f82ffaffe9b
[59.204] RSP: 002b:00007f82ff1fbc50 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[59.204] RAX: ffffffffffffffda RBX: 000055b191e36790 RCX: 00007f82ffaffe9b
[59.204] RDX: 000055b191e36790 RSI: 00000000c400941b RDI: 0000000000000003
[59.204] RBP: 0000000000000000 R08: 00007fff1575016f R09: 0000000000000000
[59.204] R10: 0000000000000000 R11: 0000000000000246 R12: 00007f82ff1fc640
[59.204] R13: 000000000000006b R14: 00007f82ffa87580 R15: 0000000000000000
[59.204] </TASK>
[59.204] INFO: task btrfs:346774 blocked for more than 120 seconds.
[59.205] Tainted: G W 6.3.0-rc2-btrfs-next-127+ #1
[59.205] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[59.206] task:btrfs state:D stack:0 pid:346774 ppid:1 flags:0x00004002
[59.206] Call Trace:
[59.206] <TASK>
[59.206] __schedule+0x392/0xa70
[59.206] schedule+0x5d/0xd0
[59.206] __scrub_blocked_if_needed+0x74/0xc0 [btrfs]
[59.206] ? __pfx_autoremove_wake_function+0x10/0x10
[59.206] scrub_pause_off+0x21/0x50 [btrfs]
[59.207] scrub_simple_mirror+0x1c7/0x950 [btrfs]
[59.207] ? scrub_parity_put+0x1a5/0x1d0 [btrfs]
[59.207] ? __pfx_autoremove_wake_function+0x10/0x10
[59.207] scrub_stripe+0x20d/0x740 [btrfs]
[59.208] scrub_chunk+0xc4/0x130 [btrfs]
[59.208] scrub_enumerate_chunks+0x3e4/0x7a0 [btrfs]
[59.208] ? __mutex_unlock_slowpath.isra.0+0x9a/0x120
[59.208] btrfs_scrub_dev+0x236/0x6a0 [btrfs]
[59.208] ? btrfs_ioctl+0xd97/0x32c0 [btrfs]
[59.209] ? _copy_from_user+0x7b/0x80
[59.209] btrfs_ioctl+0xde1/0x32c0 [btrfs]
[59.209] ? should_failslab+0xa/0x20
[59.209] ? kmem_cache_alloc_node+0x151/0x460
[59.209] ? alloc_io_context+0x1b/0x80
[59.209] ? preempt_count_add+0x70/0xa0
[59.209] ? __x64_sys_ioctl+0x88/0xc0
[59.209] __x64_sys_ioctl+0x88/0xc0
[59.209] do_syscall_64+0x38/0x90
[59.209] entry_SYSCALL_64_after_hwframe+0x72/0xdc
[59.209] RIP: 0033:0x7f82ffaffe9b
[59.209] RSP: 002b:00007f82fe9fac50 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[59.209] RAX: ffffffffffffffda RBX: 000055b191e36c10 RCX: 00007f82ffaffe9b
[59.209] RDX: 000055b191e36c10 RSI: 00000000c400941b RDI: 0000000000000003
[59.209] RBP: 0000000000000000 R08: 00007fff1575016f R09: 0000000000000000
[59.209] R10: 0000000000000000 R11: 0000000000000246 R12: 00007f82fe9fb640
[59.209] R13: 000000000000006b R14: 00007f82ffa87580 R15: 0000000000000000
[59.209] </TASK>
[59.209] INFO: task btrfs:346775 blocked for more than 120 seconds.
[59.210] Tainted: G W 6.3.0-rc2-btrfs-next-127+ #1
[59.210] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[59.211] task:btrfs state:D stack:0 pid:346775 ppid:1 flags:0x00004002
[59.211] Call Trace:
[59.211] <TASK>
[59.211] __schedule+0x392/0xa70
[59.211] schedule+0x5d/0xd0
[59.211] __scrub_blocked_if_needed+0x74/0xc0 [btrfs]
[59.211] ? __pfx_autoremove_wake_function+0x10/0x10
[59.211] scrub_pause_off+0x21/0x50 [btrfs]
[59.212] scrub_simple_mirror+0x1c7/0x950 [btrfs]
[59.212] ? scrub_parity_put+0x1a5/0x1d0 [btrfs]
[59.212] ? __pfx_autoremove_wake_function+0x10/0x10
[59.212] scrub_stripe+0x20d/0x740 [btrfs]
[59.213] scrub_chunk+0xc4/0x130 [btrfs]
[59.213] scrub_enumerate_chunks+0x3e4/0x7a0 [btrfs]
[59.213] ? __mutex_unlock_slowpath.isra.0+0x9a/0x120
[59.213] btrfs_scrub_dev+0x236/0x6a0 [btrfs]
[59.213] ? btrfs_ioctl+0xd97/0x32c0 [btrfs]
[59.214] ? _copy_from_user+0x7b/0x80
[59.214] btrfs_ioctl+0xde1/0x32c0 [btrfs]
[59.214] ? should_failslab+0xa/0x20
[59.214] ? kmem_cache_alloc_node+0x151/0x460
[59.214] ? alloc_io_context+0x1b/0x80
[59.214] ? preempt_count_add+0x70/0xa0
[59.214] ? __x64_sys_ioctl+0x88/0xc0
[59.214] __x64_sys_ioctl+0x88/0xc0
[59.214] do_syscall_64+0x38/0x90
[59.214] entry_SYSCALL_64_after_hwframe+0x72/0xdc
[59.214] RIP: 0033:0x7f82ffaffe9b
[59.214] RSP: 002b:00007f82fe1f9c50 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[59.214] RAX: ffffffffffffffda RBX: 000055b191e37090 RCX: 00007f82ffaffe9b
[59.214] RDX: 000055b191e37090 RSI: 00000000c400941b RDI: 0000000000000003
[59.214] RBP: 0000000000000000 R08: 00007fff1575016f R09: 0000000000000000
[59.214] R10: 0000000000000000 R11: 0000000000000246 R12: 00007f82fe1fa640
[59.214] R13: 000000000000006b R14: 00007f82ffa87580 R15: 0000000000000000
[59.214] </TASK>
[59.214] INFO: task btrfs:346776 blocked for more than 120 seconds.
[59.215] Tainted: G W 6.3.0-rc2-btrfs-next-127+ #1
[59.216] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[59.217] task:btrfs state:D stack:0 pid:346776 ppid:1 flags:0x00004002
[59.217] Call Trace:
[59.217] <TASK>
[59.217] __schedule+0x392/0xa70
[59.217] ? __pv_queued_spin_lock_slowpath+0x165/0x370
[59.217] schedule+0x5d/0xd0
[59.217] __scrub_blocked_if_needed+0x74/0xc0 [btrfs]
[59.217] ? __pfx_autoremove_wake_function+0x10/0x10
[59.217] scrub_pause_off+0x21/0x50 [btrfs]
[59.217] scrub_simple_mirror+0x1c7/0x950 [btrfs]
[59.217] ? scrub_parity_put+0x1a5/0x1d0 [btrfs]
[59.218] ? __pfx_autoremove_wake_function+0x10/0x10
[59.218] scrub_stripe+0x20d/0x740 [btrfs]
[59.218] scrub_chunk+0xc4/0x130 [btrfs]
[59.218] scrub_enumerate_chunks+0x3e4/0x7a0 [btrfs]
[59.219] ? __pfx_autoremove_wake_function+0x10/0x10
[59.219] btrfs_scrub_dev+0x236/0x6a0 [btrfs]
[59.219] ? btrfs_ioctl+0xd97/0x32c0 [btrfs]
[59.219] ? _copy_from_user+0x7b/0x80
[59.219] btrfs_ioctl+0xde1/0x32c0 [btrfs]
[59.219] ? should_failslab+0xa/0x20
[59.219] ? kmem_cache_alloc_node+0x151/0x460
[59.219] ? alloc_io_context+0x1b/0x80
[59.219] ? preempt_count_add+0x70/0xa0
[59.219] ? __x64_sys_ioctl+0x88/0xc0
[59.219] __x64_sys_ioctl+0x88/0xc0
[59.219] do_syscall_64+0x38/0x90
[59.219] entry_SYSCALL_64_after_hwframe+0x72/0xdc
[59.219] RIP: 0033:0x7f82ffaffe9b
[59.219] RSP: 002b:00007f82fd9f8c50 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[59.219] RAX: ffffffffffffffda RBX: 000055b191e37510 RCX: 00007f82ffaffe9b
[59.219] RDX: 000055b191e37510 RSI: 00000000c400941b RDI: 0000000000000003
[59.219] RBP: 0000000000000000 R08: 00007fff1575016f R09: 0000000000000000
[59.219] R10: 0000000000000000 R11: 0000000000000246 R12: 00007f82fd9f9640
[59.219] R13: 000000000000006b R14: 00007f82ffa87580 R15: 0000000000000000
[59.219] </TASK>
[59.219] INFO: task btrfs:346822 blocked for more than 120 seconds.
[59.220] Tainted: G W 6.3.0-rc2-btrfs-next-127+ #1
[59.221] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[59.222] task:btrfs state:D stack:0 pid:346822 ppid:1 flags:0x00004002
[59.222] Call Trace:
[59.222] <TASK>
[59.222] __schedule+0x392/0xa70
[59.222] schedule+0x5d/0xd0
[59.222] btrfs_scrub_cancel+0x91/0x100 [btrfs]
[59.222] ? __pfx_autoremove_wake_function+0x10/0x10
[59.222] btrfs_commit_transaction+0x572/0xeb0 [btrfs]
[59.223] ? start_transaction+0xcb/0x610 [btrfs]
[59.223] prepare_to_relocate+0x111/0x1a0 [btrfs]
[59.223] relocate_block_group+0x57/0x5d0 [btrfs]
[59.223] ? btrfs_wait_nocow_writers+0x25/0xb0 [btrfs]
[59.223] btrfs_relocate_block_group+0x248/0x3c0 [btrfs]
[59.224] ? __pfx_autoremove_wake_function+0x10/0x10
[59.224] btrfs_relocate_chunk+0x3b/0x150 [btrfs]
[59.224] btrfs_balance+0x8ff/0x11d0 [btrfs]
[59.224] ? __kmem_cache_alloc_node+0x14a/0x410
[59.224] btrfs_ioctl+0x2334/0x32c0 [btrfs]
[59.225] ? mod_objcg_state+0xd2/0x360
[59.225] ? refill_obj_stock+0xb0/0x160
[59.225] ? seq_release+0x25/0x30
[59.225] ? __rseq_handle_notify_resume+0x3b5/0x4b0
[59.225] ? percpu_counter_add_batch+0x2e/0xa0
[59.225] ? __x64_sys_ioctl+0x88/0xc0
[59.225] __x64_sys_ioctl+0x88/0xc0
[59.225] do_syscall_64+0x38/0x90
[59.225] entry_SYSCALL_64_after_hwframe+0x72/0xdc
[59.225] RIP: 0033:0x7f381a6ffe9b
[59.225] RSP: 002b:00007ffd45ecf060 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[59.225] RAX: ffffffffffffffda RBX: 0000000000000001 RCX: 00007f381a6ffe9b
[59.225] RDX: 00007ffd45ecf150 RSI: 00000000c4009420 RDI: 0000000000000003
[59.225] RBP: 0000000000000003 R08: 0000000000000013 R09: 0000000000000000
[59.225] R10: 00007f381a60c878 R11: 0000000000000246 R12: 00007ffd45ed0423
[59.225] R13: 00007ffd45ecf150 R14: 0000000000000000 R15: 00007ffd45ecf148
[59.225] </TASK>
What happens is the following:
1) A scrub is running, so fs_info->scrubs_running is 1;
2) Task A starts block group relocation, and at btrfs_relocate_chunk() it
pauses scrub by calling btrfs_scrub_pause(). That increments
fs_info->scrub_pause_req from 0 to 1 and waits for the scrub task to
pause (for fs_info->scrubs_paused to be == to fs_info->scrubs_running);
3) The scrub task pauses at scrub_pause_off(), waiting for
fs_info->scrub_pause_req to decrease to 0;
4) Task A then enters btrfs_relocate_block_group(), and down that call
chain we start a transaction and then attempt to commit it;
5) When task A calls btrfs_commit_transaction(), it either will do the
commit itself or wait for some other task that already started the
commit of the transaction - it doesn't matter which case;
6) The transaction commit enters state TRANS_STATE_COMMIT_START;
7) An error happens during the transaction commit, like -ENOSPC when
running delayed refs or delayed items for example;
8) This results in calling transaction.c:cleanup_transaction(), where
we call btrfs_scrub_cancel(), incrementing fs_info->scrub_cancel_req
from 0 to 1, and blocking this task waiting for fs_info->scrubs_running
to decrease to 0;
9) From this point on, both the transaction commit and the scrub task
hang forever:
1) The transaction commit is waiting for fs_info->scrubs_running to
be decreased to 0;
2) The scrub task is at scrub_pause_off() waiting for
fs_info->scrub_pause_req to decrease to 0 - so it can not proceed
to stop the scrub and decrement fs_info->scrubs_running from 0 to 1.
Therefore resulting in a deadlock.
Fix this by having cleanup_transaction(), called if a transaction commit
fails, not call btrfs_scrub_cancel() if relocation is in progress, and
having btrfs_relocate_block_group() call btrfs_scrub_cancel() instead if
the relocation failed and a transaction abort happened.
This was triggered with btrfs/061 from fstests.
Fixes: 55e3a601c81c ("btrfs: Fix data checksum error cause by replace with io-load.")
CC: stable@vger.kernel.org # 4.14+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
This fixes mkfs/mount/check failures due to race with systemd-udevd
scan.
During the device scan initiated by systemd-udevd, other user space
EXCL operations such as mkfs, mount, or check may get blocked and result
in a "Device or resource busy" error. This is because the device
scan process opens the device with the EXCL flag in the kernel.
Two reports were received:
- btrfs/179 test case, where the fsck command failed with the -EBUSY
error
- LTP pwritev03 test case, where mkfs.vfs failed with
the -EBUSY error, when mkfs.vfs tried to overwrite old btrfs filesystem
on the device.
In both cases, fsck and mkfs (respectively) were racing with a
systemd-udevd device scan, and systemd-udevd won, resulting in the
-EBUSY error for fsck and mkfs.
Reproducing the problem has been difficult because there is a very
small window during which these userspace threads can race to
acquire the exclusive device open. Even on the system where the problem
was observed, the problem occurrences were anywhere between 10 to 400
iterations and chances of reproducing decreases with debug printk()s.
However, an exclusive device open is unnecessary for the scan process,
as there are no write operations on the device during scan. Furthermore,
during the mount process, the superblock is re-read in the below
function call chain:
btrfs_mount_root
btrfs_open_devices
open_fs_devices
btrfs_open_one_device
btrfs_get_bdev_and_sb
So, to fix this issue, removes the FMODE_EXCL flag from the scan
operation, and add a comment.
The case where mkfs may still write to the device and a scan is running,
the btrfs signature is not written at that time so scan will not
recognize such device.
Reported-by: Sherry Yang <sherry.yang@oracle.com>
Reported-by: kernel test robot <oliver.sang@intel.com>
Link: https://lore.kernel.org/oe-lkp/202303170839.fdf23068-oliver.sang@intel.com
CC: stable@vger.kernel.org # 5.4+
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
The quota assign ioctl can currently run in parallel with a quota disable
ioctl call. The assign ioctl uses the quota root, while the disable ioctl
frees that root, and therefore we can have a use-after-free triggered in
the assign ioctl, leading to a trace like the following when KASAN is
enabled:
[672.723][T736] BUG: KASAN: slab-use-after-free in btrfs_search_slot+0x2962/0x2db0
[672.723][T736] Read of size 8 at addr ffff888022ec0208 by task btrfs_search_sl/27736
[672.724][T736]
[672.725][T736] CPU: 1 PID: 27736 Comm: btrfs_search_sl Not tainted 6.3.0-rc3 #37
[672.723][T736] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1 04/01/2014
[672.727][T736] Call Trace:
[672.728][T736] <TASK>
[672.728][T736] dump_stack_lvl+0xd9/0x150
[672.725][T736] print_report+0xc1/0x5e0
[672.720][T736] ? __virt_addr_valid+0x61/0x2e0
[672.727][T736] ? __phys_addr+0xc9/0x150
[672.725][T736] ? btrfs_search_slot+0x2962/0x2db0
[672.722][T736] kasan_report+0xc0/0xf0
[672.729][T736] ? btrfs_search_slot+0x2962/0x2db0
[672.724][T736] btrfs_search_slot+0x2962/0x2db0
[672.723][T736] ? fs_reclaim_acquire+0xba/0x160
[672.722][T736] ? split_leaf+0x13d0/0x13d0
[672.726][T736] ? rcu_is_watching+0x12/0xb0
[672.723][T736] ? kmem_cache_alloc+0x338/0x3c0
[672.722][T736] update_qgroup_status_item+0xf7/0x320
[672.724][T736] ? add_qgroup_rb+0x3d0/0x3d0
[672.739][T736] ? do_raw_spin_lock+0x12d/0x2b0
[672.730][T736] ? spin_bug+0x1d0/0x1d0
[672.737][T736] btrfs_run_qgroups+0x5de/0x840
[672.730][T736] ? btrfs_qgroup_rescan_worker+0xa70/0xa70
[672.738][T736] ? __del_qgroup_relation+0x4ba/0xe00
[672.738][T736] btrfs_ioctl+0x3d58/0x5d80
[672.735][T736] ? tomoyo_path_number_perm+0x16a/0x550
[672.737][T736] ? tomoyo_execute_permission+0x4a0/0x4a0
[672.731][T736] ? btrfs_ioctl_get_supported_features+0x50/0x50
[672.737][T736] ? __sanitizer_cov_trace_switch+0x54/0x90
[672.734][T736] ? do_vfs_ioctl+0x132/0x1660
[672.730][T736] ? vfs_fileattr_set+0xc40/0xc40
[672.730][T736] ? _raw_spin_unlock_irq+0x2e/0x50
[672.732][T736] ? sigprocmask+0xf2/0x340
[672.737][T736] ? __fget_files+0x26a/0x480
[672.732][T736] ? bpf_lsm_file_ioctl+0x9/0x10
[672.738][T736] ? btrfs_ioctl_get_supported_features+0x50/0x50
[672.736][T736] __x64_sys_ioctl+0x198/0x210
[672.736][T736] do_syscall_64+0x39/0xb0
[672.731][T736] entry_SYSCALL_64_after_hwframe+0x63/0xcd
[672.739][T736] RIP: 0033:0x4556ad
[672.742][T736] </TASK>
[672.743][T736]
[672.748][T736] Allocated by task 27677:
[672.743][T736] kasan_save_stack+0x22/0x40
[672.741][T736] kasan_set_track+0x25/0x30
[672.741][T736] __kasan_kmalloc+0xa4/0xb0
[672.749][T736] btrfs_alloc_root+0x48/0x90
[672.746][T736] btrfs_create_tree+0x146/0xa20
[672.744][T736] btrfs_quota_enable+0x461/0x1d20
[672.743][T736] btrfs_ioctl+0x4a1c/0x5d80
[672.747][T736] __x64_sys_ioctl+0x198/0x210
[672.749][T736] do_syscall_64+0x39/0xb0
[672.744][T736] entry_SYSCALL_64_after_hwframe+0x63/0xcd
[672.756][T736]
[672.757][T736] Freed by task 27677:
[672.759][T736] kasan_save_stack+0x22/0x40
[672.759][T736] kasan_set_track+0x25/0x30
[672.756][T736] kasan_save_free_info+0x2e/0x50
[672.751][T736] ____kasan_slab_free+0x162/0x1c0
[672.758][T736] slab_free_freelist_hook+0x89/0x1c0
[672.752][T736] __kmem_cache_free+0xaf/0x2e0
[672.752][T736] btrfs_put_root+0x1ff/0x2b0
[672.759][T736] btrfs_quota_disable+0x80a/0xbc0
[672.752][T736] btrfs_ioctl+0x3e5f/0x5d80
[672.756][T736] __x64_sys_ioctl+0x198/0x210
[672.753][T736] do_syscall_64+0x39/0xb0
[672.765][T736] entry_SYSCALL_64_after_hwframe+0x63/0xcd
[672.769][T736]
[672.768][T736] The buggy address belongs to the object at ffff888022ec0000
[672.768][T736] which belongs to the cache kmalloc-4k of size 4096
[672.769][T736] The buggy address is located 520 bytes inside of
[672.769][T736] freed 4096-byte region [ffff888022ec0000, ffff888022ec1000)
[672.760][T736]
[672.764][T736] The buggy address belongs to the physical page:
[672.761][T736] page:ffffea00008bb000 refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x22ec0
[672.766][T736] head:ffffea00008bb000 order:3 entire_mapcount:0 nr_pages_mapped:0 pincount:0
[672.779][T736] flags: 0xfff00000010200(slab|head|node=0|zone=1|lastcpupid=0x7ff)
[672.770][T736] raw: 00fff00000010200 ffff888012842140 ffffea000054ba00 dead000000000002
[672.770][T736] raw: 0000000000000000 0000000000040004 00000001ffffffff 0000000000000000
[672.771][T736] page dumped because: kasan: bad access detected
[672.778][T736] page_owner tracks the page as allocated
[672.777][T736] page last allocated via order 3, migratetype Unmovable, gfp_mask 0xd2040(__GFP_IO|__GFP_NOWARN|__GFP_NORETRY|__GFP_COMP|__GFP_NOMEMALLOC), pid 88
[672.779][T736] get_page_from_freelist+0x119c/0x2d50
[672.779][T736] __alloc_pages+0x1cb/0x4a0
[672.776][T736] alloc_pages+0x1aa/0x270
[672.773][T736] allocate_slab+0x260/0x390
[672.771][T736] ___slab_alloc+0xa9a/0x13e0
[672.778][T736] __slab_alloc.constprop.0+0x56/0xb0
[672.771][T736] __kmem_cache_alloc_node+0x136/0x320
[672.789][T736] __kmalloc+0x4e/0x1a0
[672.783][T736] tomoyo_realpath_from_path+0xc3/0x600
[672.781][T736] tomoyo_path_perm+0x22f/0x420
[672.782][T736] tomoyo_path_unlink+0x92/0xd0
[672.780][T736] security_path_unlink+0xdb/0x150
[672.788][T736] do_unlinkat+0x377/0x680
[672.788][T736] __x64_sys_unlink+0xca/0x110
[672.789][T736] do_syscall_64+0x39/0xb0
[672.783][T736] entry_SYSCALL_64_after_hwframe+0x63/0xcd
[672.784][T736] page last free stack trace:
[672.787][T736] free_pcp_prepare+0x4e5/0x920
[672.787][T736] free_unref_page+0x1d/0x4e0
[672.784][T736] __unfreeze_partials+0x17c/0x1a0
[672.797][T736] qlist_free_all+0x6a/0x180
[672.796][T736] kasan_quarantine_reduce+0x189/0x1d0
[672.797][T736] __kasan_slab_alloc+0x64/0x90
[672.793][T736] kmem_cache_alloc+0x17c/0x3c0
[672.799][T736] getname_flags.part.0+0x50/0x4e0
[672.799][T736] getname_flags+0x9e/0xe0
[672.792][T736] vfs_fstatat+0x77/0xb0
[672.791][T736] __do_sys_newlstat+0x84/0x100
[672.798][T736] do_syscall_64+0x39/0xb0
[672.796][T736] entry_SYSCALL_64_after_hwframe+0x63/0xcd
[672.790][T736]
[672.791][T736] Memory state around the buggy address:
[672.799][T736] ffff888022ec0100: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[672.805][T736] ffff888022ec0180: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[672.802][T736] >ffff888022ec0200: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[672.809][T736] ^
[672.809][T736] ffff888022ec0280: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[672.809][T736] ffff888022ec0300: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
Fix this by having the qgroup assign ioctl take the qgroup ioctl mutex
before calling btrfs_run_qgroups(), which is what all qgroup ioctls should
call.
Reported-by: butt3rflyh4ck <butterflyhuangxx@gmail.com>
Link: https://lore.kernel.org/linux-btrfs/CAFcO6XN3VD8ogmHwqRk4kbiwtpUSNySu2VAxN8waEPciCHJvMA@mail.gmail.com/
CC: stable@vger.kernel.org # 5.10+
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
LOOP_CONFIGURE is, as far as I understand it, supposed to be a way to
combine LOOP_SET_FD and LOOP_SET_STATUS64 into a single syscall. When
using LOOP_SET_FD+LOOP_SET_STATUS64, a single uevent would be sent for
each partition found on the loop device after the second ioctl(), but
when using LOOP_CONFIGURE, no such uevent was being sent.
In the old setup, uevents are disabled for LOOP_SET_FD, but not for
LOOP_SET_STATUS64. This makes sense, as it prevents uevents being
sent for a partially configured device during LOOP_SET_FD - they're
only sent at the end of LOOP_SET_STATUS64. But for LOOP_CONFIGURE,
uevents were disabled for the entire operation, so that final
notification was never issued. To fix this, reduce the critical
section to exclude the loop_reread_partitions() call, which causes
the uevents to be issued, to after uevents are re-enabled, matching
the behaviour of the LOOP_SET_FD+LOOP_SET_STATUS64 combination.
I noticed this because Busybox's losetup program recently changed from
using LOOP_SET_FD+LOOP_SET_STATUS64 to LOOP_CONFIGURE, and this broke
my setup, for which I want a notification from the kernel any time a
new partition becomes available.
Signed-off-by: Alyssa Ross <hi@alyssa.is>
[hch: reduced the critical section]
Signed-off-by: Christoph Hellwig <hch@lst.de>
Fixes: 3448914e8cc5 ("loop: Add LOOP_CONFIGURE ioctl")
Link: https://lore.kernel.org/r/20230320125430.55367-1-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Pull kvm fixes from Paolo Bonzini:
"RISC-V:
- Fix VM hang in case of timer delta being zero
ARM:
- MMU fixes:
- Read the MMU notifier seq before dropping the mmap lock to guard
against reading a potentially stale VMA
- Disable interrupts when walking user page tables to protect
against the page table being freed
- Read the MTE permissions for the VMA within the mmap lock
critical section, avoiding the use of a potentally stale VMA
pointer
- vPMU fixes:
- Return the sum of the current perf event value and PMC snapshot
for reads from userspace
- Don't save the value of guest writes to PMCR_EL0.{C,P}, which
could otherwise lead to userspace erroneously resetting the vPMU
during VM save/restore"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
riscv/kvm: Fix VM hang in case of timer delta being zero.
KVM: arm64: Check for kvm_vma_mte_allowed in the critical section
KVM: arm64: Disable interrupts while walking userspace PTs
KVM: arm64: Retry fault if vma_lookup() results become invalid
KVM: arm64: PMU: Don't save PMCR_EL0.{C,P} for the vCPU
KVM: arm64: PMU: Fix GET_ONE_REG for vPMC regs to return the current value
|
|
The verifier test creates BPF ringbuf maps using hard-coded
4096 as max_entries. Some tests will fail if the page size
of the running kernel is not 4096. Use getpagesize() instead.
Signed-off-by: Hengqi Chen <hengqi.chen@gmail.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20230326095341.816023-1-hengqi.chen@gmail.com
|
|
This patch prevents races on the print function pointer, allowing the
libbpf_set_print() function to become thread-safe.
Signed-off-by: JP Kobryn <inwardvessel@gmail.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20230325010845.46000-1-inwardvessel@gmail.com
|
|
For ACPI drivers that provide a ->notify() callback and set
ACPI_DRIVER_ALL_NOTIFY_EVENTS in their flags, that callback can be
invoked while either the ->add() or the ->remove() callback is running
without any synchronization at the bus type level which is counter to
the common-sense expectation that notification handling should only be
enabled when the driver is actually bound to the device. As a result,
if the driver is not careful enough, it's ->notify() callback may crash
when it is invoked too early or too late [1].
This issue has been amplified by commit d6fb6ee1820c ("ACPI: bus: Drop
driver member of struct acpi_device") that made acpi_bus_notify() check
for the presence of the driver and its ->notify() callback directly
instead of using an extra driver pointer that was only set and cleared
by the bus type code, but it was present before that commit although
it was harder to reproduce then.
It can be addressed by using the observation that
acpi_device_install_notify_handler() can be modified to install the
handler for all types of events when ACPI_DRIVER_ALL_NOTIFY_EVENTS is
set in the driver flags, in which case acpi_bus_notify() will not need
to invoke the driver's ->notify() callback any more and that callback
will only be invoked after acpi_device_install_notify_handler() has run
and before acpi_device_remove_notify_handler() runs, which implies the
correct ordering with respect to the other ACPI driver callbacks.
Modify the code accordingly and while at it, drop two redundant local
variables from acpi_bus_notify() and turn its description comment into
a proper kerneldoc one.
Fixes: d6fb6ee1820c ("ACPI: bus: Drop driver member of struct acpi_device")
Link: https://lore.kernel.org/linux-acpi/9f6cba7a8a57e5a687c934e8e406e28c.squirrel@mail.panix.com # [1]
Reported-by: Pierre Asselin <pa@panix.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Tested-by: Pierre Asselin <pa@panix.com>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/pdx86/platform-drivers-x86
Pull x86 platform driver fixes from Hans de Goede:
- Intel tpmi/vsec fixes
- think-lmi fixes
- two other small fixes / hw-id additions
* tag 'platform-drivers-x86-v6.3-3' of git://git.kernel.org/pub/scm/linux/kernel/git/pdx86/platform-drivers-x86:
platform/surface: aggregator: Add missing fwnode_handle_put()
platform/x86: think-lmi: Add possible_values for ThinkStation
platform/x86: think-lmi: only display possible_values if available
platform/x86: think-lmi: use correct possible_values delimiters
platform/x86: think-lmi: add missing type attribute
platform/x86 (gigabyte-wmi): Add support for A320M-S2H V2
platform/x86/intel: tpmi: Revise the comment of intel_vsec_add_aux
platform/x86/intel: tpmi: Fix double free in tpmi_create_device()
platform/x86/intel: vsec: Fix a memory leak in intel_vsec_add_aux
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/mtd/linux
Pull MTD fixes from Miquel Raynal:
"Raw NAND controller driver fixes:
- meson:
- Invalidate cache on polling ECC bit
- Initialize struct with zeroes
- nandsim: Artificially prevent sequential page reads
ECC engine driver fixes:
- mxic-ecc: Fix mxic_ecc_data_xfer_wait_for_completion() when irq is
used
Binging fixes:
- jedec,spi-nor: Document CPOL/CPHA support"
* tag 'mtd/fixes-for-6.3-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/mtd/linux:
mtd: rawnand: meson: invalidate cache on polling ECC bit
mtd: rawnand: nandsim: Artificially prevent sequential page reads
dt-bindings: mtd: jedec,spi-nor: Document CPOL/CPHA support
mtd: nand: mxic-ecc: Fix mxic_ecc_data_xfer_wait_for_completion() when irq is used
mtd: rawnand: meson: initialize struct with zeroes
|
|
Commit "perf/amlogic: resolve conflict between canvas & pmu"
changed the base address.
Fixes: 2016e2113d35 ("perf/amlogic: Add support for Amlogic meson G12 SoC DDR PMU driver")
Signed-off-by: Marc Gonzalez <mgonzalez@freebox.fr>
Acked-by: Will Deacon <will@kernel.org>
Reviewed-by: Neil Armstrong <neil.armstrong@linaro.org>
Link: https://lore.kernel.org/r/20230327120932.2158389-4-mgonzalez@freebox.fr
Signed-off-by: Neil Armstrong <neil.armstrong@linaro.org>
|
|
Return -EFAULT if put_user() for the PTRACE_GET_LAST_BREAK
request fails, instead of silently ignoring it.
Reviewed-by: Sven Schnelle <svens@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
|
|
Expolines depend on scripts/basic/fixdep. And build of expolines can now
race with the fixdep build:
make[1]: *** Deleting file 'arch/s390/lib/expoline/expoline.o'
/bin/sh: line 1: scripts/basic/fixdep: Permission denied
make[1]: *** [../scripts/Makefile.build:385: arch/s390/lib/expoline/expoline.o] Error 126
make: *** [../arch/s390/Makefile:166: expoline_prepare] Error 2
The dependence was removed in the below Fixes: commit. So reintroduce
the dependence on scripts.
Fixes: a0b0987a7811 ("s390/nospec: remove unneeded header includes")
Cc: Joe Lawrence <joe.lawrence@redhat.com>
Cc: stable@vger.kernel.org
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Alexander Gordeev <agordeev@linux.ibm.com>
Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
Cc: Sven Schnelle <svens@linux.ibm.com>
Cc: linux-s390@vger.kernel.org
Signed-off-by: Jiri Slaby (SUSE) <jirislaby@kernel.org>
Link: https://lore.kernel.org/r/20230316112809.7903-1-jirislaby@kernel.org
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
|
|
The device release callback function invoked to release the matrix device
uses the dev_get_drvdata(device *dev) function to retrieve the
pointer to the vfio_matrix_dev object in order to free its storage. The
problem is, this object is not stored as drvdata with the device; since the
kfree function will accept a NULL pointer, the memory for the
vfio_matrix_dev object is never freed.
Since the device being released is contained within the vfio_matrix_dev
object, the container_of macro will be used to retrieve its pointer.
Fixes: 1fde573413b5 ("s390: vfio-ap: base implementation of VFIO AP device driver")
Signed-off-by: Tony Krowiak <akrowiak@linux.ibm.com>
Reviewed-by: Harald Freudenberger <freude@linux.ibm.com>
Link: https://lore.kernel.org/r/20230320150447.34557-1-akrowiak@linux.ibm.com
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
|
|
Add missing earlyclobber annotation to size, to, and tmp2 operands of the
__clear_user() inline assembly since they are modified or written to before
the last usage of all input operands. This can lead to incorrect register
allocation for the inline assembly.
Fixes: 6c2a9e6df604 ("[S390] Use alternative user-copy operations for new hardware.")
Reported-by: Mark Rutland <mark.rutland@arm.com>
Link: https://lore.kernel.org/all/20230321122514.1743889-3-mark.rutland@arm.com/
Cc: stable@vger.kernel.org
Reviewed-by: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
|
|
Don't report an error code to L1 when synthesizing a nested VM-Exit and
L2 is in Real Mode. Per Intel's SDM, regarding the error code valid bit:
This bit is always 0 if the VM exit occurred while the logical processor
was in real-address mode (CR0.PE=0).
The bug was introduced by a recent fix for AMD's Paged Real Mode, which
moved the error code suppression from the common "queue exception" path
to the "inject exception" path, but missed VMX's "synthesize VM-Exit"
path.
Fixes: b97f07458373 ("KVM: x86: determine if an exception has an error code only when injecting it.")
Cc: stable@vger.kernel.org
Cc: Maxim Levitsky <mlevitsk@redhat.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20230322143300.2209476-3-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
When injecting an exception into a vCPU in Real Mode, suppress the error
code by clearing the flag that tracks whether the error code is valid, not
by clearing the error code itself. The "typo" was introduced by recent
fix for SVM's funky Paged Real Mode.
Opportunistically hoist the logic above the tracepoint so that the trace
is coherent with respect to what is actually injected (this was also the
behavior prior to the buggy commit).
Fixes: b97f07458373 ("KVM: x86: determine if an exception has an error code only when injecting it.")
Cc: stable@vger.kernel.org
Cc: Maxim Levitsky <mlevitsk@redhat.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20230322143300.2209476-2-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
According to S905X2 Datasheet - Revision 07:
DMC_MON area spans 0xff638080-0xff6380c0
DDR_PLL area spans 0xff638c00-0xff638c34
Round DDR_PLL area size up to 0x40
Fixes: 90cf8e21016fa3 ("arm64: dts: meson: Add DDR PMU node")
Signed-off-by: Marc Gonzalez <mgonzalez@freebox.fr>
Reviewed-by: Neil Armstrong <neil.armstrong@linaro.org>
Link: https://lore.kernel.org/r/20230327120932.2158389-3-mgonzalez@freebox.fr
Signed-off-by: Neil Armstrong <neil.armstrong@linaro.org>
|
|
Clear vcpu->mmio_needed when injecting an exception from the emulator to
squash a (legitimate) warning about vcpu->mmio_needed being true at the
start of KVM_RUN without a callback being registered to complete the
userspace MMIO exit. Suppressing the MMIO write exit is inarguably wrong
from an architectural perspective, but it is the least awful hack-a-fix
due to shortcomings in KVM's uAPI, not to mention that KVM already
suppresses MMIO writes in this scenario.
Outside of REP string instructions, KVM doesn't provide a way to resume
an instruction at the exact point where it was "interrupted" if said
instruction partially completed before encountering an MMIO access. For
MMIO reads, KVM immediately exits to userspace upon detecting MMIO as
userspace provides the to-be-read value in a buffer, and so KVM can safely
(more or less) restart the instruction from the beginning. When the
emulator re-encounters the MMIO read, KVM will service the MMIO by getting
the value from the buffer instead of exiting to userspace, i.e. KVM won't
put the vCPU into an infinite loop.
On an emulated MMIO write, KVM finishes the instruction before exiting to
userspace, as exiting immediately would ultimately hang the vCPU due to
the aforementioned shortcoming of KVM not being able to resume emulation
in the middle of an instruction.
For the vast majority of _emulated_ instructions, deferring the userspace
exit doesn't cause problems as very few x86 instructions (again ignoring
string operations) generate multiple writes. But for instructions that
generate multiple writes, e.g. PUSHA (multiple pushes onto the stack),
deferring the exit effectively results in only the final write triggering
an exit to userspace. KVM does support multiple MMIO "fragments", but
only for page splits; if an instruction performs multiple distinct MMIO
writes, the number of fragments gets reset when the next MMIO write comes
along and any previous MMIO writes are dropped.
Circling back to the warning, if a deferred MMIO write coincides with an
exception, e.g. in this case a #SS due to PUSHA underflowing the stack
after queueing a write to an MMIO page on a previous push, KVM injects
the exceptions and leaves the deferred MMIO pending without registering a
callback, thus triggering the splat.
Sweep the problem under the proverbial rug as dropping MMIO writes is not
unique to the exception scenario (see above), i.e. instructions like PUSHA
are fundamentally broken with respect to MMIO, and have been since KVM's
inception.
Reported-by: zhangjianguo <zhangjianguo18@huawei.com>
Reported-by: syzbot+760a73552f47a8cd0fd9@syzkaller.appspotmail.com
Reported-by: syzbot+8accb43ddc6bd1f5713a@syzkaller.appspotmail.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20230322141220.2206241-1-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
According to S905X2 Datasheet - Revision 07:
DRAM Memory Controller (DMC) register area spans ff638000-ff63a000.
According to DeviceTree Specification - Release v0.4-rc1:
simple-bus nodes do not require reg property.
Fixes: 1499218c80c99a ("arm64: dts: move common G12A & G12B modes to meson-g12-common.dtsi")
Signed-off-by: Marc Gonzalez <mgonzalez@freebox.fr>
Reviewed-by: Martin Blumenstingl <martin.blumenstingl@googlemail.com>
Link: https://lore.kernel.org/r/20230327120932.2158389-2-mgonzalez@freebox.fr
Signed-off-by: Neil Armstrong <neil.armstrong@linaro.org>
|
|
KVM irqfd based emulation of level-triggered interrupts doesn't work
quite correctly in some cases, particularly in the case of interrupts
that are handled in a Linux guest as oneshot interrupts (IRQF_ONESHOT).
Such an interrupt is acked to the device in its threaded irq handler,
i.e. later than it is acked to the interrupt controller (EOI at the end
of hardirq), not earlier.
Linux keeps such interrupt masked until its threaded handler finishes,
to prevent the EOI from re-asserting an unacknowledged interrupt.
However, with KVM + vfio (or whatever is listening on the resamplefd)
we always notify resamplefd at the EOI, so vfio prematurely unmasks the
host physical IRQ, thus a new physical interrupt is fired in the host.
This extra interrupt in the host is not a problem per se. The problem is
that it is unconditionally queued for injection into the guest, so the
guest sees an extra bogus interrupt. [*]
There are observed at least 2 user-visible issues caused by those
extra erroneous interrupts for a oneshot irq in the guest:
1. System suspend aborted due to a pending wakeup interrupt from
ChromeOS EC (drivers/platform/chrome/cros_ec.c).
2. Annoying "invalid report id data" errors from ELAN0000 touchpad
(drivers/input/mouse/elan_i2c_core.c), flooding the guest dmesg
every time the touchpad is touched.
The core issue here is that by the time when the guest unmasks the IRQ,
the physical IRQ line is no longer asserted (since the guest has
acked the interrupt to the device in the meantime), yet we
unconditionally inject the interrupt queued into the guest by the
previous resampling. So to fix the issue, we need a way to detect that
the IRQ is no longer pending, and cancel the queued interrupt in this
case.
With IOAPIC we are not able to probe the physical IRQ line state
directly (at least not if the underlying physical interrupt controller
is an IOAPIC too), so in this patch we use irqfd resampler for that.
Namely, instead of injecting the queued interrupt, we just notify the
resampler that this interrupt is done. If the IRQ line is actually
already deasserted, we are done. If it is still asserted, a new
interrupt will be shortly triggered through irqfd and injected into the
guest.
In the case if there is no irqfd resampler registered for this IRQ, we
cannot fix the issue, so we keep the existing behavior: immediately
unconditionally inject the queued interrupt.
This patch fixes the issue for x86 IOAPIC only. In the long run, we can
fix it for other irqchips and other architectures too, possibly taking
advantage of reading the physical state of the IRQ line, which is
possible with some other irqchips (e.g. with arm64 GIC, maybe even with
the legacy x86 PIC).
[*] In this description we assume that the interrupt is a physical host
interrupt forwarded to the guest e.g. by vfio. Potentially the same
issue may occur also with a purely virtual interrupt from an
emulated device, e.g. if the guest handles this interrupt, again, as
a oneshot interrupt.
Signed-off-by: Dmytro Maluka <dmy@semihalf.com>
Link: https://lore.kernel.org/kvm/31420943-8c5f-125c-a5ee-d2fde2700083@semihalf.com/
Link: https://lore.kernel.org/lkml/87o7wrug0w.wl-maz@kernel.org/
Message-Id: <20230322204344.50138-3-dmy@semihalf.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
It is useful to be able to do read-only traversal of the list of all the
registered irqfd resamplers without locking the resampler_lock mutex.
In particular, we are going to traverse it to search for a resampler
registered for the given irq of an irqchip, and that will be done with
an irqchip spinlock (ioapic->lock) held, so it is undesirable to lock a
mutex in this context. So turn this list into an RCU list.
For protecting the read side, reuse kvm->irq_srcu which is already used
for protecting a number of irq related things (kvm->irq_routing,
irqfd->resampler->list, kvm->irq_ack_notifier_list,
kvm->arch.mask_notifier_list).
Signed-off-by: Dmytro Maluka <dmy@semihalf.com>
Message-Id: <20230322204344.50138-2-dmy@semihalf.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|