Age | Commit message (Collapse) | Author |
|
Original logic only sets the return value but doesn't jump out of the
loop if the bus is kept active by a client. This is not expected. A
malicious or buggy i2c client can hang the kernel in this case and
should be avoided. This is observed during a long time test with a
PCA953x GPIO extender.
Fix it by changing the logic to not only sets the return value, but also
jumps out of the loop and return to the caller with -ETIMEDOUT.
Fixes: fbfab1ab0658 ("i2c: qup: reorganization of driver code to remove polling for qup v1")
Signed-off-by: Yang Xiwen <forbidden405@outlook.com>
Cc: <stable@vger.kernel.org> # v4.17+
Signed-off-by: Andi Shyti <andi.shyti@kernel.org>
Link: https://lore.kernel.org/r/20250616-qca-i2c-v1-1-2a8d37ee0a30@outlook.com
|
|
The current implementation uses wait_for_completion(), which can cause
the caller to hang indefinitely if the transfer never completes.
Switch to wait_for_completion_interruptible() so that the operation can
be interrupted by signals.
Fixes: 84e1d0bf1d71 ("i2c: virtio: disable timeout handling")
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Cc: <stable@vger.kernel.org> # v5.16+
Signed-off-by: Andi Shyti <andi.shyti@kernel.org>
Link: https://lore.kernel.org/r/b8944e9cab8eb959d888ae80add6f2a686159ba2.1751541962.git.viresh.kumar@linaro.org
|
|
The acpi_evaluate_object() returns an ACPI error code and not
Linux one. For the some platforms the err will have positive code
which may be interpreted incorrectly. Use device_reset() for
reset control which handles it correctly.
Fixes: bd2fdedbf2ba ("i2c: tegra: Add the ACPI support")
Reported-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Signed-off-by: Akhil R <akhilrajeev@nvidia.com>
Cc: <stable@vger.kernel.org> # v5.17+
Reviewed-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Signed-off-by: Andi Shyti <andi.shyti@kernel.org>
Link: https://lore.kernel.org/r/20250710131206.2316-2-akhilrajeev@nvidia.com
|
|
My CC-adding automation returned nothing on a future patch to the
include/linux/in6.h file, and I went looking for why. Add the missed
in6.h to MAINTAINERS.
Signed-off-by: Kees Cook <kees@kernel.org>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20250722165645.work.047-kees@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Pull kvm fix from Paolo Bonzini:
- Fix cleanup mistake (probably a cut-and-paste error) in a Xen
hypercall
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
KVM: x86/xen: Fix cleanup logic in emulation of Xen schedop poll hypercalls
|
|
kvm_xen_schedop_poll does a kmalloc_array() when a VM polls the host
for more than one event channel potr (nr_ports > 1).
After the kmalloc_array(), the error paths need to go through the
"out" label, but the call to kvm_read_guest_virt() does not.
Fixes: 92c58965e965 ("KVM: x86/xen: Use kvm_read_guest_virt() instead of open-coding it badly")
Reviewed-by: David Woodhouse <dwmw@amazon.co.uk>
Signed-off-by: Manuel Andreas <manuel.andreas@tum.de>
[Adjusted commit message. - Paolo]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
https://gitlab.freedesktop.org/drm/misc/kernel into drm-fixes
drm-misc-fixes for v6.16-rc8/final?:
- Revert all uses of drm_gem_object->dmabuf to
drm_gem_object->import_attach->dmabuf.
- Fix amdgpu returning BIOS cluttered VRAM after resume.
- Scheduler hang fix.
- Revert nouveau ioctl fix as it caused regressions.
- Fix null pointer deref in nouveau.
- Fix unnecessary semicolon in ti_sn_bridge_probe.
Signed-off-by: Dave Airlie <airlied@redhat.com>
From: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
Link: https://lore.kernel.org/r/72235afd-c849-49fe-9cc1-2b1781abdf08@linux.intel.com
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull ufs fix from Al Viro:
"Fix regression in ufs options parsing"
* tag 'pull-ufs-fix' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
fix the regression in ufs options parsing
|
|
A really dumb braino on rebasing and a dumber fuckup with managing #for-next
Fixes: b70cb459890b ("ufs: convert ufs to the new mount API")
Fucked-up-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
|
dma_fence_wait_timeout returns a long type but the driver is
only using the lower 32 bits of the retval and discarding the
upper 32 bits.
This is particularly problematic if there are already signalled
or stub fences on some of the hw planes. In this case the
dma_fence_wait_timeout function will immediately return with
timeout value MAX_SCHEDULE_TIMEOUT (0x7fffffffffffffff) since
the fence is already signalled. If the driver only uses the lower
32 bits of this return value then it'll interpret it as an error
code (0xFFFFFFFF or (-1)) and skip the wait on the remaining fences.
This issue was first observed in the xe driver with the Android
compositor where the GPU composited layer was not properly waited
on when there were stub fences in other overlay planes resulting in
visual artifacts.
Fixes: d59cf7bb73f3c ("drm/i915/display: Use dma_fence interfaces instead of i915_sw_fence")
Signed-off-by: Aakash Deep Sarkar <aakash.deep.sarkar@intel.com>
Reviewed-by: Matthew Brost <matthew.brost@intel.com>
Signed-off-by: Ankit Nautiyal <ankit.k.nautiyal@intel.com>
Link: https://lore.kernel.org/r/20250708074540.1948068-1-aakash.deep.sarkar@intel.com
(cherry picked from commit cdb16039515a5ac4d2c923f7a651cf19a803a3fe)
Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
|
|
NAND devices with different page sizes requires different number
of ECC steps, yet the qcom_spi_ecc_init_ctx_pipelined() function
sets 4 steps in 'ecc_cfg' unconditionally.
The correct number of the steps is calculated earlier in the
function already, so use that instead of the hardcoded value.
Fixes: 7304d1909080 ("spi: spi-qpic: add driver for QCOM SPI NAND flash Interface")
Signed-off-by: Gabor Juhos <j4g8y7@gmail.com>
Link: https://patch.msgid.link/20250723-qpic-snand-fix-steps-v1-1-d800695dde4c@gmail.com
Signed-off-by: Mark Brown <broonie@kernel.org>
|
|
Make sure to drop the references to the accdet OF node and platform
device taken by of_parse_phandle() and of_find_device_by_node() after
looking up the sound component during probe.
Fixes: cf536e2622e2 ("ASoC: mediatek: common: Handle mediatek,accdet property")
Cc: stable@vger.kernel.org # 6.15
Cc: Nícolas F. R. A. Prado <nfraprado@collabora.com>
Signed-off-by: Johan Hovold <johan@kernel.org>
Link: https://patch.msgid.link/20250722092542.32754-1-johan@kernel.org
Signed-off-by: Mark Brown <broonie@kernel.org>
|
|
syzbot reported a bug in in afs_put_vlserverlist.
kAFS: bad VL server IP address
BUG: unable to handle page fault for address: fffffffffffffffa
...
Oops: Oops: 0002 [#1] SMP KASAN PTI
...
RIP: 0010:refcount_dec_and_test include/linux/refcount.h:450 [inline]
RIP: 0010:afs_put_vlserverlist+0x3a/0x220 fs/afs/vl_list.c:67
...
Call Trace:
<TASK>
afs_alloc_cell fs/afs/cell.c:218 [inline]
afs_lookup_cell+0x12a5/0x1680 fs/afs/cell.c:264
afs_cell_init+0x17a/0x380 fs/afs/cell.c:386
afs_proc_rootcell_write+0x21f/0x290 fs/afs/proc.c:247
proc_simple_write+0x114/0x1b0 fs/proc/generic.c:825
pde_write fs/proc/inode.c:330 [inline]
proc_reg_write+0x23d/0x330 fs/proc/inode.c:342
vfs_write+0x25c/0x1180 fs/read_write.c:682
ksys_write+0x12a/0x240 fs/read_write.c:736
do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
do_syscall_64+0xcd/0x260 arch/x86/entry/syscall_64.c:94
entry_SYSCALL_64_after_hwframe+0x77/0x7f
Because afs_parse_text_addrs() parses incorrectly, its return value -EINVAL
is assigned to vllist, which results in -EINVAL being used as the vllist
address when afs_put_vlserverlist() is executed.
Set the vllist value to NULL when a parsing error occurs to avoid this
issue.
Fixes: e2c2cb8ef07a ("afs: Simplify cell record handling")
Reported-by: syzbot+5c042fbab0b292c98fc6@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=5c042fbab0b292c98fc6
Tested-by: syzbot+5c042fbab0b292c98fc6@syzkaller.appspotmail.com
Signed-off-by: Edward Adam Davis <eadavis@qq.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Link: https://lore.kernel.org/4119365.1753108011@warthog.procyon.org.uk
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
cc: linux-fsdevel@vger.kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Add a missing check for reaching the end of the string while attempting
to split a command.
Fixes: f94f70d39cc2 ("afs: Provide a way to configure address priorities")
Reported-by: syzbot+7741f872f3c53385a2e2@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=7741f872f3c53385a2e2
Signed-off-by: Leo Stone <leocstone@gmail.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Link: https://lore.kernel.org/4119428.1753108152@warthog.procyon.org.uk
Acked-by: David Howells <dhowells@redhat.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
cc: linux-fsdevel@vger.kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/mkl/linux-can
Marc Kleine-Budde says:
====================
pull-request: can 2025-07-22
The patch is by me and fixes a potential NULL pointer deref in the CAN
device driver infrastructure. It can be triggered from user space.
* tag 'linux-can-fixes-for-6.16-20250722' of git://git.kernel.org/pub/scm/linux/kernel/git/mkl/linux-can:
can: netlink: can_changelink(): fix NULL pointer deref of struct can_priv::do_set_mode
====================
Link: https://patch.msgid.link/20250722110059.3664104-1-mkl@pengutronix.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
The test is supposed to observe that the 'clash_resolve' stat counter
incremented (i.e., the code path was covered).
This check was incorrect, 'conntrack -S' needs to be called in the
revevant namespace, not the initial netns.
The clash resolution logic in conntrack is only exercised when multiple
packets with the same udp quadruple race. Depending on kernel config,
number of CPUs, scheduling policy etc. this might not trigger even
after several retries. Thus the script eventually returns SKIP if the
retry count is exceeded.
The udpclash tool with also exit with a failure if it did not observe
the expected number of replies.
In the script, make a note of this but do not fail anymore, just check if
the clash resolution logic triggered after all.
Remove the 'single-core' test: while unlikely, with preemptible kernel it
should be possible to also trigger clash resolution logic.
With this change the test will either SKIP or pass.
Hard error could be restored later once its clear whats going on, so
also dump 'conntrack -S' when some packets went missing to see if
conntrack dropped them on insert.
Fixes: 78a588363587 ("selftests: netfilter: add conntrack clash resolution test case")
Signed-off-by: Florian Westphal <fw@strlen.de>
Link: https://patch.msgid.link/20250721223652.6956-1-fw@strlen.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/net-queue
Tony Nguyen says:
====================
Intel Wired LAN Driver Updates 2025-07-21 (i40e, ice, e1000e)
For i40e:
Dennis Chen adjusts reporting of VF Tx dropped to a more appropriate
field.
Jamie Bainbridge fixes a check which can cause a PF set VF MAC address
to be lost.
For ice:
Haoxiang Li adds an error check in DDP load to prevent NULL pointer
dereference.
For e1000e:
Jacek Kowalski adds workarounds for issues surrounding Tiger Lake
platforms with uninitialized NVMs.
* '40GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/net-queue:
e1000e: ignore uninitialized checksum word on tgp
e1000e: disregard NVM checksum on tgp when valid checksum bit is not set
ice: Fix a null pointer dereference in ice_copy_and_init_pkg()
i40e: When removing VF MAC filters, only check PF-set MAC
i40e: report VF tx_dropped with tx_errors instead of tx_discards
====================
Link: https://patch.msgid.link/20250721173733.2248057-1-anthony.l.nguyen@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Sometimes the netdev triggers causes tasks to get blocked for more then
120 seconds, which in turn makes the (WAN) network port on the NanoPi
R5S fail to come up.
This results in the following (partial) trace:
INFO: task kworker/0:1:11 blocked for more than 120 seconds.
Not tainted 6.16-rc6+unreleased-arm64-cknow #1 Debian 6.16~rc6-1~exp1
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
task:kworker/0:1 state:D stack:0 pid:11 tgid:11 ppid:2 task_flags:0x4208060 flags:0x00000010
Workqueue: events_power_efficient reg_check_chans_work [cfg80211]
Call trace:
__switch_to+0xf8/0x168 (T)
__schedule+0x3f8/0xda8
schedule+0x3c/0x120
schedule_preempt_disabled+0x2c/0x58
__mutex_lock.constprop.0+0x4d0/0xab8
__mutex_lock_slowpath+0x1c/0x30
mutex_lock+0x50/0x68
rtnl_lock+0x20/0x38
reg_check_chans_work+0x40/0x478 [cfg80211]
process_one_work+0x178/0x3e0
worker_thread+0x260/0x390
kthread+0x150/0x250
ret_from_fork+0x10/0x20
INFO: task kworker/0:1:11 is blocked on a mutex likely owned by task dhcpcd:615.
task:dhcpcd state:D stack:0 pid:615 tgid:615 ppid:614 task_flags:0x400140 flags:0x00000018
Call trace:
__switch_to+0xf8/0x168 (T)
__schedule+0x3f8/0xda8
schedule+0x3c/0x120
schedule_preempt_disabled+0x2c/0x58
rwsem_down_write_slowpath+0x1e4/0x750
down_write+0x98/0xb0
led_trigger_register+0x134/0x1c0
phy_led_triggers_register+0xf4/0x258 [libphy]
phy_attach_direct+0x30c/0x390 [libphy]
phylink_fwnode_phy_connect+0xb0/0x138 [phylink]
__stmmac_open+0xec/0x520 [stmmac]
stmmac_open+0x4c/0xe8 [stmmac]
__dev_open+0x130/0x2e0
__dev_change_flags+0x1c4/0x248
netif_change_flags+0x2c/0x80
dev_change_flags+0x88/0xc8
devinet_ioctl+0x35c/0x610
inet_ioctl+0x204/0x260
sock_do_ioctl+0x6c/0x140
sock_ioctl+0x2e4/0x388
__arm64_sys_ioctl+0xb4/0x120
invoke_syscall+0x6c/0x100
el0_svc_common.constprop.0+0x48/0xf0
do_el0_svc+0x24/0x38
el0_svc+0x3c/0x188
el0t_64_sync_handler+0x10c/0x140
el0t_64_sync+0x198/0x1a0
In order to not introduce a regression with kernel 6.16, drop the netdev
triggers for now while the problem is being investigated further.
Fixes: 1631cbdb8089 ("arm64: dts: rockchip: Improve LED config for NanoPi R5S")
Helped-by: Robin Murphy <robin.murphy@arm.com>
Signed-off-by: Diederik de Haas <didi.debian@cknow.org>
Link: https://lore.kernel.org/r/20250722123628.25660-1-didi.debian@cknow.org
Signed-off-by: Heiko Stuebner <heiko@sntech.de>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
|
|
If devicetree describes power supplies related to a PCI device, we
unnecessarily created a pwrctrl device even if CONFIG_PCI_PWRCTL was not
enabled.
We only need pci_pwrctrl_create_device() when CONFIG_PCI_PWRCTRL is
enabled. Compile it out when CONFIG_PCI_PWRCTRL is not enabled.
When pci_pwrctrl_create_device() creates and returns a pwrctrl device,
pci_scan_device() doesn't enumerate the PCI device. It assumes the pwrctrl
core will rescan the bus after turning on the power. However, if
CONFIG_PCI_PWRCTRL is not enabled, the rescan never happens, which breaks
PCI enumeration on any system that describes power supplies in devicetree
but does not use pwrctrl.
Jim reported that some brcmstb platforms break this way. The brcmstb
driver is still broken if CONFIG_PCI_PWRCTRL is enabled, but this commit at
least allows brcmstb to work when it's NOT enabled.
Fixes: 957f40d039a9 ("PCI/pwrctrl: Move creation of pwrctrl devices to pci_scan_device()")
Reported-by: Jim Quinlan <james.quinlan@broadcom.com>
Link: https://lore.kernel.org/r/CA+-6iNwgaByXEYD3j=-+H_PKAxXRU78svPMRHDKKci8AGXAUPg@mail.gmail.com
Signed-off-by: Manivannan Sadhasivam <manivannan.sadhasivam@linaro.org>
[bhelgaas: commit log]
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Reviewed-by: Lukas Wunner <lukas@wunner.de>
Cc: stable@vger.kernel.org # v6.15
Link: https://patch.msgid.link/20250701064731.52901-1-manivannan.sadhasivam@linaro.org
|
|
this patch is to fix my previous Commit <e5182305a519> i have fixed mute
led but for by This patch corrects the coefficient mask value introduced
in commit <e5182305a519>, which was intended to enable the mute LED
functionality. During testing, multiple values were evaluated, and
an incorrect value was mistakenly included in the final commit.
This update fixes that error by applying the correct mask value for
proper mute LED behavior.
Tested on 6.15.5-arch1-1
Fixes: e5182305a519 ("ALSA: hda/realtek: Enable Mute LED on HP OMEN 16 Laptop xd000xx")
Signed-off-by: SHARAN KUMAR M <sharweshraajan@gmail.com>
Link: https://patch.msgid.link/20250722172224.15359-1-sharweshraajan@gmail.com
Signed-off-by: Takashi Iwai <tiwai@suse.de>
|
|
`cpu_switch_to()` and `call_on_irq_stack()` manipulate SP to change
to different stacks along with the Shadow Call Stack if it is enabled.
Those two stack changes cannot be done atomically and both functions
can be interrupted by SErrors or Debug Exceptions which, though unlikely,
is very much broken : if interrupted, we can end up with mismatched stacks
and Shadow Call Stack leading to clobbered stacks.
In `cpu_switch_to()`, it can happen when SP_EL0 points to the new task,
but x18 stills points to the old task's SCS. When the interrupt handler
tries to save the task's SCS pointer, it will save the old task
SCS pointer (x18) into the new task struct (pointed to by SP_EL0),
clobbering it.
In `call_on_irq_stack()`, it can happen when switching from the task stack
to the IRQ stack and when switching back. In both cases, we can be
interrupted when the SCS pointer points to the IRQ SCS, but SP points to
the task stack. The nested interrupt handler pushes its return addresses
on the IRQ SCS. It then detects that SP points to the task stack,
calls `call_on_irq_stack()` and clobbers the task SCS pointer with
the IRQ SCS pointer, which it will also use !
This leads to tasks returning to addresses on the wrong SCS,
or even on the IRQ SCS, triggering kernel panics via CONFIG_VMAP_STACK
or FPAC if enabled.
This is possible on a default config, but unlikely.
However, when enabling CONFIG_ARM64_PSEUDO_NMI, DAIF is unmasked and
instead the GIC is responsible for filtering what interrupts the CPU
should receive based on priority.
Given the goal of emulating NMIs, pseudo-NMIs can be received by the CPU
even in `cpu_switch_to()` and `call_on_irq_stack()`, possibly *very*
frequently depending on the system configuration and workload, leading
to unpredictable kernel panics.
Completely mask DAIF in `cpu_switch_to()` and restore it when returning.
Do the same in `call_on_irq_stack()`, but restore and mask around
the branch.
Mask DAIF even if CONFIG_SHADOW_CALL_STACK is not enabled for consistency
of behaviour between all configurations.
Introduce and use an assembly macro for saving and masking DAIF,
as the existing one saves but only masks IF.
Cc: <stable@vger.kernel.org>
Signed-off-by: Ada Couprie Diaz <ada.coupriediaz@arm.com>
Reported-by: Cristian Prundeanu <cpru@amazon.com>
Fixes: 59b37fe52f49 ("arm64: Stash shadow stack pointer in the task struct on interrupt")
Tested-by: Cristian Prundeanu <cpru@amazon.com>
Acked-by: Will Deacon <will@kernel.org>
Link: https://lore.kernel.org/r/20250718142814.133329-1-ada.coupriediaz@arm.com
Signed-off-by: Will Deacon <will@kernel.org>
|
|
As reported by the kernel test robot, a recent patch introduced an
unnecessary semicolon. Remove it.
Fixes: 55e8ff842051 ("drm/bridge: ti-sn65dsi86: Add HPD for DisplayPort connector type")
Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202506301704.0SBj6ply-lkp@intel.com/
Reviewed-by: Devarsh Thakkar <devarsht@ti.com>
Signed-off-by: Douglas Anderson <dianders@chromium.org>
Link: https://lore.kernel.org/r/20250714130631.1.I1cfae3222e344a3b3c770d079ee6b6f7f3b5d636@changeid
|
|
drivers
Most drivers only populate the fields cycles and cs_id of system_counterval
in their get_time_fn() callback for get_device_system_crosststamp(), unless
they explicitly provide nanosecond values.
When the use_nsecs field was added to struct system_counterval, most
drivers did not care. Clock sources other than CSID_GENERIC could then get
converted in convert_base_to_cs() based on an uninitialized use_nsecs field,
which usually results in -EINVAL during the following range check.
Pass in a fully zero initialized system_counterval_t to cure that.
Fixes: 6b2e29977518 ("timekeeping: Provide infrastructure for converting to/from a base clock")
Signed-off-by: Markus Blöchl <markus@blochl.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: John Stultz <jstultz@google.com>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/all/20250720-timekeeping_uninit_crossts-v2-1-f513c885b7c2@blochl.de
|
|
My previous patch ended up causing a regression for the
DRM_IOCTL_NOUVEAU_NVIF ioctl. The intention of my patch was to only
pass ioctl commands that have the correct dir/type/nr bits into the
nouveau_abi16_ioctl() function.
This turned out to be too strict, as userspace does use at least
write-only and write-read direction settings. Checking for both of these
still did not fix the issue, so the best we can do for the 6.16 release
is to revert back to what we've had since linux-3.16.
This version is still fragile, but at least it is known to work with
existing userspace. Fixing this properly requires a better understanding
of what commands are being passed from userspace in practice, and how
that relies on the undocumented (miss)behavior in nouveau_drm_ioctl().
Fixes: e5478166dffb ("drm/nouveau: check ioctl command codes better")
Reported-by: Satadru Pramanik <satadru@gmail.com>
Closes: https://lore.kernel.org/lkml/CAFrh3J85tsZRpOHQtKgNHUVnn=EG=QKBnZTRtWS8eWSc1K1xkA@mail.gmail.com/
Reported-by: Chris Bainbridge <chris.bainbridge@gmail.com>
Closes: https://lore.kernel.org/lkml/aH9n_QGMFx2ZbKlw@debian.local/
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Link: https://lore.kernel.org/r/20250722115830.2587297-1-arnd@kernel.org
[ Add Closes: tags, fix minor typo in commit message. - Danilo ]
Signed-off-by: Danilo Krummrich <dakr@kernel.org>
|
|
can_priv::do_set_mode
Andrei Lalaev reported a NULL pointer deref when a CAN device is
restarted from Bus Off and the driver does not implement the struct
can_priv::do_set_mode callback.
There are 2 code path that call struct can_priv::do_set_mode:
- directly by a manual restart from the user space, via
can_changelink()
- delayed automatic restart after bus off (deactivated by default)
To prevent the NULL pointer deference, refuse a manual restart or
configure the automatic restart delay in can_changelink() and report
the error via extack to user space.
As an additional safety measure let can_restart() return an error if
can_priv::do_set_mode is not set instead of dereferencing it
unchecked.
Reported-by: Andrei Lalaev <andrey.lalaev@gmail.com>
Closes: https://lore.kernel.org/all/20250714175520.307467-1-andrey.lalaev@gmail.com
Fixes: 39549eef3587 ("can: CAN Network device driver and Netlink interface")
Link: https://patch.msgid.link/20250718-fix-nullptr-deref-do_set_mode-v1-1-0b520097bb96@pengutronix.de
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
|
|
qfq_delete_class
might_sleep could be trigger in the atomic context in qfq_delete_class.
qfq_destroy_class was moved into atomic context locked
by sch_tree_lock to avoid a race condition bug on
qfq_aggregate. However, might_sleep could be triggered by
qfq_destroy_class, which introduced sleeping in atomic context (path:
qfq_destroy_class->qdisc_put->__qdisc_destroy->lockdep_unregister_key
->might_sleep).
Considering the race is on the qfq_aggregate objects, keeping
qfq_rm_from_agg in the lock but moving the left part out can solve
this issue.
Fixes: 5e28d5a3f774 ("net/sched: sch_qfq: Fix race condition on qfq_aggregate")
Reported-by: Dan Carpenter <dan.carpenter@linaro.org>
Signed-off-by: Xiang Mei <xmei5@asu.edu>
Link: https://patch.msgid.link/4a04e0cc-a64b-44e7-9213-2880ed641d77@sabinyo.mountain
Reviewed-by: Cong Wang <xiyou.wangcong@gmail.com>
Reviewed-by: Dan Carpenter <dan.carpenter@linaro.org>
Link: https://patch.msgid.link/20250717230128.159766-1-xmei5@asu.edu
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
The mutexes qdev_mutex and chip->mutex are acquired in that order
throughout the driver. To preserve proper lock hierarchy and avoid
potential deadlocks, they must be released in the reverse
order of acquisition.
This change reorders the unlock sequence to first release chip->mutex
followed by qdev_mutex, ensuring consistency with the locking pattern.
[ fixed the code indentations and Fixes tag by tiwai ]
Fixes: 326bbc348298a ("ALSA: usb-audio: qcom: Introduce QC USB SND offloading support")
Signed-off-by: Erick Karanja <karanja99erick@gmail.com>
Link: https://patch.msgid.link/20250721114554.1666104-1-karanja99erick@gmail.com
Signed-off-by: Takashi Iwai <tiwai@suse.de>
|
|
gve_tx_timeout was calculating missed completions in a way that is only
relevant in the GQ queue format. Additionally, it was attempting to
disable device interrupts, which is not needed in either GQ or DQ queue
formats.
As a result, TX timeouts with the DQ queue format likely would have
triggered early resets without kicking the queue at all.
This patch drops the check for pending work altogether and always kicks
the queue after validating the queue has not seen a TX timeout too
recently.
Cc: stable@vger.kernel.org
Fixes: 87a7f321bb6a ("gve: Recover from queue stall due to missed IRQ")
Co-developed-by: Tim Hostetler <thostet@google.com>
Signed-off-by: Tim Hostetler <thostet@google.com>
Signed-off-by: Praveen Kaligineedi <pkaligineedi@google.com>
Signed-off-by: Harshitha Ramamurthy <hramamurthy@google.com>
Link: https://patch.msgid.link/20250717192024.1820931-1-hramamurthy@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
The AARP proxy‐probe routine (aarp_proxy_probe_network) sends a probe,
releases the aarp_lock, sleeps, then re-acquires the lock. During that
window an expire timer thread (__aarp_expire_timer) can remove and
kfree() the same entry, leading to a use-after-free.
race condition:
cpu 0 | cpu 1
atalk_sendmsg() | atif_proxy_probe_device()
aarp_send_ddp() | aarp_proxy_probe_network()
mod_timer() | lock(aarp_lock) // LOCK!!
timeout around 200ms | alloc(aarp_entry)
and then call | proxies[hash] = aarp_entry
aarp_expire_timeout() | aarp_send_probe()
| unlock(aarp_lock) // UNLOCK!!
lock(aarp_lock) // LOCK!! | msleep(100);
__aarp_expire_timer(&proxies[ct]) |
free(aarp_entry) |
unlock(aarp_lock) // UNLOCK!! |
| lock(aarp_lock) // LOCK!!
| UAF aarp_entry !!
==================================================================
BUG: KASAN: slab-use-after-free in aarp_proxy_probe_network+0x560/0x630 net/appletalk/aarp.c:493
Read of size 4 at addr ffff8880123aa360 by task repro/13278
CPU: 3 UID: 0 PID: 13278 Comm: repro Not tainted 6.15.2 #3 PREEMPT(full)
Call Trace:
<TASK>
__dump_stack lib/dump_stack.c:94 [inline]
dump_stack_lvl+0x116/0x1b0 lib/dump_stack.c:120
print_address_description mm/kasan/report.c:408 [inline]
print_report+0xc1/0x630 mm/kasan/report.c:521
kasan_report+0xca/0x100 mm/kasan/report.c:634
aarp_proxy_probe_network+0x560/0x630 net/appletalk/aarp.c:493
atif_proxy_probe_device net/appletalk/ddp.c:332 [inline]
atif_ioctl+0xb58/0x16c0 net/appletalk/ddp.c:857
atalk_ioctl+0x198/0x2f0 net/appletalk/ddp.c:1818
sock_do_ioctl+0xdc/0x260 net/socket.c:1190
sock_ioctl+0x239/0x6a0 net/socket.c:1311
vfs_ioctl fs/ioctl.c:51 [inline]
__do_sys_ioctl fs/ioctl.c:906 [inline]
__se_sys_ioctl fs/ioctl.c:892 [inline]
__x64_sys_ioctl+0x194/0x200 fs/ioctl.c:892
do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
do_syscall_64+0xcb/0x250 arch/x86/entry/syscall_64.c:94
entry_SYSCALL_64_after_hwframe+0x77/0x7f
</TASK>
Allocated:
aarp_alloc net/appletalk/aarp.c:382 [inline]
aarp_proxy_probe_network+0xd8/0x630 net/appletalk/aarp.c:468
atif_proxy_probe_device net/appletalk/ddp.c:332 [inline]
atif_ioctl+0xb58/0x16c0 net/appletalk/ddp.c:857
atalk_ioctl+0x198/0x2f0 net/appletalk/ddp.c:1818
Freed:
kfree+0x148/0x4d0 mm/slub.c:4841
__aarp_expire net/appletalk/aarp.c:90 [inline]
__aarp_expire_timer net/appletalk/aarp.c:261 [inline]
aarp_expire_timeout+0x480/0x6e0 net/appletalk/aarp.c:317
The buggy address belongs to the object at ffff8880123aa300
which belongs to the cache kmalloc-192 of size 192
The buggy address is located 96 bytes inside of
freed 192-byte region [ffff8880123aa300, ffff8880123aa3c0)
Memory state around the buggy address:
ffff8880123aa200: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
ffff8880123aa280: 00 00 00 00 fc fc fc fc fc fc fc fc fc fc fc fc
>ffff8880123aa300: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
^
ffff8880123aa380: fb fb fb fb fb fb fb fb fc fc fc fc fc fc fc fc
ffff8880123aa400: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
==================================================================
Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
Signed-off-by: Kito Xu (veritas501) <hxzene@gmail.com>
Link: https://patch.msgid.link/20250717012843.880423-1-hxzene@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
On ASP versions v2.x we need to program the TX map vector register to
properly exercise end-to-end flow control, otherwise the TX engine can
either lock-up, or cause the hardware calculated checksum to be
wrong/corrupted when multiple back to back packets are being submitted
for transmission. This register defaults to 0, which means no flow
control being applied.
Fixes: e9f31435ee7d ("net: bcmasp: Add support for asp-v3.0")
Signed-off-by: Florian Fainelli <florian.fainelli@broadcom.com>
Link: https://patch.msgid.link/20250718212242.3447751-1-florian.fainelli@broadcom.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Currently holes are sent as writes full of zeroes, which results in
unnecessarily using disk space at the receiving end and increasing the
stream size.
In some cases we avoid sending writes of zeroes, like during a full
send operation where we just skip writes for holes.
But for some cases we fill previous holes with writes of zeroes too, like
in this scenario:
1) We have a file with a hole in the range [2M, 3M), we snapshot the
subvolume and do a full send. The range [2M, 3M) stays as a hole at
the receiver since we skip sending write commands full of zeroes;
2) We punch a hole for the range [3M, 4M) in our file, so that now it
has a 2M hole in the range [2M, 4M), and snapshot the subvolume.
Now if we do an incremental send, we will send write commands full
of zeroes for the range [2M, 4M), removing the hole for [2M, 3M) at
the receiver.
We could improve cases such as this last one by doing additional
comparisons of file extent items (or their absence) between the parent
and send snapshots, but that's a lot of code to add plus additional CPU
and IO costs.
Since the send stream v2 already has a fallocate command and btrfs-progs
implements a callback to execute fallocate since the send stream v2
support was added to it, update the kernel to use fallocate for punching
holes for V2+ streams.
Test coverage is provided by btrfs/284 which is a version of btrfs/007
that exercises send stream v2 instead of v1, using fsstress with random
operations and fssum to verify file contents.
Link: https://github.com/kdave/btrfs-progs/issues/1001
CC: stable@vger.kernel.org # 6.1+
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Matthieu Baerts says:
====================
selftests: mptcp: connect: cover alt modes
mptcp_connect.sh can be executed manually with "-m <MODE>" and "-C" to
make sure everything works as expected when using "mmap" and "sendfile"
modes instead of "poll", and with the MPTCP checksum support.
These modes should be validated, but they are not when the selftests are
executed via the kselftest helpers. It means that most CIs validating
these selftests, like NIPA for the net development trees and LKFT for
the stable ones, are not covering these modes.
To fix that, new test programs have been added, simply calling
mptcp_connect.sh with the right parameters.
The first patch can be backported up to v5.6, and the second one up to
v5.14.
v1: https://lore.kernel.org/20250714-net-mptcp-sft-connect-alt-v1-0-bf1c5abbe575@kernel.org
====================
Link: https://patch.msgid.link/20250715-net-mptcp-sft-connect-alt-v2-0-8230ddd82454@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
The checksum mode has been added a while ago, but it is only validated
when manually launching mptcp_connect.sh with "-C".
The different CIs were then not validating these MPTCP Connect tests
with checksum enabled. To make sure they do, add a new test program
executing mptcp_connect.sh with the checksum mode.
Fixes: 94d66ba1d8e4 ("selftests: mptcp: enable checksum in mptcp_connect.sh")
Cc: stable@vger.kernel.org
Reviewed-by: Geliang Tang <geliang@kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Link: https://patch.msgid.link/20250715-net-mptcp-sft-connect-alt-v2-2-8230ddd82454@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
The "mmap" and "sendfile" alternate modes for mptcp_connect.sh/.c are
available from the beginning, but only tested when mptcp_connect.sh is
manually launched with "-m mmap" or "-m sendfile", not via the
kselftests helpers.
The MPTCP CI was manually running "mptcp_connect.sh -m mmap", but not
"-m sendfile". Plus other CIs, especially the ones validating the stable
releases, were not validating these alternate modes.
To make sure these modes are validated by these CIs, add two new test
programs executing mptcp_connect.sh with the alternate modes.
Fixes: 048d19d444be ("mptcp: add basic kselftest for mptcp")
Cc: stable@vger.kernel.org
Reviewed-by: Geliang Tang <geliang@kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Link: https://patch.msgid.link/20250715-net-mptcp-sft-connect-alt-v2-1-8230ddd82454@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
We have a single transaction abort call that can be due to an error from
one of two calls to update_block_group_item(). Unfold the transaction
abort calls so that if they happen we know which update_block_group_item()
call failed.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
We are using a variable named 'log_ref_ver' of type int to indicate if we
are processing an extref item or not, using a value of 1 if so, otherwise
0. This is an odd name and type, so rename it to 'is_extref_item' and
change its type to bool.
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
During log replay, at add_inode_ref(), if we have an extref item that
contains multiple extrefs and one of them points to a directory that does
not exist in the subvolume tree, we are supposed to ignore it and process
the remaining extrefs encoded in the extref item, since each extref can
point to a different parent inode. However when that happens we just
return from the function and ignore the remaining extrefs.
The problem has been around since extrefs were introduced, in commit
f186373fef00 ("btrfs: extended inode refs"), but it's hard to hit in
practice because getting extref items encoding multiple extref requires
getting a hash collision when computing the offset of the extref's
key. The offset if computed like this:
key.offset = btrfs_extref_hash(dir_ino, name->name, name->len);
and btrfs_extref_hash() is just a wrapper around crc32c().
Fix this by moving to next iteration of the loop when we don't find
the parent directory that an extref points to.
Fixes: f186373fef00 ("btrfs: extended inode refs")
CC: stable@vger.kernel.org # 6.1+
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
During log replay, at add_inode_ref(), we return -ENOENT if our current
inode isn't found on the subvolume tree or if a parent directory isn't
found. The error comes from btrfs_iget_logging() <- btrfs_iget() <-
btrfs_read_locked_inode().
The single caller of add_inode_ref(), replay_one_buffer(), ignores an
-ENOENT error because it expects that error to mean only that a parent
directory wasn't found and that is ok.
Before commit 5f61b961599a ("btrfs: fix inode lookup error handling during
log replay") we were converting any error when getting a parent directory
to -ENOENT and any error when getting the current inode to -EIO, so our
caller would fail log replay in case we can't find the current inode.
After that commit however in case the current inode is not found we return
-ENOENT to the caller and therefore it ignores the critical fact that the
current inode was not found in the subvolume tree.
Fix this by converting -ENOENT to 0 when we don't find a parent directory,
returning -ENOENT when we don't find the current inode and making the
caller, replay_one_buffer(), not ignore -ENOENT anymore.
Fixes: 5f61b961599a ("btrfs: fix inode lookup error handling during log replay")
CC: stable@vger.kernel.org # 6.16
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
For data reloc inodes, they are a special type of inodes that are not
exposed to user space, and are only utilized during data block groups
relocation.
They do not go under regular read-write operations, but have their file
extents manually created to have the same layout of a block group, then
its content is read from the original block group, and written back to
the new location which is in a new block group.
Previously all the handling was done in page units, and commit
c2832898126f ("btrfs: make relocate_one_page() handle subpage case")
changed the handling to subpage blocks.
On the other hand, data reloc inodes are a perfect match for large data
folios, as each relocation cluster represents one or more data extents
that are contiguous in their logical addresses.
This patch enables large folios for data reloc inodes by:
- Remove the special handling of data reloc inodes when setting folio
order
- Change relocate_one_folio() to return the file offset of the next
folio
Originally it's designed to handle fixed page sized blocks, but with
large folios, we can handle a large folio, thus we have to return the
end of the current folio.
- Remove the warning on folio_order()
- Use folio_size() to replace fixed PAGE_SIZE usage
- Use file_offset as iterator inside relocate_file_extent_cluster
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
The function btrfs_subpage_assert() is a very commonly utilized assert
to make sure the range passed in is correct inside the folio.
And when some code is not properly subpage/large folio compatible
btrfs_subpage_assert() will be the first to be triggered.
E.g. when I incorrectly enabled large folios for data reloc inodes, it
immediately triggered btrfs_subpage_assert().
In that case, outputting all the involved members will be very helpful,
this includes:
- start
- len
- folio position inside the mapping
- folio size
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Commit 9d9ea1e68a05 ("btrfs: subpage: fix relocation potentially
overwriting last page data") fixed a bug when relocating data block
groups for subpage cases.
However for the incoming large folios for data reloc inode, we can hit
the same situation where block size is the same as page size, but the
folio we got is still larger than a block.
In that case, the old subpage specific check is no longer reliable.
Here we have to enhance the handling by:
- Unconditionally invalidate the page cache for the current cluster
We set the @flush to true so that any dirty folios are properly
written back first.
And this time instead of dropping the whole page cache, just drop the
range covered by the current cluster.
This will bring some minor performance drop, as for a large folio, the
heading half will be read twice (read by previous cluster, then
invalidated, then read again by the current cluster).
However that is required to support large folios, and this gets rid of
the kinda tricky manual uptodate flag clearing for each block.
- Remove the special handling of writing back the whole page cache
filemap_invalidate_inode() handles the write back already, and since
we're invalidating all pages in the range, we no longer need to
manually clear the uptodate flags for involved blocks.
Thus there is no need to manually write back the whole page cache.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Currently the defrag ioctl cannot rewrite the extents without
compression. Add a new flag for that, as setting compression to 0 (or
"no compression") means to do no changes to compression so take what is
the current default, like mount options or properties.
The defrag setting overrides mount or properties. The compression
BTRFS_DEFRAG_DONT_COMPRESS is only used for in-memory operations and
does not need to have a fixed value.
Mount with zstd:9, copy test file from /usr/bin/ (about 260KB):
$ mount -o compress=zstd:9 /dev/vda /mnt
$ filefrag -vsb testfile
filefrag: -b needs a blocksize option, assuming 1024-byte blocks.
Filesystem type is: 9123683e
File size of testfile is 297704 (292 blocks of 1024 bytes)
ext: logical_offset: physical_offset: length: expected: flags:
0: 0.. 127: 13312.. 13439: 128: encoded
1: 128.. 255: 13364.. 13491: 128: 13440: encoded
2: 256.. 291: 13424.. 13459: 36: 13492: last,encoded,eof
testfile: 3 extents found
$ compsize testfile
Processed 1 file, 3 regular extents (3 refs), 0 inline, 1 fragments.
Type Perc Disk Usage Uncompressed Referenced
TOTAL 42% 124K 292K 292K
zstd 42% 124K 292K 292K
Defrag to uncompressed:
$ btrfs fi defrag --nocomp testfile
$ filefrag -vsb testfile
filefrag: -b needs a blocksize option, assuming 1024-byte blocks.
Filesystem type is: 9123683e
File size of testfile is 297704 (292 blocks of 1024 bytes)
ext: logical_offset: physical_offset: length: expected: flags:
0: 0.. 291: 291840.. 292131: 292: last,eof
testfile: 1 extent found
$ compsize testfile
Processed 1 file, 1 regular extents (1 refs), 0 inline, 1 fragments.
Type Perc Disk Usage Uncompressed Referenced
TOTAL 100% 292K 292K 292K
none 100% 292K 292K 292K
Compress again with LZO:
$ btrfs fi defrag -clzo testfile
$ filefrag -vsb testfile
filefrag: -b needs a blocksize option, assuming 1024-byte blocks.
Filesystem type is: 9123683e
File size of testfile is 297704 (292 blocks of 1024 bytes)
ext: logical_offset: physical_offset: length: expected: flags:
0: 0.. 127: 13312.. 13439: 128: encoded
1: 128.. 255: 13392.. 13519: 128: 13440: encoded
2: 256.. 291: 13480.. 13515: 36: 13520: last,encoded,eof
testfile: 3 extents found
$ compsize testfile
Processed 1 file, 3 regular extents (3 refs), 0 inline, 1 fragments.
Type Perc Disk Usage Uncompressed Referenced
TOTAL 64% 188K 292K 292K
lzo 64% 188K 292K 292K
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
If the ssd_spread mount option is enabled, then we run the so called
clustered allocator for data block groups. In practice, this results in
creating a btrfs_free_cluster which caches a block_group and borrows its
free extents for allocation.
Since the introduction of allocation size classes in 6.1, there has been
a bug in the interaction between that feature and ssd_spread.
find_free_extent() has a number of nested loops. The loop going over the
allocation stages, stored in ffe_ctl->loop and managed by
find_free_extent_update_loop(), the loop over the raid levels, and the
loop over all the block_groups in a space_info. The size class feature
relies on the block_group loop to ensure it gets a chance to see a
block_group of a given size class. However, the clustered allocator
uses the cached cluster block_group and breaks that loop. Each call to
do_allocation() will really just go back to the same cached block_group.
Normally, this is OK, as the allocation either succeeds and we don't
want to loop any more or it fails, and we clear the cluster and return
its space to the block_group.
But with size classes, the allocation can succeed, then later fail,
outside of do_allocation() due to size class mismatch. That latter
failure is not properly handled due to the highly complex multi loop
logic. The result is a painful loop where we continue to allocate the
same num_bytes from the cluster in a tight loop until it fails and
releases the cluster and lets us try a new block_group. But by then, we
have skipped great swaths of the available block_groups and are likely
to fail to allocate, looping the outer loop. In pathological cases like
the reproducer below, the cached block_group is often the very last one,
in which case we don't perform this tight bg loop but instead rip
through the ffe stages to LOOP_CHUNK_ALLOC and allocate a chunk, which
is now the last one, and we enter the tight inner loop until an
allocation failure. Then allocation succeeds on the final block_group
and if the next allocation is a size mismatch, the exact same thing
happens again.
Triggering this is as easy as mounting with -o ssd_spread and then
running:
mount -o ssd_spread $dev $mnt
dd if=/dev/zero of=$mnt/big bs=16M count=1 &>/dev/null
dd if=/dev/zero of=$mnt/med bs=4M count=1 &>/dev/null
sync
if you do the two writes + sync in a loop, you can force btrfs to spin
an excessive amount on semi-successful clustered allocations, before
ultimately failing and advancing to the stage where we force a chunk
allocation. This results in 2G of data allocated per iteration, despite
only using ~20M of data. By using a small size classed extent, the inner
loop takes longer and we can spin for longer.
The simplest, shortest term fix to unbreak this is to make the clustered
allocator size_class aware in the dumbest way, where it fails on size
class mismatch. This may hinder the operation of the clustered
allocator, but better hindered than completely broken and terribly
overallocating.
Further re-design improvements are also in the works.
Fixes: 52bb7a2166af ("btrfs: introduce size class to block group allocator")
CC: stable@vger.kernel.org # 6.1+
Reported-by: David Sterba <dsterba@suse.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Boris Burkov <boris@bur.io>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
btrfs_zone_finish() can fail for several reason. If it is -EAGAIN, we need
to try it again later. So, put the block group to the retry list properly.
Failing to do so will keep the removable block group intact until remount
and can causes unnecessary ENOSPC.
Fixes: 74e91b12b115 ("btrfs: zoned: zone finish unused block group")
CC: stable@vger.kernel.org # 6.1+
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
There are some reports of "unable to find chunk map for logical 2147483648
length 16384" error message appears in dmesg. This means some IOs are
occurring after a block group is removed.
When a metadata tree node is cleaned on a zoned setup, we keep that node
still dirty and write it out not to create a write hole. However, this can
make a block group's used bytes == 0 while there is a dirty region left.
Such an unused block group is moved into the unused_bg list and processed
for removal. When the removal succeeds, the block group is removed from the
transaction->dirty_bgs list, so the unused dirty nodes in the block group
are not sent at the transaction commit time. It will be written at some
later time e.g, sync or umount, and causes "unable to find chunk map"
errors.
This can happen relatively easy on SMR whose zone size is 256MB. However,
calling do_zone_finish() on such block group returns -EAGAIN and keep that
block group intact, which is why the issue is hidden until now.
Fixes: afba2bc036b0 ("btrfs: zoned: implement active zone tracking")
CC: stable@vger.kernel.org # 6.1+
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
It's just a simple wrapper around btrfs_clear_extent_bit() that passes a
NULL for its last argument (a cached extent state record), plus there is
not counter part - we have a btrfs_set_extent_bit() but we do not have a
btrfs_set_extent_bits() (plural version). So just remove it and make all
callers use btrfs_clear_extent_bit() directly.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
We have a cached extent state record from the previous extent locking so
we can use when setting the EXTENT_NORESERVE in the range, allowing the
operation to be faster if the extent io tree is relatively large.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Set the EXTENT_NORESERVE bit in the io tree before unlocking the range so
that we can use the cached state and speedup the operation, since the
unlock operation releases the cached state.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
When BTRFS is doing automatic block-group reclaim, it is spamming the
kernel log messages a lot.
Add a 'verbose' parameter to btrfs_relocate_chunk() and
btrfs_relocate_block_group() to control the verbosity of these log
message. This way the old behaviour of printing log messages on a
user-space initiated balance operation can be kept while excessive log
spamming due to auto reclaim is mitigated.
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Remove the log message before reclaiming a chunk in
btrfs_reclaim_bgs_work(). Especially with automatic block-group
reclaiming these messages spam the kernel log.
Note there is also a tracepoint for the same condition to ease debugging.
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|