Age | Commit message (Collapse) | Author |
|
Notice that ignore_dl_rate_limit() need not piggy back on the
limits_changed handling to achieve its goal (which is to enforce a
frequency update before its due time).
Namely, if sugov_should_update_freq() is updated to check
sg_policy->need_freq_update and return 'true' if it is set when
sg_policy->limits_changed is not set, ignore_dl_rate_limit() may
set the former directly instead of setting the latter, so it can
avoid hitting the memory barrier in sugov_should_update_freq().
Update the code accordingly.
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Reviewed-by: Christian Loehle <christian.loehle@arm.com>
Link: https://patch.msgid.link/10666429.nUPlyArG6x@rjwysocki.net
|
|
The handling of the limits_changed flag in struct sugov_policy needs to
be explicitly synchronized to ensure that cpufreq policy limits updates
will not be missed in some cases.
Without that synchronization it is theoretically possible that
the limits_changed update in sugov_should_update_freq() will be
reordered with respect to the reads of the policy limits in
cpufreq_driver_resolve_freq() and in that case, if the limits_changed
update in sugov_limits() clobbers the one in sugov_should_update_freq(),
the new policy limits may not take effect for a long time.
Likewise, the limits_changed update in sugov_limits() may theoretically
get reordered with respect to the updates of the policy limits in
cpufreq_set_policy() and if sugov_should_update_freq() runs between
them, the policy limits change may be missed.
To ensure that the above situations will not take place, add memory
barriers preventing the reordering in question from taking place and
add READ_ONCE() and WRITE_ONCE() annotations around all of the
limits_changed flag updates to prevent the compiler from messing up
with that code.
Fixes: 600f5badb78c ("cpufreq: schedutil: Don't skip freq update when limits change")
Cc: 5.3+ <stable@vger.kernel.org> # 5.3+
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Reviewed-by: Christian Loehle <christian.loehle@arm.com>
Link: https://patch.msgid.link/3376719.44csPzL39Z@rjwysocki.net
|
|
Commit 8e461a1cb43d ("cpufreq: schedutil: Fix superfluous updates caused
by need_freq_update") modified sugov_should_update_freq() to set the
need_freq_update flag only for drivers with CPUFREQ_NEED_UPDATE_LIMITS
set, but that flag generally needs to be set when the policy limits
change because the driver callback may need to be invoked for the new
limits to take effect.
However, if the return value of cpufreq_driver_resolve_freq() after
applying the new limits is still equal to the previously selected
frequency, the driver callback needs to be invoked only in the case
when CPUFREQ_NEED_UPDATE_LIMITS is set (which means that the driver
specifically wants its callback to be invoked every time the policy
limits change).
Update the code accordingly to avoid missing policy limits changes for
drivers without CPUFREQ_NEED_UPDATE_LIMITS.
Fixes: 8e461a1cb43d ("cpufreq: schedutil: Fix superfluous updates caused by need_freq_update")
Closes: https://lore.kernel.org/lkml/Z_Tlc6Qs-tYpxWYb@linaro.org/
Reported-by: Stephan Gerhold <stephan.gerhold@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Reviewed-by: Christian Loehle <christian.loehle@arm.com>
Link: https://patch.msgid.link/3010358.e9J7NaK4W3@rjwysocki.net
|
|
The QDMA packet scheduler suffers from a performance issue.
Fix this by picking up changes from MediaTek's SDK which change to use
Token Bucket instead of Leaky Bucket and fix the SPEED_1000 configuration.
Fixes: 160d3a9b1929 ("net: ethernet: mtk_eth_soc: introduce MTK_NETSYS_V2 support")
Signed-off-by: Bo-Cun Chen <bc-bocun.chen@mediatek.com>
Signed-off-by: Daniel Golle <daniel@makrotopia.org>
Link: https://patch.msgid.link/18040f60f9e2f5855036b75b28c4332a2d2ebdd8.1744764277.git.daniel@makrotopia.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
100Mbps
Without this patch, the maximum weight of the queue limit will be
incorrect when linked at 100Mbps due to an apparent typo.
Fixes: f63959c7eec31 ("net: ethernet: mtk_eth_soc: implement multi-queue support for per-port queues")
Signed-off-by: Bo-Cun Chen <bc-bocun.chen@mediatek.com>
Signed-off-by: Daniel Golle <daniel@makrotopia.org>
Link: https://patch.msgid.link/74111ba0bdb13743313999ed467ce564e8189006.1744764277.git.daniel@makrotopia.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
In the current method, the MDC divider was reset to the default setting
of 2.5MHz after the NETSYS SER. Therefore, we need to reapply the MDC
divider configuration function in mtk_hw_init() after reset.
Fixes: c0a440031d431 ("net: ethernet: mtk_eth_soc: set MDIO bus clock frequency")
Signed-off-by: Bo-Cun Chen <bc-bocun.chen@mediatek.com>
Signed-off-by: Daniel Golle <daniel@makrotopia.org>
Link: https://patch.msgid.link/8ab7381447e6cdcb317d5b5a6ddd90a1734efcb0.1744764277.git.daniel@makrotopia.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Commit <5518f239aff1> ("iommu/vt-d: Move scalable mode ATS enablement to
probe path") changed the PCI ATS enablement logic to run earlier,
specifically before the default domain attachment.
On some client platforms, this change resulted in boot failures, causing
the kernel to panic with the following message and call trace:
Kernel panic - not syncing: DMAR hardware is malfunctioning
CPU: 0 UID: 0 PID: 1 Comm: swapper/0 Not tainted 6.14.0-rc3+ #175
Call Trace:
<TASK>
dump_stack_lvl+0x6f/0xb0
dump_stack+0x10/0x16
panic+0x10a/0x2b7
iommu_enable_translation.cold+0xc/0xc
intel_iommu_init+0xe39/0xec0
? trace_hardirqs_on+0x1e/0xd0
? __pfx_pci_iommu_init+0x10/0x10
pci_iommu_init+0xd/0x40
do_one_initcall+0x5b/0x390
kernel_init_freeable+0x26d/0x2b0
? __pfx_kernel_init+0x10/0x10
kernel_init+0x15/0x120
ret_from_fork+0x35/0x60
? __pfx_kernel_init+0x10/0x10
ret_from_fork_asm+0x1a/0x30
RIP: 1f0f:0x0
Code: Unable to access opcode bytes at 0xffffffffffffffd6.
RSP: 0000:0000000000000000 EFLAGS: 841f0f2e66 ORIG_RAX:
1f0f2e6600000000
RAX: 0000000000000000 RBX: 1f0f2e6600000000 RCX:
2e66000000000084
RDX: 0000000000841f0f RSI: 000000841f0f2e66 RDI:
00841f0f2e660000
RBP: 00841f0f2e660000 R08: 00841f0f2e660000 R09:
000000841f0f2e66
R10: 0000000000841f0f R11: 2e66000000000084 R12:
000000841f0f2e66
R13: 0000000000841f0f R14: 2e66000000000084 R15:
1f0f2e6600000000
</TASK>
---[ end Kernel panic - not syncing: DMAR hardware is malfunctioning ]---
Fix this by reverting the timing change for ATS enablement introduced by
the offending commit and restoring the previous behavior.
Fixes: 5518f239aff1 ("iommu/vt-d: Move scalable mode ATS enablement to probe path")
Reported-by: Jarkko Nikula <jarkko.nikula@linux.intel.com>
Closes: https://lore.kernel.org/linux-iommu/01b9c72f-460d-4f77-b696-54c6825babc9@linux.intel.com/
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
Tested-by: Jarkko Nikula <jarkko.nikula@linux.intel.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Link: https://lore.kernel.org/r/20250416073608.1799578-1-baolu.lu@linux.intel.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
|
|
In the review for iommu_copy_struct_to_user() helper, Matt pointed out that
a NULL pointer should be rejected prior to dereferencing it:
https://lore.kernel.org/all/86881827-8E2D-461C-BDA3-FA8FD14C343C@nvidia.com
And Alok pointed out a typo at the same time:
https://lore.kernel.org/all/480536af-6830-43ce-a327-adbd13dc3f1d@oracle.com
Since both issues were copied from iommu_copy_struct_from_user(), fix them
first in the current header.
Fixes: e9d36c07bb78 ("iommu: Add iommu_copy_struct_from_user helper")
Cc: stable@vger.kernel.org
Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Acked-by: Alok Tiwari <alok.a.tiwari@oracle.com>
Reviewed-by: Matthew R. Ochs <mochs@nvidia.com>
Link: https://lore.kernel.org/r/20250414191635.450472-1-nicolinc@nvidia.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
|
|
There is a string parsing logic error which can lead to an overflow of hid
or uid buffers. Comparing ACPIID_LEN against a total string length doesn't
take into account the lengths of individual hid and uid buffers so the
check is insufficient in some cases. For example if the length of hid
string is 4 and the length of the uid string is 260, the length of str
will be equal to ACPIID_LEN + 1 but uid string will overflow uid buffer
which size is 256.
The same applies to the hid string with length 13 and uid string with
length 250.
Check the length of hid and uid strings separately to prevent
buffer overflow.
Found by Linux Verification Center (linuxtesting.org) with SVACE.
Fixes: ca3bf5d47cec ("iommu/amd: Introduces ivrs_acpihid kernel parameter")
Cc: stable@vger.kernel.org
Signed-off-by: Pavel Paklov <Pavel.Paklov@cyberprotect.ru>
Link: https://lore.kernel.org/r/20250325092259.392844-1-Pavel.Paklov@cyberprotect.ru
Signed-off-by: Joerg Roedel <jroedel@suse.de>
|
|
Sending exact nr_segs, avoids bio split check and processing in
block layer, which takes around 5%[1] of overall CPU utilization.
In our setup, we see overall improvement of IOPS from 7.15M to 7.65M [2]
and 5% less CPU utilization.
[1]
3.52% io_uring [kernel.kallsyms] [k] bio_split_rw_at
1.42% io_uring [kernel.kallsyms] [k] bio_split_rw
0.62% io_uring [kernel.kallsyms] [k] bio_submit_split
[2]
sudo taskset -c 0,1 ./t/io_uring -b512 -d128 -c32 -s32 -p1 -F1 -B1 -n2
-r4 /dev/nvme0n1 /dev/nvme1n1
Signed-off-by: Nitesh Shetty <nj.shetty@samsung.com>
[Pavel: fixed for kbuf, rebased and reworked on top of cleanups]
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/7a1a49a8d053bd617c244291d63dbfbc07afde36.1744882081.git.asml.silence@gmail.com
[axboe: fold in fix factoring in buf reg offset]
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf
Pablo Neira Ayuso says:
====================
Netfilter fix for net
The following batch contains one Netfilter fix for net:
1) conntrack offload bit is erroneously unset in a race scenario,
from Florian Westphal.
netfilter pull request 25-04-17
* tag 'nf-25-04-17' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf:
netfilter: conntrack: fix erronous removal of offload bit
====================
Link: https://patch.msgid.link/20250417102847.16640-1-pablo@netfilter.org
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
io_import_fixed is a mess. Even though we know the final len of the
iterator, we still assign offset + len and do some magic after to
correct for that.
Do offset calculation first and finalise it with iov_iter_bvec at the
end.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/2d5107fed24f8b23245ef2ede9a5a7f7c426df61.1744882081.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Kernel registered buffers are special because segments are not uniform
in size, and we have a bunch of optimisations based on that uniformity
for normal buffers. Handle kbuf separately, it'll be cleaner this way.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/4e9e5990b0ab5aee723c0be5cd9b5bcf810375f9.1744882081.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Don't optimise for requests with offset=0. Large registered buffers are
the preference and hence the user is likely to pass an offset, and the
adjustments are not expensive and will be made even cheaper in following
patches.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/1c2beb20470ee3c886a363d4d8340d3790db19f3.1744882081.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
From PMU's perspective, Panther Lake is similar to the previous
generation Lunar Lake. Both are hybrid platforms, with e-core and
p-core.
The key differences are the ARCH PEBS feature and several new events.
The ARCH PEBS is supported in the following patches.
The new events will be supported later in perf tool.
Share the code path with the Lunar Lake. Only update the name.
Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lkml.kernel.org/r/20250415114428.341182-2-dapeng1.mi@linux.intel.com
|
|
Currently when a user samples user space GPRs (--user-regs option) with
PEBS, the user space GPRs actually always come from software PMI
instead of from PEBS hardware. This leads to the sampled GPRs to
possibly be inaccurate for single PEBS record case because of the
skid between counter overflow and GPRs sampling on PMI.
For the large PEBS case, it is even worse. If user sets the
exclude_kernel attribute, large PEBS would be used to sample user space
GPRs, but since PEBS GPRs group is not really enabled, it leads to all
samples in the large PEBS record to share the same piece of user space
GPRs, like this reproducer shows:
$ perf record -e branches:pu --user-regs=ip,ax -c 100000 ./foo
$ perf report -D | grep "AX"
.... AX 0x000000003a0d4ead
.... AX 0x000000003a0d4ead
.... AX 0x000000003a0d4ead
.... AX 0x000000003a0d4ead
.... AX 0x000000003a0d4ead
.... AX 0x000000003a0d4ead
.... AX 0x000000003a0d4ead
.... AX 0x000000003a0d4ead
.... AX 0x000000003a0d4ead
.... AX 0x000000003a0d4ead
.... AX 0x000000003a0d4ead
So enable GPRs group for user space GPRs sampling and prioritize reading
GPRs from PEBS. If the PEBS sampled GPRs is not user space GPRs (single
PEBS record case), perf_sample_regs_user() modifies them to user space
GPRs.
[ mingo: Clarified the changelog. ]
Fixes: c22497f5838c ("perf/x86/intel: Support adaptive PEBS v4")
Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20250415104135.318169-2-dapeng1.mi@linux.intel.com
|
|
The below code would always unconditionally clear other status bits like
perf metrics overflow bit once PEBS buffer overflows:
status &= intel_ctrl | GLOBAL_STATUS_TRACE_TOPAPMI;
This is incorrect. Perf metrics overflow bit should be cleared only when
fixed counter 3 in PEBS counter group. Otherwise perf metrics overflow
could be missed to handle.
Closes: https://lore.kernel.org/all/20250225110012.GK31462@noisy.programming.kicks-ass.net/
Fixes: 7b2c05a15d29 ("perf/x86/intel: Generic support for hardware TopDown metrics")
Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Kan Liang <kan.liang@linux.intel.com>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20250415104135.318169-1-dapeng1.mi@linux.intel.com
|
|
Pull NVMe fixes from Christoph:
"nvme fixes for Linux 6.15
- fix scan failure for non-ANA multipath controllers (Hannes Reinecke)
- fix multipath sysfs links creation for some cases (Hannes Reinecke)
- PCIe endpoint fixes (Damien Le Moal)
- use NULL instead of 0 in the auth code (Damien Le Moal)"
* tag 'nvme-6.15-2025-04-17' of git://git.infradead.org/nvme:
nvmet: pci-epf: cleanup link state management
nvmet: pci-epf: clear CC and CSTS when disabling the controller
nvmet: pci-epf: always fully initialize completion entries
nvmet: auth: use NULL to clear a pointer in nvmet_auth_sq_free()
nvme-multipath: sysfs links may not be created for devices
nvme: fixup scan failure for non-ANA multipath controllers
|
|
We've never supported StreamID aliasing between devices, and as such
they will never have had functioning DMA, but this is not fatal to the
SMMU itself. Although aliasing between hard-wired platform device
StreamIDs would tend to raise questions about the whole system, in
practice it's far more likely to occur relatively innocently due to
legacy PCI bridges, where the underlying StreamID mappings are still
perfectly reasonable.
As such, return a more benign -ENODEV when failing probe for such an
unsupported device (and log a more obvious error message), so that it
doesn't break the entire SMMU probe now that bus_iommu_probe() runs in
the right order and can propagate that error back. The end result is
still that the device doesn't get an IOMMU group and probably won't
work, same as before.
Signed-off-by: Robin Murphy <robin.murphy@arm.com>
Link: https://lore.kernel.org/r/39d54e49c8476efc4653e352150d44b185d6d50f.1744380554.git.robin.murphy@arm.com
Signed-off-by: Will Deacon <will@kernel.org>
|
|
ASPEED VGA card has two built-in devices:
0008:06:00.0 PCI bridge: ASPEED Technology, Inc. AST1150 PCI-to-PCI Bridge (rev 06)
0008:07:00.0 VGA compatible controller: ASPEED Technology, Inc. ASPEED Graphics Family (rev 52)
Its toplogy looks like this:
+-[0008:00]---00.0-[01-09]--+-00.0-[02-09]--+-00.0-[03]----00.0 Sandisk Corp Device 5017
| +-01.0-[04]--
| +-02.0-[05]----00.0 NVIDIA Corporation Device
| +-03.0-[06-07]----00.0-[07]----00.0 ASPEED Technology, Inc. ASPEED Graphics Family
| +-04.0-[08]----00.0 Renesas Technology Corp. uPD720201 USB 3.0 Host Controller
| \-05.0-[09]----00.0 Realtek Semiconductor Co., Ltd. RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller
\-00.1 PMC-Sierra Inc. Device 4028
The IORT logic populaties two identical IDs into the fwspec->ids array via
DMA aliasing in iort_pci_iommu_init() called by pci_for_each_dma_alias().
Though the SMMU driver had been able to handle this situation since commit
563b5cbe334e ("iommu/arm-smmu-v3: Cope with duplicated Stream IDs"), that
got broken by the later commit cdf315f907d4 ("iommu/arm-smmu-v3: Maintain
a SID->device structure"), which ended up with allocating separate streams
with the same stuffing.
On a kernel prior to v6.15-rc1, there has been an overlooked warning:
pci 0008:07:00.0: vgaarb: setting as boot VGA device
pci 0008:07:00.0: vgaarb: bridge control possible
pci 0008:07:00.0: vgaarb: VGA device added: decodes=io+mem,owns=none,locks=none
pcieport 0008:06:00.0: Adding to iommu group 14
ast 0008:07:00.0: stream 67328 already in tree <===== WARNING
ast 0008:07:00.0: enabling device (0002 -> 0003)
ast 0008:07:00.0: Using default configuration
ast 0008:07:00.0: AST 2600 detected
ast 0008:07:00.0: [drm] Using analog VGA
ast 0008:07:00.0: [drm] dram MCLK=396 Mhz type=1 bus_width=16
[drm] Initialized ast 0.1.0 for 0008:07:00.0 on minor 0
ast 0008:07:00.0: [drm] fb0: astdrmfb frame buffer device
With v6.15-rc, since the commit bcb81ac6ae3c ("iommu: Get DT/ACPI parsing
into the proper probe path"), the error returned with the warning is moved
to the SMMU device probe flow:
arm_smmu_probe_device+0x15c/0x4c0
__iommu_probe_device+0x150/0x4f8
probe_iommu_group+0x44/0x80
bus_for_each_dev+0x7c/0x100
bus_iommu_probe+0x48/0x1a8
iommu_device_register+0xb8/0x178
arm_smmu_device_probe+0x1350/0x1db0
which then fails the entire SMMU driver probe:
pci 0008:06:00.0: Adding to iommu group 21
pci 0008:07:00.0: stream 67328 already in tree
arm-smmu-v3 arm-smmu-v3.9.auto: Failed to register iommu
arm-smmu-v3 arm-smmu-v3.9.auto: probe with driver arm-smmu-v3 failed with error -22
Since SMMU driver had been already expecting a potential duplicated Stream
ID in arm_smmu_install_ste_for_dev(), change the arm_smmu_insert_master()
routine to ignore a duplicated ID from the fwspec->sids array as well.
Note: this has been failing the iommu_device_probe() since 2021, although a
recent iommu commit in v6.15-rc1 that moves iommu_device_probe() started to
fail the SMMU driver probe. Since nobody has cared about DMA Alias support,
leave that as it was but fix the fundamental iommu_device_probe() breakage.
Fixes: cdf315f907d4 ("iommu/arm-smmu-v3: Maintain a SID->device structure")
Cc: stable@vger.kernel.org
Suggested-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
Link: https://lore.kernel.org/r/20250415185620.504299-1-nicolinc@nvidia.com
Signed-off-by: Will Deacon <will@kernel.org>
|
|
UBSan caught a bug with IOMMU SVA domains, where the reported exponent
value in __arm_smmu_tlb_inv_range() was >= 64.
__arm_smmu_tlb_inv_range() uses the domain's pgsize_bitmap to compute
the number of pages to invalidate and the invalidation range. Currently
arm_smmu_sva_domain_alloc() does not setup the iommu domain's
pgsize_bitmap. This leads to __ffs() on the value returning 64 and that
leads to undefined behaviour w.r.t. shift operations
Fix this by initializing the iommu_domain's pgsize_bitmap to PAGE_SIZE.
Effectively the code needs to use the smallest page size for
invalidation
Cc: stable@vger.kernel.org
Fixes: eb6c97647be2 ("iommu/arm-smmu-v3: Avoid constructing invalid range commands")
Suggested-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Balbir Singh <balbirs@nvidia.com>
Cc: Jean-Philippe Brucker <jean-philippe@linaro.org>
Cc: Will Deacon <will@kernel.org>
Cc: Robin Murphy <robin.murphy@arm.com>
Cc: Joerg Roedel <joro@8bytes.org>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/20250412002354.3071449-1-balbirs@nvidia.com
Signed-off-by: Will Deacon <will@kernel.org>
|
|
pci1xxxx enables its PCI device with pcim_enable_device(). This,
implicitly, switches the function pci_request_regions() into managed
mode, where it becomes a devres function.
The PCI subsystem wants to remove this hybrid nature from its
interfaces. To do so, users of the aforementioned combination of
functions must be ported to non-hybrid functions.
Moreover, since both functions are already managed in this driver, the
call to pci_release_regions() is unnecessary.
Remove the call to pci_release_regions().
Replace the call to sometimes-managed pci_request_regions() with one to
the always-managed pcim_request_all_regions().
Signed-off-by: Philipp Stanner <phasta@kernel.org>
Link: https://patch.msgid.link/20250417083902.23483-4-phasta@kernel.org
Signed-off-by: Mark Brown <broonie@kernel.org>
|
|
cavium-thunderx enables its PCI device with pcim_enable_device(). This,
implicitly, switches the function pci_request_regions() into managed
mode, where it becomes a devres function.
The PCI subsystem wants to remove this hybrid nature from its
interfaces. To do so, users of the aforementioned combination of
functions must be ported to non-hybrid functions.
Moreover, since both functions are already managed in this driver, the
calls to pci_release_regions() are unnecessary.
Remove the calls to pci_release_regions().
Replace the call to sometimes-managed pci_request_regions() with one to
the always-managed pcim_request_all_regions().
Signed-off-by: Philipp Stanner <phasta@kernel.org>
Link: https://patch.msgid.link/20250417083902.23483-2-phasta@kernel.org
Signed-off-by: Mark Brown <broonie@kernel.org>
|
|
Commit 67e4fe398513 ("iommu/arm-smmu-v3: Use S2FWB for NESTED domains")
introduced S2FWB usage but omitted the corresponding feature detection.
As a result, vIOMMU allocation fails on FVP in arm_vsmmu_alloc(), due to
the following check:
if (!arm_smmu_master_canwbs(master) &&
!(smmu->features & ARM_SMMU_FEAT_S2FWB))
return ERR_PTR(-EOPNOTSUPP);
This patch adds the missing detection logic to prevent allocation
failure when S2FWB is supported.
Fixes: 67e4fe398513 ("iommu/arm-smmu-v3: Use S2FWB for NESTED domains")
Signed-off-by: Aneesh Kumar K.V (Arm) <aneesh.kumar@kernel.org>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Nicolin Chen <nicolinc@nvidia.com>
Reviewed-by: Pranjal Shrivastava <praan@google.com>
Link: https://lore.kernel.org/r/20250408033351.1012411-1-aneesh.kumar@kernel.org
Signed-off-by: Will Deacon <will@kernel.org>
|
|
Add check for the return value of spi_imx_setupxfer().
spi_imx->rx and spi_imx->tx function pointer can be NULL when
spi_imx_setupxfer() return error, and make NULL pointer dereference.
Unable to handle kernel NULL pointer dereference at virtual address 0000000000000000
Call trace:
0x0
spi_imx_pio_transfer+0x50/0xd8
spi_imx_transfer_one+0x18c/0x858
spi_transfer_one_message+0x43c/0x790
__spi_pump_transfer_message+0x238/0x5d4
__spi_sync+0x2b0/0x454
spi_write_then_read+0x11c/0x200
Signed-off-by: Tamura Dai <kirinode0@gmail.com>
Reviewed-by: Carlos Song <carlos.song@nxp.com>
Link: https://patch.msgid.link/20250417011700.14436-1-kirinode0@gmail.com
Signed-off-by: Mark Brown <broonie@kernel.org>
|
|
error handling
wmax_thermal_information() may also return -ENOMSG, which would leave
`id` uninitialized in thermal_profile_probe.
Reorder and modify logic to catch all errors.
Reported-by: Dan Carpenter <dan.carpenter@linaro.org>
Closes: https://lore.kernel.org/r/Z_-KVqNbD9ygvE2X@stanley.mountain
Fixes: 27e9e6339896 ("platform/x86: alienware-wmi: Refactor thermal control methods")
Signed-off-by: Kurt Borja <kuurtb@gmail.com>
Link: https://lore.kernel.org/r/20250416-smatch-fix-v1-1-35491b462d8f@gmail.com
Reviewed-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>
Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>
|
|
In certain situations, the sysfs for uncore may not be present when all
CPUs in a package are offlined and then brought back online after boot.
This issue can occur if there is an error in adding the sysfs entry due
to a memory allocation failure. Retrying to bring the CPUs online will
not resolve the issue, as the uncore_cpu_mask is already set for the
package before the failure condition occurs.
This issue does not occur if the failure happens during module
initialization, as the module will fail to load in the event of any
error.
To address this, ensure that the uncore_cpu_mask is not set until the
successful return of uncore_freq_add_entry().
Fixes: dbce412a7733 ("platform/x86/intel-uncore-freq: Split common and enumeration part")
Signed-off-by: Shouye Liu <shouyeliu@tencent.com>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20250417032321.75580-1-shouyeliu@gmail.com
Reviewed-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>
Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>
|
|
When an APU exits HW sleep with no active wake sources the Linux kernel will
rapidly assert that the APU can enter back into HW sleep. This happens in a
few ms. Contrasting this to Windows, Windows can take 10s of seconds to
enter back into the resiliency phase for Modern Standby.
For some situations this can be problematic because it can cause leakage
from VDDCR_SOC to VDD_MISC and force VDD_MISC outside of the electrical
design guide specifications. On some designs this will trip the over
voltage protection feature (OVP) of the voltage regulator module, but it
could cause APU damage as well.
To prevent this risk, add an explicit sleep call so that future attempts
to enter into HW sleep will have enough time to settle. This will occur
while the screen is dark and only on cases that the APU should enter HW
sleep again, so it shouldn't be noticeable to any user.
Cc: stable@vger.kernel.org
Signed-off-by: Mario Limonciello <mario.limonciello@amd.com>
Acked-by: Shyam Sundar S K <Shyam-sundar.S-k@amd.com>
Link: https://lore.kernel.org/r/20250414162446.3853194-1-superm1@kernel.org
Reviewed-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>
Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth
Luiz Augusto von Dentz says:
====================
bluetooth pull request for net:
- l2cap: Process valid commands in too long frame
- vhci: Avoid needless snprintf() calls
* tag 'for-net-2025-04-16' of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth:
Bluetooth: vhci: Avoid needless snprintf() calls
Bluetooth: l2cap: Process valid commands in too long frame
====================
Link: https://patch.msgid.link/20250416210126.2034212-1-luiz.dentz@gmail.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
The scale of IIO bandwidth in free running counters is inherited from
the ICX. The counter increments for every 32 bytes rather than 4 bytes.
The IIO bandwidth out free running counters don't increment with a
consistent size. The increment depends on the requested size. It's
impossible to find a fixed increment. Remove it from the event_descs.
Fixes: 0378c93a92e2 ("perf/x86/intel/uncore: Support IIO free-running counters on Sapphire Rapids server")
Reported-by: Tang Jun <dukang.tj@alibaba-inc.com>
Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20250416142426.3933977-3-kan.liang@linux.intel.com
|
|
There was a mistake in the ICX uncore spec too. The counter increments
for every 32 bytes rather than 4 bytes.
The same as SNR, there are 1 ioclk and 8 IIO bandwidth in free running
counters. Reuse the snr_uncore_iio_freerunning_events().
Fixes: 2b3b76b5ec67 ("perf/x86/intel/uncore: Add Ice Lake server uncore support")
Reported-by: Tang Jun <dukang.tj@alibaba-inc.com>
Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20250416142426.3933977-2-kan.liang@linux.intel.com
|
|
There was a mistake in the SNR uncore spec. The counter increments for
every 32 bytes of data sent from the IO agent to the SOC, not 4 bytes
which was documented in the spec.
The event list has been updated:
"EventName": "UNC_IIO_BANDWIDTH_IN.PART0_FREERUN",
"BriefDescription": "Free running counter that increments for every 32
bytes of data sent from the IO agent to the SOC",
Update the scale of the IIO bandwidth in free running counters as well.
Fixes: 210cc5f9db7a ("perf/x86/intel/uncore: Add uncore support for Snow Ridge server")
Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20250416142426.3933977-1-kan.liang@linux.intel.com
|
|
Meghana Malladi says:
====================
Bug fixes from XDP and perout series
This patch series consists of bug fixes from the XDP series:
1. Fixes a kernel warning that occurs when bringing down the
network interface.
2. Resolves a potential NULL pointer dereference in the
emac_xmit_xdp_frame() function.
3. Resolves a potential NULL pointer dereference in the
icss_iep_perout_enable() function
v3: https://lore.kernel.org/all/20250328102403.2626974-1-m-malladi@ti.com/
====================
Link: https://patch.msgid.link/20250415090543.717991-1-m-malladi@ti.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
The ICSS IEP driver tracks perout and pps enable state with flags.
Currently when disabling pps and perout signals during icss_iep_exit(),
results in NULL pointer dereference for perout.
To fix the null pointer dereference issue, the icss_iep_perout_enable_hw
function can be modified to directly clear the IEP CMP registers when
disabling PPS or PEROUT, without referencing the ptp_perout_request
structure, as its contents are irrelevant in this case.
Fixes: 9b115361248d ("net: ti: icssg-prueth: Fix clearing of IEP_CMP_CFG registers during iep_init")
Reported-by: Dan Carpenter <dan.carpenter@linaro.org>
Closes: https://lore.kernel.org/all/7b1c7c36-363a-4085-b26c-4f210bee1df6@stanley.mountain/
Signed-off-by: Meghana Malladi <m-malladi@ti.com>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Link: https://patch.msgid.link/20250415090543.717991-4-m-malladi@ti.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
emac_xmit_xdp_frame()
There is an error check inside emac_xmit_xdp_frame() function which
is called when the driver wants to transmit XDP frame, to check if
the allocated tx descriptor is NULL, if true to exit and return
ICSSG_XDP_CONSUMED implying failure in transmission.
In this case trying to free a descriptor which is NULL will result
in kernel crash due to NULL pointer dereference. Fix this error handling
and increase netdev tx_dropped stats in the caller of this function
if the function returns ICSSG_XDP_CONSUMED.
Fixes: 62aa3246f462 ("net: ti: icssg-prueth: Add XDP support")
Reported-by: Dan Carpenter <dan.carpenter@linaro.org>
Closes: https://lore.kernel.org/all/70d8dd76-0c76-42fc-8611-9884937c82f5@stanley.mountain/
Signed-off-by: Meghana Malladi <m-malladi@ti.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Reviewed-by: Roger Quadros <rogerq@kernel.org>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Link: https://patch.msgid.link/20250415090543.717991-3-m-malladi@ti.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
During network interface initialization, the NIC driver needs to register
its Rx queue with the XDP, to ensure the incoming XDP buffer carries a
pointer reference to this info and is stored inside xdp_rxq_info.
While this struct isn't tied to XDP prog, if there are any changes in
Rx queue, the NIC driver needs to stop the Rx queue by unregistering
with XDP before purging and reallocating memory. Drop page_pool destroy
during Rx channel reset as this is already handled by XDP during
xdp_rxq_info_unreg (Rx queue unregister), failing to do will cause the
following warning:
warning logs: https://gist.github.com/MeghanaMalladiTI/eb627e5dc8de24e42d7d46572c13e576
Fixes: 46eeb90f03e0 ("net: ti: icssg-prueth: Use page_pool API for RX buffer allocation")
Signed-off-by: Meghana Malladi <m-malladi@ti.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Reviewed-by: Roger Quadros <rogerq@kernel.org>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Link: https://patch.msgid.link/20250415090543.717991-2-m-malladi@ti.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
There is a potential deadlock if we do report zones in an IO context, detailed
in below lockdep report. When one process do a report zones and another process
freezes the block device, the report zones side cannot allocate a tag because
the freeze is already started. This can thus result in new block group creation
to hang forever, blocking the write path.
Thankfully, a new block group should be created on empty zones. So, reporting
the zones is not necessary and we can set the write pointer = 0 and load the
zone capacity from the block layer using bdev_zone_capacity() helper.
======================================================
WARNING: possible circular locking dependency detected
6.14.0-rc1 #252 Not tainted
------------------------------------------------------
modprobe/1110 is trying to acquire lock:
ffff888100ac83e0 ((work_completion)(&(&wb->dwork)->work)){+.+.}-{0:0}, at: __flush_work+0x38f/0xb60
but task is already holding lock:
ffff8881205b6f20 (&q->q_usage_counter(queue)#16){++++}-{0:0}, at: sd_remove+0x85/0x130
which lock already depends on the new lock.
the existing dependency chain (in reverse order) is:
-> #3 (&q->q_usage_counter(queue)#16){++++}-{0:0}:
blk_queue_enter+0x3d9/0x500
blk_mq_alloc_request+0x47d/0x8e0
scsi_execute_cmd+0x14f/0xb80
sd_zbc_do_report_zones+0x1c1/0x470
sd_zbc_report_zones+0x362/0xd60
blkdev_report_zones+0x1b1/0x2e0
btrfs_get_dev_zones+0x215/0x7e0 [btrfs]
btrfs_load_block_group_zone_info+0x6d2/0x2c10 [btrfs]
btrfs_make_block_group+0x36b/0x870 [btrfs]
btrfs_create_chunk+0x147d/0x2320 [btrfs]
btrfs_chunk_alloc+0x2ce/0xcf0 [btrfs]
start_transaction+0xce6/0x1620 [btrfs]
btrfs_uuid_scan_kthread+0x4ee/0x5b0 [btrfs]
kthread+0x39d/0x750
ret_from_fork+0x30/0x70
ret_from_fork_asm+0x1a/0x30
-> #2 (&fs_info->dev_replace.rwsem){++++}-{4:4}:
down_read+0x9b/0x470
btrfs_map_block+0x2ce/0x2ce0 [btrfs]
btrfs_submit_chunk+0x2d4/0x16c0 [btrfs]
btrfs_submit_bbio+0x16/0x30 [btrfs]
btree_write_cache_pages+0xb5a/0xf90 [btrfs]
do_writepages+0x17f/0x7b0
__writeback_single_inode+0x114/0xb00
writeback_sb_inodes+0x52b/0xe00
wb_writeback+0x1a7/0x800
wb_workfn+0x12a/0xbd0
process_one_work+0x85a/0x1460
worker_thread+0x5e2/0xfc0
kthread+0x39d/0x750
ret_from_fork+0x30/0x70
ret_from_fork_asm+0x1a/0x30
-> #1 (&fs_info->zoned_meta_io_lock){+.+.}-{4:4}:
__mutex_lock+0x1aa/0x1360
btree_write_cache_pages+0x252/0xf90 [btrfs]
do_writepages+0x17f/0x7b0
__writeback_single_inode+0x114/0xb00
writeback_sb_inodes+0x52b/0xe00
wb_writeback+0x1a7/0x800
wb_workfn+0x12a/0xbd0
process_one_work+0x85a/0x1460
worker_thread+0x5e2/0xfc0
kthread+0x39d/0x750
ret_from_fork+0x30/0x70
ret_from_fork_asm+0x1a/0x30
-> #0 ((work_completion)(&(&wb->dwork)->work)){+.+.}-{0:0}:
__lock_acquire+0x2f52/0x5ea0
lock_acquire+0x1b1/0x540
__flush_work+0x3ac/0xb60
wb_shutdown+0x15b/0x1f0
bdi_unregister+0x172/0x5b0
del_gendisk+0x841/0xa20
sd_remove+0x85/0x130
device_release_driver_internal+0x368/0x520
bus_remove_device+0x1f1/0x3f0
device_del+0x3bd/0x9c0
__scsi_remove_device+0x272/0x340
scsi_forget_host+0xf7/0x170
scsi_remove_host+0xd2/0x2a0
sdebug_driver_remove+0x52/0x2f0 [scsi_debug]
device_release_driver_internal+0x368/0x520
bus_remove_device+0x1f1/0x3f0
device_del+0x3bd/0x9c0
device_unregister+0x13/0xa0
sdebug_do_remove_host+0x1fb/0x290 [scsi_debug]
scsi_debug_exit+0x17/0x70 [scsi_debug]
__do_sys_delete_module.isra.0+0x321/0x520
do_syscall_64+0x93/0x180
entry_SYSCALL_64_after_hwframe+0x76/0x7e
other info that might help us debug this:
Chain exists of:
(work_completion)(&(&wb->dwork)->work) --> &fs_info->dev_replace.rwsem --> &q->q_usage_counter(queue)#16
Possible unsafe locking scenario:
CPU0 CPU1
---- ----
lock(&q->q_usage_counter(queue)#16);
lock(&fs_info->dev_replace.rwsem);
lock(&q->q_usage_counter(queue)#16);
lock((work_completion)(&(&wb->dwork)->work));
*** DEADLOCK ***
5 locks held by modprobe/1110:
#0: ffff88811f7bc108 (&dev->mutex){....}-{4:4}, at: device_release_driver_internal+0x8f/0x520
#1: ffff8881022ee0e0 (&shost->scan_mutex){+.+.}-{4:4}, at: scsi_remove_host+0x20/0x2a0
#2: ffff88811b4c4378 (&dev->mutex){....}-{4:4}, at: device_release_driver_internal+0x8f/0x520
#3: ffff8881205b6f20 (&q->q_usage_counter(queue)#16){++++}-{0:0}, at: sd_remove+0x85/0x130
#4: ffffffffa3284360 (rcu_read_lock){....}-{1:3}, at: __flush_work+0xda/0xb60
stack backtrace:
CPU: 0 UID: 0 PID: 1110 Comm: modprobe Not tainted 6.14.0-rc1 #252
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.3-3.fc41 04/01/2014
Call Trace:
<TASK>
dump_stack_lvl+0x6a/0x90
print_circular_bug.cold+0x1e0/0x274
check_noncircular+0x306/0x3f0
? __pfx_check_noncircular+0x10/0x10
? mark_lock+0xf5/0x1650
? __pfx_check_irq_usage+0x10/0x10
? lockdep_lock+0xca/0x1c0
? __pfx_lockdep_lock+0x10/0x10
__lock_acquire+0x2f52/0x5ea0
? __pfx___lock_acquire+0x10/0x10
? __pfx_mark_lock+0x10/0x10
lock_acquire+0x1b1/0x540
? __flush_work+0x38f/0xb60
? __pfx_lock_acquire+0x10/0x10
? __pfx_lock_release+0x10/0x10
? mark_held_locks+0x94/0xe0
? __flush_work+0x38f/0xb60
__flush_work+0x3ac/0xb60
? __flush_work+0x38f/0xb60
? __pfx_mark_lock+0x10/0x10
? __pfx___flush_work+0x10/0x10
? __pfx_wq_barrier_func+0x10/0x10
? __pfx___might_resched+0x10/0x10
? mark_held_locks+0x94/0xe0
wb_shutdown+0x15b/0x1f0
bdi_unregister+0x172/0x5b0
? __pfx_bdi_unregister+0x10/0x10
? up_write+0x1ba/0x510
del_gendisk+0x841/0xa20
? __pfx_del_gendisk+0x10/0x10
? _raw_spin_unlock_irqrestore+0x35/0x60
? __pm_runtime_resume+0x79/0x110
sd_remove+0x85/0x130
device_release_driver_internal+0x368/0x520
? kobject_put+0x5d/0x4a0
bus_remove_device+0x1f1/0x3f0
device_del+0x3bd/0x9c0
? __pfx_device_del+0x10/0x10
__scsi_remove_device+0x272/0x340
scsi_forget_host+0xf7/0x170
scsi_remove_host+0xd2/0x2a0
sdebug_driver_remove+0x52/0x2f0 [scsi_debug]
? kernfs_remove_by_name_ns+0xc0/0xf0
device_release_driver_internal+0x368/0x520
? kobject_put+0x5d/0x4a0
bus_remove_device+0x1f1/0x3f0
device_del+0x3bd/0x9c0
? __pfx_device_del+0x10/0x10
? __pfx___mutex_unlock_slowpath+0x10/0x10
device_unregister+0x13/0xa0
sdebug_do_remove_host+0x1fb/0x290 [scsi_debug]
scsi_debug_exit+0x17/0x70 [scsi_debug]
__do_sys_delete_module.isra.0+0x321/0x520
? __pfx___do_sys_delete_module.isra.0+0x10/0x10
? __pfx_slab_free_after_rcu_debug+0x10/0x10
? kasan_save_stack+0x2c/0x50
? kasan_record_aux_stack+0xa3/0xb0
? __call_rcu_common.constprop.0+0xc4/0xfb0
? kmem_cache_free+0x3a0/0x590
? __x64_sys_close+0x78/0xd0
do_syscall_64+0x93/0x180
? lock_is_held_type+0xd5/0x130
? __call_rcu_common.constprop.0+0x3c0/0xfb0
? lockdep_hardirqs_on+0x78/0x100
? __call_rcu_common.constprop.0+0x3c0/0xfb0
? __pfx___call_rcu_common.constprop.0+0x10/0x10
? kmem_cache_free+0x3a0/0x590
? lockdep_hardirqs_on_prepare+0x16d/0x400
? do_syscall_64+0x9f/0x180
? lockdep_hardirqs_on+0x78/0x100
? do_syscall_64+0x9f/0x180
? __pfx___x64_sys_openat+0x10/0x10
? lockdep_hardirqs_on_prepare+0x16d/0x400
? do_syscall_64+0x9f/0x180
? lockdep_hardirqs_on+0x78/0x100
? do_syscall_64+0x9f/0x180
entry_SYSCALL_64_after_hwframe+0x76/0x7e
RIP: 0033:0x7f436712b68b
RSP: 002b:00007ffe9f1a8658 EFLAGS: 00000206 ORIG_RAX: 00000000000000b0
RAX: ffffffffffffffda RBX: 00005559b367fd80 RCX: 00007f436712b68b
RDX: 0000000000000000 RSI: 0000000000000800 RDI: 00005559b367fde8
RBP: 00007ffe9f1a8680 R08: 1999999999999999 R09: 0000000000000000
R10: 00007f43671a5fe0 R11: 0000000000000206 R12: 0000000000000000
R13: 00007ffe9f1a86b0 R14: 0000000000000000 R15: 0000000000000000
</TASK>
Reported-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
CC: <stable@vger.kernel.org> # 6.13+
Tested-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
{bdev,disk}_zone_capacity() takes block_device or gendisk and sector position
and returns the zone capacity of the corresponding zone.
With that, move disk_nr_zones() and blk_zone_plug_bio() to consolidate them in
the same #ifdef block.
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
The whole tree checker returns EUCLEAN, except the one check in
btrfs_verify_level_key(). This was inherited from the function that was
moved from disk-io.c in 2cac5af16537 ("btrfs: move
btrfs_verify_level_key into tree-checker.c") but this should be unified
with the rest.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
If we have a failure at create_reloc_inode(), under the 'out' label we
assign an error pointer to the 'inode' variable and then return a weird
pointer because we return the expression "&inode->vfs_inode":
static noinline_for_stack struct inode *create_reloc_inode(
const struct btrfs_block_group *group)
{
(...)
out:
(...)
if (ret) {
if (inode)
iput(&inode->vfs_inode);
inode = ERR_PTR(ret);
}
return &inode->vfs_inode;
}
This can make us return a pointer that is not an error pointer and make
the caller proceed as if an error didn't happen and later result in an
invalid memory access when dereferencing the inode pointer.
Syzbot reported reported such a case with the following stack trace:
R10: 0000000000000002 R11: 0000000000000246 R12: 0000000000000000
R13: 0000000000000000 R14: 431bde82d7b634db R15: 00007ffc55de5790
</TASK>
BTRFS info (device loop0): relocating block group 6881280 flags data|metadata
Oops: general protection fault, probably for non-canonical address 0xdffffc0000000045: 0000 [#1] SMP KASAN NOPTI
KASAN: null-ptr-deref in range [0x0000000000000228-0x000000000000022f]
CPU: 0 UID: 0 PID: 5332 Comm: syz-executor215 Not tainted 6.14.0-syzkaller-13423-ga8662bcd2ff1 #0 PREEMPT(full)
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2~bpo12+1 04/01/2014
RIP: 0010:relocate_file_extent_cluster+0xe7/0x1750 fs/btrfs/relocation.c:2971
Code: 00 74 08 (...)
RSP: 0018:ffffc9000d3375e0 EFLAGS: 00010203
RAX: 0000000000000045 RBX: 000000000000022c RCX: ffff888000562440
RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff8880452db000
RBP: ffffc9000d337870 R08: ffffffff84089251 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000000 R12: dffffc0000000000
R13: ffffffff9368a020 R14: 0000000000000394 R15: ffff8880452db000
FS: 000055558bc7b380(0000) GS:ffff88808c596000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 000055a7a192e740 CR3: 0000000036e2e000 CR4: 0000000000352ef0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
<TASK>
relocate_block_group+0xa1e/0xd50 fs/btrfs/relocation.c:3657
btrfs_relocate_block_group+0x777/0xd80 fs/btrfs/relocation.c:4011
btrfs_relocate_chunk+0x12c/0x3b0 fs/btrfs/volumes.c:3511
__btrfs_balance+0x1a93/0x25e0 fs/btrfs/volumes.c:4292
btrfs_balance+0xbde/0x10c0 fs/btrfs/volumes.c:4669
btrfs_ioctl_balance+0x3f5/0x660 fs/btrfs/ioctl.c:3586
vfs_ioctl fs/ioctl.c:51 [inline]
__do_sys_ioctl fs/ioctl.c:906 [inline]
__se_sys_ioctl+0xf1/0x160 fs/ioctl.c:892
do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
do_syscall_64+0xf3/0x230 arch/x86/entry/syscall_64.c:94
entry_SYSCALL_64_after_hwframe+0x77/0x7f
RIP: 0033:0x7fb4ef537dd9
Code: 28 00 00 (...)
RSP: 002b:00007ffc55de5728 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
RAX: ffffffffffffffda RBX: 00007ffc55de5750 RCX: 00007fb4ef537dd9
RDX: 0000200000000440 RSI: 00000000c4009420 RDI: 0000000000000003
RBP: 0000000000000002 R08: 00007ffc55de54c6 R09: 00007ffc55de5770
R10: 0000000000000002 R11: 0000000000000246 R12: 0000000000000000
R13: 0000000000000000 R14: 431bde82d7b634db R15: 00007ffc55de5790
</TASK>
Modules linked in:
---[ end trace 0000000000000000 ]---
RIP: 0010:relocate_file_extent_cluster+0xe7/0x1750 fs/btrfs/relocation.c:2971
Code: 00 74 08 (...)
RSP: 0018:ffffc9000d3375e0 EFLAGS: 00010203
RAX: 0000000000000045 RBX: 000000000000022c RCX: ffff888000562440
RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff8880452db000
RBP: ffffc9000d337870 R08: ffffffff84089251 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000000 R12: dffffc0000000000
R13: ffffffff9368a020 R14: 0000000000000394 R15: ffff8880452db000
FS: 000055558bc7b380(0000) GS:ffff88808c596000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 000055a7a192e740 CR3: 0000000036e2e000 CR4: 0000000000352ef0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
----------------
Code disassembly (best guess):
0: 00 74 08 48 add %dh,0x48(%rax,%rcx,1)
4: 89 df mov %ebx,%edi
6: e8 f8 36 24 fe call 0xfe243703
b: 48 89 9c 24 30 01 00 mov %rbx,0x130(%rsp)
12: 00
13: 4c 89 74 24 28 mov %r14,0x28(%rsp)
18: 4d 8b 76 10 mov 0x10(%r14),%r14
1c: 49 8d 9e 98 fe ff ff lea -0x168(%r14),%rbx
23: 48 89 d8 mov %rbx,%rax
26: 48 c1 e8 03 shr $0x3,%rax
* 2a: 42 80 3c 20 00 cmpb $0x0,(%rax,%r12,1) <-- trapping instruction
2f: 74 08 je 0x39
31: 48 89 df mov %rbx,%rdi
34: e8 ca 36 24 fe call 0xfe243703
39: 4c 8b 3b mov (%rbx),%r15
3c: 48 rex.W
3d: 8b .byte 0x8b
3e: 44 rex.R
3f: 24 .byte 0x24
So fix this by returning the error immediately.
Reported-by: syzbot+7481815bb47ef3e702e2@syzkaller.appspotmail.com
Link: https://lore.kernel.org/linux-btrfs/67f14ee9.050a0220.0a13.023e.GAE@google.com/
Fixes: b204e5c7d4dc ("btrfs: make btrfs_iget() return a btrfs inode instead")
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
There was a bug report about a NULL pointer dereference in
__btrfs_add_free_space_zoned() that ultimately happens because a
conversion from the default metadata profile DUP to a RAID1 profile on two
disks.
The stack trace has the following signature:
BTRFS error (device sdc): zoned: write pointer offset mismatch of zones in raid1 profile
BUG: kernel NULL pointer dereference, address: 0000000000000058
#PF: supervisor read access in kernel mode
#PF: error_code(0x0000) - not-present page
PGD 0 P4D 0
Oops: Oops: 0000 [#1] PREEMPT SMP NOPTI
RIP: 0010:__btrfs_add_free_space_zoned.isra.0+0x61/0x1a0
RSP: 0018:ffffa236b6f3f6d0 EFLAGS: 00010246
RAX: 0000000000000000 RBX: ffff96c8132f3400 RCX: 0000000000000001
RDX: 0000000010000000 RSI: 0000000000000000 RDI: ffff96c8132f3410
RBP: 0000000010000000 R08: 0000000000000003 R09: 0000000000000000
R10: 0000000000000000 R11: 00000000ffffffff R12: 0000000000000000
R13: ffff96c758f65a40 R14: 0000000000000001 R15: 000011aac0000000
FS: 00007fdab1cb2900(0000) GS:ffff96e60ca00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000058 CR3: 00000001a05ae000 CR4: 0000000000350ef0
Call Trace:
<TASK>
? __die_body.cold+0x19/0x27
? page_fault_oops+0x15c/0x2f0
? exc_page_fault+0x7e/0x180
? asm_exc_page_fault+0x26/0x30
? __btrfs_add_free_space_zoned.isra.0+0x61/0x1a0
btrfs_add_free_space_async_trimmed+0x34/0x40
btrfs_add_new_free_space+0x107/0x120
btrfs_make_block_group+0x104/0x2b0
btrfs_create_chunk+0x977/0xf20
btrfs_chunk_alloc+0x174/0x510
? srso_return_thunk+0x5/0x5f
btrfs_inc_block_group_ro+0x1b1/0x230
btrfs_relocate_block_group+0x9e/0x410
btrfs_relocate_chunk+0x3f/0x130
btrfs_balance+0x8ac/0x12b0
? srso_return_thunk+0x5/0x5f
? srso_return_thunk+0x5/0x5f
? __kmalloc_cache_noprof+0x14c/0x3e0
btrfs_ioctl+0x2686/0x2a80
? srso_return_thunk+0x5/0x5f
? ioctl_has_perm.constprop.0.isra.0+0xd2/0x120
__x64_sys_ioctl+0x97/0xc0
do_syscall_64+0x82/0x160
? srso_return_thunk+0x5/0x5f
? __memcg_slab_free_hook+0x11a/0x170
? srso_return_thunk+0x5/0x5f
? kmem_cache_free+0x3f0/0x450
? srso_return_thunk+0x5/0x5f
? srso_return_thunk+0x5/0x5f
? syscall_exit_to_user_mode+0x10/0x210
? srso_return_thunk+0x5/0x5f
? do_syscall_64+0x8e/0x160
? sysfs_emit+0xaf/0xc0
? srso_return_thunk+0x5/0x5f
? srso_return_thunk+0x5/0x5f
? seq_read_iter+0x207/0x460
? srso_return_thunk+0x5/0x5f
? vfs_read+0x29c/0x370
? srso_return_thunk+0x5/0x5f
? srso_return_thunk+0x5/0x5f
? syscall_exit_to_user_mode+0x10/0x210
? srso_return_thunk+0x5/0x5f
? do_syscall_64+0x8e/0x160
? srso_return_thunk+0x5/0x5f
? exc_page_fault+0x7e/0x180
entry_SYSCALL_64_after_hwframe+0x76/0x7e
RIP: 0033:0x7fdab1e0ca6d
RSP: 002b:00007ffeb2b60c80 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
RAX: ffffffffffffffda RBX: 0000000000000003 RCX: 00007fdab1e0ca6d
RDX: 00007ffeb2b60d80 RSI: 00000000c4009420 RDI: 0000000000000003
RBP: 00007ffeb2b60cd0 R08: 0000000000000000 R09: 0000000000000013
R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
R13: 00007ffeb2b6343b R14: 00007ffeb2b60d80 R15: 0000000000000001
</TASK>
CR2: 0000000000000058
---[ end trace 0000000000000000 ]---
The 1st line is the most interesting here:
BTRFS error (device sdc): zoned: write pointer offset mismatch of zones in raid1 profile
When a RAID1 block-group is created and a write pointer mismatch between
the disks in the RAID set is detected, btrfs sets the alloc_offset to the
length of the block group marking it as full. Afterwards the code expects
that a balance operation will evacuate the data in this block-group and
repair the problems.
But before this is possible, the new space of this block-group will be
accounted in the free space cache. But in __btrfs_add_free_space_zoned()
it is being checked if it is a initial creation of a block group and if
not a reclaim decision will be made. But the decision if a block-group's
free space accounting is done for an initial creation depends on if the
size of the added free space is the whole length of the block-group and
the allocation offset is 0.
But as btrfs_load_block_group_zone_info() sets the allocation offset to
the zone capacity (i.e. marking the block-group as full) this initial
decision is not met, and the space_info pointer in the 'struct
btrfs_block_group' has not yet been assigned.
Fail creation of the block group and rely on manual user intervention to
re-balance the filesystem.
Afterwards the filesystem can be unmounted, mounted in degraded mode and
the missing device can be removed after a full balance of the filesystem.
Reported-by: 西木野羰基 <yanqiyu01@gmail.com>
Link: https://lore.kernel.org/linux-btrfs/CAB_b4sBhDe3tscz=duVyhc9hNE+gu=B8CrgLO152uMyanR8BEA@mail.gmail.com/
Fixes: b1934cd60695 ("btrfs: zoned: handle broken write pointer on zones")
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
After enabling large data folios for tests, I hit the ASSERT() inside
GET_SUBPAGE_BITMAP() where blocks_per_folio matches BITS_PER_LONG.
The ASSERT() itself is only based on the original subpage fs block size,
where we have at most 16 blocks per page, thus
"ASSERT(blocks_per_folio < BITS_PER_LONG)".
However the experimental large data folio support will set the max folio
order according to the BITS_PER_LONG, so we can have a case where a large
folio contains exactly BITS_PER_LONG blocks.
So the ASSERT() is too strict, change it to
"ASSERT(blocks_per_folio <= BITS_PER_LONG)" to avoid the false alert.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me>
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
[BUG]
When running btrfs/004 with 4K fs block size and 64K page size,
sometimes fsstress workload can take 100% CPU for a while, but not long
enough to trigger a 120s hang warning.
[CAUSE]
When such 100% CPU usage happens, btrfs_punch_hole_lock_range() is
always in the call trace.
One example when this problem happens, the function
btrfs_punch_hole_lock_range() got the following parameters:
lock_start = 4096, lockend = 20469
Then we calculate @page_lockstart by rounding up lock_start to page
boundary, which is 64K (page size is 64K).
For @page_lockend, we round down the value towards page boundary, which
result 0. Then since we need to pass an inclusive end to
filemap_range_has_page(), we subtract 1 from the rounded down value,
resulting in (u64)-1.
In the above case, the range is inside the same page, and we do not even
need to call filemap_range_has_page(), not to mention to call it with
(u64)-1 at the end.
This behavior will cause btrfs_punch_hole_lock_range() to busy loop
waiting for irrelevant range to have its pages dropped.
[FIX]
Calculate @page_lockend by just rounding down @lockend, without
decreasing the value by one. So @page_lockend will no longer overflow.
Then exit early if @page_lockend is no larger than @page_lockstart.
As it means either the range is inside the same page, or the two pages
are adjacent already.
Finally only decrease @page_lockend when calling filemap_range_has_page().
Fixes: 0528476b6ac7 ("btrfs: fix the filemap_range_has_page() call in btrfs_punch_hole_lock_range()")
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
subpage_calc_start_bit()
Inside the macro, subpage_calc_start_bit(), we need to calculate the
offset to the beginning of the folio.
But we're using offset_in_page(), on systems with 4K page size and 4K fs
block size, this means we will always return offset 0 for a large folio,
causing all kinds of errors.
Fix it by using offset_in_folio() instead.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
The blamed commit exposes a possible issue with flow_offload_teardown():
We might remove the offload bit of a conntrack entry that has been
offloaded again.
1. conntrack entry c1 is offloaded via flow f1 (f1->ct == c1).
2. f1 times out and is pushed back to slowpath, c1 offload bit is
removed. Due to bug, f1 is not unlinked from rhashtable right away.
3. a new packet arrives for the flow and re-offload is triggered, i.e.
f2->ct == c1. This is because lookup in flowtable skip entries with
teardown bit set.
4. Next flowtable gc cycle finds f1 again
5. flow_offload_teardown() is called again for f1 and c1 offload bit is
removed again, even though we have f2 referencing the same entry.
This is harmless, but clearly not correct.
Fix the bug that exposes this: set 'teardown = true' to have the gc
callback unlink the flowtable entry from the table right away instead of
the unintentional defer to the next round.
Also prevent flow_offload_teardown() from fixing up the ct state more than
once: We could also be called from the data path or a notifier, not only
from the flowtable gc callback.
NF_FLOW_TEARDOWN can never be unset, so we can use it as synchronization
point: if we observe did not see a 0 -> 1 transition, then another CPU
is already doing the ct state fixups for us.
Fixes: 03428ca5cee9 ("netfilter: conntrack: rework offload nf_conn timeout extension logic")
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
|
|
Fix and improve documentation related to landlock_restrict_self(2)'s
flags. Update the LANDLOCK_RESTRICT_SELF_LOG_SAME_EXEC_OFF
documentation according to the current semantic.
Cc: Günther Noack <gnoack@google.com>
Cc: Paul Moore <paul@paul-moore.com>
Link: https://lore.kernel.org/r/20250416154716.1799902-3-mic@digikod.net
Signed-off-by: Mickaël Salaün <mic@digikod.net>
|
|
Fix, deduplicate, and improve rendering of landlock_restrict_self(2)'s
flags documentation.
The flags are now rendered like the syscall's parameters and
description.
Cc: Günther Noack <gnoack@google.com>
Cc: Paul Moore <paul@paul-moore.com>
Link: https://lore.kernel.org/r/20250416154716.1799902-2-mic@digikod.net
Signed-off-by: Mickaël Salaün <mic@digikod.net>
|
|
Move and fix the flags documentation, and improve formatting.
It makes more sense and it eases maintenance to document syscall flags
in landlock.h, where they are defined. This is already the case for
landlock_restrict_self(2)'s flags.
The flags are now rendered like the syscall's parameters and
description.
Cc: Günther Noack <gnoack@google.com>
Cc: Paul Moore <paul@paul-moore.com>
Link: https://lore.kernel.org/r/20250416154716.1799902-1-mic@digikod.net
Signed-off-by: Mickaël Salaün <mic@digikod.net>
|
|
Currently bdex_statx is only called from the very high-level
vfs_statx_path function, and thus bypassing it for in-kernel calls
to vfs_getattr or vfs_getattr_nosec.
This breaks querying the block ѕize of the underlying device in the
loop driver and also is a pitfall for any other new kernel caller.
Move the call into the lowest level helper to ensure all callers get
the right results.
Fixes: 2d985f8c6b91 ("vfs: support STATX_DIOALIGN on block devices")
Fixes: f4774e92aab8 ("loop: take the file system minimum dio alignment into account")
Reported-by: "Darrick J. Wong" <djwong@kernel.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/20250417064042.712140-1-hch@lst.de
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
GCC 15's new -Wunterminated-string-initialization notices that the
character lookup tables "fscache_cache_states" and "fscache_cookie_states"
(which are not used as a C-String) need to be marked as "nonstring":
fs/netfs/fscache_cache.c:375:67: warning: initializer-string for array of 'char' truncates NUL terminator but destination lacks 'nonstring' attribute (6 chars into 5 available) [-Wunterminated-string-initialization]
375 | static const char fscache_cache_states[NR__FSCACHE_CACHE_STATE] = "-PAEW";
| ^~~~~~~
fs/netfs/fscache_cookie.c:32:69: warning: initializer-string for array of 'char' truncates NUL terminator but destination lacks 'nonstring' attribute (11 chars into 10 available) [-Wunterminated-string-initialization]
32 | static const char fscache_cookie_states[FSCACHE_COOKIE_STATE__NR] = "-LCAIFUWRD";
| ^~~~~~~~~~~~
Annotate the arrays.
Signed-off-by: Kees Cook <kees@kernel.org>
Link: https://lore.kernel.org/20250416221654.work.028-kees@kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
|