Age | Commit message (Collapse) | Author |
|
This adds support for allowing proactive reclaim in general on a NUMA
system. A per-node interface extends support for beyond a memcg-specific
interface, respecting the current semantics of memory.reclaim: respecting
aging LRU and not supporting artificially triggering eviction on nodes
belonging to non-bottom tiers.
This patch allows userspace to do:
echo "512M swappiness=10" > /sys/devices/system/node/nodeX/reclaim
One of the premises for this is to semantically align as best as possible
with memory.reclaim. During a brief time memcg did support nodemask until
55ab834a86a9 (Revert "mm: add nodes= arg to memory.reclaim"), for which
semantics around reclaim (eviction) vs demotion were not clear, rendering
charging expectations to be broken.
With this approach:
1. Users who do not use memcg can benefit from proactive reclaim. The
memcg interface is not NUMA aware and there are usecases that are
focusing on NUMA balancing rather than workload memory footprint.
2. Proactive reclaim on top tiers will trigger demotion, for which
memory is still byte-addressable. Reclaiming on the bottom nodes will
trigger evicting to swap (the traditional sense of reclaim). This
follows the semantics of what is today part of the aging process on
tiered memory, mirroring what every other form of reclaim does
(reactive and memcg proactive reclaim). Furthermore per-node proactive
reclaim is not as susceptible to the memcg charging problem mentioned
above.
3. Unlike the nodes= arg, this interface avoids confusing semantics,
such as what exactly the user wants when mixing top-tier and low-tier
nodes in the nodemask. Further per-node interface is less exposed to
"free up memory in my container" usecases, where eviction is intended.
4. Users that *really* want to free up memory can use proactive
reclaim on nodes knowingly to be on the bottom tiers to force eviction
in a natural way - higher access latencies are still better than swap.
If compelled, while no guarantees and perhaps not worth the effort,
users could also also potentially follow a ladder-like approach to
eventually free up the memory. Alternatively, perhaps an 'evict'
option could be added to the parameters for both memory.reclaim and
per-node interfaces to force this action unconditionally.
[akpm@linux-foundation.org: user_proactive_reclaim(): return -EBUSY on PGDAT_RECLAIM_LOCKED contention, per Roman]
[dave@stgolabs.net: memcg && node is also a bogus case, per Shakeel]
Link: https://lkml.kernel.org/r/20250717235604.2atyx2aobwowpge3@offworld
Link: https://lkml.kernel.org/r/20250623185851.830632-5-dave@stgolabs.net
Signed-off-by: Davidlohr Bueso <dave@stgolabs.net>
Acked-by: Shakeel Butt <shakeel.butt@linux.dev>
Acked-by: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Yosry Ahmed <yosryahmed@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
There are no callers of unmap_and_put_page() left. Remove it.
Link: https://lkml.kernel.org/r/20250709194017.927978-6-vishal.moola@gmail.com
Signed-off-by: Vishal Moola (Oracle) <vishal.moola@gmail.com>
Acked-by: David Hildenbrand <david@redhat.com>
Cc: Jordan Rome <linux@jordanrome.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Add a new field to 'struct damos', namely migrate_dests, to allow DAMON
API callers specify multiple migration destination nodes and their
weights. Also update 'struct damos' creation and destruction functions
accordingly to initialize the new field and free up the API
caller-allocated buffers on those, respectively.
Link: https://lkml.kernel.org/r/20250709005952.17776-3-bijan311@gmail.com
Signed-off-by: SeongJae Park <sj@kernel.org>
Signed-off-by: Bijan Tabatabai <bijantabatab@micron.com>
Reviewed-by: SeongJae Park <sj@kernel.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Ravi Shankar Jonnalagadda <ravis.opensrc@micron.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Patch series "mm/damon/vaddr: Allow interleaving in migrate_{hot,cold}
actions", v4.
A recent patchset automatically sets the interleave weight for each node
according to the node's maximum bandwidth [1]. In another thread, the
patch set's author, Joshua Hahn, wondered if/how thes weights should be
changed if the bandwidth utilization of the system changes [2].
This patch set adds the mechanism for dynamically changing how application
data is interleaved across nodes while leaving the policy of what the
interleave weights should be to userspace. It does this by having the
migrate_{hot,cold} operating schemes interleave application data according
to the list of migration nodes and weights passed in via the DAMON sysfs
interface. This functionality can be used to dynamically adjust how
folios are interleaved by having a userspace process adjust those weights.
If no specific destination nodes or weights are provided, the
migrate_{hot,cold} actions will only migrate folios to damos->target_nid
as before.
The algorithm used to interleave the folios is similar to the one used for
the weighted interleave mempolicy [3]. It uses the offset from which a
folio is mapped into a VMA to determine the node the folio should be
placed in. This method is convenient because for a given set of
interleave weights, a folio has only one valid node it can be placed in,
limitng the amount of unnecessary data movement. However, finding out how
a folio is mapped inside of a VMA requires a costly rmap walk when using a
paddr scheme. As such, we have decided that this functionality makes more
sense as a vaddr scheme [4]. To this end, this patch set also adds vaddr
versions of the migrate_{hot,cold}.
Motivation
==========
There have been prior discussions about how changing the interleave
weights in response to the system's bandwidth utilization can be
beneficial [2]. However, currently the interleave weights only are
applied when data is allocated. Migrating already allocated pages
according to the dynamically changing weights will better help balance the
bandwidth utilization across nodes.
As a toy example, imagine some application that uses 75% of the local
bandwidth. Assuming sufficient capacity, when running alone, we want to
keep that application's data in local memory. However, if a second
instance of that application begins, using the same amount of bandwidth,
it would be best to interleave the data of both processes to alleviate the
bandwidth pressure from the local node. Likewise, when one of the
processes ends, the data should be moves back to local memory.
We imagine there would be a userspace application that would monitor
system performance characteristics, such as bandwidth utilization or
memory access latency, and uses that information to tune the interleave
weights. Others seem to have come to a similar conclusion in previous
discussions [5]. We are currently working on a userspace program that
does this, but it is not quite ready to be published yet.
After the userspace application tunes the interleave weights, there must
be some mechanism that actually migrates pages to be consistent with those
weights. This patchset is what provides this mechanism.
We believe DAMON is the correct venue for the interleaving mechanism for a
few reasons. First, we noticed that we don't have to migrate all of the
application's pages to improve performance. we just need to migrate the
frequently accessed pages. DAMON's existing hotness traching is very
useful for this. Second, DAMON's quota system can be used to ensure we
are not using too much bandwidth for migrations. Finally, as Ying pointed
out [6], a complete solution must also handle when a memory node is at
capacity. The existing migrate_cold action can be used in conjunction
with the functionality added in this patch set to provide that complete
solution.
Functionality Test
==================
Below is an example of this new functionality in use to confirm that these
patches behave as intended.
In this example, the user starts an application, alloc_data, which
allocates 1GB using the default memory policy (i.e. allocate to local
memory) then sleeps. Afterwards, we start DAMON to interleave the data at
a 1:1 ratio. Using numastat, we show that DAMON has migrated the
application's data to match the new interleave ratio.
For this example, I modified the userspace damo tool [8] to write to the
migration_dest sysfs files. I plan to upstream these changes when these
patches are merged.
$ # Allocate the data initially
$ ./alloc_data 1G &
[1] 6587
$ numastat -c -p alloc_data
Per-node process memory usage (in MBs) for PID 6587 (alloc_data)
Node 0 Node 1 Total
------ ------ -----
Huge 0 0 0
Heap 0 0 0
Stack 0 0 0
Private 1027 0 1027
------- ------ ------ -----
Total 1027 0 1027
$ # Start DAMON to interleave data at a 1:1 ratio
$ cat ./interleave_vaddr.yaml
kdamonds:
- contexts:
- ops: vaddr
addr_unit: null
targets:
- pid: 6587
regions: []
intervals:
sample_us: 500 ms
aggr_us: 5 s
ops_update_us: 20 s
intervals_goal:
access_bp: 0 %
aggrs: '0'
min_sample_us: 0 ns
max_sample_us: 0 ns
nr_regions:
min: '20'
max: '50'
schemes:
- action: migrate_hot
dests:
- nid: 0
weight: 1
- nid: 1
weight: 1
access_pattern:
sz_bytes:
min: 0 B
max: max
nr_accesses:
min: 0 %
max: 100 %
age:
min: 0 ns
max: max
$ sudo ./damo/damo interleave_vaddr.yaml
$ # Verify that DAMON has migrated data to match the 1:1 ratio
$ numastat -c -p alloc_data
Per-node process memory usage (in MBs) for PID 6587 (alloc_data)
Node 0 Node 1 Total
------ ------ -----
Huge 0 0 0
Heap 0 0 0
Stack 0 0 0
Private 514 514 1027
------- ------ ------ -----
Total 514 514 1027
Performance Test
================
Below is a simple example showing that interleaving application data using
these patches can improve application performance. To do this, we run a
bandwidth intensive embedding reduction application [7]. This workload is
useful for this test because it reports the time it takes each iteration
to run and each iteration reuses the same allocation, allowing us to see
the benefits of the migration.
We evaluate this on a 128 core/256 thread AMD CPU with 72GB/s of local DDR
bandwidth and 26 GB/s of CXL bandwidth.
Before we start the workload, the system bandwidth utilization is low, so
we start with the interleave weights of 1:0, i.e. allocating all data to
local memory. When the workload beings, it saturates the local bandwidth,
making the page placement suboptimal. To alleviate this, we modify the
interleave weights, triggering DAMON to migrate the workload's data.
We use the same interleave_vaddr.yaml file to setup DAMON, except we
configure it to begin with a 1:0 interleave ratio, and attach it to the
shell and its children processes.
$ sudo ./damo/damo start interleave_vaddr.yaml --include_child_tasks &
$ <path>/eval_baseline -d amazon_All -c 255 -r 100
<clip startup output>
Eval Phase 3: Running Baseline...
REPEAT # 0 Baseline Total time : 7323.54 ms
REPEAT # 1 Baseline Total time : 7624.56 ms
REPEAT # 2 Baseline Total time : 7619.61 ms
REPEAT # 3 Baseline Total time : 7617.12 ms
REPEAT # 4 Baseline Total time : 7638.64 ms
REPEAT # 5 Baseline Total time : 7611.27 ms
REPEAT # 6 Baseline Total time : 7629.32 ms
REPEAT # 7 Baseline Total time : 7695.63 ms
# Interleave weights set to 3:1
REPEAT # 8 Baseline Total time : 7077.5 ms
REPEAT # 9 Baseline Total time : 5633.23 ms
REPEAT # 10 Baseline Total time : 5644.6 ms
REPEAT # 11 Baseline Total time : 5627.66 ms
REPEAT # 12 Baseline Total time : 5629.76 ms
REPEAT # 13 Baseline Total time : 5633.05 ms
REPEAT # 14 Baseline Total time : 5641.24 ms
REPEAT # 15 Baseline Total time : 5631.18 ms
REPEAT # 16 Baseline Total time : 5631.33 ms
Updating the interleave weights and having DAMON migrate the workload data
according to the weights resulted in an approximarely 25% speedup.
Patches Sequence
================
Patches 1-7 extend the DAMON API to specify multiple destination nodes and
weights for the migrate_{hot,cold} actions. These patches are from SJ'S
RFC [8].
Patches 8-10 add a vaddr implementation of the migrate_{hot,cold} schemes.
Patch 11 modifies the vaddr migrate_{hot,cold} schemes to interleave data
according to the weights provided by damos->migrate_dest.
Patches 12-13 allow the vaddr migrate_{hot,cold} implementation to filter
out folios like the paddr version.
This patch (of 13):
Introduce a new struct, namely damos_migrate_dests, for specifying
multiple DAMOS' migration destination nodes and their weights.
Link: https://lkml.kernel.org/r/20250709005952.17776-1-bijan311@gmail.com
Link: https://lkml.kernel.org/r/20250709005952.17776-2-bijan311@gmail.com
Link: https://lore.kernel.org/linux-mm/20250520141236.2987309-1-joshua.hahnjy@gmail.com/ [1]
Link: https://lore.kernel.org/linux-mm/20250313155705.1943522-1-joshua.hahnjy@gmail.com/ [2]
Link: https://elixir.bootlin.com/linux/v6.15.4/source/mm/mempolicy.c#L2015 [3]
Link: https://lore.kernel.org/damon/20250624223310.55786-1-sj@kernel.org/ [4]
Link: https://lore.kernel.org/linux-mm/20250314151137.892379-1-joshua.hahnjy@gmail.com/ [5]
Link: https://lore.kernel.org/linux-mm/87frjfx6u4.fsf@DESKTOP-5N7EMDA/ [6]
Link: https://github.com/SNU-ARC/MERCI [7]
Link: https://lore.kernel.org/damon/20250702051558.54138-1-sj@kernel.org/ [8]
Signed-off-by: SeongJae Park <sj@kernel.org>
Signed-off-by: Bijan Tabatabai <bijantabatab@micron.com>
Reviewed-by: SeongJae Park <sj@kernel.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Ravi Shankar Jonnalagadda <ravis.opensrc@micron.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
The only user of the counter (FUSE) was removed in commit 0c58a97f919c
("fuse: remove tmp folio for writebacks and internal rb tree") so follow
the established pattern of removing the counter and hardcoding 0 in
meminfo output, as done recently with NR_BOUNCE. Update documentation for
procfs, including for the value for Bounce that was missed when removing
its counter.
Also remove the mention of NR_WRITEBACK_TEMP implications from a comment
in wb_position_ratio(). The rest of the comment there about fuse setting
bdi->max_ratio to 1% is still correct.
[vbabka@suse.cz: v2]
Link: https://lkml.kernel.org/r/5a848e15-6a57-4ecb-a015-d4f358b8a5d3@suse.cz
Link: https://lkml.kernel.org/r/20250625-nr_writeback_removal-v1-1-7f2a0df70faa@suse.cz
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Brendan Jackman <jackmanb@google.com>
Cc: Danilo Krummrich <dakr@kernel.org>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Jan Kara <jack@suse.cz>
Cc: Jeff Layton <jlayton@kernel.org>
Cc: Jeffle Xu <jefflexu@linux.alibaba.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Joanne Koong <joannelkoong@gmail.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Kirill A. Shuemov <kirill.shutemov@linux.intel.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Maxim Patlasov <mpatlasov@parallels.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Miklos Szeredi <mszeredi@redhat.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Zach O'Keefe <zokeefe@google.com>
Cc: Zi Yan <ziy@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
The vmstat_text array contains labels for counters displayed in
/proc/vmstat. It is important to keep the labels in sync with the
counters.
There is a BUILD_BUG_ON() check in vmstat_start() that ensures the size of
the vmstat_text is not smaller than VM_EVENT_COUNTERS. This helps to
catch cases where a new counter is added but the label is not. However,
it does not help if a counter is removed but the label remains.
It would be nice to make the BUILD_BUG_ON() check more strict to catch
such cases. However, when compiling with MEMCG enabled but
VM_EVENT_COUNTERS disabled, the vmstat_text array is larger than
NR_VMSTAT_ITEMS.
This issue arises because some elements of the vmstat_text array are
present when either MEMCG or VM_EVENT_COUNTERS is enabled, but
NR_VMSTAT_ITEMS only accounts for these elements if VM_EVENT_COUNTERS is
enabled.
Instead of adjusting the NR_VMSTAT_ITEMS definition to account for MEMCG,
make MEMCG select VM_EVENT_COUNTERS. VM_EVENT_COUNTERS is enabled in most
configurations anyway.
Link: https://lkml.kernel.org/r/20250604095111.533783-1-kirill.shutemov@linux.intel.com
Fixes: ebc5d83d0443 ("mm/memcontrol: use vmstat names for printing statistics")
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reported-by: Randy Dunlap <rdunlap@infradead.org>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Tested-by: Randy Dunlap <rdunlap@infradead.org>
Acked-by: Randy Dunlap <rdunlap@infradead.org>
Acked-by: Shakeel Butt <shakeel.butt@linux.dev>
Cc: Konstantin Khlebnikov <koct9i@gmail.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Suren Baghdasaryan <surenb@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
DAMON core implements a static function to see if a given DAMON context is
running. DAMON sysfs interface is implementing the same one on its own.
Make the core function non-static and reuse it from the DAMON sysfs
interface.
Link: https://lkml.kernel.org/r/20250705175000.56259-5-sj@kernel.org
Signed-off-by: SeongJae Park <sj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull vfs fixes from Christian Brauner:
- Fix a memory leak in fcntl_dirnotify()
- Raise SB_I_NOEXEC on secrement superblock instead of messing with
flags on the mount
- Add fsdevel and block mailing lists to uio entry. We had a few
instances were very questionable stuff was added without either block
or the VFS being aware of it
- Fix netfs copy-to-cache so that it performs collection with
ceph+fscache
- Fix netfs race between cache write completion and ALL_QUEUED being
set
- Verify the inode mode when loading entries from disk in isofs
- Avoid state_lock in iomap_set_range_uptodate()
- Fix PIDFD_INFO_COREDUMP check in PIDFD_GET_INFO ioctl
- Fix the incorrect return value in __cachefiles_write()
* tag 'vfs-6.16-rc7.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
MAINTAINERS: add block and fsdevel lists to iov_iter
netfs: Fix race between cache write completion and ALL_QUEUED being set
netfs: Fix copy-to-cache so that it performs collection with ceph+fscache
fix a leak in fcntl_dirnotify()
iomap: avoid unnecessary ifs_set_range_uptodate() with locks
isofs: Verify inode mode when loading from disk
cachefiles: Fix the incorrect return value in __cachefiles_write()
secretmem: use SB_I_NOEXEC
coredump: fix PIDFD_INFO_COREDUMP ioctl check
|
|
ssh://gitolite.kernel.org/pub/scm/linux/kernel/git/jic23/iio into char-misc-next
Jonathan writes:
IIO: New device support, features, late breaking fixes and cleanup for 6.17
The normal mixed bag. A few more fixes than usual as I failed to send
them out earlier.
New device support
==================
adi,ad4080
- New driver for this high speed ADC. Includes extensions to iio-backends
necessary to support filter config, variable data lands and data
alignment control.
adi,ad4170-4
- New driver for this 24-bit very feature rich ADC suited for weigh scale
and thermocouple applications.
adi,ad7405
- New driver for this single channel isolated ADC with backend support
(adi-axi-adc)
google,cros_ec_activity
- Add activity detection to the existing set of cros_ec drivers covering
both human body and significant motion detection.
mediatek,mt6359
- Add support for MT6363 and MT6373 PMIC Auxiliary ADCs.
nicera,d3-323-aa
- New driver for this configurable Passive InfraRed sensor.
Device ID only
==============
mediatek,mt7981-auxadc
- Add ID to mt2701 driver as fully compatible with mt7986-auxadc.
rohm,bu79100g
- Add ID to ad7476 driver as fully compatible with TI ADS7866.
Features
========
Core
- New in_voltageY_convdelay to allow for devices to control timing
offsets between sampling different channels.
adi,ad-sigma-delta-library
- Support SPI offload (later fix for missing Kconfig dependency)
adi,ad4851
- SPI 3-wire support.
adi,ad7606
- Power supply control.
- convdelay and calibbias support for calibration purposes.
- gain calibration support based on external filter resistance provided
from device tree.
adi,ad7768-1
- Add output regulator for VCM output, typically used for preconditioning
circuits.
- Add gpio controller for the 4 GPIOs.
- Multiple scan type support to enable 16-bit modes.
- Support synchronization over SPI.
- Filter type and oversampling ratio control.
- Low pass filter cut off read only attribute.
adi,adxl313
- FIFO support
- DC activity, inactivity detection with power-save on inactivity
- AC coupled activity detection
- Documentation for this complex driver.
- debugfs register access.
adi,adxl345
- Sampling frequency and sensor range controls.
bosch,bmi270
- Add step counter support.
invensense,icm42600
- Wake on motion support.
Cleanup and fixes
=================
backend
- Drop unused parameter from iio_backend_ovesampling_ratio_set()
docs
- Fix ABI docs around I and Q modifiers.
treewide
- Switch remaining drives to use maple tree regcache.
- Drop use of DRIVER_NAME style definitions when only used in one
place.
- Drop unused export.h includes.
- Use = { } in place of memset in various drivers.
- Constify various info structures and related.
- Switch some drivers from array of chip_info structures to individual
named structures.
adi,ad-sigma_delta library
- Fix over allocation of scan buffer. (bits/bytes confusion)
- Sort includes and apply iwyu principles to ensure sensible set.
- Use u8 instead of uint8_t
- Replace hard coded type sizes with sizeof() and BITS_TO_BYTES() as
appropriate.
- Factor out setting of read address to reduce duplication.
- Switch to buffer predisable so error handling on buffer enable
functions correctly (balanced against postenable).
adi,ad4000
- Don't use sift_right() on an unsigned value.
adi,ad7173
- Add missing check on spi_setup() succeeding.
- Simplify clock enable disable code using devm_clk_get_enabled()
- Fix channel index for syscalib_mode
- Fix number of configuration slots for some devics.
- Fix the channel used for calibration.
- Fix setting ODR up in probe.
adi,ad7380
- Drop unused oversampling_ratio getter function call as value never
used.
adi,ad7606
- Exit if invalid dt_schema encountered rather than carrying on with
unknown config.
adi,ad7768-1
- Ensure SYNC_IN pulse is long enough.
- Switch sampling_frequency_available to read_avail() callback.
adi,ada4250
- Ensuring a dma-safe buffer for regmap_bulk_read()
- Use a local dev variable to simplify code
- Relax chip ID matching to allow for fallback dt compatibles.
- Make use of devm_regulator_get_enabled_read_voltage() to replace
equivalent code.
- Shuffle elements around in struct to improve logical groupings and
reduce holes.
- Use dev_err_probe()
adi,adxl313
- Use regcache to reduce traffic.
- Factor out enabling of measurement.
adi,adxl345
- Drop irq from struct as only used locally in code
- Simplify measure enable function using regmap_update_bits()
- Replace some magic numbers by units.h defines
- Simplify interrupt mapping code
- Simplify FIFO read out.
adi,axi-dac
- Factor out code to check for bus free to reduce duplication.
avago,apds9306
- Use a helper to get register address in both get and set functions.
bosch,bmi160+bmi270
- Ensure triggers suspended and resumed correctly.
bosch,bmo055
- Fix theoretical OOB acces to hw_xlate array.
freescale,vf610
- Drop -ENOMEM error message as plenty of existing prints if memory
allocation fails.
- Use dev_err_probe() and devm_clk_geT_enabled() to simplify probe().
kionix,kx022a
- Apply include what you use principles to includes.
invensense,itg3200
- Add missing dt-binding for this gyroscope.
invensense,icm42600
- Switch from int64_t and similar to s64 and other kernel types.
- Simplify arrangement of DMA safe buffers and potentially reduce
structure size a little.
invensense,mpu6050
- Reduce duplication in aux read/write code.
- Use sysfs_emit() to replace scnprintf()
murata,irsd200
- Drop duplicate printing of ret in dev_err_probe()
nxp,lpc3220-adc
- Add missing clocks property to dt-binding.
st,spear600
- Convert dt-binding that got left behind in staging to yaml in the main
tree.
st,stm32-adc
- Use dev_fwnode() rather than directly accessing the of_node.
vti,sca3000
- Use direct returns instead of gotos where simple.
Various other minor typo and white space fixes.
* tag 'iio-for-6.17a' of ssh://gitolite.kernel.org/pub/scm/linux/kernel/git/jic23/iio: (201 commits)
iio: adc: ad_sigma_delta: Select IIO_BUFFER_DMAENGINE and SPI_OFFLOAD
iio: adc: ad7173: fix setting ODR in probe
iio: adc: ad7173: fix calibration channel
iio: adc: ad7173: fix num_slots
iio: adc: ad7173: fix channels index for syscalib_mode
iio: adc: ad_sigma_delta: change to buffer predisable
iio: ABI: fix correctness of I and Q modifiers
iio: Add driver for Nicera D3-323-AA PIR sensor
dt-bindings: iio: proximity: Add Nicera D3-323-AA PIR sensor
dt-bindings: vendor-prefixes: Add Nicera
iio: dac: vf610: Simplify with devm_clk_get_enabled()
iio: adc: vf610: Simplify with dev_err_probe
iio: adc: vf610: Drop -ENOMEM error message
iio: imu: bno055: make bno055_sysfs_attr const
iio: imu: bno055: fix OOB access of hw_xlate array
dt-bindings: iio: adc: Add support for MT7981
iio: accel: kionix-kx022a: Apply approximate iwyu principles to includes
iio: adc: ad4170-4: Add support for weigh scale, thermocouple, and RTD sens
iio: adc: ad4170-4: Add support for internal temperature sensor
iio: adc: ad4170-4: Add GPIO controller support
...
|
|
This adds the usual scoped_guard(srcu_fast, &my_srcu) and
guard(srcu_fast)(&my_srcu).
Suggested-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Signed-off-by: Neeraj Upadhyay (AMD) <neeraj.upadhyay@kernel.org>
|
|
Commit cc34acd577f1 ("docs: net: document new locking reality")
introduced netif_ vs dev_ function semantics: the former expects locked
netdev, the latter takes care of the locking. We don't strictly
follow this semantics on either side, but there are more dev_xxx handlers
now that don't fit. Rename them to netif_xxx where appropriate.
netif_close_many is used only by vlan/dsa and one mtk driver, so move it into
NETDEV_INTERNAL namespace.
Signed-off-by: Stanislav Fomichev <sdf@fomichev.me>
Link: https://patch.msgid.link/20250717172333.1288349-8-sdf@fomichev.me
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Commit cc34acd577f1 ("docs: net: document new locking reality")
introduced netif_ vs dev_ function semantics: the former expects locked
netdev, the latter takes care of the locking. We don't strictly
follow this semantics on either side, but there are more dev_xxx handlers
now that don't fit. Rename them to netif_xxx where appropriate.
Note that one dev_set_threaded call still remains in mt76 for debugfs file.
Signed-off-by: Stanislav Fomichev <sdf@fomichev.me>
Link: https://patch.msgid.link/20250717172333.1288349-7-sdf@fomichev.me
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Commit cc34acd577f1 ("docs: net: document new locking reality")
introduced netif_ vs dev_ function semantics: the former expects locked
netdev, the latter takes care of the locking. We don't strictly
follow this semantics on either side, but there are more dev_xxx handlers
now that don't fit. Rename them to netif_xxx where appropriate.
Signed-off-by: Stanislav Fomichev <sdf@fomichev.me>
Link: https://patch.msgid.link/20250717172333.1288349-6-sdf@fomichev.me
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Commit cc34acd577f1 ("docs: net: document new locking reality")
introduced netif_ vs dev_ function semantics: the former expects locked
netdev, the latter takes care of the locking. We don't strictly
follow this semantics on either side, but there are more dev_xxx handlers
now that don't fit. Rename them to netif_xxx where appropriate.
__netif_set_mtu is used only by bond, so move it into
NETDEV_INTERNAL namespace.
Signed-off-by: Stanislav Fomichev <sdf@fomichev.me>
Link: https://patch.msgid.link/20250717172333.1288349-5-sdf@fomichev.me
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Commit cc34acd577f1 ("docs: net: document new locking reality")
introduced netif_ vs dev_ function semantics: the former expects locked
netdev, the latter takes care of the locking. We don't strictly
follow this semantics on either side, but there are more dev_xxx handlers
now that don't fit. Rename them to netif_xxx where appropriate.
netif_pre_changeaddr_notify is used only by ipvlan/bond, so move it into
NETDEV_INTERNAL namespace.
Signed-off-by: Stanislav Fomichev <sdf@fomichev.me>
Link: https://patch.msgid.link/20250717172333.1288349-4-sdf@fomichev.me
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Commit cc34acd577f1 ("docs: net: document new locking reality")
introduced netif_ vs dev_ function semantics: the former expects locked
netdev, the latter takes care of the locking. We don't strictly
follow this semantics on either side, but there are more dev_xxx handlers
now that don't fit. Rename them to netif_xxx where appropriate.
netif_get_mac_address is used only by tun/tap, so move it into
NETDEV_INTERNAL namespace.
Signed-off-by: Stanislav Fomichev <sdf@fomichev.me>
Link: https://patch.msgid.link/20250717172333.1288349-3-sdf@fomichev.me
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Commit cc34acd577f1 ("docs: net: document new locking reality")
introduced netif_ vs dev_ function semantics: the former expects locked
netdev, the latter takes care of the locking. We don't strictly
follow this semantics on either side, but there are more dev_xxx handlers
now that don't fit. Rename them to netif_xxx where appropriate.
Signed-off-by: Stanislav Fomichev <sdf@fomichev.me>
Link: https://patch.msgid.link/20250717172333.1288349-2-sdf@fomichev.me
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Add a new SKB drop reason (SKB_DROP_REASON_PFMEMALLOC) to track packets
dropped due to memory pressure. In production environments, we've observed
memory exhaustion reported by memory layer stack traces, but these drops
were not properly tracked in the SKB drop reason infrastructure.
While most network code paths now properly report pfmemalloc drops, some
protocol-specific socket implementations still use sk_filter() without
drop reason tracking:
- Bluetooth L2CAP sockets
- CAIF sockets
- IUCV sockets
- Netlink sockets
- SCTP sockets
- Unix domain sockets
These remaining cases represent less common paths and could be converted
in a follow-up patch if needed. The current implementation provides
significantly improved observability into memory pressure events in the
network stack, especially for key protocols like TCP and UDP, helping to
diagnose problems in production environments.
Reported-by: Matt Fleming <mfleming@cloudflare.com>
Signed-off-by: Jesper Dangaard Brouer <hawk@kernel.org>
Link: https://patch.msgid.link/175268316579.2407873.11634752355644843509.stgit@firesoul
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Update Common Event Record to CXL r3.2 definition.
Add additional validity check for event records.
Add memory sparing event record tracing.
|
|
Rename the shortterm-related identifiers to wait-related.
The usage of shortterm_users refcount is now beyond its name. It is
also used for references which live longer than an ioctl execution.
E.g. vdev holds idev's shortterm_users refcount on vdev allocation,
releases it during idev's pre_destroy(). Rename the refcount as
wait_cnt, since it is always used to sync the referencing & the
destruction of the object by waiting for it to go to zero.
List all changed identifiers:
iommufd_object::shortterm_users -> iommufd_object::wait_cnt
REMOVE_WAIT_SHORTTERM -> REMOVE_WAIT
iommufd_object_dec_wait_shortterm() -> iommufd_object_dec_wait()
zerod_shortterm -> zerod_wait_cnt
No functional change intended.
Link: https://patch.msgid.link/r/20250716070349.1807226-9-yilun.xu@linux.intel.com
Suggested-by: Kevin Tian <kevin.tian@intel.com>
Suggested-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Nicolin Chen <nicolinc@nvidia.com>
Tested-by: Nicolin Chen <nicolinc@nvidia.com>
Signed-off-by: Xu Yilun <yilun.xu@linux.intel.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
|
|
Remove struct device *dev from struct vdevice.
The dev pointer is the Plan B for vdevice to reference the physical
device. As now vdev->idev is added without refcounting concern, just
use vdev->idev->dev when needed. To avoid exposing
struct iommufd_device in the public header, export a
iommufd_vdevice_to_device() helper.
Link: https://patch.msgid.link/r/20250716070349.1807226-6-yilun.xu@linux.intel.com
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Nicolin Chen <nicolinc@nvidia.com>
Co-developed-by: Nicolin Chen <nicolinc@nvidia.com>
Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
Tested-by: Nicolin Chen <nicolinc@nvidia.com>
Signed-off-by: Xu Yilun <yilun.xu@linux.intel.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
|
|
Destroy iommufd_vdevice (vdev) on iommufd_idevice (idev) destruction so
that vdev can't outlive idev.
idev represents the physical device bound to iommufd, while the vdev
represents the virtual instance of the physical device in the VM. The
lifecycle of the vdev should not be longer than idev. This doesn't
cause real problem on existing use cases cause vdev doesn't impact the
physical device, only provides virtualization information. But to
extend vdev for Confidential Computing (CC), there are needs to do
secure configuration for the vdev, e.g. TSM Bind/Unbind. These
configurations should be rolled back on idev destroy, or the external
driver (VFIO) functionality may be impact.
The idev is created by external driver so its destruction can't fail.
The idev implements pre_destroy() op to actively remove its associated
vdev before destroying itself. There are 3 cases on idev pre_destroy():
1. vdev is already destroyed by userspace. No extra handling needed.
2. vdev is still alive. Use iommufd_object_tombstone_user() to
destroy vdev and tombstone the vdev ID.
3. vdev is being destroyed by userspace. The vdev ID is already
freed, but vdev destroy handler is not completed. This requires
multi-threads syncing - vdev holds idev's short term users
reference until vdev destruction completes, idev leverages
existing wait_shortterm mechanism for syncing.
idev should also block any new reference to it after pre_destroy(),
or the following wait shortterm would timeout. Introduce a 'destroying'
flag, set it to true on idev pre_destroy(). Any attempt to reference
idev should honor this flag under the protection of
idev->igroup->lock.
Link: https://patch.msgid.link/r/20250716070349.1807226-5-yilun.xu@linux.intel.com
Originally-by: Nicolin Chen <nicolinc@nvidia.com>
Suggested-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Nicolin Chen <nicolinc@nvidia.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Co-developed-by: "Aneesh Kumar K.V (Arm)" <aneesh.kumar@kernel.org>
Signed-off-by: "Aneesh Kumar K.V (Arm)" <aneesh.kumar@kernel.org>
Tested-by: Nicolin Chen <nicolinc@nvidia.com>
Signed-off-by: Xu Yilun <yilun.xu@linux.intel.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
|
|
Cross-merge BPF and other fixes after downstream PR.
No conflicts.
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
There are no more users of struct io_uring_cmd_data and its op_data
field. Remove it to shave 8 bytes from struct io_async_cmd and eliminate
a store and load for every uring_cmd.
Signed-off-by: Caleb Sander Mateos <csander@purestorage.com>
Acked-by: David Sterba <dsterba@suse.com>
Link: https://lore.kernel.org/r/20250708202212.2851548-5-csander@purestorage.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Add a flag IORING_URING_CMD_REISSUE that ->uring_cmd() implementations
can use to tell whether this is the first or subsequent issue of the
uring_cmd. This will allow ->uring_cmd() implementations to store
information in the io_uring_cmd's pdu across issues.
Signed-off-by: Caleb Sander Mateos <csander@purestorage.com>
Acked-by: David Sterba <dsterba@suse.com>
Link: https://lore.kernel.org/r/20250708202212.2851548-3-csander@purestorage.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/phy/linux-phy
Pull phy fixes from Vinod Koul:
"Core:
- use per-PHY lockdep keys, in order to fix a phy using internal phys
Drivers:
- tegra:
- fixes for unbalanced regulator
- decouple pad calibration fix
- disable periodic updates
- qualcomm:
- error code fix for driver probe"
* tag 'phy-fix-6.16' of git://git.kernel.org/pub/scm/linux/kernel/git/phy/linux-phy:
phy: qcom: fix error code in snps_eusb2_hsphy_probe()
phy: use per-PHY lockdep keys
phy: tegra: xusb: Fix unbalanced regulator disable in UTMI PHY mode
phy: tegra: xusb: Disable periodic tracking on Tegra234
phy: tegra: xusb: Decouple CYA_TRK_CODE_UPDATE_ON_IDLE from trk_hw_mode
|
|
CXL rev 3.2 section 8.2.10.2.1.4 Table 8-60 defines the Memory Sparing
Event Record.
Determine if the event read is memory sparing record and if so trace the
record.
Memory device shall produce a memory sparing event record
1. After completion of a PPR maintenance operation if the memory sparing
event record enable bit is set (Field: sPPR/hPPR Operation Mode in
Table 8-128/Table 8-131).
2. In response to a query request by the host (see section 8.2.10.7.1.4)
to determine the availability of sparing resources.
The device shall report the resource availability by producing the Memory
Sparing Event Record (see Table 8-60) in which the channel, rank, nibble
mask, bank group, bank, row, column, sub-channel fields are a copy of the
values specified in the request. If the controller does not support
reporting whether a resource is available, and a perform maintenance
operation for memory sparing is issued with query resources set to 1, the
controller shall return invalid input.
Example trace log for produce memory sparing event record on completion
of a soft PPR operation,
cxl_memory_sparing: memdev=mem1 host=0000:0f:00.0 serial=3
log=Informational : time=55045163029
uuid=e71f3a40-2d29-4092-8a39-4d1c966c7c65 len=128 flags='0x1' handle=1
related_handle=0 maint_op_class=2 maint_op_sub_class=1
ld_id=0 head_id=0 : flags='' result=0
validity_flags='CHANNEL|RANK|NIBBLE|BANK GROUP|BANK|ROW|COLUMN'
spare resource avail=1 channel=2 rank=5 nibble_mask=a59c bank_group=2
bank=4 row=13 column=23 sub_channel=0
comp_id=00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
comp_id_pldm_valid_flags='' pldm_entity_id=0x00 pldm_resource_id=0x00
Note: For memory sparing event record, fields 'maintenance operation
class' and 'maintenance operation subclass' are defined twice, first
in the common event record (Table 8-55) and second in the memory
sparing event record (Table 8-60). Thus those in the sparing event
record coded as reserved, to be removed when the spec is updated.
Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com>
Reviewed-by: Dave Jiang <dave.jiang@intel.com>
Signed-off-by: Shiju Jose <shiju.jose@huawei.com>
Link: https://patch.msgid.link/20250717101817.2104-5-shiju.jose@huawei.com
Signed-off-by: Dave Jiang <dave.jiang@intel.com>
|
|
CXL spec 3.2 section 8.2.10.2.1 Table 8-55, Common Event Record format
defined new fields LD-ID and Head ID.
LD-ID: ID of logical device from where the event originated, which is
valid only if LD-ID valid flag is set to 1.
CXL spec 3.2 Section 2.4 describes, a Type 3 Multi-Logical Device (MLD)
can partition its resources into up to 16 isolated Logical Devices.
Each Logical Device is identified by a Logical Device Identifier (LD-ID)
in CXL.mem and CXL.io protocols. LD-ID is a 16-bit Logical Device
identifier applicable for CXL.io and CXL.mem requests and responses.
CXL.mem supports only the lower 4 bits of LD-ID and therefore can support
up to 16 unique LD-ID values over the link. Requests and responses
forwarded over an MLD Port are tagged with LD-ID.
Head ID: ID of the device head, from where the event originated, which is
valid only if head valid flag is set to 1.
Add updates for the above spec changes in the CXL events record and CXL
common trace event implementation.
Reviewed-by: Dave Jiang <dave.jiang@intel.com>
Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com>
Signed-off-by: Shiju Jose <shiju.jose@huawei.com>
Link: https://patch.msgid.link/20250717101817.2104-2-shiju.jose@huawei.com
Signed-off-by: Dave Jiang <dave.jiang@intel.com>
|
|
Introduce the ability to parse the short beacon data and long
beacon period. The long beacon period represents the number of beacon
intervals between each long beacon transmission. Additionally,
as a BSS cannot change its configuration such that short beaconing
is dynamically disabled/enabled without tearing down the interface
- we ensure we have an existing short beacon before performing
the update.
Signed-off-by: Lachlan Hodges <lachlan.hodges@morsemicro.com>
Link: https://patch.msgid.link/20250717074205.312577-3-lachlan.hodges@morsemicro.com
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
|
|
S1G short beacons are an optional frame type used in an S1G BSS
that contain a limited set of elements. While they are optional,
they are a fundamental part of S1G that enables significant
power saving.
Expose 2 additional netlink attributes,
NL80211_ATTR_S1G_LONG_BEACON_PERIOD which denotes the number of beacon
intervals between each long beacon and NL80211_ATTR_S1G_SHORT_BEACON
which is a nested attribute containing the short beacon tail and
head. We split them as the long beacon period cannot be updated,
and is only used when initialisng the interface, whereas the short
beacon data can be used to both initialise and update the templates.
This follows how things such as the beacon interval and DTIM period
currently operate.
During the initialisation path, we ensure we have the long beacon
period if the short beacon data is being passed down, whereas
the update path will simply update the template if its sent down.
The short beacon data is validated using the same routines for regular
beacons as they support correctly parsing the short beacon format
while ensuring the frame is well-formed.
Signed-off-by: Lachlan Hodges <lachlan.hodges@morsemicro.com>
Link: https://patch.msgid.link/20250717074205.312577-2-lachlan.hodges@morsemicro.com
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
|
|
Add the SDIO ID and firmware matching for the 43751 device.
Based on the previous work from Marc Gonzalez <mgonzalez@freebox.fr>.
Tested on an i.MX6DL board connected to an AP6398SV chip with the
brcmfmac43752-sdio.bin firmware taken from:
https://source.puri.sm/Librem5/firmware-brcm43752-nonfree
Signed-off-by: Fabio Estevam <festevam@gmail.com>
Acked-by: Arend van Spriel <arend.vanspriel@broadcom.com>>
Link: https://patch.msgid.link/20250712215307.1310802-1-festevam@gmail.com
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
|
|
Expose the auxiliary clocks through the vDSO.
Architectures not using the generic vDSO time framework,
namely SPARC64, are not supported.
Signed-off-by: Thomas Weißschuh <thomas.weissschuh@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/all/20250701-vdso-auxclock-v1-12-df7d9f87b9b8@linutronix.de
|
|
Expose the auxiliary clock data so it can be read from the vDSO.
Architectures not using the generic vDSO time framework,
namely SPARC64, are not supported.
Signed-off-by: Thomas Weißschuh <thomas.weissschuh@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/all/20250701-vdso-auxclock-v1-11-df7d9f87b9b8@linutronix.de
|
|
Move the constant resolution to a shared header,
so the vDSO can use it and return it without going through a syscall.
Signed-off-by: Thomas Weißschuh <thomas.weissschuh@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/all/20250701-vdso-auxclock-v1-10-df7d9f87b9b8@linutronix.de
|
|
The {prepare,unprepare}_crypt_hardware callbacks were added back in 2016
by commit 735d37b5424b ("crypto: engine - Introduce the block request
crypto engine framework"), but they were never implemented by any driver.
Remove them as they are unused.
Since the 'engine->idling' and 'was_busy' flags are no longer needed,
remove them as well.
Signed-off-by: Ovidiu Panait <ovidiu.panait.oss@gmail.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
|
|
Remove request batching support from crypto_engine, as there are no
drivers using this feature and it doesn't really work that well.
Instead of doing batching based on backlog, a more optimal approach
would be for the user to handle the batching (similar to how IPsec
can hook into GSO to get 64K of data each time or how block encryption
can use unit sizes much greater than 4K).
Suggested-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: Ovidiu Panait <ovidiu.panait.oss@gmail.com>
Reviewed-by: Horia Geantă <horia.geanta@nxp.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
|
|
To avoid a crash when control flow integrity is enabled, make the
workspace ("stream") free function use a consistent type, and call it
through a function pointer that has that same type.
Fixes: 42d9f6c77479 ("crypto: acomp - Move scomp stream allocation code into acomp")
Cc: stable@vger.kernel.org
Signed-off-by: Eric Biggers <ebiggers@kernel.org>
Reviewed-by: Giovanni Cabiddu <giovanni.cabiddu@intel.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip into head
Local lock changes required by net/crypto
|
|
Add internal helper backing_file_set_user_path() for the only
two cases that need to modify backing_file fields.
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Link: https://lore.kernel.org/20250607115304.2521155-2-amir73il@gmail.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Stephen reports:
Documentation/core-api/cleanup:7: include/linux/cleanup.h:73: ERROR: Unexpected indentation. [docutils]
Documentation/core-api/cleanup:7: include/linux/cleanup.h:74: WARNING: Block quote ends without a blank line; unexpected unindent. [docutils]
Which points out that the ACQUIRE() example in cleanup.h missed the "::"
suffix to mark the following text as a code-block.
Fixes: 857d18f23ab1 ("cleanup: Introduce ACQUIRE() and ACQUIRE_ERR() for conditional locks")
Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Closes: http://lore.kernel.org/20250717173354.34375751@canb.auug.org.au
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Acked-by: Randy Dunlap <rdunlap@infradead.org>
Tested-by: Randy Dunlap <rdunlap@infradead.org>
Link: https://patch.msgid.link/20250717163036.1275791-1-dan.j.williams@intel.com
Signed-off-by: Dave Jiang <dave.jiang@intel.com>
|
|
The two str_has_prefix() and strstarts() are about the same
with a slight difference on what they return. Group them in
the header.
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Link: https://lore.kernel.org/r/20250711085514.1294428-1-andriy.shevchenko@linux.intel.com
Signed-off-by: Kees Cook <kees@kernel.org>
|
|
neigh_add() updates pneigh_entry() found or created by pneigh_create().
This update is serialised by RTNL, but we will remove it.
Let's move the update part to pneigh_create() and make it return errno
instead of a pointer of pneigh_entry.
Now, the pneigh code is RTNL free.
Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20250716221221.442239-16-kuniyu@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
tbl->phash_buckets[] is only modified in the slow path by pneigh_create()
and pneigh_delete() under the table lock.
Both of them are called under RTNL, so no extra lock is needed, but we
will remove RTNL from the paths.
pneigh_create() looks up a pneigh_entry, and this part can be lockless,
but it would complicate the logic like
1. lookup
2. allocate pengih_entry for GFP_KERNEL
3. lookup again but under lock
4. if found, return it after freeing the allocated memory
5. else, return the new one
Instead, let's add a per-table mutex and run lookup and allocation
under it.
Note that updating pneigh_entry part in neigh_add() is still protected
by RTNL and will be moved to pneigh_create() in the next patch.
Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20250716221221.442239-15-kuniyu@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
__pneigh_lookup() is the lockless version of pneigh_lookup(),
but its only caller pndisc_is_router() holds the table lock and
reads pneigh_netry.flags.
This is because accessing pneigh_entry after pneigh_lookup() was
illegal unless the caller holds RTNL or the table lock.
Now, pneigh_entry is guaranteed to be alive during the RCU critical
section.
Let's call pneigh_lookup() and use READ_ONCE() for n->flags in
pndisc_is_router() and remove __pneigh_lookup().
Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20250716221221.442239-13-kuniyu@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
We will convert RTM_GETNEIGH to RCU.
neigh_get() looks up pneigh_entry by pneigh_lookup() and passes
it to pneigh_fill_info().
Then, we must ensure that the entry is alive till pneigh_fill_info()
completes, but read_lock_bh(&tbl->lock) in pneigh_lookup() does not
guarantee that.
Also, we will convert all readers of tbl->phash_buckets[] to RCU.
Let's use call_rcu() to free pneigh_entry and update phash_buckets[]
and ->next by rcu_assign_pointer().
pneigh_ifdown_and_unlock() uses list_head to avoid overwriting
->next and moving RCU iterators to another list.
pndisc_destructor() (only IPv6 ndisc uses this) uses a mutex, so it
is not delayed to call_rcu(), where we cannot sleep. This is fine
because the mcast code works with RCU and ipv6_dev_mc_dec() frees
mcast objects after RCU grace period.
While at it, we change the return type of pneigh_ifdown_and_unlock()
to void.
Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20250716221221.442239-8-kuniyu@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
The next patch will free pneigh_entry with call_rcu().
Then, we need to annotate neigh_table.phash_buckets[] and
pneigh_entry.next with __rcu.
To make the next patch cleaner, let's annotate the fields in advance.
Currently, all accesses to the fields are under the neigh table lock,
so rcu_dereference_protected() is used with 1 for now, but most of them
(except in pneigh_delete() and pneigh_ifdown_and_unlock()) will be
replaced with rcu_dereference() and rcu_dereference_check().
Note that pneigh_ifdown_and_unlock() changes pneigh_entry.next to a
local list, which is illegal because the RCU iterator could be moved
to another list. This part will be fixed in the next patch.
Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20250716221221.442239-7-kuniyu@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
pneigh_lookup() has ASSERT_RTNL() in the middle of the function, which
is confusing.
When called with the last argument, creat, 0, pneigh_lookup() literally
looks up a proxy neighbour entry. This is the case of the reader path
as the fast path and RTM_GETNEIGH.
pneigh_lookup(), however, creates a pneigh_entry when called with creat 1
from RTM_NEWNEIGH and SIOCSARP, which require RTNL.
Let's split pneigh_lookup() into two functions.
We will convert all the reader paths to RCU, and read_lock_bh(&tbl->lock)
in the new pneigh_lookup() will be dropped.
Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20250716221221.442239-6-kuniyu@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Add initial support for RSS_SET, for now only operations on
the indirection table are supported.
Unlike the ioctl don't check if at least one parameter is
being changed. This is how other ethtool-nl ops behave,
so pick the ethtool-nl consistency vs copying ioctl behavior.
There are two special cases here:
1) resetting the table to defaults;
2) support for tables of different size.
For (1) I use an empty Netlink attribute (array of size 0).
(2) may require some background. AFAICT a lot of modern devices
allow allocating RSS tables of different sizes. mlx5 can upsize
its tables, bnxt has some "table size calculation", and Intel
folks asked about RSS table sizing in context of resource allocation
in the past. The ethtool IOCTL API has a concept of table size,
but right now the user is expected to provide a table exactly
the size the device requests. Some drivers may change the table
size at runtime (in response to queue count changes) but the
user is not in control of this. What's not great is that all
RSS contexts share the same table size. For example a device
with 128 queues enabled, 16 RSS contexts 8 queues in each will
likely have 256 entry tables for each of the 16 contexts,
while 32 would be more than enough given each context only has
8 queues. To address this the Netlink API should avoid enforcing
table size at the uAPI level, and should allow the user to express
the min table size they expect.
To fully solve (2) we will need more driver plumbing but
at the uAPI level this patch allows the user to specify
a table size smaller than what the device advertises. The device
table size must be a multiple of the user requested table size.
We then replicate the user-provided table to fill the full device
size table. This addresses the "allow the user to express the min
table size" objective, while not enforcing any fixed size.
From Netlink perspective .get_rxfh_indir_size() is now de facto
the "max" table size supported by the device.
We may choose to support table replication in ethtool, too,
when we actually plumb this thru the device APIs.
Initially I was considering moving full pattern generation
to the kernel (which queues to use, at which frequency and
what min sequence length). I don't think this complexity
would buy us much and most if not all devices have pow-2
table sizes, which simplifies the replication a lot.
Reviewed-by: Gal Pressman <gal@nvidia.com>
Link: https://patch.msgid.link/20250716000331.1378807-2-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Several places in the kernel do class shifting to match whether a PCI
device is display class. Add pci_is_display() for those places to use.
Signed-off-by: Mario Limonciello <mario.limonciello@amd.com>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Reviewed-by: Daniel Dadap <ddadap@nvidia.com>
Reviewed-by: Simona Vetter <simona.vetter@ffwll.ch>
Link: https://patch.msgid.link/20250717173812.3633478-2-superm1@kernel.org
|
|
Add more detail to the kernel-doc function-header comments for
stop_machine(), stop_machine_cpuslocked(), and stop_core_cpuslocked().
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
|