summaryrefslogtreecommitdiff
path: root/include/linux
AgeCommit message (Collapse)Author
2025-07-19mm/damon/core: introduce repeat mode damon_call()SeongJae Park
damon_call() can be useful for reading or writing DAMON internal data for one time. A common pattern of DAMON core usage from DAMON modules is doing such reads and writes repeatedly, for example, to periodically update the DAMOS stats. To do that with damon_call(), callers should call damon_call() repeatedly, with their own delay loop. Each caller doing that is repetitive. Introduce a repeat mode damon_call(). Callers can use the mode by setting a new field in damon_call_control. If the mode is turned on, damon_call() returns success immediately, and DAMON repeats invoking the callback function inside the kdamond main loop. Link: https://lkml.kernel.org/r/20250712195016.151108-3-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-07-19mm/damon: accept parallel damon_call() requestsSeongJae Park
Patch series "mm/damon: remove damon_callback". damon_callback was the only way for communicating with DAMON for contexts running on its worker thread. The interface is flexible and simple. But as DAMON evolves with more features, damon_callback has become somewhat too old. With runtime parameters update, for example, its lack of synchronization support was found to be inconvenient. Arguably it is also not easy to use correctly since the callers should understand when each callback is called, and implication of the return values from the callbacks. To replace it, damon_call() and damos_walk() are introduced. And those replaced a few damon_callback use cases. Some use cases of damon_callback such as parallel or repetitive DAMON internal data reading and additional cleanups cannot simply be replaced by damon_call() and damos_walk(), though. To allow those replaceable, extend damon_call() for parallel and/or repeated callbacks and modify the core/ops layers for additional resources cleanup. With the updates, replace the remaining damon_callback usages and finally say goodbye to damon_callback. This patch (of 14): Calling damon_call() while it is serving for another parallel thread immediately fails with -EBUSY. The caller should call it again, later. Each caller implementing such retry logic would be redundant. Accept parallel damon_call() requests and do the wait instead of the caller. Link: https://lkml.kernel.org/r/20250712195016.151108-1-sj@kernel.org Link: https://lkml.kernel.org/r/20250712195016.151108-2-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-07-19mm: introduce per-node proactive reclaim interfaceDavidlohr Bueso
This adds support for allowing proactive reclaim in general on a NUMA system. A per-node interface extends support for beyond a memcg-specific interface, respecting the current semantics of memory.reclaim: respecting aging LRU and not supporting artificially triggering eviction on nodes belonging to non-bottom tiers. This patch allows userspace to do: echo "512M swappiness=10" > /sys/devices/system/node/nodeX/reclaim One of the premises for this is to semantically align as best as possible with memory.reclaim. During a brief time memcg did support nodemask until 55ab834a86a9 (Revert "mm: add nodes= arg to memory.reclaim"), for which semantics around reclaim (eviction) vs demotion were not clear, rendering charging expectations to be broken. With this approach: 1. Users who do not use memcg can benefit from proactive reclaim. The memcg interface is not NUMA aware and there are usecases that are focusing on NUMA balancing rather than workload memory footprint. 2. Proactive reclaim on top tiers will trigger demotion, for which memory is still byte-addressable. Reclaiming on the bottom nodes will trigger evicting to swap (the traditional sense of reclaim). This follows the semantics of what is today part of the aging process on tiered memory, mirroring what every other form of reclaim does (reactive and memcg proactive reclaim). Furthermore per-node proactive reclaim is not as susceptible to the memcg charging problem mentioned above. 3. Unlike the nodes= arg, this interface avoids confusing semantics, such as what exactly the user wants when mixing top-tier and low-tier nodes in the nodemask. Further per-node interface is less exposed to "free up memory in my container" usecases, where eviction is intended. 4. Users that *really* want to free up memory can use proactive reclaim on nodes knowingly to be on the bottom tiers to force eviction in a natural way - higher access latencies are still better than swap. If compelled, while no guarantees and perhaps not worth the effort, users could also also potentially follow a ladder-like approach to eventually free up the memory. Alternatively, perhaps an 'evict' option could be added to the parameters for both memory.reclaim and per-node interfaces to force this action unconditionally. [akpm@linux-foundation.org: user_proactive_reclaim(): return -EBUSY on PGDAT_RECLAIM_LOCKED contention, per Roman] [dave@stgolabs.net: memcg && node is also a bogus case, per Shakeel] Link: https://lkml.kernel.org/r/20250717235604.2atyx2aobwowpge3@offworld Link: https://lkml.kernel.org/r/20250623185851.830632-5-dave@stgolabs.net Signed-off-by: Davidlohr Bueso <dave@stgolabs.net> Acked-by: Shakeel Butt <shakeel.butt@linux.dev> Acked-by: Roman Gushchin <roman.gushchin@linux.dev> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Yosry Ahmed <yosryahmed@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-07-19mm: remove unmap_and_put_page()Vishal Moola (Oracle)
There are no callers of unmap_and_put_page() left. Remove it. Link: https://lkml.kernel.org/r/20250709194017.927978-6-vishal.moola@gmail.com Signed-off-by: Vishal Moola (Oracle) <vishal.moola@gmail.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: Jordan Rome <linux@jordanrome.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-07-19mm/damon/core: add damos->migrate_dests fieldSeongJae Park
Add a new field to 'struct damos', namely migrate_dests, to allow DAMON API callers specify multiple migration destination nodes and their weights. Also update 'struct damos' creation and destruction functions accordingly to initialize the new field and free up the API caller-allocated buffers on those, respectively. Link: https://lkml.kernel.org/r/20250709005952.17776-3-bijan311@gmail.com Signed-off-by: SeongJae Park <sj@kernel.org> Signed-off-by: Bijan Tabatabai <bijantabatab@micron.com> Reviewed-by: SeongJae Park <sj@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Ravi Shankar Jonnalagadda <ravis.opensrc@micron.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-07-19mm/damon: add struct damos_migrate_destsSeongJae Park
Patch series "mm/damon/vaddr: Allow interleaving in migrate_{hot,cold} actions", v4. A recent patchset automatically sets the interleave weight for each node according to the node's maximum bandwidth [1]. In another thread, the patch set's author, Joshua Hahn, wondered if/how thes weights should be changed if the bandwidth utilization of the system changes [2]. This patch set adds the mechanism for dynamically changing how application data is interleaved across nodes while leaving the policy of what the interleave weights should be to userspace. It does this by having the migrate_{hot,cold} operating schemes interleave application data according to the list of migration nodes and weights passed in via the DAMON sysfs interface. This functionality can be used to dynamically adjust how folios are interleaved by having a userspace process adjust those weights. If no specific destination nodes or weights are provided, the migrate_{hot,cold} actions will only migrate folios to damos->target_nid as before. The algorithm used to interleave the folios is similar to the one used for the weighted interleave mempolicy [3]. It uses the offset from which a folio is mapped into a VMA to determine the node the folio should be placed in. This method is convenient because for a given set of interleave weights, a folio has only one valid node it can be placed in, limitng the amount of unnecessary data movement. However, finding out how a folio is mapped inside of a VMA requires a costly rmap walk when using a paddr scheme. As such, we have decided that this functionality makes more sense as a vaddr scheme [4]. To this end, this patch set also adds vaddr versions of the migrate_{hot,cold}. Motivation ========== There have been prior discussions about how changing the interleave weights in response to the system's bandwidth utilization can be beneficial [2]. However, currently the interleave weights only are applied when data is allocated. Migrating already allocated pages according to the dynamically changing weights will better help balance the bandwidth utilization across nodes. As a toy example, imagine some application that uses 75% of the local bandwidth. Assuming sufficient capacity, when running alone, we want to keep that application's data in local memory. However, if a second instance of that application begins, using the same amount of bandwidth, it would be best to interleave the data of both processes to alleviate the bandwidth pressure from the local node. Likewise, when one of the processes ends, the data should be moves back to local memory. We imagine there would be a userspace application that would monitor system performance characteristics, such as bandwidth utilization or memory access latency, and uses that information to tune the interleave weights. Others seem to have come to a similar conclusion in previous discussions [5]. We are currently working on a userspace program that does this, but it is not quite ready to be published yet. After the userspace application tunes the interleave weights, there must be some mechanism that actually migrates pages to be consistent with those weights. This patchset is what provides this mechanism. We believe DAMON is the correct venue for the interleaving mechanism for a few reasons. First, we noticed that we don't have to migrate all of the application's pages to improve performance. we just need to migrate the frequently accessed pages. DAMON's existing hotness traching is very useful for this. Second, DAMON's quota system can be used to ensure we are not using too much bandwidth for migrations. Finally, as Ying pointed out [6], a complete solution must also handle when a memory node is at capacity. The existing migrate_cold action can be used in conjunction with the functionality added in this patch set to provide that complete solution. Functionality Test ================== Below is an example of this new functionality in use to confirm that these patches behave as intended. In this example, the user starts an application, alloc_data, which allocates 1GB using the default memory policy (i.e. allocate to local memory) then sleeps. Afterwards, we start DAMON to interleave the data at a 1:1 ratio. Using numastat, we show that DAMON has migrated the application's data to match the new interleave ratio. For this example, I modified the userspace damo tool [8] to write to the migration_dest sysfs files. I plan to upstream these changes when these patches are merged. $ # Allocate the data initially $ ./alloc_data 1G & [1] 6587 $ numastat -c -p alloc_data Per-node process memory usage (in MBs) for PID 6587 (alloc_data) Node 0 Node 1 Total ------ ------ ----- Huge 0 0 0 Heap 0 0 0 Stack 0 0 0 Private 1027 0 1027 ------- ------ ------ ----- Total 1027 0 1027 $ # Start DAMON to interleave data at a 1:1 ratio $ cat ./interleave_vaddr.yaml kdamonds: - contexts: - ops: vaddr addr_unit: null targets: - pid: 6587 regions: [] intervals: sample_us: 500 ms aggr_us: 5 s ops_update_us: 20 s intervals_goal: access_bp: 0 % aggrs: '0' min_sample_us: 0 ns max_sample_us: 0 ns nr_regions: min: '20' max: '50' schemes: - action: migrate_hot dests: - nid: 0 weight: 1 - nid: 1 weight: 1 access_pattern: sz_bytes: min: 0 B max: max nr_accesses: min: 0 % max: 100 % age: min: 0 ns max: max $ sudo ./damo/damo interleave_vaddr.yaml $ # Verify that DAMON has migrated data to match the 1:1 ratio $ numastat -c -p alloc_data Per-node process memory usage (in MBs) for PID 6587 (alloc_data) Node 0 Node 1 Total ------ ------ ----- Huge 0 0 0 Heap 0 0 0 Stack 0 0 0 Private 514 514 1027 ------- ------ ------ ----- Total 514 514 1027 Performance Test ================ Below is a simple example showing that interleaving application data using these patches can improve application performance. To do this, we run a bandwidth intensive embedding reduction application [7]. This workload is useful for this test because it reports the time it takes each iteration to run and each iteration reuses the same allocation, allowing us to see the benefits of the migration. We evaluate this on a 128 core/256 thread AMD CPU with 72GB/s of local DDR bandwidth and 26 GB/s of CXL bandwidth. Before we start the workload, the system bandwidth utilization is low, so we start with the interleave weights of 1:0, i.e. allocating all data to local memory. When the workload beings, it saturates the local bandwidth, making the page placement suboptimal. To alleviate this, we modify the interleave weights, triggering DAMON to migrate the workload's data. We use the same interleave_vaddr.yaml file to setup DAMON, except we configure it to begin with a 1:0 interleave ratio, and attach it to the shell and its children processes. $ sudo ./damo/damo start interleave_vaddr.yaml --include_child_tasks & $ <path>/eval_baseline -d amazon_All -c 255 -r 100 <clip startup output> Eval Phase 3: Running Baseline... REPEAT # 0 Baseline Total time : 7323.54 ms REPEAT # 1 Baseline Total time : 7624.56 ms REPEAT # 2 Baseline Total time : 7619.61 ms REPEAT # 3 Baseline Total time : 7617.12 ms REPEAT # 4 Baseline Total time : 7638.64 ms REPEAT # 5 Baseline Total time : 7611.27 ms REPEAT # 6 Baseline Total time : 7629.32 ms REPEAT # 7 Baseline Total time : 7695.63 ms # Interleave weights set to 3:1 REPEAT # 8 Baseline Total time : 7077.5 ms REPEAT # 9 Baseline Total time : 5633.23 ms REPEAT # 10 Baseline Total time : 5644.6 ms REPEAT # 11 Baseline Total time : 5627.66 ms REPEAT # 12 Baseline Total time : 5629.76 ms REPEAT # 13 Baseline Total time : 5633.05 ms REPEAT # 14 Baseline Total time : 5641.24 ms REPEAT # 15 Baseline Total time : 5631.18 ms REPEAT # 16 Baseline Total time : 5631.33 ms Updating the interleave weights and having DAMON migrate the workload data according to the weights resulted in an approximarely 25% speedup. Patches Sequence ================ Patches 1-7 extend the DAMON API to specify multiple destination nodes and weights for the migrate_{hot,cold} actions. These patches are from SJ'S RFC [8]. Patches 8-10 add a vaddr implementation of the migrate_{hot,cold} schemes. Patch 11 modifies the vaddr migrate_{hot,cold} schemes to interleave data according to the weights provided by damos->migrate_dest. Patches 12-13 allow the vaddr migrate_{hot,cold} implementation to filter out folios like the paddr version. This patch (of 13): Introduce a new struct, namely damos_migrate_dests, for specifying multiple DAMOS' migration destination nodes and their weights. Link: https://lkml.kernel.org/r/20250709005952.17776-1-bijan311@gmail.com Link: https://lkml.kernel.org/r/20250709005952.17776-2-bijan311@gmail.com Link: https://lore.kernel.org/linux-mm/20250520141236.2987309-1-joshua.hahnjy@gmail.com/ [1] Link: https://lore.kernel.org/linux-mm/20250313155705.1943522-1-joshua.hahnjy@gmail.com/ [2] Link: https://elixir.bootlin.com/linux/v6.15.4/source/mm/mempolicy.c#L2015 [3] Link: https://lore.kernel.org/damon/20250624223310.55786-1-sj@kernel.org/ [4] Link: https://lore.kernel.org/linux-mm/20250314151137.892379-1-joshua.hahnjy@gmail.com/ [5] Link: https://lore.kernel.org/linux-mm/87frjfx6u4.fsf@DESKTOP-5N7EMDA/ [6] Link: https://github.com/SNU-ARC/MERCI [7] Link: https://lore.kernel.org/damon/20250702051558.54138-1-sj@kernel.org/ [8] Signed-off-by: SeongJae Park <sj@kernel.org> Signed-off-by: Bijan Tabatabai <bijantabatab@micron.com> Reviewed-by: SeongJae Park <sj@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Ravi Shankar Jonnalagadda <ravis.opensrc@micron.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-07-19mm, vmstat: remove the NR_WRITEBACK_TEMP node_stat_item counterVlastimil Babka
The only user of the counter (FUSE) was removed in commit 0c58a97f919c ("fuse: remove tmp folio for writebacks and internal rb tree") so follow the established pattern of removing the counter and hardcoding 0 in meminfo output, as done recently with NR_BOUNCE. Update documentation for procfs, including for the value for Bounce that was missed when removing its counter. Also remove the mention of NR_WRITEBACK_TEMP implications from a comment in wb_position_ratio(). The rest of the comment there about fuse setting bdi->max_ratio to 1% is still correct. [vbabka@suse.cz: v2] Link: https://lkml.kernel.org/r/5a848e15-6a57-4ecb-a015-d4f358b8a5d3@suse.cz Link: https://lkml.kernel.org/r/20250625-nr_writeback_removal-v1-1-7f2a0df70faa@suse.cz Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Cc: Brendan Jackman <jackmanb@google.com> Cc: Danilo Krummrich <dakr@kernel.org> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Jan Kara <jack@suse.cz> Cc: Jeff Layton <jlayton@kernel.org> Cc: Jeffle Xu <jefflexu@linux.alibaba.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Joanne Koong <joannelkoong@gmail.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Kirill A. Shuemov <kirill.shutemov@linux.intel.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Maxim Patlasov <mpatlasov@parallels.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Miklos Szeredi <mszeredi@redhat.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Tejun Heo <tj@kernel.org> Cc: Zach O'Keefe <zokeefe@google.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-07-19mm/vmstat: make MEMCG select VM_EVENT_COUNTERSKirill A. Shutemov
The vmstat_text array contains labels for counters displayed in /proc/vmstat. It is important to keep the labels in sync with the counters. There is a BUILD_BUG_ON() check in vmstat_start() that ensures the size of the vmstat_text is not smaller than VM_EVENT_COUNTERS. This helps to catch cases where a new counter is added but the label is not. However, it does not help if a counter is removed but the label remains. It would be nice to make the BUILD_BUG_ON() check more strict to catch such cases. However, when compiling with MEMCG enabled but VM_EVENT_COUNTERS disabled, the vmstat_text array is larger than NR_VMSTAT_ITEMS. This issue arises because some elements of the vmstat_text array are present when either MEMCG or VM_EVENT_COUNTERS is enabled, but NR_VMSTAT_ITEMS only accounts for these elements if VM_EVENT_COUNTERS is enabled. Instead of adjusting the NR_VMSTAT_ITEMS definition to account for MEMCG, make MEMCG select VM_EVENT_COUNTERS. VM_EVENT_COUNTERS is enabled in most configurations anyway. Link: https://lkml.kernel.org/r/20250604095111.533783-1-kirill.shutemov@linux.intel.com Fixes: ebc5d83d0443 ("mm/memcontrol: use vmstat names for printing statistics") Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Reported-by: Randy Dunlap <rdunlap@infradead.org> Acked-by: Vlastimil Babka <vbabka@suse.cz> Tested-by: Randy Dunlap <rdunlap@infradead.org> Acked-by: Randy Dunlap <rdunlap@infradead.org> Acked-by: Shakeel Butt <shakeel.butt@linux.dev> Cc: Konstantin Khlebnikov <koct9i@gmail.com> Cc: David Hildenbrand <david@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Suren Baghdasaryan <surenb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-07-19mm/damon/sysfs: use DAMON core API damon_is_running()SeongJae Park
DAMON core implements a static function to see if a given DAMON context is running. DAMON sysfs interface is implementing the same one on its own. Make the core function non-static and reuse it from the DAMON sysfs interface. Link: https://lkml.kernel.org/r/20250705175000.56259-5-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-07-19Merge tag 'iio-for-6.17a' of ↵Greg Kroah-Hartman
ssh://gitolite.kernel.org/pub/scm/linux/kernel/git/jic23/iio into char-misc-next Jonathan writes: IIO: New device support, features, late breaking fixes and cleanup for 6.17 The normal mixed bag. A few more fixes than usual as I failed to send them out earlier. New device support ================== adi,ad4080 - New driver for this high speed ADC. Includes extensions to iio-backends necessary to support filter config, variable data lands and data alignment control. adi,ad4170-4 - New driver for this 24-bit very feature rich ADC suited for weigh scale and thermocouple applications. adi,ad7405 - New driver for this single channel isolated ADC with backend support (adi-axi-adc) google,cros_ec_activity - Add activity detection to the existing set of cros_ec drivers covering both human body and significant motion detection. mediatek,mt6359 - Add support for MT6363 and MT6373 PMIC Auxiliary ADCs. nicera,d3-323-aa - New driver for this configurable Passive InfraRed sensor. Device ID only ============== mediatek,mt7981-auxadc - Add ID to mt2701 driver as fully compatible with mt7986-auxadc. rohm,bu79100g - Add ID to ad7476 driver as fully compatible with TI ADS7866. Features ======== Core - New in_voltageY_convdelay to allow for devices to control timing offsets between sampling different channels. adi,ad-sigma-delta-library - Support SPI offload (later fix for missing Kconfig dependency) adi,ad4851 - SPI 3-wire support. adi,ad7606 - Power supply control. - convdelay and calibbias support for calibration purposes. - gain calibration support based on external filter resistance provided from device tree. adi,ad7768-1 - Add output regulator for VCM output, typically used for preconditioning circuits. - Add gpio controller for the 4 GPIOs. - Multiple scan type support to enable 16-bit modes. - Support synchronization over SPI. - Filter type and oversampling ratio control. - Low pass filter cut off read only attribute. adi,adxl313 - FIFO support - DC activity, inactivity detection with power-save on inactivity - AC coupled activity detection - Documentation for this complex driver. - debugfs register access. adi,adxl345 - Sampling frequency and sensor range controls. bosch,bmi270 - Add step counter support. invensense,icm42600 - Wake on motion support. Cleanup and fixes ================= backend - Drop unused parameter from iio_backend_ovesampling_ratio_set() docs - Fix ABI docs around I and Q modifiers. treewide - Switch remaining drives to use maple tree regcache. - Drop use of DRIVER_NAME style definitions when only used in one place. - Drop unused export.h includes. - Use = { } in place of memset in various drivers. - Constify various info structures and related. - Switch some drivers from array of chip_info structures to individual named structures. adi,ad-sigma_delta library - Fix over allocation of scan buffer. (bits/bytes confusion) - Sort includes and apply iwyu principles to ensure sensible set. - Use u8 instead of uint8_t - Replace hard coded type sizes with sizeof() and BITS_TO_BYTES() as appropriate. - Factor out setting of read address to reduce duplication. - Switch to buffer predisable so error handling on buffer enable functions correctly (balanced against postenable). adi,ad4000 - Don't use sift_right() on an unsigned value. adi,ad7173 - Add missing check on spi_setup() succeeding. - Simplify clock enable disable code using devm_clk_get_enabled() - Fix channel index for syscalib_mode - Fix number of configuration slots for some devics. - Fix the channel used for calibration. - Fix setting ODR up in probe. adi,ad7380 - Drop unused oversampling_ratio getter function call as value never used. adi,ad7606 - Exit if invalid dt_schema encountered rather than carrying on with unknown config. adi,ad7768-1 - Ensure SYNC_IN pulse is long enough. - Switch sampling_frequency_available to read_avail() callback. adi,ada4250 - Ensuring a dma-safe buffer for regmap_bulk_read() - Use a local dev variable to simplify code - Relax chip ID matching to allow for fallback dt compatibles. - Make use of devm_regulator_get_enabled_read_voltage() to replace equivalent code. - Shuffle elements around in struct to improve logical groupings and reduce holes. - Use dev_err_probe() adi,adxl313 - Use regcache to reduce traffic. - Factor out enabling of measurement. adi,adxl345 - Drop irq from struct as only used locally in code - Simplify measure enable function using regmap_update_bits() - Replace some magic numbers by units.h defines - Simplify interrupt mapping code - Simplify FIFO read out. adi,axi-dac - Factor out code to check for bus free to reduce duplication. avago,apds9306 - Use a helper to get register address in both get and set functions. bosch,bmi160+bmi270 - Ensure triggers suspended and resumed correctly. bosch,bmo055 - Fix theoretical OOB acces to hw_xlate array. freescale,vf610 - Drop -ENOMEM error message as plenty of existing prints if memory allocation fails. - Use dev_err_probe() and devm_clk_geT_enabled() to simplify probe(). kionix,kx022a - Apply include what you use principles to includes. invensense,itg3200 - Add missing dt-binding for this gyroscope. invensense,icm42600 - Switch from int64_t and similar to s64 and other kernel types. - Simplify arrangement of DMA safe buffers and potentially reduce structure size a little. invensense,mpu6050 - Reduce duplication in aux read/write code. - Use sysfs_emit() to replace scnprintf() murata,irsd200 - Drop duplicate printing of ret in dev_err_probe() nxp,lpc3220-adc - Add missing clocks property to dt-binding. st,spear600 - Convert dt-binding that got left behind in staging to yaml in the main tree. st,stm32-adc - Use dev_fwnode() rather than directly accessing the of_node. vti,sca3000 - Use direct returns instead of gotos where simple. Various other minor typo and white space fixes. * tag 'iio-for-6.17a' of ssh://gitolite.kernel.org/pub/scm/linux/kernel/git/jic23/iio: (201 commits) iio: adc: ad_sigma_delta: Select IIO_BUFFER_DMAENGINE and SPI_OFFLOAD iio: adc: ad7173: fix setting ODR in probe iio: adc: ad7173: fix calibration channel iio: adc: ad7173: fix num_slots iio: adc: ad7173: fix channels index for syscalib_mode iio: adc: ad_sigma_delta: change to buffer predisable iio: ABI: fix correctness of I and Q modifiers iio: Add driver for Nicera D3-323-AA PIR sensor dt-bindings: iio: proximity: Add Nicera D3-323-AA PIR sensor dt-bindings: vendor-prefixes: Add Nicera iio: dac: vf610: Simplify with devm_clk_get_enabled() iio: adc: vf610: Simplify with dev_err_probe iio: adc: vf610: Drop -ENOMEM error message iio: imu: bno055: make bno055_sysfs_attr const iio: imu: bno055: fix OOB access of hw_xlate array dt-bindings: iio: adc: Add support for MT7981 iio: accel: kionix-kx022a: Apply approximate iwyu principles to includes iio: adc: ad4170-4: Add support for weigh scale, thermocouple, and RTD sens iio: adc: ad4170-4: Add support for internal temperature sensor iio: adc: ad4170-4: Add GPIO controller support ...
2025-07-19srcu: Add guards for SRCU-fast readersPaul E. McKenney
This adds the usual scoped_guard(srcu_fast, &my_srcu) and guard(srcu_fast)(&my_srcu). Suggested-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Neeraj Upadhyay (AMD) <neeraj.upadhyay@kernel.org>
2025-07-18net: s/dev_close_many/netif_close_many/Stanislav Fomichev
Commit cc34acd577f1 ("docs: net: document new locking reality") introduced netif_ vs dev_ function semantics: the former expects locked netdev, the latter takes care of the locking. We don't strictly follow this semantics on either side, but there are more dev_xxx handlers now that don't fit. Rename them to netif_xxx where appropriate. netif_close_many is used only by vlan/dsa and one mtk driver, so move it into NETDEV_INTERNAL namespace. Signed-off-by: Stanislav Fomichev <sdf@fomichev.me> Link: https://patch.msgid.link/20250717172333.1288349-8-sdf@fomichev.me Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-18net: s/dev_set_threaded/netif_set_threaded/Stanislav Fomichev
Commit cc34acd577f1 ("docs: net: document new locking reality") introduced netif_ vs dev_ function semantics: the former expects locked netdev, the latter takes care of the locking. We don't strictly follow this semantics on either side, but there are more dev_xxx handlers now that don't fit. Rename them to netif_xxx where appropriate. Note that one dev_set_threaded call still remains in mt76 for debugfs file. Signed-off-by: Stanislav Fomichev <sdf@fomichev.me> Link: https://patch.msgid.link/20250717172333.1288349-7-sdf@fomichev.me Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-18net: s/dev_get_flags/netif_get_flags/Stanislav Fomichev
Commit cc34acd577f1 ("docs: net: document new locking reality") introduced netif_ vs dev_ function semantics: the former expects locked netdev, the latter takes care of the locking. We don't strictly follow this semantics on either side, but there are more dev_xxx handlers now that don't fit. Rename them to netif_xxx where appropriate. Signed-off-by: Stanislav Fomichev <sdf@fomichev.me> Link: https://patch.msgid.link/20250717172333.1288349-6-sdf@fomichev.me Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-18net: s/__dev_set_mtu/__netif_set_mtu/Stanislav Fomichev
Commit cc34acd577f1 ("docs: net: document new locking reality") introduced netif_ vs dev_ function semantics: the former expects locked netdev, the latter takes care of the locking. We don't strictly follow this semantics on either side, but there are more dev_xxx handlers now that don't fit. Rename them to netif_xxx where appropriate. __netif_set_mtu is used only by bond, so move it into NETDEV_INTERNAL namespace. Signed-off-by: Stanislav Fomichev <sdf@fomichev.me> Link: https://patch.msgid.link/20250717172333.1288349-5-sdf@fomichev.me Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-18net: s/dev_pre_changeaddr_notify/netif_pre_changeaddr_notify/Stanislav Fomichev
Commit cc34acd577f1 ("docs: net: document new locking reality") introduced netif_ vs dev_ function semantics: the former expects locked netdev, the latter takes care of the locking. We don't strictly follow this semantics on either side, but there are more dev_xxx handlers now that don't fit. Rename them to netif_xxx where appropriate. netif_pre_changeaddr_notify is used only by ipvlan/bond, so move it into NETDEV_INTERNAL namespace. Signed-off-by: Stanislav Fomichev <sdf@fomichev.me> Link: https://patch.msgid.link/20250717172333.1288349-4-sdf@fomichev.me Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-18net: s/dev_get_mac_address/netif_get_mac_address/Stanislav Fomichev
Commit cc34acd577f1 ("docs: net: document new locking reality") introduced netif_ vs dev_ function semantics: the former expects locked netdev, the latter takes care of the locking. We don't strictly follow this semantics on either side, but there are more dev_xxx handlers now that don't fit. Rename them to netif_xxx where appropriate. netif_get_mac_address is used only by tun/tap, so move it into NETDEV_INTERNAL namespace. Signed-off-by: Stanislav Fomichev <sdf@fomichev.me> Link: https://patch.msgid.link/20250717172333.1288349-3-sdf@fomichev.me Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-18net: s/dev_get_port_parent_id/netif_get_port_parent_id/Stanislav Fomichev
Commit cc34acd577f1 ("docs: net: document new locking reality") introduced netif_ vs dev_ function semantics: the former expects locked netdev, the latter takes care of the locking. We don't strictly follow this semantics on either side, but there are more dev_xxx handlers now that don't fit. Rename them to netif_xxx where appropriate. Signed-off-by: Stanislav Fomichev <sdf@fomichev.me> Link: https://patch.msgid.link/20250717172333.1288349-2-sdf@fomichev.me Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-18net: track pfmemalloc drops via SKB_DROP_REASON_PFMEMALLOCJesper Dangaard Brouer
Add a new SKB drop reason (SKB_DROP_REASON_PFMEMALLOC) to track packets dropped due to memory pressure. In production environments, we've observed memory exhaustion reported by memory layer stack traces, but these drops were not properly tracked in the SKB drop reason infrastructure. While most network code paths now properly report pfmemalloc drops, some protocol-specific socket implementations still use sk_filter() without drop reason tracking: - Bluetooth L2CAP sockets - CAIF sockets - IUCV sockets - Netlink sockets - SCTP sockets - Unix domain sockets These remaining cases represent less common paths and could be converted in a follow-up patch if needed. The current implementation provides significantly improved observability into memory pressure events in the network stack, especially for key protocols like TCP and UDP, helping to diagnose problems in production environments. Reported-by: Matt Fleming <mfleming@cloudflare.com> Signed-off-by: Jesper Dangaard Brouer <hawk@kernel.org> Link: https://patch.msgid.link/175268316579.2407873.11634752355644843509.stgit@firesoul Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-18iommufd: Rename some shortterm-related identifiersXu Yilun
Rename the shortterm-related identifiers to wait-related. The usage of shortterm_users refcount is now beyond its name. It is also used for references which live longer than an ioctl execution. E.g. vdev holds idev's shortterm_users refcount on vdev allocation, releases it during idev's pre_destroy(). Rename the refcount as wait_cnt, since it is always used to sync the referencing & the destruction of the object by waiting for it to go to zero. List all changed identifiers: iommufd_object::shortterm_users -> iommufd_object::wait_cnt REMOVE_WAIT_SHORTTERM -> REMOVE_WAIT iommufd_object_dec_wait_shortterm() -> iommufd_object_dec_wait() zerod_shortterm -> zerod_wait_cnt No functional change intended. Link: https://patch.msgid.link/r/20250716070349.1807226-9-yilun.xu@linux.intel.com Suggested-by: Kevin Tian <kevin.tian@intel.com> Suggested-by: Jason Gunthorpe <jgg@nvidia.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Reviewed-by: Nicolin Chen <nicolinc@nvidia.com> Tested-by: Nicolin Chen <nicolinc@nvidia.com> Signed-off-by: Xu Yilun <yilun.xu@linux.intel.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2025-07-18iommufd/vdevice: Remove struct device reference from struct vdeviceXu Yilun
Remove struct device *dev from struct vdevice. The dev pointer is the Plan B for vdevice to reference the physical device. As now vdev->idev is added without refcounting concern, just use vdev->idev->dev when needed. To avoid exposing struct iommufd_device in the public header, export a iommufd_vdevice_to_device() helper. Link: https://patch.msgid.link/r/20250716070349.1807226-6-yilun.xu@linux.intel.com Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Reviewed-by: Nicolin Chen <nicolinc@nvidia.com> Co-developed-by: Nicolin Chen <nicolinc@nvidia.com> Signed-off-by: Nicolin Chen <nicolinc@nvidia.com> Tested-by: Nicolin Chen <nicolinc@nvidia.com> Signed-off-by: Xu Yilun <yilun.xu@linux.intel.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2025-07-18iommufd: Destroy vdevice on idevice destroyXu Yilun
Destroy iommufd_vdevice (vdev) on iommufd_idevice (idev) destruction so that vdev can't outlive idev. idev represents the physical device bound to iommufd, while the vdev represents the virtual instance of the physical device in the VM. The lifecycle of the vdev should not be longer than idev. This doesn't cause real problem on existing use cases cause vdev doesn't impact the physical device, only provides virtualization information. But to extend vdev for Confidential Computing (CC), there are needs to do secure configuration for the vdev, e.g. TSM Bind/Unbind. These configurations should be rolled back on idev destroy, or the external driver (VFIO) functionality may be impact. The idev is created by external driver so its destruction can't fail. The idev implements pre_destroy() op to actively remove its associated vdev before destroying itself. There are 3 cases on idev pre_destroy(): 1. vdev is already destroyed by userspace. No extra handling needed. 2. vdev is still alive. Use iommufd_object_tombstone_user() to destroy vdev and tombstone the vdev ID. 3. vdev is being destroyed by userspace. The vdev ID is already freed, but vdev destroy handler is not completed. This requires multi-threads syncing - vdev holds idev's short term users reference until vdev destruction completes, idev leverages existing wait_shortterm mechanism for syncing. idev should also block any new reference to it after pre_destroy(), or the following wait shortterm would timeout. Introduce a 'destroying' flag, set it to true on idev pre_destroy(). Any attempt to reference idev should honor this flag under the protection of idev->igroup->lock. Link: https://patch.msgid.link/r/20250716070349.1807226-5-yilun.xu@linux.intel.com Originally-by: Nicolin Chen <nicolinc@nvidia.com> Suggested-by: Jason Gunthorpe <jgg@nvidia.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Reviewed-by: Nicolin Chen <nicolinc@nvidia.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Co-developed-by: "Aneesh Kumar K.V (Arm)" <aneesh.kumar@kernel.org> Signed-off-by: "Aneesh Kumar K.V (Arm)" <aneesh.kumar@kernel.org> Tested-by: Nicolin Chen <nicolinc@nvidia.com> Signed-off-by: Xu Yilun <yilun.xu@linux.intel.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2025-07-18Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf after rc6Alexei Starovoitov
Cross-merge BPF and other fixes after downstream PR. No conflicts. Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-07-18io_uring/cmd: remove struct io_uring_cmd_dataCaleb Sander Mateos
There are no more users of struct io_uring_cmd_data and its op_data field. Remove it to shave 8 bytes from struct io_async_cmd and eliminate a store and load for every uring_cmd. Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> Acked-by: David Sterba <dsterba@suse.com> Link: https://lore.kernel.org/r/20250708202212.2851548-5-csander@purestorage.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-07-18io_uring/cmd: introduce IORING_URING_CMD_REISSUE flagCaleb Sander Mateos
Add a flag IORING_URING_CMD_REISSUE that ->uring_cmd() implementations can use to tell whether this is the first or subsequent issue of the uring_cmd. This will allow ->uring_cmd() implementations to store information in the io_uring_cmd's pdu across issues. Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> Acked-by: David Sterba <dsterba@suse.com> Link: https://lore.kernel.org/r/20250708202212.2851548-3-csander@purestorage.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-07-18Merge tag 'phy-fix-6.16' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/phy/linux-phy Pull phy fixes from Vinod Koul: "Core: - use per-PHY lockdep keys, in order to fix a phy using internal phys Drivers: - tegra: - fixes for unbalanced regulator - decouple pad calibration fix - disable periodic updates - qualcomm: - error code fix for driver probe" * tag 'phy-fix-6.16' of git://git.kernel.org/pub/scm/linux/kernel/git/phy/linux-phy: phy: qcom: fix error code in snps_eusb2_hsphy_probe() phy: use per-PHY lockdep keys phy: tegra: xusb: Fix unbalanced regulator disable in UTMI PHY mode phy: tegra: xusb: Disable periodic tracking on Tegra234 phy: tegra: xusb: Decouple CYA_TRK_CODE_UPDATE_ON_IDLE from trk_hw_mode
2025-07-18wifi: brcmfmac: Add support for the SDIO 43751 deviceFabio Estevam
Add the SDIO ID and firmware matching for the 43751 device. Based on the previous work from Marc Gonzalez <mgonzalez@freebox.fr>. Tested on an i.MX6DL board connected to an AP6398SV chip with the brcmfmac43752-sdio.bin firmware taken from: https://source.puri.sm/Librem5/firmware-brcm43752-nonfree Signed-off-by: Fabio Estevam <festevam@gmail.com> Acked-by: Arend van Spriel <arend.vanspriel@broadcom.com>> Link: https://patch.msgid.link/20250712215307.1310802-1-festevam@gmail.com Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2025-07-18vdso/vsyscall: Update auxiliary clock data in the datapageThomas Weißschuh
Expose the auxiliary clock data so it can be read from the vDSO. Architectures not using the generic vDSO time framework, namely SPARC64, are not supported. Signed-off-by: Thomas Weißschuh <thomas.weissschuh@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/all/20250701-vdso-auxclock-v1-11-df7d9f87b9b8@linutronix.de
2025-07-18Merge tag 'local-lock-for-net' of ↵Herbert Xu
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip into head Local lock changes required by net/crypto
2025-07-18fs: constify file ptr in backing_file accessor helpersAmir Goldstein
Add internal helper backing_file_set_user_path() for the only two cases that need to modify backing_file fields. Signed-off-by: Amir Goldstein <amir73il@gmail.com> Link: https://lore.kernel.org/20250607115304.2521155-2-amir73il@gmail.com Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-07-17cleanup: Fix documentation build error for ACQUIRE updatesDan Williams
Stephen reports: Documentation/core-api/cleanup:7: include/linux/cleanup.h:73: ERROR: Unexpected indentation. [docutils] Documentation/core-api/cleanup:7: include/linux/cleanup.h:74: WARNING: Block quote ends without a blank line; unexpected unindent. [docutils] Which points out that the ACQUIRE() example in cleanup.h missed the "::" suffix to mark the following text as a code-block. Fixes: 857d18f23ab1 ("cleanup: Introduce ACQUIRE() and ACQUIRE_ERR() for conditional locks") Reported-by: Stephen Rothwell <sfr@canb.auug.org.au> Closes: http://lore.kernel.org/20250717173354.34375751@canb.auug.org.au Signed-off-by: Dan Williams <dan.j.williams@intel.com> Acked-by: Randy Dunlap <rdunlap@infradead.org> Tested-by: Randy Dunlap <rdunlap@infradead.org> Link: https://patch.msgid.link/20250717163036.1275791-1-dan.j.williams@intel.com Signed-off-by: Dave Jiang <dave.jiang@intel.com>
2025-07-17string: Group str_has_prefix() and strstarts()Andy Shevchenko
The two str_has_prefix() and strstarts() are about the same with a slight difference on what they return. Group them in the header. Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Link: https://lore.kernel.org/r/20250711085514.1294428-1-andriy.shevchenko@linux.intel.com Signed-off-by: Kees Cook <kees@kernel.org>
2025-07-17PCI: Add pci_is_display() to check if device is a display controllerMario Limonciello
Several places in the kernel do class shifting to match whether a PCI device is display class. Add pci_is_display() for those places to use. Signed-off-by: Mario Limonciello <mario.limonciello@amd.com> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> Reviewed-by: Daniel Dadap <ddadap@nvidia.com> Reviewed-by: Simona Vetter <simona.vetter@ffwll.ch> Link: https://patch.msgid.link/20250717173812.3633478-2-superm1@kernel.org
2025-07-17stop_machine: Improve kernel-doc function-header commentsPaul E. McKenney
Add more detail to the kernel-doc function-header comments for stop_machine(), stop_machine_cpuslocked(), and stop_core_cpuslocked(). Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2025-07-17Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski
Cross-merge networking fixes after downstream PR (net-6.16-rc7). Conflicts: Documentation/netlink/specs/ovpn.yaml 880d43ca9aa4 ("netlink: specs: clean up spaces in brackets") af52020fc599 ("ovpn: reject unexpected netlink attributes") drivers/net/phy/phy_device.c a44312d58e78 ("net: phy: Don't register LEDs for genphy") f0f2b992d818 ("net: phy: Don't register LEDs for genphy") https://lore.kernel.org/20250710114926.7ec3a64f@kernel.org drivers/net/wireless/intel/iwlwifi/fw/regulatory.c drivers/net/wireless/intel/iwlwifi/mld/regulatory.c 5fde0fcbd760 ("wifi: iwlwifi: mask reserved bits in chan_state_active_bitmap") ea045a0de3b9 ("wifi: iwlwifi: add support for accepting raw DSM tables by firmware") net/ipv6/mcast.c ae3264a25a46 ("ipv6: mcast: Delay put pmc->idev in mld_del_delrec()") a8594c956cc9 ("ipv6: mcast: Avoid a duplicate pointer check in mld_del_delrec()") https://lore.kernel.org/8cc52891-3653-4b03-a45e-05464fe495cf@kernel.org No adjacent changes. Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-17Merge back earlier material related to system sleepRafael J. Wysocki
2025-07-17cgroup: llist: avoid memory tears for llist_nodeShakeel Butt
Before the commit 36df6e3dbd7e ("cgroup: make css_rstat_updated nmi safe"), the struct llist_node is expected to be private to the one inserting the node to the lockless list or the one removing the node from the lockless list. After the mentioned commit, the llist_node in the rstat code is per-cpu shared between the stacked contexts i.e. process, softirq, hardirq & nmi. It is possible the compiler may tear the loads or stores of llist_node. Let's avoid that. KCSAN reported the following race: Reported by Kernel Concurrency Sanitizer on: CPU: 60 UID: 0 PID: 5425 ... 6.16.0-rc3-next-20250626 #1 NONE Tainted: [E]=UNSIGNED_MODULE Hardware name: ... ================================================================== ================================================================== BUG: KCSAN: data-race in css_rstat_flush / css_rstat_updated write to 0xffffe8fffe1c85f0 of 8 bytes by task 1061 on cpu 1: css_rstat_flush+0x1b8/0xeb0 __mem_cgroup_flush_stats+0x184/0x190 flush_memcg_stats_dwork+0x22/0x50 process_one_work+0x335/0x630 worker_thread+0x5f1/0x8a0 kthread+0x197/0x340 ret_from_fork+0xd3/0x110 ret_from_fork_asm+0x11/0x20 read to 0xffffe8fffe1c85f0 of 8 bytes by task 3551 on cpu 15: css_rstat_updated+0x81/0x180 mod_memcg_lruvec_state+0x113/0x2d0 __mod_lruvec_state+0x3d/0x50 lru_add+0x21e/0x3f0 folio_batch_move_lru+0x80/0x1b0 __folio_batch_add_and_move+0xd7/0x160 folio_add_lru_vma+0x42/0x50 do_anonymous_page+0x892/0xe90 __handle_mm_fault+0xfaa/0x1520 handle_mm_fault+0xdc/0x350 do_user_addr_fault+0x1dc/0x650 exc_page_fault+0x5c/0x110 asm_exc_page_fault+0x22/0x30 value changed: 0xffffe8fffe18e0d0 -> 0xffffe8fffe1c85f0 $ ./scripts/faddr2line vmlinux css_rstat_flush+0x1b8/0xeb0 css_rstat_flush+0x1b8/0xeb0: init_llist_node at include/linux/llist.h:86 (inlined by) llist_del_first_init at include/linux/llist.h:308 (inlined by) css_process_update_tree at kernel/cgroup/rstat.c:148 (inlined by) css_rstat_updated_list at kernel/cgroup/rstat.c:258 (inlined by) css_rstat_flush at kernel/cgroup/rstat.c:389 $ ./scripts/faddr2line vmlinux css_rstat_updated+0x81/0x180 css_rstat_updated+0x81/0x180: css_rstat_updated at kernel/cgroup/rstat.c:90 (discriminator 1) These are expected race and a simple READ_ONCE/WRITE_ONCE resolves these reports. However let's add comments to explain the race and the need for memory barriers if stronger guarantees are needed. More specifically the rstat updater and the flusher can race and cause a scenario where the stats updater skips adding the css to the lockless list but the flusher might not see those updates done by the skipped updater. This is benign race and the subsequent flusher will flush those stats and at the moment there aren't any rstat users which are not fine with this kind of race. However some future user might want more stricter guarantee, so let's add appropriate comments to ease the job of future users. Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev> Reviewed-by: Paul E. McKenney <paulmck@kernel.org> Fixes: 36df6e3dbd7e ("cgroup: make css_rstat_updated nmi safe") Signed-off-by: Tejun Heo <tj@kernel.org>
2025-07-17ilog2: add max_pow_of_two_factor()John Garry
Relocate the function max_pow_of_two_factor() to common ilog2.h from the xfs code, as it will be used elsewhere. Also simplify the function, as advised by Mikulas Patocka. Signed-off-by: John Garry <john.g.garry@oracle.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Link: https://lore.kernel.org/r/20250711105258.3135198-2-john.g.garry@oracle.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-07-17nvme: fix typo in status code constant for self-test in progressAlok Tiwari
Correct a typo error in the NVMe status code constant from NVME_SC_SELT_TEST_IN_PROGRESS to NVME_SC_SELF_TEST_IN_PROGRESS to accurately reflect its meaning. Signed-off-by: Alok Tiwari <alok.a.tiwari@oracle.com> Reviewed-by: Randy Dunlap <rdunlap@infradead.org> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
2025-07-17Merge branch 'for-next' of ↵Paolo Abeni
git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/linux Tony Nguyen says: ==================== Add RDMA support for Intel IPU E2000 in idpf Tatyana Nikolova says: This idpf patch series is the second part of the staged submission for introducing RDMA RoCEv2 support for the IPU E2000 line of products, referred to as GEN3. To support RDMA GEN3 devices, the idpf driver uses common definitions of the IIDC interface and implements specific device functionality in iidc_rdma_idpf.h. The IPU model can host one or more logical network endpoints called vPorts per PCI function that are flexibly associated with a physical port or an internal communication port. Other features as it pertains to GEN3 devices include: * MMIO learning * RDMA capability negotiation * RDMA vectors discovery between idpf and control plane These patches are split from the submission "Add RDMA support for Intel IPU E2000 (GEN3)" [1]. The patches have been tested on a range of hosts and platforms with a variety of general RDMA applications which include standalone verbs (rping, perftest, etc.), storage and HPC applications. Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com> [1] https://lore.kernel.org/all/20240724233917.704-1-tatyana.e.nikolova@intel.com/ This idpf patch series is the second part of the staged submission for introducing RDMA RoCEv2 support for the IPU E2000 line of products, referred to as GEN3. To support RDMA GEN3 devices, the idpf driver uses common definitions of the IIDC interface and implements specific device functionality in iidc_rdma_idpf.h. The IPU model can host one or more logical network endpoints called vPorts per PCI function that are flexibly associated with a physical port or an internal communication port. Other features as it pertains to GEN3 devices include: * MMIO learning * RDMA capability negotiation * RDMA vectors discovery between idpf and control plane These patches are split from the submission "Add RDMA support for Intel IPU E2000 (GEN3)" [1]. The patches have been tested on a range of hosts and platforms with a variety of general RDMA applications which include standalone verbs (rping, perftest, etc.), storage and HPC applications. Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com> [1] https://lore.kernel.org/all/20240724233917.704-1-tatyana.e.nikolova@intel.com/ IWL reviews: v3: https://lore.kernel.org/all/20250708210554.1662-1-tatyana.e.nikolova@intel.com/ v2: https://lore.kernel.org/all/20250612220002.1120-1-tatyana.e.nikolova@intel.com/ v1 (split from previous series): https://lore.kernel.org/all/20250523170435.668-1-tatyana.e.nikolova@intel.com/ v3: https://lore.kernel.org/all/20250207194931.1569-1-tatyana.e.nikolova@intel.com/ RFC v2: https://lore.kernel.org/all/20240824031924.421-1-tatyana.e.nikolova@intel.com/ RFC: https://lore.kernel.org/all/20240724233917.704-1-tatyana.e.nikolova@intel.com/ * 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/linux: idpf: implement get LAN MMIO memory regions idpf: implement IDC vport aux driver MTU change handler idpf: implement remaining IDC RDMA core callbacks and handlers idpf: implement RDMA vport auxiliary dev create, init, and destroy idpf: implement core RDMA auxiliary dev create, init, and destroy idpf: use reserved RDMA vectors from control plane ==================== Link: https://patch.msgid.link/20250714181002.2865694-1-anthony.l.nguyen@intel.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-07-17watchdog: Don't use "proxy" headersAndy Shevchenko
Update header inclusions to follow IWYU (Include What You Use) principle. Note that kernel.h is discouraged to be included as it's written at the top of that file. Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Reviewed-by: Guenter Roeck <linux@roeck-us.net> Link: https://lore.kernel.org/r/20250708133646.70384-3-andriy.shevchenko@linux.intel.com Signed-off-by: Guenter Roeck <linux@roeck-us.net> Signed-off-by: Wim Van Sebroeck <wim@linux-watchdog.org>
2025-07-16firmware: qcom: scm: take struct device as argument in SHM bridge enableBartosz Golaszewski
qcom_scm_shm_bridge_enable() is used early in the SCM initialization routine. It makes an SCM call and so expects the internal __scm pointer in the SCM driver to be assigned. For this reason the tzmem memory pool is allocated *after* this pointer is assigned. However, this can lead to a crash if another consumer of the SCM API makes a call using the memory pool between the assignment of the __scm pointer and the initialization of the tzmem memory pool. As qcom_scm_shm_bridge_enable() is a special case, not meant to be called by ordinary users, pull it into the local SCM header. Make it take struct device as argument. This is the device that will be used to make the SCM call as opposed to the global __scm pointer. This will allow us to move the tzmem initialization *before* the __scm assignment in the core SCM driver. Signed-off-by: Bartosz Golaszewski <bartosz.golaszewski@linaro.org> Reviewed-by: Konrad Dybcio <konrad.dybcio@oss.qualcomm.com> Link: https://lore.kernel.org/r/20250630-qcom-scm-race-v2-2-fa3851c98611@linaro.org Signed-off-by: Bjorn Andersson <andersson@kernel.org>
2025-07-16firmware: qcom: scm: remove unused arguments from SHM bridge routinesBartosz Golaszewski
qcom_scm_shm_bridge_create() and qcom_scm_shm_bridge_delete() take struct device as argument but don't use it. Remove it from these functions' prototypes. Reviewed-by: Konrad Dybcio <konrad.dybcio@oss.qualcomm.com> Signed-off-by: Bartosz Golaszewski <bartosz.golaszewski@linaro.org> Link: https://lore.kernel.org/r/20250630-qcom-scm-race-v2-1-fa3851c98611@linaro.org Signed-off-by: Bjorn Andersson <andersson@kernel.org>
2025-07-16bpf: Add struct bpf_token_infoTao Chen
The 'commit 35f96de04127 ("bpf: Introduce BPF token object")' added BPF token as a new kind of BPF kernel object. And BPF_OBJ_GET_INFO_BY_FD already used to get BPF object info, so we can also get token info with this cmd. One usage scenario, when program runs failed with token, because of the permission failure, we can report what BPF token is allowing with this API for debugging. Acked-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: Tao Chen <chen.dylane@linux.dev> Link: https://lore.kernel.org/r/20250716134654.1162635-1-chen.dylane@linux.dev Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-07-16ACPI: APEI: handle synchronous exceptions in task workShuai Xue
The memory uncorrected error could be signaled by asynchronous interrupt (specifically, SPI in arm64 platform), e.g. when an error is detected by a background scrubber, or signaled by synchronous exception (specifically, data abort exception in arm64 platform), e.g. when a CPU tries to access a poisoned cache line. Currently, both synchronous and asynchronous errors use memory_failure_queue() to schedule memory_failure() to exectute in a kworker context. As a result, when a user-space process is accessing a poisoned data, a data abort is taken and the memory_failure() is executed in the kworker context, which: - will send wrong si_code by SIGBUS signal in early_kill mode, and - can not kill the user-space in some cases resulting a synchronous error infinite loop Issue 1: send wrong si_code in early_kill mode Since commit a70297d22132 ("ACPI: APEI: set memory failure flags as MF_ACTION_REQUIRED on synchronous events")', the flag MF_ACTION_REQUIRED could be used to determine whether a synchronous exception occurs on ARM64 platform. When a synchronous exception is detected, the kernel is expected to terminate the current process which has accessed a poisoned page. This is done by sending a SIGBUS signal with error code BUS_MCEERR_AR, indicating an action-required machine check error on read. However, when kill_proc() is called to terminate the processes who has the poisoned page mapped, it sends the incorrect SIGBUS error code BUS_MCEERR_AO because the context in which it operates is not the one where the error was triggered. To reproduce this problem: #sysctl -w vm.memory_failure_early_kill=1 vm.memory_failure_early_kill = 1 # STEP2: inject an UCE error and consume it to trigger a synchronous error #einj_mem_uc single 0: single vaddr = 0xffffb0d75400 paddr = 4092d55b400 injecting ... triggering ... signal 7 code 5 addr 0xffffb0d75000 page not present Test passed The si_code (code 5) from einj_mem_uc indicates that it is BUS_MCEERR_AO error and it is not factually correct. After this change: # STEP1: enable early kill mode #sysctl -w vm.memory_failure_early_kill=1 vm.memory_failure_early_kill = 1 # STEP2: inject an UCE error and consume it to trigger a synchronous error #einj_mem_uc single 0: single vaddr = 0xffffb0d75400 paddr = 4092d55b400 injecting ... triggering ... signal 7 code 4 addr 0xffffb0d75000 page not present Test passed The si_code (code 4) from einj_mem_uc indicates that it is a BUS_MCEERR_AR error as expected. Issue 2: a synchronous error infinite loop If a user-space process, e.g. devmem, accesses a poisoned page for which the HWPoison flag is set, kill_accessing_process() is called to send SIGBUS to current processs with error info. Since the memory_failure() is executed in the kworker context, it will just do nothing but return EFAULT. So, devmem will access the posioned page and trigger an exception again, resulting in a synchronous error infinite loop. Such exception loop may cause platform firmware to exceed some threshold and reboot when Linux could have recovered from this error. To reproduce this problem: # STEP 1: inject an UCE error, and kernel will set HWPosion flag for related page #einj_mem_uc single 0: single vaddr = 0xffffb0d75400 paddr = 4092d55b400 injecting ... triggering ... signal 7 code 4 addr 0xffffb0d75000 page not present Test passed # STEP 2: access the same page and it will trigger a synchronous error infinite loop devmem 0x4092d55b400 To fix above two issues, queue memory_failure() as a task_work so that it runs in the context of the process that is actually consuming the poisoned data. Signed-off-by: Shuai Xue <xueshuai@linux.alibaba.com> Tested-by: Ma Wupeng <mawupeng1@huawei.com> Reviewed-by: Kefeng Wang <wangkefeng.wang@huawei.com> Reviewed-by: Xiaofei Tan <tanxiaofei@huawei.com> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Reviewed-by: Jarkko Sakkinen <jarkko@kernel.org> Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> Reviewed-by: Jane Chu <jane.chu@oracle.com> Reviewed-by: Yazen Ghannam <yazen.ghannam@amd.com> Reviewed-by: Hanjun Guo <guohanjun@huawei.com> Link: https://patch.msgid.link/20250714114212.31660-3-xueshuai@linux.alibaba.com [ rjw: Changelog edits ] Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2025-07-16pmdomain: core: introduce dev_pm_genpd_is_on()Hiago De Franco
This helper function returns the current power status of a given generic power domain. As example, remoteproc/imx_rproc.c can now use this function to check the power status of the remote core to properly set "attached" or "offline" modes. Suggested-by: Ulf Hansson <ulf.hansson@linaro.org> Reviewed-by: Bjorn Andersson <andersson@kernel.org> Reviewed-by: Peng Fan <peng.fan@nxp.com> Signed-off-by: Hiago De Franco <hiago.franco@toradex.com> Link: https://lore.kernel.org/r/20250629172512.14857-2-hiagofranco@gmail.com Signed-off-by: Ulf Hansson <ulf.hansson@linaro.org>
2025-07-16cxl: Convert to ACQUIRE() for conditional rwsem lockingDan Williams
Use ACQUIRE() to cleanup conditional locking paths in the CXL driver The ACQUIRE() macro and its associated ACQUIRE_ERR() helpers, like scoped_cond_guard(), arrange for scoped-based conditional locking. Unlike scoped_cond_guard(), these macros arrange for an ERR_PTR() to be retrieved representing the state of the conditional lock. The goal of this conversion is to complete the removal of all explicit unlock calls in the subsystem. I.e. the methods to acquire a lock are solely via guard(), scoped_guard() (for limited cases), or ACQUIRE(). All unlock is implicit / scope-based. In order to make sure all lock sites are converted, the existing rwsem's are consolidated and renamed in 'struct cxl_rwsem'. While that makes the patch noisier it gives a clean cut-off between old-world (explicit unlock allowed), and new world (explicit unlock deleted). Cc: David Lechner <dlechner@baylibre.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Ingo Molnar <mingo@kernel.org> Cc: Fabio M. De Francesco <fabio.m.de.francesco@linux.intel.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Jonathan Cameron <jonathan.cameron@huawei.com> Cc: Dave Jiang <dave.jiang@intel.com> Cc: Alison Schofield <alison.schofield@intel.com> Cc: Vishal Verma <vishal.l.verma@intel.com> Cc: Ira Weiny <ira.weiny@intel.com> Cc: Shiju Jose <shiju.jose@huawei.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Dan Williams <dan.j.williams@intel.com> Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com> Reviewed-by: Fabio M. De Francesco <fabio.m.de.francesco@linux.intel.com> Reviewed-by: Dave Jiang <dave.jiang@intel.com> Tested-by: Shiju Jose <shiju.jose@huawei.com> Link: https://patch.msgid.link/20250711234932.671292-9-dan.j.williams@intel.com Signed-off-by: Dave Jiang <dave.jiang@intel.com>
2025-07-16cleanup: Introduce ACQUIRE() and ACQUIRE_ERR() for conditional locksPeter Zijlstra
scoped_cond_guard(), automatic cleanup for conditional locks, has a couple pain points: * It causes existing straight-line code to be re-indented into a new bracketed scope. While this can be mitigated by a new helper function to contain the scope, that is not always a comfortable conversion. * The return code from the conditional lock is tossed in favor of a scheme to pass a 'return err;' statement to the macro. Other attempts to clean this up, to behave more like guard() [1], got hung up trying to both establish and evaluate the conditional lock in one statement. ACQUIRE() solves this by reflecting the result of the condition in the automatic variable established by the lock CLASS(). The result is separately retrieved with the ACQUIRE_ERR() helper, effectively a PTR_ERR() operation. Link: http://lore.kernel.org/all/Z1LBnX9TpZLR5Dkf@gmail.com [1] Link: http://patch.msgid.link/20250512105026.GP4439@noisy.programming.kicks-ass.net Link: http://patch.msgid.link/20250512185817.GA1808@noisy.programming.kicks-ass.net Cc: Ingo Molnar <mingo@kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: David Lechner <dlechner@baylibre.com> Cc: Fabio M. De Francesco <fabio.m.de.francesco@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> [djbw: wrap Peter's proposal with changelog and comments] Co-developed-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com> Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com> Link: https://patch.msgid.link/20250711234932.671292-2-dan.j.williams@intel.com Signed-off-by: Dave Jiang <dave.jiang@intel.com>
2025-07-16mm/pagemap: add write_begin_get_folio() helper functionTaotao Chen
Add write_begin_get_folio() to simplify the common folio lookup logic used by filesystem ->write_begin() implementations. This helper wraps __filemap_get_folio() with common flags such as FGP_WRITEBEGIN, conditional FGP_DONTCACHE, and set folio order based on the write length. Part of a series refactoring address_space_operations write_begin and write_end callbacks to use struct kiocb for passing write context and flags. Signed-off-by: Taotao Chen <chentaotao@didiglobal.com> Link: https://lore.kernel.org/20250716093559.217344-5-chentaotao@didiglobal.com Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-07-16fs: change write_begin/write_end interface to take struct kiocb *Taotao Chen
Change the address_space_operations callbacks write_begin() and write_end() to take struct kiocb * as the first argument instead of struct file *. Update all affected function prototypes, implementations, call sites, and related documentation across VFS, filesystems, and block layer. Part of a series refactoring address_space_operations write_begin and write_end callbacks to use struct kiocb for passing write context and flags. Signed-off-by: Taotao Chen <chentaotao@didiglobal.com> Link: https://lore.kernel.org/20250716093559.217344-4-chentaotao@didiglobal.com Signed-off-by: Christian Brauner <brauner@kernel.org>