summaryrefslogtreecommitdiff
path: root/include/linux
AgeCommit message (Collapse)Author
2025-02-05srcu: Add SRCU_READ_FLAVOR_SLOWGP to flag need for synchronize_rcu()Paul E. McKenney
This commit switches from a direct test of SRCU_READ_FLAVOR_LITE to a new SRCU_READ_FLAVOR_SLOWGP macro to check for substituting synchronize_rcu() for smp_mb() in SRCU grace periods. Right now, SRCU_READ_FLAVOR_SLOWGP is exactly SRCU_READ_FLAVOR_LITE, but the addition of the _fast() flavor of SRCU will change that. Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Cc: Alexei Starovoitov <ast@kernel.org> Cc: Andrii Nakryiko <andrii@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Kent Overstreet <kent.overstreet@linux.dev> Cc: <bpf@vger.kernel.org> Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
2025-02-05srcu: Rename srcu_check_read_flavor_lite() to srcu_check_read_flavor_force()Paul E. McKenney
This commit renames the srcu_check_read_flavor_lite() function to srcu_check_read_flavor_force() and adds a read_flavor argument in order to support an srcu_read_lock_fast() variant that is to avoid array indexing in both the lock and unlock primitives. Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Cc: Alexei Starovoitov <ast@kernel.org> Cc: Andrii Nakryiko <andrii@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Kent Overstreet <kent.overstreet@linux.dev> Cc: <bpf@vger.kernel.org> Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
2025-02-05srcu: Make Tree SRCU updates independent of ->srcu_idxPaul E. McKenney
This commit makes Tree SRCU updates independent of ->srcu_idx, then drop ->srcu_idx. Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Cc: Alexei Starovoitov <ast@kernel.org> Cc: Andrii Nakryiko <andrii@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Kent Overstreet <kent.overstreet@linux.dev> Cc: <bpf@vger.kernel.org> Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
2025-02-05srcu: Make SRCU readers use ->srcu_ctrs for counter selectionPaul E. McKenney
This commit causes SRCU readers to use ->srcu_ctrs for counter selection instead of ->srcu_idx. This takes another step towards array-indexing-free SRCU readers. [ paulmck: Apply kernel test robot feedback. ] Co-developed-by: Z qiang <qiang.zhang1211@gmail.com> Signed-off-by: Z qiang <qiang.zhang1211@gmail.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Tested-by: kernel test robot <oliver.sang@intel.com> Cc: Alexei Starovoitov <ast@kernel.org> Cc: Andrii Nakryiko <andrii@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Kent Overstreet <kent.overstreet@linux.dev> Cc: <bpf@vger.kernel.org> Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
2025-02-05srcu: Pull ->srcu_{un,}lock_count into a new srcu_ctr structurePaul E. McKenney
This commit prepares for array-index-free srcu_read_lock*() by moving the ->srcu_{un,}lock_count fields into a new srcu_ctr structure. This will permit ->srcu_index to be replaced by a per-CPU pointer to this structure. Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Cc: Alexei Starovoitov <ast@kernel.org> Cc: Andrii Nakryiko <andrii@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Kent Overstreet <kent.overstreet@linux.dev> Cc: <bpf@vger.kernel.org> Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
2025-02-05srcu: Define SRCU_READ_FLAVOR_ALL in terms of symbolsPaul E. McKenney
This commit defines SRCU_READ_FLAVOR_ALL in terms of the SRCU_READ_FLAVOR_* definitions instead of a hexadecimal constant. Suggested-by: Neeraj Upadhyay <Neeraj.Upadhyay@amd.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Cc: Alexei Starovoitov <ast@kernel.org> Cc: Andrii Nakryiko <andrii@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Kent Overstreet <kent.overstreet@linux.dev> Cc: <bpf@vger.kernel.org> Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
2025-02-05rcu: handle unstable rdp in rcu_read_unlock_strict()Ankur Arora
rcu_read_unlock_strict() can be called with preemption enabled which can make for an unstable rdp and a racy norm value. Fix this by dropping the preempt-count in __rcu_read_unlock() after the call to rcu_read_unlock_strict(), adjusting the preempt-count check appropriately. Suggested-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com> Reviewed-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
2025-02-05rcu: rename PREEMPT_AUTO to PREEMPT_LAZYAnkur Arora
Replace mentions of PREEMPT_AUTO with PREEMPT_LAZY. Also, since PREMPT_LAZY implies PREEMPTION, we can reduce the TASKS_RCU selection criteria from this: NEED_TASKS_RCU && (PREEMPTION || PREEMPT_AUTO) to this: NEED_TASKS_RCU && PREEMPTION CC: Paul E. McKenney <paulmck@kernel.org> Reviewed-by: Frederic Weisbecker <frederic@kernel.org> Reviewed-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
2025-02-05rcu: fix header guard for rcu_all_qs()Ankur Arora
rcu_all_qs() is defined for !CONFIG_PREEMPT_RCU but the declaration is conditioned on CONFIG_PREEMPTION. With CONFIG_PREEMPT_LAZY, CONFIG_PREEMPTION=y does not imply CONFIG_PREEMPT_RCU=y. Decouple the two. Cc: Paul E. McKenney <paulmck@kernel.org> Reviewed-by: Frederic Weisbecker <frederic@kernel.org> Reviewed-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
2025-02-05Revert "i2c: Replace list-based mechanism for handling auto-detected clients"Wolfram Sang
This reverts commit 56a50667cbcfaf95eea9128d5676af94e54b51a8. Mux handling is not sufficiently implemented. It needs more time. Signed-off-by: Wolfram Sang <wsa+renesas@sang-engineering.com>
2025-02-05Revert "i2c: Replace list-based mechanism for handling userspace-created ↵Wolfram Sang
clients" This reverts commit 3cfe39b3a845593a485ab1c716615979004ef9f6. Mux handling is not sufficiently implemented. It needs more time. Signed-off-by: Wolfram Sang <wsa+renesas@sang-engineering.com>
2025-02-05slab: don't batch kvfree_rcu() with SLUB_TINYVlastimil Babka
kvfree_rcu() is batched for better performance except on TINY_RCU, which is a simple implementation for small UP systems. Similarly SLUB_TINY is an option intended for small systems, whether or not used together with TINY_RCU. In case SLUB_TINY is used with !TINY_RCU, it makes arguably sense to not do the batching and limit the memory footprint. It's also suboptimal to have RCU-specific #ifdefs in slab code. With that, add CONFIG_KVFREE_RCU_BATCHED to determine whether batching kvfree_rcu() implementation is used. It is not set by a user prompt, but enabled by default and disabled in case TINY_RCU or SLUB_TINY are enabled. Use the new config for #ifdef's in slab code and extend their scope to cover all code used by the batched kvfree_rcu(). For example there's no need to perform kvfree_rcu_init() if the batching is disabled. Reviewed-by: Uladzislau Rezki (Sony) <urezki@gmail.com> Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org> Reviewed-by: Hyeonggon Yoo <42.hyeyoo@gmail.com> Tested-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
2025-02-05rcu, slab: use a regular callback function for kvfree_rcuVlastimil Babka
RCU has been special-casing callback function pointers that are integers lower than 4096 as offsets of rcu_head for kvfree() instead. The tree RCU implementation no longer does that as the batched kvfree_rcu() is not a simple call_rcu(). The tiny RCU still does, and the plan is also to make tree RCU use call_rcu() for SLUB_TINY configurations. Instead of teaching tree RCU again to special case the offsets, let's remove the special casing completely. Since there's no SLOB anymore, it is possible to create a callback function that can take a pointer to a middle of slab object with unknown offset and determine the object's pointer before freeing it, so implement that as kvfree_rcu_cb(). Large kmalloc and vmalloc allocations are handled simply by aligning down to page size. For that we retain the requirement that the offset is smaller than 4096. But we can remove __is_kvfree_rcu_offset() completely and instead just opencode the condition in the BUILD_BUG_ON() check. Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org> Reviewed-by: Hyeonggon Yoo <42.hyeyoo@gmail.com> Tested-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
2025-02-05slab, rcu: move TINY_RCU variant of kvfree_rcu() to SLABVlastimil Babka
Following the move of TREE_RCU implementation, let's move also the TINY_RCU one for consistency and subsequent refactoring. For simplicity, remove the separate inline __kvfree_call_rcu() as TINY_RCU is not meant for high-performance hardware anyway. Declare kvfree_call_rcu() in rcupdate.h to avoid header dependency issues. Also move the kvfree_rcu_barrier() declaration to slab.h Reviewed-by: Uladzislau Rezki (Sony) <urezki@gmail.com> Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org> Reviewed-by: Hyeonggon Yoo <42.hyeyoo@gmail.com> Tested-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
2025-02-05perf: Avoid the read if the count is already updatedPeter Zijlstra (Intel)
The event may have been updated in the PMU-specific implementation, e.g., Intel PEBS counters snapshotting. The common code should not read and overwrite the value. The PERF_SAMPLE_READ in the data->sample_type can be used to detect whether the PMU-specific value is available. If yes, avoid the pmu->read() in the common code. Add a new flag, skip_read, to track the case. Factor out a perf_pmu_read() to clean up the code. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Kan Liang <kan.liang@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20250121152303.3128733-3-kan.liang@linux.intel.com
2025-02-05uprobes: Remove the spinlock within handle_singlestep()Liao Chang
This patch introduces a flag to track TIF_SIGPENDING is suppress temporarily during the uprobe single-step. Upon uprobe singlestep is handled and the flag is confirmed, it could resume the TIF_SIGPENDING directly without acquiring the siglock in most case, then reducing contention and improving overall performance. I've use the script developed by Andrii in [1] to run benchmark. The CPU used was Kunpeng916 (Hi1616), 4 NUMA nodes, 64 cores@2.4GHz running the kernel on next tree + the optimization for get_xol_insn_slot() [2]. before-opt ---------- uprobe-nop ( 1 cpus): 0.907 ± 0.003M/s ( 0.907M/s/cpu) uprobe-nop ( 2 cpus): 1.676 ± 0.008M/s ( 0.838M/s/cpu) uprobe-nop ( 4 cpus): 3.210 ± 0.003M/s ( 0.802M/s/cpu) uprobe-nop ( 8 cpus): 4.457 ± 0.003M/s ( 0.557M/s/cpu) uprobe-nop (16 cpus): 3.724 ± 0.011M/s ( 0.233M/s/cpu) uprobe-nop (32 cpus): 2.761 ± 0.003M/s ( 0.086M/s/cpu) uprobe-nop (64 cpus): 1.293 ± 0.015M/s ( 0.020M/s/cpu) uprobe-push ( 1 cpus): 0.883 ± 0.001M/s ( 0.883M/s/cpu) uprobe-push ( 2 cpus): 1.642 ± 0.005M/s ( 0.821M/s/cpu) uprobe-push ( 4 cpus): 3.086 ± 0.002M/s ( 0.771M/s/cpu) uprobe-push ( 8 cpus): 3.390 ± 0.003M/s ( 0.424M/s/cpu) uprobe-push (16 cpus): 2.652 ± 0.005M/s ( 0.166M/s/cpu) uprobe-push (32 cpus): 2.713 ± 0.005M/s ( 0.085M/s/cpu) uprobe-push (64 cpus): 1.313 ± 0.009M/s ( 0.021M/s/cpu) uprobe-ret ( 1 cpus): 1.774 ± 0.000M/s ( 1.774M/s/cpu) uprobe-ret ( 2 cpus): 3.350 ± 0.001M/s ( 1.675M/s/cpu) uprobe-ret ( 4 cpus): 6.604 ± 0.000M/s ( 1.651M/s/cpu) uprobe-ret ( 8 cpus): 6.706 ± 0.005M/s ( 0.838M/s/cpu) uprobe-ret (16 cpus): 5.231 ± 0.001M/s ( 0.327M/s/cpu) uprobe-ret (32 cpus): 5.743 ± 0.003M/s ( 0.179M/s/cpu) uprobe-ret (64 cpus): 4.726 ± 0.016M/s ( 0.074M/s/cpu) after-opt --------- uprobe-nop ( 1 cpus): 0.985 ± 0.002M/s ( 0.985M/s/cpu) uprobe-nop ( 2 cpus): 1.773 ± 0.005M/s ( 0.887M/s/cpu) uprobe-nop ( 4 cpus): 3.304 ± 0.001M/s ( 0.826M/s/cpu) uprobe-nop ( 8 cpus): 5.328 ± 0.002M/s ( 0.666M/s/cpu) uprobe-nop (16 cpus): 6.475 ± 0.002M/s ( 0.405M/s/cpu) uprobe-nop (32 cpus): 4.831 ± 0.082M/s ( 0.151M/s/cpu) uprobe-nop (64 cpus): 2.564 ± 0.053M/s ( 0.040M/s/cpu) uprobe-push ( 1 cpus): 0.964 ± 0.001M/s ( 0.964M/s/cpu) uprobe-push ( 2 cpus): 1.766 ± 0.002M/s ( 0.883M/s/cpu) uprobe-push ( 4 cpus): 3.290 ± 0.009M/s ( 0.823M/s/cpu) uprobe-push ( 8 cpus): 4.670 ± 0.002M/s ( 0.584M/s/cpu) uprobe-push (16 cpus): 5.197 ± 0.004M/s ( 0.325M/s/cpu) uprobe-push (32 cpus): 5.068 ± 0.161M/s ( 0.158M/s/cpu) uprobe-push (64 cpus): 2.605 ± 0.026M/s ( 0.041M/s/cpu) uprobe-ret ( 1 cpus): 1.833 ± 0.001M/s ( 1.833M/s/cpu) uprobe-ret ( 2 cpus): 3.384 ± 0.003M/s ( 1.692M/s/cpu) uprobe-ret ( 4 cpus): 6.677 ± 0.004M/s ( 1.669M/s/cpu) uprobe-ret ( 8 cpus): 6.854 ± 0.005M/s ( 0.857M/s/cpu) uprobe-ret (16 cpus): 6.508 ± 0.006M/s ( 0.407M/s/cpu) uprobe-ret (32 cpus): 5.793 ± 0.009M/s ( 0.181M/s/cpu) uprobe-ret (64 cpus): 4.743 ± 0.016M/s ( 0.074M/s/cpu) Above benchmark results demonstrates a obivious improvement in the scalability of trig-uprobe-nop and trig-uprobe-push, the peak throughput of which are from 4.5M/s to 6.4M/s and 3.3M/s to 5.1M/s individually. [1] https://lore.kernel.org/all/20240731214256.3588718-1-andrii@kernel.org [2] https://lore.kernel.org/all/20240727094405.1362496-1-liaochang1@huawei.com Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Acked-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Liao Chang <liaochang1@huawei.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20250124093826.2123675-3-liaochang1@huawei.com
2025-02-04rcu: Remove references to old grace-period-wait primitivesPaul E. McKenney
The rcu_barrier_sched(), synchronize_sched(), and synchronize_rcu_bh() RCU API members have been gone for many years. This commit therefore removes non-historical instances of them. Reported-by: Joe Perches <joe@perches.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
2025-02-04mlx4: Remove unused functionsDr. David Alan Gilbert
The last use of mlx4_find_cached_mac() was removed in 2014 by commit 2f5bb473681b ("mlx4: Add ref counting to port MAC table for RoCE") mlx4_zone_free_entries() was added in 2014 by commit 7a89399ffad7 ("net/mlx4: Add mlx4_bitmap zone allocator") but hasn't been used. (The _unique version is used) Remove them. Signed-off-by: Dr. David Alan Gilbert <linux@treblig.org> Reviewed-by: Simon Horman <horms@kernel.org> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Reviewed-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com> Link: https://patch.msgid.link/20250203185229.204279-1-linux@treblig.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-04KVM: remove kvm_arch_post_init_vmPaolo Bonzini
The only statement in a kvm_arch_post_init_vm implementation can be moved into the x86 kvm_arch_init_vm. Do so and remove all traces from architecture-independent code. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-02-04tty/ldsem: Remove unused ldsem_down_write_trylockDr. David Alan Gilbert
ldsem_down_write_trylock() was added in 2013 by commit 4898e640caf0 ("tty: Add timed, writer-prioritized rw semaphore") but hasn't been used. Remove it. Signed-off-by: Dr. David Alan Gilbert <linux@treblig.org> Reviewed-by: Jiri Slaby <jirislaby@kernel.org> Link: https://lore.kernel.org/r/20250122012559.441006-1-linux@treblig.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-02-04efi: Use BIT_ULL() constants for memory attributesArd Biesheuvel
For legibility, use the existing BIT_ULL() to generate the u64 type EFI memory attribute macros. Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
2025-02-04efi: Avoid cold plugged memory for placing the kernelArd Biesheuvel
UEFI 2.11 introduced EFI_MEMORY_HOT_PLUGGABLE to annotate system memory regions that are 'cold plugged' at boot, i.e., hot pluggable memory that is available from early boot, and described as system RAM by the firmware. Existing loaders and EFI applications running in the boot context will happily use this memory for allocating data structures that cannot be freed or moved at runtime, and this prevents the memory from being unplugged. Going forward, the new EFI_MEMORY_HOT_PLUGGABLE attribute should be tested, and memory annotated as such should be avoided for such allocations. In the EFI stub, there are a couple of occurrences where, instead of the high-level AllocatePages() UEFI boot service, a low-level code sequence is used that traverses the EFI memory map and carves out the requested number of pages from a free region. This is needed, e.g., for allocating as low as possible, or for allocating pages at random. While AllocatePages() should presumably avoid special purpose memory and cold plugged regions, this manual approach needs to incorporate this logic itself, in order to prevent the kernel itself from ending up in a hot unpluggable region, preventing it from being unplugged. So add the EFI_MEMORY_HOTPLUGGABLE macro definition, and check for it where appropriate. Cc: stable@vger.kernel.org Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
2025-02-04fsnotify: add mount notification infrastructureMiklos Szeredi
This is just the plumbing between the event source (fs/namespace.c) and the event consumer (fanotify). In itself it does nothing. Signed-off-by: Miklos Szeredi <mszeredi@redhat.com> Link: https://lore.kernel.org/r/20250129165803.72138-2-mszeredi@redhat.com Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-02-03net: harmonize tstats and dstatsPaolo Abeni
After the blamed commits below, some UDP tunnel use dstats for accounting. On the xmit path, all the UDP-base tunnels ends up using iptunnel_xmit_stats() for stats accounting, and the latter assumes the relevant (tunnel) network device uses tstats. The end result is some 'funny' stat report for the mentioned UDP tunnel, e.g. when no packet is actually dropped and a bunch of packets are transmitted: gnv2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue \ state UNKNOWN mode DEFAULT group default qlen 1000 link/ether ee:7d:09:87:90:ea brd ff:ff:ff:ff:ff:ff RX: bytes packets errors dropped missed mcast 14916 23 0 15 0 0 TX: bytes packets errors dropped carrier collsns 0 1566 0 0 0 0 Address the issue ensuring the same binary layout for the overlapping fields of dstats and tstats. While this solution is a bit hackish, is smaller and with no performance pitfall compared to other alternatives i.e. supporting both dstat and tstat in iptunnel_xmit_stats() or reverting the blamed commit. With time we should possibly move all the IP-based tunnel (and virtual devices) to dstats. Fixes: c77200c07491 ("bareudp: Handle stats using NETDEV_PCPU_STAT_DSTATS.") Fixes: 6fa6de302246 ("geneve: Handle stats using NETDEV_PCPU_STAT_DSTATS.") Fixes: be226352e8dc ("vxlan: Handle stats using NETDEV_PCPU_STAT_DSTATS.") Signed-off-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Guillaume Nault <gnault@redhat.com> Link: https://patch.msgid.link/2e1c444cf0f63ae472baff29862c4c869be17031.1738432804.git.pabeni@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-03iio: adc: ad7173: move fwnode_irq_get_byname() call siteDavid Lechner
Move the call to fwnode_irq_get_byname() from the driver-specific ad7173_fw_parse_device_config() to the shared ad_sd_init() function. The main reason for this is that we want struct ad_sigma_delta_info to be static const data that describes the actual ADC chip, not the application-specific configuration or any runtime state. Previously, this struct was being used to pass the IRQ number to the shared ad_sd_init() function. Now, this is replaced by a boolean flag that is set at compile time and the ad_sd_init() function handles looking up the IRQ number instead. This also has the added benefit that if any other drivers need to make use of this in the future, they just have to set the flag and the shared code will take care of the rest rather than duplicating the code in each driver. Signed-off-by: David Lechner <dlechner@baylibre.com> Link: https://patch.msgid.link/20250113-iio-adc-ad7313-fix-non-const-info-struct-v4-1-b63be3ecac4a@baylibre.com Signed-off-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
2025-02-03Merge tag 'timers-urgent-2025-02-03' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull timer fixes from Thomas Gleixner: - Properly cast the input to secs_to_jiffies() to unsigned long as otherwise the result uses the data type of the input variable, which causes result range checks to fail if the input data type is signed and smaller than unsigned long. - Handle late armed hrtimers gracefully on CPU hotplug There are legitimate cases where a hrtimer is (re)armed on an outgoing CPU after the timers have been migrated away. This triggers warnings and caused people to implement horrible workarounds in RCU. But those workarounds are incomplete and do not cover e.g. the scheduler hrtimers. Stop this by force moving timer which are enqueued on the current CPU after timer migration to be queued on a remote online CPU. This allows to undo the workarounds in a seperate step. - Demote a warning level printk() to info level in the clocksource watchdog code as there is no point to emit a warning level message for a purely informational message. - Mark a helper function __always_inline and move it into the existing #ifdef block to avoid 'unused function' warnings from CLANG * tag 'timers-urgent-2025-02-03' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: jiffies: Cast to unsigned long in secs_to_jiffies() conversion clocksource: Use pr_info() for "Checking clocksource synchronization" message hrtimers: Force migrate away hrtimers queued after CPUHP_AP_HRTIMERS_DYING hrtimers: Mark is_migration_base() with __always_inline
2025-02-03usb: musb: Constify struct musb_fifo_cfgChristophe JAILLET
'struct musb_fifo_cfg' are not modified in these drivers. Constifying these structures moves some data to a read-only section, so increase overall security. On a x86_64, with allmodconfig, as an example: Before: ====== text data bss dec hex filename 64381 5537 202 70120 111e8 drivers/usb/musb/musb_core.o After: ===== text data bss dec hex filename 64957 4929 202 70088 111c8 drivers/usb/musb/musb_core.o Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr> Link: https://lore.kernel.org/r/26e6f3e25dc8e785d0034dd7e59918e455563e60.1737234596.git.christophe.jaillet@wanadoo.fr Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-02-03HID: pidff: Move all hid-pidff definitions to a dedicated headerTomasz Pakuła
Do not clutter hid includes with stuff not needed outside of the kernel. Signed-off-by: Tomasz Pakuła <tomasz.pakula.oficjalny@gmail.com> Reviewed-by: Michał Kopeć <michal@nozomi.space> Reviewed-by: Paul Dino Jones <paul@spacefreak18.xyz> Tested-by: Paul Dino Jones <paul@spacefreak18.xyz> Tested-by: Cristóferson Bueno <cbueno81@gmail.com> Tested-by: Pablo Cisneros <patchkez@protonmail.com> Signed-off-by: Jiri Kosina <jkosina@suse.com>
2025-02-03HID: pidff: Add PERIODIC_SINE_ONLY quirkTomasz Pakuła
Some devices only support SINE periodic effect although they advertise support for all PERIODIC effect in their HID descriptor. Some just do nothing when trying to play such an effect (upload goes fine), some express undefined behavior like turning to one side. This quirk forces all the periodic effects to be uploaded as SINE. This is acceptable as all these effects are similar in nature and are mostly used as rumble. SINE is the most popular with others seldom used (especially SAW_UP and SAW_DOWN). Fixes periodic effects for PXN and LITE STAR wheels Signed-off-by: Tomasz Pakuła <tomasz.pakula.oficjalny@gmail.com> Reviewed-by: Michał Kopeć <michal@nozomi.space> Reviewed-by: Paul Dino Jones <paul@spacefreak18.xyz> Tested-by: Cristóferson Bueno <cbueno81@gmail.com> Signed-off-by: Jiri Kosina <jkosina@suse.com>
2025-02-03HID: pidff: Add FIX_WHEEL_DIRECTION quirkTomasz Pakuła
Most steering wheels simply ignore DIRECTION field, but some try to be compliant with the PID standard and use it in force calculations. Games often ignore setting this field properly and/or there can be issues with dinput8 -> wine -> SDL -> Linux API translation, and this value can be incorrect. This can lead to partial/complete loss of Force Feedback or even unexpected force reversal. Sadly, this quirk can't be detected automatically without sending out effects that would move an axis. This fixes FFB on Moza Racing devices and others where effect direction is not simply ignored. Signed-off-by: Tomasz Pakuła <tomasz.pakula.oficjalny@gmail.com> Reviewed-by: Michał Kopeć <michal@nozomi.space> Reviewed-by: Paul Dino Jones <paul@spacefreak18.xyz> Signed-off-by: Jiri Kosina <jkosina@suse.com>
2025-02-03HID: pidff: Add hid_pidff_init_with_quirks and export as GPL symbolTomasz Pakuła
This lays out a way to provide an initial set of quirks to enable before device initialization takes place. GPL symbol export needed for the possibility of building HID drivers which use this function as modules. Adding a wrapper function to ensure compatibility with the old behavior of hid_pidff_init. Signed-off-by: Tomasz Pakuła <tomasz.pakula.oficjalny@gmail.com> Reviewed-by: Michał Kopeć <michal@nozomi.space> Reviewed-by: Paul Dino Jones <paul@spacefreak18.xyz> Tested-by: Paul Dino Jones <paul@spacefreak18.xyz> Tested-by: Cristóferson Bueno <cbueno81@gmail.com> Tested-by: Pablo Cisneros <patchkez@protonmail.com> Signed-off-by: Jiri Kosina <jkosina@suse.com>
2025-02-03HID: pidff: Add PERMISSIVE_CONTROL quirkTomasz Pakuła
With this quirk, a PID device isn't required to have a strict logical_minimum of 1 for the the PID_DEVICE_CONTROL usage page. Some devices come with weird values in their device descriptors and this quirk enables their initialization even if the logical minimum of the DEVICE_CONTROL page is not 1. Fixes initialization of VRS Direct Force Pro Changes in v6: - Change quirk name to better reflect it's intention Co-developed-by: Makarenko Oleg <oleg@makarenk.ooo> Signed-off-by: Makarenko Oleg <oleg@makarenk.ooo> Signed-off-by: Tomasz Pakuła <tomasz.pakula.oficjalny@gmail.com> Reviewed-by: Michał Kopeć <michal@nozomi.space> Reviewed-by: Paul Dino Jones <paul@spacefreak18.xyz> Tested-by: Paul Dino Jones <paul@spacefreak18.xyz> Tested-by: Cristóferson Bueno <cbueno81@gmail.com> Tested-by: Pablo Cisneros <patchkez@protonmail.com> Signed-off-by: Jiri Kosina <jkosina@suse.com>
2025-02-03HID: pidff: Add MISSING_PBO quirk and its detectionTomasz Pakuła
Some devices with only one axis are missing PARAMETER_BLOCK_OFFSET field for conditional effects. They can only have one axis, so we're limiting the max_axis when setting the report for those effects. Automatic detection ensures compatibility even if such device won't be explicitly defined in the kernel. Fixes initialization of VRS DirectForce PRO and possibly other devices. Changes in v6: - Fixed NULL pointer dereference. When PBO is missing, make sure not to set it anyway Co-developed-by: Makarenko Oleg <oleg@makarenk.ooo> Signed-off-by: Makarenko Oleg <oleg@makarenk.ooo> Signed-off-by: Tomasz Pakuła <tomasz.pakula.oficjalny@gmail.com> Reviewed-by: Michał Kopeć <michal@nozomi.space> Reviewed-by: Paul Dino Jones <paul@spacefreak18.xyz> Tested-by: Paul Dino Jones <paul@spacefreak18.xyz> Tested-by: Cristóferson Bueno <cbueno81@gmail.com> Tested-by: Pablo Cisneros <patchkez@protonmail.com> Signed-off-by: Jiri Kosina <jkosina@suse.com>
2025-02-03HID: pidff: Add MISSING_DELAY quirk and its detectionTomasz Pakuła
A lot of devices do not include this field, and it's seldom used in force feedback implementations. I tested about three dozen applications and none of them make use of the delay. This fixes initialization of a lot of PID wheels like Cammus, VRS, FFBeast This change has no effect on fully compliant devices Co-developed-by: Makarenko Oleg <oleg@makarenk.ooo> Signed-off-by: Makarenko Oleg <oleg@makarenk.ooo> Signed-off-by: Tomasz Pakuła <tomasz.pakula.oficjalny@gmail.com> Reviewed-by: Michał Kopeć <michal@nozomi.space> Reviewed-by: Paul Dino Jones <paul@spacefreak18.xyz> Tested-by: Paul Dino Jones <paul@spacefreak18.xyz> Tested-by: Cristóferson Bueno <cbueno81@gmail.com> Tested-by: Pablo Cisneros <patchkez@protonmail.com> Signed-off-by: Jiri Kosina <jkosina@suse.com>
2025-02-03module: drop unused module_writable_address()Mike Rapoport (Microsoft)
module_writable_address() is unused and can be removed. Signed-off-by: "Mike Rapoport (Microsoft)" <rppt@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20250126074733.1384926-9-rppt@kernel.org
2025-02-03module: switch to execmem API for remapping as RW and restoring ROXMike Rapoport (Microsoft)
Instead of using writable copy for module text sections, temporarily remap the memory allocated from execmem's ROX cache as writable and restore its ROX permissions after the module is formed. This will allow removing nasty games with writable copy in alternatives patching on x86. Signed-off-by: "Mike Rapoport (Microsoft)" <rppt@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20250126074733.1384926-7-rppt@kernel.org
2025-02-03execmem: add API for temporal remapping as RW and restoring ROX afterwardsMike Rapoport (Microsoft)
Using a writable copy for ROX memory is cumbersome and error prone. Add API that allow temporarily remapping of ranges in the ROX cache as writable and then restoring their read-only-execute permissions. This API will be later used in modules code and will allow removing nasty games with writable copy in alternatives patching on x86. The restoring of the ROX permissions relies on the ability of architecture to reconstruct large pages in its set_memory_rox() method. Signed-off-by: "Mike Rapoport (Microsoft)" <rppt@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20250126074733.1384926-6-rppt@kernel.org
2025-02-03x86/mm/pat: restore large ROX pages after fragmentationKirill A. Shutemov
Change of attributes of the pages may lead to fragmentation of direct mapping over time and performance degradation when these pages contain executable code. With current code it's one way road: kernel tries to avoid splitting large pages, but it doesn't restore them back even if page attributes got compatible again. Any change to the mapping may potentially allow to restore large page. Add a hook to cpa_flush() path that will check if the pages in the range that were just touched can be mapped at PMD level. If the collapse at the PMD level succeeded, also attempt to collapse PUD level. The collapse logic runs only when a set_memory_ method explicitly sets CPA_COLLAPSE flag, for now this is only enabled in set_memory_rox(). CPUs don't like[1] to have to have TLB entries of different size for the same memory, but looks like it's okay as long as these entries have matching attributes[2]. Therefore it's critical to flush TLB before any following changes to the mapping. Note that we already allow for multiple TLB entries of different sizes for the same memory now in split_large_page() path. It's not a new situation. set_memory_4k() provides a way to use 4k pages on purpose. Kernel must not remap such pages as large. Re-use one of software PTE bits to indicate such pages. [1] See Erratum 383 of AMD Family 10h Processors [2] https://lore.kernel.org/linux-mm/1da1b025-cabc-6f04-bde5-e50830d1ecf0@amd.com/ [rppt@kernel.org: * s/restore/collapse/ * update formatting per peterz * use 'struct ptdesc' instead of 'struct page' for list of page tables to be freed * try to collapse PMD first and if it succeeds move on to PUD as peterz suggested * flush TLB twice: for changes done in the original CPA call and after collapsing of large pages * update commit message ] Signed-off-by: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Co-developed-by: "Mike Rapoport (Microsoft)" <rppt@kernel.org> Signed-off-by: "Mike Rapoport (Microsoft)" <rppt@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20250126074733.1384926-4-rppt@kernel.org
2025-02-02sched_ext: Add an event, SCX_EV_SELECT_CPU_FALLBACKChangwoo Min
Add a core event, SCX_EV_SELECT_CPU_FALLBACK, which represents how many times ops.select_cpu() returns a CPU that the task can't use. __scx_add_event() is used since the caller holds an rq lock, so the preemption has already been disabled. Signed-off-by: Changwoo Min <changwoo@igalia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-02-02cgroup: fix race between fork and cgroup.killShakeel Butt
Tejun reported the following race between fork() and cgroup.kill at [1]. Tejun: I was looking at cgroup.kill implementation and wondering whether there could be a race window. So, __cgroup_kill() does the following: k1. Set CGRP_KILL. k2. Iterate tasks and deliver SIGKILL. k3. Clear CGRP_KILL. The copy_process() does the following: c1. Copy a bunch of stuff. c2. Grab siglock. c3. Check fatal_signal_pending(). c4. Commit to forking. c5. Release siglock. c6. Call cgroup_post_fork() which puts the task on the css_set and tests CGRP_KILL. The intention seems to be that either a forking task gets SIGKILL and terminates on c3 or it sees CGRP_KILL on c6 and kills the child. However, I don't see what guarantees that k3 can't happen before c6. ie. After a forking task passes c5, k2 can take place and then before the forking task reaches c6, k3 can happen. Then, nobody would send SIGKILL to the child. What am I missing? This is indeed a race. One way to fix this race is by taking cgroup_threadgroup_rwsem in write mode in __cgroup_kill() as the fork() side takes cgroup_threadgroup_rwsem in read mode from cgroup_can_fork() to cgroup_post_fork(). However that would be heavy handed as this adds one more potential stall scenario for cgroup.kill which is usually called under extreme situation like memory pressure. To fix this race, let's maintain a sequence number per cgroup which gets incremented on __cgroup_kill() call. On the fork() side, the cgroup_can_fork() will cache the sequence number locally and recheck it against the cgroup's sequence number at cgroup_post_fork() site. If the sequence numbers mismatch, it means __cgroup_kill() can been called and we should send SIGKILL to the newly created task. Reported-by: Tejun Heo <tj@kernel.org> Closes: https://lore.kernel.org/all/Z5QHE2Qn-QZ6M-KW@slm.duckdns.org/ [1] Fixes: 661ee6280931 ("cgroup: introduce cgroup.kill") Cc: stable@vger.kernel.org # v5.14+ Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev> Reviewed-by: Michal Koutný <mkoutny@suse.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-02-01Merge tag 'pull-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfsLinus Torvalds
Pull misc vfs cleanups from Al Viro: "Two unrelated patches - one is a removal of long-obsolete include in overlayfs (it used to need fs/internal.h, but the extern it wanted has been moved back to include/linux/namei.h) and another introduces convenience helper constructing struct qstr by a NUL-terminated string" * tag 'pull-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: add a string-to-qstr constructor fs/overlayfs/namei.c: get rid of include ../internal.h
2025-02-01Merge tag 'mm-hotfixes-stable-2025-02-01-03-56' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull misc fixes from Andrew Morton: "21 hotfixes. 8 are cc:stable and the remainder address post-6.13 issues. 13 are for MM and 8 are for non-MM. All are singletons, please see the changelogs for details" * tag 'mm-hotfixes-stable-2025-02-01-03-56' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (21 commits) MAINTAINERS: include linux-mm for xarray maintenance revert "xarray: port tests to kunit" MAINTAINERS: add lib/test_xarray.c mailmap, MAINTAINERS, docs: update Carlos's email address mm/hugetlb: fix hugepage allocation for interleaved memory nodes mm: gup: fix infinite loop within __get_longterm_locked mm, swap: fix reclaim offset calculation error during allocation .mailmap: update email address for Christopher Obbard kfence: skip __GFP_THISNODE allocations on NUMA systems nilfs2: fix possible int overflows in nilfs_fiemap() mm: compaction: use the proper flag to determine watermarks kernel: be more careful about dup_mmap() failures and uprobe registering mm/fake-numa: handle cases with no SRAT info mm: kmemleak: fix upper boundary check for physical address objects mailmap: add an entry for Hamza Mahfooz MAINTAINERS: mailmap: update Yosry Ahmed's email address scripts/gdb: fix aarch64 userspace detection in get_current_task mm/vmscan: accumulate nr_demoted for accurate demotion statistics ocfs2: fix incorrect CPU endianness conversion causing mount failure mm/zsmalloc: add __maybe_unused attribute for is_first_zpdesc() ...
2025-02-01mm/vmscan: fix hard LOCKUP in function isolate_lru_foliosliuye
This fixes the following hard lockup in isolate_lru_folios() during memory reclaim. If the LRU mostly contains ineligible folios this may trigger watchdog. watchdog: Watchdog detected hard LOCKUP on cpu 173 RIP: 0010:native_queued_spin_lock_slowpath+0x255/0x2a0 Call Trace: _raw_spin_lock_irqsave+0x31/0x40 folio_lruvec_lock_irqsave+0x5f/0x90 folio_batch_move_lru+0x91/0x150 lru_add_drain_per_cpu+0x1c/0x40 process_one_work+0x17d/0x350 worker_thread+0x27b/0x3a0 kthread+0xe8/0x120 ret_from_fork+0x34/0x50 ret_from_fork_asm+0x1b/0x30 lruvec->lru_lock owner: PID: 2865 TASK: ffff888139214d40 CPU: 40 COMMAND: "kswapd0" #0 [fffffe0000945e60] crash_nmi_callback at ffffffffa567a555 #1 [fffffe0000945e68] nmi_handle at ffffffffa563b171 #2 [fffffe0000945eb0] default_do_nmi at ffffffffa6575920 #3 [fffffe0000945ed0] exc_nmi at ffffffffa6575af4 #4 [fffffe0000945ef0] end_repeat_nmi at ffffffffa6601dde [exception RIP: isolate_lru_folios+403] RIP: ffffffffa597df53 RSP: ffffc90006fb7c28 RFLAGS: 00000002 RAX: 0000000000000001 RBX: ffffc90006fb7c60 RCX: ffffea04a2196f88 RDX: ffffc90006fb7c60 RSI: ffffc90006fb7c60 RDI: ffffea04a2197048 RBP: ffff88812cbd3010 R8: ffffea04a2197008 R9: 0000000000000001 R10: 0000000000000000 R11: 0000000000000001 R12: ffffea04a2197008 R13: ffffea04a2197048 R14: ffffc90006fb7de8 R15: 0000000003e3e937 ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018 <NMI exception stack> #5 [ffffc90006fb7c28] isolate_lru_folios at ffffffffa597df53 #6 [ffffc90006fb7cf8] shrink_active_list at ffffffffa597f788 #7 [ffffc90006fb7da8] balance_pgdat at ffffffffa5986db0 #8 [ffffc90006fb7ec0] kswapd at ffffffffa5987354 #9 [ffffc90006fb7ef8] kthread at ffffffffa5748238 crash> Scenario: User processe are requesting a large amount of memory and keep page active. Then a module continuously requests memory from ZONE_DMA32 area. Memory reclaim will be triggered due to ZONE_DMA32 watermark alarm reached. However pages in the LRU(active_anon) list are mostly from the ZONE_NORMAL area. Reproduce: Terminal 1: Construct to continuously increase pages active(anon). mkdir /tmp/memory mount -t tmpfs -o size=1024000M tmpfs /tmp/memory dd if=/dev/zero of=/tmp/memory/block bs=4M tail /tmp/memory/block Terminal 2: vmstat -a 1 active will increase. procs ---memory--- ---swap-- ---io---- -system-- ---cpu--- ... r b swpd free inact active si so bi bo 1 0 0 1445623076 45898836 83646008 0 0 0 1 0 0 1445623076 43450228 86094616 0 0 0 1 0 0 1445623076 41003480 88541364 0 0 0 1 0 0 1445623076 38557088 90987756 0 0 0 1 0 0 1445623076 36109688 93435156 0 0 0 1 0 0 1445619552 33663256 95881632 0 0 0 1 0 0 1445619804 31217140 98327792 0 0 0 1 0 0 1445619804 28769988 100774944 0 0 0 1 0 0 1445619804 26322348 103222584 0 0 0 1 0 0 1445619804 23875592 105669340 0 0 0 cat /proc/meminfo | head Active(anon) increase. MemTotal: 1579941036 kB MemFree: 1445618500 kB MemAvailable: 1453013224 kB Buffers: 6516 kB Cached: 128653956 kB SwapCached: 0 kB Active: 118110812 kB Inactive: 11436620 kB Active(anon): 115345744 kB Inactive(anon): 945292 kB When the Active(anon) is 115345744 kB, insmod module triggers the ZONE_DMA32 watermark. perf record -e vmscan:mm_vmscan_lru_isolate -aR perf script isolate_mode=0 classzone=1 order=1 nr_requested=32 nr_scanned=2 nr_skipped=2 nr_taken=0 lru=active_anon isolate_mode=0 classzone=1 order=1 nr_requested=32 nr_scanned=0 nr_skipped=0 nr_taken=0 lru=active_anon isolate_mode=0 classzone=1 order=0 nr_requested=32 nr_scanned=28835844 nr_skipped=28835844 nr_taken=0 lru=active_anon isolate_mode=0 classzone=1 order=1 nr_requested=32 nr_scanned=28835844 nr_skipped=28835844 nr_taken=0 lru=active_anon isolate_mode=0 classzone=1 order=0 nr_requested=32 nr_scanned=29 nr_skipped=29 nr_taken=0 lru=active_anon isolate_mode=0 classzone=1 order=0 nr_requested=32 nr_scanned=0 nr_skipped=0 nr_taken=0 lru=active_anon See nr_scanned=28835844. 28835844 * 4k = 115343376KB approximately equal to 115345744 kB. If increase Active(anon) to 1000G then insmod module triggers the ZONE_DMA32 watermark. hard lockup will occur. In my device nr_scanned = 0000000003e3e937 when hard lockup. Convert to memory size 0x0000000003e3e937 * 4KB = 261072092 KB. [ffffc90006fb7c28] isolate_lru_folios at ffffffffa597df53 ffffc90006fb7c30: 0000000000000020 0000000000000000 ffffc90006fb7c40: ffffc90006fb7d40 ffff88812cbd3000 ffffc90006fb7c50: ffffc90006fb7d30 0000000106fb7de8 ffffc90006fb7c60: ffffea04a2197008 ffffea0006ed4a48 ffffc90006fb7c70: 0000000000000000 0000000000000000 ffffc90006fb7c80: 0000000000000000 0000000000000000 ffffc90006fb7c90: 0000000000000000 0000000000000000 ffffc90006fb7ca0: 0000000000000000 0000000003e3e937 ffffc90006fb7cb0: 0000000000000000 0000000000000000 ffffc90006fb7cc0: 8d7c0b56b7874b00 ffff88812cbd3000 About the Fixes: Why did it take eight years to be discovered? The problem requires the following conditions to occur: 1. The device memory should be large enough. 2. Pages in the LRU(active_anon) list are mostly from the ZONE_NORMAL area. 3. The memory in ZONE_DMA32 needs to reach the watermark. If the memory is not large enough, or if the usage design of ZONE_DMA32 area memory is reasonable, this problem is difficult to detect. notes: The problem is most likely to occur in ZONE_DMA32 and ZONE_NORMAL, but other suitable scenarios may also trigger the problem. Link: https://lkml.kernel.org/r/20241119060842.274072-1-liuye@kylinos.cn Fixes: b2e18757f2c9 ("mm, vmscan: begin reclaiming pages on a per-node basis") Signed-off-by: liuye <liuye@kylinos.cn> Cc: Hugh Dickins <hughd@google.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Yang Shi <yang@os.amperecomputing.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-01-31Merge tag 'riscv-for-linus-6.14-mw1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux Pull RISC-V updates from Palmer Dabbelt: - The PH1520 pinctrl and dwmac drivers are enabeled in defconfig - A redundant AQRL barrier has been removed from the futex cmpxchg implementation - Support for the T-Head vector extensions, which includes exposing these extensions to userspace on systems that implement them - Some more page table information is now printed on die() and systems that cause PA overflows * tag 'riscv-for-linus-6.14-mw1' of git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux: riscv: add a warning when physical memory address overflows riscv/mm/fault: add show_pte() before die() riscv: Add ghostwrite vulnerability selftests: riscv: Support xtheadvector in vector tests selftests: riscv: Fix vector tests riscv: hwprobe: Document thead vendor extensions and xtheadvector extension riscv: hwprobe: Add thead vendor extension probing riscv: vector: Support xtheadvector save/restore riscv: Add xtheadvector instruction definitions riscv: csr: Add CSR encodings for CSR_VXRM/CSR_VXSAT RISC-V: define the elements of the VCSR vector CSR riscv: vector: Use vlenb from DT for thead riscv: Add thead and xtheadvector as a vendor extension riscv: dts: allwinner: Add xtheadvector to the D1/D1s devicetree dt-bindings: cpus: add a thead vlen register length property dt-bindings: riscv: Add xtheadvector ISA extension description RISC-V: Mark riscv_v_init() as __init riscv: defconfig: drop RT_GROUP_SCHED=y riscv/futex: Optimize atomic cmpxchg riscv: defconfig: enable pinctrl and dwmac support for TH1520
2025-01-31Merge tag 'kbuild-v6.14' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild Pull Kbuild updates from Masahiro Yamada: - Support multiple hook locations for maint scripts of Debian package - Remove 'cpio' from the build tool requirement - Introduce gendwarfksyms tool, which computes CRCs for export symbols based on the DWARF information - Support CONFIG_MODVERSIONS for Rust - Resolve all conflicts in the genksyms parser - Fix several syntax errors in genksyms * tag 'kbuild-v6.14' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild: (64 commits) kbuild: fix Clang LTO with CONFIG_OBJTOOL=n kbuild: Strip runtime const RELA sections correctly kconfig: fix memory leak in sym_warn_unmet_dep() kconfig: fix file name in warnings when loading KCONFIG_DEFCONFIG_LIST genksyms: fix syntax error for attribute before init-declarator genksyms: fix syntax error for builtin (u)int*x*_t types genksyms: fix syntax error for attribute after 'union' genksyms: fix syntax error for attribute after 'struct' genksyms: fix syntax error for attribute after abstact_declarator genksyms: fix syntax error for attribute before nested_declarator genksyms: fix syntax error for attribute before abstract_declarator genksyms: decouple ATTRIBUTE_PHRASE from type-qualifier genksyms: record attributes consistently for init-declarator genksyms: restrict direct-declarator to take one parameter-type-list genksyms: restrict direct-abstract-declarator to take one parameter-type-list genksyms: remove Makefile hack genksyms: fix last 3 shift/reduce conflicts genksyms: fix 6 shift/reduce conflicts and 5 reduce/reduce conflicts genksyms: reduce type_qualifier directly to decl_specifier genksyms: rename cvar_qualifier to type_qualifier ...
2025-01-31Merge tag 'block-6.14-20250131' of git://git.kernel.dk/linuxLinus Torvalds
Pull more block updates from Jens Axboe: - MD pull request via Song: - Fix a md-cluster regression introduced - More sysfs race fixes - Mark anything inside queue freezing as not being able to do IO for memory allocations - Fix for a regression introduced in loop in this merge window - Fix for a regression in queue mapping setups introduced in this merge window - Fix for the block dio fops attempting an iov_iter revert upton getting -EIOCBQUEUED on the read side. This one is going to stable as well * tag 'block-6.14-20250131' of git://git.kernel.dk/linux: block: force noio scope in blk_mq_freeze_queue block: fix nr_hw_queue update racing with disk addition/removal block: get rid of request queue ->sysfs_dir_lock loop: don't clear LO_FLAGS_PARTSCAN on LOOP_SET_STATUS{,64} md/md-bitmap: Synchronize bitmap_get_stats() with bitmap lifetime blk-mq: create correct map for fallback case block: don't revert iter for -EIOCBQUEUED
2025-01-31Merge tag 'io_uring-6.14-20250131' of git://git.kernel.dk/linuxLinus Torvalds
Pull more io_uring updates from Jens Axboe: - Series cleaning up the alloc cache changes from this merge window, and then another series on top making it better yet. This also solves an issue with KASAN_EXTRA_INFO, by making io_uring resilient to KASAN using parts of the freed struct for storage - Cleanups and simplications to buffer cloning and io resource node management - Fix an issue introduced in this merge window where READ/WRITE_ONCE was used on an atomic_t, which made some archs complain - Fix for an errant connect retry when the socket has been shut down - Fix for multishot and provided buffers * tag 'io_uring-6.14-20250131' of git://git.kernel.dk/linux: io_uring/net: don't retry connect operation on EPOLLERR io_uring/rw: simplify io_rw_recycle() io_uring: remove !KASAN guards from cache free io_uring/net: extract io_send_select_buffer() io_uring/net: clean io_msg_copy_hdr() io_uring/net: make io_net_vec_assign() return void io_uring: add alloc_cache.c io_uring: dont ifdef io_alloc_cache_kasan() io_uring: include all deps for alloc_cache.h io_uring: fix multishots with selected buffers io_uring/register: use atomic_read/write for sq_flags migration io_uring/alloc_cache: get rid of _nocache() helper io_uring: get rid of alloc cache init_once handling io_uring/uring_cmd: cleanup struct io_uring_cmd_data layout io_uring/uring_cmd: use cached cmd_op in io_uring_cmd_sock() io_uring/msg_ring: don't leave potentially dangling ->tctx pointer io_uring/rsrc: Move lockdep assert from io_free_rsrc_node() to caller io_uring/rsrc: remove unused parameter ctx for io_rsrc_node_alloc() io_uring: clean up io_uring_register_get_file() io_uring/rsrc: Simplify buffer cloning by locking both rings
2025-01-31Merge tag 'x86-mm-2025-01-31' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 mm updates from Ingo Molnar: - The biggest changes are the TLB flushing scalability optimizations, to update the mm_cpumask lazily and related changes. This feature has both a track record and a continued risk of performance regressions, so it was already delayed by a cycle - but it's all 100% perfect now™ (Rik van Riel) - Also miscellaneous fixes and cleanups. (Gautam Somani, Kirill Shutemov, Sebastian Andrzej Siewior) * tag 'x86-mm-2025-01-31' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/mm: Remove unnecessary include of <linux/extable.h> x86/mtrr: Rename mtrr_overwrite_state() to guest_force_mtrr_state() x86/mm/selftests: Fix typo in lam.c x86/mm/tlb: Only trim the mm_cpumask once a second x86/mm/tlb: Also remove local CPU from mm_cpumask if stale x86/mm/tlb: Add tracepoint for TLB flush IPI to stale CPU x86/mm/tlb: Update mm_cpumask lazily
2025-01-31Merge tag 'ceph-for-6.14-rc1' of https://github.com/ceph/ceph-clientLinus Torvalds
Pull ceph updates from Ilya Dryomov: "A fix for a memory leak from Antoine (marked for stable) and two cleanups from Liang and Slava" * tag 'ceph-for-6.14-rc1' of https://github.com/ceph/ceph-client: ceph: exchange hardcoded value on NAME_MAX ceph: streamline request head structures in MDS client ceph: fix memory leak in ceph_mds_auth_match()
2025-01-31block: force noio scope in blk_mq_freeze_queueChristoph Hellwig
When block drivers or the core block code perform allocations with a frozen queue, this could try to recurse into the block device to reclaim memory and deadlock. Thus all allocations done by a process that froze a queue need to be done without __GFP_IO and __GFP_FS. Instead of tying to track all of them down, force a noio scope as part of freezing the queue. Note that nvme is a bit of a mess here due to the non-owner freezes, and they will be addressed separately. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20250131120352.1315351-2-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>