summaryrefslogtreecommitdiff
path: root/include
AgeCommit message (Collapse)Author
2024-11-11rtnetlink: Add peer_type in struct rtnl_link_ops.Kuniyuki Iwashima
In ops->newlink(), veth, vxcan, and netkit call rtnl_link_get_net() with a net pointer, which is the first argument of ->newlink(). rtnl_link_get_net() could return another netns based on IFLA_NET_NS_PID and IFLA_NET_NS_FD in the peer device's attributes. We want to get it and fill rtnl_nets->nets[] in advance in rtnl_newlink() for per-netns RTNL. All of the three get the peer netns in the same way: 1. Call rtnl_nla_parse_ifinfomsg() 2. Call ops->validate() (vxcan doesn't have) 3. Call rtnl_link_get_net_tb() Let's add a new field peer_type to struct rtnl_link_ops and prefetch netns in the peer ifla to add it to rtnl_nets in rtnl_newlink(). Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org> Link: https://patch.msgid.link/20241108004823.29419-6-kuniyu@amazon.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-11-11rtnetlink: Remove __rtnl_link_register()Kuniyuki Iwashima
link_ops is protected by link_ops_mutex and no longer needs RTNL, so we have no reason to have __rtnl_link_register() separately. Let's remove it and call rtnl_link_register() from ifb.ko and dummy.ko. Note that both modules' init() work on init_net only, so we need not export pernet_ops_rwsem and can use rtnl_net_lock() there. Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Link: https://patch.msgid.link/20241108004823.29419-4-kuniyu@amazon.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-11-11rtnetlink: Protect link_ops by mutex.Kuniyuki Iwashima
rtnl_link_unregister() holds RTNL and calls synchronize_srcu(), but rtnl_newlink() will acquire SRCU frist and then RTNL. Then, we need to unlink ops and call synchronize_srcu() outside of RTNL to avoid the deadlock. rtnl_link_unregister() rtnl_newlink() ---- ---- lock(rtnl_mutex); lock(&ops->srcu); lock(rtnl_mutex); sync(&ops->srcu); Let's move as such and add a mutex to protect link_ops. Now, link_ops is protected by its dedicated mutex and rtnl_link_register() no longer needs to hold RTNL. While at it, we move the initialisation of ops->dellink and ops->srcu out of the mutex scope. Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Link: https://patch.msgid.link/20241108004823.29419-3-kuniyu@amazon.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-11-11rtnetlink: Remove __rtnl_link_unregister().Kuniyuki Iwashima
rtnl_link_unregister() holds RTNL and calls __rtnl_link_unregister(), where we call synchronize_srcu() to wait inflight RTM_NEWLINK requests for per-netns RTNL. We put synchronize_srcu() in __rtnl_link_unregister() due to ifb.ko and dummy.ko. However, rtnl_newlink() will acquire SRCU before RTNL later in this series. Then, lockdep will detect the deadlock: rtnl_link_unregister() rtnl_newlink() ---- ---- lock(rtnl_mutex); lock(&ops->srcu); lock(rtnl_mutex); sync(&ops->srcu); To avoid the problem, we must call synchronize_srcu() before RTNL in rtnl_link_unregister(). As a preparation, let's remove __rtnl_link_unregister(). Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Link: https://patch.msgid.link/20241108004823.29419-2-kuniyu@amazon.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-11-11mm: mmap_lock: check trace_mmap_lock_$type_enabled() instead of regcountVlastimil Babka
Since 7d6be67cfdd4 ("mm: mmap_lock: replace get_memcg_path_buf() with on-stack buffer") we use trace_mmap_lock_reg()/unreg() only to maintain an atomic reg_refcount which is checked to avoid performing get_mm_memcg_path() in case none of the tracepoints using it is enabled. This can be achieved directly by putting all the work needed for the tracepoint behind the trace_mmap_lock_##type##_enabled(), as suggested by Documentation/trace/tracepoints.rst and with the following advantages: - uses the tracepoint's static key instead of evaluating a branch - the check tracepoint specific, not shared by all of them - we can get rid of trace_mmap_lock_reg()/unreg() completely Thus use the trace_..._enabled() check and remove unnecessary code. Link: https://lkml.kernel.org/r/20241105113456.95066-2-vbabka@suse.cz Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Axel Rasmussen <axelrasmussen@google.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-11-11mm: define general function pXd_init()Bibo Mao
pud_init(), pmd_init() and kernel_pte_init() are duplicated defined in file kasan.c and sparse-vmemmap.c as weak functions. Move them to generic header file pgtable.h, architecture can redefine them. Link: https://lkml.kernel.org/r/20241104070712.52902-1-maobibo@loongson.cn Signed-off-by: Bibo Mao <maobibo@loongson.cn> Reviewed-by: Huacai Chen <chenhuacai@loongson.cn> Cc: Alexander Potapenko <glider@google.com> Cc: Andrey Konovalov <andreyknvl@gmail.com> Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de> Cc: Vincenzo Frascino <vincenzo.frascino@arm.com> Cc: WANG Xuerui <kernel@xen0n.name> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-11-11kmemleak: iommu/iova: fix transient kmemleak false positiveCatalin Marinas
The introduction of iova_depot_pop() in 911aa1245da8 ("iommu/iova: Make the rcache depot scale better") confused kmemleak by moving a struct iova_magazine object from a singly linked list to rcache->depot and resetting the 'next' pointer referencing it. Unlike doubly linked lists, the content of the object being referred is never changed on removal from a singly linked list and the kmemleak checksum heuristics do not detect such scenario. This leads to false positives like: unreferenced object 0xffff8881a5301000 (size 1024): comm "softirq", pid 0, jiffies 4306297099 (age 462.991s) hex dump (first 32 bytes): 00 00 00 00 00 00 00 00 e7 7d 05 00 00 00 00 00 .........}...... 0f b4 05 00 00 00 00 00 b4 96 05 00 00 00 00 00 ................ backtrace: [<ffffffff819f5f08>] __kmem_cache_alloc_node+0x1e8/0x320 [<ffffffff818a239a>] kmalloc_trace+0x2a/0x60 [<ffffffff8231d31e>] free_iova_fast+0x28e/0x4e0 [<ffffffff82310860>] fq_ring_free_locked+0x1b0/0x310 [<ffffffff8231225d>] fq_flush_timeout+0x19d/0x2e0 [<ffffffff813e95ba>] call_timer_fn+0x19a/0x5c0 [<ffffffff813ea16b>] __run_timers+0x78b/0xb80 [<ffffffff813ea5bd>] run_timer_softirq+0x5d/0xd0 [<ffffffff82f1d915>] __do_softirq+0x205/0x8b5 Introduce kmemleak_transient_leak() which resets the object checksum requiring another scan pass before it is reported (if still unreferenced). Call this new API in iova_depot_pop(). Link: https://lkml.kernel.org/r/20241104111944.2207155-1-catalin.marinas@arm.com Link: https://lore.kernel.org/r/ZY1osaGLyT-sdKE8@shredder/ Signed-off-by: Catalin Marinas <catalin.marinas@arm.com> Reported-by: Ido Schimmel <idosch@idosch.org> Tested-by: Ido Schimmel <idosch@nvidia.com> Acked-by: Robin Murphy <robin.murphy@arm.com> Cc: Joerg Roedel <joro@8bytes.org> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-11-11mm/list_lru: simplify the list_lru walk callback functionKairui Song
Now isolation no longer takes the list_lru global node lock, only use the per-cgroup lock instead. And this lock is inside the list_lru_one being walked, no longer needed to pass the lock explicitly. Link: https://lkml.kernel.org/r/20241104175257.60853-7-ryncsn@gmail.com Signed-off-by: Kairui Song <kasong@tencent.com> Cc: Chengming Zhou <zhouchengming@bytedance.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Qi Zheng <zhengqi.arch@bytedance.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Waiman Long <longman@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-11-11mm/list_lru: split the lock to per-cgroup scopeKairui Song
Currently, every list_lru has a per-node lock that protects adding, deletion, isolation, and reparenting of all list_lru_one instances belonging to this list_lru on this node. This lock contention is heavy when multiple cgroups modify the same list_lru. This lock can be split into per-cgroup scope to reduce contention. To achieve this, we need a stable list_lru_one for every cgroup. This commit adds a lock to each list_lru_one and introduced a helper function lock_list_lru_of_memcg, making it possible to pin the list_lru of a memcg. Then reworked the reparenting process. Reparenting will switch the list_lru_one instances one by one. By locking each instance and marking it dead using the nr_items counter, reparenting ensures that all items in the corresponding cgroup (on-list or not, because items have a stable cgroup, see below) will see the list_lru_one switch synchronously. Objcg reparent is also moved after list_lru reparent so items will have a stable mem cgroup until all list_lru_one instances are drained. The only caller that doesn't work the *_obj interfaces are direct calls to list_lru_{add,del}. But it's only used by zswap and that's also based on objcg, so it's fine. This also changes the bahaviour of the isolation function when LRU_RETRY or LRU_REMOVED_RETRY is returned, because now releasing the lock could unblock reparenting and free the list_lru_one, isolation function will have to return withoug re-lock the lru. prepare() { mkdir /tmp/test-fs modprobe brd rd_nr=1 rd_size=33554432 mkfs.xfs -f /dev/ram0 mount -t xfs /dev/ram0 /tmp/test-fs for i in $(seq 1 512); do mkdir "/tmp/test-fs/$i" for j in $(seq 1 10240); do echo TEST-CONTENT > "/tmp/test-fs/$i/$j" done & done; wait } do_test() { read_worker() { sleep 1 tar -cv "$1" &>/dev/null } read_in_all() { cd "/tmp/test-fs" && ls for i in $(seq 1 512); do (exec sh -c 'echo "$PPID"') > "/sys/fs/cgroup/benchmark/$i/cgroup.procs" read_worker "$i" & done; wait } for i in $(seq 1 512); do mkdir -p "/sys/fs/cgroup/benchmark/$i" done echo +memory > /sys/fs/cgroup/benchmark/cgroup.subtree_control echo 512M > /sys/fs/cgroup/benchmark/memory.max echo 3 > /proc/sys/vm/drop_caches time read_in_all } Above script simulates compression of small files in multiple cgroups with memory pressure. Run prepare() then do_test for 6 times: Before: real 0m7.762s user 0m11.340s sys 3m11.224s real 0m8.123s user 0m11.548s sys 3m2.549s real 0m7.736s user 0m11.515s sys 3m11.171s real 0m8.539s user 0m11.508s sys 3m7.618s real 0m7.928s user 0m11.349s sys 3m13.063s real 0m8.105s user 0m11.128s sys 3m14.313s After this commit (about ~15% faster): real 0m6.953s user 0m11.327s sys 2m42.912s real 0m7.453s user 0m11.343s sys 2m51.942s real 0m6.916s user 0m11.269s sys 2m43.957s real 0m6.894s user 0m11.528s sys 2m45.346s real 0m6.911s user 0m11.095s sys 2m43.168s real 0m6.773s user 0m11.518s sys 2m40.774s Link: https://lkml.kernel.org/r/20241104175257.60853-6-ryncsn@gmail.com Signed-off-by: Kairui Song <kasong@tencent.com> Cc: Chengming Zhou <zhouchengming@bytedance.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Qi Zheng <zhengqi.arch@bytedance.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Waiman Long <longman@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-11-11mm/list_lru: don't pass unnecessary key parametersKairui Song
Patch series "mm/list_lru: Split list_lru lock into per-cgroup scope". When LOCKDEP is not enabled, lock_class_key is an empty struct that is never used. But the list_lru initialization function still takes a placeholder pointer as parameter, and the compiler cannot optimize it because the function is not static and exported. Remove this parameter and move it inside the list_lru struct. Only use it when LOCKDEP is enabled. Kernel builds with LOCKDEP will be slightly larger, while !LOCKDEP builds without it will be slightly smaller (the common case). Link: https://lkml.kernel.org/r/20241104175257.60853-1-ryncsn@gmail.com Link: https://lkml.kernel.org/r/20241104175257.60853-2-ryncsn@gmail.com Signed-off-by: Kairui Song <kasong@tencent.com> Acked-by: Shakeel Butt <shakeel.butt@linux.dev> Cc: Chengming Zhou <zhouchengming@bytedance.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Qi Zheng <zhengqi.arch@bytedance.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Waiman Long <longman@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-11-11rxrpc: Add a tracepoint for aborts being proposedDavid Howells
Add a tracepoint to rxrpc to trace the proposal of an abort. The abort is performed asynchronously by the I/O thread. Signed-off-by: David Howells <dhowells@redhat.com> cc: Marc Dionne <marc.dionne@auristor.com> cc: Simon Horman <horms@kernel.org> cc: linux-afs@lists.infradead.org Link: https://patch.msgid.link/726356.1730898045@warthog.procyon.org.uk Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-11-11empty include/asm-generic/vga.hAl Viro
all places that use anything defined in it (vgacon, mdacon and vga16fb) are built only on architectures that have all that stuff in their native asm/vga.h allows to kill stub asm/vga.h on sh, while we are at it... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Arnd Bergmann <arnd@arndb.de>
2024-11-11vt_buffer.h: get rid of dead code in default scr_...() instancesAl Viro
Only 4 architectures define VT_BUF_HAVE_RW (alpha, mips, powerpc, sparc) and all of them define VT_BUF_HAVE_MEM{SET,CPY,MOVE}W. In other words, the code under #ifdef VT_BUF_HAVE_RW in default scr_mem...w() instances won't be compiled anyway. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Arnd Bergmann <arnd@arndb.de>
2024-11-11mm: page_frag: reuse existing space for 'size' and 'pfmemalloc'Yunsheng Lin
Currently there is one 'struct page_frag' for every 'struct sock' and 'struct task_struct', we are about to replace the 'struct page_frag' with 'struct page_frag_cache' for them. Before begin the replacing, we need to ensure the size of 'struct page_frag_cache' is not bigger than the size of 'struct page_frag', as there may be tens of thousands of 'struct sock' and 'struct task_struct' instances in the system. By or'ing the page order & pfmemalloc with lower bits of 'va' instead of using 'u16' or 'u32' for page size and 'u8' for pfmemalloc, we are able to avoid 3 or 5 bytes space waste. And page address & pfmemalloc & order is unchanged for the same page in the same 'page_frag_cache' instance, it makes sense to fit them together. After this patch, the size of 'struct page_frag_cache' should be the same as the size of 'struct page_frag'. CC: Andrew Morton <akpm@linux-foundation.org> CC: Linux-MM <linux-mm@kvack.org> Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com> Reviewed-by: Alexander Duyck <alexanderduyck@fb.com> Link: https://patch.msgid.link/20241028115343.3405838-7-linyunsheng@huawei.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-11-11mm: page_frag: avoid caller accessing 'page_frag_cache' directlyYunsheng Lin
Use appropriate frag_page API instead of caller accessing 'page_frag_cache' directly. CC: Andrew Morton <akpm@linux-foundation.org> CC: Linux-MM <linux-mm@kvack.org> Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com> Reviewed-by: Alexander Duyck <alexanderduyck@fb.com> Acked-by: Chuck Lever <chuck.lever@oracle.com> Link: https://patch.msgid.link/20241028115343.3405838-5-linyunsheng@huawei.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-11-11mm: move the page fragment allocator from page_alloc into its own fileYunsheng Lin
Inspired by [1], move the page fragment allocator from page_alloc into its own c file and header file, as we are about to make more change for it to replace another page_frag implementation in sock.c As this patchset is going to replace 'struct page_frag' with 'struct page_frag_cache' in sched.h, including page_frag_cache.h in sched.h has a compiler error caused by interdependence between mm_types.h and mm.h for asm-offsets.c, see [2]. So avoid the compiler error by moving 'struct page_frag_cache' to mm_types_task.h as suggested by Alexander, see [3]. 1. https://lore.kernel.org/all/20230411160902.4134381-3-dhowells@redhat.com/ 2. https://lore.kernel.org/all/15623dac-9358-4597-b3ee-3694a5956920@gmail.com/ 3. https://lore.kernel.org/all/CAKgT0UdH1yD=LSCXFJ=YM_aiA4OomD-2wXykO42bizaWMt_HOA@mail.gmail.com/ CC: David Howells <dhowells@redhat.com> CC: Linux-MM <linux-mm@kvack.org> Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com> Acked-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Alexander Duyck <alexanderduyck@fb.com> Link: https://patch.msgid.link/20241028115343.3405838-3-linyunsheng@huawei.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-11-11Merge branch kvm-arm64/nv-pmu into kvmarm/nextOliver Upton
* kvm-arm64/nv-pmu: : Support for vEL2 PMU controls : : Align the vEL2 PMU support with the current state of non-nested KVM, : including: : : - Trap routing, with the annoying complication of EL2 traps that apply : in Host EL0 : : - PMU emulation, using the correct configuration bits depending on : whether a counter falls in the hypervisor or guest range of PMCs : : - Perf event swizzling across nested boundaries, as the event filtering : needs to be remapped to cope with vEL2 KVM: arm64: nv: Reprogram PMU events affected by nested transition KVM: arm64: nv: Apply EL2 event filtering when in hyp context KVM: arm64: nv: Honor MDCR_EL2.HLP KVM: arm64: nv: Honor MDCR_EL2.HPME KVM: arm64: Add helpers to determine if PMC counts at a given EL KVM: arm64: nv: Adjust range of accessible PMCs according to HPMN KVM: arm64: Rename kvm_pmu_valid_counter_mask() KVM: arm64: nv: Advertise support for FEAT_HPMN0 KVM: arm64: nv: Describe trap behaviour of MDCR_EL2.HPMN KVM: arm64: nv: Honor MDCR_EL2.{TPM, TPMCR} in Host EL0 KVM: arm64: nv: Reinject traps that take effect in Host EL0 KVM: arm64: nv: Rename BEHAVE_FORWARD_ANY KVM: arm64: nv: Allow coarse-grained trap combos to use complex traps KVM: arm64: Describe RES0/RES1 bits of MDCR_EL2 arm64: sysreg: Add new definitions for ID_AA64DFR0_EL1 arm64: sysreg: Migrate MDCR_EL2 definition to table arm64: sysreg: Describe ID_AA64DFR2_EL1 fields Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
2024-11-11Merge branch kvm-arm64/psci-1.3 into kvmarm/nextOliver Upton
* kvm-arm64/psci-1.3: : PSCI v1.3 support, courtesy of David Woodhouse : : Bump KVM's PSCI implementation up to v1.3, with the added bonus of : implementing the SYSTEM_OFF2 call. Like other system-scoped PSCI calls, : this gets relayed to userspace for further processing with a new : KVM_SYSTEM_EVENT_SHUTDOWN flag. : : As an added bonus, implement client-side support for hibernation with : the SYSTEM_OFF2 call. arm64: Use SYSTEM_OFF2 PSCI call to power off for hibernate KVM: arm64: nvhe: Pass through PSCI v1.3 SYSTEM_OFF2 call KVM: selftests: Add test for PSCI SYSTEM_OFF2 KVM: arm64: Add support for PSCI v1.2 and v1.3 KVM: arm64: Add PSCI v1.3 SYSTEM_OFF2 function for hibernation firmware/psci: Add definitions for PSCI v1.3 specification Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
2024-11-11net: netlink: add nla_get_*_default() accessorsJohannes Berg
There are quite a number of places that use patterns such as if (attr) val = nla_get_u16(attr); else val = DEFAULT; Add nla_get_u16_default() and friends like that to not have to type this out all the time. Acked-by: Toke Høiland-Jørgensen <toke@kernel.org> Acked-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Johannes Berg <johannes.berg@intel.com> Link: https://patch.msgid.link/20241108114145.acd2aadb03ac.I3df6aac71d38a5baa1c0a03d0c7e82d4395c030e@changeid Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-11-11firmware: qcom: scm: Introduce CP_SMMU_APERTURE_IDBjorn Andersson
The QCOM_SCM_SVC_MP service provides QCOM_SCM_MP_CP_SMMU_APERTURE_ID, which is used to trigger the mapping of register banks into the SMMU context for per-processes page tables to function (in case this isn't statically setup by firmware). This is necessary on e.g. QCS6490 Rb3Gen2, in order to avoid "CP | AHB bus error"-errors from the GPU. Introduce a function to allow the msm driver to invoke this call. Signed-off-by: Bjorn Andersson <bjorn.andersson@oss.qualcomm.com> Reviewed-by: Rob Clark <robdclark@gmail.com> Link: https://lore.kernel.org/r/20241110-adreno-smmu-aparture-v2-1-9b1fb2ee41d4@oss.qualcomm.com Signed-off-by: Bjorn Andersson <andersson@kernel.org>
2024-11-11nvme: check ns's volatile write cache not presentGuixin Liu
When the VWC of a namespace does not exist, the BLK_FEAT_WRITE_CACHE flag should not be set when registering the block device, regardless of whether the controller supports VWC. Signed-off-by: Guixin Liu <kanie@linux.alibaba.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-11-11nvmet: support for csi identify nsKeith Busch
Implements reporting the I/O Command Set Independent Identify Namespace command. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-11-11nvmet: implement rotational media information logKeith Busch
Most of the information is stubbed. Supporting these commands is a requirement for supporting rotational media. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-11-11nvmet: implement endurance groupsKeith Busch
Most of the returned information is just stubbed data. The target must support these in order to report rotational media. Since this driver doesn't know any better, each namespace is its own endurance group with the engid value matching the nsid. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-11-11nvmet: implement supported features logKeith Busch
This log is required for nvme 2.1. Reviewed-by: Matias Bjørling <matias.bjorling@wdc.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-11-11nvmet: implement supported log pagesKeith Busch
This log is required for nvme 2.1. Reviewed-by: Matias Bjørling <matias.bjorling@wdc.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-11-11nvmet: implement active command set ns listKeith Busch
This is required for nvme 2.1 for targets that support multiple command sets. We support NVM and ZNS, so are required to support this identification. Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Matias Bjørling <matias.bjorling@wdc.com> Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-11-11nvmet: support reservation featureGuixin Liu
This patch implements the reservation feature, including: 1. reservation register(register, unregister and replace). 2. reservation acquire(acquire, preempt, preempt and abort). 3. reservation release(release and clear). 4. reservation report. 5. set feature and get feature of reservation notify mask. 6. get log page of reservation event. Not supported: 1. persistent reservation through power loss. Test cases: Use nvme-cli and fio to test all implemented sub features: 1. use nvme resv-register to register host a registrant or unregister or replace a new key. 2. use nvme resv-acquire to set host to the holder, and use fio to send read and write io in all reservation type. And also test preempt and "preempt and abort". 3. use nvme resv-report to show all registrants and reservation status. 4. use nvme resv-release to release all registrants. 5. use nvme get-log to get events generated by the preceding operations. In addition, make reservation configurable, one can set ns to support reservation before enable ns. The default of resv_enable is false. Signed-off-by: Guixin Liu <kanie@linux.alibaba.com> Reviewed-by: Dmitry Bogdanov <d.bogdanov@yadro.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Tested-by: Chaitanya Kulkarni <kch@nvidia.com> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-11-11kunit: skb: use "gfp" variable instead of hardcoding GFP_KERNELDan Carpenter
The intent here was clearly to use the gfp variable flags instead of hardcoding GFP_KERNEL. All the callers pass GFP_KERNEL as the gfp flags so this doesn't affect runtime. Fixes: b3231d353a51 ("kunit: add a convenience allocation wrapper for SKBs") Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org> Reviewed-by: David Gow <davidgow@google.com> Reviewed-by: Kuan-Wei Chiu <visitorckw@gmail.com> Signed-off-by: Shuah Khan <skhan@linuxfoundation.org>
2024-11-11drm/fourcc: add AMD_FMT_MOD_TILE_GFX9_4K_D_XQiang Yu
This is used when radeonsi export small texture's modifier to user with eglExportDMABUFImageQueryMESA(). mesa changes is available here: https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/31658 Reviewed-by: Marek Olšák <marek.olsak@amd.com> Signed-off-by: Qiang Yu <qiang.yu@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-11-11block: pre-calculate max_zone_append_sectorsChristoph Hellwig
max_zone_append_sectors differs from all other queue limits in that the final value used is not stored in the queue_limits but needs to be obtained using queue_limits_max_zone_append_sectors helper. This not only adds (tiny) extra overhead to the I/O path, but also can be easily forgotten in file system code. Add a new max_hw_zone_append_sectors value to queue_limits which is set by the driver, and calculate max_zone_append_sectors from that and the other inputs in blk_validate_zoned_limits, similar to how max_sectors is calculated to fix this. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20241104073955.112324-3-hch@lst.de Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Link: https://lore.kernel.org/r/20241108154657.845768-2-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-11-11bpf: Drop special callback reference handlingKumar Kartikeya Dwivedi
Logic to prevent callbacks from acquiring new references for the program (i.e. leaving acquired references), and releasing caller references (i.e. those acquired in parent frames) was introduced in commit 9d9d00ac29d0 ("bpf: Fix reference state management for synchronous callbacks"). This was necessary because back then, the verifier simulated each callback once (that could potentially be executed N times, where N can be zero). This meant that callbacks that left lingering resources or cleared caller resources could do it more than once, operating on undefined state or leaking memory. With the fixes to callback verification in commit ab5cfac139ab ("bpf: verify callbacks as if they are called unknown number of times"), all of this extra logic is no longer necessary. Hence, drop it as part of this commit. Cc: Eduard Zingerman <eddyz87@gmail.com> Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20241109231430.2475236-3-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
2024-11-11bpf: Refactor active lock managementKumar Kartikeya Dwivedi
When bpf_spin_lock was introduced originally, there was deliberation on whether to use an array of lock IDs, but since bpf_spin_lock is limited to holding a single lock at any given time, we've been using a single ID to identify the held lock. In preparation for introducing spin locks that can be taken multiple times, introduce support for acquiring multiple lock IDs. For this purpose, reuse the acquired_refs array and store both lock and pointer references. We tag the entry with REF_TYPE_PTR or REF_TYPE_LOCK to disambiguate and find the relevant entry. The ptr field is used to track the map_ptr or btf (for bpf_obj_new allocations) to ensure locks can be matched with protected fields within the same "allocation", i.e. bpf_obj_new object or map value. The struct active_lock is changed to an int as the state is part of the acquired_refs array, and we only need active_lock as a cheap way of detecting lock presence. Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20241109231430.2475236-2-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
2024-11-11bpf: Add support for uprobe multi session attachJiri Olsa
Adding support to attach BPF program for entry and return probe of the same function. This is common use case which at the moment requires to create two uprobe multi links. Adding new BPF_TRACE_UPROBE_SESSION attach type that instructs kernel to attach single link program to both entry and exit probe. It's possible to control execution of the BPF program on return probe simply by returning zero or non zero from the entry BPF program execution to execute or not the BPF program on return probe respectively. Signed-off-by: Jiri Olsa <jolsa@kernel.org> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20241108134544.480660-4-jolsa@kernel.org
2024-11-11block: lift bio_is_zone_append to bio.hChristoph Hellwig
Make bio_is_zone_append globally available, because file systems need to use to check for a zone append bio in their end_io handlers to deal with the block layer emulation. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Link: https://lore.kernel.org/r/20241104062647.91160-4-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-11-11ASoC: add symmetric_ prefix for dai->rate/channels/sample_bitsKuninori Morimoto
snd_soc_dai has rate/channels/sample_bits parameter, but it is only valid if symmetry is being enforced by symmetric_xxx flag on driver. It is very difficult to know about it from current naming, and easy to misunderstand it. add symmetric_ prefix for it. Signed-off-by: Kuninori Morimoto <kuninori.morimoto.gx@renesas.com> Link: https://patch.msgid.link/87zfmd8bnf.wl-kuninori.morimoto.gx@renesas.com Signed-off-by: Mark Brown <broonie@kernel.org>
2024-11-11Merge back thermal control material for 6.13Rafael J. Wysocki
2024-11-11btrfs: add new ioctl to wait for cleaned subvolumesDavid Sterba
Add a new unprivileged ioctl that will let the command 'btrfs subvolume sync' work without the (privileged) SEARCH_TREE ioctl. There are several modes of operation, where the most common ones are to wait on a specific subvolume or all currently queued for cleaning. This is utilized e.g. in backup applications that delete subvolumes and wait until they're cleaned to check for remaining space. The other modes are for flexibility, e.g. for monitoring or checkpoints in the queue of deleted subvolumes, again without the need to use SEARCH_TREE. Notes: - waiting is interruptible, the timeout is set to 1 second and is not configurable - repeated calls to the ioctl see a different state, so this is inherently racy when using e.g. the count or peek next/last Use cases: - a subvolume A was deleted, wait for cleaning (WAIT_FOR_ONE) - a bunch of subvolumes were deleted, wait for all (WAIT_FOR_QUEUED or PEEK_LAST + WAIT_FOR_ONE) - count how many are queued (not blocking), for monitoring purposes - report progress (PEEK_NEXT), may miss some if cleaning is quick - own waiting in user space (PEEK_LAST until it's 0) Signed-off-by: David Sterba <dsterba@suse.com>
2024-11-11io_uring/cmd: let cmds to know about dying taskPavel Begunkov
When the taks that submitted a request is dying, a task work for that request might get run by a kernel thread or even worse by a half dismantled task. We can't just cancel the task work without running the callback as the cmd might need to do some clean up, so pass a flag instead. If set, it's not safe to access any task resources and the callback is expected to cancel the cmd ASAP. Reviewed-by: Jens Axboe <axboe@kernel.dk> Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-11-11btrfs: rename extent map shrinker members from struct btrfs_fs_infoFilipe Manana
The names for the members of struct btrfs_fs_info related to the extent map shrinker are a bit too long, so rename them to be shorter by replacing the "extent_map_" prefix with the "em_" prefix. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-11-11btrfs: simplify tracking progress for the extent map shrinkerFilipe Manana
Now that the extent map shrinker can only be run by a single task (as a work queue item) there is no need to keep the progress of the shrinker protected by a spinlock and passing the progress to trace events as parameters. So remove the lock and simplify the arguments for the trace events. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-11-11btrfs: remove unused btrfs_try_tree_write_lock()Dr. David Alan Gilbert
btrfs_try_tree_write_lock() has been unused since commit 50b21d7a066f ("btrfs: submit a writeback bio per extent_buffer"). Remove it as we don't need it anymore. Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Dr. David Alan Gilbert <linux@treblig.org> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-11-11btrfs: qgroups: remove bytenr field from struct btrfs_qgroup_extent_recordFilipe Manana
Now that we track qgroup extent records in a xarray we don't need to have a "bytenr" field in struct btrfs_qgroup_extent_record, since we can get it from the index of the record in the xarray. So remove the field and grab the bytenr from either the index key or any other place where it's available (delayed refs). This reduces the size of struct btrfs_qgroup_extent_record from 40 bytes down to 32 bytes, meaning that we now can store 128 instances of this structure instead of 102 per 4K page. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-11-11printk: Introduce FORCE_CON flagMarcos Paulo de Souza
Introduce FORCE_CON flag to printk. The new flag will make it possible to create a context where printk messages will never be suppressed. This mechanism will be used in the next patch to create a force_con context on sysrq handling, removing an existing workaround on the loglevel global variable. The workaround existed to make sure that sysrq header messages were sent to all consoles, but this doesn't work with deferred messages because the loglevel might be restored to its original value before a console flushes the messages. Signed-off-by: Marcos Paulo de Souza <mpdesouza@suse.com> Reviewed-by: John Ogness <john.ogness@linutronix.de> Reviewed-by: Petr Mladek <pmladek@suse.com> Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Link: https://lore.kernel.org/r/20241105-printk-loud-con-v2-1-bd3ecdf7b0e4@suse.com Signed-off-by: Petr Mladek <pmladek@suse.com>
2024-11-11Merge tag 'v6.12-rc7' into __tmp-hansg-linux-tags_media_atomisp_6_13_1Mauro Carvalho Chehab
Linux 6.12-rc7 * tag 'v6.12-rc7': (1909 commits) Linux 6.12-rc7 filemap: Fix bounds checking in filemap_read() i2c: designware: do not hold SCL low when I2C_DYNAMIC_TAR_UPDATE is not set mailmap: add entry for Thorsten Blum ocfs2: remove entry once instead of null-ptr-dereference in ocfs2_xa_remove() signal: restore the override_rlimit logic fs/proc: fix compile warning about variable 'vmcore_mmap_ops' ucounts: fix counter leak in inc_rlimit_get_ucounts() selftests: hugetlb_dio: check for initial conditions to skip in the start mm: fix docs for the kernel parameter ``thp_anon=`` mm/damon/core: avoid overflow in damon_feed_loop_next_input() mm/damon/core: handle zero schemes apply interval mm/damon/core: handle zero {aggregation,ops_update} intervals mm/mlock: set the correct prev on failure objpool: fix to make percpu slot allocation more robust mm/page_alloc: keep track of free highatomic bcachefs: Fix UAF in __promote_alloc() error path bcachefs: Change OPT_STR max to be 1 less than the size of choices array bcachefs: btree_cache.freeable list fixes bcachefs: check the invalid parameter for perf test ...
2024-11-11uprobes: Re-order struct uprobe_task to save some spaceChristophe JAILLET
On x86_64, with allmodconfig, struct uprobe_task is 72 bytes long, with a hole and some padding. /* size: 72, cachelines: 2, members: 7 */ /* sum members: 64, holes: 1, sum holes: 4 */ /* padding: 4 */ /* forced alignments: 1, forced holes: 1, sum forced holes: 4 */ /* last cacheline: 8 bytes */ Reorder the structure to fill the hole and avoid the padding. This way, the whole structure fits in a single cacheline and some memory is saved when it is allocated. /* size: 64, cachelines: 1, members: 7 */ /* forced alignments: 1 */ Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: "Masami Hiramatsu (Google)" <mhiramat@kernel.org> Link: https://lore.kernel.org/r/a9f541d0cedf421f765c77a1fb93d6a979778a88.1730495562.git.christophe.jaillet@wanadoo.fr
2024-11-11rust: helpers: Avoid raw_spin_lock initialization for PREEMPT_RTEder Zulian
When PREEMPT_RT=y, spin locks are mapped to rt_mutex types, so using spinlock_check() + __raw_spin_lock_init() to initialize spin locks is incorrect, and would cause build errors. Introduce __spin_lock_init() to initialize a spin lock with lockdep rquired information for PREEMPT_RT builds, and use it in the Rust helper. Fixes: d2d6422f8bd1 ("x86: Allow to enable PREEMPT_RT.") Closes: https://lore.kernel.org/oe-kbuild-all/202409251238.vetlgXE9-lkp@intel.com/ Reported-by: kernel test robot <lkp@intel.com> Signed-off-by: Eder Zulian <ezulian@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Boqun Feng <boqun.feng@gmail.com> Tested-by: Boqun Feng <boqun.feng@gmail.com> Link: https://lore.kernel.org/r/20241107163223.2092690-2-ezulian@redhat.com
2024-11-11cred: Add a light version of override/revert_creds()Vinicius Costa Gomes
Add a light version of override/revert_creds(), this should only be used when the credentials in question will outlive the critical section and the critical section doesn't change the ->usage of the credentials. Suggested-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Vinicius Costa Gomes <vinicius.gomes@intel.com> Signed-off-by: Amir Goldstein <amir73il@gmail.com>
2024-11-11backing-file: clean up the APIMiklos Szeredi
- Pass iocb to ctx->end_write() instead of file + pos - Get rid of ctx->user_file, which is redundant most of the time - Instead pass iocb to backing_file_splice_read and backing_file_splice_write Signed-off-by: Miklos Szeredi <mszeredi@redhat.com> Signed-off-by: Amir Goldstein <amir73il@gmail.com>
2024-11-11memcg: add flush tracepointJP Kobryn
This tracepoint gives visibility on how often the flushing of memcg stats occurs and contains info on whether it was forced, skipped, and the value of stats updated. It can help with understanding how readers are affected by having to perform the flush, and the effectiveness of the flush by inspecting the number of stats updated. Paired with the recently added tracepoints for tracing rstat updates, it can also help show correlation where stats exceed thresholds frequently. Link: https://lkml.kernel.org/r/20241029021106.25587-3-inwardvessel@gmail.com Signed-off-by: JP Kobryn <inwardvessel@gmail.com> Reviewed-by: Yosry Ahmed <yosryahmed@google.com> Acked-by: Shakeel Butt <shakeel.butt@linux.dev> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>