summaryrefslogtreecommitdiff
path: root/kernel
AgeCommit message (Collapse)Author
2022-09-21perf/core: Convert snprintf() to scnprintf()Jules Irenge
Coccinelle reports a warning: WARNING: use scnprintf or sprintf This LWN article explains the rationale for this change: https: //lwn.net/Articles/69419/ Ie. snprintf() returns what *would* be the resulting length, while scnprintf() returns the actual length. Adding to that, there has also been some slow migration from snprintf to scnprintf, here's the shift in usage in the past 3.5 years, in all fs/ files: v5.0 v6.0-rc6 -------------------------------------- snprintf() uses: 63 213 scnprintf() uses: 374 186 No intended change in behavior. [ mingo: Improved the changelog & reviewed the usage sites. ] Signed-off-by: Jules Irenge <jbi.octave@gmail.com> Signed-off-by: Ingo Molnar <mingo@kernel.org>
2022-09-21locking/lockdep: Print more debug information - report name and key when ↵Tetsuo Handa
look_up_lock_class() got confused Printing this information will be helpful: ------------[ cut here ]------------ Looking for class "l2tp_sock" with key l2tp_socket_class, but found a different class "slock-AF_INET6" with the same key WARNING: CPU: 1 PID: 14195 at kernel/locking/lockdep.c:940 look_up_lock_class+0xcc/0x140 Modules linked in: CPU: 1 PID: 14195 Comm: a.out Not tainted 6.0.0-rc6-dirty #863 Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006 RIP: 0010:look_up_lock_class+0xcc/0x140 Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/bd99391e-f787-efe9-5ec6-3c6dc4c587b0@I-love.SAKURA.ne.jp
2022-09-21Merge tag 'v6.0-rc6' into locking/core, to refresh the branchIngo Molnar
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2022-09-20lib/cpumask: add FORCE_NR_CPUS config optionYury Norov
The size of cpumasks is hard-limited by compile-time parameter NR_CPUS, but defined at boot-time when kernel parses ACPI/DT tables, and stored in nr_cpu_ids. In many practical cases, number of CPUs for a target is known at compile time, and can be provided with NR_CPUS. In that case, compiler may be instructed to rely on NR_CPUS as on actual number of CPUs, not an upper limit. It allows to optimize many cpumask routines and significantly shrink size of the kernel image. This patch adds FORCE_NR_CPUS option to teach the compiler to rely on NR_CPUS and enable corresponding optimizations. If FORCE_NR_CPUS=y, kernel will not set nr_cpu_ids at boot, but only check that the actual number of possible CPUs is equal to NR_CPUS, and WARN if that doesn't hold. The new option is especially useful in embedded applications because kernel configurations are unique for each SoC, the number of CPUs is constant and known well, and memory limitations are typically harder. For my 4-CPU ARM64 build with NR_CPUS=4, FORCE_NR_CPUS=y saves 46KB: add/remove: 3/4 grow/shrink: 46/729 up/down: 652/-46952 (-46300) Signed-off-by: Yury Norov <yury.norov@gmail.com>
2022-09-20Merge tag 'execve-v6.0-rc7' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux Pull execve reverts from Kees Cook: "The recent work to support time namespace unsharing turns out to have some undesirable corner cases, so rather than allowing the API to stay exposed for another release, it'd be best to remove it ASAP, with the replacement getting another cycle of testing. Nothing is known to use this yet, so no userspace breakage is expected. For more details, see: https://lore.kernel.org/lkml/ed418e43ad28b8688cfea2b7c90fce1c@ispras.ru Summary: - Remove the recent 'unshare time namespace on vfork+exec' feature (Andrei Vagin)" * tag 'execve-v6.0-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux: Revert "fs/exec: allow to unshare a time namespace on vfork+exec" Revert "selftests/timens: add a test for vfork+exit"
2022-09-20bpf: Check whether or not node is NULL before free it in free_bulkHou Tao
llnode could be NULL if there are new allocations after the checking of c-free_cnt > c->high_watermark in bpf_mem_refill() and before the calling of __llist_del_first() in free_bulk (e.g. a PREEMPT_RT kernel or allocation in NMI context). And it will incur oops as shown below: BUG: kernel NULL pointer dereference, address: 0000000000000000 #PF: supervisor write access in kernel mode #PF: error_code(0x0002) - not-present page PGD 0 P4D 0 Oops: 0002 [#1] PREEMPT_RT SMP CPU: 39 PID: 373 Comm: irq_work/39 Tainted: G W 6.0.0-rc6-rt9+ #1 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996) RIP: 0010:bpf_mem_refill+0x66/0x130 ...... Call Trace: <TASK> irq_work_single+0x24/0x60 irq_work_run_list+0x24/0x30 run_irq_workd+0x18/0x20 smpboot_thread_fn+0x13f/0x2c0 kthread+0x121/0x140 ? kthread_complete_and_exit+0x20/0x20 ret_from_fork+0x1f/0x30 </TASK> Simply fixing it by checking whether or not llnode is NULL in free_bulk(). Fixes: 8d5a8011b35d ("bpf: Batch call_rcu callbacks instead of SLAB_TYPESAFE_BY_RCU.") Signed-off-by: Hou Tao <houtao1@huawei.com> Link: https://lore.kernel.org/r/20220919144811.3570825-1-houtao@huaweicloud.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-09-20sched/psi: export psi_memstall_{enter,leave}Christoph Hellwig
To properly account for all refaults from file system logic, file systems need to call psi_memstall_enter directly, so export it. Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Link: https://lore.kernel.org/r/20220915094200.139713-3-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-09-20swiotlb: don't panic!Robin Murphy
The panics in swiotlb are relics of a bygone era, some of them inadvertently inherited from a memblock refactor, and all of them unnecessary since they are in places that may also fail gracefully anyway. Convert the panics in swiotlb_init_remap() into non-fatal warnings more consistent with the other bail-out paths there and in swiotlb_init_late() (but don't bother trying to roll anything back, since if anything does actually fail that early, the aim is merely to keep going as far as possible to get more diagnostic information out of the inevitably-dying kernel). It's not for SWIOTLB to decide that the system is terminally compromised just because there *might* turn out to be one or more 32-bit devices that might want to make streaming DMA mappings, especially since we already handle the no-buffer case later if it turns out someone did want it. Similarly though, downgrade that panic in swiotlb_tbl_map_single(), since even if we do get to that point it's an overly extreme reaction. It makes little difference to the DMA API caller whether a mapping fails because the buffer is full or because there is no buffer, and once again it's not for SWIOTLB to presume that any particular DMA mapping is so fundamental to the operation of the system that it must be terminal if it could never succeed. Even if the caller handles failure by futilely retrying forever, a single stuck thread is considerably less impactful to the user than a needless panic. Signed-off-by: Robin Murphy <robin.murphy@arm.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
2022-09-20swiotlb: replace kmap_atomic() with memcpy_{from,to}_page()Fabio M. De Francesco
The use of kmap_atomic() is being deprecated in favor of kmap_local_page(), which can also be used in atomic context (including interrupts). Replace kmap_atomic() with kmap_local_page(). Instead of open coding mapping, memcpy(), and un-mapping, use the memcpy_{from,to}_page() helper. Suggested-by: Ira Weiny <ira.weiny@intel.com> Signed-off-by: Fabio M. De Francesco <fmdefrancesco@gmail.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
2022-09-19smp: add set_nr_cpu_ids()Yury Norov
In preparation to support compile-time nr_cpu_ids, add a setter for the variable. This is a no-op for all arches. Signed-off-by: Yury Norov <yury.norov@gmail.com>
2022-09-19smp: don't declare nr_cpu_ids if NR_CPUS == 1Yury Norov
SMP and NR_CPUS are independent options, hence nr_cpu_ids may be declared even if NR_CPUS == 1, which is useless. Signed-off-by: Yury Norov <yury.norov@gmail.com>
2022-09-19genirq: Provide generic_handle_domain_irq_safe().Sebastian Andrzej Siewior
commit 509853f9e1e7b ("genirq: Provide generic_handle_irq_safe()") addressed the problem of demultiplexing interrupt handlers which are force threaded on PREEMPT_RT enabled kernels which means that the demultiplexed handler is invoked with interrupts enabled which triggers a lockdep warning due to a non-irq safe lock acquisition. The same problem exists for the irq domain based interrupt handling via generic_handle_domain_irq() which has been reported against the AMD pin-ctrl driver. Provide generic_handle_domain_irq_safe() which can used from any context. [ tglx: Split the usage sites out and massaged changelog ] Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/YnkfWFzvusFFktSt@linutronix.de Link: https://bugzilla.kernel.org/show_bug.cgi?id=215954
2022-09-16bpf/btf: Use btf_type_str() whenever possiblePeilin Ye
We have btf_type_str(). Use it whenever possible in btf.c, instead of "btf_kind_str[BTF_INFO_KIND(t->info)]". Signed-off-by: Peilin Ye <peilin.ye@bytedance.com> Link: https://lore.kernel.org/r/20220916202800.31421-1-yepeilin.cs@gmail.com Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2022-09-16ftrace: Add HAVE_DYNAMIC_FTRACE_NO_PATCHABLEPeter Zijlstra (Intel)
x86 will shortly start using -fpatchable-function-entry for purposes other than ftrace, make sure the __patchable_function_entry section isn't merged in the mcount_loc section. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Jiri Olsa <jolsa@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/20220903131154.420467-2-jolsa@kernel.org
2022-09-16bpf: use kvmemdup_bpfptr helperWang Yufen
Use kvmemdup_bpfptr helper instead of open-coding to simplify the code. Signed-off-by: Wang Yufen <wangyufen@huawei.com> Acked-by: Stanislav Fomichev <sdf@google.com> Link: https://lore.kernel.org/r/1663058433-14089-1-git-send-email-wangyufen@huawei.com Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2022-09-16bpf: Ensure correct locking around vulnerable function find_vpid()Lee Jones
The documentation for find_vpid() clearly states: "Must be called with the tasklist_lock or rcu_read_lock() held." Presently we do neither for find_vpid() instance in bpf_task_fd_query(). Add proper rcu_read_lock/unlock() to fix the issue. Fixes: 41bdc4b40ed6f ("bpf: introduce bpf subcommand BPF_TASK_FD_QUERY") Signed-off-by: Lee Jones <lee@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Yonghong Song <yhs@fb.com> Link: https://lore.kernel.org/bpf/20220912133855.1218900-1-lee@kernel.org
2022-09-15locking: Add __sched to semaphore functionsNamhyung Kim
The internal functions are marked with __sched already, let's do the same for external functions too so that we can skip them in the stack trace. Signed-off-by: Namhyung Kim <namhyung@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20220909000803.4181857-1-namhyung@kernel.org
2022-09-15locking/rwsem: Disable preemption while trying for rwsem lockGokul krishna Krishnakumar
Make the region inside the rwsem_write_trylock non preemptible. We observe RT task is hogging CPU when trying to acquire rwsem lock which was acquired by a kworker task but before the rwsem owner was set. Here is the scenario: 1. CFS task (affined to a particular CPU) takes rwsem lock. 2. CFS task gets preempted by a RT task before setting owner. 3. RT task (FIFO) is trying to acquire the lock, but spinning until RT throttling happens for the lock as the lock was taken by CFS task. This patch attempts to fix the above issue by disabling preemption until owner is set for the lock. While at it also fix the issues at the places where rwsem_{set,clear}_owner() are called. This also adds lockdep annotation of preemption disable in rwsem_{set,clear}_owner() on Peter Z. suggestion. Signed-off-by: Gokul krishna Krishnakumar <quic_gokukris@quicinc.com> Signed-off-by: Mukesh Ojha <quic_mojha@quicinc.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Waiman Long <longman@redhat.com> Link: https://lore.kernel.org/r/1662661467-24203-1-git-send-email-quic_mojha@quicinc.com
2022-09-15sched/fair: Move call to list_last_entry() in detach_tasksVincent Guittot
Move the call to list_last_entry() in detach_tasks() after testing loop_max and loop_break. Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20220825122726.20819-4-vincent.guittot@linaro.org
2022-09-15sched/fair: Cleanup loop_max and loop_breakVincent Guittot
sched_nr_migrate_break is set to a fix value and never changes so we can replace it by a define SCHED_NR_MIGRATE_BREAK. Also, we adjust SCHED_NR_MIGRATE_BREAK to be aligned with the init value of sysctl_sched_nr_migrate which can be init to different values. Then, use SCHED_NR_MIGRATE_BREAK to init sysctl_sched_nr_migrate. The behavior stays unchanged unless you modify sysctl_sched_nr_migrate trough debugfs. Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20220825122726.20819-3-vincent.guittot@linaro.org
2022-09-15sched/fair: Make sure to try to detach at least one movable taskVincent Guittot
During load balance, we try at most env->loop_max time to move a task. But it can happen that the loop_max LRU tasks (ie tail of the cfs_tasks list) can't be moved to dst_cpu because of affinity. In this case, loop in the list until we found at least one. The maximum of detached tasks remained the same as before. Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20220825122726.20819-2-vincent.guittot@linaro.org
2022-09-15bpf: Add verifier check for BPF_PTR_POISON retval and argDave Marchevsky
BPF_PTR_POISON was added in commit c0a5a21c25f37 ("bpf: Allow storing referenced kptr in map") to denote a bpf_func_proto btf_id which the verifier will replace with a dynamically-determined btf_id at verification time. This patch adds verifier 'poison' functionality to BPF_PTR_POISON in order to prepare for expanded use of the value to poison ret- and arg-btf_id in ongoing work, namely rbtree and linked list patchsets [0, 1]. Specifically, when the verifier checks helper calls, it assumes that BPF_PTR_POISON'ed ret type will be replaced with a valid type before - or in lieu of - the default ret_btf_id logic. Similarly for arg btf_id. If poisoned btf_id reaches default handling block for either, consider this a verifier internal error and fail verification. Otherwise a helper w/ poisoned btf_id but no verifier logic replacing the type will cause a crash as the invalid pointer is dereferenced. Also move BPF_PTR_POISON to existing include/linux/posion.h header and remove unnecessary shift. [0]: lore.kernel.org/bpf/20220830172759.4069786-1-davemarchevsky@fb.com [1]: lore.kernel.org/bpf/20220904204145.3089-1-memxor@gmail.com Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com> Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20220912154544.1398199-1-davemarchevsky@fb.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-09-13Revert "fs/exec: allow to unshare a time namespace on vfork+exec"Andrei Vagin
This reverts commit 133e2d3e81de5d9706cab2dd1d52d231c27382e5. Alexey pointed out a few undesirable side effects of the reverted change. First, it doesn't take into account that CLONE_VFORK can be used with CLONE_THREAD. Second, a child process doesn't enter a target time name-space, if its parent dies before the child calls exec. It happens because the parent clears vfork_done. Eric W. Biederman suggests installing a time namespace as a task gets a new mm. It includes all new processes cloned without CLONE_VM and all tasks that call exec(). This is an user API change, but we think there aren't users that depend on the old behavior. It is too late to make such changes in this release, so let's roll back this patch and introduce the right one in the next release. Cc: Alexey Izbyshev <izbyshev@ispras.ru> Cc: Christian Brauner <brauner@kernel.org> Cc: Dmitry Safonov <0x7f454c46@gmail.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Florian Weimer <fweimer@redhat.com> Cc: Kees Cook <keescook@chromium.org> Signed-off-by: Andrei Vagin <avagin@gmail.com> Signed-off-by: Kees Cook <keescook@chromium.org> Link: https://lore.kernel.org/r/20220913102551.1121611-3-avagin@google.com
2022-09-13perf/bpf: Always use perf callchains if existNamhyung Kim
If the perf_event has PERF_SAMPLE_CALLCHAIN, BPF can use it for stack trace. The problematic cases like PEBS and IBS already handled in the PMU driver and they filled the callchain info in the sample data. For others, we can call perf_callchain() before the BPF handler. Signed-off-by: Namhyung Kim <namhyung@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20220908214104.3851807-2-namhyung@kernel.org
2022-09-13perf: Use sample_flags for callchainNamhyung Kim
So that it can call perf_callchain() only if needed. Historically it used __PERF_SAMPLE_CALLCHAIN_EARLY but we can do that with sample_flags in the struct perf_sample_data. Signed-off-by: Namhyung Kim <namhyung@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20220908214104.3851807-1-namhyung@kernel.org
2022-09-12Merge 6.0-rc5 into driver-core-nextGreg Kroah-Hartman
We need the driver core and debugfs changes in this branch. Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-09-11latencytop: use the last element of latency_record of systemwuchi
In account_global_scheduler_latency(), when we don't find the matching latency_record we try to select one which is unused in latency_record[MAXLR], but the condition will skip the last one. if (i >= MAXLR-1) Fix that. Link: https://lkml.kernel.org/r/20220903135233.5225-1-wuchi.zero@gmail.com Signed-off-by: wuchi <wuchi.zero@gmail.com> Reviewed-by: Andrew Morton <akpm@linux-foudation.org> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Luis Chamberlain <mcgrof@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-09-11kernel/utsname_sysctl.c: print kernel archPetr Vorel
Print the machine hardware name (UTS_MACHINE) in /proc/sys/kernel/arch. This helps people who debug kernel with initramfs with minimal environment (i.e. without coreutils or even busybox) or allow to open sysfs file instead of run 'uname -m' in high level languages. Link: https://lkml.kernel.org/r/20220901194403.3819-1-pvorel@suse.cz Signed-off-by: Petr Vorel <pvorel@suse.cz> Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: David Sterba <dsterba@suse.com> Cc: "Eric W . Biederman" <ebiederm@xmission.com> Cc: Rafael J. Wysocki <rafael@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-09-11kernel/profile.c: simplify duplicated code in profile_setup()wuchi
The code to parse option string "schedule/sleep/kvm" of cmdline in function profile_setup is redundant, so simplify that. Link: https://lkml.kernel.org/r/20220901003121.53597-1-wuchi.zero@gmail.com Signed-off-by: wuchi <wuchi.zero@gmail.com> Reviewed-by: Andrew Morton <akpm@linux-foudation.org> Cc: Christoph Hellwig <hch@lst.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-09-11fail_function: fix wrong use of fei_attr_remove()Yang Yingliang
If register_kprobe() fails, the new attr is not added to the list yet, so it should call fei_attr_free() intstead. Link: https://lkml.kernel.org/r/20220826073337.2085798-3-yangyingliang@huawei.com Fixes: 4b1a29a7f542 ("error-injection: Support fault injection framework") Signed-off-by: Yang Yingliang <yangyingliang@huawei.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-09-11fail_function: refactor code of checking return value of register_kprobe()Yang Yingliang
Refactor the error handling of register_kprobe() to improve readability. No functional change. Link: https://lkml.kernel.org/r/20220826073337.2085798-2-yangyingliang@huawei.com Signed-off-by: Yang Yingliang <yangyingliang@huawei.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-09-11fail_function: switch to memdup_user_nul() helperYang Yingliang
Use memdup_user_nul() helper instead of open-coding to simplify the code. Link: https://lkml.kernel.org/r/20220826073337.2085798-1-yangyingliang@huawei.com Signed-off-by: Yang Yingliang <yangyingliang@huawei.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-09-11smpboot: use atomic_try_cmpxchg in cpu_wait_death and cpu_report_deathUros Bizjak
Use atomic_try_cmpxchg instead of atomic_cmpxchg (*ptr, old, new) == old in cpu_wait_death and cpu_report_death. x86 CMPXCHG instruction returns success in ZF flag, so this change saves a compare after cmpxchg (and related move instruction in front of cmpxchg). Also, atomic_try_cmpxchg implicitly assigns old *ptr value to "old" when cmpxchg fails, enabling further code simplifications. No functional change intended. Link: https://lkml.kernel.org/r/20220825145603.5811-1-ubizjak@gmail.com Signed-off-by: Uros Bizjak <ubizjak@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-09-11task_work: use try_cmpxchg in task_work_add, task_work_cancel_match and ↵Uros Bizjak
task_work_run Use try_cmpxchg instead of cmpxchg (*ptr, old, new) == old in task_work_add, task_work_cancel_match and task_work_run. x86 CMPXCHG instruction returns success in ZF flag, so this change saves a compare after cmpxchg (and related move instruction in front of cmpxchg). Also, atomic_try_cmpxchg implicitly assigns old *ptr value to "old" when cmpxchg fails, enabling further code simplifications. The patch avoids extra memory read in case cmpxchg fails. Link: https://lkml.kernel.org/r/20220823152632.4517-1-ubizjak@gmail.com Signed-off-by: Uros Bizjak <ubizjak@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-09-11kexec: replace kmap() with kmap_local_page()Fabio M. De Francesco
kmap() is being deprecated in favor of kmap_local_page(). There are two main problems with kmap(): (1) It comes with an overhead as mapping space is restricted and protected by a global lock for synchronization and (2) it also requires global TLB invalidation when the kmap's pool wraps and it might block when the mapping space is fully utilized until a slot becomes available. With kmap_local_page() the mappings are per thread, CPU local, can take page faults, and can be called from any context (including interrupts). It is faster than kmap() in kernels with HIGHMEM enabled. Furthermore, the tasks can be preempted and, when they are scheduled to run again, the kernel virtual addresses are restored and are still valid. Since its use in kexec_core.c is safe everywhere, it should be preferred. Therefore, replace kmap() with kmap_local_page() in kexec_core.c. Tested on a QEMU/KVM x86_32 VM, 6GB RAM, booting a kernel with HIGHMEM64GB enabled. Link: https://lkml.kernel.org/r/20220821182519.9483-1-fmdefrancesco@gmail.com Signed-off-by: Fabio M. De Francesco <fmdefrancesco@gmail.com> Suggested-by: Ira Weiny <ira.weiny@intel.com> Reviewed-by: Ira Weiny <ira.weiny@intel.com> Acked-by: Baoquan He <bhe@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-09-11kernel: exit: cleanup release_thread()Kefeng Wang
Only x86 has own release_thread(), introduce a new weak release_thread() function to clean empty definitions in other ARCHs. Link: https://lkml.kernel.org/r/20220819014406.32266-1-wangkefeng.wang@huawei.com Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> Acked-by: Guo Ren <guoren@kernel.org> [csky] Acked-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Acked-by: Geert Uytterhoeven <geert@linux-m68k.org> Acked-by: Brian Cain <bcain@quicinc.com> Acked-by: Michael Ellerman <mpe@ellerman.id.au> [powerpc] Acked-by: Stafford Horne <shorne@gmail.com> [openrisc] Acked-by: Catalin Marinas <catalin.marinas@arm.com> [arm64] Acked-by: Huacai Chen <chenhuacai@kernel.org> [LoongArch] Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Anton Ivanov <anton.ivanov@cambridgegreys.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Christian Borntraeger <borntraeger@linux.ibm.com> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Chris Zankel <chris@zankel.net> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: "David S. Miller" <davem@davemloft.net> Cc: Dinh Nguyen <dinguyen@kernel.org> Cc: Guo Ren <guoren@kernel.org> [csky] Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Helge Deller <deller@gmx.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru> Cc: James Bottomley <James.Bottomley@HansenPartnership.com> Cc: Johannes Berg <johannes@sipsolutions.net> Cc: Jonas Bonn <jonas@southpole.se> Cc: Matt Turner <mattst88@gmail.com> Cc: Max Filippov <jcmvbkbc@gmail.com> Cc: Michal Simek <monstr@monstr.eu> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Paul Walmsley <paul.walmsley@sifive.com> Cc: Richard Henderson <richard.henderson@linaro.org> Cc: Richard Weinberger <richard@nod.at> Cc: Rich Felker <dalias@libc.org> Cc: Stefan Kristiansson <stefan.kristiansson@saunalahti.fi> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Vineet Gupta <vgupta@kernel.org> Cc: Will Deacon <will@kernel.org> Cc: Xuerui Wang <kernel@xen0n.name> Cc: Yoshinori Sato <ysato@users.osdn.me> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-09-11panic, kexec: make __crash_kexec() NMI safeValentin Schneider
Attempting to get a crash dump out of a debug PREEMPT_RT kernel via an NMI panic() doesn't work. The cause of that lies in the PREEMPT_RT definition of mutex_trylock(): if (IS_ENABLED(CONFIG_DEBUG_RT_MUTEXES) && WARN_ON_ONCE(!in_task())) return 0; This prevents an nmi_panic() from executing the main body of __crash_kexec() which does the actual kexec into the kdump kernel. The warning and return are explained by: 6ce47fd961fa ("rtmutex: Warn if trylock is called from hard/softirq context") [...] The reasons for this are: 1) There is a potential deadlock in the slowpath 2) Another cpu which blocks on the rtmutex will boost the task which allegedly locked the rtmutex, but that cannot work because the hard/softirq context borrows the task context. Furthermore, grabbing the lock isn't NMI safe, so do away with kexec_mutex and replace it with an atomic variable. This is somewhat overzealous as *some* callsites could keep using a mutex (e.g. the sysfs-facing ones like crash_shrink_memory()), but this has the benefit of involving a single unified lock and preventing any future NMI-related surprises. Tested by triggering NMI panics via: $ echo 1 > /proc/sys/kernel/panic_on_unrecovered_nmi $ echo 1 > /proc/sys/kernel/unknown_nmi_panic $ echo 1 > /proc/sys/kernel/panic $ ipmitool power diag Link: https://lkml.kernel.org/r/20220630223258.4144112-3-vschneid@redhat.com Fixes: 6ce47fd961fa ("rtmutex: Warn if trylock is called from hard/softirq context") Signed-off-by: Valentin Schneider <vschneid@redhat.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Baoquan He <bhe@redhat.com> Cc: "Eric W . Biederman" <ebiederm@xmission.com> Cc: Juri Lelli <jlelli@redhat.com> Cc: Luis Claudio R. Goncalves <lgoncalv@redhat.com> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Petr Mladek <pmladek@suse.com> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-09-11kexec: turn all kexec_mutex acquisitions into trylocksValentin Schneider
Patch series "kexec, panic: Making crash_kexec() NMI safe", v4. This patch (of 2): Most acquistions of kexec_mutex are done via mutex_trylock() - those were a direct "translation" from: 8c5a1cf0ad3a ("kexec: use a mutex for locking rather than xchg()") there have however been two additions since then that use mutex_lock(): crash_get_memory_size() and crash_shrink_memory(). A later commit will replace said mutex with an atomic variable, and locking operations will become atomic_cmpxchg(). Rather than having those mutex_lock() become while (atomic_cmpxchg(&lock, 0, 1)), turn them into trylocks that can return -EBUSY on acquisition failure. This does halve the printable size of the crash kernel, but that's still neighbouring 2G for 32bit kernels which should be ample enough. Link: https://lkml.kernel.org/r/20220630223258.4144112-1-vschneid@redhat.com Link: https://lkml.kernel.org/r/20220630223258.4144112-2-vschneid@redhat.com Signed-off-by: Valentin Schneider <vschneid@redhat.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: "Eric W . Biederman" <ebiederm@xmission.com> Cc: Juri Lelli <jlelli@redhat.com> Cc: Luis Claudio R. Goncalves <lgoncalv@redhat.com> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Petr Mladek <pmladek@suse.com> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Baoquan He <bhe@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-09-11memory tiering: adjust hot threshold automaticallyHuang Ying
The promotion hot threshold is workload and system configuration dependent. So in this patch, a method to adjust the hot threshold automatically is implemented. The basic idea is to control the number of the candidate promotion pages to match the promotion rate limit. If the hint page fault latency of a page is less than the hot threshold, we will try to promote the page, and the page is called the candidate promotion page. If the number of the candidate promotion pages in the statistics interval is much more than the promotion rate limit, the hot threshold will be decreased to reduce the number of the candidate promotion pages. Otherwise, the hot threshold will be increased to increase the number of the candidate promotion pages. To make the above method works, in each statistics interval, the total number of the pages to check (on which the hint page faults occur) and the hot/cold distribution need to be stable. Because the page tables are scanned linearly in NUMA balancing, but the hot/cold distribution isn't uniform along the address usually, the statistics interval should be larger than the NUMA balancing scan period. So in the patch, the max scan period is used as statistics interval and it works well in our tests. Link: https://lkml.kernel.org/r/20220713083954.34196-4-ying.huang@intel.com Signed-off-by: "Huang, Ying" <ying.huang@intel.com> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Tested-by: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Michal Hocko <mhocko@suse.com> Cc: osalvador <osalvador@suse.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@surriel.com> Cc: Shakeel Butt <shakeelb@google.com> Cc: Wei Xu <weixugc@google.com> Cc: Yang Shi <shy828301@gmail.com> Cc: Zhong Jiang <zhongjiang-ali@linux.alibaba.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-09-11memory tiering: rate limit NUMA migration throughputHuang Ying
In NUMA balancing memory tiering mode, if there are hot pages in slow memory node and cold pages in fast memory node, we need to promote/demote hot/cold pages between the fast and cold memory nodes. A choice is to promote/demote as fast as possible. But the CPU cycles and memory bandwidth consumed by the high promoting/demoting throughput will hurt the latency of some workload because of accessing inflating and slow memory bandwidth contention. A way to resolve this issue is to restrict the max promoting/demoting throughput. It will take longer to finish the promoting/demoting. But the workload latency will be better. This is implemented in this patch as the page promotion rate limit mechanism. The number of the candidate pages to be promoted to the fast memory node via NUMA balancing is counted, if the count exceeds the limit specified by the users, the NUMA balancing promotion will be stopped until the next second. A new sysctl knob kernel.numa_balancing_promote_rate_limit_MBps is added for the users to specify the limit. Link: https://lkml.kernel.org/r/20220713083954.34196-3-ying.huang@intel.com Signed-off-by: "Huang, Ying" <ying.huang@intel.com> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Tested-by: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Michal Hocko <mhocko@suse.com> Cc: osalvador <osalvador@suse.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@surriel.com> Cc: Shakeel Butt <shakeelb@google.com> Cc: Wei Xu <weixugc@google.com> Cc: Yang Shi <shy828301@gmail.com> Cc: Zhong Jiang <zhongjiang-ali@linux.alibaba.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-09-11memory tiering: hot page selection with hint page fault latencyHuang Ying
Patch series "memory tiering: hot page selection", v4. To optimize page placement in a memory tiering system with NUMA balancing, the hot pages in the slow memory nodes need to be identified. Essentially, the original NUMA balancing implementation selects the mostly recently accessed (MRU) pages to promote. But this isn't a perfect algorithm to identify the hot pages. Because the pages with quite low access frequency may be accessed eventually given the NUMA balancing page table scanning period could be quite long (e.g. 60 seconds). So in this patchset, we implement a new hot page identification algorithm based on the latency between NUMA balancing page table scanning and hint page fault. Which is a kind of mostly frequently accessed (MFU) algorithm. In NUMA balancing memory tiering mode, if there are hot pages in slow memory node and cold pages in fast memory node, we need to promote/demote hot/cold pages between the fast and cold memory nodes. A choice is to promote/demote as fast as possible. But the CPU cycles and memory bandwidth consumed by the high promoting/demoting throughput will hurt the latency of some workload because of accessing inflating and slow memory bandwidth contention. A way to resolve this issue is to restrict the max promoting/demoting throughput. It will take longer to finish the promoting/demoting. But the workload latency will be better. This is implemented in this patchset as the page promotion rate limit mechanism. The promotion hot threshold is workload and system configuration dependent. So in this patchset, a method to adjust the hot threshold automatically is implemented. The basic idea is to control the number of the candidate promotion pages to match the promotion rate limit. We used the pmbench memory accessing benchmark tested the patchset on a 2-socket server system with DRAM and PMEM installed. The test results are as follows, pmbench score promote rate (accesses/s) MB/s ------------- ------------ base 146887704.1 725.6 hot selection 165695601.2 544.0 rate limit 162814569.8 165.2 auto adjustment 170495294.0 136.9 From the results above, With hot page selection patch [1/3], the pmbench score increases about 12.8%, and promote rate (overhead) decreases about 25.0%, compared with base kernel. With rate limit patch [2/3], pmbench score decreases about 1.7%, and promote rate decreases about 69.6%, compared with hot page selection patch. With threshold auto adjustment patch [3/3], pmbench score increases about 4.7%, and promote rate decrease about 17.1%, compared with rate limit patch. Baolin helped to test the patchset with MySQL on a machine which contains 1 DRAM node (30G) and 1 PMEM node (126G). sysbench /usr/share/sysbench/oltp_read_write.lua \ ...... --tables=200 \ --table-size=1000000 \ --report-interval=10 \ --threads=16 \ --time=120 The tps can be improved about 5%. This patch (of 3): To optimize page placement in a memory tiering system with NUMA balancing, the hot pages in the slow memory node need to be identified. Essentially, the original NUMA balancing implementation selects the mostly recently accessed (MRU) pages to promote. But this isn't a perfect algorithm to identify the hot pages. Because the pages with quite low access frequency may be accessed eventually given the NUMA balancing page table scanning period could be quite long (e.g. 60 seconds). The most frequently accessed (MFU) algorithm is better. So, in this patch we implemented a better hot page selection algorithm. Which is based on NUMA balancing page table scanning and hint page fault as follows, - When the page tables of the processes are scanned to change PTE/PMD to be PROT_NONE, the current time is recorded in struct page as scan time. - When the page is accessed, hint page fault will occur. The scan time is gotten from the struct page. And The hint page fault latency is defined as hint page fault time - scan time The shorter the hint page fault latency of a page is, the higher the probability of their access frequency to be higher. So the hint page fault latency is a better estimation of the page hot/cold. It's hard to find some extra space in struct page to hold the scan time. Fortunately, we can reuse some bits used by the original NUMA balancing. NUMA balancing uses some bits in struct page to store the page accessing CPU and PID (referring to page_cpupid_xchg_last()). Which is used by the multi-stage node selection algorithm to avoid to migrate pages shared accessed by the NUMA nodes back and forth. But for pages in the slow memory node, even if they are shared accessed by multiple NUMA nodes, as long as the pages are hot, they need to be promoted to the fast memory node. So the accessing CPU and PID information are unnecessary for the slow memory pages. We can reuse these bits in struct page to record the scan time. For the fast memory pages, these bits are used as before. For the hot threshold, the default value is 1 second, which works well in our performance test. All pages with hint page fault latency < hot threshold will be considered hot. It's hard for users to determine the hot threshold. So we don't provide a kernel ABI to set it, just provide a debugfs interface for advanced users to experiment. We will continue to work on a hot threshold automatic adjustment mechanism. The downside of the above method is that the response time to the workload hot spot changing may be much longer. For example, - A previous cold memory area becomes hot - The hint page fault will be triggered. But the hint page fault latency isn't shorter than the hot threshold. So the pages will not be promoted. - When the memory area is scanned again, maybe after a scan period, the hint page fault latency measured will be shorter than the hot threshold and the pages will be promoted. To mitigate this, if there are enough free space in the fast memory node, the hot threshold will not be used, all pages will be promoted upon the hint page fault for fast response. Thanks Zhong Jiang reported and tested the fix for a bug when disabling memory tiering mode dynamically. Link: https://lkml.kernel.org/r/20220713083954.34196-1-ying.huang@intel.com Link: https://lkml.kernel.org/r/20220713083954.34196-2-ying.huang@intel.com Signed-off-by: "Huang, Ying" <ying.huang@intel.com> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Tested-by: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Rik van Riel <riel@surriel.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Yang Shi <shy828301@gmail.com> Cc: Zi Yan <ziy@nvidia.com> Cc: Wei Xu <weixugc@google.com> Cc: osalvador <osalvador@suse.de> Cc: Shakeel Butt <shakeelb@google.com> Cc: Zhong Jiang <zhongjiang-ali@linux.alibaba.com> Cc: Oscar Salvador <osalvador@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-09-10bpf: Add verifier support for custom callback return rangeDave Marchevsky
Verifier logic to confirm that a callback function returns 0 or 1 was added in commit 69c087ba6225b ("bpf: Add bpf_for_each_map_elem() helper"). At the time, callback return value was only used to continue or stop iteration. In order to support callbacks with a broader return value range, such as those added in rbtree series[0] and others, add a callback_ret_range to bpf_func_state. Verifier's helpers which set in_callback_fn will also set the new field, which the verifier will later use to check return value bounds. Default to tnum_range(0, 0) instead of using tnum_unknown as a sentinel value as the latter would prevent the valid range (0, U64_MAX) being used. Previous global default tnum_range(0, 1) is explicitly set for extant callback helpers. The change to global default was made after discussion around this patch in rbtree series [1], goal here is to make it more obvious that callback_ret_range should be explicitly set. [0]: lore.kernel.org/bpf/20220830172759.4069786-1-davemarchevsky@fb.com/ [1]: lore.kernel.org/bpf/20220830172759.4069786-2-davemarchevsky@fb.com/ Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com> Reviewed-by: Stanislav Fomichev <sdf@google.com> Link: https://lore.kernel.org/r/20220908230716.2751723-1-davemarchevsky@fb.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-09-10bpf: btf: fix truncated last_member_type_id in btf_struct_resolveLorenz Bauer
When trying to finish resolving a struct member, btf_struct_resolve saves the member type id in a u16 temporary variable. This truncates the 32 bit type id value if it exceeds UINT16_MAX. As a result, structs that have members with type ids > UINT16_MAX and which need resolution will fail with a message like this: [67414] STRUCT ff_device size=120 vlen=12 effect_owners type_id=67434 bits_offset=960 Member exceeds struct_size Fix this by changing the type of last_member_type_id to u32. Fixes: a0791f0df7d2 ("bpf: fix BTF limits") Reviewed-by: Stanislav Fomichev <sdf@google.com> Signed-off-by: Lorenz Bauer <oss@lmb.io> Link: https://lore.kernel.org/r/20220910110120.339242-1-oss@lmb.io Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-09-10bpf: Export btf_type_by_id() and bpf_log()Daniel Xu
These symbols will be used in nf_conntrack.ko to support direct writes to `nf_conn`. Signed-off-by: Daniel Xu <dxu@dxuuu.xyz> Link: https://lore.kernel.org/r/3c98c19dc50d3b18ea5eca135b4fc3a5db036060.1662568410.git.dxu@dxuuu.xyz Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-09-10bpf: Remove duplicate PTR_TO_BTF_ID RO checkDaniel Xu
Since commit 27ae7997a661 ("bpf: Introduce BPF_PROG_TYPE_STRUCT_OPS") there has existed bpf_verifier_ops:btf_struct_access. When btf_struct_access is _unset_ for a prog type, the verifier runs the default implementation, which is to enforce read only: if (env->ops->btf_struct_access) { [...] } else { if (atype != BPF_READ) { verbose(env, "only read is supported\n"); return -EACCES; } [...] } When btf_struct_access is _set_, the expectation is that btf_struct_access has full control over accesses, including if writes are allowed. Rather than carve out an exception for each prog type that may write to BTF ptrs, delete the redundant check and give full control to btf_struct_access. Signed-off-by: Daniel Xu <dxu@dxuuu.xyz> Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/962da2bff1238746589e332ff1aecc49403cd7ce.1662568410.git.dxu@dxuuu.xyz Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-09-10bpf: Simplify code by using for_each_cpu_wrap()Punit Agrawal
In the percpu freelist code, it is a common pattern to iterate over the possible CPUs mask starting with the current CPU. The pattern is implemented using a hand rolled while loop with the loop variable increment being open-coded. Simplify the code by using for_each_cpu_wrap() helper to iterate over the possible cpus starting with the current CPU. As a result, some of the special-casing in the loop also gets simplified. No functional change intended. Signed-off-by: Punit Agrawal <punit.agrawal@bytedance.com> Acked-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20220907155746.1750329-1-punit.agrawal@bytedance.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-09-10bpf: add missing percpu_counter_destroy() in htab_map_alloc()Tetsuo Handa
syzbot is reporting ODEBUG bug in htab_map_alloc() [1], for commit 86fe28f7692d96d2 ("bpf: Optimize element count in non-preallocated hash map.") added percpu_counter_init() to htab_map_alloc() but forgot to add percpu_counter_destroy() to the error path. Link: https://syzkaller.appspot.com/bug?extid=5d1da78b375c3b5e6c2b [1] Reported-by: syzbot <syzbot+5d1da78b375c3b5e6c2b@syzkaller.appspotmail.com> Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Fixes: 86fe28f7692d96d2 ("bpf: Optimize element count in non-preallocated hash map.") Reviewed-by: Stanislav Fomichev <sdf@google.com> Link: https://lore.kernel.org/r/e2e4cc0e-9d36-4ca1-9bfa-ce23e6f8310b@I-love.SAKURA.ne.jp Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-09-10Merge tag 'dma-mapping-6.0-2022-09-10' of ↵Linus Torvalds
git://git.infradead.org/users/hch/dma-mapping Pull dma-mapping fixes from Christoph Hellwig: - revert a panic on swiotlb initialization failure (Yu Zhao) - fix the lookup for partial syncs in dma-debug (Robin Murphy) - fix a shift overflow in swiotlb (Chao Gao) - fix a comment typo in swiotlb (Chao Gao) - mark a function static now that all abusers are gone (Christoph Hellwig) * tag 'dma-mapping-6.0-2022-09-10' of git://git.infradead.org/users/hch/dma-mapping: dma-mapping: mark dma_supported static swiotlb: fix a typo swiotlb: avoid potential left shift overflow dma-debug: improve search for partial syncs Revert "swiotlb: panic if nslabs is too small"
2022-09-09Merge tag 'driver-core-6.0-rc5' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core Pull driver core fixes from Greg KH: "Here are some small driver core and debugfs fixes for 6.0-rc5. Included in here are: - multiple attempts to get the arch_topology code to work properly on non-cluster SMT systems. First attempt caused build breakages in linux-next and 0-day, second try worked. - debugfs fixes for a long-suffering memory leak. The pattern of debugfs_remove(debugfs_lookup(...)) turns out to leak dentries, so add debugfs_lookup_and_remove() to fix this problem. Also fix up the scheduler debug code that highlighted this problem. Fixes for other subsystems will be trickling in over the next few months for this same issue once the debugfs function is merged. All of these have been in linux-next since Wednesday with no reported problems" * tag 'driver-core-6.0-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: arch_topology: Make cluster topology span at least SMT CPUs sched/debug: fix dentry leak in update_sched_domain_debugfs debugfs: add debugfs_lookup_and_remove() driver core: fix driver_set_override() issue with empty strings Revert "arch_topology: Make cluster topology span at least SMT CPUs" arch_topology: Make cluster topology span at least SMT CPUs
2022-09-09Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdmaLinus Torvalds
Pull rdma fixes from Jason Gunthorpe: "Many bug fixes in several drivers: - Fix misuse of the DMA API in rtrs - Several irdma issues: hung task due to SQ flushing, incorrect capability reporting to userspace, improper error handling for MW corners, touching an uninitialized SGL for during invalidation. - hns was using the wrong page size limits for the HW, an incorrect calculation of wqe_shift causing WQE corruption, and mis computed a timer id. - Fix a crash in SRP triggered by blktests - Fix compiler errors by calling virt_to_page() with the proper type in siw - Userspace triggerable deadlock in ODP - mlx5 could use the wrong profile due to some driver loading races, counters were not working in some device configurations, and a crash on error unwind" * tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma: RDMA/irdma: Report RNR NAK generation in device caps RDMA/irdma: Use s/g array in post send only when its valid RDMA/irdma: Return correct WC error for bind operation failure RDMA/irdma: Return error on MR deregister CQP failure RDMA/irdma: Report the correct max cqes from query device MAINTAINERS: Update maintainers of HiSilicon RoCE RDMA/mlx5: Fix UMR cleanup on error flow of driver init RDMA/mlx5: Set local port to one when accessing counters RDMA/mlx5: Rely on RoCE fw cap instead of devlink when setting profile IB/core: Fix a nested dead lock as part of ODP flow RDMA/siw: Pass a pointer to virt_to_page() RDMA/srp: Set scmnd->result only when scmnd is not NULL RDMA/hns: Remove the num_qpc_timer variable RDMA/hns: Fix wrong fixed value of qp->rq.wqe_shift RDMA/hns: Fix supported page size RDMA/cma: Fix arguments order in net device validation RDMA/irdma: Fix drain SQ hang with no completion RDMA/rtrs-srv: Pass the correct number of entries for dma mapped SGL RDMA/rtrs-clt: Use the right sg_cnt after ib_dma_map_sg