summaryrefslogtreecommitdiff
path: root/kernel
AgeCommit message (Collapse)Author
2022-09-06bpf: Replace __ksize with ksize.Alexei Starovoitov
__ksize() was made private. Use ksize() instead. Reported-by: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-09-06rv/reactor: add __init/__exit annotations to module init/exit funcsXiu Jianfeng
Add missing __init/__exit annotations to module init/exit funcs. Link: https://lkml.kernel.org/r/20220906141210.132607-1-xiujianfeng@huawei.com Fixes: 135b881ea885 ("rv/reactor: Add the printk reactor") Fixes: e88043c0ac16 ("rv/reactor: Add the panic reactor") Signed-off-by: Xiu Jianfeng <xiujianfeng@huawei.com> Acked-by: Daniel Bristot de Oliveira <bristot@kernel.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2022-09-06tracing: Fix to check event_mutex is held while accessing trigger listMasami Hiramatsu (Google)
Since the check_user_trigger() is called outside of RCU read lock, this list_for_each_entry_rcu() caused a suspicious RCU usage warning. # echo hist:keys=pid > events/sched/sched_stat_runtime/trigger # cat events/sched/sched_stat_runtime/trigger [ 43.167032] [ 43.167418] ============================= [ 43.167992] WARNING: suspicious RCU usage [ 43.168567] 5.19.0-rc5-00029-g19ebe4651abf #59 Not tainted [ 43.169283] ----------------------------- [ 43.169863] kernel/trace/trace_events_trigger.c:145 RCU-list traversed in non-reader section!! ... However, this file->triggers list is safe when it is accessed under event_mutex is held. To fix this warning, adds a lockdep_is_held check to the list_for_each_entry_rcu(). Link: https://lkml.kernel.org/r/166226474977.223837.1992182913048377113.stgit@devnote2 Cc: stable@vger.kernel.org Fixes: 7491e2c44278 ("tracing: Add a probe that attaches to trace events") Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2022-09-06tracing: hold caller_addr to hardirq_{enable,disable}_ipYipeng Zou
Currently, The arguments passing to lockdep_hardirqs_{on,off} was fixed in CALLER_ADDR0. The function trace_hardirqs_on_caller should have been intended to use caller_addr to represent the address that caller wants to be traced. For example, lockdep log in riscv showing the last {enabled,disabled} at __trace_hardirqs_{on,off} all the time(if called by): [ 57.853175] hardirqs last enabled at (2519): __trace_hardirqs_on+0xc/0x14 [ 57.853848] hardirqs last disabled at (2520): __trace_hardirqs_off+0xc/0x14 After use trace_hardirqs_xx_caller, we can get more effective information: [ 53.781428] hardirqs last enabled at (2595): restore_all+0xe/0x66 [ 53.782185] hardirqs last disabled at (2596): ret_from_exception+0xa/0x10 Link: https://lkml.kernel.org/r/20220901104515.135162-2-zouyipeng@huawei.com Cc: stable@vger.kernel.org Fixes: c3bc8fd637a96 ("tracing: Centralize preemptirq tracepoints and unify their usage") Signed-off-by: Yipeng Zou <zouyipeng@huawei.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2022-09-06tracepoint: Allow trace events in modules with TAINT_TESTAlison Schofield
Commit 2852ca7fba9f ("panic: Taint kernel if tests are run") introduced a new taint type, TAINT_TEST, to signal that an in-kernel test module has been loaded. TAINT_TEST taint type defaults into a 'bad_taint' list for kernel tracing and blocks the creation of trace events. This causes a problem for CXL testing where loading the cxl_test module makes all CXL modules out-of-tree, blocking any trace events. Trace events are in development for CXL at the moment and this issue was found in test with v6.0-rc1. Link: https://lkml.kernel.org/r/20220829171048.263065-1-alison.schofield@intel.com Fixes: 2852ca7fba9f7 ("panic: Taint kernel if tests are run") Reported-by: Ira Weiny <ira.weiny@intel.com> Suggested-by: Dan Williams <dan.j.williams@intel.com> Tested-by: Ira Weiny <ira.weiny@intel.com> Reviewed-by: David Gow <davidgow@google.com> Signed-off-by: Alison Schofield <alison.schofield@intel.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2022-09-06rv/monitors: Make monitor's automata definition staticDaniel Bristot de Oliveira
Monitor's automata definition is only used locally, so make them static for all existing monitors. Link: https://lore.kernel.org/all/202208210332.gtHXje45-lkp@intel.com Link: https://lore.kernel.org/all/202208210358.6HH3OrVs-lkp@intel.com Link: https://lkml.kernel.org/r/a50e27c3738d6ef809f4201857229fed64799234.1661266564.git.bristot@kernel.org Fixes: ccc319dcb450 ("rv/monitor: Add the wwnr monitor") Fixes: 8812d21219b9 ("rv/monitor: Add the wip monitor skeleton created by dot2k") Reported-by: kernel test robot <lkp@intel.com> Signed-off-by: Daniel Bristot de Oliveira <bristot@kernel.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2022-09-06Merge https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-nextPaolo Abeni
Daniel Borkmann says: ==================== pull-request: bpf-next 2022-09-05 The following pull-request contains BPF updates for your *net-next* tree. We've added 106 non-merge commits during the last 18 day(s) which contain a total of 159 files changed, 5225 insertions(+), 1358 deletions(-). There are two small merge conflicts, resolve them as follows: 1) tools/testing/selftests/bpf/DENYLIST.s390x Commit 27e23836ce22 ("selftests/bpf: Add lru_bug to s390x deny list") in bpf tree was needed to get BPF CI green on s390x, but it conflicted with newly added tests on bpf-next. Resolve by adding both hunks, result: [...] lru_bug # prog 'printk': failed to auto-attach: -524 setget_sockopt # attach unexpected error: -524 (trampoline) cb_refs # expected error message unexpected error: -524 (trampoline) cgroup_hierarchical_stats # JIT does not support calling kernel function (kfunc) htab_update # failed to attach: ERROR: strerror_r(-524)=22 (trampoline) [...] 2) net/core/filter.c Commit 1227c1771dd2 ("net: Fix data-races around sysctl_[rw]mem_(max|default).") from net tree conflicts with commit 29003875bd5b ("bpf: Change bpf_setsockopt(SOL_SOCKET) to reuse sk_setsockopt()") from bpf-next tree. Take the code as it is from bpf-next tree, result: [...] if (getopt) { if (optname == SO_BINDTODEVICE) return -EINVAL; return sk_getsockopt(sk, SOL_SOCKET, optname, KERNEL_SOCKPTR(optval), KERNEL_SOCKPTR(optlen)); } return sk_setsockopt(sk, SOL_SOCKET, optname, KERNEL_SOCKPTR(optval), *optlen); [...] The main changes are: 1) Add any-context BPF specific memory allocator which is useful in particular for BPF tracing with bonus of performance equal to full prealloc, from Alexei Starovoitov. 2) Big batch to remove duplicated code from bpf_{get,set}sockopt() helpers as an effort to reuse the existing core socket code as much as possible, from Martin KaFai Lau. 3) Extend BPF flow dissector for BPF programs to just augment the in-kernel dissector with custom logic. In other words, allow for partial replacement, from Shmulik Ladkani. 4) Add a new cgroup iterator to BPF with different traversal options, from Hao Luo. 5) Support for BPF to collect hierarchical cgroup statistics efficiently through BPF integration with the rstat framework, from Yosry Ahmed. 6) Support bpf_{g,s}et_retval() under more BPF cgroup hooks, from Stanislav Fomichev. 7) BPF hash table and local storages fixes under fully preemptible kernel, from Hou Tao. 8) Add various improvements to BPF selftests and libbpf for compilation with gcc BPF backend, from James Hilliard. 9) Fix verifier helper permissions and reference state management for synchronous callbacks, from Kumar Kartikeya Dwivedi. 10) Add support for BPF selftest's xskxceiver to also be used against real devices that support MAC loopback, from Maciej Fijalkowski. 11) Various fixes to the bpf-helpers(7) man page generation script, from Quentin Monnet. 12) Document BPF verifier's tnum_in(tnum_range(), ...) gotchas, from Shung-Hsi Yu. 13) Various minor misc improvements all over the place. * https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (106 commits) bpf: Optimize rcu_barrier usage between hash map and bpf_mem_alloc. bpf: Remove usage of kmem_cache from bpf_mem_cache. bpf: Remove prealloc-only restriction for sleepable bpf programs. bpf: Prepare bpf_mem_alloc to be used by sleepable bpf programs. bpf: Remove tracing program restriction on map types bpf: Convert percpu hash map to per-cpu bpf_mem_alloc. bpf: Add percpu allocation support to bpf_mem_alloc. bpf: Batch call_rcu callbacks instead of SLAB_TYPESAFE_BY_RCU. bpf: Adjust low/high watermarks in bpf_mem_cache bpf: Optimize call_rcu in non-preallocated hash map. bpf: Optimize element count in non-preallocated hash map. bpf: Relax the requirement to use preallocated hash maps in tracing progs. samples/bpf: Reduce syscall overhead in map_perf_test. selftests/bpf: Improve test coverage of test_maps bpf: Convert hash map to bpf_mem_alloc. bpf: Introduce any context BPF specific memory allocator. selftest/bpf: Add test for bpf_getsockopt() bpf: Change bpf_getsockopt(SOL_IPV6) to reuse do_ipv6_getsockopt() bpf: Change bpf_getsockopt(SOL_IP) to reuse do_ip_getsockopt() bpf: Change bpf_getsockopt(SOL_TCP) to reuse do_tcp_getsockopt() ... ==================== Link: https://lore.kernel.org/r/20220905161136.9150-1-daniel@iogearbox.net Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2022-09-06cgroup: Remove CFTYPE_PRESSURETejun Heo
CFTYPE_PRESSURE is used to flag PSI related files so that they are not created if PSI is disabled during boot. It's a bit weird to use a generic flag to mark a specific file type. Let's instead move the PSI files into its own cftypes array and add/rm them conditionally. This is a bit more code but cleaner. No userland visible changes. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Johannes Weiner <hannes@cmpxchg.org>
2022-09-06cgroup: Improve cftype add/rm error handlingTejun Heo
Let's track whether a cftype is currently added or not using a new flag __CFTYPE_ADDED so that duplicate operations can be failed safely and consistently allow using empty cftypes. Signed-off-by: Tejun Heo <tj@kernel.org>
2022-09-06perf: Use sample_flags for txnKan Liang
Use the new sample_flags to indicate whether the txn field is filled by the PMU driver. Remove the txn field from the perf_sample_data_init() to minimize the number of cache lines touched. Signed-off-by: Kan Liang <kan.liang@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20220901130959.1285717-7-kan.liang@linux.intel.com
2022-09-06perf: Use sample_flags for data_srcKan Liang
Use the new sample_flags to indicate whether the data_src field is filled by the PMU driver. Remove the data_src field from the perf_sample_data_init() to minimize the number of cache lines touched. Signed-off-by: Kan Liang <kan.liang@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20220901130959.1285717-6-kan.liang@linux.intel.com
2022-09-06perf: Use sample_flags for weightKan Liang
Use the new sample_flags to indicate whether the weight field is filled by the PMU driver. Remove the weight field from the perf_sample_data_init() to minimize the number of cache lines touched. Signed-off-by: Kan Liang <kan.liang@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20220901130959.1285717-5-kan.liang@linux.intel.com
2022-09-06perf: Use sample_flags for branch stackKan Liang
Use the new sample_flags to indicate whether the branch stack is filled by the PMU driver. Remove the br_stack from the perf_sample_data_init() to minimize the number of cache lines touched. Signed-off-by: Kan Liang <kan.liang@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20220901130959.1285717-4-kan.liang@linux.intel.com
2022-09-06perf: Add sample_flags to indicate the PMU-filled sample dataKan Liang
On some platforms, some data e.g., timestamps, can be retrieved from the PMU driver. Usually, the data from the PMU driver is more accurate. The current perf kernel should output the PMU-filled sample data if it's available. To check the availability of the PMU-filled sample data, the current perf kernel initializes the related fields in the perf_sample_data_init(). When outputting a sample, the perf checks whether the field is updated by the PMU driver. If yes, the updated value will be output. If not, the perf uses an SW way to calculate the value or just outputs the initialized value if an SW way is unavailable either. With more and more data being provided by the PMU driver, more fields has to be initialized in the perf_sample_data_init(). That will increase the number of cache lines touched in perf_sample_data_init() and be harmful to the performance. Add new "sample_flags" to indicate the PMU-filled sample data. The PMU driver should set the corresponding PERF_SAMPLE_ flag when the field is updated. The initialization of the corresponding field is not required anymore. The following patches will make use of it and remove the corresponding fields from the perf_sample_data_init(), which will further minimize the number of cache lines touched. Only clear the sample flags that have already been done by the PMU driver in the perf_prepare_sample() for the PERF_RECORD_SAMPLE. For the other PERF_RECORD_ event type, the sample data is not available. Suggested-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Kan Liang <kan.liang@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20220901130959.1285717-2-kan.liang@linux.intel.com
2022-09-06perf/core: Fix reentry problem in perf_output_read_group()Yang Jihong
perf_output_read_group may respond to IPI request of other cores and invoke __perf_install_in_context function. As a result, hwc configuration is modified. causing inconsistency and unexpected consequences. Interrupts are not disabled when perf_output_read_group reads PMU counter. In this case, IPI request may be received from other cores. As a result, PMU configuration is modified and an error occurs when reading PMU counter: CPU0 CPU1 __se_sys_perf_event_open perf_install_in_context perf_output_read_group smp_call_function_single for_each_sibling_event(sub, leader) { generic_exec_single if ((sub != event) && remote_function (sub->state == PERF_EVENT_STATE_ACTIVE)) | <enter IPI handler: __perf_install_in_context> <----RAISE IPI-----+ __perf_install_in_context ctx_resched event_sched_out armpmu_del ... hwc->idx = -1; // event->hwc.idx is set to -1 ... <exit IPI> sub->pmu->read(sub); armpmu_read armv8pmu_read_counter armv8pmu_read_hw_counter int idx = event->hw.idx; // idx = -1 u64 val = armv8pmu_read_evcntr(idx); u32 counter = ARMV8_IDX_TO_COUNTER(idx); // invalid counter = 30 read_pmevcntrn(counter) // undefined instruction Signed-off-by: Yang Jihong <yangjihong1@huawei.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20220902082918.179248-1-yangjihong1@huawei.com
2022-09-05bpf: Optimize rcu_barrier usage between hash map and bpf_mem_alloc.Alexei Starovoitov
User space might be creating and destroying a lot of hash maps. Synchronous rcu_barrier-s in a destruction path of hash map delay freeing of hash buckets and other map memory and may cause artificial OOM situation under stress. Optimize rcu_barrier usage between bpf hash map and bpf_mem_alloc: - remove rcu_barrier from hash map, since htab doesn't use call_rcu directly and there are no callback to wait for. - bpf_mem_alloc has call_rcu_in_progress flag that indicates pending callbacks. Use it to avoid barriers in fast path. - When barriers are needed copy bpf_mem_alloc into temp structure and wait for rcu barrier-s in the worker to let the rest of hash map freeing to proceed. Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/20220902211058.60789-17-alexei.starovoitov@gmail.com
2022-09-05bpf: Remove usage of kmem_cache from bpf_mem_cache.Alexei Starovoitov
For bpf_mem_cache based hash maps the following stress test: for (i = 1; i <= 512; i <<= 1) for (j = 1; j <= 1 << 18; j <<= 1) fd = bpf_map_create(BPF_MAP_TYPE_HASH, NULL, i, j, 2, 0); creates many kmem_cache-s that are not mergeable in debug kernels and consume unnecessary amount of memory. Turned out bpf_mem_cache's free_list logic does batching well, so usage of kmem_cache for fixes size allocations doesn't bring any performance benefits vs normal kmalloc. Hence get rid of kmem_cache in bpf_mem_cache. That saves memory, speeds up map create/destroy operations, while maintains hash map update/delete performance. Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/20220902211058.60789-16-alexei.starovoitov@gmail.com
2022-09-05bpf: Remove prealloc-only restriction for sleepable bpf programs.Alexei Starovoitov
Since hash map is now converted to bpf_mem_alloc and it's waiting for rcu and rcu_tasks_trace GPs before freeing elements into global memory slabs it's safe to use dynamically allocated hash maps in sleepable bpf programs. Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Acked-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20220902211058.60789-15-alexei.starovoitov@gmail.com
2022-09-05bpf: Prepare bpf_mem_alloc to be used by sleepable bpf programs.Alexei Starovoitov
Use call_rcu_tasks_trace() to wait for sleepable progs to finish. Then use call_rcu() to wait for normal progs to finish and finally do free_one() on each element when freeing objects into global memory pool. Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Acked-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20220902211058.60789-14-alexei.starovoitov@gmail.com
2022-09-05bpf: Remove tracing program restriction on map typesAlexei Starovoitov
The hash map is now fully converted to bpf_mem_alloc. Its implementation is not allocating synchronously and not calling call_rcu() directly. It's now safe to use non-preallocated hash maps in all types of tracing programs including BPF_PROG_TYPE_PERF_EVENT that runs out of NMI context. Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Acked-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20220902211058.60789-13-alexei.starovoitov@gmail.com
2022-09-05bpf: Convert percpu hash map to per-cpu bpf_mem_alloc.Alexei Starovoitov
Convert dynamic allocations in percpu hash map from alloc_percpu() to bpf_mem_cache_alloc() from per-cpu bpf_mem_alloc. Since bpf_mem_alloc frees objects after RCU gp the call_rcu() is removed. pcpu_init_value() now needs to zero-fill per-cpu allocations, since dynamically allocated map elements are now similar to full prealloc, since alloc_percpu() is not called inline and the elements are reused in the freelist. Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Acked-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20220902211058.60789-12-alexei.starovoitov@gmail.com
2022-09-05bpf: Add percpu allocation support to bpf_mem_alloc.Alexei Starovoitov
Extend bpf_mem_alloc to cache free list of fixed size per-cpu allocations. Once such cache is created bpf_mem_cache_alloc() will return per-cpu objects. bpf_mem_cache_free() will free them back into global per-cpu pool after observing RCU grace period. per-cpu flavor of bpf_mem_alloc is going to be used by per-cpu hash maps. The free list cache consists of tuples { llist_node, per-cpu pointer } Unlike alloc_percpu() that returns per-cpu pointer the bpf_mem_cache_alloc() returns a pointer to per-cpu pointer and bpf_mem_cache_free() expects to receive it back. Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Acked-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20220902211058.60789-11-alexei.starovoitov@gmail.com
2022-09-05bpf: Batch call_rcu callbacks instead of SLAB_TYPESAFE_BY_RCU.Alexei Starovoitov
SLAB_TYPESAFE_BY_RCU makes kmem_caches non mergeable and slows down kmem_cache_destroy. All bpf_mem_cache are safe to share across different maps and programs. Convert SLAB_TYPESAFE_BY_RCU to batched call_rcu. This change solves the memory consumption issue, avoids kmem_cache_destroy latency and keeps bpf hash map performance the same. Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Acked-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20220902211058.60789-10-alexei.starovoitov@gmail.com
2022-09-05bpf: Adjust low/high watermarks in bpf_mem_cacheAlexei Starovoitov
The same low/high watermarks for every bucket in bpf_mem_cache consume significant amount of memory. Preallocating 64 elements of 4096 bytes each in the free list is not efficient. Make low/high watermarks and batching value dependent on element size. This change brings significant memory savings. Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Acked-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20220902211058.60789-9-alexei.starovoitov@gmail.com
2022-09-05bpf: Optimize call_rcu in non-preallocated hash map.Alexei Starovoitov
Doing call_rcu() million times a second becomes a bottle neck. Convert non-preallocated hash map from call_rcu to SLAB_TYPESAFE_BY_RCU. The rcu critical section is no longer observed for one htab element which makes non-preallocated hash map behave just like preallocated hash map. The map elements are released back to kernel memory after observing rcu critical section. This improves 'map_perf_test 4' performance from 100k events per second to 250k events per second. bpf_mem_alloc + percpu_counter + typesafe_by_rcu provide 10x performance boost to non-preallocated hash map and make it within few % of preallocated map while consuming fraction of memory. Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Acked-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20220902211058.60789-8-alexei.starovoitov@gmail.com
2022-09-05bpf: Optimize element count in non-preallocated hash map.Alexei Starovoitov
The atomic_inc/dec might cause extreme cache line bouncing when multiple cpus access the same bpf map. Based on specified max_entries for the hash map calculate when percpu_counter becomes faster than atomic_t and use it for such maps. For example samples/bpf/map_perf_test is using hash map with max_entries 1000. On a system with 16 cpus the 'map_perf_test 4' shows 14k events per second using atomic_t. On a system with 15 cpus it shows 100k events per second using percpu. map_perf_test is an extreme case where all cpus colliding on atomic_t which causes extreme cache bouncing. Note that the slow path of percpu_counter is 5k events per secound vs 14k for atomic, so the heuristic is necessary. See comment in the code why the heuristic is based on num_online_cpus(). Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Acked-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20220902211058.60789-7-alexei.starovoitov@gmail.com
2022-09-05bpf: Relax the requirement to use preallocated hash maps in tracing progs.Alexei Starovoitov
Since bpf hash map was converted to use bpf_mem_alloc it is safe to use from tracing programs and in RT kernels. But per-cpu hash map is still using dynamic allocation for per-cpu map values, hence keep the warning for this map type. In the future alloc_percpu_gfp can be front-end-ed with bpf_mem_cache and this restriction will be completely lifted. perf_event (NMI) bpf programs have to use preallocated hash maps, because free_htab_elem() is using call_rcu which might crash if re-entered. Sleepable bpf programs have to use preallocated hash maps, because life time of the map elements is not protected by rcu_read_lock/unlock. This restriction can be lifted in the future as well. Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Acked-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20220902211058.60789-6-alexei.starovoitov@gmail.com
2022-09-05bpf: Convert hash map to bpf_mem_alloc.Alexei Starovoitov
Convert bpf hash map to use bpf memory allocator. Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Acked-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20220902211058.60789-3-alexei.starovoitov@gmail.com
2022-09-05bpf: Introduce any context BPF specific memory allocator.Alexei Starovoitov
Tracing BPF programs can attach to kprobe and fentry. Hence they run in unknown context where calling plain kmalloc() might not be safe. Front-end kmalloc() with minimal per-cpu cache of free elements. Refill this cache asynchronously from irq_work. BPF programs always run with migration disabled. It's safe to allocate from cache of the current cpu with irqs disabled. Free-ing is always done into bucket of the current cpu as well. irq_work trims extra free elements from buckets with kfree and refills them with kmalloc, so global kmalloc logic takes care of freeing objects allocated by one cpu and freed on another. struct bpf_mem_alloc supports two modes: - When size != 0 create kmem_cache and bpf_mem_cache for each cpu. This is typical bpf hash map use case when all elements have equal size. - When size == 0 allocate 11 bpf_mem_cache-s for each cpu, then rely on kmalloc/kfree. Max allocation size is 4096 in this case. This is bpf_dynptr and bpf_kptr use case. bpf_mem_alloc/bpf_mem_free are bpf specific 'wrappers' of kmalloc/kfree. bpf_mem_cache_alloc/bpf_mem_cache_free are 'wrappers' of kmem_cache_alloc/kmem_cache_free. The allocators are NMI-safe from bpf programs only. They are not NMI-safe in general. Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Acked-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20220902211058.60789-2-alexei.starovoitov@gmail.com
2022-09-05IB/core: Fix a nested dead lock as part of ODP flowYishai Hadas
Fix a nested dead lock as part of ODP flow by using mmput_async(). From the below call trace [1] can see that calling mmput() once we have the umem_odp->umem_mutex locked as required by ib_umem_odp_map_dma_and_lock() might trigger in the same task the exit_mmap()->__mmu_notifier_release()->mlx5_ib_invalidate_range() which may dead lock when trying to lock the same mutex. Moving to use mmput_async() will solve the problem as the above exit_mmap() flow will be called in other task and will be executed once the lock will be available. [1] [64843.077665] task:kworker/u133:2 state:D stack: 0 pid:80906 ppid: 2 flags:0x00004000 [64843.077672] Workqueue: mlx5_ib_page_fault mlx5_ib_eqe_pf_action [mlx5_ib] [64843.077719] Call Trace: [64843.077722] <TASK> [64843.077724] __schedule+0x23d/0x590 [64843.077729] schedule+0x4e/0xb0 [64843.077735] schedule_preempt_disabled+0xe/0x10 [64843.077740] __mutex_lock.constprop.0+0x263/0x490 [64843.077747] __mutex_lock_slowpath+0x13/0x20 [64843.077752] mutex_lock+0x34/0x40 [64843.077758] mlx5_ib_invalidate_range+0x48/0x270 [mlx5_ib] [64843.077808] __mmu_notifier_release+0x1a4/0x200 [64843.077816] exit_mmap+0x1bc/0x200 [64843.077822] ? walk_page_range+0x9c/0x120 [64843.077828] ? __cond_resched+0x1a/0x50 [64843.077833] ? mutex_lock+0x13/0x40 [64843.077839] ? uprobe_clear_state+0xac/0x120 [64843.077860] mmput+0x5f/0x140 [64843.077867] ib_umem_odp_map_dma_and_lock+0x21b/0x580 [ib_core] [64843.077931] pagefault_real_mr+0x9a/0x140 [mlx5_ib] [64843.077962] pagefault_mr+0xb4/0x550 [mlx5_ib] [64843.077992] pagefault_single_data_segment.constprop.0+0x2ac/0x560 [mlx5_ib] [64843.078022] mlx5_ib_eqe_pf_action+0x528/0x780 [mlx5_ib] [64843.078051] process_one_work+0x22b/0x3d0 [64843.078059] worker_thread+0x53/0x410 [64843.078065] ? process_one_work+0x3d0/0x3d0 [64843.078073] kthread+0x12a/0x150 [64843.078079] ? set_kthread_struct+0x50/0x50 [64843.078085] ret_from_fork+0x22/0x30 [64843.078093] </TASK> Fixes: 36f30e486dce ("IB/core: Improve ODP to use hmm_range_fault()") Reviewed-by: Maor Gottlieb <maorg@nvidia.com> Signed-off-by: Yishai Hadas <yishaih@nvidia.com> Link: https://lore.kernel.org/r/74d93541ea533ef7daec6f126deb1072500aeb16.1661251841.git.leonro@nvidia.com Signed-off-by: Leon Romanovsky <leon@kernel.org>
2022-09-05sched/debug: fix dentry leak in update_sched_domain_debugfsGreg Kroah-Hartman
Kuyo reports that the pattern of using debugfs_remove(debugfs_lookup()) leaks a dentry and with a hotplug stress test, the machine eventually runs out of memory. Fix this up by using the newly created debugfs_lookup_and_remove() call instead which properly handles the dentry reference counting logic. Cc: Major Chen <major.chen@samsung.com> Cc: stable <stable@kernel.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Juri Lelli <juri.lelli@redhat.com> Cc: Vincent Guittot <vincent.guittot@linaro.org> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Ben Segall <bsegall@google.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Daniel Bristot de Oliveira <bristot@redhat.com> Cc: Valentin Schneider <vschneid@redhat.com> Cc: Matthias Brugger <matthias.bgg@gmail.com> Reported-by: Kuyo Chang <kuyo.chang@mediatek.com> Tested-by: Kuyo Chang <kuyo.chang@mediatek.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20220902123107.109274-2-gregkh@linuxfoundation.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-09-05Merge 6.0-rc4 into tty-nextGreg Kroah-Hartman
We need the tty/serial fixes in here as well. Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-09-04cgroup/cpuset: Make partition invalid if cpumask change violates exclusivity ↵Waiman Long
rule Currently, changes in "cpust.cpus" of a partition root is not allowed if it violates the sibling cpu exclusivity rule when the check is done in the validate_change() function. That is inconsistent with the other cpuset changes that are always allowed but may make a partition invalid. Update the cpuset code to allow cpumask change even if it violates the sibling cpu exclusivity rule, but invalidate the partition instead just like the other changes. However, other sibling partitions with conflicting cpumask will also be invalidated in order to not violating the exclusivity rule. This behavior is specific to this partition rule violation. Note that a previous commit has made sibling cpu exclusivity rule check the last check of validate_change(). So if -EINVAL is returned, we can be sure that sibling cpu exclusivity rule violation is the only rule that is broken. Signed-off-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2022-09-04cgroup/cpuset: Relocate a code block in validate_change()Waiman Long
This patch moves down the exclusive cpu and memory check in validate_change(). There is no functional change. Signed-off-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2022-09-04cgroup/cpuset: Show invalid partition reason stringWaiman Long
There are a number of different reasons which can cause a partition to become invalid. A user seeing an invalid partition may not know exactly why. To help user to get a better understanding of the underlying reason, The cpuset.cpus.partition control file, when read, will now report the reason why a partition become invalid. When a partition does become invalid, reading the control file will show "root invalid (<reason>)" where <reason> is a string that describes why the partition is invalid. Signed-off-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2022-09-04cgroup/cpuset: Add a new isolated cpus.partition typeWaiman Long
Cpuset v1 uses the sched_load_balance control file to determine if load balancing should be enabled. Cpuset v2 gets rid of sched_load_balance as its use may require disabling load balancing at cgroup root. For workloads that require very low latency like DPDK, the latency jitters caused by periodic load balancing may exceed the desired latency limit. When cpuset v2 is in use, the only way to avoid this latency cost is to use the "isolcpus=" kernel boot option to isolate a set of CPUs. After the kernel boot, however, there is no way to add or remove CPUs from this isolated set. For workloads that are more dynamic in nature, that means users have to provision enough CPUs for the worst case situation resulting in excess idle CPUs. To address this issue for cpuset v2, a new cpuset.cpus.partition type "isolated" is added which allows the creation of a cpuset partition without load balancing. This will allow system administrators to dynamically adjust the size of isolated partition to the current need of the workload without rebooting the system. Signed-off-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2022-09-04cgroup/cpuset: Relax constraints to partition & cpus changesWaiman Long
Currently, enabling a partition root is only allowed if all the constraints of a valid partition are satisfied. Even changes to "cpuset.cpus" may not be allowed in some cases. Moreover, there are limits to changes made to a parent cpuset if it is a valid partition root. This is contrary to the general cgroup v2 philosophy. This patch relaxes the constraints of changing the state of "cpuset.cpus" and "cpuset.cpus.partition". Now all valid changes ("member" or "root") to "cpuset.cpus.partition" are allowed even if there are child cpusets underneath it. Trying to make a cpuset a partition root, however, will cause its state to become invalid if the following constraints of a valid partition root are not satisfied. 1) The "cpuset.cpus" is non-empty and exclusive. 2) The parent cpuset is a valid partition root. 3) The "cpuset.cpus" overlaps parent's "cpuset.cpus". Similarly, almost all changes to "cpuset.cpus" are allowed with the exception that if the underlying CS_CPU_EXCLUSIVE flag is set, the exclusivity rule will still apply. Signed-off-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2022-09-04cgroup/cpuset: Allow no-task partition to have empty cpuset.cpus.effectiveWaiman Long
Currently, a partition root cannot have empty "cpuset.cpus.effective". As a result, a parent partition root cannot distribute out all its CPUs to child partitions with no CPUs left. However in most cases, there shouldn't be any tasks associated with intermediate nodes of the default hierarchy. So the current rule is too restrictive and can waste valuable CPU resource. To address this issue, we are now allowing a partition to have empty "cpuset.cpus.effective" as long as it has no task. Since cpuset is threaded, no-internal-process rule does not apply. So it is possible to have tasks in a partition root with child sub-partitions even though that should be uncommon. A parent partition with no task can now have all its CPUs distributed out to its child partitions. The top cpuset always have some house-keeping tasks running and so its list of effective cpu can't be empty. Once a partition with empty "cpuset.cpus.effective" is formed, no new task can be moved into it until "cpuset.cpus.effective" becomes non-empty. Signed-off-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2022-09-04cgroup/cpuset: Miscellaneous cleanups & add helper functionsWaiman Long
The partition root state (PRS) macro names do not currently match the external names. Change them to match the external names and add helper functions to read or change the state. Shorten the cpuset argument of update_parent_subparts_cpumask() to cs to match other cpuset functions. Remove the new_prs argument from notify_partition_change() as the cs->partition_root_state has already been set to new_prs before it is called. There is no functional change. Signed-off-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2022-09-04cgroup/cpuset: Enable update_tasks_cpumask() on top_cpusetWaiman Long
Previously, update_tasks_cpumask() is not supposed to be called with top cpuset. With cpuset partition that takes CPUs away from the top cpuset, adjusting the cpus_mask of the tasks in the top cpuset is necessary. Percpu kthreads, however, are ignored. Fixes: ee8dde0cd2ce ("cpuset: Add new v2 cpuset.sched.partition flag") Signed-off-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2022-09-04cgroup: add pids.peak interface for pids controllerJosh Don
pids.peak tracks the high watermark of usage for number of pids. This helps give a better baseline on which to set pids.max. Polling pids.current isn't really feasible, since it would potentially miss short-lived spikes. This interface is analogous to memory.peak. Signed-off-by: Josh Don <joshdon@google.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2022-09-04cgroup: Remove data-race around cgrp_dfl_visibleTejun Heo
There's a seemingly harmless data-race around cgrp_dfl_visible detected by kernel concurrency sanitizer. Let's remove it by throwing WRITE/READ_ONCE at it. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: Abhishek Shah <abhishek.shah@columbia.edu> Cc: Gabriel Ryan <gabe@cs.columbia.edu> Reviewed-by: Christian Brauner (Microsoft) <brauner@kernel.org> Link: https://lore.kernel.org/netdev/20220819072256.fn7ctciefy4fc4cu@wittgenstein/
2022-09-01bpf: Only add BTF IDs for socket security hooks when CONFIG_SECURITY_NETWORK ↵Hou Tao
is on When CONFIG_SECURITY_NETWORK is disabled, there will be build warnings from resolve_btfids: WARN: resolve_btfids: unresolved symbol bpf_lsm_socket_socketpair ...... WARN: resolve_btfids: unresolved symbol bpf_lsm_inet_conn_established Fixing it by wrapping these BTF ID definitions by CONFIG_SECURITY_NETWORK. Fixes: 69fd337a975c ("bpf: per-cgroup lsm flavor") Fixes: 9113d7e48e91 ("bpf: expose bpf_{g,s}etsockopt to lsm cgroup") Signed-off-by: Hou Tao <houtao1@huawei.com> Link: https://lore.kernel.org/r/20220901065126.3856297-1-houtao@huaweicloud.com Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2022-09-01audit_init_parent(): constify pathAl Viro
Reviewed-by: Christian Brauner (Microsoft) <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2022-09-01bpf: Remove useless else ifJiapeng Chong
The assignment of the else and else if branches is the same, so the else if here is redundant, so we remove it and add a comment to make the code here readable. ./kernel/bpf/cgroup_iter.c:81:6-8: WARNING: possible condition with no effect (if == else). Link: https://bugzilla.openanolis.cn/show_bug.cgi?id=2016 Reported-by: Abaci Robot <abaci@linux.alibaba.com> Signed-off-by: Jiapeng Chong <jiapeng.chong@linux.alibaba.com> Link: https://lore.kernel.org/r/20220831021618.86770-1-jiapeng.chong@linux.alibaba.com Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2022-09-01Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski
tools/testing/selftests/net/.gitignore sort the net-next version and use it Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-09-01bpf: Use this_cpu_{inc_return|dec} for prog->activeHou Tao
Both __this_cpu_inc_return() and __this_cpu_dec() are not preemption safe and now migrate_disable() doesn't disable preemption, so the update of prog-active is not atomic and in theory under fully preemptible kernel recurisve prevention may do not work. Fixing by using the preemption-safe and IRQ-safe variants. Fixes: ca06f55b9002 ("bpf: Add per-program recursion prevention mechanism") Signed-off-by: Hou Tao <houtao1@huawei.com> Acked-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/r/20220901061938.3789460-3-houtao@huaweicloud.com Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2022-09-01bpf: Use this_cpu_{inc|dec|inc_return} for bpf_task_storage_busyHou Tao
Now migrate_disable() does not disable preemption and under some architectures (e.g. arm64) __this_cpu_{inc|dec|inc_return} are neither preemption-safe nor IRQ-safe, so for fully preemptible kernel concurrent lookups or updates on the same task local storage and on the same CPU may make bpf_task_storage_busy be imbalanced, and bpf_task_storage_trylock() on the specific cpu will always fail. Fixing it by using this_cpu_{inc|dec|inc_return} when manipulating bpf_task_storage_busy. Fixes: bc235cdb423a ("bpf: Prevent deadlock from recursive bpf_task_storage_[get|delete]") Signed-off-by: Hou Tao <houtao1@huawei.com> Acked-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/r/20220901061938.3789460-2-houtao@huaweicloud.com Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2022-09-01Merge branches 'doc.2022.08.31b', 'fixes.2022.08.31b', 'kvfree.2022.08.31b', ↵Paul E. McKenney
'nocb.2022.09.01a', 'poll.2022.08.31b', 'poll-srcu.2022.08.31b' and 'tasks.2022.08.31b' into HEAD doc.2022.08.31b: Documentation updates fixes.2022.08.31b: Miscellaneous fixes kvfree.2022.08.31b: kvfree_rcu() updates nocb.2022.09.01a: NOCB CPU updates poll.2022.08.31b: Full-oldstate RCU polling grace-period API poll-srcu.2022.08.31b: Polled SRCU grace-period updates tasks.2022.08.31b: Tasks RCU updates
2022-09-01rcutorture: Use the barrier operation specified by cur_opsZqiang
The rcutorture_oom_notify() function unconditionally invokes rcu_barrier(), which is OK when the rcutorture.torture_type value is "rcu", but unhelpful otherwise. The purpose of these barrier calls is to wait for all outstanding callback-flooding callbacks to be invoked before cleaning up their data. Using the wrong barrier function therefore risks arbitrary memory corruption. Thus, this commit changes these rcu_barrier() calls into cur_ops->cb_barrier() to make things work when torturing non-vanilla flavors of RCU. Signed-off-by: Zqiang <qiang1.zhang@intel.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>