summaryrefslogtreecommitdiff
path: root/kernel
AgeCommit message (Collapse)Author
2020-02-20timer: Use hlist_unhashed_lockless() in timer_pending()Eric Dumazet
The timer_pending() function is mostly used in lockless contexts, so Without proper annotations, KCSAN might detect a data-race [1]. Using hlist_unhashed_lockless() instead of hand-coding it seems appropriate (as suggested by Paul E. McKenney). [1] BUG: KCSAN: data-race in del_timer / detach_if_pending write to 0xffff88808697d870 of 8 bytes by task 10 on cpu 0: __hlist_del include/linux/list.h:764 [inline] detach_timer kernel/time/timer.c:815 [inline] detach_if_pending+0xcd/0x2d0 kernel/time/timer.c:832 try_to_del_timer_sync+0x60/0xb0 kernel/time/timer.c:1226 del_timer_sync+0x6b/0xa0 kernel/time/timer.c:1365 schedule_timeout+0x2d2/0x6e0 kernel/time/timer.c:1896 rcu_gp_fqs_loop+0x37c/0x580 kernel/rcu/tree.c:1639 rcu_gp_kthread+0x143/0x230 kernel/rcu/tree.c:1799 kthread+0x1d4/0x200 drivers/block/aoe/aoecmd.c:1253 ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:352 read to 0xffff88808697d870 of 8 bytes by task 12060 on cpu 1: del_timer+0x3b/0xb0 kernel/time/timer.c:1198 sk_stop_timer+0x25/0x60 net/core/sock.c:2845 inet_csk_clear_xmit_timers+0x69/0xa0 net/ipv4/inet_connection_sock.c:523 tcp_clear_xmit_timers include/net/tcp.h:606 [inline] tcp_v4_destroy_sock+0xa3/0x3f0 net/ipv4/tcp_ipv4.c:2096 inet_csk_destroy_sock+0xf4/0x250 net/ipv4/inet_connection_sock.c:836 tcp_close+0x6f3/0x970 net/ipv4/tcp.c:2497 inet_release+0x86/0x100 net/ipv4/af_inet.c:427 __sock_release+0x85/0x160 net/socket.c:590 sock_close+0x24/0x30 net/socket.c:1268 __fput+0x1e1/0x520 fs/file_table.c:280 ____fput+0x1f/0x30 fs/file_table.c:313 task_work_run+0xf6/0x130 kernel/task_work.c:113 tracehook_notify_resume include/linux/tracehook.h:188 [inline] exit_to_usermode_loop+0x2b4/0x2c0 arch/x86/entry/common.c:163 Reported by Kernel Concurrency Sanitizer on: CPU: 1 PID: 12060 Comm: syz-executor.5 Not tainted 5.4.0-rc3+ #0 Hardware name: Google Google Compute Engine/Google Compute Engine, Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> [ paulmck: Pulled in Eric's later amendments. ] Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-02-20rcu: Add *_ONCE() to rcu_node ->boost_kthread_statusPaul E. McKenney
The rcu_node structure's ->boost_kthread_status field is accessed locklessly, so this commit causes all updates to use WRITE_ONCE() and all reads to use READ_ONCE(). This data race was reported by KCSAN. Not appropriate for backporting due to failure being unlikely. Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-02-20rcu: Add *_ONCE() to rcu_data ->rcu_forced_tickPaul E. McKenney
The rcu_data structure's ->rcu_forced_tick field is read locklessly, so this commit adds WRITE_ONCE() to all updates and READ_ONCE() to all lockless reads. This data race was reported by KCSAN. Not appropriate for backporting due to failure being unlikely. Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-02-20rcu: Add READ_ONCE() to rcu_data ->gpwrapPaul E. McKenney
The rcu_data structure's ->gpwrap field is read locklessly, and so this commit adds the required READ_ONCE() to a pair of laods in order to avoid destructive compiler optimizations. This data race was reported by KCSAN. Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-02-20rcu: Fix typos in file-header commentsSeongJae Park
Convert to plural and add a note that this is for Tree RCU. Signed-off-by: SeongJae Park <sjpark@amazon.de> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-02-20rcu: Add *_ONCE() for grace-period progress indicatorsPaul E. McKenney
The various RCU structures' ->gp_seq, ->gp_seq_needed, ->gp_req_activity, and ->gp_activity fields are read locklessly, so they must be updated with WRITE_ONCE() and, when read locklessly, with READ_ONCE(). This commit makes these changes. This data race was reported by KCSAN. Not appropriate for backporting due to failure being unlikely. Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-02-20rcu: Add READ_ONCE() to rcu_segcblist ->tails[]Paul E. McKenney
The rcu_segcblist structure's ->tails[] array entries are read locklessly, so this commit adds the READ_ONCE() to a load in order to avoid destructive compiler optimizations. This data race was reported by KCSAN. Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-02-20locking/rtmutex: rcu: Add WRITE_ONCE() to rt_mutex ->ownerPaul E. McKenney
The rt_mutex structure's ->owner field is read locklessly, so this commit adds the WRITE_ONCE() to an update in order to provide proper documentation and READ_ONCE()/WRITE_ONCE() pairing. This data race was reported by KCSAN. Not appropriate for backporting due to failure being unlikely. Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Will Deacon <will@kernel.org>
2020-02-20rcu: Add WRITE_ONCE() to rcu_node ->qsmaskinitnextPaul E. McKenney
The rcu_state structure's ->qsmaskinitnext field is read locklessly, so this commit adds the WRITE_ONCE() to an update in order to provide proper documentation and READ_ONCE()/WRITE_ONCE() pairing. This data race was reported by KCSAN. Not appropriate for backporting due to failure being unlikely for systems not doing incessant CPU-hotplug operations. Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-02-20rcu: Add WRITE_ONCE() to rcu_state ->gp_req_activityPaul E. McKenney
The rcu_state structure's ->gp_req_activity field is read locklessly, so this commit adds the WRITE_ONCE() to an update in order to provide proper documentation and READ_ONCE()/WRITE_ONCE() pairing. This data race was reported by KCSAN. Not appropriate for backporting due to failure being unlikely. Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-02-20rcu: Add READ_ONCE() to rcu_node ->gp_seqPaul E. McKenney
The rcu_node structure's ->gp_seq field is read locklessly, so this commit adds the READ_ONCE() to several loads in order to avoid destructive compiler optimizations. This data race was reported by KCSAN. Not appropriate for backporting because this affects only tracing and warnings. Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-02-20rcu: Add WRITE_ONCE to rcu_node ->exp_seq_rq storePaul E. McKenney
The rcu_node structure's ->exp_seq_rq field is read locklessly, so this commit adds the WRITE_ONCE() to a load in order to provide proper documentation and READ_ONCE()/WRITE_ONCE() pairing. This data race was reported by KCSAN. Not appropriate for backporting due to failure being unlikely. Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-02-20rcu: Add WRITE_ONCE() to rcu_node ->qsmask updatePaul E. McKenney
The rcu_node structure's ->qsmask field is read locklessly, so this commit adds the WRITE_ONCE() to an update in order to provide proper documentation and READ_ONCE()/WRITE_ONCE() pairing. This data race was reported by KCSAN. Not appropriate for backporting due to failure being unlikely. Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-02-20rcu: Provide debug symbols and line numbers in KCSAN runsPaul E. McKenney
This commit adds "-g -fno-omit-frame-pointer" to ease interpretation of KCSAN output, but only for CONFIG_KCSAN=y kerrnels. Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-02-20rcu: Fix exp_funnel_lock()/rcu_exp_wait_wake() dataracePaul E. McKenney
The rcu_node structure's ->exp_seq_rq field is accessed locklessly, so updates must use WRITE_ONCE(). This commit therefore adds the needed WRITE_ONCE() invocation where it was missed. This data race was reported by KCSAN. Not appropriate for backporting due to failure being unlikely. Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-02-20rcu: Warn on for_each_leaf_node_cpu_mask() from non-leafPaul E. McKenney
The for_each_leaf_node_cpu_mask() and for_each_leaf_node_possible_cpu() macros must be invoked only on leaf rcu_node structures. Failing to abide by this restriction can result in infinite loops on systems with more than 64 CPUs (or for more than 32 CPUs on 32-bit systems). This commit therefore adds WARN_ON_ONCE() calls to make misuse of these two macros easier to debug. Reported-by: Qian Cai <cai@lca.pw> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-02-20bootconfig: Set CONFIG_BOOT_CONFIG=n by defaultMasami Hiramatsu
Set CONFIG_BOOT_CONFIG=n by default. This also warns user if CONFIG_BOOT_CONFIG=n but "bootconfig" is given in the kernel command line. Link: http://lkml.kernel.org/r/158220111291.26565.9036889083940367969.stgit@devnote2 Suggested-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2020-02-20tracing: Clear trace_state when starting traceMasami Hiramatsu
Clear trace_state data structure when starting trace in __synth_event_trace_start() internal function. Currently trace_state is initialized only in the synth_event_trace_start() API, but the trace_state in synth_event_trace() and synth_event_trace_array() are on the stack without initialization. This means those APIs will see wrong parameters and wil skip closing process in __synth_event_trace_end() because trace_state->disabled may be !0. Link: http://lkml.kernel.org/r/158193315899.8868.1781259176894639952.stgit@devnote2 Reviewed-by: Tom Zanussi <zanussi@kernel.org> Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2020-02-20tracing: Disable trace_printk() on post poned testsSteven Rostedt (VMware)
The tracing seftests checks various aspects of the tracing infrastructure, and one is filtering. If trace_printk() is active during a self test, it can cause the filtering to fail, which will disable that part of the trace. To keep the selftests from failing because of trace_printk() calls, trace_printk() checks the variable tracing_selftest_running, and if set, it does not write to the tracing buffer. As some tracers were registered earlier in boot, the selftest they triggered would fail because not all the infrastructure was set up for the full selftest. Thus, some of the tests were post poned to when their infrastructure was ready (namely file system code). The postpone code did not set the tracing_seftest_running variable, and could fail if a trace_printk() was added and executed during their run. Cc: stable@vger.kernel.org Fixes: 9afecfbb95198 ("tracing: Postpone tracer start-up tests till the system is more robust") Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2020-02-20tracing: Have synthetic event test use raw_smp_processor_id()Steven Rostedt (VMware)
The test code that tests synthetic event creation pushes in as one of its test fields the current CPU using "smp_processor_id()". As this is just something to see if the value is correctly passed in, and the actual CPU used does not matter, use raw_smp_processor_id(), otherwise with debug preemption enabled, a warning happens as the smp_processor_id() is called without preemption enabled. Link: http://lkml.kernel.org/r/20200220162950.35162579@gandalf.local.home Reviewed-by: Tom Zanussi <zanussi@kernel.org> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2020-02-20tracing: Fix number printing bug in print_synth_event()Tom Zanussi
Fix a varargs-related bug in print_synth_event() which resulted in strange output and oopses on 32-bit x86 systems. The problem is that trace_seq_printf() expects the varargs to match the format string, but print_synth_event() was always passing u64 values regardless. This results in unspecified behavior when unpacking with va_arg() in trace_seq_printf(). Add a function that takes the size into account when calling trace_seq_printf(). Before: modprobe-1731 [003] .... 919.039758: gen_synth_test: next_pid_field=777(null)next_comm_field=hula hoops ts_ns=1000000 ts_ms=1000 cpu=3(null)my_string_field=thneed my_int_field=598(null) After: insmod-1136 [001] .... 36.634590: gen_synth_test: next_pid_field=777 next_comm_field=hula hoops ts_ns=1000000 ts_ms=1000 cpu=1 my_string_field=thneed my_int_field=598 Link: http://lkml.kernel.org/r/a9b59eb515dbbd7d4abe53b347dccf7a8e285657.1581720155.git.zanussi@kernel.org Reported-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Signed-off-by: Tom Zanussi <zanussi@kernel.org> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2020-02-20tracing: Check that number of vals matches number of synth event fieldsTom Zanussi
Commit 7276531d4036('tracing: Consolidate trace() functions') inadvertently dropped the synth_event_trace() and synth_event_trace_array() checks that verify the number of values passed in matches the number of fields in the synthetic event being traced, so add them back. Link: http://lkml.kernel.org/r/32819cac708714693669e0dfe10fe9d935e94a16.1581720155.git.zanussi@kernel.org Signed-off-by: Tom Zanussi <zanussi@kernel.org> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2020-02-20tracing: Make synth_event trace functions endian-correctTom Zanussi
synth_event_trace(), synth_event_trace_array() and __synth_event_add_val() write directly into the trace buffer and need to take endianness into account, like trace_event_raw_event_synth() does. Link: http://lkml.kernel.org/r/2011354355e405af9c9d28abba430d1f5ff7771a.1581720155.git.zanussi@kernel.org Signed-off-by: Tom Zanussi <zanussi@kernel.org> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2020-02-20tracing: Make sure synth_event_trace() example always uses u64Tom Zanussi
synth_event_trace() is the varargs version of synth_event_trace_array(), which takes an array of u64, as do synth_event_add_val() et al. To not only be consistent with those, but also to address the fact that synth_event_trace() expects every arg to be of the same type since it doesn't also pass in e.g. a format string, the caller needs to make sure all args are of the same type, u64. u64 is used because it needs to accomodate the largest type available in synthetic events, which is u64. This fixes the bug reported by the kernel test robot/Rong Chen. Link: https://lore.kernel.org/lkml/20200212113444.GS12867@shao2-debian/ Link: http://lkml.kernel.org/r/894c4e955558b521210ee0642ba194a9e603354c.1581720155.git.zanussi@kernel.org Fixes: 9fe41efaca084 ("tracing: Add synth event generation test module") Reported-by: kernel test robot <rong.a.chen@intel.com> Signed-off-by: Tom Zanussi <zanussi@kernel.org> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2020-02-20sched/fair: Remove wake_cap()Morten Rasmussen
Capacity-awareness in the wake-up path previously involved disabling wake_affine in certain scenarios. We have just made select_idle_sibling() capacity-aware, so this isn't needed anymore. Remove wake_cap() entirely. Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com> [Changelog tweaks] Signed-off-by: Valentin Schneider <valentin.schneider@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> [Changelog tweaks] Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lkml.kernel.org/r/20200206191957.12325-5-valentin.schneider@arm.com
2020-02-20sched/core: Remove for_each_lower_domain()Valentin Schneider
The last remaining user of this macro has just been removed, get rid of it. Suggested-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Signed-off-by: Valentin Schneider <valentin.schneider@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Quentin Perret <qperret@google.com> Link: https://lkml.kernel.org/r/20200206191957.12325-4-valentin.schneider@arm.com
2020-02-20sched/topology: Remove SD_BALANCE_WAKE on asymmetric capacity systemsMorten Rasmussen
SD_BALANCE_WAKE was previously added to lower sched_domain levels on asymmetric CPU capacity systems by commit: 9ee1cda5ee25 ("sched/core: Enable SD_BALANCE_WAKE for asymmetric capacity systems") to enable the use of find_idlest_cpu() and friends to find an appropriate CPU for tasks. That responsibility has now been shifted to select_idle_sibling() and friends, and hence the flag can be removed. Note that this causes asymmetric CPU capacity systems to no longer enter the slow wakeup path (find_idlest_cpu()) on wakeups - only on execs and forks (which is aligned with all other mainline topologies). Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com> [Changelog tweaks] Signed-off-by: Valentin Schneider <valentin.schneider@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Quentin Perret <qperret@google.com> Link: https://lkml.kernel.org/r/20200206191957.12325-3-valentin.schneider@arm.com
2020-02-20sched/fair: Add asymmetric CPU capacity wakeup scanMorten Rasmussen
Issue ===== On asymmetric CPU capacity topologies, we currently rely on wake_cap() to drive select_task_rq_fair() towards either: - its slow-path (find_idlest_cpu()) if either the previous or current (waking) CPU has too little capacity for the waking task - its fast-path (select_idle_sibling()) otherwise Commit: 3273163c6775 ("sched/fair: Let asymmetric CPU configurations balance at wake-up") points out that this relies on the assumption that "[...]the CPU capacities within an SD_SHARE_PKG_RESOURCES domain (sd_llc) are homogeneous". This assumption no longer holds on newer generations of big.LITTLE systems (DynamIQ), which can accommodate CPUs of different compute capacity within a single LLC domain. To hopefully paint a better picture, a regular big.LITTLE topology would look like this: +---------+ +---------+ | L2 | | L2 | +----+----+ +----+----+ |CPU0|CPU1| |CPU2|CPU3| +----+----+ +----+----+ ^^^ ^^^ LITTLEs bigs which would result in the following scheduler topology: DIE [ ] <- sd_asym_cpucapacity MC [ ] [ ] <- sd_llc 0 1 2 3 Conversely, a DynamIQ topology could look like: +-------------------+ | L3 | +----+----+----+----+ | L2 | L2 | L2 | L2 | +----+----+----+----+ |CPU0|CPU1|CPU2|CPU3| +----+----+----+----+ ^^^^^ ^^^^^ LITTLEs bigs which would result in the following scheduler topology: MC [ ] <- sd_llc, sd_asym_cpucapacity 0 1 2 3 What this means is that, on DynamIQ systems, we could pass the wake_cap() test (IOW presume the waking task fits on the CPU capacities of some LLC domain), thus go through select_idle_sibling(). This function operates on an LLC domain, which here spans both bigs and LITTLEs, so it could very well pick a CPU of too small capacity for the task, despite there being fitting idle CPUs - it very much depends on the CPU iteration order, on which we have absolutely no guarantees capacity-wise. Implementation ============== Introduce yet another select_idle_sibling() helper function that takes CPU capacity into account. The policy is to pick the first idle CPU which is big enough for the task (task_util * margin < cpu_capacity). If no idle CPU is big enough, we pick the idle one with the highest capacity. Unlike other select_idle_sibling() helpers, this one operates on the sd_asym_cpucapacity sched_domain pointer, which is guaranteed to span all known CPU capacities in the system. As such, this will work for both "legacy" big.LITTLE (LITTLEs & bigs split at MC, joined at DIE) and for newer DynamIQ systems (e.g. LITTLEs and bigs in the same MC domain). Note that this limits the scope of select_idle_sibling() to select_idle_capacity() for asymmetric CPU capacity systems - the LLC domain will not be scanned, and no further heuristic will be applied. Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com> Signed-off-by: Valentin Schneider <valentin.schneider@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Quentin Perret <qperret@google.com> Link: https://lkml.kernel.org/r/20200206191957.12325-2-valentin.schneider@arm.com
2020-02-20sched/core: Remove duplicate assignment in sched_tick_remote()Scott Wood
A redundant "curr = rq->curr" was added; remove it. Fixes: ebc0f83c78a2 ("timers/nohz: Update NOHZ load in remote tick") Signed-off-by: Scott Wood <swood@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lkml.kernel.org/r/1580776558-12882-1-git-send-email-swood@redhat.com
2020-02-20PM / hibernate: fix typo "reserverd_size" -> "reserved_size"Alexandre Belloni
Fix a mistake in a variable name in a comment. Signed-off-by: Alexandre Belloni <alexandre.belloni@bootlin.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2020-02-19Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpfDavid S. Miller
Alexei Starovoitov says: ==================== pull-request: bpf 2020-02-19 The following pull-request contains BPF updates for your *net* tree. We've added 10 non-merge commits during the last 10 day(s) which contain a total of 10 files changed, 93 insertions(+), 31 deletions(-). The main changes are: 1) batched bpf hashtab fixes from Brian and Yonghong. 2) various selftests and libbpf fixes. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-19bpf: Fix a potential deadlock with bpf_map_do_batchYonghong Song
Commit 057996380a42 ("bpf: Add batch ops to all htab bpf map") added lookup_and_delete batch operation for hash table. The current implementation has bpf_lru_push_free() inside the bucket lock, which may cause a deadlock. syzbot reports: -> #2 (&htab->buckets[i].lock#2){....}: __raw_spin_lock_irqsave include/linux/spinlock_api_smp.h:110 [inline] _raw_spin_lock_irqsave+0x95/0xcd kernel/locking/spinlock.c:159 htab_lru_map_delete_node+0xce/0x2f0 kernel/bpf/hashtab.c:593 __bpf_lru_list_shrink_inactive kernel/bpf/bpf_lru_list.c:220 [inline] __bpf_lru_list_shrink+0xf9/0x470 kernel/bpf/bpf_lru_list.c:266 bpf_lru_list_pop_free_to_local kernel/bpf/bpf_lru_list.c:340 [inline] bpf_common_lru_pop_free kernel/bpf/bpf_lru_list.c:447 [inline] bpf_lru_pop_free+0x87c/0x1670 kernel/bpf/bpf_lru_list.c:499 prealloc_lru_pop+0x2c/0xa0 kernel/bpf/hashtab.c:132 __htab_lru_percpu_map_update_elem+0x67e/0xa90 kernel/bpf/hashtab.c:1069 bpf_percpu_hash_update+0x16e/0x210 kernel/bpf/hashtab.c:1585 bpf_map_update_value.isra.0+0x2d7/0x8e0 kernel/bpf/syscall.c:181 generic_map_update_batch+0x41f/0x610 kernel/bpf/syscall.c:1319 bpf_map_do_batch+0x3f5/0x510 kernel/bpf/syscall.c:3348 __do_sys_bpf+0x9b7/0x41e0 kernel/bpf/syscall.c:3460 __se_sys_bpf kernel/bpf/syscall.c:3355 [inline] __x64_sys_bpf+0x73/0xb0 kernel/bpf/syscall.c:3355 do_syscall_64+0xfa/0x790 arch/x86/entry/common.c:294 entry_SYSCALL_64_after_hwframe+0x49/0xbe -> #0 (&loc_l->lock){....}: check_prev_add kernel/locking/lockdep.c:2475 [inline] check_prevs_add kernel/locking/lockdep.c:2580 [inline] validate_chain kernel/locking/lockdep.c:2970 [inline] __lock_acquire+0x2596/0x4a00 kernel/locking/lockdep.c:3954 lock_acquire+0x190/0x410 kernel/locking/lockdep.c:4484 __raw_spin_lock_irqsave include/linux/spinlock_api_smp.h:110 [inline] _raw_spin_lock_irqsave+0x95/0xcd kernel/locking/spinlock.c:159 bpf_common_lru_push_free kernel/bpf/bpf_lru_list.c:516 [inline] bpf_lru_push_free+0x250/0x5b0 kernel/bpf/bpf_lru_list.c:555 __htab_map_lookup_and_delete_batch+0x8d4/0x1540 kernel/bpf/hashtab.c:1374 htab_lru_map_lookup_and_delete_batch+0x34/0x40 kernel/bpf/hashtab.c:1491 bpf_map_do_batch+0x3f5/0x510 kernel/bpf/syscall.c:3348 __do_sys_bpf+0x1f7d/0x41e0 kernel/bpf/syscall.c:3456 __se_sys_bpf kernel/bpf/syscall.c:3355 [inline] __x64_sys_bpf+0x73/0xb0 kernel/bpf/syscall.c:3355 do_syscall_64+0xfa/0x790 arch/x86/entry/common.c:294 entry_SYSCALL_64_after_hwframe+0x49/0xbe Possible unsafe locking scenario: CPU0 CPU2 ---- ---- lock(&htab->buckets[i].lock#2); lock(&l->lock); lock(&htab->buckets[i].lock#2); lock(&loc_l->lock); *** DEADLOCK *** To fix the issue, for htab_lru_map_lookup_and_delete_batch() in CPU0, let us do bpf_lru_push_free() out of the htab bucket lock. This can avoid the above deadlock scenario. Fixes: 057996380a42 ("bpf: Add batch ops to all htab bpf map") Reported-by: syzbot+a38ff3d9356388f2fb83@syzkaller.appspotmail.com Reported-by: syzbot+122b5421d14e68f29cd1@syzkaller.appspotmail.com Suggested-by: Hillf Danton <hdanton@sina.com> Suggested-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: Yonghong Song <yhs@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Reviewed-by: Jakub Sitnicki <jakub@cloudflare.com> Acked-by: Brian Vazquez <brianvv@google.com> Acked-by: Martin KaFai Lau <kafai@fb.com> Link: https://lore.kernel.org/bpf/20200219234757.3544014-1-yhs@fb.com
2020-02-19bpf: Do not grab the bucket spinlock by default on htab batch opsBrian Vazquez
Grabbing the spinlock for every bucket even if it's empty, was causing significant perfomance cost when traversing htab maps that have only a few entries. This patch addresses the issue by checking first the bucket_cnt, if the bucket has some entries then we go and grab the spinlock and proceed with the batching. Tested with a htab of size 50K and different value of populated entries. Before: Benchmark Time(ns) CPU(ns) --------------------------------------------- BM_DumpHashMap/1 2759655 2752033 BM_DumpHashMap/10 2933722 2930825 BM_DumpHashMap/200 3171680 3170265 BM_DumpHashMap/500 3639607 3635511 BM_DumpHashMap/1000 4369008 4364981 BM_DumpHashMap/5k 11171919 11134028 BM_DumpHashMap/20k 69150080 69033496 BM_DumpHashMap/39k 190501036 190226162 After: Benchmark Time(ns) CPU(ns) --------------------------------------------- BM_DumpHashMap/1 202707 200109 BM_DumpHashMap/10 213441 210569 BM_DumpHashMap/200 478641 472350 BM_DumpHashMap/500 980061 967102 BM_DumpHashMap/1000 1863835 1839575 BM_DumpHashMap/5k 8961836 8902540 BM_DumpHashMap/20k 69761497 69322756 BM_DumpHashMap/39k 187437830 186551111 Fixes: 057996380a42 ("bpf: Add batch ops to all htab bpf map") Signed-off-by: Brian Vazquez <brianvv@google.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Yonghong Song <yhs@fb.com> Link: https://lore.kernel.org/bpf/20200218172552.215077-1-brianvv@google.com
2020-02-19s390: remove obsolete ieee_emulation_warningsStephen Kitt
s390 math emulation was removed with commit 5a79859ae0f3 ("s390: remove 31 bit support"), rendering ieee_emulation_warnings useless. The code still built because it was protected by CONFIG_MATHEMU, which was no longer selectable. This patch removes the sysctl_ieee_emulation_warnings declaration and the sysctl entry declaration. Link: https://lkml.kernel.org/r/20200214172628.3598516-1-steve@sk2.org Reviewed-by: Vasily Gorbik <gor@linux.ibm.com> Signed-off-by: Stephen Kitt <steve@sk2.org> Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
2020-02-18Merge tag 'dma-mapping-5.6' of git://git.infradead.org/users/hch/dma-mappingLinus Torvalds
Pull dma-mapping fixes from Christoph Hellwig: - give command line cma= precedence over the CONFIG_ option (Nicolas Saenz Julienne) - always allow 32-bit DMA, even for weirdly placed ZONE_DMA - improve the debug printks when memory is not addressable, to help find problems with swiotlb initialization * tag 'dma-mapping-5.6' of git://git.infradead.org/users/hch/dma-mapping: dma-direct: improve DMA mask overflow reporting dma-direct: improve swiotlb error reporting dma-direct: relax addressability checks in dma_direct_supported dma-contiguous: CMA: give precedence to cmdline
2020-02-17bpf, offload: Replace bitwise AND by logical AND in bpf_prog_offload_info_fillJohannes Krude
This if guards whether user-space wants a copy of the offload-jited bytecode and whether this bytecode exists. By erroneously doing a bitwise AND instead of a logical AND on user- and kernel-space buffer-size can lead to no data being copied to user-space especially when user-space size is a power of two and bigger then the kernel-space buffer. Fixes: fcfb126defda ("bpf: add new jited info fields in bpf_dev_offload and bpf_prog_info") Signed-off-by: Johannes Krude <johannes@krude.de> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Jakub Kicinski <kuba@kernel.org> Link: https://lore.kernel.org/bpf/20200212193227.GA3769@phlox.h.transitiv.net
2020-02-15Merge branch 'sched-urgent-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler fixes from Ingo Molnar: "Misc fixes all over the place: - Fix NUMA over-balancing between lightly loaded nodes. This is fallout of the big load-balancer rewrite. - Fix the NOHZ remote loadavg update logic, which fixes anomalies like reported 150 loadavg on mostly idle CPUs. - Fix XFS performance/scalability - Fix throttled groups unbound task-execution bug - Fix PSI procfs boundary condition - Fix the cpu.uclamp.{min,max} cgroup configuration write checks - Fix DocBook annotations - Fix RCU annotations - Fix overly CPU-intensive housekeeper CPU logic loop on large CPU counts" * 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: sched/fair: Fix kernel-doc warning in attach_entity_load_avg() sched/core: Annotate curr pointer in rq with __rcu sched/psi: Fix OOB write when writing 0 bytes to PSI files sched/fair: Allow a per-CPU kthread waking a task to stack on the same CPU, to fix XFS performance regression sched/fair: Prevent unlimited runtime on throttled group sched/nohz: Optimize get_nohz_timer_target() sched/uclamp: Reject negative values in cpu_uclamp_write() sched/fair: Allow a small load imbalance between low utilisation SD_NUMA domains timers/nohz: Update NOHZ load in remote tick sched/core: Don't skip remote tick for idle CPUs
2020-02-14Merge tag 'pm-5.6-rc2' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm Pull power management fixes from Rafael Wysocki: "Fix three issues related to the handling of wakeup events signaled through the ACPI SCI while suspended to idle (Rafael Wysocki) and unexport an internal cpufreq variable (Yangtao Li)" * tag 'pm-5.6-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: ACPI: PM: s2idle: Prevent spurious SCIs from waking up the system ACPICA: Introduce acpi_any_gpe_status_set() ACPI: PM: s2idle: Avoid possible race related to the EC GPE ACPI: EC: Fix flushing of pending work cpufreq: Make cpufreq_global_kobject static
2020-02-14PM: QoS: Make CPU latency QoS depend on CONFIG_CPU_IDLERafael J. Wysocki
Because cpuidle is the only user of the effective constraint coming from the CPU latency QoS, add #ifdef CONFIG_CPU_IDLE around that code to avoid building it unnecessarily. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Reviewed-by: Ulf Hansson <ulf.hansson@linaro.org> Reviewed-by: Amit Kucheria <amit.kucheria@linaro.org> Tested-by: Amit Kucheria <amit.kucheria@linaro.org>
2020-02-14PM: QoS: Update file information commentsRafael J. Wysocki
Update the file information comments in include/linux/pm_qos.h and kernel/power/qos.c by adding titles along with copyright and authors information to them and changing the qos.c description to better reflect its contents (outdated information is dropped from it in particular). Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Reviewed-by: Ulf Hansson <ulf.hansson@linaro.org> Reviewed-by: Amit Kucheria <amit.kucheria@linaro.org> Tested-by: Amit Kucheria <amit.kucheria@linaro.org>
2020-02-14PM: QoS: Drop PM_QOS_CPU_DMA_LATENCY and rename related functionsRafael J. Wysocki
Drop the PM QoS classes enum including PM_QOS_CPU_DMA_LATENCY, drop the wrappers around pm_qos_request(), pm_qos_request_active(), and pm_qos_add/update/remove_request() introduced previously, rename these functions, respectively, to cpu_latency_qos_limit(), cpu_latency_qos_request_active(), and cpu_latency_qos_add/update/remove_request(), and update their kerneldoc comments. [While at it, drop some useless comments from these functions.] No intentional functional impact. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Reviewed-by: Ulf Hansson <ulf.hansson@linaro.org> Reviewed-by: Amit Kucheria <amit.kucheria@linaro.org> Tested-by: Amit Kucheria <amit.kucheria@linaro.org>
2020-02-14genirq/proc: Reject invalid affinity masks (again)Thomas Gleixner
Qian Cai reported that the WARN_ON() in the x86/msi affinity setting code, which catches cases where the affinity setting is not done on the CPU which is the current target of the interrupt, triggers during CPU hotplug stress testing. It turns out that the warning which was added with the commit addressing the MSI affinity race unearthed yet another long standing bug. If user space writes a bogus affinity mask, i.e. it contains no online CPUs, then it calls irq_select_affinity_usr(). This was introduced for ALPHA in eee45269b0f5 ("[PATCH] Alpha: convert to generic irq framework (generic part)") and subsequently made available for all architectures in 18404756765c ("genirq: Expose default irq affinity mask (take 3)") which introduced the circumvention of the affinity setting restrictions for interrupt which cannot be moved in process context. The whole exercise is bogus in various aspects: 1) If the interrupt is already started up then there is absolutely no point to honour a bogus interrupt affinity setting from user space. The interrupt is already assigned to an online CPU and it does not make any sense to reassign it to some other randomly chosen online CPU. 2) If the interupt is not yet started up then there is no point either. A subsequent startup of the interrupt will invoke irq_setup_affinity() anyway which will chose a valid target CPU. So the only correct solution is to just return -EINVAL in case user space wrote an affinity mask which does not contain any online CPUs, except for ALPHA which has it's own magic sauce for this. Fixes: 18404756765c ("genirq: Expose default irq affinity mask (take 3)") Reported-by: Qian Cai <cai@lca.pw> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Qian Cai <cai@lca.pw> Link: https://lkml.kernel.org/r/878sl8xdbm.fsf@nanos.tec.linutronix.de
2020-02-13PM: QoS: Adjust pm_qos_request() signature and reorder pm_qos.hRafael J. Wysocki
Change the return type of pm_qos_request() to be the same as the one of pm_qos_read_value() called by it internally and stop exporting it to modules (because its only caller, cpuidle, is not modular). Also move the pm_qos_read_value() header away from the CPU latency QoS API function headers in pm_qos.h (because it technically does not belong to that API). No intentional functional impact. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Reviewed-by: Ulf Hansson <ulf.hansson@linaro.org> Reviewed-by: Amit Kucheria <amit.kucheria@linaro.org> Tested-by: Amit Kucheria <amit.kucheria@linaro.org>
2020-02-13PM: QoS: Simplify definitions of CPU latency QoS trace eventsRafael J. Wysocki
Modify the definitions of the CPU latency QoS trace events to take one argument (since PM_QOS_CPU_DMA_LATENCY is always passed as the pm_qos_class argument to them) and update the documentation of them accordingly (while at it, make it explicitly mention CPU latency QoS and relocate it after the device PM QoS trace events documentation). The names and output format of the trace events do not change to preserve user space compatibility. No intentional functional impact. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Reviewed-by: Ulf Hansson <ulf.hansson@linaro.org> Reviewed-by: Amit Kucheria <amit.kucheria@linaro.org> Tested-by: Amit Kucheria <amit.kucheria@linaro.org>
2020-02-13PM: QoS: Rename things related to the CPU latency QoSRafael J. Wysocki
First, rename PM_QOS_CPU_DMA_LAT_DEFAULT_VALUE to PM_QOS_CPU_LATENCY_DEFAULT_VALUE and update all of the code referring to it accordingly. Next, rename cpu_dma_constraints to cpu_latency_constraints, move the definition of it closer to the functions referring to it and update all of them accordingly. [While at it, add a comment to mark the start of the code related to the CPU latency QoS.] Finally, rename the pm_qos_power_*() family of functions and pm_qos_power_fops to cpu_latency_qos_*() and cpu_latency_qos_fops, respectively, and update the definition of cpu_latency_qos_miscdev. [While at it, update the miscdev interface code start comment.] No intentional functional impact. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Reviewed-by: Ulf Hansson <ulf.hansson@linaro.org> Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Reviewed-by: Amit Kucheria <amit.kucheria@linaro.org> Tested-by: Amit Kucheria <amit.kucheria@linaro.org>
2020-02-13PM: QoS: Drop PM_QOS_CPU_DMA_LATENCY notifier chainRafael J. Wysocki
Notice that pm_qos_remove_notifier() is not used at all and the only caller of pm_qos_add_notifier() is the cpuidle core, which only needs the PM_QOS_CPU_DMA_LATENCY notifier to invoke wake_up_all_idle_cpus() upon changes of the PM_QOS_CPU_DMA_LATENCY target value. First, to ensure that wake_up_all_idle_cpus() will be called whenever the PM_QOS_CPU_DMA_LATENCY target value changes, modify the pm_qos_add/update/remove_request() family of functions to check if the effective constraint for the PM_QOS_CPU_DMA_LATENCY has changed and call wake_up_all_idle_cpus() directly in that case. Next, drop the PM_QOS_CPU_DMA_LATENCY notifier from cpuidle as it is not necessary any more. Finally, drop both pm_qos_add_notifier() and pm_qos_remove_notifier(), as they have no callers now, along with cpu_dma_lat_notifier which is only used by them. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Reviewed-by: Ulf Hansson <ulf.hansson@linaro.org> Reviewed-by: Amit Kucheria <amit.kucheria@linaro.org> Tested-by: Amit Kucheria <amit.kucheria@linaro.org>
2020-02-13PM: QoS: Redefine struct pm_qos_request and drop struct pm_qos_objectRafael J. Wysocki
First, change the definition of struct pm_qos_request so that it contains a struct pm_qos_constraints pointer (called "qos") instead of a PM QoS class number (in preparation for dropping the PM QoS classes concept altogether going forward) and move its definition (along with the definition of struct pm_qos_flags_request that does not change) after the definition of struct pm_qos_constraints. Next, drop the definition of struct pm_qos_object and the null_pm_qos and cpu_dma_pm_qos variables of that type along with pm_qos_array[] holding pointers to them and change the code to refer to the pm_qos_constraints structure directly or to use the new qos pointer in struct pm_qos_request for that instead of going through pm_qos_array[] to access it. Also update kerneldoc comments that mention pm_qos_class to refer to PM_QOS_CPU_DMA_LATENCY directly instead. Finally, drop register_pm_qos_misc(), introduce cpu_latency_qos_miscdev (with the name field set to "cpu_dma_latency") to implement the CPU latency QoS interface in /dev/ and register it directly from pm_qos_power_init(). After these changes the notion of PM QoS classes remains only in the API (in the form of redundant function parameters that are ignored) and in the definitions of PM QoS trace events. While at it, some redundant local variables are dropped etc. No intentional functional impact. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Reviewed-by: Ulf Hansson <ulf.hansson@linaro.org> Reviewed-by: Amit Kucheria <amit.kucheria@linaro.org> Tested-by: Amit Kucheria <amit.kucheria@linaro.org>
2020-02-13PM: QoS: Clean up misc device file operationsRafael J. Wysocki
Reorder the code to avoid using extra function header declarations for the pm_qos_power_*() family of functions and drop those declarations. Also clean up the internals of those functions to consolidate checks, avoid using redundant local variables and similar. No intentional functional impact. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Reviewed-by: Ulf Hansson <ulf.hansson@linaro.org> Reviewed-by: Amit Kucheria <amit.kucheria@linaro.org> Tested-by: Amit Kucheria <amit.kucheria@linaro.org>
2020-02-13PM: QoS: Drop iterations over global QoS classesRafael J. Wysocki
After commit c3082a674f46 ("PM: QoS: Get rid of unused flags") the only global PM QoS class in use is PM_QOS_CPU_DMA_LATENCY, so it does not really make sense to iterate over global QoS classes anywhere, since there is only one. Remove iterations over global QoS classes from the code and use PM_QOS_CPU_DMA_LATENCY as the target class directly where needed. No intentional functional impact. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Reviewed-by: Ulf Hansson <ulf.hansson@linaro.org> Reviewed-by: Amit Kucheria <amit.kucheria@linaro.org> Tested-by: Amit Kucheria <amit.kucheria@linaro.org>
2020-02-13PM: QoS: Clean up pm_qos_read_value() and pm_qos_get/set_value()Rafael J. Wysocki
Move the definition of pm_qos_read_value() before the one of pm_qos_get_value() and add a kerneldoc comment to it (as it is not static). Also replace the BUG() in pm_qos_get_value() with WARN() (to prevent the kernel from crashing if an unknown PM QoS type is used by mistake) and drop the comment next to it that is not necessary any more. Additionally, drop the unnecessary inline modifier from the header of pm_qos_set_value(). Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Reviewed-by: Ulf Hansson <ulf.hansson@linaro.org> Reviewed-by: Amit Kucheria <amit.kucheria@linaro.org> Tested-by: Amit Kucheria <amit.kucheria@linaro.org>