summaryrefslogtreecommitdiff
path: root/kernel
AgeCommit message (Collapse)Author
2019-02-03audit: remove audit_context when CONFIG_ AUDIT and not AUDITSYSCALLRichard Guy Briggs
Remove audit_context from struct task_struct and struct audit_buffer when CONFIG_AUDIT is enabled but CONFIG_AUDITSYSCALL is not. Also, audit_log_name() (and supporting inode and fcaps functions) should have been put back in auditsc.c when soft and hard link logging was normalized since it is only used by syscall auditing. See github issue https://github.com/linux-audit/audit-kernel/issues/105 Signed-off-by: Richard Guy Briggs <rgb@redhat.com> Signed-off-by: Paul Moore <paul@paul-moore.com>
2019-02-03Merge branch 'smp-urgent-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull cpu hotplug fixes from Thomas Gleixner: "Two fixes for the cpu hotplug machinery: - Replace the overly clever 'SMT disabled by BIOS' detection logic as it breaks KVM scenarios and prevents speculation control updates when the Hyperthreads are brought online late after boot. - Remove a redundant invocation of the speculation control update function" * 'smp-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: cpu/hotplug: Fix "SMT disabled by BIOS" detection for KVM x86/speculation: Remove redundant arch_smt_update() invocation
2019-02-03Merge branch 'perf-urgent-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull perf fixes from Thomas Gleixner: "A pile of perf updates: - Fix broken sanity check in the /proc/sys/kernel/perf_cpu_time_max_percent write handler - Cure a perf script crash which caused by an unitinialized data structure - Highlight the hottest instruction in perf top and not a random one - Cure yet another clang issue when building perf python - Handle topology entries with no CPU correctly in the tools - Handle perf data which contains both tracepoints and performance counter entries correctly. - Add a missing NULL pointer check in perf ordered_events_free()" * 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: perf script: Fix crash when processing recorded stat data perf top: Fix wrong hottest instruction highlighted perf tools: Handle TOPOLOGY headers with no CPU perf python: Remove -fstack-clash-protection when building with some clang versions perf core: Fix perf_proc_update_handler() bug perf script: Fix crash with printing mixed trace point and other events perf ordered_events: Fix crash in ordered_events__free
2019-02-01psi: fix aggregation idle shut-offJohannes Weiner
psi has provisions to shut off the periodic aggregation worker when there is a period of no task activity - and thus no data that needs aggregating. However, while developing psi monitoring, Suren noticed that the aggregation clock currently won't stay shut off for good. Debugging this revealed a flaw in the idle design: an aggregation run will see no task activity and decide to go to sleep; shortly thereafter, the kworker thread that executed the aggregation will go idle and cause a scheduling change, during which the psi callback will kick the !pending worker again. This will ping-pong forever, and is equivalent to having no shut-off logic at all (but with more code!) Fix this by exempting aggregation workers from psi's clock waking logic when the state change is them going to sleep. To do this, tag workers with the last work function they executed, and if in psi we see a worker going to sleep after aggregating psi data, we will not reschedule the aggregation work item. What if the worker is also executing other items before or after? Any psi state times that were incurred by work items preceding the aggregation work will have been collected from the per-cpu buckets during the aggregation itself. If there are work items following the aggregation work, the worker's last_func tag will be overwritten and the aggregator will be kept alive to process this genuine new activity. If the aggregation work is the last thing the worker does, and we decide to go idle, the brief period of non-idle time incurred between the aggregation run and the kworker's dequeue will be stranded in the per-cpu buckets until the clock is woken by later activity. But that should not be a problem. The buckets can hold 4s worth of time, and future activity will wake the clock with a 2s delay, giving us 2s worth of data we can leave behind when disabling aggregation. If it takes a worker more than two seconds to go idle after it finishes its last work item, we likely have bigger problems in the system, and won't notice one sample that was averaged with a bogus per-CPU weight. Link: http://lkml.kernel.org/r/20190116193501.1910-1-hannes@cmpxchg.org Fixes: eb414681d5a0 ("psi: pressure stall information for CPU, memory, and IO") Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reported-by: Suren Baghdasaryan <surenb@google.com> Acked-by: Tejun Heo <tj@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Lai Jiangshan <jiangshanlai@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-02-01kernel/exit.c: release ptraced tasks before zap_pid_ns_processesAndrei Vagin
Currently, exit_ptrace() adds all ptraced tasks in a dead list, then zap_pid_ns_processes() waits on all tasks in a current pidns, and only then are tasks from the dead list released. zap_pid_ns_processes() can get stuck on waiting tasks from the dead list. In this case, we will have one unkillable process with one or more dead children. Thanks to Oleg for the advice to release tasks in find_child_reaper(). Link: http://lkml.kernel.org/r/20190110175200.12442-1-avagin@gmail.com Fixes: 7c8bd2322c7f ("exit: ptrace: shift "reap dead" code from exit_ptrace() to forget_original_parent()") Signed-off-by: Andrei Vagin <avagin@gmail.com> Signed-off-by: Oleg Nesterov <oleg@redhat.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-02-01Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpfDavid S. Miller
Alexei Starovoitov says: ==================== pull-request: bpf 2019-01-31 The following pull-request contains BPF updates for your *net* tree. The main changes are: 1) disable preemption in sender side of socket filters, from Alexei. 2) fix two potential deadlocks in syscall bpf lookup and prog_register, from Martin and Alexei. 3) fix BTF to allow typedef on func_proto, from Yonghong. 4) two bpftool fixes, from Jiri and Paolo. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-01bpf: introduce BPF_F_LOCK flagAlexei Starovoitov
Introduce BPF_F_LOCK flag for map_lookup and map_update syscall commands and for map_update() helper function. In all these cases take a lock of existing element (which was provided in BTF description) before copying (in or out) the rest of map value. Implementation details that are part of uapi: Array: The array map takes the element lock for lookup/update. Hash: hash map also takes the lock for lookup/update and tries to avoid the bucket lock. If old element exists it takes the element lock and updates the element in place. If element doesn't exist it allocates new one and inserts into hash table while holding the bucket lock. In rare case the hashmap has to take both the bucket lock and the element lock to update old value in place. Cgroup local storage: It is similar to array. update in place and lookup are done with lock taken. Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-02-01bpf: add support for bpf_spin_lock to cgroup local storageAlexei Starovoitov
Allow 'struct bpf_spin_lock' to reside inside cgroup local storage. Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-02-01bpf: introduce bpf_spin_lockAlexei Starovoitov
Introduce 'struct bpf_spin_lock' and bpf_spin_lock/unlock() helpers to let bpf program serialize access to other variables. Example: struct hash_elem { int cnt; struct bpf_spin_lock lock; }; struct hash_elem * val = bpf_map_lookup_elem(&hash_map, &key); if (val) { bpf_spin_lock(&val->lock); val->cnt++; bpf_spin_unlock(&val->lock); } Restrictions and safety checks: - bpf_spin_lock is only allowed inside HASH and ARRAY maps. - BTF description of the map is mandatory for safety analysis. - bpf program can take one bpf_spin_lock at a time, since two or more can cause dead locks. - only one 'struct bpf_spin_lock' is allowed per map element. It drastically simplifies implementation yet allows bpf program to use any number of bpf_spin_locks. - when bpf_spin_lock is taken the calls (either bpf2bpf or helpers) are not allowed. - bpf program must bpf_spin_unlock() before return. - bpf program can access 'struct bpf_spin_lock' only via bpf_spin_lock()/bpf_spin_unlock() helpers. - load/store into 'struct bpf_spin_lock lock;' field is not allowed. - to use bpf_spin_lock() helper the BTF description of map value must be a struct and have 'struct bpf_spin_lock anyname;' field at the top level. Nested lock inside another struct is not allowed. - syscall map_lookup doesn't copy bpf_spin_lock field to user space. - syscall map_update and program map_update do not update bpf_spin_lock field. - bpf_spin_lock cannot be on the stack or inside networking packet. bpf_spin_lock can only be inside HASH or ARRAY map value. - bpf_spin_lock is available to root only and to all program types. - bpf_spin_lock is not allowed in inner maps of map-in-map. - ld_abs is not allowed inside spin_lock-ed region. - tracing progs and socket filter progs cannot use bpf_spin_lock due to insufficient preemption checks Implementation details: - cgroup-bpf class of programs can nest with xdp/tc programs. Hence bpf_spin_lock is equivalent to spin_lock_irqsave. Other solutions to avoid nested bpf_spin_lock are possible. Like making sure that all networking progs run with softirq disabled. spin_lock_irqsave is the simplest and doesn't add overhead to the programs that don't use it. - arch_spinlock_t is used when its implemented as queued_spin_lock - archs can force their own arch_spinlock_t - on architectures where queued_spin_lock is not available and sizeof(arch_spinlock_t) != sizeof(__u32) trivial lock is used. - presence of bpf_spin_lock inside map value could have been indicated via extra flag during map_create, but specifying it via BTF is cleaner. It provides introspection for map key/value and reduces user mistakes. Next steps: - allow bpf_spin_lock in other map types (like cgroup local storage) - introduce BPF_F_LOCK flag for bpf_map_update() syscall and helper to request kernel to grab bpf_spin_lock before rewriting the value. That will serialize access to map elements. Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-02-01dma-debug: add dumping facility via debugfsCorentin Labbe
While debugging a DMA mapping leak, I needed to access debug_dma_dump_mappings() but easily from user space. This patch adds a /sys/kernel/debug/dma-api/dump file which contain all current DMA mapping. Signed-off-by: Corentin Labbe <clabbe@baylibre.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
2019-02-01dma: debug: no need to check return value of debugfs_create functionsGreg Kroah-Hartman
When calling debugfs functions, there is no need to ever check the return value. The function can work or not, but the code logic should never do something different based on this. Also delete the variables for the file dentries for the debugfs entries as they are never used at all once they are created. Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Reviewed-by: Robin Murphy <robin.murphy@arm.com> [hch: moved dma_debug_dent to function scope and renamed it] Signed-off-by: Christoph Hellwig <hch@lst.de>
2019-02-01dma-mapping: remove the default map_resource implementationChristoph Hellwig
Instead provide a proper implementation in the direct mapping code, and also wire it up for arm and powerpc, leaving an error return for all the IOMMU or virtual mapping instances for which we'd have to wire up an actual implementation Signed-off-by: Christoph Hellwig <hch@lst.de> Tested-by: Marek Szyprowski <m.szyprowski@samsung.com>
2019-02-01pipe: stop using ->can_mergeJann Horn
Al Viro pointed out that since there is only one pipe buffer type to which new data can be appended, it isn't necessary to have a ->can_merge field in struct pipe_buf_operations, we can just check for a magic type. Suggested-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Jann Horn <jannh@google.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2019-01-31audit: remove unused actx param from audit_rule_matchRichard Guy Briggs
The audit_rule_match() struct audit_context *actx parameter is not used by any in-tree consumers (selinux, apparmour, integrity, smack). The audit context is an internal audit structure that should only be accessed by audit accessor functions. It was part of commit 03d37d25e0f9 ("LSM/Audit: Introduce generic Audit LSM hooks") but appears to have never been used. Remove it. Please see the github issue https://github.com/linux-audit/audit-kernel/issues/107 Signed-off-by: Richard Guy Briggs <rgb@redhat.com> [PM: fixed the referenced commit title] Signed-off-by: Paul Moore <paul@paul-moore.com>
2019-01-31bpf: Fix syscall's stackmap lookup potential deadlockMartin KaFai Lau
The map_lookup_elem used to not acquiring spinlock in order to optimize the reader. It was true until commit 557c0c6e7df8 ("bpf: convert stackmap to pre-allocation") The syscall's map_lookup_elem(stackmap) calls bpf_stackmap_copy(). bpf_stackmap_copy() may find the elem no longer needed after the copy is done. If that is the case, pcpu_freelist_push() saves this elem for reuse later. This push requires a spinlock. If a tracing bpf_prog got run in the middle of the syscall's map_lookup_elem(stackmap) and this tracing bpf_prog is calling bpf_get_stackid(stackmap) which also requires the same pcpu_freelist's spinlock, it may end up with a dead lock situation as reported by Eric Dumazet in https://patchwork.ozlabs.org/patch/1030266/ The situation is the same as the syscall's map_update_elem() which needs to acquire the pcpu_freelist's spinlock and could race with tracing bpf_prog. Hence, this patch fixes it by protecting bpf_stackmap_copy() with this_cpu_inc(bpf_prog_active) to prevent tracing bpf_prog from running. A later syscall's map_lookup_elem commit f1a2e44a3aec ("bpf: add queue and stack maps") also acquires a spinlock and races with tracing bpf_prog similarly. Hence, this patch is forward looking and protects the majority of the map lookups. bpf_map_offload_lookup_elem() is the exception since it is for network bpf_prog only (i.e. never called by tracing bpf_prog). Fixes: 557c0c6e7df8 ("bpf: convert stackmap to pre-allocation") Reported-by: Eric Dumazet <eric.dumazet@gmail.com> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-01-31bpf: fix potential deadlock in bpf_prog_registerAlexei Starovoitov
Lockdep found a potential deadlock between cpu_hotplug_lock, bpf_event_mutex, and cpuctx_mutex: [ 13.007000] WARNING: possible circular locking dependency detected [ 13.007587] 5.0.0-rc3-00018-g2fa53f892422-dirty #477 Not tainted [ 13.008124] ------------------------------------------------------ [ 13.008624] test_progs/246 is trying to acquire lock: [ 13.009030] 0000000094160d1d (tracepoints_mutex){+.+.}, at: tracepoint_probe_register_prio+0x2d/0x300 [ 13.009770] [ 13.009770] but task is already holding lock: [ 13.010239] 00000000d663ef86 (bpf_event_mutex){+.+.}, at: bpf_probe_register+0x1d/0x60 [ 13.010877] [ 13.010877] which lock already depends on the new lock. [ 13.010877] [ 13.011532] [ 13.011532] the existing dependency chain (in reverse order) is: [ 13.012129] [ 13.012129] -> #4 (bpf_event_mutex){+.+.}: [ 13.012582] perf_event_query_prog_array+0x9b/0x130 [ 13.013016] _perf_ioctl+0x3aa/0x830 [ 13.013354] perf_ioctl+0x2e/0x50 [ 13.013668] do_vfs_ioctl+0x8f/0x6a0 [ 13.014003] ksys_ioctl+0x70/0x80 [ 13.014320] __x64_sys_ioctl+0x16/0x20 [ 13.014668] do_syscall_64+0x4a/0x180 [ 13.015007] entry_SYSCALL_64_after_hwframe+0x49/0xbe [ 13.015469] [ 13.015469] -> #3 (&cpuctx_mutex){+.+.}: [ 13.015910] perf_event_init_cpu+0x5a/0x90 [ 13.016291] perf_event_init+0x1b2/0x1de [ 13.016654] start_kernel+0x2b8/0x42a [ 13.016995] secondary_startup_64+0xa4/0xb0 [ 13.017382] [ 13.017382] -> #2 (pmus_lock){+.+.}: [ 13.017794] perf_event_init_cpu+0x21/0x90 [ 13.018172] cpuhp_invoke_callback+0xb3/0x960 [ 13.018573] _cpu_up+0xa7/0x140 [ 13.018871] do_cpu_up+0xa4/0xc0 [ 13.019178] smp_init+0xcd/0xd2 [ 13.019483] kernel_init_freeable+0x123/0x24f [ 13.019878] kernel_init+0xa/0x110 [ 13.020201] ret_from_fork+0x24/0x30 [ 13.020541] [ 13.020541] -> #1 (cpu_hotplug_lock.rw_sem){++++}: [ 13.021051] static_key_slow_inc+0xe/0x20 [ 13.021424] tracepoint_probe_register_prio+0x28c/0x300 [ 13.021891] perf_trace_event_init+0x11f/0x250 [ 13.022297] perf_trace_init+0x6b/0xa0 [ 13.022644] perf_tp_event_init+0x25/0x40 [ 13.023011] perf_try_init_event+0x6b/0x90 [ 13.023386] perf_event_alloc+0x9a8/0xc40 [ 13.023754] __do_sys_perf_event_open+0x1dd/0xd30 [ 13.024173] do_syscall_64+0x4a/0x180 [ 13.024519] entry_SYSCALL_64_after_hwframe+0x49/0xbe [ 13.024968] [ 13.024968] -> #0 (tracepoints_mutex){+.+.}: [ 13.025434] __mutex_lock+0x86/0x970 [ 13.025764] tracepoint_probe_register_prio+0x2d/0x300 [ 13.026215] bpf_probe_register+0x40/0x60 [ 13.026584] bpf_raw_tracepoint_open.isra.34+0xa4/0x130 [ 13.027042] __do_sys_bpf+0x94f/0x1a90 [ 13.027389] do_syscall_64+0x4a/0x180 [ 13.027727] entry_SYSCALL_64_after_hwframe+0x49/0xbe [ 13.028171] [ 13.028171] other info that might help us debug this: [ 13.028171] [ 13.028807] Chain exists of: [ 13.028807] tracepoints_mutex --> &cpuctx_mutex --> bpf_event_mutex [ 13.028807] [ 13.029666] Possible unsafe locking scenario: [ 13.029666] [ 13.030140] CPU0 CPU1 [ 13.030510] ---- ---- [ 13.030875] lock(bpf_event_mutex); [ 13.031166] lock(&cpuctx_mutex); [ 13.031645] lock(bpf_event_mutex); [ 13.032135] lock(tracepoints_mutex); [ 13.032441] [ 13.032441] *** DEADLOCK *** [ 13.032441] [ 13.032911] 1 lock held by test_progs/246: [ 13.033239] #0: 00000000d663ef86 (bpf_event_mutex){+.+.}, at: bpf_probe_register+0x1d/0x60 [ 13.033909] [ 13.033909] stack backtrace: [ 13.034258] CPU: 1 PID: 246 Comm: test_progs Not tainted 5.0.0-rc3-00018-g2fa53f892422-dirty #477 [ 13.034964] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.11.0-2.el7 04/01/2014 [ 13.035657] Call Trace: [ 13.035859] dump_stack+0x5f/0x8b [ 13.036130] print_circular_bug.isra.37+0x1ce/0x1db [ 13.036526] __lock_acquire+0x1158/0x1350 [ 13.036852] ? lock_acquire+0x98/0x190 [ 13.037154] lock_acquire+0x98/0x190 [ 13.037447] ? tracepoint_probe_register_prio+0x2d/0x300 [ 13.037876] __mutex_lock+0x86/0x970 [ 13.038167] ? tracepoint_probe_register_prio+0x2d/0x300 [ 13.038600] ? tracepoint_probe_register_prio+0x2d/0x300 [ 13.039028] ? __mutex_lock+0x86/0x970 [ 13.039337] ? __mutex_lock+0x24a/0x970 [ 13.039649] ? bpf_probe_register+0x1d/0x60 [ 13.039992] ? __bpf_trace_sched_wake_idle_without_ipi+0x10/0x10 [ 13.040478] ? tracepoint_probe_register_prio+0x2d/0x300 [ 13.040906] tracepoint_probe_register_prio+0x2d/0x300 [ 13.041325] bpf_probe_register+0x40/0x60 [ 13.041649] bpf_raw_tracepoint_open.isra.34+0xa4/0x130 [ 13.042068] ? __might_fault+0x3e/0x90 [ 13.042374] __do_sys_bpf+0x94f/0x1a90 [ 13.042678] do_syscall_64+0x4a/0x180 [ 13.042975] entry_SYSCALL_64_after_hwframe+0x49/0xbe [ 13.043382] RIP: 0033:0x7f23b10a07f9 [ 13.045155] RSP: 002b:00007ffdef42fdd8 EFLAGS: 00000202 ORIG_RAX: 0000000000000141 [ 13.045759] RAX: ffffffffffffffda RBX: 00007ffdef42ff70 RCX: 00007f23b10a07f9 [ 13.046326] RDX: 0000000000000070 RSI: 00007ffdef42fe10 RDI: 0000000000000011 [ 13.046893] RBP: 00007ffdef42fdf0 R08: 0000000000000038 R09: 00007ffdef42fe10 [ 13.047462] R10: 0000000000000000 R11: 0000000000000202 R12: 0000000000000000 [ 13.048029] R13: 0000000000000016 R14: 00007f23b1db4690 R15: 0000000000000000 Since tracepoints_mutex will be taken in tracepoint_probe_register/unregister() there is no need to take bpf_event_mutex too. bpf_event_mutex is protecting modifications to prog array used in kprobe/perf bpf progs. bpf_raw_tracepoints don't need to take this mutex. Fixes: c4f6699dfcb8 ("bpf: introduce BPF_RAW_TRACEPOINT") Acked-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-01-31bpf: fix lockdep false positive in percpu_freelistAlexei Starovoitov
Lockdep warns about false positive: [ 12.492084] 00000000e6b28347 (&head->lock){+...}, at: pcpu_freelist_push+0x2a/0x40 [ 12.492696] but this lock was taken by another, HARDIRQ-safe lock in the past: [ 12.493275] (&rq->lock){-.-.} [ 12.493276] [ 12.493276] [ 12.493276] and interrupts could create inverse lock ordering between them. [ 12.493276] [ 12.494435] [ 12.494435] other info that might help us debug this: [ 12.494979] Possible interrupt unsafe locking scenario: [ 12.494979] [ 12.495518] CPU0 CPU1 [ 12.495879] ---- ---- [ 12.496243] lock(&head->lock); [ 12.496502] local_irq_disable(); [ 12.496969] lock(&rq->lock); [ 12.497431] lock(&head->lock); [ 12.497890] <Interrupt> [ 12.498104] lock(&rq->lock); [ 12.498368] [ 12.498368] *** DEADLOCK *** [ 12.498368] [ 12.498837] 1 lock held by dd/276: [ 12.499110] #0: 00000000c58cb2ee (rcu_read_lock){....}, at: trace_call_bpf+0x5e/0x240 [ 12.499747] [ 12.499747] the shortest dependencies between 2nd lock and 1st lock: [ 12.500389] -> (&rq->lock){-.-.} { [ 12.500669] IN-HARDIRQ-W at: [ 12.500934] _raw_spin_lock+0x2f/0x40 [ 12.501373] scheduler_tick+0x4c/0xf0 [ 12.501812] update_process_times+0x40/0x50 [ 12.502294] tick_periodic+0x27/0xb0 [ 12.502723] tick_handle_periodic+0x1f/0x60 [ 12.503203] timer_interrupt+0x11/0x20 [ 12.503651] __handle_irq_event_percpu+0x43/0x2c0 [ 12.504167] handle_irq_event_percpu+0x20/0x50 [ 12.504674] handle_irq_event+0x37/0x60 [ 12.505139] handle_level_irq+0xa7/0x120 [ 12.505601] handle_irq+0xa1/0x150 [ 12.506018] do_IRQ+0x77/0x140 [ 12.506411] ret_from_intr+0x0/0x1d [ 12.506834] _raw_spin_unlock_irqrestore+0x53/0x60 [ 12.507362] __setup_irq+0x481/0x730 [ 12.507789] setup_irq+0x49/0x80 [ 12.508195] hpet_time_init+0x21/0x32 [ 12.508644] x86_late_time_init+0xb/0x16 [ 12.509106] start_kernel+0x390/0x42a [ 12.509554] secondary_startup_64+0xa4/0xb0 [ 12.510034] IN-SOFTIRQ-W at: [ 12.510305] _raw_spin_lock+0x2f/0x40 [ 12.510772] try_to_wake_up+0x1c7/0x4e0 [ 12.511220] swake_up_locked+0x20/0x40 [ 12.511657] swake_up_one+0x1a/0x30 [ 12.512070] rcu_process_callbacks+0xc5/0x650 [ 12.512553] __do_softirq+0xe6/0x47b [ 12.512978] irq_exit+0xc3/0xd0 [ 12.513372] smp_apic_timer_interrupt+0xa9/0x250 [ 12.513876] apic_timer_interrupt+0xf/0x20 [ 12.514343] default_idle+0x1c/0x170 [ 12.514765] do_idle+0x199/0x240 [ 12.515159] cpu_startup_entry+0x19/0x20 [ 12.515614] start_kernel+0x422/0x42a [ 12.516045] secondary_startup_64+0xa4/0xb0 [ 12.516521] INITIAL USE at: [ 12.516774] _raw_spin_lock_irqsave+0x38/0x50 [ 12.517258] rq_attach_root+0x16/0xd0 [ 12.517685] sched_init+0x2f2/0x3eb [ 12.518096] start_kernel+0x1fb/0x42a [ 12.518525] secondary_startup_64+0xa4/0xb0 [ 12.518986] } [ 12.519132] ... key at: [<ffffffff82b7bc28>] __key.71384+0x0/0x8 [ 12.519649] ... acquired at: [ 12.519892] pcpu_freelist_pop+0x7b/0xd0 [ 12.520221] bpf_get_stackid+0x1d2/0x4d0 [ 12.520563] ___bpf_prog_run+0x8b4/0x11a0 [ 12.520887] [ 12.521008] -> (&head->lock){+...} { [ 12.521292] HARDIRQ-ON-W at: [ 12.521539] _raw_spin_lock+0x2f/0x40 [ 12.521950] pcpu_freelist_push+0x2a/0x40 [ 12.522396] bpf_get_stackid+0x494/0x4d0 [ 12.522828] ___bpf_prog_run+0x8b4/0x11a0 [ 12.523296] INITIAL USE at: [ 12.523537] _raw_spin_lock+0x2f/0x40 [ 12.523944] pcpu_freelist_populate+0xc0/0x120 [ 12.524417] htab_map_alloc+0x405/0x500 [ 12.524835] __do_sys_bpf+0x1a3/0x1a90 [ 12.525253] do_syscall_64+0x4a/0x180 [ 12.525659] entry_SYSCALL_64_after_hwframe+0x49/0xbe [ 12.526167] } [ 12.526311] ... key at: [<ffffffff838f7668>] __key.13130+0x0/0x8 [ 12.526812] ... acquired at: [ 12.527047] __lock_acquire+0x521/0x1350 [ 12.527371] lock_acquire+0x98/0x190 [ 12.527680] _raw_spin_lock+0x2f/0x40 [ 12.527994] pcpu_freelist_push+0x2a/0x40 [ 12.528325] bpf_get_stackid+0x494/0x4d0 [ 12.528645] ___bpf_prog_run+0x8b4/0x11a0 [ 12.528970] [ 12.529092] [ 12.529092] stack backtrace: [ 12.529444] CPU: 0 PID: 276 Comm: dd Not tainted 5.0.0-rc3-00018-g2fa53f892422 #475 [ 12.530043] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.11.0-2.el7 04/01/2014 [ 12.530750] Call Trace: [ 12.530948] dump_stack+0x5f/0x8b [ 12.531248] check_usage_backwards+0x10c/0x120 [ 12.531598] ? ___bpf_prog_run+0x8b4/0x11a0 [ 12.531935] ? mark_lock+0x382/0x560 [ 12.532229] mark_lock+0x382/0x560 [ 12.532496] ? print_shortest_lock_dependencies+0x180/0x180 [ 12.532928] __lock_acquire+0x521/0x1350 [ 12.533271] ? find_get_entry+0x17f/0x2e0 [ 12.533586] ? find_get_entry+0x19c/0x2e0 [ 12.533902] ? lock_acquire+0x98/0x190 [ 12.534196] lock_acquire+0x98/0x190 [ 12.534482] ? pcpu_freelist_push+0x2a/0x40 [ 12.534810] _raw_spin_lock+0x2f/0x40 [ 12.535099] ? pcpu_freelist_push+0x2a/0x40 [ 12.535432] pcpu_freelist_push+0x2a/0x40 [ 12.535750] bpf_get_stackid+0x494/0x4d0 [ 12.536062] ___bpf_prog_run+0x8b4/0x11a0 It has been explained that is a false positive here: https://lkml.org/lkml/2018/7/25/756 Recap: - stackmap uses pcpu_freelist - The lock in pcpu_freelist is a percpu lock - stackmap is only used by tracing bpf_prog - A tracing bpf_prog cannot be run if another bpf_prog has already been running (ensured by the percpu bpf_prog_active counter). Eric pointed out that this lockdep splats stops other legit lockdep splats in selftests/bpf/test_progs.c. Fix this by calling local_irq_save/restore for stackmap. Another false positive had also been worked around by calling local_irq_save in commit 89ad2fa3f043 ("bpf: fix lockdep splat"). That commit added unnecessary irq_save/restore to fast path of bpf hash map. irqs are already disabled at that point, since htab is holding per bucket spin_lock with irqsave. Let's reduce overhead for htab by introducing __pcpu_freelist_push/pop function w/o irqsave and convert pcpu_freelist_push/pop to irqsave to be used elsewhere (right now only in stackmap). It stops lockdep false positive in stackmap with a bit of acceptable overhead. Fixes: 557c0c6e7df8 ("bpf: convert stackmap to pre-allocation") Reported-by: Naresh Kamboju <naresh.kamboju@linaro.org> Reported-by: Eric Dumazet <eric.dumazet@gmail.com> Acked-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-01-31bpf: run bpf programs with preemption disabledAlexei Starovoitov
Disabled preemption is necessary for proper access to per-cpu maps from BPF programs. But the sender side of socket filters didn't have preemption disabled: unix_dgram_sendmsg->sk_filter->sk_filter_trim_cap->bpf_prog_run_save_cb->BPF_PROG_RUN and a combination of af_packet with tun device didn't disable either: tpacket_snd->packet_direct_xmit->packet_pick_tx_queue->ndo_select_queue-> tun_select_queue->tun_ebpf_select_queue->bpf_prog_run_clear_cb->BPF_PROG_RUN Disable preemption before executing BPF programs (both classic and extended). Reported-by: Jann Horn <jannh@google.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Song Liu <songliubraving@fb.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-01-31cgroup/pids: turn cgroup_subsys->free() into cgroup_subsys->release() to fix ↵Oleg Nesterov
the accounting The only user of cgroup_subsys->free() callback is pids_cgrp_subsys which needs pids_free() to uncharge the pid. However, ->free() is called from __put_task_struct()->cgroup_free() and this is too late. Even the trivial program which does for (;;) { int pid = fork(); assert(pid >= 0); if (pid) wait(NULL); else exit(0); } can run out of limits because release_task()->call_rcu(delayed_put_task_struct) implies an RCU gp after the task/pid goes away and before the final put(). Test-case: mkdir -p /tmp/CG mount -t cgroup2 none /tmp/CG echo '+pids' > /tmp/CG/cgroup.subtree_control mkdir /tmp/CG/PID echo 2 > /tmp/CG/PID/pids.max perl -e 'while ($p = fork) { wait; } $p // die "fork failed: $!\n"' & echo $! > /tmp/CG/PID/cgroup.procs Without this patch the forking process fails soon after migration. Rename cgroup_subsys->free() to cgroup_subsys->release() and move the callsite into the new helper, cgroup_release(), called by release_task() which actually frees the pid(s). Reported-by: Herton R. Krzesinski <hkrzesin@redhat.com> Reported-by: Jan Stancek <jstancek@redhat.com> Signed-off-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2019-01-31async: Add support for queueing on specific NUMA nodeAlexander Duyck
Introduce four new variants of the async_schedule_ functions that allow scheduling on a specific NUMA node. The first two functions are async_schedule_near and async_schedule_near_domain end up mapping to async_schedule and async_schedule_domain, but provide NUMA node specific functionality. They replace the original functions which were moved to inline function definitions that call the new functions while passing NUMA_NO_NODE. The second two functions are async_schedule_dev and async_schedule_dev_domain which provide NUMA specific functionality when passing a device as the data member and that device has a NUMA node other than NUMA_NO_NODE. The main motivation behind this is to address the need to be able to schedule device specific init work on specific NUMA nodes in order to improve performance of memory initialization. I have seen a significant improvement in initialziation time for persistent memory as a result of this approach. In the case of 3TB of memory on a single node the initialization time in the worst case went from 36s down to about 26s for a 10s improvement. As such the data shows a general benefit for affinitizing the async work to the node local to the device. Reviewed-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Alexander Duyck <alexander.h.duyck@linux.intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-01-31workqueue: Provide queue_work_node to queue work near a given NUMA nodeAlexander Duyck
Provide a new function, queue_work_node, which is meant to schedule work on a "random" CPU of the requested NUMA node. The main motivation for this is to help assist asynchronous init to better improve boot times for devices that are local to a specific node. For now we just default to the first CPU that is in the intersection of the cpumask of the node and the online cpumask. The only exception is if the CPU is local to the node we will just use the current CPU. This should work for our purposes as we are currently only using this for unbound work so the CPU will be translated to a node anyway instead of being directly used. As we are only using the first CPU to represent the NUMA node for now I am limiting the scope of the function so that it can only be used with unbound workqueues. Acked-by: Tejun Heo <tj@kernel.org> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Acked-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Alexander Duyck <alexander.h.duyck@linux.intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-01-31relay: check return of create_buf_file() properlyGreg Kroah-Hartman
If create_buf_file() returns an error, don't try to reference it later as a valid dentry pointer. This problem was exposed when debugfs started to return errors instead of just NULL for some calls when they do not succeed properly. Also, the check for WARN_ON(dentry) was just wrong :) Reported-by: Kees Cook <keescook@chromium.org> Reported-and-tested-by: syzbot+16c3a70e1e9b29346c43@syzkaller.appspotmail.com Reported-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: David Rientjes <rientjes@google.com> Fixes: ff9fb72bc077 ("debugfs: return error values, not NULL") Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-01-31bpf, cgroups: clean up kerneldoc warningsValdis Kletnieks
Building with W=1 reveals some bitrot: CC kernel/bpf/cgroup.o kernel/bpf/cgroup.c:238: warning: Function parameter or member 'flags' not described in '__cgroup_bpf_attach' kernel/bpf/cgroup.c:367: warning: Function parameter or member 'unused_flags' not described in '__cgroup_bpf_detach' Add a kerneldoc line for 'flags'. Fixing the warning for 'unused_flags' is best approached by removing the unused parameter on the function call. Signed-off-by: Valdis Kletnieks <valdis.kletnieks@vt.edu> Acked-by: Song Liu <songliubraving@fb.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-01-31bpf: fix bitrotted kerneldocValdis Kletnieks
Over the years, the function signature has changed, but the kerneldoc block hasn't. Signed-off-by: Valdis Kletnieks <valdis.kletnieks@vt.edu> Acked-by: Song Liu <songliubraving@fb.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-01-30audit: ignore fcaps on umountRichard Guy Briggs
Don't fetch fcaps when umount2 is called to avoid a process hang while it waits for the missing resource to (possibly never) re-appear. Note the comment above user_path_mountpoint_at(): * A umount is a special case for path walking. We're not actually interested * in the inode in this situation, and ESTALE errors can be a problem. We * simply want track down the dentry and vfsmount attached at the mountpoint * and avoid revalidating the last component. This can happen on ceph, cifs, 9p, lustre, fuse (gluster) or NFS. Please see the github issue tracker https://github.com/linux-audit/audit-kernel/issues/100 Signed-off-by: Richard Guy Briggs <rgb@redhat.com> [PM: merge fuzz in audit_log_fcaps()] Signed-off-by: Paul Moore <paul@paul-moore.com>
2019-01-30cpu/hotplug: Fix "SMT disabled by BIOS" detection for KVMJosh Poimboeuf
With the following commit: 73d5e2b47264 ("cpu/hotplug: detect SMT disabled by BIOS") ... the hotplug code attempted to detect when SMT was disabled by BIOS, in which case it reported SMT as permanently disabled. However, that code broke a virt hotplug scenario, where the guest is booted with only primary CPU threads, and a sibling is brought online later. The problem is that there doesn't seem to be a way to reliably distinguish between the HW "SMT disabled by BIOS" case and the virt "sibling not yet brought online" case. So the above-mentioned commit was a bit misguided, as it permanently disabled SMT for both cases, preventing future virt sibling hotplugs. Going back and reviewing the original problems which were attempted to be solved by that commit, when SMT was disabled in BIOS: 1) /sys/devices/system/cpu/smt/control showed "on" instead of "notsupported"; and 2) vmx_vm_init() was incorrectly showing the L1TF_MSG_SMT warning. I'd propose that we instead consider #1 above to not actually be a problem. Because, at least in the virt case, it's possible that SMT wasn't disabled by BIOS and a sibling thread could be brought online later. So it makes sense to just always default the smt control to "on" to allow for that possibility (assuming cpuid indicates that the CPU supports SMT). The real problem is #2, which has a simple fix: change vmx_vm_init() to query the actual current SMT state -- i.e., whether any siblings are currently online -- instead of looking at the SMT "control" sysfs value. So fix it by: a) reverting the original "fix" and its followup fix: 73d5e2b47264 ("cpu/hotplug: detect SMT disabled by BIOS") bc2d8d262cba ("cpu/hotplug: Fix SMT supported evaluation") and b) changing vmx_vm_init() to query the actual current SMT state -- instead of the sysfs control value -- to determine whether the L1TF warning is needed. This also requires the 'sched_smt_present' variable to exported, instead of 'cpu_smt_control'. Fixes: 73d5e2b47264 ("cpu/hotplug: detect SMT disabled by BIOS") Reported-by: Igor Mammedov <imammedo@redhat.com> Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Joe Mario <jmario@redhat.com> Cc: Jiri Kosina <jikos@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: kvm@vger.kernel.org Cc: stable@vger.kernel.org Link: https://lkml.kernel.org/r/e3a85d585da28cc333ecbc1e78ee9216e6da9396.1548794349.git.jpoimboe@redhat.com
2019-01-29Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller
2019-01-29bpf: btf: allow typedef func_protoYonghong Song
Current implementation does not allow typedef func_proto. But it is actually allowed. -bash-4.4$ cat t.c typedef int (f) (int); f *g; -bash-4.4$ clang -O2 -g -c -target bpf t.c -Xclang -target-feature -Xclang +dwarfris -bash-4.4$ pahole -JV t.o File t.o: [1] PTR (anon) type_id=2 [2] TYPEDEF f type_id=3 [3] FUNC_PROTO (anon) return=4 args=(4 (anon)) [4] INT int size=4 bit_offset=0 nr_bits=32 encoding=SIGNED -bash-4.4$ This patch related btf verifier to allow such (typedef func_proto) patterns. Fixes: 2667a2626f4d ("bpf: btf: Add BTF_KIND_FUNC and BTF_KIND_FUNC_PROTO") Acked-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: Yonghong Song <yhs@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2019-01-29x86/speculation: Remove redundant arch_smt_update() invocationZhenzhong Duan
With commit a74cfffb03b7 ("x86/speculation: Rework SMT state change"), arch_smt_update() is invoked from each individual CPU hotplug function. Therefore the extra arch_smt_update() call in the sysfs SMT control is redundant. Fixes: a74cfffb03b7 ("x86/speculation: Rework SMT state change") Signed-off-by: Zhenzhong Duan <zhenzhong.duan@oracle.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: <konrad.wilk@oracle.com> Cc: <dwmw@amazon.co.uk> Cc: <bp@suse.de> Cc: <srinivas.eeda@oracle.com> Cc: <peterz@infradead.org> Cc: <hpa@zytor.com> Link: https://lkml.kernel.org/r/e2e064f2-e8ef-42ca-bf4f-76b612964752@default
2019-01-29Merge branch 'devx-async' into k.o/for-nextJason Gunthorpe
Yishai Hadas says: Enable DEVX asynchronous query commands This series enables querying a DEVX object in an asynchronous mode. The userspace application won't block when calling the firmware and it will be able to get the response back once that it will be ready. To enable the above functionality: - DEVX asynchronous command completion FD object was introduced. - The applicable file operations were implemented to enable using it by the user application. - Query asynchronous method was added to the DEVX object, it will call the firmware asynchronously and manages the response on the given input FD. - Hot unplug support was added for the FD to work properly upon unbind/disassociate. - mlx5 core fence for asynchronous commands was implemented and used to prevent racing upon unbind/disassociate. This branch is based on mlx5-next & v5.0-rc2 due to dependencies, from git://git.kernel.org/pub/scm/linux/kernel/git/mellanox/linux * branch 'devx-async': IB/mlx5: Implement DEVX hot unplug for async command FD IB/mlx5: Implement the file ops of DEVX async command FD IB/mlx5: Introduce async DEVX obj query API IB/mlx5: Introduce MLX5_IB_OBJECT_DEVX_ASYNC_CMD_FD Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-01-29futex: No need to check return value of debugfs_create functionsGreg Kroah-Hartman
When calling debugfs functions, there is no need to ever check the return value. The function can work or not, but the code logic should never do something different based on this. Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Darren Hart (VMware) <dvhart@infradead.org> Cc: Peter Zijlstra <peterz@infradead.org> Link: https://lkml.kernel.org/r/20190122152151.16139-40-gregkh@linuxfoundation.org
2019-01-29timers: Mark expected switch fall-throughsGustavo A. R. Silva
In preparation to enabling -Wimplicit-fallthrough, mark switch cases where fall through is indeed expected. Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: John Stultz <john.stultz@linaro.org> Cc: Stephen Boyd <sboyd@kernel.org> Link: https://lkml.kernel.org/r/20190123081413.GA3949@embeddedor
2019-01-29timekeeping/debug: No need to check return value of debugfs_create functionsGreg Kroah-Hartman
When calling debugfs functions, there is no need to ever check the return value. The function can work or not, but the code logic should never do something different based on this. Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: John Stultz <john.stultz@linaro.org> Cc: Stephen Boyd <sboyd@kernel.org> Link: https://lkml.kernel.org/r/20190122152151.16139-43-gregkh@linuxfoundation.org
2019-01-29genirq/debugfs: No need to check return value of debugfs_create functionsGreg Kroah-Hartman
When calling debugfs functions, there is no need to ever check the return value. The function can work or not, but the code logic should never do something different based on this. Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Marc Zyngier <marc.zyngier@arm.com> Link: https://lkml.kernel.org/r/20190122152151.16139-50-gregkh@linuxfoundation.org
2019-01-28Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-nextDavid S. Miller
Daniel Borkmann says: ==================== pull-request: bpf-next 2019-01-29 The following pull-request contains BPF updates for your *net-next* tree. The main changes are: 1) Teach verifier dead code removal, this also allows for optimizing / removing conditional branches around dead code and to shrink the resulting image. Code store constrained architectures like nfp would have hard time doing this at JIT level, from Jakub. 2) Add JMP32 instructions to BPF ISA in order to allow for optimizing code generation for 32-bit sub-registers. Evaluation shows that this can result in code reduction of ~5-20% compared to 64 bit-only code generation. Also add implementation for most JITs, from Jiong. 3) Add support for __int128 types in BTF which is also needed for vmlinux's BTF conversion to work, from Yonghong. 4) Add a new command to bpftool in order to dump a list of BPF-related parameters from the system or for a specific network device e.g. in terms of available prog/map types or helper functions, from Quentin. 5) Add AF_XDP sock_diag interface for querying sockets from user space which provides information about the RX/TX/fill/completion rings, umem, memory usage etc, from Björn. 6) Add skb context access for skb_shared_info->gso_segs field, from Eric. 7) Add support for testing flow dissector BPF programs by extending existing BPF_PROG_TEST_RUN infrastructure, from Stanislav. 8) Split BPF kselftest's test_verifier into various subgroups of tests in order better deal with merge conflicts in this area, from Jakub. 9) Add support for queue/stack manipulations in bpftool, from Stanislav. 10) Document BTF, from Yonghong. 11) Dump supported ELF section names in libbpf on program load failure, from Taeung. 12) Silence a false positive compiler warning in verifier's BTF handling, from Peter. 13) Fix help string in bpftool's feature probing, from Prashant. 14) Remove duplicate includes in BPF kselftests, from Yue. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2019-01-27Merge branch 'timers-urgent-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull timer fix from Thomas Glexiner: "A single regression fix to address the unintended breakage of posix cpu timers. This is caused by a new sanity check in the common code, which fails for posix cpu timers under certain conditions because the posix cpu timer code never updates the variable which is checked" * 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: posix-cpu-timers: Unbreak timer rearming
2019-01-27Merge branch 'locking-urgent-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull locking fixes from Thomas Gleixner: "A small series of fixes which all address possible missed wakeups: - Document and fix the wakeup ordering of wake_q - Add the missing barrier in rcuwait_wake_up(), which was documented in the comment but missing in the code - Fix the possible missed wakeups in the rwsem and futex code" * 'locking-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: locking/rwsem: Fix (possible) missed wakeup futex: Fix (possible) missed wakeup sched/wake_q: Fix wakeup ordering for wake_q sched/wake_q: Document wake_q_add() sched/wait: Fix rcuwait_wake_up() ordering
2019-01-27Merge branch 'irq-urgent-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull irq fixes from Thomas Gleixner: "A small set of fixes for the interrupt subsystem: - Fix a double increment in the irq descriptor allocator which resulted in a sanity check only being done for every second affinity mask - Add a missing device tree translation in the stm32-exti driver. Without that the interrupt association is completely wrong. - Initialize the mutex in the GIC-V3 MBI driver - Fix the alignment for aliasing devices in the GIC-V3-ITS driver so multi MSI allocations work correctly - Ensure that the initial affinity of a interrupt is not empty at startup time. - Drop bogus include in the madera irq chip driver - Fix KernelDoc regression" * 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: irqchip/gic-v3-its: Align PCI Multi-MSI allocation on their size genirq/irqdesc: Fix double increment in alloc_descs() genirq: Fix the kerneldoc comment for struct irq_affinity_desc irqchip/madera: Drop GPIO includes irqchip/gic-v3-mbi: Fix uninitialized mbi_lock irqchip/stm32-exti: Add domain translate function genirq: Make sure the initial affinity is not empty
2019-01-27sched/fair: Fix unnecessary increase of balance intervalVincent Guittot
In case of active balancing, we increase the balance interval to cover pinned tasks cases not covered by all_pinned logic. Neverthless, the active migration triggered by asym packing should be treated as the normal unbalanced case and reset the interval to default value, otherwise active migration for asym_packing can be easily delayed for hundreds of ms because of this pinned task detection mechanism. The same happens to other conditions tested in need_active_balance() like misfit task and when the capacity of src_cpu is reduced compared to dst_cpu (see comments in need_active_balance() for details). Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: valentin.schneider@arm.com Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-01-27sched/fair: Fix rounding bug for asym packingVincent Guittot
When check_asym_packing() is triggered, the imbalance is set to: busiest_stat.avg_load * busiest_stat.group_capacity / SCHED_CAPACITY_SCALE But busiest_stat.avg_load equals: sgs->group_load * SCHED_CAPACITY_SCALE / sgs->group_capacity These divisions can generate a rounding that will make imbalance slightly lower than the weighted load of the cfs_rq. But this is enough to skip the rq in find_busiest_queue() and prevents asym migration from happening. Directly set imbalance to busiest's sgs->group_load to remove the rounding. Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: valentin.schneider@arm.com Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-01-27sched/fair: Trigger asym_packing during idle load balanceVincent Guittot
Newly idle load balancing is not always triggered when a CPU becomes idle. This prevents the scheduler from getting a chance to migrate the task for asym packing. Enable active migration during idle load balance too. Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: valentin.schneider@arm.com Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-01-27sched/fair: Robustify CFS-bandwidth timer lockingPeter Zijlstra
Traditionally hrtimer callbacks were run with IRQs disabled, but with the introduction of HRTIMER_MODE_SOFT it is possible they run from SoftIRQ context, which does _NOT_ have IRQs disabled. Allow for the CFS bandwidth timers (period_timer and slack_timer) to be ran from SoftIRQ context; this entails removing the assumption that IRQs are already disabled from the locking. While mainline doesn't strictly need this, -RT forces all timers not explicitly marked with MODE_HARD into MODE_SOFT and trips over this. And marking these timers as MODE_HARD doesn't make sense as they're not required for RT operation and can potentially be quite expensive. Reported-by: Tom Putzeys <tom.putzeys@be.atlascopco.com> Tested-by: Mike Galbraith <efault@gmx.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: Thomas Gleixner <tglx@linutronix.de> Link: https://lkml.kernel.org/r/20190107125231.GE14122@hirez.programming.kicks-ass.net Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-01-27sched/core: Give DCE a fighting chancePeter Zijlstra
All that fancy new Energy-Aware scheduling foo is hidden behind a static_key, which is awesome if you have the stuff enabled in your config. However, when you lack all the prerequisites it doesn't make any sense to pretend we'll ever actually run this, so provide a little more clue to the compiler so it can more agressively delete the code. text data bss dec hex filename 50297 976 96 51369 c8a9 defconfig-build/kernel/sched/fair.o 49227 944 96 50267 c45b defconfig-build/kernel/sched/fair.o Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-01-27sched/topology: Introduce a sysctl for Energy Aware SchedulingQuentin Perret
In its current state, Energy Aware Scheduling (EAS) starts automatically on asymmetric platforms having an Energy Model (EM). However, there are users who want to have an EM (for thermal management for example), but don't want EAS with it. In order to let users disable EAS explicitly, introduce a new sysctl called 'sched_energy_aware'. It is enabled by default so that EAS can start automatically on platforms where it makes sense. Flipping it to 0 rebuilds the scheduling domains and disables EAS. Signed-off-by: Quentin Perret <quentin.perret@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: adharmap@codeaurora.org Cc: chris.redpath@arm.com Cc: currojerez@riseup.net Cc: dietmar.eggemann@arm.com Cc: edubezval@gmail.com Cc: gregkh@linuxfoundation.org Cc: javi.merino@kernel.org Cc: joel@joelfernandes.org Cc: juri.lelli@redhat.com Cc: morten.rasmussen@arm.com Cc: patrick.bellasi@arm.com Cc: pkondeti@codeaurora.org Cc: rjw@rjwysocki.net Cc: skannan@codeaurora.org Cc: smuckle@google.com Cc: srinivas.pandruvada@linux.intel.com Cc: thara.gopinath@linaro.org Cc: tkjos@google.com Cc: valentin.schneider@arm.com Cc: vincent.guittot@linaro.org Cc: viresh.kumar@linaro.org Link: https://lkml.kernel.org/r/20181203095628.11858-11-quentin.perret@arm.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-01-26bpf: JIT blinds support JMP32Jiong Wang
This patch adds JIT blinds support for JMP32. Like BPF_JMP_REG/IMM, JMP32 version are needed for building raw bpf insn. They are added to both include/linux/filter.h and tools/include/linux/filter.h. Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com> Signed-off-by: Jiong Wang <jiong.wang@netronome.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2019-01-26bpf: interpreter support for JMP32Jiong Wang
This patch implements interpreting new JMP32 instructions. Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com> Signed-off-by: Jiong Wang <jiong.wang@netronome.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2019-01-26bpf: disassembler support JMP32Jiong Wang
This patch teaches disassembler about JMP32. There are two places to update: - Class 0x6 now used by BPF_JMP32, not "unused". - BPF_JMP32 need to show comparison operands properly. The disassemble format is to add an extra "(32)" before the operands if it is a sub-register. A better disassemble format for both JMP32 and ALU32 just show the register prefix as "w" instead of "r", this is the format using by LLVM assembler. Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com> Signed-off-by: Jiong Wang <jiong.wang@netronome.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2019-01-26bpf: verifier support JMP32Jiong Wang
This patch teach verifier about the new BPF_JMP32 instruction class. Verifier need to treat it similar as the existing BPF_JMP class. A BPF_JMP32 insn needs to go through all checks that have been done on BPF_JMP. Also, verifier is doing runtime optimizations based on the extra info conditional jump instruction could offer, especially when the comparison is between constant and register that the value range of the register could be improved based on the comparison results. These code are updated accordingly. Acked-by: Jakub Kicinski <jakub.kicinski@netronome.com> Signed-off-by: Jiong Wang <jiong.wang@netronome.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2019-01-26bpf: refactor verifier min/max code for condition jumpJiong Wang
The current min/max code does both signed and unsigned comparisons against the input argument "val" which is "u64" and there is explicit type casting when the comparison is signed. As we will need slightly more complexer type casting when JMP32 introduced, it is better to host the signed type casting. This makes the code more clean with ignorable runtime overhead. Also, code for J*GE/GT/LT/LE and JEQ/JNE are very similar, this patch combine them. The main purpose for this refactor is to make sure the min/max code will still be readable and with minimum code duplication after JMP32 introduced. Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com> Signed-off-by: Jiong Wang <jiong.wang@netronome.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2019-01-25rcuperf: Stop abusing IS_ENABLED()Paul E. McKenney
The ever-evolving IS_ENABLED() macro is intended for CONFIG_* Kconfig options, but rcuperf currently uses it for the decidedly non-CONFIG_* MODULE macro. In the spirit of not inviting trouble, this commit substitutes tried-and-true #ifdef. Reported-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com> Acked-by: Ingo Molnar <mingo@kernel.org>