summaryrefslogtreecommitdiff
path: root/kernel
AgeCommit message (Collapse)Author
2018-05-31sched/core: Require cpu_active() in select_task_rq(), for user tasksPaul Burton
select_task_rq() is used in a few paths to select the CPU upon which a thread should be run - for example it is used by try_to_wake_up() & by fork or exec balancing. As-is it allows use of any online CPU that is present in the task's cpus_allowed mask. This presents a problem because there is a period whilst CPUs are brought online where a CPU is marked online, but is not yet fully initialized - ie. the period where CPUHP_AP_ONLINE_IDLE <= state < CPUHP_ONLINE. Usually we don't run any user tasks during this window, but there are corner cases where this can happen. An example observed is: - Some user task A, running on CPU X, forks to create task B. - sched_fork() calls __set_task_cpu() with cpu=X, setting task B's task_struct::cpu field to X. - CPU X is offlined. - Task A, currently somewhere between the __set_task_cpu() in copy_process() and the call to wake_up_new_task(), is migrated to CPU Y by migrate_tasks() when CPU X is offlined. - CPU X is onlined, but still in the CPUHP_AP_ONLINE_IDLE state. The scheduler is now active on CPU X, but there are no user tasks on the runqueue. - Task A runs on CPU Y & reaches wake_up_new_task(). This calls select_task_rq() with cpu=X, taken from task B's task_struct, and select_task_rq() allows CPU X to be returned. - Task A enqueues task B on CPU X's runqueue, via activate_task() & enqueue_task(). - CPU X now has a user task on its runqueue before it has reached the CPUHP_ONLINE state. In most cases, the user tasks that schedule on the newly onlined CPU have no idea that anything went wrong, but one case observed to be problematic is if the task goes on to invoke the sched_setaffinity syscall. The newly onlined CPU reaches the CPUHP_AP_ONLINE_IDLE state before the CPU that brought it online calls stop_machine_unpark(). This means that for a portion of the window of time between CPUHP_AP_ONLINE_IDLE & CPUHP_ONLINE the newly onlined CPU's struct cpu_stopper has its enabled field set to false. If a user thread is executed on the CPU during this window and it invokes sched_setaffinity with a CPU mask that does not include the CPU it's running on, then when __set_cpus_allowed_ptr() calls stop_one_cpu() intending to invoke migration_cpu_stop() and perform the actual migration away from the CPU it will simply return -ENOENT rather than calling migration_cpu_stop(). We then return from the sched_setaffinity syscall back to the user task that is now running on a CPU which it just asked not to run on, and which is not present in its cpus_allowed mask. This patch resolves the problem by having select_task_rq() enforce that user tasks run on CPUs that are active - the same requirement that select_fallback_rq() already enforces. This should ensure that newly onlined CPUs reach the CPUHP_AP_ACTIVE state before being able to schedule user tasks, and also implies that bringup_wait_for_ap() will have called stop_machine_unpark() which resolves the sched_setaffinity issue above. I haven't yet investigated them, but it may be of interest to review whether any of the actions performed by hotplug states between CPUHP_AP_ONLINE_IDLE & CPUHP_AP_ACTIVE could have similar unintended effects on user tasks that might schedule before they are reached, which might widen the scope of the problem from just affecting the behaviour of sched_setaffinity. Signed-off-by: Paul Burton <paul.burton@mips.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/20180526154648.11635-2-paul.burton@mips.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-05-31sched/core: Fix rules for running on online && !active CPUsPeter Zijlstra
As already enforced by the WARN() in __set_cpus_allowed_ptr(), the rules for running on an online && !active CPU are stricter than just being a kthread, you need to be a per-cpu kthread. If you're not strictly per-CPU, you have better CPUs to run on and don't need the partially booted one to get your work done. The exception is to allow smpboot threads to bootstrap the CPU itself and get kernel 'services' initialized before we allow userspace on it. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Tejun Heo <tj@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Fixes: 955dbdf4ce87 ("sched: Allow migrating kthreads into online but inactive CPUs") Link: http://lkml.kernel.org/r/20170725165821.cejhb7v2s3kecems@hirez.programming.kicks-ass.net Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-05-30bpf: devmap: remove redundant assignment of dev = devColin Ian King
The assignment dev = dev is redundant and should be removed. Detected by CoverityScan, CID#1469486 ("Evaluation order violation") Signed-off-by: Colin Ian King <colin.king@canonical.com> Acked-by: Song Liu <songliubraving@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2018-05-30media: rc: introduce BPF_PROG_LIRC_MODE2Sean Young
Add support for BPF_PROG_LIRC_MODE2. This type of BPF program can call rc_keydown() to reported decoded IR scancodes, or rc_repeat() to report that the last key should be repeated. The bpf program can be attached to using the bpf(BPF_PROG_ATTACH) syscall; the target_fd must be the /dev/lircN device. Acked-by: Yonghong Song <yhs@fb.com> Signed-off-by: Sean Young <sean@mess.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-05-30bpf: bpf_prog_array_copy() should return -ENOENT if exclude_prog not foundSean Young
This makes is it possible for bpf prog detach to return -ENOENT. Acked-by: Yonghong Song <yhs@fb.com> Signed-off-by: Sean Young <sean@mess.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-05-30Merge branch 'for-mingo' of ↵Ingo Molnar
git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu into core/rcu Pull RCU fix from Paul E. McKenney: "This additional v4.18 pull request contains a single commit that fell through the cracks: Provide early rcu_cpu_starting() callback for the benefit of the x86/mtrr code, which needs RCU to be available on incoming CPUs earlier than has been the case in the past." Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-05-29tracing: Allow histogram triggers to access ftrace internal eventsSteven Rostedt (VMware)
Now that trace_marker can have triggers, including a histogram triggers, the onmatch() and onmax() access the trace event. To do so, the search routine to find the event file needs to use the raw __find_event_file() that does not filter out ftrace events. Reviewed-by: Namhyung Kim <namhyung@kernel.org> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2018-05-29tracing: Have zero size length in filter logic be full stringSteven Rostedt (VMware)
As strings in trace events may not have a nul terminating character, the filter string compares use the defined string length for the field for the compares. The trace_marker records data slightly different than do normal events. It's size is zero, meaning that the string is the rest of the array, and that the string also ends with '\0'. If the size is zero, assume that the string is nul terminated and read the string in question as is. Reviewed-by: Namhyung Kim <namhyung@kernel.org> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2018-05-29tracing: Add trigger file for trace_markers tracefs/ftrace/printSteven Rostedt (VMware)
Allow writing to the trace_markers file initiate triggers defined in tracefs/ftrace/print/trigger file. This will allow of user space to trigger the same type of triggers (including histograms) that the trace events use. Had to create a ftrace_event_register() function that will become the trace_marker print event's reg() function. This is required because of how triggers are enabled: event_trigger_write() { event_trigger_regex_write() { trigger_process_regex() { for p in trigger_commands { p->func(); /* trigger_snapshot_cmd->func */ event_trigger_callback() { cmd_ops->reg() /* register_trigger() */ { trace_event_trigger_enable_disable() { trace_event_enable_disable() { call->class->reg(); Without the reg() function, the trigger code will call a NULL pointer and crash the system. Cc: Tom Zanussi <tom.zanussi@linux.intel.com> Cc: Clark Williams <williams@redhat.com> Cc: Karim Yaghmour <karim.yaghmour@opersys.com> Cc: Brendan Gregg <bgregg@netflix.com> Suggested-by: Joel Fernandes <joelaf@google.com> Reviewed-by: Namhyung Kim <namhyung@kernel.org> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2018-05-29Merge tag 'trace-v4.17-rc4-3' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace Pull tracing fixes from Steven Rostedt: "While writing selftests for a new feature, I triggered two existing bugs that deal with triggers and instances. - a generic trigger bug where the triggers are not removed from a linked list properly when deleting an instance. - a bug specific to snapshots, where the snapshot is done in the top level buffer, when it is supposed to snapshot the buffer associated to the instance the snapshot trigger exists in" * tag 'trace-v4.17-rc4-3' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: tracing: Make the snapshot trigger work with instances tracing: Fix crash when freeing instances with event triggers
2018-05-29tracing: Do not show filter file for ftrace internal eventsSteven Rostedt (VMware)
The filter file in the ftrace internal events, like in /sys/kernel/tracing/events/ftrace/function/filter is not attached to any functionality. Do not create them as they are meaningless. In the future, if an ftrace internal event gets filter functionality, then it will need to create it directly. Reviewed-by: Namhyung Kim <namhyung@kernel.org> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2018-05-29tracing: Add brackets in ftrace event dynamic arraysSteven Rostedt (VMware)
The dynamic arrays defined for ftrace internal events, such as the buf field for trace_marker (ftrace/print) did not have brackets which makes the filter code not accept it as a string. This is not currently an issues because the filter code doesn't do anything for these events, but they will in the future, and this needs to be fixed for when it does. Reviewed-by: Namhyung Kim <namhyung@kernel.org> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2018-05-29tracing: Have event_trace_init() called by trace_init_tracefs()Steven Rostedt (VMware)
Instead of having both trace_init_tracefs() and event_trace_init() be called by fs_initcall() routines, have event_trace_init() called directly by trace_init_tracefs(). This will guarantee order of how the events are created with respect to the rest of the ftrace infrastructure. This is needed to be able to assoctiate event files with ftrace internal events, such as the trace_marker. Reviewed-by: Namhyung Kim <namhyung@kernel.org> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2018-05-29tracing: Add __find_event_file() to find event files without restrictionsSteven Rostedt (VMware)
By adding the function __find_event_file() that can search for files without restrictions, such as if the event associated with the file has a reg function, or if it has the "ignore" flag set, the files that are associated to ftrace internal events (like trace_marker and function events) can be found and used. find_event_file() still returns a "filtered" file, as most callers need a valid trace event file. One created by the trace_events.h macros and not one created for parsing ftrace specific events. Reviewed-by: Namhyung Kim <namhyung@kernel.org> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2018-05-29tracing: Do not reference event data in post call triggersSteven Rostedt (VMware)
Trace event triggers can be called before or after the event has been committed. If it has been called after the commit, there's a possibility that the event no longer exists. Currently, the two post callers is the trigger to disable tracing (traceoff) and the one that will record a stack dump (stacktrace). Neither of them reference the trace event entry record, as that would lead to a race condition that could pass in corrupted data. To prevent any other users of the post data triggers from using the trace event record, pass in NULL to the post call trigger functions for the event record, as they should never need to use them in the first place. This does not fix any bug, but prevents bugs from happening by new post call trigger users. Reviewed-by: Masami Hiramatsu <mhiramat@kernel.org> Reviewed-by: Namhyung Kim <namhyung@kernel.org> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2018-05-28PM / QoS: Drop redundant declaration of pm_qos_get_value()Rafael J. Wysocki
The extra forward declaration of pm_qos_get_value() is redundant, so drop it. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2018-05-28tracepoints: Fix the descriptions of tracepoint_probe_register{_prio}Lee, Chun-Yi
The description of tracepoint_probe_register duplicates with tracepoint_probe_register_prio. This patch cleans up the descriptions. Link: http://lkml.kernel.org/r/20170616082643.7311-1-jlee@suse.com Signed-off-by: "Lee, Chun-Yi" <jlee@suse.com> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2018-05-28tracing: Make the snapshot trigger work with instancesSteven Rostedt (VMware)
The snapshot trigger currently only affects the main ring buffer, even when it is used by the instances. This can be confusing as the snapshot trigger is listed in the instance. > # cd /sys/kernel/tracing > # mkdir instances/foo > # echo snapshot > instances/foo/events/syscalls/sys_enter_fchownat/trigger > # echo top buffer > trace_marker > # echo foo buffer > instances/foo/trace_marker > # touch /tmp/bar > # chown rostedt /tmp/bar > # cat instances/foo/snapshot # tracer: nop # # # * Snapshot is freed * # # Snapshot commands: # echo 0 > snapshot : Clears and frees snapshot buffer # echo 1 > snapshot : Allocates snapshot buffer, if not already allocated. # Takes a snapshot of the main buffer. # echo 2 > snapshot : Clears snapshot buffer (but does not allocate or free) # (Doesn't have to be '2' works with any number that # is not a '0' or '1') > # cat snapshot # tracer: nop # # _-----=> irqs-off # / _----=> need-resched # | / _---=> hardirq/softirq # || / _--=> preempt-depth # ||| / delay # TASK-PID CPU# |||| TIMESTAMP FUNCTION # | | | |||| | | bash-1189 [000] .... 111.488323: tracing_mark_write: top buffer Not only did the snapshot occur in the top level buffer, but the instance snapshot buffer should have been allocated, and it is still free. Cc: stable@vger.kernel.org Fixes: 85f2b08268c01 ("tracing: Add basic event trigger framework") Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2018-05-28bpf: Hooks for sys_sendmsgAndrey Ignatov
In addition to already existing BPF hooks for sys_bind and sys_connect, the patch provides new hooks for sys_sendmsg. It leverages existing BPF program type `BPF_PROG_TYPE_CGROUP_SOCK_ADDR` that provides access to socket itlself (properties like family, type, protocol) and user-passed `struct sockaddr *` so that BPF program can override destination IP and port for system calls such as sendto(2) or sendmsg(2) and/or assign source IP to the socket. The hooks are implemented as two new attach types: `BPF_CGROUP_UDP4_SENDMSG` and `BPF_CGROUP_UDP6_SENDMSG` for UDPv4 and UDPv6 correspondingly. UDPv4 and UDPv6 separate attach types for same reason as sys_bind and sys_connect hooks, i.e. to prevent reading from / writing to e.g. user_ip6 fields when user passes sockaddr_in since it'd be out-of-bound. The difference with already existing hooks is sys_sendmsg are implemented only for unconnected UDP. For TCP it doesn't make sense to change user-provided `struct sockaddr *` at sendto(2)/sendmsg(2) time since socket either was already connected and has source/destination set or wasn't connected and call to sendto(2)/sendmsg(2) would lead to ENOTCONN anyway. Connected UDP is already handled by sys_connect hooks that can override source/destination at connect time and use fast-path later, i.e. these hooks don't affect UDP fast-path. Rewriting source IP is implemented differently than that in sys_connect hooks. When sys_sendmsg is used with unconnected UDP it doesn't work to just bind socket to desired local IP address since source IP can be set on per-packet basis by using ancillary data (cmsg(3)). So no matter if socket is bound or not, source IP has to be rewritten on every call to sys_sendmsg. To do so two new fields are added to UAPI `struct bpf_sock_addr`; * `msg_src_ip4` to set source IPv4 for UDPv4; * `msg_src_ip6` to set source IPv6 for UDPv6. Signed-off-by: Andrey Ignatov <rdna@fb.com> Acked-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-05-28bpf: avoid -Wmaybe-uninitialized warningArnd Bergmann
The stack_map_get_build_id_offset() function is too long for gcc to track whether 'work' may or may not be initialized at the end of it, leading to a false-positive warning: kernel/bpf/stackmap.c: In function 'stack_map_get_build_id_offset': kernel/bpf/stackmap.c:334:13: error: 'work' may be used uninitialized in this function [-Werror=maybe-uninitialized] This removes the 'in_nmi_ctx' flag and uses the state of that variable itself to see if it got initialized. Fixes: bae77c5eb5b2 ("bpf: enable stackmap with build_id in nmi context") Signed-off-by: Arnd Bergmann <arnd@arndb.de> Acked-by: Song Liu <songliubraving@fb.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-05-28bpf: btf: avoid -Wreturn-type warningArnd Bergmann
gcc warns about a noreturn function possibly returning in some configurations: kernel/bpf/btf.c: In function 'env_type_is_resolve_sink': kernel/bpf/btf.c:729:1: error: control reaches end of non-void function [-Werror=return-type] Using BUG() instead of BUG_ON() avoids that warning and otherwise does the exact same thing. Fixes: eb3f595dab40 ("bpf: btf: Validate type reference") Signed-off-by: Arnd Bergmann <arnd@arndb.de> Acked-by: Song Liu <songliubraving@fb.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-05-27tracing: Fix crash when freeing instances with event triggersSteven Rostedt (VMware)
If a instance has an event trigger enabled when it is freed, it could cause an access of free memory. Here's the case that crashes: # cd /sys/kernel/tracing # mkdir instances/foo # echo snapshot > instances/foo/events/initcall/initcall_start/trigger # rmdir instances/foo Would produce: general protection fault: 0000 [#1] PREEMPT SMP PTI Modules linked in: tun bridge ... CPU: 5 PID: 6203 Comm: rmdir Tainted: G W 4.17.0-rc4-test+ #933 Hardware name: Hewlett-Packard HP Compaq Pro 6300 SFF/339A, BIOS K01 v03.03 07/14/2016 RIP: 0010:clear_event_triggers+0x3b/0x70 RSP: 0018:ffffc90003783de0 EFLAGS: 00010286 RAX: 0000000000000000 RBX: 6b6b6b6b6b6b6b2b RCX: 0000000000000000 RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff8800c7130ba0 RBP: ffffc90003783e00 R08: ffff8801131993f8 R09: 0000000100230016 R10: ffffc90003783d80 R11: 0000000000000000 R12: ffff8800c7130ba0 R13: ffff8800c7130bd8 R14: ffff8800cc093768 R15: 00000000ffffff9c FS: 00007f6f4aa86700(0000) GS:ffff88011eb40000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f6f4a5aed60 CR3: 00000000cd552001 CR4: 00000000001606e0 Call Trace: event_trace_del_tracer+0x2a/0xc5 instance_rmdir+0x15c/0x200 tracefs_syscall_rmdir+0x52/0x90 vfs_rmdir+0xdb/0x160 do_rmdir+0x16d/0x1c0 __x64_sys_rmdir+0x17/0x20 do_syscall_64+0x55/0x1a0 entry_SYSCALL_64_after_hwframe+0x49/0xbe This was due to the call the clears out the triggers when an instance is being deleted not removing the trigger from the link list. Cc: stable@vger.kernel.org Fixes: 85f2b08268c01 ("tracing: Add basic event trigger framework") Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2018-05-27PM / hibernate: Fix oops at snapshot_write()Tetsuo Handa
syzbot is reporting NULL pointer dereference at snapshot_write() [1]. This is because data->handle is zero-cleared by ioctl(SNAPSHOT_FREE). Fix this by checking data_of(data->handle) != NULL before using it. [1] https://syzkaller.appspot.com/bug?id=828a3c71bd344a6de8b6a31233d51a72099f27fd Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Reported-by: syzbot <syzbot+ae590932da6e45d6564d@syzkaller.appspotmail.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2018-05-27PM / wakeup: Make s2idle_lock a RAW_SPINLOCKSebastian Andrzej Siewior
The `s2idle_lock' is acquired during suspend while interrupts are disabled even on RT. The lock is acquired for short sections only. Make it a RAW lock which avoids "sleeping while atomic" warnings on RT. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2018-05-27PM / s2idle: Make s2idle_wait_head swait basedSebastian Andrzej Siewior
s2idle_wait_head is used during s2idle with interrupts disabled even on RT. There is no "custom" wake up function so swait could be used instead which is also lower weight compared to the wait_queue. Make s2idle_wait_head a swait_queue_head. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2018-05-27PM / suspend: Prevent might sleep splatsThomas Gleixner
timekeeping suspend/resume calls read_persistent_clock() which takes rtc_lock. That results in might sleep warnings because at that point we run with interrupts disabled. We cannot convert rtc_lock to a raw spinlock as that would trigger other might sleep warnings. As a workaround we disable the might sleep warnings by setting system_state to SYSTEM_SUSPEND before calling sysdev_suspend() and restoring it to SYSTEM_RUNNING afer sysdev_resume(). There is no lock contention because hibernate / suspend to RAM is single-CPU at this point. In s2idle's case the system_state is set to SYSTEM_SUSPEND before timekeeping_suspend() which is invoked by the last CPU. In the resume case it set back to SYSTEM_RUNNING after timekeeping_resume() which is invoked by the first CPU in the resume case. The other CPUs will block on tick_freeze_lock. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> [bigeasy: cover s2idle in tick_freeze() / tick_unfreeze()] Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2018-05-26Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller
Lots of easy overlapping changes in the confict resolutions here. Signed-off-by: David S. Miller <davem@davemloft.net>
2018-05-26Merge branch 'sched-urgent-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler fixes from Thomas Gleixner: "Three fixes for scheduler and kthread code: - allow calling kthread_park() on an already parked thread - restore the sched_pi_setprio() tracepoint behaviour - clarify the unclear string for the scheduling domain debug output" * 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: sched, tracing: Fix trace_sched_pi_setprio() for deboosting kthread: Allow kthread_park() on a parked kthread sched/topology: Clarify root domain(s) debug string
2018-05-25Merge branch 'akpm' (patches from Andrew)Linus Torvalds
Merge misc fixes from Andrew Morton: "16 fixes" * emailed patches from Andrew Morton <akpm@linux-foundation.org>: kasan: fix memory hotplug during boot kasan: free allocated shadow memory on MEM_CANCEL_ONLINE checkpatch: fix macro argument precedence test init/main.c: include <linux/mem_encrypt.h> kernel/sys.c: fix potential Spectre v1 issue mm/memory_hotplug: fix leftover use of struct page during hotplug proc: fix smaps and meminfo alignment mm: do not warn on offline nodes unless the specific node is explicitly requested mm, memory_hotplug: make has_unmovable_pages more robust mm/kasan: don't vfree() nonexistent vm_area MAINTAINERS: change hugetlbfs maintainer and update files ipc/shm: fix shmat() nil address after round-down when remapping Revert "ipc/shm: Fix shmat mmap nil-page protection" idr: fix invalid ptr dereference on item delete ocfs2: revert "ocfs2/o2hb: check len for bio_add_page() to avoid getting incorrect bio" mm: fix nr_rotate_swap leak in swapon() error case
2018-05-25Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netLinus Torvalds
Pull networking fixes from David Miller: "Let's begin the holiday weekend with some networking fixes: 1) Whoops need to restrict cfg80211 wiphy names even more to 64 bytes. From Eric Biggers. 2) Fix flags being ignored when using kernel_connect() with SCTP, from Xin Long. 3) Use after free in DCCP, from Alexey Kodanev. 4) Need to check rhltable_init() return value in ipmr code, from Eric Dumazet. 5) XDP handling fixes in virtio_net from Jason Wang. 6) Missing RTA_TABLE in rtm_ipv4_policy[], from Roopa Prabhu. 7) Need to use IRQ disabling spinlocks in mlx4_qp_lookup(), from Jack Morgenstein. 8) Prevent out-of-bounds speculation using indexes in BPF, from Daniel Borkmann. 9) Fix regression added by AF_PACKET link layer cure, from Willem de Bruijn. 10) Correct ENIC dma mask, from Govindarajulu Varadarajan. 11) Missing config options for PMTU tests, from Stefano Brivio" * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (48 commits) ibmvnic: Fix partial success login retries selftests/net: Add missing config options for PMTU tests mlx4_core: allocate ICM memory in page size chunks enic: set DMA mask to 47 bit ppp: remove the PPPIOCDETACH ioctl ipv4: remove warning in ip_recv_error net : sched: cls_api: deal with egdev path only if needed vhost: synchronize IOTLB message with dev cleanup packet: fix reserve calculation net/mlx5: IPSec, Fix a race between concurrent sandbox QP commands net/mlx5e: When RXFCS is set, add FCS data into checksum calculation bpf: properly enforce index mask to prevent out-of-bounds speculation net/mlx4: Fix irq-unsafe spinlock usage net: phy: broadcom: Fix bcm_write_exp() net: phy: broadcom: Fix auxiliary control register reads net: ipv4: add missing RTA_TABLE to rtm_ipv4_policy net/mlx4: fix spelling mistake: "Inrerface" -> "Interface" and rephrase message ibmvnic: Only do H_EOI for mobility events tuntap: correctly set SOCKWQ_ASYNC_NOSPACE virtio-net: fix leaking page for gso packet during mergeable XDP ...
2018-05-25kernel/sys.c: fix potential Spectre v1 issueGustavo A. R. Silva
`resource' can be controlled by user-space, hence leading to a potential exploitation of the Spectre variant 1 vulnerability. This issue was detected with the help of Smatch: kernel/sys.c:1474 __do_compat_sys_old_getrlimit() warn: potential spectre issue 'get_current()->signal->rlim' (local cap) kernel/sys.c:1455 __do_sys_old_getrlimit() warn: potential spectre issue 'get_current()->signal->rlim' (local cap) Fix this by sanitizing *resource* before using it to index current->signal->rlim Notice that given that speculation windows are large, the policy is to kill the speculation on the first load and not worry if it can be completed with a dependent load/store [1]. [1] https://marc.info/?l=linux-kernel&m=152449131114778&w=2 Link: http://lkml.kernel.org/r/20180515030038.GA11822@embeddedor.com Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Alexei Starovoitov <ast@kernel.org> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-05-25Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpfDavid S. Miller
Daniel Borkmann says: ==================== pull-request: bpf 2018-05-24 The following pull-request contains BPF updates for your *net* tree. The main changes are: 1) Fix a bug in the original fix to prevent out of bounds speculation when multiple tail call maps from different branches or calls end up at the same tail call helper invocation, from Daniel. 2) Two selftest fixes, one in reuseport_bpf_numa where test is skipped in case of missing numa support and another one to update kernel config to properly support xdp_meta.sh test, from Anders. ... Would be great if you have a chance to merge net into net-next after that. The verifier fix would be needed later as a dependency in bpf-next for upcomig work there. When you do the merge there's a trivial conflict on BPF side with 849fa50662fb ("bpf/verifier: refine retval R0 state for bpf_get_stack helper"): Resolution is to keep both functions, the do_refine_retval_range() and record_func_map(). ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2018-05-25Merge back PM core material for v4.18.Rafael J. Wysocki
2018-05-25locking/rwsem: Simplify the is-owner-spinnable checksOleg Nesterov
Add the trivial owner_on_cpu() helper for rwsem_can_spin_on_owner() and rwsem_spin_on_owner(), it also allows to make rwsem_can_spin_on_owner() a bit more clear. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Waiman Long <longman@redhat.com> Cc: Amir Goldstein <amir73il@gmail.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Jan Kara <jack@suse.cz> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Matthew Wilcox <willy@infradead.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Theodore Y. Ts'o <tytso@mit.edu> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/20180518165534.GA22348@redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-05-25Merge branch 'linus' into locking/core, to pick up fixesIngo Molnar
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-05-25perf/core: Wire up compat PERF_EVENT_IOC_QUERY_BPF, ↵Eugene Syromiatnikov
PERF_EVENT_IOC_MODIFY_ATTRIBUTES Since pointer size is different in compat, and switching in _perf_ioctl is done using exact ioctl numbers, all new ioctl numbers that use pointer should be added to perf_compat_ioctl for _IOC_SIZE fixup before passing to perf_ioctl routine (this shouldn't be needed if semantics of the size argument of _IO* macros was honored). Signed-off-by: Eugene Syromiatnikov <esyr@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Stephane Eranian <eranian@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vince Weaver <vincent.weaver@maine.edu> Link: http://lkml.kernel.org/r/20180521123420.GA24291@asgard.redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-05-25perf/core: Fix bad use of igrab()Song Liu
As Miklos reported and suggested: "This pattern repeats two times in trace_uprobe.c and in kernel/events/core.c as well: ret = kern_path(filename, LOOKUP_FOLLOW, &path); if (ret) goto fail_address_parse; inode = igrab(d_inode(path.dentry)); path_put(&path); And it's wrong. You can only hold a reference to the inode if you have an active ref to the superblock as well (which is normally through path.mnt) or holding s_umount. This way unmounting the containing filesystem while the tracepoint is active will give you the "VFS: Busy inodes after unmount..." message and a crash when the inode is finally put. Solution: store path instead of inode." This patch fixes the issue in kernel/event/core.c. Reviewed-and-tested-by: Alexander Shishkin <alexander.shishkin@linux.intel.com> Reported-by: Miklos Szeredi <miklos@szeredi.hu> Signed-off-by: Song Liu <songliubraving@fb.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: <kernel-team@fb.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Stephane Eranian <eranian@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vince Weaver <vincent.weaver@maine.edu> Fixes: 375637bc5249 ("perf/core: Introduce address range filtering") Link: http://lkml.kernel.org/r/20180418062907.3210386-2-songliubraving@fb.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-05-25perf/core: Fix group scheduling with mixed hw and sw eventsSong Liu
When hw and sw events are mixed in the same group, they are all attached to the hw perf_event_context. This sometimes requires moving group of perf_event to a different context. We found a bug in how the kernel handles this, for example if we do: perf stat -e '{faults,ref-cycles,faults}' -I 1000 1.005591180 1,297 faults 1.005591180 457,476,576 ref-cycles 1.005591180 <not supported> faults First, sw event "faults" is attached to the sw context, and becomes the group leader. Then, hw event "ref-cycles" is attached, so both events are moved to the hw context. Last, another sw "faults" tries to attach, but it fails because of mismatch between the new target ctx (from sw pmu) and the group_leader's ctx (hw context, same as ref-cycles). The broken condition is: group_leader is sw event; group_leader is on hw context; add a sw event to the group. Fix this scenario by checking group_leader's context (instead of just event type). If group_leader is on hw context, use the ->pmu of this context to look up context for the new event. Signed-off-by: Song Liu <songliubraving@fb.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: <kernel-team@fb.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Stephane Eranian <eranian@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vince Weaver <vincent.weaver@maine.edu> Fixes: b04243ef7006 ("perf: Complete software pmu grouping") Link: http://lkml.kernel.org/r/20180503194716.162815-1-songliubraving@fb.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-05-25sched/fair: Update util_est before updating schedutilPatrick Bellasi
When a task is enqueued the estimated utilization of a CPU is updated to better support the selection of the required frequency. However, schedutil is (implicitly) updated by update_load_avg() which always happens before util_est_{en,de}queue(), thus potentially introducing a latency between estimated utilization updates and frequency selections. Let's update util_est at the beginning of enqueue_task_fair(), which will ensure that all schedutil updates will see the most updated estimated utilization value for a CPU. Reported-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Viresh Kumar <viresh.kumar@linaro.org> Acked-by: Vincent Guittot <vincent.guittot@linaro.org> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com> Cc: Joel Fernandes <joelaf@google.com> Cc: Juri Lelli <juri.lelli@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Morten Rasmussen <morten.rasmussen@arm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rafael J . Wysocki <rafael.j.wysocki@intel.com> Cc: Steve Muckle <smuckle@google.com> Fixes: 7f65ea42eb00 ("sched/fair: Add util_est on top of PELT") Link: http://lkml.kernel.org/r/20180524141023.13765-3-patrick.bellasi@arm.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-05-25sched/cpufreq: Modify aggregate utilization to always include blocked FAIR ↵Patrick Bellasi
utilization Since the refactoring introduced by: commit 8f111bc357aa ("cpufreq/schedutil: Rewrite CPUFREQ_RT support") we aggregate FAIR utilization only if this class has runnable tasks. This was mainly due to avoid the risk to stay on an high frequency just because of the blocked utilization of a CPU not being properly decayed while the CPU was idle. However, since: commit 31e77c93e432 ("sched/fair: Update blocked load when newly idle") the FAIR blocked utilization is properly decayed also for IDLE CPUs. This allows us to use the FAIR blocked utilization as a safe mechanism to gracefully reduce the frequency only if no FAIR tasks show up on a CPU for a reasonable period of time. Moreover, we also reduce the frequency drops of CPUs running periodic tasks which, depending on the task periodicity and the time required for a frequency switch, was increasing the chances to introduce some undesirable performance variations. Reported-by: Vincent Guittot <vincent.guittot@linaro.org> Tested-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Viresh Kumar <viresh.kumar@linaro.org> Acked-by: Vincent Guittot <vincent.guittot@linaro.org> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com> Cc: Joel Fernandes <joelaf@google.com> Cc: Juri Lelli <juri.lelli@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Morten Rasmussen <morten.rasmussen@arm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rafael J . Wysocki <rafael.j.wysocki@intel.com> Cc: Steve Muckle <smuckle@google.com> Link: http://lkml.kernel.org/r/20180524141023.13765-2-patrick.bellasi@arm.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-05-25Merge branch 'sched/urgent' into sched/core, to pick up fixesIngo Molnar
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-05-25kthread: Allow kthread_park() on a parked kthreadPeter Zijlstra
The following commit: 85f1abe0019f ("kthread, sched/wait: Fix kthread_parkme() completion issue") added a WARN() in the case where we call kthread_park() on an already parked thread, because the old code wasn't doing the right thing there and it wasn't at all clear that would happen. It turns out, this does in fact happen, so we have to deal with it. Instead of potentially returning early, also wait for the completion. This does however mean we have to use complete_all() and re-initialize the completion on re-use. Reported-by: LKP <lkp@01.org> Tested-by: Meelis Roos <mroos@linux.ee> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: kernel test robot <lkp@intel.com> Cc: wfg@linux.intel.com Cc: Thomas Gleixner <tglx@linutronix.de> Fixes: 85f1abe0019f ("kthread, sched/wait: Fix kthread_parkme() completion issue") Link: http://lkml.kernel.org/r/20180504091142.GI12235@hirez.programming.kicks-ass.net Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-05-25sched/topology: Clarify root domain(s) debug stringJuri Lelli
When scheduler debug is enabled, building scheduling domains outputs information about how the domains are laid out and to which root domain each CPU (or sets of CPUs) belongs, e.g.: CPU0 attaching sched-domain(s): domain-0: span=0-5 level=MC groups: 0:{ span=0 }, 1:{ span=1 }, 2:{ span=2 }, 3:{ span=3 }, 4:{ span=4 }, 5:{ span=5 } CPU1 attaching sched-domain(s): domain-0: span=0-5 level=MC groups: 1:{ span=1 }, 2:{ span=2 }, 3:{ span=3 }, 4:{ span=4 }, 5:{ span=5 }, 0:{ span=0 } [...] span: 0-5 (max cpu_capacity = 1024) The fact that latest line refers to CPUs 0-5 root domain doesn't however look immediately obvious to me: one might wonder why span 0-5 is reported "again". Make it more clear by adding "root domain" to it, as to end with the following: CPU0 attaching sched-domain(s): domain-0: span=0-5 level=MC groups: 0:{ span=0 }, 1:{ span=1 }, 2:{ span=2 }, 3:{ span=3 }, 4:{ span=4 }, 5:{ span=5 } CPU1 attaching sched-domain(s): domain-0: span=0-5 level=MC groups: 1:{ span=1 }, 2:{ span=2 }, 3:{ span=3 }, 4:{ span=4 }, 5:{ span=5 }, 0:{ span=0 } [...] root domain span: 0-5 (max cpu_capacity = 1024) Signed-off-by: Juri Lelli <juri.lelli@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Patrick Bellasi <patrick.bellasi@arm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/20180524152936.17611-1-juri.lelli@redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-05-24Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-nextDavid S. Miller
Alexei Starovoitov says: ==================== pull-request: bpf-next 2018-05-24 The following pull-request contains BPF updates for your *net-next* tree. The main changes are: 1) Björn Töpel cleans up AF_XDP (removes rebind, explicit cache alignment from uapi, etc). 2) David Ahern adds mtu checks to bpf_ipv{4,6}_fib_lookup() helpers. 3) Jesper Dangaard Brouer adds bulking support to ndo_xdp_xmit. 4) Jiong Wang adds support for indirect and arithmetic shifts to NFP 5) Martin KaFai Lau cleans up BTF uapi and makes the btf_header extensible. 6) Mathieu Xhonneux adds an End.BPF action to seg6local with BPF helpers allowing to edit/grow/shrink a SRH and apply on a packet generic SRv6 actions. 7) Sandipan Das adds support for bpf2bpf function calls in ppc64 JIT. 8) Yonghong Song adds BPF_TASK_FD_QUERY command for introspection of tracing events. 9) other misc fixes from Gustavo A. R. Silva, Sirio Balmelli, John Fastabend, and Magnus Karlsson ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2018-05-24xdp/trace: extend tracepoint in devmap with an errJesper Dangaard Brouer
Extending tracepoint xdp:xdp_devmap_xmit in devmap with an err code allow people to easier identify the reason behind the ndo_xdp_xmit call to a given driver is failing. Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2018-05-24xdp: change ndo_xdp_xmit API to support bulkingJesper Dangaard Brouer
This patch change the API for ndo_xdp_xmit to support bulking xdp_frames. When kernel is compiled with CONFIG_RETPOLINE, XDP sees a huge slowdown. Most of the slowdown is caused by DMA API indirect function calls, but also the net_device->ndo_xdp_xmit() call. Benchmarked patch with CONFIG_RETPOLINE, using xdp_redirect_map with single flow/core test (CPU E5-1650 v4 @ 3.60GHz), showed performance improved: for driver ixgbe: 6,042,682 pps -> 6,853,768 pps = +811,086 pps for driver i40e : 6,187,169 pps -> 6,724,519 pps = +537,350 pps With frames avail as a bulk inside the driver ndo_xdp_xmit call, further optimizations are possible, like bulk DMA-mapping for TX. Testing without CONFIG_RETPOLINE show the same performance for physical NIC drivers. The virtual NIC driver tun sees a huge performance boost, as it can avoid doing per frame producer locking, but instead amortize the locking cost over the bulk. V2: Fix compile errors reported by kbuild test robot <lkp@intel.com> V4: Isolated ndo, driver changes and callers. Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2018-05-24xdp: introduce xdp_return_frame_rx_napiJesper Dangaard Brouer
When sending an xdp_frame through xdp_do_redirect call, then error cases can happen where the xdp_frame needs to be dropped, and returning an -errno code isn't sufficient/possible any-longer (e.g. for cpumap case). This is already fully supported, by simply calling xdp_return_frame. This patch is an optimization, which provides xdp_return_frame_rx_napi, which is a faster variant for these error cases. It take advantage of the protection provided by XDP RX running under NAPI protection. This change is mostly relevant for drivers using the page_pool allocator as it can take advantage of this. (Tested with mlx5). Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2018-05-24xdp: add tracepoint for devmap like cpumap haveJesper Dangaard Brouer
Notice how this allow us get XDP statistic without affecting the XDP performance, as tracepoint is no-longer activated on a per packet basis. V5: Spotted by John Fastabend. Fix 'sent' also counted 'drops' in this patch, a later patch corrected this, but it was a mistake in this intermediate step. Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2018-05-24bpf: devmap prepare xdp frames for bulkingJesper Dangaard Brouer
Like cpumap create queue for xdp frames that will be bulked. For now, this patch simply invoke ndo_xdp_xmit foreach frame. This happens, either when the map flush operation is envoked, or when the limit DEV_MAP_BULK_SIZE is reached. V5: Avoid memleak on error path in dev_map_update_elem() Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2018-05-24bpf: devmap introduce dev_map_enqueueJesper Dangaard Brouer
Functionality is the same, but the ndo_xdp_xmit call is now simply invoked from inside the devmap.c code. V2: Fix compile issue reported by kbuild test robot <lkp@intel.com> V5: Cleanups requested by Daniel - Newlines before func definition - Use BUILD_BUG_ON checks - Remove unnecessary use return value store in dev_map_enqueue Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>