summaryrefslogtreecommitdiff
path: root/kernel
AgeCommit message (Collapse)Author
2019-02-13Merge tag 'trace-v5.0-rc4' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace Pull tracing fix from Steven Rostedt: "This fixes kprobes/uprobes dynamic processing of strings, where it processes the args but does not update the remaining length of the buffer that the string arguments will be placed in. It constantly passes in the total size of buffer used instead of passing in the remaining size of the buffer used. This could cause issues if the strings are larger than the max size of an event which could cause the strings to be written beyond what was reserved on the buffer" * tag 'trace-v5.0-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: tracing: probeevent: Correctly update remaining space in dynamic area
2019-02-13dma-mapping: remove an incorrect __iommem annotationChristoph Hellwig
memmap return a regular void pointer, not and __iomem one. Signed-off-by: Christoph Hellwig <hch@lst.de>
2019-02-13dma-mapping: add a kconfig symbol for arch_teardown_dma_ops availabilityChristoph Hellwig
Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Catalin Marinas <catalin.marinas@arm.com> # arm64
2019-02-13dma-mapping: add a kconfig symbol for arch_setup_dma_ops availabilityChristoph Hellwig
Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Paul Burton <paul.burton@mips.com> # MIPS Acked-by: Catalin Marinas <catalin.marinas@arm.com> # arm64
2019-02-13dma-mapping: move debug configuration options to kernel/dmaAndy Shevchenko
This is a follow up to the commit cf65a0f6f6ff ("dma-mapping: move all DMA mapping code to kernel/dma") which moved source code of DMA API to kernel/dma folder. Since there is no file left in the lib that require DMA API debugging options move the latter to kernel/dma as well. Cc: Christoph Hellwig <hch@lst.de> Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
2019-02-13signal: Restore the stop PTRACE_EVENT_EXITEric W. Biederman
In the middle of do_exit() there is there is a call "ptrace_event(PTRACE_EVENT_EXIT, code);" That call places the process in TACKED_TRACED aka "(TASK_WAKEKILL | __TASK_TRACED)" and waits for for the debugger to release the task or SIGKILL to be delivered. Skipping past dequeue_signal when we know a fatal signal has already been delivered resulted in SIGKILL remaining pending and TIF_SIGPENDING remaining set. This in turn caused the scheduler to not sleep in PTACE_EVENT_EXIT as it figured a fatal signal was pending. This also caused ptrace_freeze_traced in ptrace_check_attach to fail because it left a per thread SIGKILL pending which is what fatal_signal_pending tests for. This difference in signal state caused strace to report strace: Exit of unknown pid NNNNN ignored Therefore update the signal handling state like dequeue_signal would when removing a per thread SIGKILL, by removing SIGKILL from the per thread signal mask and clearing TIF_SIGPENDING. Acked-by: Oleg Nesterov <oleg@redhat.com> Reported-by: Oleg Nesterov <oleg@redhat.com> Reported-by: Ivan Delalande <colona@arista.com> Cc: stable@vger.kernel.org Fixes: 35634ffa1751 ("signal: Always notice exiting tasks") Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2019-02-13genirq: introduce irq_chip_mask_ack_parent()Linus Walleij
The hierarchical irqchip never before ran into a situation where the parent is not "simple", i.e. does not implement .irq_ack() and .irq_mask() like most, but the qcom-pm8xxx.c happens to implement only .irq_mask_ack(). Since we want to make ssbi-gpio a hierarchical child of this irqchip, it must *also* only implement .irq_mask_ack() and call down to the parent, and for this we of course need irq_chip_mask_ack_parent(). Cc: Marc Zyngier <marc.zyngier@arm.com> Cc: Thomas Gleixner <tglx@linutronix.de> Acked-by: Marc Zyngier <marc.zyngier@arm.com> Signed-off-by: Brian Masney <masneyb@onstation.org> Signed-off-by: Linus Walleij <linus.walleij@linaro.org>
2019-02-13genirq: introduce irq_domain_translate_twocellBrian Masney
Add a new function irq_domain_translate_twocell() that is to be used as the translate function in struct irq_domain_ops for the v2 IRQ API. This patch also changes irq_domain_xlate_twocell() from the v1 IRQ API to call irq_domain_translate_twocell() in the v2 IRQ API. This required changes to of_phandle_args_to_fwspec()'s arguments so that it can be called from multiple places. Cc: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Marc Zyngier <marc.zyngier@arm.com> Signed-off-by: Brian Masney <masneyb@onstation.org> Signed-off-by: Linus Walleij <linus.walleij@linaro.org>
2019-02-13Merge branch 'rcu-next' of ↵Ingo Molnar
git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu into core/rcu Pull the latest RCU tree from Paul E. McKenney: - Additional cleanups after RCU flavor consolidation - Grace-period forward-progress cleanups and improvements - Documentation updates - Miscellaneous fixes - spin_is_locked() conversions to lockdep - SPDX changes to RCU source and header files - SRCU updates - Torture-test updates, including nolibc updates and moving nolibc to tools/include Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-02-13sched/fair: Use non-atomic cpumask_{set,clear}_cpu()Viresh Kumar
The cpumasks updated here are not subject to concurrency and using atomic bitops for them is pointless and expensive. Use the non-atomic variants instead. Suggested-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vincent Guittot <vincent.guittot@linaro.org> Link: http://lkml.kernel.org/r/2e2a10f84b9049a81eef94ed6d5989447c21e34a.1549963617.git.viresh.kumar@linaro.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-02-13kprobes: Prohibit probing on lockdep functionsMasami Hiramatsu
Some lockdep functions can be involved in breakpoint handling and probing on those functions can cause a breakpoint recursion. Prohibit probing on those functions by blacklist. Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Andrea Righi <righi.andrea@gmail.com> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/154998810578.31052.1680977921449292812.stgit@devbox Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-02-13kprobes: Prohibit probing on RCU debug routineMasami Hiramatsu
Since kprobe itself depends on RCU, probing on RCU debug routine can cause recursive breakpoint bugs. Prohibit probing on RCU debug routines. int3 ->do_int3() ->ist_enter() ->RCU_LOCKDEP_WARN() ->debug_lockdep_rcu_enabled() -> int3 Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Andrea Righi <righi.andrea@gmail.com> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/154998807741.31052.11229157537816341591.stgit@devbox Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-02-13kprobes: Prohibit probing on hardirq tracersMasami Hiramatsu
Since kprobes breakpoint handling involves hardirq tracer, probing these functions cause breakpoint recursion problem. Prohibit probing on those functions. Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org> Acked-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Andrea Righi <righi.andrea@gmail.com> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/154998802073.31052.17255044712514564153.stgit@devbox Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-02-13kprobes: Search non-suffixed symbol in blacklistMasami Hiramatsu
Newer GCC versions can generate some different instances of a function with suffixed symbols if the function is optimized and only has a part of that. (e.g. .constprop, .part etc.) In this case, it is not enough to check the entry of kprobe blacklist because it only records non-suffixed symbol address. To fix this issue, search non-suffixed symbol in blacklist if given address is within a symbol which has a suffix. Note that this can cause false positive cases if a kprobe-safe function is optimized to suffixed instance and has same name symbol which is blacklisted. But I would like to chose a fail-safe design for this issue. Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org> Reviewed-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Andrea Righi <righi.andrea@gmail.com> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/154998799234.31052.6136378903570418008.stgit@devbox Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-02-13x86/kprobes: Prohibit probing on functions before kprobe_int3_handler()Masami Hiramatsu
Prohibit probing on the functions called before kprobe_int3_handler() in do_int3(). More specifically, ftrace_int3_handler(), poke_int3_handler(), and ist_enter(). And since rcu_nmi_enter() is called by ist_enter(), it also should be marked as NOKPROBE_SYMBOL. Since those are handled before kprobe_int3_handler(), probing those functions can cause a breakpoint recursion and crash the kernel. Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Andrea Righi <righi.andrea@gmail.com> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/154998793571.31052.11301258949601150994.stgit@devbox Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-02-13perf/core: Fix impossible ring-buffer sizes warningIngo Molnar
The following commit: 9dff0aa95a32 ("perf/core: Don't WARN() for impossible ring-buffer sizes") results in perf recording failures with larger mmap areas: root@skl:/tmp# perf record -g -a failed to mmap with 12 (Cannot allocate memory) The root cause is that the following condition is buggy: if (order_base_2(size) >= MAX_ORDER) goto fail; The problem is that @size is in bytes and MAX_ORDER is in pages, so the right test is: if (order_base_2(size) >= PAGE_SHIFT+MAX_ORDER) goto fail; Fix it. Reported-by: "Jin, Yao" <yao.jin@linux.intel.com> Bisected-by: Borislav Petkov <bp@alien8.de> Analyzed-by: Peter Zijlstra <peterz@infradead.org> Cc: Julien Thierry <julien.thierry@arm.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: <stable@vger.kernel.org> Fixes: 9dff0aa95a32 ("perf/core: Don't WARN() for impossible ring-buffer sizes") Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-02-12audit: mark expected switch fall-throughGustavo A. R. Silva
In preparation to enabling -Wimplicit-fallthrough, mark switch cases where we are expecting to fall through. This patch fixes the following warning: kernel/auditfilter.c: In function ‘audit_krule_to_data’: kernel/auditfilter.c:668:7: warning: this statement may fall through [-Wimplicit-fallthrough=] if (krule->pflags & AUDIT_LOGINUID_LEGACY && !f->val) { ^ kernel/auditfilter.c:674:3: note: here default: ^~~~~~~ Warning level 3 was used: -Wimplicit-fallthrough=3 Notice that, in this particular case, the code comment is modified in accordance with what GCC is expecting to find. This patch is part of the ongoing efforts to enable -Wimplicit-fallthrough. Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com> Signed-off-by: Paul Moore <paul@paul-moore.com>
2019-02-12swiotlb: checking whether swiotlb buffer is full with io_tlb_usedDongli Zhang
This patch uses io_tlb_used to help check whether swiotlb buffer is full. io_tlb_used is no longer used for only debugfs. It is also used to help optimize swiotlb_tbl_map_single(). Suggested-by: Joe Jin <joe.jin@oracle.com> Signed-off-by: Dongli Zhang <dongli.zhang@oracle.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2019-02-12swiotlb: add debugfs to track swiotlb buffer usageDongli Zhang
The device driver will not be able to do dma operations once swiotlb buffer is full, either because the driver is using so many IO TLB blocks inflight, or because there is memory leak issue in device driver. To export the swiotlb buffer usage via debugfs would help the user estimate the size of swiotlb buffer to pre-allocate or analyze device driver memory leak issue. Signed-off-by: Dongli Zhang <dongli.zhang@oracle.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2019-02-12swiotlb: fix comment on swiotlb_bounce()Dongli Zhang
Fix the comment as swiotlb_bounce() is used to copy from original dma location to swiotlb buffer during swiotlb_tbl_map_single(), while to copy from swiotlb buffer to original dma location during swiotlb_tbl_unmap_single(). Signed-off-by: Dongli Zhang <dongli.zhang@oracle.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2019-02-12bpf: offload: add priv field for driversJakub Kicinski
Currently bpf_offload_dev does not have any priv pointer, forcing the drivers to work backwards from the netdev in program metadata. This is not great given programs are conceptually associated with the offload device, and it means one or two unnecessary deferences. Add a priv pointer to bpf_offload_dev. Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Reviewed-by: Quentin Monnet <quentin.monnet@netronome.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-02-11tracing: probeevent: Correctly update remaining space in dynamic areaAndreas Ziegler
Commit 9178412ddf5a ("tracing: probeevent: Return consumed bytes of dynamic area") improved the string fetching mechanism by returning the number of required bytes after copying the argument to the dynamic area. However, this return value is now only used to increment the pointer inside the dynamic area but misses updating the 'maxlen' variable which indicates the remaining space in the dynamic area. This means that fetch_store_string() always reads the *total* size of the dynamic area from the data_loc pointer instead of the *remaining* size (and passes it along to strncpy_from_{user,unsafe}) even if we're already about to copy data into the middle of the dynamic area. Link: http://lkml.kernel.org/r/20190206190013.16405-1-andreas.ziegler@fau.de Cc: Ingo Molnar <mingo@redhat.com> Cc: stable@vger.kernel.org Fixes: 9178412ddf5a ("tracing: probeevent: Return consumed bytes of dynamic area") Acked-by: Masami Hiramatsu <mhiramat@kernel.org> Signed-off-by: Andreas Ziegler <andreas.ziegler@fau.de> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2019-02-11tracing: Change the function format to display function names by perfChangbin Du
Here is an example for this change. $ sudo perf record -e 'ftrace:function' --filter='ip==schedule' $ sudo perf report The output of perf before this patch: \# Samples: 100 of event 'ftrace:function' \# Event count (approx.): 100 \# \# Overhead Trace output \# ........ ...................................... \# 51.00% ffffffff81f6aaa0 <-- ffffffff81158e8d 29.00% ffffffff81f6aaa0 <-- ffffffff8116ccb2 8.00% ffffffff81f6aaa0 <-- ffffffff81f6f2ed 4.00% ffffffff81f6aaa0 <-- ffffffff811628db 4.00% ffffffff81f6aaa0 <-- ffffffff81f6ec5b 2.00% ffffffff81f6aaa0 <-- ffffffff81f6f21a 1.00% ffffffff81f6aaa0 <-- ffffffff811b04af 1.00% ffffffff81f6aaa0 <-- ffffffff8143ce17 After this patch: \# Samples: 36 of event 'ftrace:function' \# Event count (approx.): 36 \# \# Overhead Trace output \# ........ ............................................ \# 38.89% schedule <-- schedule_hrtimeout_range_clock 27.78% schedule <-- worker_thread 13.89% schedule <-- schedule_timeout 11.11% schedule <-- smpboot_thread_fn 5.56% schedule <-- rcu_gp_kthread 2.78% schedule <-- exit_to_usermode_loop Link: http://lkml.kernel.org/r/20190209161919.32350-1-changbin.du@gmail.com Signed-off-by: Changbin Du <changbin.du@gmail.com> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2019-02-11bpf: fix lockdep false positive in stackmapAlexei Starovoitov
Lockdep warns about false positive: [ 11.211460] ------------[ cut here ]------------ [ 11.211936] DEBUG_LOCKS_WARN_ON(depth <= 0) [ 11.211985] WARNING: CPU: 0 PID: 141 at ../kernel/locking/lockdep.c:3592 lock_release+0x1ad/0x280 [ 11.213134] Modules linked in: [ 11.214954] RIP: 0010:lock_release+0x1ad/0x280 [ 11.223508] Call Trace: [ 11.223705] <IRQ> [ 11.223874] ? __local_bh_enable+0x7a/0x80 [ 11.224199] up_read+0x1c/0xa0 [ 11.224446] do_up_read+0x12/0x20 [ 11.224713] irq_work_run_list+0x43/0x70 [ 11.225030] irq_work_run+0x26/0x50 [ 11.225310] smp_irq_work_interrupt+0x57/0x1f0 [ 11.225662] irq_work_interrupt+0xf/0x20 since rw_semaphore is released in a different task vs task that locked the sema. It is expected behavior. Fix the warning with up_read_non_owner() and rwsem_release() annotation. Fixes: bae77c5eb5b2 ("bpf: enable stackmap with build_id in nmi context") Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-02-11perf/x86: Add check_period PMU callbackJiri Olsa
Vince (and later on Ravi) reported crashes in the BTS code during fuzzing with the following backtrace: general protection fault: 0000 [#1] SMP PTI ... RIP: 0010:perf_prepare_sample+0x8f/0x510 ... Call Trace: <IRQ> ? intel_pmu_drain_bts_buffer+0x194/0x230 intel_pmu_drain_bts_buffer+0x160/0x230 ? tick_nohz_irq_exit+0x31/0x40 ? smp_call_function_single_interrupt+0x48/0xe0 ? call_function_single_interrupt+0xf/0x20 ? call_function_single_interrupt+0xa/0x20 ? x86_schedule_events+0x1a0/0x2f0 ? x86_pmu_commit_txn+0xb4/0x100 ? find_busiest_group+0x47/0x5d0 ? perf_event_set_state.part.42+0x12/0x50 ? perf_mux_hrtimer_restart+0x40/0xb0 intel_pmu_disable_event+0xae/0x100 ? intel_pmu_disable_event+0xae/0x100 x86_pmu_stop+0x7a/0xb0 x86_pmu_del+0x57/0x120 event_sched_out.isra.101+0x83/0x180 group_sched_out.part.103+0x57/0xe0 ctx_sched_out+0x188/0x240 ctx_resched+0xa8/0xd0 __perf_event_enable+0x193/0x1e0 event_function+0x8e/0xc0 remote_function+0x41/0x50 flush_smp_call_function_queue+0x68/0x100 generic_smp_call_function_single_interrupt+0x13/0x30 smp_call_function_single_interrupt+0x3e/0xe0 call_function_single_interrupt+0xf/0x20 </IRQ> The reason is that while event init code does several checks for BTS events and prevents several unwanted config bits for BTS event (like precise_ip), the PERF_EVENT_IOC_PERIOD allows to create BTS event without those checks being done. Following sequence will cause the crash: If we create an 'almost' BTS event with precise_ip and callchains, and it into a BTS event it will crash the perf_prepare_sample() function because precise_ip events are expected to come in with callchain data initialized, but that's not the case for intel_pmu_drain_bts_buffer() caller. Adding a check_period callback to be called before the period is changed via PERF_EVENT_IOC_PERIOD. It will deny the change if the event would become BTS. Plus adding also the limit_period check as well. Reported-by: Vince Weaver <vincent.weaver@maine.edu> Signed-off-by: Jiri Olsa <jolsa@kernel.org> Acked-by: Peter Zijlstra <peterz@infradead.org> Cc: <stable@vger.kernel.org> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com> Cc: Ravi Bangoria <ravi.bangoria@linux.ibm.com> Cc: Stephane Eranian <eranian@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/20190204123532.GA4794@krava Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-02-11futex: Convert futex_pi_state.refcount to refcount_tElena Reshetova
atomic_t variables are currently used to implement reference counters with the following properties: - counter is initialized to 1 using atomic_set() - a resource is freed upon counter reaching zero - once counter reaches zero, its further increments aren't allowed - counter schema uses basic atomic operations (set, inc, inc_not_zero, dec_and_test, etc.) Such atomic variables should be converted to a newly provided refcount_t type and API that prevents accidental counter overflows and underflows. This is important since overflows and underflows can lead to use-after-free situation and be exploitable. The variable futex_pi_state.refcount is used as pure reference counter. Convert it to refcount_t and fix up the operations. **Important note for maintainers: Some functions from refcount_t API defined in lib/refcount.c have different memory ordering guarantees than their atomic counterparts. Please check Documentation/core-api/refcount-vs-atomic.rst for more information. Normally the differences should not matter since refcount_t provides enough guarantees to satisfy the refcounting use cases, but in some rare cases it might matter. Please double check that you don't have some undocumented memory guarantees for this variable usage. For the futex_pi_state.refcount it might make a difference in following places: - get_pi_state() and exit_pi_state_list(): increment in refcount_inc_not_zero() only guarantees control dependency on success vs. fully ordered atomic counterpart - put_pi_state(): decrement in refcount_dec_and_test() provides RELEASE ordering and ACQUIRE ordering on success vs. fully ordered atomic counterpart Suggested-by: Kees Cook <keescook@chromium.org> Signed-off-by: Elena Reshetova <elena.reshetova@intel.com> Reviewed-by: David Windsor <dwindsor@gmail.com> Reviewed-by: Hans Liljestrand <ishkamiel@gmail.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Will Deacon <will.deacon@arm.com> Cc: dvhart@infradead.org Link: http://lkml.kernel.org/r/1549369467-3505-1-git-send-email-elena.reshetova@intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-02-11Merge 5.0-rc6 into driver-core-nextGreg Kroah-Hartman
We need the debugfs fixes in here as well. Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-02-11sched/fair: Remove unused 'sd' parameter from select_idle_smt()Viresh Kumar
The 'sd' parameter isn't getting used in select_idle_smt(), drop it. Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vincent Guittot <vincent.guittot@linaro.org> Link: http://lkml.kernel.org/r/f91c5e118183e79d4a982e9ac4ce5e47948f6c1b.1549536337.git.viresh.kumar@linaro.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-02-11sched/fair: Prune, fix and simplify the nohz_balancer_kick() comment blockValentin Schneider
The comment block for that function lists the heuristics for triggering a nohz kick, but the most recent ones (blocked load updates, misfit) aren't included, and some of them (LLC nohz logic, asym packing) are no longer in sync with the code. The conditions are either simple enough or properly commented, so get rid of that list instead of letting it grow. Signed-off-by: Valentin Schneider <valentin.schneider@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Dietmar.Eggemann@arm.com Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: morten.rasmussen@arm.com Cc: vincent.guittot@linaro.org Link: https://lkml.kernel.org/r/20190117153411.2390-4-valentin.schneider@arm.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-02-11sched/fair: Explain LLC nohz kick conditionValentin Schneider
Provide a comment explaining the LLC related nohz kick in nohz_balancer_kick(). Signed-off-by: Valentin Schneider <valentin.schneider@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Dietmar.Eggemann@arm.com Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: morten.rasmussen@arm.com Cc: vincent.guittot@linaro.org Link: https://lkml.kernel.org/r/20190117153411.2390-3-valentin.schneider@arm.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-02-11sched/fair: Simplify nohz_balancer_kick()Valentin Schneider
Calling 'nohz_balance_exit_idle(rq)' will always clear 'rq->cpu' from 'nohz.idle_cpus_mask' if it is set. Since it is called at the top of 'nohz_balancer_kick()', 'rq->cpu' will never be set in 'nohz.idle_cpus_mask' if it is accessed in the rest of the function. Combine the 'sched_domain_span()' with 'nohz.idle_cpus_mask' and drop the '(i == cpu)' check since 'rq->cpu' will never be iterated over. While at it, clean up a condition alignment. Signed-off-by: Valentin Schneider <valentin.schneider@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Dietmar.Eggemann@arm.com Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: morten.rasmussen@arm.com Cc: vincent.guittot@linaro.org Link: https://lkml.kernel.org/r/20190117153411.2390-2-valentin.schneider@arm.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-02-11sched/topology: Fix percpu data types in struct sd_data & struct s_dataLuc Van Oostenryck
The percpu members of struct sd_data and s_data are declared as: struct ... ** __percpu member; So their type is: __percpu pointer to pointer to struct ... But looking at how they're used, their type should be: pointer to __percpu pointer to struct ... and they should thus be declared as: struct ... * __percpu *member; So fix the placement of '__percpu' in the definition of these structures. This addresses a bunch of Sparse's warnings like: warning: incorrect type in initializer (different address spaces) expected void const [noderef] <asn:3> *__vpp_verify got struct sched_domain ** Signed-off-by: Luc Van Oostenryck <luc.vanoostenryck@gmail.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: https://lkml.kernel.org/r/20190118144936.79158-1-luc.vanoostenryck@gmail.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-02-11sched/fair: Simplify post_init_entity_util_avg() by calling it with a ↵Dietmar Eggemann
task_struct pointer argument Since commit: d03266910a53 ("sched/fair: Fix task group initialization") the utilization of a sched entity representing a task group is no longer initialized to any other value than 0. So post_init_entity_util_avg() is only used for tasks, not for sched_entities. Make this clear by calling it with a task_struct pointer argument which also eliminates the entity_is_task(se) if condition in the fork path and get rid of the stale comment in remove_entity_load_avg() accordingly. Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Morten Rasmussen <morten.rasmussen@arm.com> Cc: Patrick Bellasi <patrick.bellasi@arm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Quentin Perret <quentin.perret@arm.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Valentin Schneider <valentin.schneider@arm.com> Cc: Vincent Guittot <vincent.guittot@linaro.org> Link: https://lkml.kernel.org/r/20190122162501.12000-1-dietmar.eggemann@arm.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-02-11sched/fair: Fix O(nr_cgroups) in the load balancing pathVincent Guittot
This re-applies the commit reverted here: commit c40f7d74c741 ("sched/fair: Fix infinite loop in update_blocked_averages() by reverting a9e7f6544b9c") I.e. now that cfs_rq can be safely removed/added in the list, we can re-apply: commit a9e7f6544b9c ("sched/fair: Fix O(nr_cgroups) in load balance path") Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: sargun@sargun.me Cc: tj@kernel.org Cc: xiexiuqi@huawei.com Cc: xiezhipeng1@huawei.com Link: https://lkml.kernel.org/r/1549469662-13614-3-git-send-email-vincent.guittot@linaro.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-02-11sched/fair: Optimize update_blocked_averages()Vincent Guittot
Removing a cfs_rq from rq->leaf_cfs_rq_list can break the parent/child ordering of the list when it will be added back. In order to remove an empty and fully decayed cfs_rq, we must remove its children too, so they will be added back in the right order next time. With a normal decay of PELT, a parent will be empty and fully decayed if all children are empty and fully decayed too. In such a case, we just have to ensure that the whole branch will be added when a new task is enqueued. This is default behavior since : commit f6783319737f ("sched/fair: Fix insertion in rq->leaf_cfs_rq_list") In case of throttling, the PELT of throttled cfs_rq will not be updated whereas the parent will. This breaks the assumption made above unless we remove the children of a cfs_rq that is throttled. Then, they will be added back when unthrottled and a sched_entity will be enqueued. As throttled cfs_rq are now removed from the list, we can remove the associated test in update_blocked_averages(). Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: sargun@sargun.me Cc: tj@kernel.org Cc: xiexiuqi@huawei.com Cc: xiezhipeng1@huawei.com Link: https://lkml.kernel.org/r/1549469662-13614-2-git-send-email-vincent.guittot@linaro.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-02-11Merge tag 'v5.0-rc6' into sched/core, to pick up fixesIngo Molnar
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-02-10bpf: Add struct bpf_tcp_sock and BPF_FUNC_tcp_sockMartin KaFai Lau
This patch adds a helper function BPF_FUNC_tcp_sock and it is currently available for cg_skb and sched_(cls|act): struct bpf_tcp_sock *bpf_tcp_sock(struct bpf_sock *sk); int cg_skb_foo(struct __sk_buff *skb) { struct bpf_tcp_sock *tp; struct bpf_sock *sk; __u32 snd_cwnd; sk = skb->sk; if (!sk) return 1; tp = bpf_tcp_sock(sk); if (!tp) return 1; snd_cwnd = tp->snd_cwnd; /* ... */ return 1; } A 'struct bpf_tcp_sock' is also added to the uapi bpf.h to provide read-only access. bpf_tcp_sock has all the existing tcp_sock's fields that has already been exposed by the bpf_sock_ops. i.e. no new tcp_sock's fields are exposed in bpf.h. This helper returns a pointer to the tcp_sock. If it is not a tcp_sock or it cannot be traced back to a tcp_sock by sk_to_full_sk(), it returns NULL. Hence, the caller needs to check for NULL before accessing it. The current use case is to expose members from tcp_sock to allow a cg_skb_bpf_prog to provide per cgroup traffic policing/shaping. Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2019-02-10bpf: Add a bpf_sock pointer to __sk_buff and a bpf_sk_fullsock helperMartin KaFai Lau
In kernel, it is common to check "skb->sk && sk_fullsock(skb->sk)" before accessing the fields in sock. For example, in __netdev_pick_tx: static u16 __netdev_pick_tx(struct net_device *dev, struct sk_buff *skb, struct net_device *sb_dev) { /* ... */ struct sock *sk = skb->sk; if (queue_index != new_index && sk && sk_fullsock(sk) && rcu_access_pointer(sk->sk_dst_cache)) sk_tx_queue_set(sk, new_index); /* ... */ return queue_index; } This patch adds a "struct bpf_sock *sk" pointer to the "struct __sk_buff" where a few of the convert_ctx_access() in filter.c has already been accessing the skb->sk sock_common's fields, e.g. sock_ops_convert_ctx_access(). "__sk_buff->sk" is a PTR_TO_SOCK_COMMON_OR_NULL in the verifier. Some of the fileds in "bpf_sock" will not be directly accessible through the "__sk_buff->sk" pointer. It is limited by the new "bpf_sock_common_is_valid_access()". e.g. The existing "type", "protocol", "mark" and "priority" in bpf_sock are not allowed. The newly added "struct bpf_sock *bpf_sk_fullsock(struct bpf_sock *sk)" can be used to get a sk with all accessible fields in "bpf_sock". This helper is added to both cg_skb and sched_(cls|act). int cg_skb_foo(struct __sk_buff *skb) { struct bpf_sock *sk; sk = skb->sk; if (!sk) return 1; sk = bpf_sk_fullsock(sk); if (!sk) return 1; if (sk->family != AF_INET6 || sk->protocol != IPPROTO_TCP) return 1; /* some_traffic_shaping(); */ return 1; } (1) The sk is read only (2) There is no new "struct bpf_sock_common" introduced. (3) Future kernel sock's members could be added to bpf_sock only instead of repeatedly adding at multiple places like currently in bpf_sock_ops_md, bpf_sock_addr_md, sk_reuseport_md...etc. (4) After "sk = skb->sk", the reg holding sk is in type PTR_TO_SOCK_COMMON_OR_NULL. (5) After bpf_sk_fullsock(), the return type will be in type PTR_TO_SOCKET_OR_NULL which is the same as the return type of bpf_sk_lookup_xxx(). However, bpf_sk_fullsock() does not take refcnt. The acquire_reference_state() is only depending on the return type now. To avoid it, a new is_acquire_function() is checked before calling acquire_reference_state(). (6) The WARN_ON in "release_reference_state()" is no longer an internal verifier bug. When reg->id is not found in state->refs[], it means the bpf_prog does something wrong like "bpf_sk_release(bpf_sk_fullsock(skb->sk))" where reference has never been acquired by calling "bpf_sk_fullsock(skb->sk)". A -EINVAL and a verbose are done instead of WARN_ON. A test is added to the test_verifier in a later patch. Since the WARN_ON in "release_reference_state()" is no longer needed, "__release_reference_state()" is folded into "release_reference_state()" also. Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2019-02-10bpf: Fix narrow load on a bpf_sock returned from sk_lookup()Martin KaFai Lau
By adding this test to test_verifier: { "reference tracking: access sk->src_ip4 (narrow load)", .insns = { BPF_SK_LOOKUP, BPF_MOV64_REG(BPF_REG_6, BPF_REG_0), BPF_JMP_IMM(BPF_JEQ, BPF_REG_0, 0, 3), BPF_LDX_MEM(BPF_H, BPF_REG_2, BPF_REG_0, offsetof(struct bpf_sock, src_ip4) + 2), BPF_MOV64_REG(BPF_REG_1, BPF_REG_6), BPF_EMIT_CALL(BPF_FUNC_sk_release), BPF_EXIT_INSN(), }, .prog_type = BPF_PROG_TYPE_SCHED_CLS, .result = ACCEPT, }, The above test loads 2 bytes from sk->src_ip4 where sk is obtained by bpf_sk_lookup_tcp(). It hits an internal verifier error from convert_ctx_accesses(): [root@arch-fb-vm1 bpf]# ./test_verifier 665 665 Failed to load prog 'Invalid argument'! 0: (b7) r2 = 0 1: (63) *(u32 *)(r10 -8) = r2 2: (7b) *(u64 *)(r10 -16) = r2 3: (7b) *(u64 *)(r10 -24) = r2 4: (7b) *(u64 *)(r10 -32) = r2 5: (7b) *(u64 *)(r10 -40) = r2 6: (7b) *(u64 *)(r10 -48) = r2 7: (bf) r2 = r10 8: (07) r2 += -48 9: (b7) r3 = 36 10: (b7) r4 = 0 11: (b7) r5 = 0 12: (85) call bpf_sk_lookup_tcp#84 13: (bf) r6 = r0 14: (15) if r0 == 0x0 goto pc+3 R0=sock(id=1,off=0,imm=0) R6=sock(id=1,off=0,imm=0) R10=fp0,call_-1 fp-8=????0000 fp-16=0000mmmm fp-24=mmmmmmmm fp-32=mmmmmmmm fp-40=mmmmmmmm fp-48=mmmmmmmm refs=1 15: (69) r2 = *(u16 *)(r0 +26) 16: (bf) r1 = r6 17: (85) call bpf_sk_release#86 18: (95) exit from 14 to 18: safe processed 20 insns (limit 131072), stack depth 48 bpf verifier is misconfigured Summary: 0 PASSED, 0 SKIPPED, 1 FAILED The bpf_sock_is_valid_access() is expecting src_ip4 can be narrowly loaded (meaning load any 1 or 2 bytes of the src_ip4) by marking info->ctx_field_size. However, this marked ctx_field_size is not used. This patch fixes it. Due to the recent refactoring in test_verifier, this new test will be added to the bpf-next branch (together with the bpf_tcp_sock patchset) to avoid merge conflict. Fixes: c64b7983288e ("bpf: Add PTR_TO_SOCKET verifier type") Cc: Joe Stringer <joe@wand.net.nz> Signed-off-by: Martin KaFai Lau <kafai@fb.com> Acked-by: Joe Stringer <joe@wand.net.nz> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2019-02-10softirq: Don't skip softirq execution when softirq thread is parkingMatthias Kaehlcke
When a CPU is unplugged the kernel threads of this CPU are parked (see smpboot_park_threads()). kthread_park() is used to mark each thread as parked and wake it up, so it can complete the process of parking itselfs (see smpboot_thread_fn()). If local softirqs are pending on interrupt exit invoke_softirq() is called to process the softirqs, however it skips processing when the softirq kernel thread of the local CPU is scheduled to run. The softirq kthread is one of the threads that is parked when a CPU is unplugged. Parking the kthread wakes it up, however only to complete the parking process, not to process the pending softirqs. Hence processing of softirqs at the end of an interrupt is skipped, but not done elsewhere, which can result in warnings about pending softirqs when a CPU is unplugged: /sys/devices/system/cpu # echo 0 > cpu4/online [ ... ] NOHZ: local_softirq_pending 02 [ ... ] NOHZ: local_softirq_pending 202 [ ... ] CPU4: shutdown [ ... ] psci: CPU4 killed. Don't skip processing of softirqs at the end of an interrupt when the softirq thread of the CPU is parking. Signed-off-by: Matthias Kaehlcke <mka@chromium.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: "Paul E . McKenney" <paulmck@linux.vnet.ibm.com> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: Douglas Anderson <dianders@chromium.org> Cc: Stephen Boyd <swboyd@chromium.org> Link: https://lkml.kernel.org/r/20190128234625.78241-3-mka@chromium.org
2019-02-10kthread: Add __kthread_should_park()Matthias Kaehlcke
kthread_should_park() is used to check if the calling kthread ('current') should park, but there is no function to check whether an arbitrary kthread should be parked. The latter is required to plug a CPU hotplug race vs. a parking ksoftirqd thread. The new __kthread_should_park() receives a task_struct as parameter to check if the corresponding kernel thread should be parked. Call __kthread_should_park() from kthread_should_park() to avoid code duplication. Signed-off-by: Matthias Kaehlcke <mka@chromium.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: "Paul E . McKenney" <paulmck@linux.vnet.ibm.com> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: Douglas Anderson <dianders@chromium.org> Cc: Stephen Boyd <swboyd@chromium.org> Link: https://lkml.kernel.org/r/20190128234625.78241-2-mka@chromium.org
2019-02-10genirq: Avoid summation loops for /proc/statThomas Gleixner
Waiman reported that on large systems with a large amount of interrupts the readout of /proc/stat takes a long time to sum up the interrupt statistics. In principle this is not a problem. but for unknown reasons some enterprise quality software reads /proc/stat with a high frequency. The reason for this is that interrupt statistics are accounted per cpu. So the /proc/stat logic has to sum up the interrupt stats for each interrupt. This can be largely avoided for interrupts which are not marked as 'PER_CPU' interrupts by simply adding a per interrupt summation counter which is incremented along with the per interrupt per cpu counter. The PER_CPU interrupts need to avoid that and use only per cpu accounting because they share the interrupt number and the interrupt descriptor and concurrent updates would conflict or require unwanted synchronization. Reported-by: Waiman Long <longman@redhat.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Waiman Long <longman@redhat.com> Reviewed-by: Marc Zyngier <marc.zyngier@arm.com> Reviewed-by: Davidlohr Bueso <dbueso@suse.de> Cc: Matthew Wilcox <willy@infradead.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: Kees Cook <keescook@chromium.org> Cc: linux-fsdevel@vger.kernel.org Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Miklos Szeredi <miklos@szeredi.hu> Cc: Daniel Colascione <dancol@google.com> Cc: Dave Chinner <david@fromorbit.com> Cc: Randy Dunlap <rdunlap@infradead.org> Link: https://lkml.kernel.org/r/20190208135020.925487496@linutronix.de 8<------------- v2: Undo the unintentional layout change of struct irq_desc. include/linux/irqdesc.h | 1 + kernel/irq/chip.c | 12 ++++++++++-- kernel/irq/internals.h | 8 +++++++- kernel/irq/irqdesc.c | 7 ++++++- 4 files changed, 24 insertions(+), 4 deletions(-)
2019-02-10Merge tag 'y2038-new-syscalls' of ↵Thomas Gleixner
git://git.kernel.org:/pub/scm/linux/kernel/git/arnd/playground into timers/2038 Pull y2038 - time64 system calls from Arnd Bergmann: This series finally gets us to the point of having system calls with 64-bit time_t on all architectures, after a long time of incremental preparation patches. There was actually one conversion that I missed during the summer, i.e. Deepa's timex series, which I now updated based the 5.0-rc1 changes and review comments. The following system calls are now added on all 32-bit architectures using the same system call numbers: 403 clock_gettime64 404 clock_settime64 405 clock_adjtime64 406 clock_getres_time64 407 clock_nanosleep_time64 408 timer_gettime64 409 timer_settime64 410 timerfd_gettime64 411 timerfd_settime64 412 utimensat_time64 413 pselect6_time64 414 ppoll_time64 416 io_pgetevents_time64 417 recvmmsg_time64 418 mq_timedsend_time64 419 mq_timedreceiv_time64 420 semtimedop_time64 421 rt_sigtimedwait_time64 422 futex_time64 423 sched_rr_get_interval_time64 Each one of these corresponds directly to an existing system call that includes a 'struct timespec' argument, or a structure containing a timespec or (in case of clock_adjtime) timeval. Not included here are new versions of getitimer/setitimer and getrusage/waitid, which are planned for the future but only needed to make a consistent API rather than for correct operation beyond y2038. These four system calls are based on 'timeval', and it has not been finally decided what the replacement kernel interface will use instead. So far, I have done a lot of build testing across most architectures, which has found a number of bugs. Runtime testing so far included testing LTP on 32-bit ARM with the existing system calls, to ensure we do not regress for existing binaries, and a test with a 32-bit x86 build of LTP against a modified version of the musl C library that has been adapted to the new system call interface [3]. This library can be used for testing on all architectures supported by musl-1.1.21, but it is not how the support is getting integrated into the official musl release. Official musl support is planned but will require more invasive changes to the library. Link: https://lore.kernel.org/lkml/20190110162435.309262-1-arnd@arndb.de/T/ Link: https://lore.kernel.org/lkml/20190118161835.2259170-1-arnd@arndb.de/ Link: https://git.linaro.org/people/arnd/musl-y2038.git/ [2]
2019-02-10Merge tag 'y2038-syscall-cleanup' of ↵Thomas Gleixner
git://git.kernel.org:/pub/scm/linux/kernel/git/arnd/playground into timers/2038 Pull preparatory work for y2038 changes from Arnd Bergmann: System call unification and cleanup The system call tables have diverged a bit over the years, and a number of the recent additions never made it into all architectures, for one reason or another. This is an attempt to clean it up as far as we can without breaking compatibility, doing a number of steps: - Add system calls that have not yet been integrated into all architectures but that we definitely want there. This includes {,f}statfs64() and get{eg,eu,g,p,u,pp}id() on alpha, which have been missing traditionally. - The s390 compat syscall handling is cleaned up to be more like what we do on other architectures, while keeping the 31-bit pointer extension. This was merged as a shared branch by the s390 maintainers and is included here in order to base the other patches on top. - Add the separate ipc syscalls on all architectures that traditionally only had sys_ipc(). This version is done without support for IPC_OLD that is we have in sys_ipc. The new semtimedop_time64 syscall will only be added here, not in sys_ipc - Add syscall numbers for a couple of syscalls that we probably don't need everywhere, in particular pkey_* and rseq, for the purpose of symmetry: if it's in asm-generic/unistd.h, it makes sense to have it everywhere. I expect that any future system calls will get assigned on all platforms together, even when they appear to be specific to a single architecture. - Prepare for having the same system call numbers for any future calls. In combination with the generated tables, this hopefully makes it easier to add new calls across all architectures together. All of the above are technically separate from the y2038 work, but are done as preparation before we add the new 64-bit time_t system calls everywhere, providing a common baseline set of system calls. I expect that glibc and other libraries that want to use 64-bit time_t will require linux-5.1 kernel headers for building in the future, and at a much later point may also require linux-5.1 or a later version as the minimum kernel at runtime. Having a common baseline then allows the removal of many architecture or kernel version specific workarounds.
2019-02-10genirq/affinity: Move allocation of 'node_to_cpumask' to ↵Ming Lei
irq_build_affinity_masks() 'node_to_cpumask' is just one temparay variable for irq_build_affinity_masks(), so move it into irq_build_affinity_masks(). No functioanl change. Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Bjorn Helgaas <bhelgaas@google.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Jens Axboe <axboe@kernel.dk> Cc: linux-block@vger.kernel.org Cc: Sagi Grimberg <sagi@grimberg.me> Cc: linux-nvme@lists.infradead.org Cc: linux-pci@vger.kernel.org Link: https://lkml.kernel.org/r/20190125095347.17950-2-ming.lei@redhat.com
2019-02-10Merge branch 'perf-urgent-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull perf fixes from Ingo Molnar: "A couple of kernel side fixes: - Fix the Intel uncore driver on certain hardware configurations - Fix a CPU hotplug related memory allocation bug - Remove a spurious WARN() ... plus also a handful of perf tooling fixes" * 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: perf script python: Add Python3 support to tests/attr.py perf trace: Support multiple "vfs_getname" probes perf symbols: Filter out hidden symbols from labels perf symbols: Add fallback definitions for GELF_ST_VISIBILITY() tools headers uapi: Sync linux/in.h copy from the kernel sources perf clang: Do not use 'return std::move(something)' perf mem/c2c: Fix perf_mem_events to support powerpc perf tests evsel-tp-sched: Fix bitwise operator perf/core: Don't WARN() for impossible ring-buffer sizes perf/x86/intel: Delay memory deallocation until x86_pmu_dead_cpu() perf/x86/intel/uncore: Add Node ID mask
2019-02-10Merge branch 'locking-urgent-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull locking fixes from Ingo Molnar: "An rtmutex (PI-futex) deadlock scenario fix, plus a locking documentation fix" * 'locking-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: futex: Handle early deadlock return correctly futex: Fix barrier comment
2019-02-09bpf: Fix narrow load on a bpf_sock returned from sk_lookup()Martin KaFai Lau
By adding this test to test_verifier: { "reference tracking: access sk->src_ip4 (narrow load)", .insns = { BPF_SK_LOOKUP, BPF_MOV64_REG(BPF_REG_6, BPF_REG_0), BPF_JMP_IMM(BPF_JEQ, BPF_REG_0, 0, 3), BPF_LDX_MEM(BPF_H, BPF_REG_2, BPF_REG_0, offsetof(struct bpf_sock, src_ip4) + 2), BPF_MOV64_REG(BPF_REG_1, BPF_REG_6), BPF_EMIT_CALL(BPF_FUNC_sk_release), BPF_EXIT_INSN(), }, .prog_type = BPF_PROG_TYPE_SCHED_CLS, .result = ACCEPT, }, The above test loads 2 bytes from sk->src_ip4 where sk is obtained by bpf_sk_lookup_tcp(). It hits an internal verifier error from convert_ctx_accesses(): [root@arch-fb-vm1 bpf]# ./test_verifier 665 665 Failed to load prog 'Invalid argument'! 0: (b7) r2 = 0 1: (63) *(u32 *)(r10 -8) = r2 2: (7b) *(u64 *)(r10 -16) = r2 3: (7b) *(u64 *)(r10 -24) = r2 4: (7b) *(u64 *)(r10 -32) = r2 5: (7b) *(u64 *)(r10 -40) = r2 6: (7b) *(u64 *)(r10 -48) = r2 7: (bf) r2 = r10 8: (07) r2 += -48 9: (b7) r3 = 36 10: (b7) r4 = 0 11: (b7) r5 = 0 12: (85) call bpf_sk_lookup_tcp#84 13: (bf) r6 = r0 14: (15) if r0 == 0x0 goto pc+3 R0=sock(id=1,off=0,imm=0) R6=sock(id=1,off=0,imm=0) R10=fp0,call_-1 fp-8=????0000 fp-16=0000mmmm fp-24=mmmmmmmm fp-32=mmmmmmmm fp-40=mmmmmmmm fp-48=mmmmmmmm refs=1 15: (69) r2 = *(u16 *)(r0 +26) 16: (bf) r1 = r6 17: (85) call bpf_sk_release#86 18: (95) exit from 14 to 18: safe processed 20 insns (limit 131072), stack depth 48 bpf verifier is misconfigured Summary: 0 PASSED, 0 SKIPPED, 1 FAILED The bpf_sock_is_valid_access() is expecting src_ip4 can be narrowly loaded (meaning load any 1 or 2 bytes of the src_ip4) by marking info->ctx_field_size. However, this marked ctx_field_size is not used. This patch fixes it. Due to the recent refactoring in test_verifier, this new test will be added to the bpf-next branch (together with the bpf_tcp_sock patchset) to avoid merge conflict. Fixes: c64b7983288e ("bpf: Add PTR_TO_SOCKET verifier type") Cc: Joe Stringer <joe@wand.net.nz> Signed-off-by: Martin KaFai Lau <kafai@fb.com> Acked-by: Joe Stringer <joe@wand.net.nz> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2019-02-09Merge branches 'doc.2019.01.26a', 'fixes.2019.01.26a', 'sil.2019.01.26a', ↵Paul E. McKenney
'spdx.2019.02.09a', 'srcu.2019.01.26a' and 'torture.2019.01.26a' into HEAD doc.2019.01.26a: Documentation updates. fixes.2019.01.26a: Miscellaneous fixes. sil.2019.01.26a: Removal of a few more spin_is_locked() instances. spdx.2019.02.09a: Add SPDX identifiers to RCU files srcu.2019.01.26a: SRCU updates. torture.2019.01.26a: Torture-test updates.
2019-02-09locking/locktorture: Convert to SPDX license identifierPaul E. McKenney
Replace the license boiler plate with a SPDX license identifier. While in the area, update an email address. Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com> Reviewed-by: Thomas Gleixner <tglx@linutronix.de>