summaryrefslogtreecommitdiff
path: root/kernel
AgeCommit message (Collapse)Author
2019-08-28posix-cpu-timers: Sample task times once in expiry checkThomas Gleixner
Sampling the task times twice does not make sense. Do it once. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Frederic Weisbecker <frederic@kernel.org> Link: https://lkml.kernel.org/r/20190821192920.639878168@linutronix.de
2019-08-28posix-cpu-timers: Get rid of pointer indirectionThomas Gleixner
Now that the sample functions have no return value anymore, the result can simply be returned instead of using pointer indirection. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Frederic Weisbecker <frederic@kernel.org> Link: https://lkml.kernel.org/r/20190821192920.535079278@linutronix.de
2019-08-28posix-cpu-timers: Simplify sample functionsThomas Gleixner
All callers hand in a valdiated clock id. Remove the return value which was unchecked in most places anyway. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Frederic Weisbecker <frederic@kernel.org> Link: https://lkml.kernel.org/r/20190821192920.430475832@linutronix.de
2019-08-28posix-cpu-timers: Remove pointless return value checkThomas Gleixner
set_process_cpu_timer() checks already whether the clock id is valid. No point in checking the return value of the sample function. That allows to simplify the sample function later. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Frederic Weisbecker <frederic@kernel.org> Link: https://lkml.kernel.org/r/20190821192920.339725769@linutronix.de
2019-08-28posix-cpu-timers: Use clock ID in posix_cpu_timer_rearm()Thomas Gleixner
Extract the clock ID (PROF/VIRT/SCHED) from the clock selector and use it as argument to the sample functions. That allows to simplify them once all callers are fixed. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Frederic Weisbecker <frederic@kernel.org> Link: https://lkml.kernel.org/r/20190821192920.245357769@linutronix.de
2019-08-28posix-cpu-timers: Use clock ID in posix_cpu_timer_get()Thomas Gleixner
Extract the clock ID (PROF/VIRT/SCHED) from the clock selector and use it as argument to the sample functions. That allows to simplify them once all callers are fixed. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Frederic Weisbecker <frederic@kernel.org> Link: https://lkml.kernel.org/r/20190821192920.155487201@linutronix.de
2019-08-28posix-cpu-timers: Use clock ID in posix_cpu_timer_set()Thomas Gleixner
Extract the clock ID (PROF/VIRT/SCHED) from the clock selector and use it as argument to the sample functions. That allows to simplify them once all callers are fixed. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Frederic Weisbecker <frederic@kernel.org> Link: https://lkml.kernel.org/r/20190821192920.050770464@linutronix.de
2019-08-28posix-cpu-timers: Consolidate thread group sample codeThomas Gleixner
cpu_clock_sample_group() and cpu_timer_sample_group() are almost the same. Before the rename one called thread_group_cputimer() and the other thread_group_cputime(). Really intuitive function names. Consolidate the functions and also avoid the thread traversal when the thread group's accounting is already active. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Frederic Weisbecker <frederic@kernel.org> Link: https://lkml.kernel.org/r/20190821192919.960966884@linutronix.de
2019-08-28posix-cpu-timers: Rename thread_group_cputimer() and make it staticThomas Gleixner
thread_group_cputimer() is a complete misnomer. The function does two things: - For arming process wide timers it makes sure that the atomic time storage is up to date. If no cpu timer is armed yet, then the atomic time storage is not updated by the scheduler for performance reasons. In that case a full summing up of all threads needs to be done and the update needs to be enabled. - Samples the current time into the caller supplied storage. Rename it to thread_group_start_cputime(), make it static and fixup the callsite. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Frederic Weisbecker <frederic@kernel.org> Link: https://lkml.kernel.org/r/20190821192919.869350319@linutronix.de
2019-08-28posix-cpu-timers: Sample directly in timer checkThomas Gleixner
The thread group accounting is active, otherwise the expiry function would not be running. Sample the thread group time directly. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Frederic Weisbecker <frederic@kernel.org> Link: https://lkml.kernel.org/r/20190821192919.780348088@linutronix.de
2019-08-28itimers: Use quick sample functionThomas Gleixner
get_itimer() locks sighand lock and checks whether the timer is already expired. If it is not expired then the thread group cputime accounting is already enabled. Use the sampling function not the one which is meant for starting a timer. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Frederic Weisbecker <frederic@kernel.org> Link: https://lkml.kernel.org/r/20190821192919.689713638@linutronix.de
2019-08-28posix-cpu-timers: Provide quick sample function for itimerThomas Gleixner
get_itimer() needs a sample of the current thread group cputime. It invokes thread_group_cputimer() - which is a misnomer. That function also starts eventually the group cputime accouting which is bogus because the accounting is already active when a timer is armed. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Frederic Weisbecker <frederic@kernel.org> Link: https://lkml.kernel.org/r/20190821192919.599658199@linutronix.de
2019-08-28posix-cpu-timers: Use common permission check in posix_cpu_timer_create()Thomas Gleixner
Yet another copy of the same thing gone... Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Frederic Weisbecker <frederic@kernel.org> Link: https://lkml.kernel.org/r/20190821192919.505833418@linutronix.de
2019-08-28posix-cpu-timers: Use common permission check in posix_cpu_clock_get()Thomas Gleixner
Replace the next slightly different copy of permission checks. That also removes the necessarity to check the return value of the sample functions because the clock id is already validated. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Frederic Weisbecker <frederic@kernel.org> Link: https://lkml.kernel.org/r/20190821192919.414813172@linutronix.de
2019-08-28posix-cpu-timers: Provide task validation functionsThomas Gleixner
The code contains three slightly different copies of validating whether a given clock resolves to a valid task and whether the current caller has permissions to access it. Create central functions. Replace check_clock() as a first step and rename it to something sensible. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lkml.kernel.org/r/20190821192919.326097175@linutronix.de
2019-08-28perf: Allow normal events to output AUX dataAlexander Shishkin
In some cases, ordinary (non-AUX) events can generate data for AUX events. For example, PEBS events can come out as records in the Intel PT stream instead of their usual DS records, if configured to do so. One requirement for such events is to consistently schedule together, to ensure that the data from the "AUX output" events isn't lost while their corresponding AUX event is not scheduled. We use grouping to provide this guarantee: an "AUX output" event can be added to a group where an AUX event is a group leader, and provided that the former supports writing to the latter. Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: kan.liang@linux.intel.com Link: https://lkml.kernel.org/r/20190806084606.4021-2-alexander.shishkin@linux.intel.com
2019-08-28sched/cpufreq: Align trace event behavior of fast switchingDouglas RAILLARD
Fast switching path only emits an event for the CPU of interest, whereas the regular path emits an event for all the CPUs that had their frequency changed, i.e. all the CPUs sharing the same policy. With the current behavior, looking at cpu_frequency event for a given CPU that is using the fast switching path will not give the correct frequency signal. Signed-off-by: Douglas RAILLARD <douglas.raillard@arm.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2019-08-28bpf: introduce verifier internal test flagAlexei Starovoitov
Introduce BPF_F_TEST_STATE_FREQ flag to stress test parentage chain and state pruning. Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Song Liu <songliubraving@fb.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-08-27Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netDavid S. Miller
Minor conflict in r8169, bug fix had two versions in net and net-next, take the net-next hunks. Signed-off-by: David S. Miller <davem@davemloft.net>
2019-08-27Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netLinus Torvalds
Pull networking fixes from David Miller: 1) Use 32-bit index for tails calls in s390 bpf JIT, from Ilya Leoshkevich. 2) Fix missed EPOLLOUT events in TCP, from Eric Dumazet. Same fix for SMC from Jason Baron. 3) ipv6_mc_may_pull() should return 0 for malformed packets, not -EINVAL. From Stefano Brivio. 4) Don't forget to unpin umem xdp pages in error path of xdp_umem_reg(). From Ivan Khoronzhuk. 5) Fix sta object leak in mac80211, from Johannes Berg. 6) Fix regression by not configuring PHYLINK on CPU port of bcm_sf2 switches. From Florian Fainelli. 7) Revert DMA sync removal from r8169 which was causing regressions on some MIPS Loongson platforms. From Heiner Kallweit. 8) Use after free in flow dissector, from Jakub Sitnicki. 9) Fix NULL derefs of net devices during ICMP processing across collect_md tunnels, from Hangbin Liu. 10) proto_register() memory leaks, from Zhang Lin. 11) Set NLM_F_MULTI flag in multipart netlink messages consistently, from John Fastabend. * git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (66 commits) r8152: Set memory to all 0xFFs on failed reg reads openvswitch: Fix conntrack cache with timeout ipv4: mpls: fix mpls_xmit for iptunnel nexthop: Fix nexthop_num_path for blackhole nexthops net: rds: add service level support in rds-info net: route dump netlink NLM_F_MULTI flag missing s390/qeth: reject oversized SNMP requests sock: fix potential memory leak in proto_register() MAINTAINERS: Add phylink keyword to SFF/SFP/SFP+ MODULE SUPPORT xfrm/xfrm_policy: fix dst dev null pointer dereference in collect_md mode ipv4/icmp: fix rt dst dev null pointer dereference openvswitch: Fix log message in ovs conntrack bpf: allow narrow loads of some sk_reuseport_md fields with offset > 0 bpf: fix use after free in prog symbol exposure bpf: fix precision tracking in presence of bpf2bpf calls flow_dissector: Fix potential use-after-free on BPF_PROG_DETACH Revert "r8169: remove not needed call to dma_sync_single_for_device" ipv6: propagate ipv6_add_dev's error returns out of ipv6_find_idev net/ncsi: Fix the payload copying for the request coming from Netlink qed: Add cleanup in qed_slowpath_start() ...
2019-08-27kallsyms: Don't let kallsyms_lookup_size_offset() fail on retrieving the ↵Marc Zyngier
first symbol An arm64 kernel configured with CONFIG_KPROBES=y CONFIG_KALLSYMS=y # CONFIG_KALLSYMS_ALL is not set CONFIG_KALLSYMS_BASE_RELATIVE=y reports the following kprobe failure: [ 0.032677] kprobes: failed to populate blacklist: -22 [ 0.033376] Please take care of using kprobes. It appears that kprobe fails to retrieve the symbol at address 0xffff000010081000, despite this symbol being in System.map: ffff000010081000 T __exception_text_start This symbol is part of the first group of aliases in the kallsyms_offsets array (symbol names generated using ugly hacks in scripts/kallsyms.c): kallsyms_offsets: .long 0x1000 // do_undefinstr .long 0x1000 // efi_header_end .long 0x1000 // _stext .long 0x1000 // __exception_text_start .long 0x12b0 // do_cp15instr Looking at the implementation of get_symbol_pos(), it returns the lowest index for aliasing symbols. In this case, it return 0. But kallsyms_lookup_size_offset() considers 0 as a failure, which is obviously wrong (there is definitely a valid symbol living there). In turn, the kprobe blacklisting stops abruptly, hence the original error. A CONFIG_KALLSYMS_ALL kernel wouldn't fail as there is always some random symbols at the beginning of this array, which are never looked up via kallsyms_lookup_size_offset. Fix it by considering that get_symbol_pos() is always successful (which is consistent with the other uses of this function). Fixes: ffc5089196446 ("[PATCH] Create kallsyms_lookup_size_offset()") Reviewed-by: Masami Hiramatsu <mhiramat@kernel.org> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Will Deacon <will@kernel.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Signed-off-by: Marc Zyngier <maz@kernel.org> Signed-off-by: Will Deacon <will@kernel.org>
2019-08-27genirq/affinity: Spread vectors on node according to nr_cpu ratioMing Lei
Now __irq_build_affinity_masks() spreads vectors evenly per node, but there is a case that not all vectors have been spread when each numa node has a different number of CPUs which triggers the warning in the spreading code. Improve the spreading algorithm by - assigning vectors according to the ratio of the number of CPUs on a node to the number of remaining CPUs. - running the assignment from smaller nodes to bigger nodes to guarantee that every active node gets allocated at least one vector. This ensures that all vectors are spread out. Asided of that the spread becomes more fair if the nodes have different number of CPUs. For example, on the following machine: CPU(s): 16 On-line CPU(s) list: 0-15 Thread(s) per core: 1 Core(s) per socket: 8 Socket(s): 2 NUMA node(s): 2 ... NUMA node0 CPU(s): 0,1,3,5-9,11,13-15 NUMA node1 CPU(s): 2,4,10,12 When a driver requests to allocate 8 vectors, the following spread results: irq 31, cpu list 2,4 irq 32, cpu list 10,12 irq 33, cpu list 0-1 irq 34, cpu list 3,5 irq 35, cpu list 6-7 irq 36, cpu list 8-9 irq 37, cpu list 11,13 irq 38, cpu list 14-15 So Node 0 has now 6 and Node 1 has 2 vectors assigned. The original algorithm assigned 4 vectors on each node which was unfair versus Node 0. [ tglx: Massaged changelog ] Reported-by: Jon Derrick <jonathan.derrick@intel.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Jon Derrick <jonathan.derrick@intel.com> Link: https://lkml.kernel.org/r/20190816022849.14075-3-ming.lei@redhat.com
2019-08-27genirq/affinity: Improve __irq_build_affinity_masks()Ming Lei
One invariant of __irq_build_affinity_masks() is that all CPUs in the specified masks (cpu_mask AND node_to_cpumask for each node) should be covered during the spread. Even though all requested vectors have been reached, it's still required to spread vectors among remained CPUs. A similar policy has been taken in case of 'numvecs <= nodes' already. So remove the following check inside the loop: if (done >= numvecs) break; Meantime assign at least 1 vector for remaining nodes if 'numvecs' vectors have been handled already. Also, if the specified cpumask for one numa node is empty, simply do not spread vectors on this node. Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lkml.kernel.org/r/20190816022849.14075-2-ming.lei@redhat.com
2019-08-26bpf: handle 32-bit zext during constant blindingNaveen N. Rao
Since BPF constant blinding is performed after the verifier pass, the ALU32 instructions inserted for doubleword immediate loads don't have a corresponding zext instruction. This is causing a kernel oops on powerpc and can be reproduced by running 'test_cgroup_storage' with bpf_jit_harden=2. Fix this by emitting BPF_ZEXT during constant blinding if prog->aux->verifier_zext is set. Fixes: a4b1d3c1ddf6cb ("bpf: verifier: insert zero extension according to analysis result") Reported-by: Michael Ellerman <mpe@ellerman.id.au> Signed-off-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com> Reviewed-by: Jiong Wang <jiong.wang@netronome.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-08-25Merge branch 'timers-urgent-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull timekeeping fix from Thomas Gleixner: "A single fix for a regression caused by the generic VDSO implementation where a math overflow causes CLOCK_BOOTTIME to become a random number generator" * 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: timekeeping/vsyscall: Prevent math overflow in BOOTTIME update
2019-08-25Merge branch 'sched-urgent-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler fix from Thomas Gleixner: "Handle the worker management in situations where a task is scheduled out on a PI lock contention correctly and schedule a new worker if possible" * 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: sched/core: Schedule new worker even if PI-blocked
2019-08-25Merge branch 'perf-urgent-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull perf fixes from Thomas Gleixner: "Two small fixes for kprobes and perf: - Prevent a deadlock in kprobe_optimizer() causes by reverse lock ordering - Fix a comment typo" * 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: kprobes: Fix potential deadlock in kprobe_optimizer() perf/x86: Fix typo in comment
2019-08-25Merge branch 'irq-urgent-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull irq fix from Thomas Gleixner: "A single fix for a imbalanced kobject operation in the irq decriptor code which was unearthed by the new warnings in the kobject code" * 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: genirq: Properly pair kobject_del() with kobject_add()
2019-08-25Merge branch 'akpm' (patches from Andrew)Linus Torvalds
Mergr misc fixes from Andrew Morton: "11 fixes" Mostly VM fixes, one psi polling fix, and one parisc build fix. * emailed patches from Andrew Morton <akpm@linux-foundation.org>: mm/kasan: fix false positive invalid-free reports with CONFIG_KASAN_SW_TAGS=y mm/zsmalloc.c: fix race condition in zs_destroy_pool mm/zsmalloc.c: migration can leave pages in ZS_EMPTY indefinitely mm, page_owner: handle THP splits correctly userfaultfd_release: always remove uffd flags and clear vm_userfaultfd_ctx psi: get poll_work to run when calling poll syscall next time mm: memcontrol: flush percpu vmevents before releasing memcg mm: memcontrol: flush percpu vmstats before releasing memcg parisc: fix compilation errrors mm, page_alloc: move_freepages should not examine struct page of reserved memory mm/z3fold.c: fix race between migration and destruction
2019-08-24Merge tag 'dma-mapping-5.3-5' of git://git.infradead.org/users/hch/dma-mappingLinus Torvalds
Pull dma-mapping fixes from Christoph Hellwig: "Two fixes for regressions in this merge window: - select the Kconfig symbols for the noncoherent dma arch helpers on arm if swiotlb is selected, not just for LPAE to not break then Xen build, that uses swiotlb indirectly through swiotlb-xen - fix the page allocator fallback in dma_alloc_contiguous if the CMA allocation fails" * tag 'dma-mapping-5.3-5' of git://git.infradead.org/users/hch/dma-mapping: dma-direct: fix zone selection after an unaddressable CMA allocation arm: select the dma-noncoherent symbols for all swiotlb builds
2019-08-24psi: get poll_work to run when calling poll syscall next timeJason Xing
Only when calling the poll syscall the first time can user receive POLLPRI correctly. After that, user always fails to acquire the event signal. Reproduce case: 1. Get the monitor code in Documentation/accounting/psi.txt 2. Run it, and wait for the event triggered. 3. Kill and restart the process. The question is why we can end up with poll_scheduled = 1 but the work not running (which would reset it to 0). And the answer is because the scheduling side sees group->poll_kworker under RCU protection and then schedules it, but here we cancel the work and destroy the worker. The cancel needs to pair with resetting the poll_scheduled flag. Link: http://lkml.kernel.org/r/1566357985-97781-1-git-send-email-joseph.qi@linux.alibaba.com Signed-off-by: Jason Xing <kerneljasonxing@linux.alibaba.com> Signed-off-by: Joseph Qi <joseph.qi@linux.alibaba.com> Reviewed-by: Caspar Zhang <caspar@linux.alibaba.com> Reviewed-by: Suren Baghdasaryan <surenb@google.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-08-23Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpfDavid S. Miller
Daniel Borkmann says: ==================== pull-request: bpf 2019-08-24 The following pull-request contains BPF updates for your *net* tree. The main changes are: 1) Fix verifier precision tracking with BPF-to-BPF calls, from Alexei. 2) Fix a use-after-free in prog symbol exposure, from Daniel. 3) Several s390x JIT fixes plus BE related fixes in BPF kselftests, from Ilya. 4) Fix memory leak by unpinning XDP umem pages in error path, from Ivan. 5) Fix a potential use-after-free on flow dissector detach, from Jakub. 6) Fix bpftool to close prog fd after showing metadata, from Quentin. 7) BPF kselftest config and TEST_PROGS_EXTENDED fixes, from Anders. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2019-08-24bpf: fix use after free in prog symbol exposureDaniel Borkmann
syzkaller managed to trigger the warning in bpf_jit_free() which checks via bpf_prog_kallsyms_verify_off() for potentially unlinked JITed BPF progs in kallsyms, and subsequently trips over GPF when walking kallsyms entries: [...] 8021q: adding VLAN 0 to HW filter on device batadv0 8021q: adding VLAN 0 to HW filter on device batadv0 WARNING: CPU: 0 PID: 9869 at kernel/bpf/core.c:810 bpf_jit_free+0x1e8/0x2a0 Kernel panic - not syncing: panic_on_warn set ... CPU: 0 PID: 9869 Comm: kworker/0:7 Not tainted 5.0.0-rc8+ #1 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Workqueue: events bpf_prog_free_deferred Call Trace: __dump_stack lib/dump_stack.c:77 [inline] dump_stack+0x113/0x167 lib/dump_stack.c:113 panic+0x212/0x40b kernel/panic.c:214 __warn.cold.8+0x1b/0x38 kernel/panic.c:571 report_bug+0x1a4/0x200 lib/bug.c:186 fixup_bug arch/x86/kernel/traps.c:178 [inline] do_error_trap+0x11b/0x200 arch/x86/kernel/traps.c:271 do_invalid_op+0x36/0x40 arch/x86/kernel/traps.c:290 invalid_op+0x14/0x20 arch/x86/entry/entry_64.S:973 RIP: 0010:bpf_jit_free+0x1e8/0x2a0 Code: 02 4c 89 e2 83 e2 07 38 d0 7f 08 84 c0 0f 85 86 00 00 00 48 ba 00 02 00 00 00 00 ad de 0f b6 43 02 49 39 d6 0f 84 5f fe ff ff <0f> 0b e9 58 fe ff ff 48 b8 00 00 00 00 00 fc ff df 4c 89 e2 48 c1 RSP: 0018:ffff888092f67cd8 EFLAGS: 00010202 RAX: 0000000000000007 RBX: ffffc90001947000 RCX: ffffffff816e9d88 RDX: dead000000000200 RSI: 0000000000000008 RDI: ffff88808769f7f0 RBP: ffff888092f67d00 R08: fffffbfff1394059 R09: fffffbfff1394058 R10: fffffbfff1394058 R11: ffffffff89ca02c7 R12: ffffc90001947002 R13: ffffc90001947020 R14: ffffffff881eca80 R15: ffff88808769f7e8 BUG: unable to handle kernel paging request at fffffbfff400d000 #PF error: [normal kernel read fault] PGD 21ffee067 P4D 21ffee067 PUD 21ffed067 PMD 9f942067 PTE 0 Oops: 0000 [#1] PREEMPT SMP KASAN CPU: 0 PID: 9869 Comm: kworker/0:7 Not tainted 5.0.0-rc8+ #1 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Workqueue: events bpf_prog_free_deferred RIP: 0010:bpf_get_prog_addr_region kernel/bpf/core.c:495 [inline] RIP: 0010:bpf_tree_comp kernel/bpf/core.c:558 [inline] RIP: 0010:__lt_find include/linux/rbtree_latch.h:115 [inline] RIP: 0010:latch_tree_find include/linux/rbtree_latch.h:208 [inline] RIP: 0010:bpf_prog_kallsyms_find+0x107/0x2e0 kernel/bpf/core.c:632 Code: 00 f0 ff ff 44 38 c8 7f 08 84 c0 0f 85 fa 00 00 00 41 f6 45 02 01 75 02 0f 0b 48 39 da 0f 82 92 00 00 00 48 89 d8 48 c1 e8 03 <42> 0f b6 04 30 84 c0 74 08 3c 03 0f 8e 45 01 00 00 8b 03 48 c1 e0 [...] Upon further debugging, it turns out that whenever we trigger this issue, the kallsyms removal in bpf_prog_ksym_node_del() was /skipped/ but yet bpf_jit_free() reported that the entry is /in use/. Problem is that symbol exposure via bpf_prog_kallsyms_add() but also perf_event_bpf_event() were done /after/ bpf_prog_new_fd(). Once the fd is exposed to the public, a parallel close request came in right before we attempted to do the bpf_prog_kallsyms_add(). Given at this time the prog reference count is one, we start to rip everything underneath us via bpf_prog_release() -> bpf_prog_put(). The memory is eventually released via deferred free, so we're seeing that bpf_jit_free() has a kallsym entry because we added it from bpf_prog_load() but /after/ bpf_prog_put() from the remote CPU. Therefore, move both notifications /before/ we install the fd. The issue was never seen between bpf_prog_alloc_id() and bpf_prog_new_fd() because upon bpf_prog_get_fd_by_id() we'll take another reference to the BPF prog, so we're still holding the original reference from the bpf_prog_load(). Fixes: 6ee52e2a3fe4 ("perf, bpf: Introduce PERF_RECORD_BPF_EVENT") Fixes: 74451e66d516 ("bpf: make jited programs visible in traces") Reported-by: syzbot+bd3bba6ff3fcea7a6ec6@syzkaller.appspotmail.com Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Cc: Song Liu <songliubraving@fb.com>
2019-08-24bpf: fix precision tracking in presence of bpf2bpf callsAlexei Starovoitov
While adding extra tests for precision tracking and extra infra to adjust verifier heuristics the existing test "calls: cross frame pruning - liveness propagation" started to fail. The root cause is the same as described in verifer.c comment: * Also if parent's curframe > frame where backtracking started, * the verifier need to mark registers in both frames, otherwise callees * may incorrectly prune callers. This is similar to * commit 7640ead93924 ("bpf: verifier: make sure callees don't prune with caller differences") * For now backtracking falls back into conservative marking. Turned out though that returning -ENOTSUPP from backtrack_insn() and doing mark_all_scalars_precise() in the current parentage chain is not enough. Depending on how is_state_visited() heuristic is creating parentage chain it's possible that callee will incorrectly prune caller. Fix the issue by setting precise=true earlier and more aggressively. Before this fix the precision tracking _within_ functions that don't do bpf2bpf calls would still work. Whereas now precision tracking is completely disabled when bpf2bpf calls are present anywhere in the program. No difference in cilium tests (they don't have bpf2bpf calls). No difference in test_progs though some of them have bpf2bpf calls, but precision tracking wasn't effective there. Fixes: b5dc0163d8fd ("bpf: precise scalar_value tracking") Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-08-23Merge tag 'modules-for-v5.3-rc6' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/jeyu/linux Pull modules fixes from Jessica Yu: "Fix BUG_ON() being triggered in frob_text() due to non-page-aligned module sections" * tag 'modules-for-v5.3-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/jeyu/linux: modules: page-align module section allocations only for arches supporting strict module rwx modules: always page-align module section allocations
2019-08-23timekeeping/vsyscall: Prevent math overflow in BOOTTIME updateThomas Gleixner
The VDSO update for CLOCK_BOOTTIME has a overflow issue as it shifts the nanoseconds based boot time offset left by the clocksource shift. That overflows once the boot time offset becomes large enough. As a consequence CLOCK_BOOTTIME in the VDSO becomes a random number causing applications to misbehave. Fix it by storing a timespec64 representation of the offset when boot time is adjusted and add that to the MONOTONIC base time value in the vdso data page. Using the timespec64 representation avoids a 64bit division in the update code. Fixes: 44f57d788e7d ("timekeeping: Provide a generic update_vsyscall() implementation") Reported-by: Chris Clayton <chris2553@googlemail.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Chris Clayton <chris2553@googlemail.com> Tested-by: Vincenzo Frascino <vincenzo.frascino@arm.com> Link: https://lkml.kernel.org/r/alpine.DEB.2.21.1908221257580.1983@nanos.tec.linutronix.de
2019-08-22Merge branch 'for-mingo' of ↵Ingo Molnar
git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu into core/rcu Pull RCU and LKMM changes from Paul E. McKenney: - A few more RCU flavor consolidation cleanups. - Miscellaneous fixes. - Updates to RCU's list-traversal macros improving lockdep usability. - Torture-test updates. - Forward-progress improvements for no-CBs CPUs: Avoid ignoring incoming callbacks during grace-period waits. - Forward-progress improvements for no-CBs CPUs: Use ->cblist structure to take advantage of others' grace periods. - Also added a small commit that avoids needlessly inflicting scheduler-clock ticks on callback-offloaded CPUs. - Forward-progress improvements for no-CBs CPUs: Reduce contention on ->nocb_lock guarding ->cblist. - Forward-progress improvements for no-CBs CPUs: Add ->nocb_bypass list to further reduce contention on ->nocb_lock guarding ->cblist. - LKMM updates. Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-08-21Merge branch 'odp_fixes' into hmm.gitJason Gunthorpe
From rdma.git Jason Gunthorpe says: ==================== This is a collection of general cleanups for ODP to clarify some of the flows around umem creation and use of the interval tree. ==================== The branch is based on v5.3-rc5 due to dependencies, and is being taken into hmm.git due to dependencies in the next patches. * odp_fixes: RDMA/mlx5: Use odp instead of mr->umem in pagefault_mr RDMA/mlx5: Use ib_umem_start instead of umem.address RDMA/core: Make invalidate_range a device operation RDMA/odp: Use kvcalloc for the dma_list and page_list RDMA/odp: Check for overflow when computing the umem_odp end RDMA/odp: Provide ib_umem_odp_release() to undo the allocs RDMA/odp: Split creating a umem_odp from ib_umem_get RDMA/odp: Make the three ways to create a umem_odp clear RMDA/odp: Consolidate umem_odp initialization RDMA/odp: Make it clearer when a umem is an implicit ODP umem RDMA/odp: Iterate over the whole rbtree directly RDMA/odp: Use the common interval tree library instead of generic RDMA/mlx5: Fix MR npages calculation for IB_ACCESS_HUGETLB Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-08-21posix-cpu-timers: Remove tsk argument from run_posix_cpu_timers()Thomas Gleixner
It's always current. Don't give people wrong ideas. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Frederic Weisbecker <frederic@kernel.org> Link: https://lkml.kernel.org/r/20190819143801.945469967@linutronix.de
2019-08-21posix-cpu-timers: Sanitize bogus WARNONSThomas Gleixner
Warning when p == NULL and then proceeding and dereferencing p does not make any sense as the kernel will crash with a NULL pointer dereference right away. Bailing out when p == NULL and returning an error code does not cure the underlying problem which caused p to be NULL. Though it might allow to do proper debugging. Same applies to the clock id check in set_process_cpu_timer(). Clean them up and make them return without trying to do further damage. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Frederic Weisbecker <frederic@kernel.org> Link: https://lkml.kernel.org/r/20190819143801.846497772@linutronix.de
2019-08-21bpf: clarify description for CONFIG_BPF_EVENTSPeter Wu
PERF_EVENT_IOC_SET_BPF supports uprobes since v4.3, and tracepoints since v4.7 via commit 04a22fae4cbc ("tracing, perf: Implement BPF programs attached to uprobes"), and commit 98b5c2c65c29 ("perf, bpf: allow bpf programs attach to tracepoints") respectively. Signed-off-by: Peter Wu <peter@lekensteyn.nl> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2019-08-21hrtimer: Don't take expiry_lock when timer is currently migratedJulien Grall
migration_base is used as a placeholder when an hrtimer is migrated to a different CPU. In the case that hrtimer_cancel_wait_running() hits a timer which is currently migrated it would pointlessly acquire the expiry lock of the migration base, which is even not initialized. Surely it could be initialized, but there is absolutely no point in acquiring this lock because the timer is guaranteed not to run it's callback for which the caller waits to finish on that base. So it would just do the inc/lock/dec/unlock dance for nothing. As the base switch is short and non-preemptible, there is no issue when the wait function returns immediately. The timer base and base->cpu_base cannot be NULL in the code path which is invoking that, so just replace those checks with a check whether base is migration base. [ tglx: Updated from RT patch. Massaged changelog. Added comment. ] Signed-off-by: Julien Grall <julien.grall@arm.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lkml.kernel.org/r/20190821092409.13225-4-julien.grall@arm.com
2019-08-21hrtimer: Protect lockless access to timer->baseJulien Grall
The update to timer->base is protected by the base->cpu_base->lock(). However, hrtimer_cancel_wait_running() does access it lockless. So the compiler is allowed to refetch timer->base which can cause havoc when the timer base is changed concurrently. Use READ_ONCE() to prevent this. [ tglx: Adapted from a RT patch ] Signed-off-by: Julien Grall <julien.grall@arm.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lkml.kernel.org/r/20190821092409.13225-2-julien.grall@arm.com
2019-08-21extable: Add function to search only kernel exception tableSantosh Sivaraj
Certain architecture specific operating modes (e.g., in powerpc machine check handler that is unable to access vmalloc memory), the search_exception_tables cannot be called because it also searches the module exception tables if entry is not found in the kernel exception table. Signed-off-by: Santosh Sivaraj <santosh@fossix.org> Reviewed-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20190820081352.8641-5-santosh@fossix.org
2019-08-21modules: page-align module section allocations only for arches supporting ↵He Zhe
strict module rwx We should keep the case of "#define debug_align(X) (X)" for all arches without CONFIG_HAS_STRICT_MODULE_RWX ability, which would save people, who are sensitive to system size, a lot of memory when using modules, especially for embedded systems. This is also the intention of the original #ifdef... statement and still valid for now. Note that this still keeps the effect of the fix of the following commit, 38f054d549a8 ("modules: always page-align module section allocations"), since when CONFIG_ARCH_HAS_STRICT_MODULE_RWX is enabled, module pages are aligned. Signed-off-by: He Zhe <zhe.he@windriver.com> Signed-off-by: Jessica Yu <jeyu@kernel.org>
2019-08-21PM: QoS: Get rid of unused flagsAmit Kucheria
The network_latency and network_throughput flags for PM-QoS have not found much use in drivers or in userspace since they were introduced. Commit 4a733ef1bea7 ("mac80211: remove PM-QoS listener") removed the only user PM_QOS_NETWORK_LATENCY in the kernel a while ago and there don't seem to be any userspace tools using the character device files either. PM_QOS_MEMORY_BANDWIDTH was never even added to the trace events. Remove all the flags except cpu_dma_latency. Signed-off-by: Amit Kucheria <amit.kucheria@linaro.org> Acked-by: Pavel Machek <pavel@ucw.cz> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2019-08-21PM / wakeup: Show wakeup sources stats in sysfsTri Vo
Add an ID and a device pointer to 'struct wakeup_source'. Use them to to expose wakeup sources statistics in sysfs under /sys/class/wakeup/wakeup<ID>/*. Co-developed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Co-developed-by: Stephen Boyd <swboyd@chromium.org> Signed-off-by: Stephen Boyd <swboyd@chromium.org> Signed-off-by: Tri Vo <trong@android.com> Tested-by: Kalesh Singh <kaleshsingh@google.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2019-08-21PM / wakeup: Use wakeup_source_register() in wakelock.cTri Vo
kernel/power/wakelock.c duplicates wakeup source creation and registration code from drivers/base/power/wakeup.c. Change struct wakelock's wakeup source to a pointer and use wakeup_source_register() function to create and register said wakeup source. Use wakeup_source_unregister() on cleanup path. Signed-off-by: Tri Vo <trong@android.com> Reviewed-by: Stephen Boyd <swboyd@chromium.org> Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2019-08-21dma-direct: fix zone selection after an unaddressable CMA allocationChristoph Hellwig
The new dma_alloc_contiguous hides if we allocate CMA or regular pages, and thus fails to retry a ZONE_NORMAL allocation if the CMA allocation succeeds but isn't addressable. That means we either fail outright or dip into a small zone that might not succeed either. Thanks to Hillf Danton for debugging this issue. Fixes: b1d2dc009dec ("dma-contiguous: add dma_{alloc,free}_contiguous() helpers") Reported-by: Tobias Klausmann <tobias.johannes.klausmann@mni.thm.de> Signed-off-by: Christoph Hellwig <hch@lst.de> Tested-by: Tobias Klausmann <tobias.johannes.klausmann@mni.thm.de>
2019-08-20posix-cpu-timers: Fixup stale commentThomas Gleixner
The comment above cleanup_timers() is outdated. The timers are only removed from the task/process list heads but not modified in any other way. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Frederic Weisbecker <frederic@kernel.org> Link: https://lkml.kernel.org/r/20190819143801.747233612@linutronix.de