summaryrefslogtreecommitdiff
path: root/kernel
AgeCommit message (Collapse)Author
2020-01-20irqdomain: Fix a memory leak in irq_domain_push_irq()Kevin Hao
Fix a memory leak reported by kmemleak: unreferenced object 0xffff000bc6f50e80 (size 128): comm "kworker/23:2", pid 201, jiffies 4294894947 (age 942.132s) hex dump (first 32 bytes): 00 00 00 00 41 00 00 00 86 c0 03 00 00 00 00 00 ....A........... 00 a0 b2 c6 0b 00 ff ff 40 51 fd 10 00 80 ff ff ........@Q...... backtrace: [<00000000e62d2240>] kmem_cache_alloc_trace+0x1a4/0x320 [<00000000279143c9>] irq_domain_push_irq+0x7c/0x188 [<00000000d9f4c154>] thunderx_gpio_probe+0x3ac/0x438 [<00000000fd09ec22>] pci_device_probe+0xe4/0x198 [<00000000d43eca75>] really_probe+0xdc/0x320 [<00000000d3ebab09>] driver_probe_device+0x5c/0xf0 [<000000005b3ecaa0>] __device_attach_driver+0x88/0xc0 [<000000004e5915f5>] bus_for_each_drv+0x7c/0xc8 [<0000000079d4db41>] __device_attach+0xe4/0x140 [<00000000883bbda9>] device_initial_probe+0x18/0x20 [<000000003be59ef6>] bus_probe_device+0x98/0xa0 [<0000000039b03d3f>] deferred_probe_work_func+0x74/0xa8 [<00000000870934ce>] process_one_work+0x1c8/0x470 [<00000000e3cce570>] worker_thread+0x1f8/0x428 [<000000005d64975e>] kthread+0xfc/0x128 [<00000000f0eaa764>] ret_from_fork+0x10/0x18 Fixes: 495c38d3001f ("irqdomain: Add irq_domain_{push,pop}_irq() functions") Signed-off-by: Kevin Hao <haokexin@gmail.com> Signed-off-by: Marc Zyngier <maz@kernel.org> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20200120043547.22271-1-haokexin@gmail.com
2020-01-20module: avoid setting info->name early in case we can fall back to ↵Jessica Yu
info->mod->name In setup_load_info(), info->name (which contains the name of the module, mostly used for early logging purposes before the module gets set up) gets unconditionally assigned if .modinfo is missing despite the fact that there is an if (!info->name) check near the end of the function. Avoid assigning a placeholder string to info->name if .modinfo doesn't exist, so that we can fall back to info->mod->name later on. Fixes: 5fdc7db6448a ("module: setup load info before module_sig_check()") Reviewed-by: Miroslav Benes <mbenes@suse.cz> Signed-off-by: Jessica Yu <jeyu@kernel.org>
2020-01-20genirq: Introduce irq_domain_translate_onecellYash Shah
Add a new function irq_domain_translate_onecell() that is to be used as the translate function in struct irq_domain_ops. Signed-off-by: Yash Shah <yash.shah@sifive.com> Signed-off-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/1575976274-13487-2-git-send-email-yash.shah@sifive.com
2020-01-20Merge tag 'v5.5-rc7' into perf/core, to pick up fixesIngo Molnar
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2020-01-20sched/fair: Define sched_idle_cpu() only for SMP configurationsViresh Kumar
sched_idle_cpu() isn't used for non SMP configuration and with a recent change, we have started getting following warning: kernel/sched/fair.c:5221:12: warning: ‘sched_idle_cpu’ defined but not used [-Wunused-function] Fix that by defining sched_idle_cpu() only for SMP configurations. Fixes: 323af6deaf70 ("sched/fair: Load balance aggressively for SCHED_IDLE CPUs") Reported-by: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Juri Lelli <juri.lelli@redhat.com> Cc: Vincent Guittot <vincent.guittot@linaro.org> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com> Link: https://lore.kernel.org/r/f0554f590687478b33914a4aff9f0e6a62886d44.1579499907.git.viresh.kumar@linaro.org
2020-01-19Revert "um: Enable CONFIG_CONSTRUCTORS"Johannes Berg
This reverts commit 786b2384bf1c ("um: Enable CONFIG_CONSTRUCTORS"). There are two issues with this commit, uncovered by Anton in tests on some (Debian) systems: 1) I completely forgot to call any constructors if CONFIG_CONSTRUCTORS isn't set. Don't recall now if it just wasn't needed on my system, or if I never tested this case. 2) With that fixed, it works - with CONFIG_CONSTRUCTORS *unset*. If I set CONFIG_CONSTRUCTORS, it fails again, which isn't totally unexpected since whatever wanted to run is likely to have to run before the kernel init etc. that calls the constructors in this case. Basically, some constructors that gcc emits (libc has?) need to run very early during init; the failure mode otherwise was that the ptrace fork test already failed: ---------------------- $ ./linux mem=512M Core dump limits : soft - 0 hard - NONE Checking that ptrace can change system call numbers...check_ptrace : child exited with exitcode 6, while expecting 0; status 0x67f Aborted ---------------------- Thinking more about this, it's clear that we simply cannot support CONFIG_CONSTRUCTORS in UML. All the cases we need now (gcov, kasan) involve not use of the __attribute__((constructor)), but instead some constructor code/entry generated by gcc. Therefore, we cannot distinguish between kernel constructors and system constructors. Thus, revert this commit. Cc: stable@vger.kernel.org [5.4+] Fixes: 786b2384bf1c ("um: Enable CONFIG_CONSTRUCTORS") Reported-by: Anton Ivanov <anton.ivanov@cambridgegreys.com> Signed-off-by: Johannes Berg <johannes.berg@intel.com> Acked-by: Anton Ivanov <anton.ivanov@cambridgegreys.co.uk> Signed-off-by: Richard Weinberger <richard@nod.at>
2020-01-19Merge ra.kernel.org:/pub/scm/linux/kernel/git/netdev/netDavid S. Miller
2020-01-19Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netLinus Torvalds
Pull networking fixes from David Miller: 1) Fix non-blocking connect() in x25, from Martin Schiller. 2) Fix spurious decryption errors in kTLS, from Jakub Kicinski. 3) Netfilter use-after-free in mtype_destroy(), from Cong Wang. 4) Limit size of TSO packets properly in lan78xx driver, from Eric Dumazet. 5) r8152 probe needs an endpoint sanity check, from Johan Hovold. 6) Prevent looping in tcp_bpf_unhash() during sockmap/tls free, from John Fastabend. 7) hns3 needs short frames padded on transmit, from Yunsheng Lin. 8) Fix netfilter ICMP header corruption, from Eyal Birger. 9) Fix soft lockup when low on memory in hns3, from Yonglong Liu. 10) Fix NTUPLE firmware command failures in bnxt_en, from Michael Chan. 11) Fix memory leak in act_ctinfo, from Eric Dumazet. * git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (91 commits) cxgb4: reject overlapped queues in TC-MQPRIO offload cxgb4: fix Tx multi channel port rate limit net: sched: act_ctinfo: fix memory leak bnxt_en: Do not treat DSN (Digital Serial Number) read failure as fatal. bnxt_en: Fix ipv6 RFS filter matching logic. bnxt_en: Fix NTUPLE firmware command failures. net: systemport: Fixed queue mapping in internal ring map net: dsa: bcm_sf2: Configure IMP port for 2Gb/sec net: dsa: sja1105: Don't error out on disabled ports with no phy-mode net: phy: dp83867: Set FORCE_LINK_GOOD to default after reset net: hns: fix soft lockup when there is not enough memory net: avoid updating qdisc_xmit_lock_key in netdev_update_lockdep_key() net/sched: act_ife: initalize ife->metalist earlier netfilter: nat: fix ICMP header corruption on ICMP errors net: wan: lapbether.c: Use built-in RCU list checking netfilter: nf_tables: fix flowtable list del corruption netfilter: nf_tables: fix memory leak in nf_tables_parse_netdev_hooks() netfilter: nf_tables: remove WARN and add NLA_STRING upper limits netfilter: nft_tunnel: ERSPAN_VERSION must not be null netfilter: nft_tunnel: fix null-attribute check ...
2020-01-18Merge branch 'timers-urgent-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull timer fixes from Ingo Molnar: "Three fixes: fix link failure on Alpha, fix a Sparse warning and annotate/robustify a lockless access in the NOHZ code" * 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: tick/sched: Annotate lockless access to last_jiffies_update lib/vdso: Make __cvdso_clock_getres() static time/posix-stubs: Provide compat itimer supoprt for alpha
2020-01-18Merge branch 'smp-urgent-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull cpu/SMT fix from Ingo Molnar: "Fix a build bug on CONFIG_HOTPLUG_SMT=y && !CONFIG_SYSFS kernels" * 'smp-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: cpu/SMT: Fix x86 link error without CONFIG_SYSFS
2020-01-18Merge branch 'perf-urgent-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull perf fixes from Ingo Molnar: "Tooling fixes, three Intel uncore driver fixes, plus an AUX events fix uncovered by the perf fuzzer" * 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: perf/x86/intel/uncore: Remove PCIe3 unit for SNR perf/x86/intel/uncore: Fix missing marker for snr_uncore_imc_freerunning_events perf/x86/intel/uncore: Add PCI ID of IMC for Xeon E3 V5 Family perf: Correctly handle failed perf_get_aux_event() perf hists: Fix variable name's inconsistency in hists__for_each() macro perf map: Set kmap->kmaps backpointer for main kernel map chunks perf report: Fix incorrectly added dimensions as switch perf data file tools lib traceevent: Fix memory leakage in filter_event
2020-01-18Merge branch 'locking-urgent-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull locking fixes from Ingo Molnar: "Three fixes: - Fix an rwsem spin-on-owner crash, introduced in v5.4 - Fix a lockdep bug when running out of stack_trace entries, introduced in v5.4 - Docbook fix" * 'locking-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: locking/rwsem: Fix kernel crash when spinning on RWSEM_OWNER_UNKNOWN futex: Fix kernel-doc notation warning locking/lockdep: Fix buffer overrun problem in stack_trace[]
2020-01-18Merge branch 'core-urgent-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull rseq fixes from Ingo Molnar: "Two rseq bugfixes: - CLONE_VM !CLONE_THREAD didn't work properly, the kernel would end up corrupting the TLS of the parent. Technically a change in the ABI but the previous behavior couldn't resonably have been relied on by applications so this looks like a valid exception to the ABI rule. - Make the RSEQ_FLAG_UNREGISTER ABI behavior consistent with the handling of other flags. This is not thought to impact any applications either" * 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: rseq: Unregister rseq for clone CLONE_VM rseq: Reject unknown flags on rseq unregister
2020-01-18Merge tag 'for-linus-2020-01-18' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux Pull thread fixes from Christian Brauner: "Here is an urgent fix for ptrace_may_access() permission checking. Commit 69f594a38967 ("ptrace: do not audit capability check when outputing /proc/pid/stat") introduced the ability to opt out of audit messages for accesses to various proc files since they are not violations of policy. While doing so it switched the check from ns_capable() to has_ns_capability{_noaudit}(). That means it switched from checking the subjective credentials (ktask->cred) of the task to using the objective credentials (ktask->real_cred). This is appears to be wrong. ptrace_has_cap() is currently only used in ptrace_may_access() And is used to check whether the calling task (subject) has the CAP_SYS_PTRACE capability in the provided user namespace to operate on the target task (object). According to the cred.h comments this means the subjective credentials of the calling task need to be used. With this fix we switch ptrace_has_cap() to use security_capable() and thus back to using the subjective credentials. As one example where this might be particularly problematic, Jann pointed out that in combination with the upcoming IORING_OP_OPENAT{2} feature, this bug might allow unprivileged users to bypass the capability checks while asynchronously opening files like /proc/*/mem, because the capability checks for this would be performed against kernel credentials. To illustrate on the former point about this being exploitable: When io_uring creates a new context it records the subjective credentials of the caller. Later on, when it starts to do work it creates a kernel thread and registers a callback. The callback runs with kernel creds for ktask->real_cred and ktask->cred. To prevent this from becoming a full-blown 0-day io_uring will call override_cred() and override ktask->cred with the subjective credentials of the creator of the io_uring instance. With ptrace_has_cap() currently looking at ktask->real_cred this override will be ineffective and the caller will be able to open arbitray proc files as mentioned above. Luckily, this is currently not exploitable but would be so once IORING_OP_OPENAT{2} land in v5.6. Let's fix it now. To minimize potential regressions I successfully ran the criu testsuite. criu makes heavy use of ptrace() and extensively hits ptrace_may_access() codepaths and has a good change of detecting any regressions. Additionally, I succesfully ran the ptrace and seccomp kernel tests" * tag 'for-linus-2020-01-18' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux: ptrace: reintroduce usage of subjective credentials in ptrace_has_cap()
2020-01-18ptrace: reintroduce usage of subjective credentials in ptrace_has_cap()Christian Brauner
Commit 69f594a38967 ("ptrace: do not audit capability check when outputing /proc/pid/stat") introduced the ability to opt out of audit messages for accesses to various proc files since they are not violations of policy. While doing so it somehow switched the check from ns_capable() to has_ns_capability{_noaudit}(). That means it switched from checking the subjective credentials of the task to using the objective credentials. This is wrong since. ptrace_has_cap() is currently only used in ptrace_may_access() And is used to check whether the calling task (subject) has the CAP_SYS_PTRACE capability in the provided user namespace to operate on the target task (object). According to the cred.h comments this would mean the subjective credentials of the calling task need to be used. This switches ptrace_has_cap() to use security_capable(). Because we only call ptrace_has_cap() in ptrace_may_access() and in there we already have a stable reference to the calling task's creds under rcu_read_lock() there's no need to go through another series of dereferences and rcu locking done in ns_capable{_noaudit}(). As one example where this might be particularly problematic, Jann pointed out that in combination with the upcoming IORING_OP_OPENAT feature, this bug might allow unprivileged users to bypass the capability checks while asynchronously opening files like /proc/*/mem, because the capability checks for this would be performed against kernel credentials. To illustrate on the former point about this being exploitable: When io_uring creates a new context it records the subjective credentials of the caller. Later on, when it starts to do work it creates a kernel thread and registers a callback. The callback runs with kernel creds for ktask->real_cred and ktask->cred. To prevent this from becoming a full-blown 0-day io_uring will call override_cred() and override ktask->cred with the subjective credentials of the creator of the io_uring instance. With ptrace_has_cap() currently looking at ktask->real_cred this override will be ineffective and the caller will be able to open arbitray proc files as mentioned above. Luckily, this is currently not exploitable but will turn into a 0-day once IORING_OP_OPENAT{2} land in v5.6. Fix it now! Cc: Oleg Nesterov <oleg@redhat.com> Cc: Eric Paris <eparis@redhat.com> Cc: stable@vger.kernel.org Reviewed-by: Kees Cook <keescook@chromium.org> Reviewed-by: Serge Hallyn <serge@hallyn.com> Reviewed-by: Jann Horn <jannh@google.com> Fixes: 69f594a38967 ("ptrace: do not audit capability check when outputing /proc/pid/stat") Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
2020-01-17lib/vdso: Update coarse timekeeper unconditionallyThomas Gleixner
The low resolution parts of the VDSO, i.e.: clock_gettime(CLOCK_*_COARSE), clock_getres(), time() can be used even if there is no VDSO capable clocksource. But if an architecture opts out of the VDSO data update then this information becomes stale. This affects ARM when there is no architected timer available. The lack of update causes userspace to use stale data forever. Make the update of the low resolution parts unconditional and only skip the update of the high resolution parts if the architecture requests it. Fixes: 44f57d788e7d ("timekeeping: Provide a generic update_vsyscall() implementation") Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/20200114185946.765577901@linutronix.de
2020-01-17lib/vdso: Make __arch_update_vdso_data() logic understandableThomas Gleixner
The function name suggests that this is a boolean checking whether the architecture asks for an update of the VDSO data, but it works the other way round. To spare further confusion invert the logic. Fixes: 44f57d788e7d ("timekeeping: Provide a generic update_vsyscall() implementation") Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/20200114185946.656652824@linutronix.de
2020-01-17perf: Correctly handle failed perf_get_aux_event()Mark Rutland
Vince reports a worrying issue: | so I was tracking down some odd behavior in the perf_fuzzer which turns | out to be because perf_even_open() sometimes returns 0 (indicating a file | descriptor of 0) even though as far as I can tell stdin is still open. ... and further the cause: | error is triggered if aux_sample_size has non-zero value. | | seems to be this line in kernel/events/core.c: | | if (perf_need_aux_event(event) && !perf_get_aux_event(event, group_leader)) | goto err_locked; | | (note, err is never set) This seems to be a thinko in commit: ab43762ef010967e ("perf: Allow normal events to output AUX data") ... and we should probably return -EINVAL here, as this should only happen when the new event is mis-configured or does not have a compatible aux_event group leader. Fixes: ab43762ef010967e ("perf: Allow normal events to output AUX data") Reported-by: Vince Weaver <vincent.weaver@maine.edu> Signed-off-by: Mark Rutland <mark.rutland@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Acked-by: Alexander Shishkin <alexander.shishkin@linux.intel.com> Tested-by: Vince Weaver <vincent.weaver@maine.edu>
2020-01-17watchdog/softlockup: Enforce that timestamp is valid on bootThomas Gleixner
Robert reported that during boot the watchdog timestamp is set to 0 for one second which is the indicator for a watchdog reset. The reason for this is that the timestamp is in seconds and the time is taken from sched clock and divided by ~1e9. sched clock starts at 0 which means that for the first second during boot the watchdog timestamp is 0, i.e. reset. Use ULONG_MAX as the reset indicator value so the watchdog works correctly right from the start. ULONG_MAX would only conflict with a real timestamp if the system reaches an uptime of 136 years on 32bit and almost eternity on 64bit. Reported-by: Robert Richter <rrichter@marvell.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/87o8v3uuzl.fsf@nanos.tec.linutronix.de
2020-01-17locking/osq: Use optimized spinning loop for arm64Waiman Long
Arm64 has a more optimized spinning loop (atomic_cond_read_acquire) using wfe for spinlock that can boost performance of sibling threads by putting the current cpu to a wait state that is broken only when the monitored variable changes or an external event happens. OSQ has a more complicated spinning loop. Besides the lock value, it also checks for need_resched() and vcpu_is_preempted(). The check for need_resched() is not a problem as it is only set by the tick interrupt handler. That will be detected by the spinning cpu right after iret. The vcpu_is_preempted() check, however, is a problem as changes to the preempt state of of previous node will not affect the wait state. For ARM64, vcpu_is_preempted is not currently defined and so is a no-op. Will has indicated that he is planning to para-virtualize wfe instead of defining vcpu_is_preempted for PV support. So just add a comment in arch/arm64/include/asm/spinlock.h to indicate that vcpu_is_preempted() should not be defined as suggested. On a 2-socket 56-core 224-thread ARM64 system, a kernel mutex locking microbenchmark was run for 10s with and without the patch. The performance numbers before patch were: Running locktest with mutex [runtime = 10s, load = 1] Threads = 224, Min/Mean/Max = 316/123,143/2,121,269 Threads = 224, Total Rate = 2,757 kop/s; Percpu Rate = 12 kop/s After patch, the numbers were: Running locktest with mutex [runtime = 10s, load = 1] Threads = 224, Min/Mean/Max = 334/147,836/1,304,787 Threads = 224, Total Rate = 3,311 kop/s; Percpu Rate = 15 kop/s So there was about 20% performance improvement. Signed-off-by: Waiman Long <longman@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Will Deacon <will@kernel.org> Link: https://lkml.kernel.org/r/20200113150735.21956-1-longman@redhat.com
2020-01-17locking/qspinlock: Fix inaccessible URL of MCS lock paperWaiman Long
It turns out that the URL of the MCS lock paper listed in the source code is no longer accessible. I did got question about where the paper was. This patch updates the URL to BZ 206115 which contains a copy of the paper from https://www.cs.rochester.edu/u/scott/papers/1991_TOCS_synch.pdf Signed-off-by: Waiman Long <longman@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Will Deacon <will@kernel.org> Link: https://lkml.kernel.org/r/20200107174914.4187-1-longman@redhat.com
2020-01-17locking/lockdep: Fix lockdep_stats indentation problemWaiman Long
It was found that two lines in the output of /proc/lockdep_stats have indentation problem: # cat /proc/lockdep_stats : in-process chains: 25057 stack-trace entries: 137827 [max: 524288] number of stack traces: 7973 number of stack hash chains: 6355 combined max dependencies: 1356414598 hardirq-safe locks: 57 hardirq-unsafe locks: 1286 : All the numbers displayed in /proc/lockdep_stats except the two stack trace numbers are formatted with a field with of 11. To properly align all the numbers, a field width of 11 is now added to the two stack trace numbers. Fixes: 8c779229d0f4 ("locking/lockdep: Report more stack trace statistics") Signed-off-by: Waiman Long <longman@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Link: https://lkml.kernel.org/r/20191211213139.29934-1-longman@redhat.com
2020-01-17locking/rwsem: Fix kernel crash when spinning on RWSEM_OWNER_UNKNOWNWaiman Long
The commit 91d2a812dfb9 ("locking/rwsem: Make handoff writer optimistically spin on owner") will allow a recently woken up waiting writer to spin on the owner. Unfortunately, if the owner happens to be RWSEM_OWNER_UNKNOWN, the code will incorrectly spin on it leading to a kernel crash. This is fixed by passing the proper non-spinnable bits to rwsem_spin_on_owner() so that RWSEM_OWNER_UNKNOWN will be treated as a non-spinnable target. Fixes: 91d2a812dfb9 ("locking/rwsem: Make handoff writer optimistically spin on owner") Reported-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Waiman Long <longman@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Christoph Hellwig <hch@lst.de> Cc: stable@vger.kernel.org Link: https://lkml.kernel.org/r/20200115154336.8679-1-longman@redhat.com
2020-01-17sched/topology: Assert non-NUMA topology masks don't (partially) overlapValentin Schneider
topology.c::get_group() relies on the assumption that non-NUMA domains do not partially overlap. Zeng Tao pointed out in [1] that such topology descriptions, while completely bogus, can end up being exposed to the scheduler. In his example (8 CPUs, 2-node system), we end up with: MC span for CPU3 == 3-7 MC span for CPU4 == 4-7 The first pass through get_group(3, sdd@MC) will result in the following sched_group list: 3 -> 4 -> 5 -> 6 -> 7 ^ / `----------------' And a later pass through get_group(4, sdd@MC) will "corrupt" that to: 3 -> 4 -> 5 -> 6 -> 7 ^ / `-----------' which will completely break things like 'while (sg != sd->groups)' when using CPU3's base sched_domain. There already are some architecture-specific checks in place such as x86/kernel/smpboot.c::topology.sane(), but this is something we can detect in the core scheduler, so it seems worthwhile to do so. Warn and abort the construction of the sched domains if such a broken topology description is detected. Note that this is somewhat expensive (O(t.c²), 't' non-NUMA topology levels and 'c' CPUs) and could be gated under SCHED_DEBUG if deemed necessary. Testing ======= Dietmar managed to reproduce this using the following qemu incantation: $ qemu-system-aarch64 -kernel ./Image -hda ./qemu-image-aarch64.img \ -append 'root=/dev/vda console=ttyAMA0 loglevel=8 sched_debug' -smp \ cores=8 --nographic -m 512 -cpu cortex-a53 -machine virt -numa \ node,cpus=0-2,nodeid=0 -numa node,cpus=3-7,nodeid=1 alongside the following drivers/base/arch_topology.c hack (AIUI wouldn't be needed if '-smp cores=X, sockets=Y' would work with qemu): 8<--- @@ -465,6 +465,9 @@ void update_siblings_masks(unsigned int cpuid) if (cpuid_topo->package_id != cpu_topo->package_id) continue; + if ((cpu < 4 && cpuid > 3) || (cpu > 3 && cpuid < 4)) + continue; + cpumask_set_cpu(cpuid, &cpu_topo->core_sibling); cpumask_set_cpu(cpu, &cpuid_topo->core_sibling); 8<--- [1]: https://lkml.kernel.org/r/1577088979-8545-1-git-send-email-prime.zeng@hisilicon.com Reported-by: Zeng Tao <prime.zeng@hisilicon.com> Signed-off-by: Valentin Schneider <valentin.schneider@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20200115160915.22575-1-valentin.schneider@arm.com
2020-01-17idle: fix spelling mistake "iterrupts" -> "interrupts"Hewenliang
There is a spelling misake in comments of cpuidle_idle_call. Fix it. Signed-off-by: Hewenliang <hewenliang4@huawei.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Link: https://lkml.kernel.org/r/20200110025604.34373-1-hewenliang4@huawei.com
2020-01-17sched/fair: Remove redundant call to cpufreq_update_util()Vincent Guittot
With commit bef69dd87828 ("sched/cpufreq: Move the cfs_rq_util_change() call to cpufreq_update_util()") update_load_avg() has become the central point for calling cpufreq (not including the update of blocked load). This change helps to simplify further the number of calls to cpufreq_update_util() and to remove last redundant ones. With update_load_avg(), we are now sure that cpufreq_update_util() will be called after every task attachment to a cfs_rq and especially after propagating this event down to the util_avg of the root cfs_rq, which is the level that is used by cpufreq governors like schedutil to set the frequency of a CPU. The SCHED_CPUFREQ_MIGRATION flag forces an early call to cpufreq when the migration happens in a cgroup whereas util_avg of root cfs_rq is not yet updated and this call is duplicated with the one that happens immediately after when the migration event reaches the root cfs_rq. The dedicated flag SCHED_CPUFREQ_MIGRATION is now useless and can be removed. The interface of attach_entity_load_avg() can also be simplified accordingly. Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Link: https://lkml.kernel.org/r/1579083620-24943-1-git-send-email-vincent.guittot@linaro.org
2020-01-17sched/psi: create /proc/pressure and /proc/pressure/{io|memory|cpu} only ↵Wang Long
when psi enabled when CONFIG_PSI_DEFAULT_DISABLED set to N or the command line set psi=0, I think we should not create /proc/pressure and /proc/pressure/{io|memory|cpu}. In the future, user maybe determine whether the psi feature is enabled by checking the existence of the /proc/pressure dir or /proc/pressure/{io|memory|cpu} files. Signed-off-by: Wang Long <w@laoqinren.net> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Link: https://lkml.kernel.org/r/1576672698-32504-1-git-send-email-w@laoqinren.net
2020-01-17sched/fair: Fix sgc->{min,max}_capacity calculation for SD_OVERLAPPeng Liu
commit bf475ce0a3dd ("sched/fair: Add per-CPU min capacity to sched_group_capacity") introduced per-cpu min_capacity. commit e3d6d0cb66f2 ("sched/fair: Add sched_group per-CPU max capacity") introduced per-cpu max_capacity. In the SD_OVERLAP case, the local variable 'capacity' represents the sum of CPU capacity of all CPUs in the first sched group (sg) of the sched domain (sd). It is erroneously used to calculate sg's min and max CPU capacity. To fix this use capacity_of(cpu) instead of 'capacity'. The code which achieves this via cpu_rq(cpu)->sd->groups->sgc->capacity (for rq->sd != NULL) can be removed since it delivers the same value as capacity_of(cpu) which is currently only used for the (!rq->sd) case (see update_cpu_capacity()). An sg of the lowest sd (rq->sd or sd->child == NULL) represents a single CPU (and hence sg->sgc->capacity == capacity_of(cpu)). Signed-off-by: Peng Liu <iwtbavbm@gmail.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Valentin Schneider <valentin.schneider@arm.com> Link: https://lkml.kernel.org/r/20200104130828.GA7718@iZj6chx1xj0e0buvshuecpZ
2020-01-17sched/fair: calculate delta runnable load only when it's neededPeng Wang
Move the code of calculation for delta_sum/delta_avg to where it is really needed to be done. Signed-off-by: Peng Wang <rocking@linux.alibaba.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Link: https://lkml.kernel.org/r/20200103114400.17668-1-rocking@linux.alibaba.com
2020-01-17sched/cputime: move rq parameter in irqtime_account_process_tickAlex Shi
Every time we call irqtime_account_process_tick() is in a interrupt, Every caller will get and assign a parameter rq = this_rq(), This is unnecessary and increase the code size a little bit. Move the rq getting action to irqtime_account_process_tick internally is better. base with this patch cputime.o 578792 bytes 577888 bytes Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/1577959674-255537-1-git-send-email-alex.shi@linux.alibaba.com
2020-01-17stop_machine: Make stop_cpus() staticYangtao Li
The function stop_cpus() is only used internally by the stop_machine for stop multiple cpus. Make it static. Signed-off-by: Yangtao Li <tiny.windzz@gmail.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20191228161912.24082-1-tiny.windzz@gmail.com
2020-01-17sched/debug: Reset watchdog on all CPUs while processing sysrq-tWei Li
Lengthy output of sysrq-t may take a lot of time on slow serial console with lots of processes and CPUs. So we need to reset NMI-watchdog to avoid spurious lockup messages, and we also reset softlockup watchdogs on all other CPUs since another CPU might be blocked waiting for us to process an IPI or stop_machine. Add to sysrq_sched_debug_show() as what we did in show_state_filter(). Signed-off-by: Wei Li <liwei391@huawei.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Link: https://lkml.kernel.org/r/20191226085224.48942-1-liwei391@huawei.com
2020-01-17sched/core: Fix size of rq::uclamp initializationLi Guanglei
rq::uclamp is an array of struct uclamp_rq, make sure we clear the whole thing. Fixes: 69842cba9ace ("sched/uclamp: Add CPU's clamp buckets refcountinga") Signed-off-by: Li Guanglei <guanglei.li@unisoc.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Qais Yousef <qais.yousef@arm.com> Link: https://lkml.kernel.org/r/1577259844-12677-1-git-send-email-guangleix.li@gmail.com
2020-01-17sched/uclamp: Fix a bug in propagating uclamp value in new cgroupsQais Yousef
When a new cgroup is created, the effective uclamp value wasn't updated with a call to cpu_util_update_eff() that looks at the hierarchy and update to the most restrictive values. Fix it by ensuring to call cpu_util_update_eff() when a new cgroup becomes online. Without this change, the newly created cgroup uses the default root_task_group uclamp values, which is 1024 for both uclamp_{min, max}, which will cause the rq to to be clamped to max, hence cause the system to run at max frequency. The problem was observed on Ubuntu server and was reproduced on Debian and Buildroot rootfs. By default, Ubuntu and Debian create a cpu controller cgroup hierarchy and add all tasks to it - which creates enough noise to keep the rq uclamp value at max most of the time. Imitating this behavior makes the problem visible in Buildroot too which otherwise looks fine since it's a minimal userspace. Fixes: 0b60ba2dd342 ("sched/uclamp: Propagate parent clamps") Reported-by: Doug Smythies <dsmythies@telus.net> Signed-off-by: Qais Yousef <qais.yousef@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Doug Smythies <dsmythies@telus.net> Link: https://lore.kernel.org/lkml/000701d5b965$361b6c60$a2524520$@net/
2020-01-17sched/fair: Load balance aggressively for SCHED_IDLE CPUsViresh Kumar
The fair scheduler performs periodic load balance on every CPU to check if it can pull some tasks from other busy CPUs. The duration of this periodic load balance is set to sd->balance_interval for the idle CPUs and is calculated by multiplying the sd->balance_interval with the sd->busy_factor (set to 32 by default) for the busy CPUs. The multiplication is done for busy CPUs to avoid doing load balance too often and rather spend more time executing actual task. While that is the right thing to do for the CPUs busy with SCHED_OTHER or SCHED_BATCH tasks, it may not be the optimal thing for CPUs running only SCHED_IDLE tasks. With the recent enhancements in the fair scheduler around SCHED_IDLE CPUs, we now prefer to enqueue a newly-woken task to a SCHED_IDLE CPU instead of other busy or idle CPUs. The same reasoning should be applied to the load balancer as well to make it migrate tasks more aggressively to a SCHED_IDLE CPU, as that will reduce the scheduling latency of the migrated (SCHED_OTHER) tasks. This patch makes minimal changes to the fair scheduler to do the next load balance soon after the last non SCHED_IDLE task is dequeued from a runqueue, i.e. making the CPU SCHED_IDLE. Also the sd->busy_factor is ignored while calculating the balance_interval for such CPUs. This is done to avoid delaying the periodic load balance by few hundred milliseconds for SCHED_IDLE CPUs. This is tested on ARM64 Hikey620 platform (octa-core) with the help of rt-app and it is verified, using kernel traces, that the newly SCHED_IDLE CPU does load balancing shortly after it becomes SCHED_IDLE and pulls tasks from other busy CPUs. Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Link: https://lkml.kernel.org/r/e485827eb8fe7db0943d6f3f6e0f5a4a70272781.1578471925.git.viresh.kumar@linaro.org
2020-01-17sched/fair : Improve update_sd_pick_busiest for spare capacity caseVincent Guittot
Similarly to calculate_imbalance() and find_busiest_group(), using the number of idle CPUs when there is only 1 CPU in the group is not efficient because we can't make a difference between a CPU running 1 task and a CPU running dozens of small tasks competing for the same CPU but not enough to overload it. More generally speaking, we should use the number of running tasks when there is the same number of idle CPUs in a group instead of blindly select the 1st one. When the groups have spare capacity and the same number of idle CPUs, we compare the number of running tasks to select the busiest group. Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/1576839893-26930-1-git-send-email-vincent.guittot@linaro.org
2020-01-17watchdog: Remove soft_lockup_hrtimer_cnt and related codeJisheng Zhang
After commit 9cf57731b63e ("watchdog/softlockup: Replace "watchdog/%u" threads with cpu_stop_work"), the percpu soft_lockup_hrtimer_cnt is not used any more, so remove it and related code. Signed-off-by: Jisheng Zhang <Jisheng.Zhang@synaptics.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20191218131720.4146aea2@xhacker.debian
2020-01-17tracing: Initialize ret in syscall_enter_define_fields()Steven Rostedt (VMware)
If syscall_enter_define_fields() is called on a system call with no arguments, the return code variable "ret" will never get initialized. Initialize it to zero. Fixes: 04ae87a52074e ("ftrace: Rework event_create_dir()") Reported-by: Qian Cai <cai@lca.pw> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/0FA8C6E3-D9F5-416D-A1B0-5E4CD583A101@lca.pw
2020-01-16bpf: Remove set but not used variable 'first_key'YueHaibing
kernel/bpf/syscall.c: In function generic_map_lookup_batch: kernel/bpf/syscall.c:1339:7: warning: variable first_key set but not used [-Wunused-but-set-variable] It is never used, so remove it. Reported-by: Hulk Robot <hulkci@huawei.com> Signed-off-by: YueHaibing <yuehaibing@huawei.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Brian Vazquez <brianvv@google.com> Link: https://lore.kernel.org/bpf/20200116145300.59056-1-yuehaibing@huawei.com
2020-01-16devmap: Adjust tracepoint for map-less queue flushJesper Dangaard Brouer
Now that we don't have a reference to a devmap when flushing the device bulk queue, let's change the the devmap_xmit tracepoint to remote the map_id and map_index fields entirely. Rearrange the fields so 'drops' and 'sent' stay in the same position in the tracepoint struct, to make it possible for the xdp_monitor utility to read both the old and the new format. Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/157918768613.1458396.9165902403373826572.stgit@toke.dk
2020-01-16xdp: Use bulking for non-map XDP_REDIRECT and consolidate code pathsToke Høiland-Jørgensen
Since the bulk queue used by XDP_REDIRECT now lives in struct net_device, we can re-use the bulking for the non-map version of the bpf_redirect() helper. This is a simple matter of having xdp_do_redirect_slow() queue the frame on the bulk queue instead of sending it out with __bpf_tx_xdp(). Unfortunately we can't make the bpf_redirect() helper return an error if the ifindex doesn't exit (as bpf_redirect_map() does), because we don't have a reference to the network namespace of the ingress device at the time the helper is called. So we have to leave it as-is and keep the device lookup in xdp_do_redirect_slow(). Since this leaves less reason to have the non-map redirect code in a separate function, so we get rid of the xdp_do_redirect_slow() function entirely. This does lose us the tracepoint disambiguation, but fortunately the xdp_redirect and xdp_redirect_map tracepoints use the same tracepoint entry structures. This means both can contain a map index, so we can just amend the tracepoint definitions so we always emit the xdp_redirect(_err) tracepoints, but with the map ID only populated if a map is present. This means we retire the xdp_redirect_map(_err) tracepoints entirely, but keep the definitions around in case someone is still listening for them. With this change, the performance of the xdp_redirect sample program goes from 5Mpps to 8.4Mpps (a 68% increase). Since the flush functions are no longer map-specific, rename the flush() functions to drop _map from their names. One of the renamed functions is the xdp_do_flush_map() callback used in all the xdp-enabled drivers. To keep from having to update all drivers, use a #define to keep the old name working, and only update the virtual drivers in this patch. Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: John Fastabend <john.fastabend@gmail.com> Link: https://lore.kernel.org/bpf/157918768505.1458396.17518057312953572912.stgit@toke.dk
2020-01-16xdp: Move devmap bulk queue into struct net_deviceToke Høiland-Jørgensen
Commit 96360004b862 ("xdp: Make devmap flush_list common for all map instances"), changed devmap flushing to be a global operation instead of a per-map operation. However, the queue structure used for bulking was still allocated as part of the containing map. This patch moves the devmap bulk queue into struct net_device. The motivation for this is reusing it for the non-map variant of XDP_REDIRECT, which will be changed in a subsequent commit. To avoid other fields of struct net_device moving to different cache lines, we also move a couple of other members around. We defer the actual allocation of the bulk queue structure until the NETDEV_REGISTER notification devmap.c. This makes it possible to check for ndo_xdp_xmit support before allocating the structure, which is not possible at the time struct net_device is allocated. However, we keep the freeing in free_netdev() to avoid adding another RCU callback on NETDEV_UNREGISTER. Because of this change, we lose the reference back to the map that originated the redirect, so change the tracepoint to always return 0 as the map ID and index. Otherwise no functional change is intended with this patch. After this patch, the relevant part of struct net_device looks like this, according to pahole: /* --- cacheline 14 boundary (896 bytes) --- */ struct netdev_queue * _tx __attribute__((__aligned__(64))); /* 896 8 */ unsigned int num_tx_queues; /* 904 4 */ unsigned int real_num_tx_queues; /* 908 4 */ struct Qdisc * qdisc; /* 912 8 */ unsigned int tx_queue_len; /* 920 4 */ spinlock_t tx_global_lock; /* 924 4 */ struct xdp_dev_bulk_queue * xdp_bulkq; /* 928 8 */ struct xps_dev_maps * xps_cpus_map; /* 936 8 */ struct xps_dev_maps * xps_rxqs_map; /* 944 8 */ struct mini_Qdisc * miniq_egress; /* 952 8 */ /* --- cacheline 15 boundary (960 bytes) --- */ struct hlist_head qdisc_hash[16]; /* 960 128 */ /* --- cacheline 17 boundary (1088 bytes) --- */ struct timer_list watchdog_timer; /* 1088 40 */ /* XXX last struct has 4 bytes of padding */ int watchdog_timeo; /* 1128 4 */ /* XXX 4 bytes hole, try to pack */ struct list_head todo_list; /* 1136 16 */ /* --- cacheline 18 boundary (1152 bytes) --- */ Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Björn Töpel <bjorn.topel@intel.com> Acked-by: John Fastabend <john.fastabend@gmail.com> Link: https://lore.kernel.org/bpf/157918768397.1458396.12673224324627072349.stgit@toke.dk
2020-01-16PM: hibernate: fix crashes with init_on_free=1Alexander Potapenko
Upon resuming from hibernation, free pages may contain stale data from the kernel that initiated the resume. This breaks the invariant inflicted by init_on_free=1 that freed pages must be zeroed. To deal with this problem, make clear_free_pages() also clear the free pages when init_on_free is enabled. Fixes: 6471384af2a6 ("mm: security: introduce init_on_alloc=1 and init_on_free=1 boot options") Reported-by: Johannes Stezenbach <js@sig21.net> Signed-off-by: Alexander Potapenko <glider@google.com> Cc: 5.3+ <stable@vger.kernel.org> # 5.3+ Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2020-01-16PM: suspend: Add sysfs attribute to control the "sync on suspend" behaviorJonas Meurer
The sysfs attribute `/sys/power/sync_on_suspend` controls, whether or not filesystems are synced by the kernel before system suspend. Congruously, the behaviour of build-time switch CONFIG_SUSPEND_SKIP_SYNC is slightly changed: It now defines the run-tim default for the new sysfs attribute `/sys/power/sync_on_suspend`. The run-time attribute is added because the existing corresponding build-time Kconfig flag for (`CONFIG_SUSPEND_SKIP_SYNC`) is not flexible enough. E.g. Linux distributions that provide pre-compiled kernels usually want to stick with the default (sync filesystems before suspend) but under special conditions this needs to be changed. One example for such a special condition is user-space handling of suspending block devices (e.g. using `cryptsetup luksSuspend` or `dmsetup suspend`) before system suspend. The Kernel trying to sync filesystems after the underlying block device already got suspended obviously leads to dead-locks. Be aware that you have to take care of the filesystem sync yourself before suspending the system in those scenarios. Signed-off-by: Jonas Meurer <jonas@freesources.org> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2020-01-16watchdog/softlockup: Remove obsolete check of last reported taskPetr Mladek
commit 9cf57731b63e ("watchdog/softlockup: Replace "watchdog/%u" threads with cpu_stop_work") ensures that the watchdog is reliably touched during a task switch. As a result the check for an unnoticed task switch is not longer needed. Remove the relevant code, which effectively reverts commit b1a8de1f5343 ("softlockup: make detector be aware of task switch of processes hogging cpu") Signed-off-by: Petr Mladek <pmladek@suse.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Peter Ziljstra <peterz@infradead.org> Link: https://lore.kernel.org/r/20191024114928.15377-2-pmladek@suse.com
2020-01-16tracing: Allow trace_printk() to nest in other tracing codeSteven Rostedt (VMware)
trace_printk() is used to debug the kernel which includes the tracing infrastructure. But because it writes to the ring buffer, and so does much of the tracing infrastructure, the ring buffer's recursive detection will drop writes to the ring buffer that is in the same context as the current write is happening (it allows interrupts to write when normal context is writing, but wont let normal context write while normal context is writing). This can cause confusion and think that the code is where the trace_printk() exists is not hit. To solve this, up the recursive nesting of the ring buffer when trace_printk() is called before it writes to the buffer itself. Note, this does make it dangerous to use trace_printk() in the ring buffer code itself, because this basically disables the recursion protection of trace_printk() buffer writes. But as trace_printk() is only used for debugging, and if this does occur, the developer will see the cause real quick (recursive blowing up of the stack). Thus the developer can deal with that. But having trace_printk() silently ignored is a much bigger problem, and disabling recursive protection is a small price to pay to fix it. Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2020-01-16watchdog: Remove soft_lockup_hrtimer_cnt and related codeJisheng Zhang
After commit 9cf57731b63e ("watchdog/softlockup: Replace "watchdog/%u" threads with cpu_stop_work"), the percpu soft_lockup_hrtimer_cnt is not used any more, so remove it and related code. Signed-off-by: Jisheng Zhang <Jisheng.Zhang@synaptics.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/20191218131720.4146aea2@xhacker.debian
2020-01-16Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpfDavid S. Miller
Daniel Borkmann says: ==================== pull-request: bpf 2020-01-15 The following pull-request contains BPF updates for your *net* tree. We've added 12 non-merge commits during the last 9 day(s) which contain a total of 13 files changed, 95 insertions(+), 43 deletions(-). The main changes are: 1) Fix refcount leak for TCP time wait and request sockets for socket lookup related BPF helpers, from Lorenz Bauer. 2) Fix wrong verification of ARSH instruction under ALU32, from Daniel Borkmann. 3) Batch of several sockmap and related TLS fixes found while operating more complex BPF programs with Cilium and OpenSSL, from John Fastabend. 4) Fix sockmap to read psock's ingress_msg queue before regular sk_receive_queue() to avoid purging data upon teardown, from Lingpeng Chen. 5) Fix printing incorrect pointer in bpftool's btf_dump_ptr() in order to properly dump a BPF map's value with BTF, from Martin KaFai Lau. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-15bpf: Add batch ops to all htab bpf mapYonghong Song
htab can't use generic batch support due some problematic behaviours inherent to the data structre, i.e. while iterating the bpf map a concurrent program might delete the next entry that batch was about to use, in that case there's no easy solution to retrieve the next entry, the issue has been discussed multiple times (see [1] and [2]). The only way hmap can be traversed without the problem previously exposed is by making sure that the map is traversing entire buckets. This commit implements those strict requirements for hmap, the implementation follows the same interaction that generic support with some exceptions: - If keys/values buffer are not big enough to traverse a bucket, ENOSPC will be returned. - out_batch contains the value of the next bucket in the iteration, not the next key, but this is transparent for the user since the user should never use out_batch for other than bpf batch syscalls. This commits implements BPF_MAP_LOOKUP_BATCH and adds support for new command BPF_MAP_LOOKUP_AND_DELETE_BATCH. Note that for update/delete batch ops it is possible to use the generic implementations. [1] https://lore.kernel.org/bpf/20190724165803.87470-1-brianvv@google.com/ [2] https://lore.kernel.org/bpf/20190906225434.3635421-1-yhs@fb.com/ Signed-off-by: Yonghong Song <yhs@fb.com> Signed-off-by: Brian Vazquez <brianvv@google.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20200115184308.162644-6-brianvv@google.com
2020-01-15bpf: Add lookup and update batch ops to arraymapBrian Vazquez
This adds the generic batch ops functionality to bpf arraymap, note that since deletion is not a valid operation for arraymap, only batch and lookup are added. Signed-off-by: Brian Vazquez <brianvv@google.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Yonghong Song <yhs@fb.com> Link: https://lore.kernel.org/bpf/20200115184308.162644-5-brianvv@google.com