summaryrefslogtreecommitdiff
path: root/kernel
AgeCommit message (Collapse)Author
2021-06-22futex: Prepare futex_lock_pi() for runtime clock selectionThomas Gleixner
futex_lock_pi() is the only futex operation which cannot select the clock for timeouts (CLOCK_MONOTONIC/CLOCK_REALTIME). That's inconsistent and there is no particular reason why this cannot be supported. This was overlooked when CLOCK_REALTIME_FLAG was introduced and unfortunately not reported when the inconsistency was discovered in glibc. Prepare the function and enforce the CLOCK_REALTIME_FLAG on FUTEX_LOCK_PI so that a new FUTEX_LOCK_PI2 can implement it correctly. Reported-by: Kurt Kanzenbach <kurt@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20210422194705.338657741@linutronix.de
2021-06-22lockdep: Fix wait-type for empty stackPeter Zijlstra
Even the very first lock can violate the wait-context check, consider the various IRQ contexts. Fixes: de8f5e4f2dc1 ("lockdep: Introduce wait-type checks") Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Joerg Roedel <jroedel@suse.de> Link: https://lore.kernel.org/r/20210617190313.256987481@infradead.org
2021-06-22lockding/lockdep: Avoid to find wrong lock dep path in check_irq_usage()Boqun Feng
In the step #3 of check_irq_usage(), we seach backwards to find a lock whose usage conflicts the usage of @target_entry1 on safe/unsafe. However, we should only keep the irq-unsafe usage of @target_entry1 into consideration, because it could be a case where a lock is hardirq-unsafe but soft-safe, and in check_irq_usage() we find it because its hardirq-unsafe could result into a hardirq-safe-unsafe deadlock, but currently since we don't filter out the other usage bits, so we may find a lock dependency path softirq-unsafe -> softirq-safe, which in fact doesn't cause a deadlock. And this may cause misleading lockdep splats. Fix this by only keeping LOCKF_ENABLED_IRQ_ALL bits when we try the backwards search. Reported-by: Johannes Berg <johannes@sipsolutions.net> Signed-off-by: Boqun Feng <boqun.feng@gmail.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20210618170110.3699115-4-boqun.feng@gmail.com
2021-06-22locking/lockdep: Remove the unnecessary trace savingBoqun Feng
In print_bad_irq_dependency(), save_trace() is called to set the ->trace for @prev_root as the current call trace, however @prev_root corresponds to the the held lock, which may not be acquired in current call trace, therefore it's wrong to use save_trace() to set ->trace of @prev_root. Moreover, with our adjustment of printing backwards dependency path, the ->trace of @prev_root is unncessary, so remove it. Reported-by: Johannes Berg <johannes@sipsolutions.net> Signed-off-by: Boqun Feng <boqun.feng@gmail.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20210618170110.3699115-3-boqun.feng@gmail.com
2021-06-22locking/lockdep: Fix the dep path printing for backwards BFSBoqun Feng
We use the same code to print backwards lock dependency path as the forwards lock dependency path, and this could result into incorrect printing because for a backwards lock_list ->trace is not the call trace where the lock of ->class is acquired. Fix this by introducing a separate function on printing the backwards dependency path. Also add a few comments about the printing while we are at it. Reported-by: Johannes Berg <johannes@sipsolutions.net> Signed-off-by: Boqun Feng <boqun.feng@gmail.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20210618170110.3699115-2-boqun.feng@gmail.com
2021-06-22sched/uclamp: Fix uclamp_tg_restrict()Qais Yousef
Now cpu.uclamp.min acts as a protection, we need to make sure that the uclamp request of the task is within the allowed range of the cgroup, that is it is clamp()'ed correctly by tg->uclamp[UCLAMP_MIN] and tg->uclamp[UCLAMP_MAX]. As reported by Xuewen [1] we can have some corner cases where there's inversion between uclamp requested by task (p) and the uclamp values of the taskgroup it's attached to (tg). Following table demonstrates 2 corner cases: | p | tg | effective -----------+-----+------+----------- CASE 1 -----------+-----+------+----------- uclamp_min | 60% | 0% | 60% -----------+-----+------+----------- uclamp_max | 80% | 50% | 50% -----------+-----+------+----------- CASE 2 -----------+-----+------+----------- uclamp_min | 0% | 30% | 30% -----------+-----+------+----------- uclamp_max | 20% | 50% | 20% -----------+-----+------+----------- With this fix we get: | p | tg | effective -----------+-----+------+----------- CASE 1 -----------+-----+------+----------- uclamp_min | 60% | 0% | 50% -----------+-----+------+----------- uclamp_max | 80% | 50% | 50% -----------+-----+------+----------- CASE 2 -----------+-----+------+----------- uclamp_min | 0% | 30% | 30% -----------+-----+------+----------- uclamp_max | 20% | 50% | 30% -----------+-----+------+----------- Additionally uclamp_update_active_tasks() must now unconditionally update both UCLAMP_MIN/MAX because changing the tg's UCLAMP_MAX for instance could have an impact on the effective UCLAMP_MIN of the tasks. | p | tg | effective -----------+-----+------+----------- old -----------+-----+------+----------- uclamp_min | 60% | 0% | 50% -----------+-----+------+----------- uclamp_max | 80% | 50% | 50% -----------+-----+------+----------- *new* -----------+-----+------+----------- uclamp_min | 60% | 0% | *60%* -----------+-----+------+----------- uclamp_max | 80% |*70%* | *70%* -----------+-----+------+----------- [1] https://lore.kernel.org/lkml/CAB8ipk_a6VFNjiEnHRHkUMBKbA+qzPQvhtNjJ_YNzQhqV_o8Zw@mail.gmail.com/ Fixes: 0c18f2ecfcc2 ("sched/uclamp: Fix wrong implementation of cpu.uclamp.min") Reported-by: Xuewen Yan <xuewen.yan94@gmail.com> Signed-off-by: Qais Yousef <qais.yousef@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20210617165155.3774110-1-qais.yousef@arm.com
2021-06-22sched/rt: Fix Deadline utilization tracking during policy changeVincent Donnefort
DL keeps track of the utilization on a per-rq basis with the structure avg_dl. This utilization is updated during task_tick_dl(), put_prev_task_dl() and set_next_task_dl(). However, when the current running task changes its policy, set_next_task_dl() which would usually take care of updating the utilization when the rq starts running DL tasks, will not see a such change, leaving the avg_dl structure outdated. When that very same task will be dequeued later, put_prev_task_dl() will then update the utilization, based on a wrong last_update_time, leading to a huge spike in the DL utilization signal. The signal would eventually recover from this issue after few ms. Even if no DL tasks are run, avg_dl is also updated in __update_blocked_others(). But as the CPU capacity depends partly on the avg_dl, this issue has nonetheless a significant impact on the scheduler. Fix this issue by ensuring a load update when a running task changes its policy to DL. Fixes: 3727e0e ("sched/dl: Add dl_rq utilization tracking") Signed-off-by: Vincent Donnefort <vincent.donnefort@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Link: https://lore.kernel.org/r/1624271872-211872-3-git-send-email-vincent.donnefort@arm.com
2021-06-22sched/rt: Fix RT utilization tracking during policy changeVincent Donnefort
RT keeps track of the utilization on a per-rq basis with the structure avg_rt. This utilization is updated during task_tick_rt(), put_prev_task_rt() and set_next_task_rt(). However, when the current running task changes its policy, set_next_task_rt() which would usually take care of updating the utilization when the rq starts running RT tasks, will not see a such change, leaving the avg_rt structure outdated. When that very same task will be dequeued later, put_prev_task_rt() will then update the utilization, based on a wrong last_update_time, leading to a huge spike in the RT utilization signal. The signal would eventually recover from this issue after few ms. Even if no RT tasks are run, avg_rt is also updated in __update_blocked_others(). But as the CPU capacity depends partly on the avg_rt, this issue has nonetheless a significant impact on the scheduler. Fix this issue by ensuring a load update when a running task changes its policy to RT. Fixes: 371bf427 ("sched/rt: Add rt_rq utilization tracking") Signed-off-by: Vincent Donnefort <vincent.donnefort@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Link: https://lore.kernel.org/r/1624271872-211872-2-git-send-email-vincent.donnefort@arm.com
2021-06-22clockevents: Add missing parameter documentationBaokun Li
Add the missing documentation for the @cpu parameter of tick_cleanup_dead_cpu(). Signed-off-by: Baokun Li <libaokun1@huawei.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/20210608024305.2750999-1-libaokun1@huawei.com
2021-06-22signal: Prevent sigqueue caching after task got releasedThomas Gleixner
syzbot reported a memory leak related to sigqueue caching. The assumption that a task cannot cache a sigqueue after the signal handler has been dropped and exit_task_sigqueue_cache() has been invoked turns out to be wrong. Such a task can still invoke release_task(other_task), which cleans up the signals of 'other_task' and ends up in sigqueue_cache_or_free(), which in turn will cache the signal because task->sigqueue_cache is NULL. That's obviously bogus because nothing will free the cached signal of that task anymore, so the cached item is leaked. This happens when e.g. the last non-leader thread exits and reaps the zombie leader. Prevent this by setting tsk::sigqueue_cache to an error pointer value in exit_task_sigqueue_cache() which forces any subsequent invocation of sigqueue_cache_or_free() from that task to hand the sigqueue back to the kmemcache. Add comments to all relevant places. Fixes: 4bad58ebc8bc ("signal: Allow tasks to cache one sigqueue struct") Reported-by: syzbot+0bac5fec63d4f399ba98@syzkaller.appspotmail.com Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Christian Brauner <christian.brauner@ubuntu.com> Link: https://lore.kernel.org/r/878s32g6j5.ffs@nanos.tec.linutronix.de
2021-06-22bpf: Fix regression on BPF_OBJ_GET with non-O_RDWR flagsMaciej Żenczykowski
This reverts commit d37300ed1821 ("bpf: program: Refuse non-O_RDWR flags in BPF_OBJ_GET"). It breaks Android userspace which expects to be able to fetch programs with just read permissions. See: https://cs.android.com/android/platform/superproject/+/master:frameworks/libs/net/common/native/bpf_syscall_wrappers/include/BpfSyscallWrappers.h;drc=7005c764be23d31fa1d69e826b4a2f6689a8c81e;l=124 Side-note: another option to fix it would be to extend bpf_prog_new_fd() and to pass in used file mode flags in the same way as we do for maps via bpf_map_new_fd(). Meaning, they'd end up in anon_inode_getfd() and thus would be retained for prog fd operations with bpf() syscall. Right now these flags are not checked with progs since they are immutable for their lifetime (as opposed to maps which can be updated from user space). In future this could potentially change with new features, but at that point it's still fine to do the bpf_prog_new_fd() extension when needed. For a simple stable fix, a revert is less churn. Fixes: d37300ed1821 ("bpf: program: Refuse non-O_RDWR flags in BPF_OBJ_GET") Signed-off-by: Maciej Żenczykowski <maze@google.com> [ Daniel: added side-note to commit message ] Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Lorenz Bauer <lmb@cloudflare.com> Acked-by: Greg Kroah-Hartman <gregkh@google.com> Link: https://lore.kernel.org/bpf/20210618105526.265003-1-zenczykowski@gmail.com
2021-06-22sched/fair: Ensure that the CFS parent is added after unthrottlingRik van Riel
Ensure that a CFS parent will be in the list whenever one of its children is also in the list. A warning on rq->tmp_alone_branch != &rq->leaf_cfs_rq_list has been reported while running LTP test cfs_bandwidth01. Odin Ugedal found the root cause: $ tree /sys/fs/cgroup/ltp/ -d --charset=ascii /sys/fs/cgroup/ltp/ |-- drain `-- test-6851 `-- level2 |-- level3a | |-- worker1 | `-- worker2 `-- level3b `-- worker3 Timeline (ish): - worker3 gets throttled - level3b is decayed, since it has no more load - level2 get throttled - worker3 get unthrottled - level2 get unthrottled - worker3 is added to list - level3b is not added to list, since nr_running==0 and is decayed [ Vincent Guittot: Rebased and updated to fix for the reported warning. ] Fixes: a7b359fc6a37 ("sched/fair: Correctly insert cfs_rq's to list on unthrottle") Reported-by: Sachin Sant <sachinp@linux.vnet.ibm.com> Suggested-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Rik van Riel <riel@surriel.com> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Tested-by: Sachin Sant <sachinp@linux.vnet.ibm.com> Acked-by: Odin Ugedal <odin@uged.al> Link: https://lore.kernel.org/r/20210621174330.11258-1-vincent.guittot@linaro.org
2021-06-22locking/lockdep: Improve noinstr vs errorsPeter Zijlstra
Better handle the failure paths. vmlinux.o: warning: objtool: debug_locks_off()+0x23: call to console_verbose() leaves .noinstr.text section vmlinux.o: warning: objtool: debug_locks_off()+0x19: call to __kasan_check_write() leaves .noinstr.text section debug_locks_off+0x19/0x40: instrument_atomic_write at include/linux/instrumented.h:86 (inlined by) __debug_locks_off at include/linux/debug_locks.h:17 (inlined by) debug_locks_off at lib/debug_locks.c:41 Fixes: 6eebad1ad303 ("lockdep: __always_inline more for noinstr") Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20210621120120.784404944@infradead.org
2021-06-22printk: fix cpu lock orderingJohn Ogness
The cpu lock implementation uses a full memory barrier to take the lock, but no memory barriers when releasing the lock. This means that changes performed by a lock owner may not be seen by the next lock owner. This may have been "good enough" for use by dump_stack() as a serialization mechanism, but it is not enough to provide proper protection for a critical section. Correct this problem by using acquire/release memory barriers for lock/unlock, respectively. Signed-off-by: John Ogness <john.ogness@linutronix.de> Reviewed-by: Sergey Senozhatsky <senozhatsky@chromium.org> Reviewed-by: Petr Mladek <pmladek@suse.com> Signed-off-by: Petr Mladek <pmladek@suse.com> Link: https://lore.kernel.org/r/20210617095051.4808-3-john.ogness@linutronix.de
2021-06-22lib/dump_stack: move cpu lock to printk.cJohn Ogness
dump_stack() implements its own cpu-reentrant spinning lock to best-effort serialize stack traces in the printk log. However, there are other functions (such as show_regs()) that can also benefit from this serialization. Move the cpu-reentrant spinning lock (cpu lock) into new helper functions printk_cpu_lock_irqsave()/printk_cpu_unlock_irqrestore() so that it is available for others as well. For !CONFIG_SMP the cpu lock is a NOP. Note that having multiple cpu locks in the system can easily lead to deadlock. Code needing a cpu lock should use the printk cpu lock, since the printk cpu lock could be acquired from any code and any context. Also note that it is not necessary for a cpu lock to disable interrupts. However, in upcoming work this cpu lock will be used for emergency tasks (for example, atomic consoles during kernel crashes) and any interruptions while holding the cpu lock should be avoided if possible. Signed-off-by: John Ogness <john.ogness@linutronix.de> Reviewed-by: Sergey Senozhatsky <senozhatsky@chromium.org> Reviewed-by: Petr Mladek <pmladek@suse.com> [pmladek@suse.com: Backported on top of 5.13-rc1.] Signed-off-by: Petr Mladek <pmladek@suse.com> Link: https://lore.kernel.org/r/20210617095051.4808-2-john.ogness@linutronix.de
2021-06-22dma-debug: report -EEXIST errors in add_dma_entryHamza Mahfooz
Since, overlapping mappings are not supported by the DMA API we should report an error if active_cacheline_insert returns -EEXIST. Suggested-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Hamza Mahfooz <someguy@effective-light.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
2021-06-22dma-mapping: remove a trailing spaceZhen Lei
Signed-off-by: Zhen Lei <thunder.leizhen@huawei.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
2021-06-21swiotlb: manipulate orig_addr when tlb_addr has offsetBumyong Lee
in case of driver wants to sync part of ranges with offset, swiotlb_tbl_sync_single() copies from orig_addr base to tlb_addr with offset and ends up with data mismatch. It was removed from "swiotlb: don't modify orig_addr in swiotlb_tbl_sync_single", but said logic has to be added back in. From Linus's email: "That commit which the removed the offset calculation entirely, because the old (unsigned long)tlb_addr & (IO_TLB_SIZE - 1) was wrong, but instead of removing it, I think it should have just fixed it to be (tlb_addr - mem->start) & (IO_TLB_SIZE - 1); instead. That way the slot offset always matches the slot index calculation." (Unfortunatly that broke NVMe). The use-case that drivers are hitting is as follow: 1. Get dma_addr_t from dma_map_single() dma_addr_t tlb_addr = dma_map_single(dev, vaddr, vsize, DMA_TO_DEVICE); |<---------------vsize------------->| +-----------------------------------+ | | original buffer +-----------------------------------+ vaddr swiotlb_align_offset |<----->|<---------------vsize------------->| +-------+-----------------------------------+ | | | swiotlb buffer +-------+-----------------------------------+ tlb_addr 2. Do something 3. Sync dma_addr_t through dma_sync_single_for_device(..) dma_sync_single_for_device(dev, tlb_addr + offset, size, DMA_TO_DEVICE); Error case. Copy data to original buffer but it is from base addr (instead of base addr + offset) in original buffer: swiotlb_align_offset |<----->|<- offset ->|<- size ->| +-------+-----------------------------------+ | | |##########| | swiotlb buffer +-------+-----------------------------------+ tlb_addr |<- size ->| +-----------------------------------+ |##########| | original buffer +-----------------------------------+ vaddr The fix is to copy the data to the original buffer and take into account the offset, like so: swiotlb_align_offset |<----->|<- offset ->|<- size ->| +-------+-----------------------------------+ | | |##########| | swiotlb buffer +-------+-----------------------------------+ tlb_addr |<- offset ->|<- size ->| +-----------------------------------+ | |##########| | original buffer +-----------------------------------+ vaddr [One fix which was Linus's that made more sense to as it created a symmetry would break NVMe. The reason for that is the: unsigned int offset = (tlb_addr - mem->start) & (IO_TLB_SIZE - 1); would come up with the proper offset, but it would lose the alignment (which this patch contains).] Fixes: 16fc3cef33a0 ("swiotlb: don't modify orig_addr in swiotlb_tbl_sync_single") Signed-off-by: Bumyong Lee <bumyong.lee@samsung.com> Signed-off-by: Chanho Park <chanho61.park@samsung.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reported-by: Dominique MARTINET <dominique.martinet@atmark-techno.com> Reported-by: Horia Geantă <horia.geanta@nxp.com> Tested-by: Horia Geantă <horia.geanta@nxp.com> CC: stable@vger.kernel.org Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2021-06-21reboot: Add hardware protection power-offMatti Vaittinen
There can be few cases when we need to shut-down the system in order to protect the hardware. Currently this is done at least by the thermal core when temperature raises over certain limit. Some PMICs can also generate interrupts for example for over-current or over-voltage, voltage drops, short-circuit, ... etc. On some systems these are a sign of hardware failure and only thing to do is try to protect the rest of the hardware by shutting down the system. Add shut-down logic which can be used by all subsystems instead of implementing the shutdown in each subsystem. The logic is stolen from thermal_core with difference of using atomic_t instead of a mutex in order to allow calls directly from IRQ context and changing the WARN() to pr_emerg() as discussed here: https://lore.kernel.org/lkml/YJuPwAZroVZ%2Fw633@alley/ and here: https://lore.kernel.org/linux-iommu/20210331093104.383705-4-geert+renesas@glider.be/ Signed-off-by: Matti Vaittinen <matti.vaittinen@fi.rohmeurope.com> Acked-by: Daniel Lezcano <daniel.lezcano@linaro.org> Link: https://lore.kernel.org/r/e83ec1ca9408f90c857ea9dcdc57b14d9037b03f.1622628333.git.matti.vaittinen@fi.rohmeurope.com Signed-off-by: Mark Brown <broonie@kernel.org>
2021-06-21cpu/hotplug: Cure the cpusets trainwreckThomas Gleixner
Alexey and Joshua tried to solve a cpusets related hotplug problem which is user space visible and results in unexpected behaviour for some time after a CPU has been plugged in and the corresponding uevent was delivered. cpusets delegate the hotplug work (rebuilding cpumasks etc.) to a workqueue. This is done because the cpusets code has already a lock nesting of cgroups_mutex -> cpu_hotplug_lock. A synchronous callback or waiting for the work to finish with cpu_hotplug_lock held can and will deadlock because that results in the reverse lock order. As a consequence the uevent can be delivered before cpusets have consistent state which means that a user space invocation of sched_setaffinity() to move a task to the plugged CPU fails up to the point where the scheduled work has been processed. The same is true for CPU unplug, but that does not create user observable failure (yet). It's still inconsistent to claim that an operation is finished before it actually is and that's the real issue at hand. uevents just make it reliably observable. Obviously the problem should be fixed in cpusets/cgroups, but untangling that is pretty much impossible because according to the changelog of the commit which introduced this 8 years ago: 3a5a6d0c2b03("cpuset: don't nest cgroup_mutex inside get_online_cpus()") the lock order cgroups_mutex -> cpu_hotplug_lock is a design decision and the whole code is built around that. So bite the bullet and invoke the relevant cpuset function, which waits for the work to finish, in _cpu_up/down() after dropping cpu_hotplug_lock and only when tasks are not frozen by suspend/hibernate because that would obviously wait forever. Waiting there with cpu_add_remove_lock, which is protecting the present and possible CPU maps, held is not a problem at all because neither work queues nor cpusets/cgroups have any lockchains related to that lock. Waiting in the hotplug machinery is not problematic either because there are already state callbacks which wait for hardware queues to drain. It makes the operations slightly slower, but hotplug is slow anyway. This ensures that state is consistent before returning from a hotplug up/down operation. It's still inconsistent during the operation, but that's a different story. Add a large comment which explains why this is done and why this is not a dump ground for the hack of the day to work around half thought out locking schemes. Document also the implications vs. hotplug operations and serialization or the lack of it. Thanks to Alexy and Joshua for analyzing why this temporary sched_setaffinity() failure happened. Fixes: 3a5a6d0c2b03("cpuset: don't nest cgroup_mutex inside get_online_cpus()") Reported-by: Alexey Klimov <aklimov@redhat.com> Reported-by: Joshua Baker <jobaker@redhat.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Alexey Klimov <aklimov@redhat.com> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/87tuowcnv3.ffs@nanos.tec.linutronix.de
2021-06-20Merge tag 'sched_urgent_for_v5.13_rc6' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler fix from Borislav Petkov: "A single fix to restore fairness between control groups with equal priority" * tag 'sched_urgent_for_v5.13_rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: sched/fair: Correctly insert cfs_rq's to list on unthrottle
2021-06-18Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski
Trivial conflicts in net/can/isotp.c and tools/testing/selftests/net/mptcp/mptcp_connect.sh scaled_ppm_to_ppb() was moved from drivers/ptp/ptp_clock.c to include/linux/ptp_clock_kernel.h in -next so re-apply the fix there. Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-06-18Merge tag 'net-5.13-rc7' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Pull networking fixes from Jakub Kicinski: "Networking fixes for 5.13-rc7, including fixes from wireless, bpf, bluetooth, netfilter and can. Current release - regressions: - mlxsw: spectrum_qdisc: Pass handle, not band number to find_class() to fix modifying offloaded qdiscs - lantiq: net: fix duplicated skb in rx descriptor ring - rtnetlink: fix regression in bridge VLAN configuration, empty info is not an error, bot-generated "fix" was not needed - libbpf: s/rx/tx/ typo on umem->rx_ring_setup_done to fix umem creation Current release - new code bugs: - ethtool: fix NULL pointer dereference during module EEPROM dump via the new netlink API - mlx5e: don't update netdev RQs with PTP-RQ, the special purpose queue should not be visible to the stack - mlx5e: select special PTP queue only for SKBTX_HW_TSTAMP skbs - mlx5e: verify dev is present in get devlink port ndo, avoid a panic Previous releases - regressions: - neighbour: allow NUD_NOARP entries to be force GCed - further fixes for fallout from reorg of WiFi locking (staging: rtl8723bs, mac80211, cfg80211) - skbuff: fix incorrect msg_zerocopy copy notifications - mac80211: fix NULL ptr deref for injected rate info - Revert "net/mlx5: Arm only EQs with EQEs" it may cause missed IRQs Previous releases - always broken: - bpf: more speculative execution fixes - netfilter: nft_fib_ipv6: skip ipv6 packets from any to link-local - udp: fix race between close() and udp_abort() resulting in a panic - fix out of bounds when parsing TCP options before packets are validated (in netfilter: synproxy, tc: sch_cake and mptcp) - mptcp: improve operation under memory pressure, add missing wake-ups - mptcp: fix double-lock/soft lookup in subflow_error_report() - bridge: fix races (null pointer deref and UAF) in vlan tunnel egress - ena: fix DMA mapping function issues in XDP - rds: fix memory leak in rds_recvmsg Misc: - vrf: allow larger MTUs - icmp: don't send out ICMP messages with a source address of 0.0.0.0 - cdc_ncm: switch to eth%d interface naming" * tag 'net-5.13-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (139 commits) net: ethernet: fix potential use-after-free in ec_bhf_remove selftests/net: Add icmp.sh for testing ICMP dummy address responses icmp: don't send out ICMP messages with a source address of 0.0.0.0 net: ll_temac: Avoid ndo_start_xmit returning NETDEV_TX_BUSY net: ll_temac: Fix TX BD buffer overwrite net: ll_temac: Add memory-barriers for TX BD access net: ll_temac: Make sure to free skb when it is completely used MAINTAINERS: add Guvenc as SMC maintainer bnxt_en: Call bnxt_ethtool_free() in bnxt_init_one() error path bnxt_en: Fix TQM fastpath ring backing store computation bnxt_en: Rediscover PHY capabilities after firmware reset cxgb4: fix wrong shift. mac80211: handle various extensible elements correctly mac80211: reset profile_periodicity/ema_ap cfg80211: avoid double free of PMSR request cfg80211: make certificate generation more robust mac80211: minstrel_ht: fix sample time check net: qed: Fix memcpy() overflow of qed_dcbx_params() net: cdc_eem: fix tx fixup skb leak net: hamradio: fix memory leak in mkiss_close ...
2021-06-18Merge tag 'trace-v5.13-rc6' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace Pull tracing fixes from Steven Rostedt: - Have recordmcount check for valid st_shndx otherwise some archs may have invalid references for the mcount location. - Two fixes done for mapping pids to task names. Traces were not showing the names of tasks when they should have. - Fix to trace_clock_global() to prevent it from going backwards * tag 'trace-v5.13-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: tracing: Do no increment trace_clock_global() by one tracing: Do not stop recording comms if the trace file is being read tracing: Do not stop recording cmdlines when tracing is off recordmcount: Correct st_shndx handling
2021-06-18Merge tag 'printk-for-5.13-fixup' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/printk/linux Pull printk fixup from Petr Mladek: "Fix misplaced EXPORT_SYMBOL(vsprintf)" * tag 'printk-for-5.13-fixup' of git://git.kernel.org/pub/scm/linux/kernel/git/printk/linux: printk: Move EXPORT_SYMBOL() closer to vprintk definition
2021-06-18Merge tag 'pm-5.13-rc7' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm Pull power management fix from Rafael Wysocki: "Remove recently added frequency invariance support from the CPPC cpufreq driver, because it has turned out to be problematic and it cannot be fixed properly on time for 5.13 (Viresh Kumar)" * tag 'pm-5.13-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: Revert "cpufreq: CPPC: Add support for frequency invariance"
2021-06-18tracing: Do no increment trace_clock_global() by oneSteven Rostedt (VMware)
The trace_clock_global() tries to make sure the events between CPUs is somewhat in order. A global value is used and updated by the latest read of a clock. If one CPU is ahead by a little, and is read by another CPU, a lock is taken, and if the timestamp of the other CPU is behind, it will simply use the other CPUs timestamp. The lock is also only taken with a "trylock" due to tracing, and strange recursions can happen. The lock is not taken at all in NMI context. In the case where the lock is not able to be taken, the non synced timestamp is returned. But it will not be less than the saved global timestamp. The problem arises because when the time goes "backwards" the time returned is the saved timestamp plus 1. If the lock is not taken, and the plus one to the timestamp is returned, there's a small race that can cause the time to go backwards! CPU0 CPU1 ---- ---- trace_clock_global() { ts = clock() [ 1000 ] trylock(clock_lock) [ success ] global_ts = ts; [ 1000 ] <interrupted by NMI> trace_clock_global() { ts = clock() [ 999 ] if (ts < global_ts) ts = global_ts + 1 [ 1001 ] trylock(clock_lock) [ fail ] return ts [ 1001] } unlock(clock_lock); return ts; [ 1000 ] } trace_clock_global() { ts = clock() [ 1000 ] if (ts < global_ts) [ false 1000 == 1000 ] trylock(clock_lock) [ success ] global_ts = ts; [ 1000 ] unlock(clock_lock) return ts; [ 1000 ] } The above case shows to reads of trace_clock_global() on the same CPU, but the second read returns one less than the first read. That is, time when backwards, and this is not what is allowed by trace_clock_global(). This was triggered by heavy tracing and the ring buffer checker that tests for the clock going backwards: Ring buffer clock went backwards: 20613921464 -> 20613921463 ------------[ cut here ]------------ WARNING: CPU: 2 PID: 0 at kernel/trace/ring_buffer.c:3412 check_buffer+0x1b9/0x1c0 Modules linked in: [..] [CPU: 2]TIME DOES NOT MATCH expected:20620711698 actual:20620711697 delta:6790234 before:20613921463 after:20613921463 [20613915818] PAGE TIME STAMP [20613915818] delta:0 [20613915819] delta:1 [20613916035] delta:216 [20613916465] delta:430 [20613916575] delta:110 [20613916749] delta:174 [20613917248] delta:499 [20613917333] delta:85 [20613917775] delta:442 [20613917921] delta:146 [20613918321] delta:400 [20613918568] delta:247 [20613918768] delta:200 [20613919306] delta:538 [20613919353] delta:47 [20613919980] delta:627 [20613920296] delta:316 [20613920571] delta:275 [20613920862] delta:291 [20613921152] delta:290 [20613921464] delta:312 [20613921464] delta:0 TIME EXTEND [20613921464] delta:0 This happened more than once, and always for an off by one result. It also started happening after commit aafe104aa9096 was added. Cc: stable@vger.kernel.org Fixes: aafe104aa9096 ("tracing: Restructure trace_clock_global() to never block") Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2021-06-18tracing: Do not stop recording comms if the trace file is being readSteven Rostedt (VMware)
A while ago, when the "trace" file was opened, tracing was stopped, and code was added to stop recording the comms to saved_cmdlines, for mapping of the pids to the task name. Code has been added that only records the comm if a trace event occurred, and there's no reason to not trace it if the trace file is opened. Cc: stable@vger.kernel.org Fixes: 7ffbd48d5cab2 ("tracing: Cache comms only after an event occurred") Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2021-06-18tracing: Do not stop recording cmdlines when tracing is offSteven Rostedt (VMware)
The saved_cmdlines is used to map pids to the task name, such that the output of the tracing does not just show pids, but also gives a human readable name for the task. If the name is not mapped, the output looks like this: <...>-1316 [005] ...2 132.044039: ... Instead of this: gnome-shell-1316 [005] ...2 132.044039: ... The names are updated when tracing is running, but are skipped if tracing is stopped. Unfortunately, this stops the recording of the names if the top level tracer is stopped, and not if there's other tracers active. The recording of a name only happens when a new event is written into a ring buffer, so there is no need to test if tracing is on or not. If tracing is off, then no event is written and no need to test if tracing is off or not. Remove the check, as it hides the names of tasks for events in the instance buffers. Cc: stable@vger.kernel.org Fixes: 7ffbd48d5cab2 ("tracing: Cache comms only after an event occurred") Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2021-06-18sched: Change task_struct::statePeter Zijlstra
Change the type and name of task_struct::state. Drop the volatile and shrink it to an 'unsigned int'. Rename it in order to find all uses such that we can use READ_ONCE/WRITE_ONCE as appropriate. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Daniel Bristot de Oliveira <bristot@redhat.com> Acked-by: Will Deacon <will@kernel.org> Acked-by: Daniel Thompson <daniel.thompson@linaro.org> Link: https://lore.kernel.org/r/20210611082838.550736351@infradead.org
2021-06-18sched,timer: Use __set_current_state()Peter Zijlstra
There's an existing helper for setting TASK_RUNNING; must've gotten lost last time we did this cleanup. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Davidlohr Bueso <dbueso@suse.de> Acked-by: Will Deacon <will@kernel.org> Link: https://lore.kernel.org/r/20210611082838.409696194@infradead.org
2021-06-18sched: Add get_current_state()Peter Zijlstra
Remove yet another few p->state accesses. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Will Deacon <will@kernel.org> Link: https://lore.kernel.org/r/20210611082838.347475156@infradead.org
2021-06-18sched,perf,kvm: Fix preemption conditionPeter Zijlstra
When ran from the sched-out path (preempt_notifier or perf_event), p->state is irrelevant to determine preemption. You can get preempted with !task_is_running() just fine. The right indicator for preemption is if the task is still on the runqueue in the sched-out path. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Mark Rutland <mark.rutland@arm.com> Link: https://lore.kernel.org/r/20210611082838.285099381@infradead.org
2021-06-18sched: Introduce task_is_running()Peter Zijlstra
Replace a bunch of 'p->state == TASK_RUNNING' with a new helper: task_is_running(p). Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Davidlohr Bueso <dave@stgolabs.net> Acked-by: Geert Uytterhoeven <geert@linux-m68k.org> Acked-by: Will Deacon <will@kernel.org> Link: https://lore.kernel.org/r/20210611082838.222401495@infradead.org
2021-06-18sched: Unbreak wakeupsPeter Zijlstra
Remove broken task->state references and let wake_up_process() DTRT. The anti-pattern in these patches breaks the ordering of ->state vs COND as described in the comment near set_current_state() and can lead to missed wakeups: (OoO load, observes RUNNING)<-. for (;;) { | t->state = UNINTERRUPTIBLE; | smp_mb(); ,-----> | (observes !COND) | / if (COND) ---------' | COND = 1; break; `- if (t->state != RUNNING) wake_up_process(t); // not done schedule(); // forever waiting } t->state = TASK_RUNNING; Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Davidlohr Bueso <dbueso@suse.de> Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Acked-by: Will Deacon <will@kernel.org> Link: https://lore.kernel.org/r/20210611082838.160855222@infradead.org
2021-06-18Merge branch 'sched/urgent' into sched/core, to resolve conflictsIngo Molnar
This commit in sched/urgent moved the cfs_rq_is_decayed() function: a7b359fc6a37: ("sched/fair: Correctly insert cfs_rq's to list on unthrottle") and this fresh commit in sched/core modified it in the old location: 9e077b52d86a: ("sched/pelt: Check that *_avg are null when *_sum are") Merge the two variants. Conflicts: kernel/sched/fair.c Signed-off-by: Ingo Molnar <mingo@kernel.org>
2021-06-17tracing: Have ftrace_dump_on_oops kernel parameter take numbersSteven Rostedt (VMware)
The kernel parameter for ftrace_dump_on_oops can take a single assignment. That is, it can be: ftrace_dump_on_oops or ftrace_dump_on_oops=orig_cpu But the content in the sysctl file is a number. 0 for disabled 1 for ftrace_dump_on_oops (all CPUs) 2 for ftrace_dump_on_oops (orig CPU) Allow the kernel command line to take a number as well to match the sysctl numbers. That is: ftrace_dump_on_oops=1 is the same as ftrace_dump_on_oops and ftrace_dump_on_oops=2 is the same as ftrace_dump_on_oops=orig_cpu Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2021-06-17Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-nextDavid S. Miller
Daniel Borkmann says: ==================== pull-request: bpf-next 2021-06-17 The following pull-request contains BPF updates for your *net-next* tree. We've added 50 non-merge commits during the last 25 day(s) which contain a total of 148 files changed, 4779 insertions(+), 1248 deletions(-). The main changes are: 1) BPF infrastructure to migrate TCP child sockets from a listener to another in the same reuseport group/map, from Kuniyuki Iwashima. 2) Add a provably sound, faster and more precise algorithm for tnum_mul() as noted in https://arxiv.org/abs/2105.05398, from Harishankar Vishwanathan. 3) Streamline error reporting changes in libbpf as planned out in the 'libbpf: the road to v1.0' effort, from Andrii Nakryiko. 4) Add broadcast support to xdp_redirect_map(), from Hangbin Liu. 5) Extends bpf_map_lookup_and_delete_elem() functionality to 4 more map types, that is, {LRU_,PERCPU_,LRU_PERCPU_,}HASH, from Denis Salopek. 6) Support new LLVM relocations in libbpf to make them more linker friendly, also add a doc to describe the BPF backend relocations, from Yonghong Song. 7) Silence long standing KUBSAN complaints on register-based shifts in interpreter, from Daniel Borkmann and Eric Biggers. 8) Add dummy PT_REGS macros in libbpf to fail BPF program compilation when target arch cannot be determined, from Lorenz Bauer. 9) Extend AF_XDP to support large umems with 1M+ pages, from Magnus Karlsson. 10) Fix two minor libbpf tc BPF API issues, from Kumar Kartikeya Dwivedi. 11) Move libbpf BPF_SEQ_PRINTF/BPF_SNPRINTF macros that can be used by BPF programs to bpf_helpers.h header, from Florent Revest. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-17tracing: Add tp_printk_stop_on_boot optionSteven Rostedt (VMware)
Add a kernel command line option that disables printing of events to console at late_initcall_sync(). This is useful when needing to see specific events written to console on boot up, but not wanting it when user space starts, as user space may make the console so noisy that the system becomes inoperable. Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2021-06-17sched/fair: Age the average idle timePeter Zijlstra
This is a partial forward-port of Peter Ziljstra's work first posted at: https://lore.kernel.org/lkml/20180530142236.667774973@infradead.org/ Currently select_idle_cpu()'s proportional scheme uses the average idle time *for when we are idle*, that is temporally challenged. When a CPU is not at all idle, we'll happily continue using whatever value we did see when the CPU goes idle. To fix this, introduce a separate average idle and age it (the existing value still makes sense for things like new-idle balancing, which happens when we do go idle). The overall goal is to not spend more time scanning for idle CPUs than we're idle for. Otherwise we're inhibiting work. This means that we need to consider the cost over all the wake-ups between consecutive idle periods. To track this, the scan cost is subtracted from the estimated average idle time. The impact of this patch is related to workloads that have domains that are fully busy or overloaded. Without the patch, the scan depth may be too high because a CPU is not reaching idle. Due to the nature of the patch, this is a regression magnet. It potentially wins when domains are almost fully busy or overloaded -- at that point searches are likely to fail but idle is not being aged as CPUs are active so search depth is too large and useless. It will potentially show regressions when there are idle CPUs and a deep search is beneficial. This tbench result on a 2-socket broadwell machine partially illustates the problem 5.13.0-rc2 5.13.0-rc2 vanilla sched-avgidle-v1r5 Hmean 1 445.02 ( 0.00%) 451.36 * 1.42%* Hmean 2 830.69 ( 0.00%) 846.03 * 1.85%* Hmean 4 1350.80 ( 0.00%) 1505.56 * 11.46%* Hmean 8 2888.88 ( 0.00%) 2586.40 * -10.47%* Hmean 16 5248.18 ( 0.00%) 5305.26 * 1.09%* Hmean 32 8914.03 ( 0.00%) 9191.35 * 3.11%* Hmean 64 10663.10 ( 0.00%) 10192.65 * -4.41%* Hmean 128 18043.89 ( 0.00%) 18478.92 * 2.41%* Hmean 256 16530.89 ( 0.00%) 17637.16 * 6.69%* Hmean 320 16451.13 ( 0.00%) 17270.97 * 4.98%* Note that 8 was a regression point where a deeper search would have helped but it gains for high thread counts when searches are useless. Hackbench is a more extreme example although not perfect as the tasks idle rapidly hackbench-process-pipes 5.13.0-rc2 5.13.0-rc2 vanilla sched-avgidle-v1r5 Amean 1 0.3950 ( 0.00%) 0.3887 ( 1.60%) Amean 4 0.9450 ( 0.00%) 0.9677 ( -2.40%) Amean 7 1.4737 ( 0.00%) 1.4890 ( -1.04%) Amean 12 2.3507 ( 0.00%) 2.3360 * 0.62%* Amean 21 4.0807 ( 0.00%) 4.0993 * -0.46%* Amean 30 5.6820 ( 0.00%) 5.7510 * -1.21%* Amean 48 8.7913 ( 0.00%) 8.7383 ( 0.60%) Amean 79 14.3880 ( 0.00%) 13.9343 * 3.15%* Amean 110 21.2233 ( 0.00%) 19.4263 * 8.47%* Amean 141 28.2930 ( 0.00%) 25.1003 * 11.28%* Amean 172 34.7570 ( 0.00%) 30.7527 * 11.52%* Amean 203 41.0083 ( 0.00%) 36.4267 * 11.17%* Amean 234 47.7133 ( 0.00%) 42.0623 * 11.84%* Amean 265 53.0353 ( 0.00%) 47.7720 * 9.92%* Amean 296 60.0170 ( 0.00%) 53.4273 * 10.98%* Stddev 1 0.0052 ( 0.00%) 0.0025 ( 51.57%) Stddev 4 0.0357 ( 0.00%) 0.0370 ( -3.75%) Stddev 7 0.0190 ( 0.00%) 0.0298 ( -56.64%) Stddev 12 0.0064 ( 0.00%) 0.0095 ( -48.38%) Stddev 21 0.0065 ( 0.00%) 0.0097 ( -49.28%) Stddev 30 0.0185 ( 0.00%) 0.0295 ( -59.54%) Stddev 48 0.0559 ( 0.00%) 0.0168 ( 69.92%) Stddev 79 0.1559 ( 0.00%) 0.0278 ( 82.17%) Stddev 110 1.1728 ( 0.00%) 0.0532 ( 95.47%) Stddev 141 0.7867 ( 0.00%) 0.0968 ( 87.69%) Stddev 172 1.0255 ( 0.00%) 0.0420 ( 95.91%) Stddev 203 0.8106 ( 0.00%) 0.1384 ( 82.92%) Stddev 234 1.1949 ( 0.00%) 0.1328 ( 88.89%) Stddev 265 0.9231 ( 0.00%) 0.0820 ( 91.11%) Stddev 296 1.0456 ( 0.00%) 0.1327 ( 87.31%) Again, higher thread counts benefit and the standard deviation shows that results are also a lot more stable when the idle time is aged. The patch potentially matters when a socket was multiple LLCs as the maximum search depth is lower. However, some of the test results were suspiciously good (e.g. specjbb2005 gaining 50% on a Zen1 machine) and other results were not dramatically different to other mcahines. Given the nature of the patch, Peter's full series is not being forward ported as each part should stand on its own. Preferably they would be merged at different times to reduce the risk of false bisections. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20210615111611.GH30378@techsingularity.net
2021-06-17sched/cpufreq: Consider reduced CPU capacity in energy calculationLukasz Luba
Energy Aware Scheduling (EAS) needs to predict the decisions made by SchedUtil. The map_util_freq() exists to do that. There are corner cases where the max allowed frequency might be reduced (due to thermal). SchedUtil as a CPUFreq governor, is aware of that but EAS is not. This patch aims to address it. SchedUtil stores the maximum allowed frequency in 'sugov_policy::next_freq' field. EAS has to predict that value, which is the real used frequency. That value is made after a call to cpufreq_driver_resolve_freq() which clamps to the CPUFreq policy limits. In the existing code EAS is not able to predict that real frequency. This leads to energy estimation errors. To avoid wrong energy estimation in EAS (due to frequency miss prediction) make sure that the step which calculates Performance Domain frequency, is also aware of the allowed CPU capacity. Furthermore, modify map_util_freq() to not extend the frequency value. Instead, use map_util_perf() to extend the util value in both places: SchedUtil and EAS, but for EAS clamp it to max allowed CPU capacity. In the end, we achieve the same desirable behavior for both subsystems and alignment in regards to the real CPU frequency. Signed-off-by: Lukasz Luba <lukasz.luba@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> (For the schedutil part) Link: https://lore.kernel.org/r/20210614191238.23224-1-lukasz.luba@arm.com
2021-06-17sched/fair: Take thermal pressure into account while estimating energyLukasz Luba
Energy Aware Scheduling (EAS) needs to be able to predict the frequency requests made by the SchedUtil governor to properly estimate energy used in the future. It has to take into account CPUs utilization and forecast Performance Domain (PD) frequency. There is a corner case when the max allowed frequency might be reduced due to thermal. SchedUtil is aware of that reduced frequency, so it should be taken into account also in EAS estimations. SchedUtil, as a CPUFreq governor, knows the maximum allowed frequency of a CPU, thanks to cpufreq_driver_resolve_freq() and internal clamping to 'policy::max'. SchedUtil is responsible to respect that upper limit while setting the frequency through CPUFreq drivers. This effective frequency is stored internally in 'sugov_policy::next_freq' and EAS has to predict that value. In the existing code the raw value of arch_scale_cpu_capacity() is used for clamping the returned CPU utilization from effective_cpu_util(). This patch fixes issue with too big single CPU utilization, by introducing clamping to the allowed CPU capacity. The allowed CPU capacity is a CPU capacity reduced by thermal pressure raw value. Thanks to knowledge about allowed CPU capacity, we don't get too big value for a single CPU utilization, which is then added to the util sum. The util sum is used as a source of information for estimating whole PD energy. To avoid wrong energy estimation in EAS (due to capped frequency), make sure that the calculation of util sum is aware of allowed CPU capacity. This thermal pressure might be visible in scenarios where the CPUs are not heavily loaded, but some other component (like GPU) drastically reduced available power budget and increased the SoC temperature. Thus, we still use EAS for task placement and CPUs are not over-utilized. Signed-off-by: Lukasz Luba <lukasz.luba@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Link: https://lore.kernel.org/r/20210614191128.22735-1-lukasz.luba@arm.com
2021-06-17sched/fair: Return early from update_tg_cfs_load() if delta == 0Dietmar Eggemann
In case the _avg delta is 0 there is no need to update se's _avg (level n) nor cfs_rq's _avg (level n-1). These values stay the same. Since cfs_rq's _avg isn't changed, i.e. no load is propagated down, cfs_rq's _sum should stay the same as well. So bail out after se's _sum has been updated. Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Link: https://lore.kernel.org/r/20210601083616.804229-1-dietmar.eggemann@arm.com
2021-06-17sched/pelt: Check that *_avg are null when *_sum areVincent Guittot
Check that we never break the rule that pelt's avg values are null if pelt's sum are. Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Acked-by: Odin Ugedal <odin@uged.al> Link: https://lore.kernel.org/r/20210601155328.19487-1-vincent.guittot@linaro.org
2021-06-17bpf: Fix up register-based shifts in interpreter to silence KUBSANDaniel Borkmann
syzbot reported a shift-out-of-bounds that KUBSAN observed in the interpreter: [...] UBSAN: shift-out-of-bounds in kernel/bpf/core.c:1420:2 shift exponent 255 is too large for 64-bit type 'long long unsigned int' CPU: 1 PID: 11097 Comm: syz-executor.4 Not tainted 5.12.0-rc2-syzkaller #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Call Trace: __dump_stack lib/dump_stack.c:79 [inline] dump_stack+0x141/0x1d7 lib/dump_stack.c:120 ubsan_epilogue+0xb/0x5a lib/ubsan.c:148 __ubsan_handle_shift_out_of_bounds.cold+0xb1/0x181 lib/ubsan.c:327 ___bpf_prog_run.cold+0x19/0x56c kernel/bpf/core.c:1420 __bpf_prog_run32+0x8f/0xd0 kernel/bpf/core.c:1735 bpf_dispatcher_nop_func include/linux/bpf.h:644 [inline] bpf_prog_run_pin_on_cpu include/linux/filter.h:624 [inline] bpf_prog_run_clear_cb include/linux/filter.h:755 [inline] run_filter+0x1a1/0x470 net/packet/af_packet.c:2031 packet_rcv+0x313/0x13e0 net/packet/af_packet.c:2104 dev_queue_xmit_nit+0x7c2/0xa90 net/core/dev.c:2387 xmit_one net/core/dev.c:3588 [inline] dev_hard_start_xmit+0xad/0x920 net/core/dev.c:3609 __dev_queue_xmit+0x2121/0x2e00 net/core/dev.c:4182 __bpf_tx_skb net/core/filter.c:2116 [inline] __bpf_redirect_no_mac net/core/filter.c:2141 [inline] __bpf_redirect+0x548/0xc80 net/core/filter.c:2164 ____bpf_clone_redirect net/core/filter.c:2448 [inline] bpf_clone_redirect+0x2ae/0x420 net/core/filter.c:2420 ___bpf_prog_run+0x34e1/0x77d0 kernel/bpf/core.c:1523 __bpf_prog_run512+0x99/0xe0 kernel/bpf/core.c:1737 bpf_dispatcher_nop_func include/linux/bpf.h:644 [inline] bpf_test_run+0x3ed/0xc50 net/bpf/test_run.c:50 bpf_prog_test_run_skb+0xabc/0x1c50 net/bpf/test_run.c:582 bpf_prog_test_run kernel/bpf/syscall.c:3127 [inline] __do_sys_bpf+0x1ea9/0x4f00 kernel/bpf/syscall.c:4406 do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46 entry_SYSCALL_64_after_hwframe+0x44/0xae [...] Generally speaking, KUBSAN reports from the kernel should be fixed. However, in case of BPF, this particular report caused concerns since the large shift is not wrong from BPF point of view, just undefined. In the verifier, K-based shifts that are >= {64,32} (depending on the bitwidth of the instruction) are already rejected. The register-based cases were not given their content might not be known at verification time. Ideas such as verifier instruction rewrite with an additional AND instruction for the source register were brought up, but regularly rejected due to the additional runtime overhead they incur. As Edward Cree rightly put it: Shifts by more than insn bitness are legal in the BPF ISA; they are implementation-defined behaviour [of the underlying architecture], rather than UB, and have been made legal for performance reasons. Each of the JIT backends compiles the BPF shift operations to machine instructions which produce implementation-defined results in such a case; the resulting contents of the register may be arbitrary but program behaviour as a whole remains defined. Guard checks in the fast path (i.e. affecting JITted code) will thus not be accepted. The case of division by zero is not truly analogous here, as division instructions on many of the JIT-targeted architectures will raise a machine exception / fault on division by zero, whereas (to the best of my knowledge) none will do so on an out-of-bounds shift. Given the KUBSAN report only affects the BPF interpreter, but not JITs, one solution is to add the ANDs with 63 or 31 into ___bpf_prog_run(). That would make the shifts defined, and thus shuts up KUBSAN, and the compiler would optimize out the AND on any CPU that interprets the shift amounts modulo the width anyway (e.g., confirmed from disassembly that on x86-64 and arm64 the generated interpreter code is the same before and after this fix). The BPF interpreter is slow path, and most likely compiled out anyway as distros select BPF_JIT_ALWAYS_ON to avoid speculative execution of BPF instructions by the interpreter. Given the main argument was to avoid sacrificing performance, the fact that the AND is optimized away from compiler for mainstream archs helps as well as a solution moving forward. Also add a comment on LSH/RSH/ARSH translation for JIT authors to provide guidance when they see the ___bpf_prog_run() interpreter code and use it as a model for a new JIT backend. Reported-by: syzbot+bed360704c521841c85d@syzkaller.appspotmail.com Reported-by: Kurt Manucredo <fuzzybritches0@gmail.com> Signed-off-by: Eric Biggers <ebiggers@kernel.org> Co-developed-by: Eric Biggers <ebiggers@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Andrii Nakryiko <andrii@kernel.org> Tested-by: syzbot+bed360704c521841c85d@syzkaller.appspotmail.com Cc: Edward Cree <ecree.xilinx@gmail.com> Link: https://lore.kernel.org/bpf/0000000000008f912605bd30d5d7@google.com Link: https://lore.kernel.org/bpf/bac16d8d-c174-bdc4-91bd-bfa62b410190@gmail.com
2021-06-16bpf: Fix typo in kernel/bpf/bpf_lsm.cShuyi Cheng
Fix s/sleeable/sleepable/ typo in a comment. Signed-off-by: Shuyi Cheng <chengshuyi@linux.alibaba.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/1623809076-97907-1-git-send-email-chengshuyi@linux.alibaba.com
2021-06-16crash_core, vmcoreinfo: append 'SECTION_SIZE_BITS' to vmcoreinfoPingfan Liu
As mentioned in kernel commit 1d50e5d0c505 ("crash_core, vmcoreinfo: Append 'MAX_PHYSMEM_BITS' to vmcoreinfo"), SECTION_SIZE_BITS in the formula: #define SECTIONS_SHIFT (MAX_PHYSMEM_BITS - SECTION_SIZE_BITS) Besides SECTIONS_SHIFT, SECTION_SIZE_BITS is also used to calculate PAGES_PER_SECTION in makedumpfile just like kernel. Unfortunately, this arch-dependent macro SECTION_SIZE_BITS changes, e.g. recently in kernel commit f0b13ee23241 ("arm64/sparsemem: reduce SECTION_SIZE_BITS"). But user space wants a stable interface to get this info. Such info is impossible to be deduced from a crashdump vmcore. Hence append SECTION_SIZE_BITS to vmcoreinfo. Link: https://lkml.kernel.org/r/20210608103359.84907-1-kernelfans@gmail.com Link: http://lists.infradead.org/pipermail/kexec/2021-June/022676.html Signed-off-by: Pingfan Liu <kernelfans@gmail.com> Acked-by: Baoquan He <bhe@redhat.com> Cc: Bhupesh Sharma <bhupesh.sharma@linaro.org> Cc: Kazuhito Hagio <k-hagio@ab.jp.nec.com> Cc: Dave Young <dyoung@redhat.com> Cc: Boris Petkov <bp@alien8.de> Cc: Ingo Molnar <mingo@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: James Morse <james.morse@arm.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Will Deacon <will@kernel.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Paul Mackerras <paulus@samba.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Dave Anderson <anderson@redhat.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-06-16printk: Move EXPORT_SYMBOL() closer to vprintk definitionPunit Agrawal
Commit 28e1745b9fa2 ("printk: rename vprintk_func to vprintk") while improving readability by removing vprintk indirection, inadvertently placed the EXPORT_SYMBOL() for the newly renamed function at the end of the file. For reader sanity, and as is convention move the EXPORT_SYMBOL() declaration just after the end of the function. Fixes: 28e1745b9fa2 ("printk: rename vprintk_func to vprintk") Signed-off-by: Punit Agrawal <punitagrawal@gmail.com> Acked-by: Rasmus Villemoes <linux@rasmusvillemoes.dk> Acked-by: Sergey Senozhatsky <senozhatsky@chromium.org> Signed-off-by: Petr Mladek <pmladek@suse.com> Link: https://lore.kernel.org/r/20210614235635.887365-1-punitagrawal@gmail.com
2021-06-15bpf: Support socket migration by eBPF.Kuniyuki Iwashima
This patch introduces a new bpf_attach_type for BPF_PROG_TYPE_SK_REUSEPORT to check if the attached eBPF program is capable of migrating sockets. When the eBPF program is attached, we run it for socket migration if the expected_attach_type is BPF_SK_REUSEPORT_SELECT_OR_MIGRATE or net.ipv4.tcp_migrate_req is enabled. Currently, the expected_attach_type is not enforced for the BPF_PROG_TYPE_SK_REUSEPORT type of program. Thus, this commit follows the earlier idea in the commit aac3fc320d94 ("bpf: Post-hooks for sys_bind") to fix up the zero expected_attach_type in bpf_prog_load_fixup_attach_type(). Moreover, this patch adds a new field (migrating_sk) to sk_reuseport_md to select a new listener based on the child socket. migrating_sk varies depending on if it is migrating a request in the accept queue or during 3WHS. - accept_queue : sock (ESTABLISHED/SYN_RECV) - 3WHS : request_sock (NEW_SYN_RECV) In the eBPF program, we can select a new listener by BPF_FUNC_sk_select_reuseport(). Also, we can cancel migration by returning SK_DROP. This feature is useful when listeners have different settings at the socket API level or when we want to free resources as soon as possible. - SK_PASS with selected_sk, select it as a new listener - SK_PASS with selected_sk NULL, fallbacks to the random selection - SK_DROP, cancel the migration. There is a noteworthy point. We select a listening socket in three places, but we do not have struct skb at closing a listener or retransmitting a SYN+ACK. On the other hand, some helper functions do not expect skb is NULL (e.g. skb_header_pointer() in BPF_FUNC_skb_load_bytes(), skb_tail_pointer() in BPF_FUNC_skb_load_bytes_relative()). So we allocate an empty skb temporarily before running the eBPF program. Suggested-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.co.jp> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: Eric Dumazet <edumazet@google.com> Acked-by: Martin KaFai Lau <kafai@fb.com> Link: https://lore.kernel.org/netdev/20201123003828.xjpjdtk4ygl6tg6h@kafai-mbp.dhcp.thefacebook.com/ Link: https://lore.kernel.org/netdev/20201203042402.6cskdlit5f3mw4ru@kafai-mbp.dhcp.thefacebook.com/ Link: https://lore.kernel.org/netdev/20201209030903.hhow5r53l6fmozjn@kafai-mbp.dhcp.thefacebook.com/ Link: https://lore.kernel.org/bpf/20210612123224.12525-10-kuniyu@amazon.co.jp
2021-06-14bpf: Fix leakage under speculation on mispredicted branchesDaniel Borkmann
The verifier only enumerates valid control-flow paths and skips paths that are unreachable in the non-speculative domain. And so it can miss issues under speculative execution on mispredicted branches. For example, a type confusion has been demonstrated with the following crafted program: // r0 = pointer to a map array entry // r6 = pointer to readable stack slot // r9 = scalar controlled by attacker 1: r0 = *(u64 *)(r0) // cache miss 2: if r0 != 0x0 goto line 4 3: r6 = r9 4: if r0 != 0x1 goto line 6 5: r9 = *(u8 *)(r6) 6: // leak r9 Since line 3 runs iff r0 == 0 and line 5 runs iff r0 == 1, the verifier concludes that the pointer dereference on line 5 is safe. But: if the attacker trains both the branches to fall-through, such that the following is speculatively executed ... r6 = r9 r9 = *(u8 *)(r6) // leak r9 ... then the program will dereference an attacker-controlled value and could leak its content under speculative execution via side-channel. This requires to mistrain the branch predictor, which can be rather tricky, because the branches are mutually exclusive. However such training can be done at congruent addresses in user space using different branches that are not mutually exclusive. That is, by training branches in user space ... A: if r0 != 0x0 goto line C B: ... C: if r0 != 0x0 goto line D D: ... ... such that addresses A and C collide to the same CPU branch prediction entries in the PHT (pattern history table) as those of the BPF program's lines 2 and 4, respectively. A non-privileged attacker could simply brute force such collisions in the PHT until observing the attack succeeding. Alternative methods to mistrain the branch predictor are also possible that avoid brute forcing the collisions in the PHT. A reliable attack has been demonstrated, for example, using the following crafted program: // r0 = pointer to a [control] map array entry // r7 = *(u64 *)(r0 + 0), training/attack phase // r8 = *(u64 *)(r0 + 8), oob address // [...] // r0 = pointer to a [data] map array entry 1: if r7 == 0x3 goto line 3 2: r8 = r0 // crafted sequence of conditional jumps to separate the conditional // branch in line 193 from the current execution flow 3: if r0 != 0x0 goto line 5 4: if r0 == 0x0 goto exit 5: if r0 != 0x0 goto line 7 6: if r0 == 0x0 goto exit [...] 187: if r0 != 0x0 goto line 189 188: if r0 == 0x0 goto exit // load any slowly-loaded value (due to cache miss in phase 3) ... 189: r3 = *(u64 *)(r0 + 0x1200) // ... and turn it into known zero for verifier, while preserving slowly- // loaded dependency when executing: 190: r3 &= 1 191: r3 &= 2 // speculatively bypassed phase dependency 192: r7 += r3 193: if r7 == 0x3 goto exit 194: r4 = *(u8 *)(r8 + 0) // leak r4 As can be seen, in training phase (phase != 0x3), the condition in line 1 turns into false and therefore r8 with the oob address is overridden with the valid map value address, which in line 194 we can read out without issues. However, in attack phase, line 2 is skipped, and due to the cache miss in line 189 where the map value is (zeroed and later) added to the phase register, the condition in line 193 takes the fall-through path due to prior branch predictor training, where under speculation, it'll load the byte at oob address r8 (unknown scalar type at that point) which could then be leaked via side-channel. One way to mitigate these is to 'branch off' an unreachable path, meaning, the current verification path keeps following the is_branch_taken() path and we push the other branch to the verification stack. Given this is unreachable from the non-speculative domain, this branch's vstate is explicitly marked as speculative. This is needed for two reasons: i) if this path is solely seen from speculative execution, then we later on still want the dead code elimination to kick in in order to sanitize these instructions with jmp-1s, and ii) to ensure that paths walked in the non-speculative domain are not pruned from earlier walks of paths walked in the speculative domain. Additionally, for robustness, we mark the registers which have been part of the conditional as unknown in the speculative path given there should be no assumptions made on their content. The fix in here mitigates type confusion attacks described earlier due to i) all code paths in the BPF program being explored and ii) existing verifier logic already ensuring that given memory access instruction references one specific data structure. An alternative to this fix that has also been looked at in this scope was to mark aux->alu_state at the jump instruction with a BPF_JMP_TAKEN state as well as direction encoding (always-goto, always-fallthrough, unknown), such that mixing of different always-* directions themselves as well as mixing of always-* with unknown directions would cause a program rejection by the verifier, e.g. programs with constructs like 'if ([...]) { x = 0; } else { x = 1; }' with subsequent 'if (x == 1) { [...] }'. For unprivileged, this would result in only single direction always-* taken paths, and unknown taken paths being allowed, such that the former could be patched from a conditional jump to an unconditional jump (ja). Compared to this approach here, it would have two downsides: i) valid programs that otherwise are not performing any pointer arithmetic, etc, would potentially be rejected/broken, and ii) we are required to turn off path pruning for unprivileged, where both can be avoided in this work through pushing the invalid branch to the verification stack. The issue was originally discovered by Adam and Ofek, and later independently discovered and reported as a result of Benedict and Piotr's research work. Fixes: b2157399cc98 ("bpf: prevent out-of-bounds speculation") Reported-by: Adam Morrison <mad@cs.tau.ac.il> Reported-by: Ofek Kirzner <ofekkir@gmail.com> Reported-by: Benedict Schlueter <benedict.schlueter@rub.de> Reported-by: Piotr Krysiuk <piotras@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: John Fastabend <john.fastabend@gmail.com> Reviewed-by: Benedict Schlueter <benedict.schlueter@rub.de> Reviewed-by: Piotr Krysiuk <piotras@gmail.com> Acked-by: Alexei Starovoitov <ast@kernel.org>