summaryrefslogtreecommitdiff
path: root/kernel
AgeCommit message (Collapse)Author
2019-10-27Merge branch 'timers-urgent-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull timer fixes from Thomas Gleixner: "A small set of fixes for time(keeping): - Add a missing include to prevent compiler warnings. - Make the VDSO implementation of clock_getres() POSIX compliant again. A recent change dropped the NULL pointer guard which is required as NULL is a valid pointer value for this function. - Fix two function documentation typos" * 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: posix-cpu-timers: Fix two trivial comments timers/sched_clock: Include local timekeeping.h for missing declarations lib/vdso: Make clock_getres() POSIX compliant again
2019-10-27Merge branch 'perf-urgent-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull perf fixes from Thomas Gleixner: "A set of perf fixes: kernel: - Unbreak the tracking of auxiliary buffer allocations which got imbalanced causing recource limit failures. - Fix the fallout of splitting of ToPA entries which missed to shift the base entry PA correctly. - Use the correct context to lookup the AUX event when unmapping the associated AUX buffer so the event can be stopped and the buffer reference dropped. tools: - Fix buildiid-cache mode setting in copyfile_mode_ns() when copying /proc/kcore - Fix freeing id arrays in the event list so the correct event is closed. - Sync sched.h anc kvm.h headers with the kernel sources. - Link jvmti against tools/lib/ctype.o to have weak strlcpy(). - Fix multiple memory and file descriptor leaks, found by coverity in perf annotate. - Fix leaks in error handling paths in 'perf c2c', 'perf kmem', found by a static analysis tool" * 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: perf/aux: Fix AUX output stopping perf/aux: Fix tracking of auxiliary trace buffer allocation perf/x86/intel/pt: Fix base for single entry topa perf kmem: Fix memory leak in compact_gfp_flags() tools headers UAPI: Sync sched.h with the kernel tools headers kvm: Sync kvm.h headers with the kernel sources tools headers kvm: Sync kvm headers with the kernel sources tools headers kvm: Sync kvm headers with the kernel sources perf c2c: Fix memory leak in build_cl_output() perf tools: Fix mode setting in copyfile_mode_ns() perf annotate: Fix multiple memory and file descriptor leaks perf tools: Fix resource leak of closedir() on the error paths perf evlist: Fix fix for freed id arrays perf jvmti: Link against tools/lib/ctype.h to have weak strlcpy()
2019-10-24Merge tag 'pm-5.4-rc5' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm Pull power management fixes from Rafael Wysocki: "These fix problems related to frequency limits management in cpufreq that were introduced during the 5.3 cycle (when PM QoS had started to be used for that), fix a few issues in the OPP (operating performance points) library code and fix up the recently added haltpoll cpuidle driver. The cpufreq changes are somewhat bigger that I would like them to be at this stage of the cycle, but the problems fixed by them include crashes on boot and shutdown in some cases (among other things) and in my view it is better to address the root of the issue right away. Specifics: - Using device PM QoS of CPU devices for managing frequency limits in cpufreq does not work, so introduce frequency QoS (based on the original low-level PM QoS) for this purpose, switch cpufreq and related code over to using it and fix a race involving deferred updates of frequency limits on top of that (Rafael Wysocki, Sudeep Holla). - Avoid calling regulator_enable()/disable() from the OPP framework to avoid side-effects on boot-enabled regulators that may change their initial voltage due to performing initial voltage balancing without all restrictions from the consumers (Marek Szyprowski). - Avoid a kref management issue in the OPP library code and drop an incorrectly added lockdep_assert_held() from it (Viresh Kumar). - Make the recently added haltpoll cpuidle driver take the 'idle=' override into account as appropriate (Zhenzhong Duan)" * tag 'pm-5.4-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: opp: Reinitialize the list_kref before adding the static OPPs again cpufreq: Cancel policy update work scheduled before freeing cpuidle: haltpoll: Take 'idle=' override into account opp: core: Revert "add regulators enable and disable" PM: QoS: Drop frequency QoS types from device PM QoS cpufreq: Use per-policy frequency QoS PM: QoS: Introduce frequency QoS opp: of: drop incorrect lockdep_assert_held()
2019-10-23Merge tag 'trace-v5.4-rc3-3' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace Pull tracing fixes from Steven Rostedt: "Two minor fixes: - A race in perf trace initialization (missing mutexes) - Minor fix to represent gfp_t in synthetic events as properly signed" * tag 'trace-v5.4-rc3-3' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: tracing: Fix race in perf_trace_buf initialization tracing: Fix "gfp_t" format for synthetic events
2019-10-23posix-cpu-timers: Fix two trivial commentsYi Wang
Recent changes modified the function arguments of thread_group_sample_cputime() and task_cputimers_expired(), but forgot to update the comments. Fix it up. [ tglx: Changed the argument name of task_cputimers_expired() as the pointer points to an array of samples. ] Fixes: b7be4ef1365d ("posix-cpu-timers: Switch thread group sampling to array") Fixes: 001f7971433a ("posix-cpu-timers: Make expiry checks array based") Signed-off-by: Yi Wang <wang.yi59@zte.com.cn> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lkml.kernel.org/r/1571643852-21848-1-git-send-email-wang.yi59@zte.com.cn
2019-10-23timers/sched_clock: Include local timekeeping.h for missing declarationsBen Dooks (Codethink)
Include the timekeeping.h header to get the declaration of the sched_clock_{suspend,resume} functions. Fixes the following sparse warnings: kernel/time/sched_clock.c:275:5: warning: symbol 'sched_clock_suspend' was not declared. Should it be static? kernel/time/sched_clock.c:286:6: warning: symbol 'sched_clock_resume' was not declared. Should it be static? Signed-off-by: Ben Dooks (Codethink) <ben.dooks@codethink.co.uk> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lkml.kernel.org/r/20191022131226.11465-1-ben.dooks@codethink.co.uk
2019-10-22perf/aux: Fix AUX output stoppingAlexander Shishkin
Commit: 8a58ddae2379 ("perf/core: Fix exclusive events' grouping") allows CAP_EXCLUSIVE events to be grouped with other events. Since all of those also happen to be AUX events (which is not the case the other way around, because arch/s390), this changes the rules for stopping the output: the AUX event may not be on its PMU's context any more, if it's grouped with a HW event, in which case it will be on that HW event's context instead. If that's the case, munmap() of the AUX buffer can't find and stop the AUX event, potentially leaving the last reference with the atomic context, which will then end up freeing the AUX buffer. This will then trip warnings: Fix this by using the context's PMU context when looking for events to stop, instead of the event's PMU context. Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Stephane Eranian <eranian@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vince Weaver <vincent.weaver@maine.edu> Cc: stable@vger.kernel.org Link: https://lkml.kernel.org/r/20191022073940.61814-1-alexander.shishkin@linux.intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-10-21tracing: Fix race in perf_trace_buf initializationPrateek Sood
A race condition exists while initialiazing perf_trace_buf from perf_trace_init() and perf_kprobe_init(). CPU0 CPU1 perf_trace_init() mutex_lock(&event_mutex) perf_trace_event_init() perf_trace_event_reg() total_ref_count == 0 buf = alloc_percpu() perf_trace_buf[i] = buf tp_event->class->reg() //fails perf_kprobe_init() goto fail perf_trace_event_init() perf_trace_event_reg() fail: total_ref_count == 0 total_ref_count == 0 buf = alloc_percpu() perf_trace_buf[i] = buf tp_event->class->reg() total_ref_count++ free_percpu(perf_trace_buf[i]) perf_trace_buf[i] = NULL Any subsequent call to perf_trace_event_reg() will observe total_ref_count > 0, causing the perf_trace_buf to be always NULL. This can result in perf_trace_buf getting accessed from perf_trace_buf_alloc() without being initialized. Acquiring event_mutex in perf_kprobe_init() before calling perf_trace_event_init() should fix this race. The race caused the following bug: Unable to handle kernel paging request at virtual address 0000003106f2003c Mem abort info: ESR = 0x96000045 Exception class = DABT (current EL), IL = 32 bits SET = 0, FnV = 0 EA = 0, S1PTW = 0 Data abort info: ISV = 0, ISS = 0x00000045 CM = 0, WnR = 1 user pgtable: 4k pages, 39-bit VAs, pgdp = ffffffc034b9b000 [0000003106f2003c] pgd=0000000000000000, pud=0000000000000000 Internal error: Oops: 96000045 [#1] PREEMPT SMP Process syz-executor (pid: 18393, stack limit = 0xffffffc093190000) pstate: 80400005 (Nzcv daif +PAN -UAO) pc : __memset+0x20/0x1ac lr : memset+0x3c/0x50 sp : ffffffc09319fc50 __memset+0x20/0x1ac perf_trace_buf_alloc+0x140/0x1a0 perf_trace_sys_enter+0x158/0x310 syscall_trace_enter+0x348/0x7c0 el0_svc_common+0x11c/0x368 el0_svc_handler+0x12c/0x198 el0_svc+0x8/0xc Ramdumps showed the following: total_ref_count = 3 perf_trace_buf = ( 0x0 -> NULL, 0x0 -> NULL, 0x0 -> NULL, 0x0 -> NULL) Link: http://lkml.kernel.org/r/1571120245-4186-1-git-send-email-prsood@codeaurora.org Cc: stable@vger.kernel.org Fixes: e12f03d7031a9 ("perf/core: Implement the 'perf_kprobe' PMU") Acked-by: Song Liu <songliubraving@fb.com> Signed-off-by: Prateek Sood <prsood@codeaurora.org> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2019-10-21perf/aux: Fix tracking of auxiliary trace buffer allocationThomas Richter
The following commit from the v5.4 merge window: d44248a41337 ("perf/core: Rework memory accounting in perf_mmap()") ... breaks auxiliary trace buffer tracking. If I run command 'perf record -e rbd000' to record samples and saving them in the **auxiliary** trace buffer then the value of 'locked_vm' becomes negative after all trace buffers have been allocated and released: During allocation the values increase: [52.250027] perf_mmap user->locked_vm:0x87 pinned_vm:0x0 ret:0 [52.250115] perf_mmap user->locked_vm:0x107 pinned_vm:0x0 ret:0 [52.250251] perf_mmap user->locked_vm:0x188 pinned_vm:0x0 ret:0 [52.250326] perf_mmap user->locked_vm:0x208 pinned_vm:0x0 ret:0 [52.250441] perf_mmap user->locked_vm:0x289 pinned_vm:0x0 ret:0 [52.250498] perf_mmap user->locked_vm:0x309 pinned_vm:0x0 ret:0 [52.250613] perf_mmap user->locked_vm:0x38a pinned_vm:0x0 ret:0 [52.250715] perf_mmap user->locked_vm:0x408 pinned_vm:0x2 ret:0 [52.250834] perf_mmap user->locked_vm:0x408 pinned_vm:0x83 ret:0 [52.250915] perf_mmap user->locked_vm:0x408 pinned_vm:0x103 ret:0 [52.251061] perf_mmap user->locked_vm:0x408 pinned_vm:0x184 ret:0 [52.251146] perf_mmap user->locked_vm:0x408 pinned_vm:0x204 ret:0 [52.251299] perf_mmap user->locked_vm:0x408 pinned_vm:0x285 ret:0 [52.251383] perf_mmap user->locked_vm:0x408 pinned_vm:0x305 ret:0 [52.251544] perf_mmap user->locked_vm:0x408 pinned_vm:0x386 ret:0 [52.251634] perf_mmap user->locked_vm:0x408 pinned_vm:0x406 ret:0 [52.253018] perf_mmap user->locked_vm:0x408 pinned_vm:0x487 ret:0 [52.253197] perf_mmap user->locked_vm:0x408 pinned_vm:0x508 ret:0 [52.253374] perf_mmap user->locked_vm:0x408 pinned_vm:0x589 ret:0 [52.253550] perf_mmap user->locked_vm:0x408 pinned_vm:0x60a ret:0 [52.253726] perf_mmap user->locked_vm:0x408 pinned_vm:0x68b ret:0 [52.253903] perf_mmap user->locked_vm:0x408 pinned_vm:0x70c ret:0 [52.254084] perf_mmap user->locked_vm:0x408 pinned_vm:0x78d ret:0 [52.254263] perf_mmap user->locked_vm:0x408 pinned_vm:0x80e ret:0 The value of user->locked_vm increases to a limit then the memory is tracked by pinned_vm. During deallocation the size is subtracted from pinned_vm until it hits a limit. Then a larger value is subtracted from locked_vm leading to a large number (because of type unsigned): [64.267797] perf_mmap_close mmap_user->locked_vm:0x408 pinned_vm:0x78d [64.267826] perf_mmap_close mmap_user->locked_vm:0x408 pinned_vm:0x70c [64.267848] perf_mmap_close mmap_user->locked_vm:0x408 pinned_vm:0x68b [64.267869] perf_mmap_close mmap_user->locked_vm:0x408 pinned_vm:0x60a [64.267891] perf_mmap_close mmap_user->locked_vm:0x408 pinned_vm:0x589 [64.267911] perf_mmap_close mmap_user->locked_vm:0x408 pinned_vm:0x508 [64.267933] perf_mmap_close mmap_user->locked_vm:0x408 pinned_vm:0x487 [64.267952] perf_mmap_close mmap_user->locked_vm:0x408 pinned_vm:0x406 [64.268883] perf_mmap_close mmap_user->locked_vm:0x307 pinned_vm:0x406 [64.269117] perf_mmap_close mmap_user->locked_vm:0x206 pinned_vm:0x406 [64.269433] perf_mmap_close mmap_user->locked_vm:0x105 pinned_vm:0x406 [64.269536] perf_mmap_close mmap_user->locked_vm:0x4 pinned_vm:0x404 [64.269797] perf_mmap_close mmap_user->locked_vm:0xffffffffffffff84 pinned_vm:0x303 [64.270105] perf_mmap_close mmap_user->locked_vm:0xffffffffffffff04 pinned_vm:0x202 [64.270374] perf_mmap_close mmap_user->locked_vm:0xfffffffffffffe84 pinned_vm:0x101 [64.270628] perf_mmap_close mmap_user->locked_vm:0xfffffffffffffe04 pinned_vm:0x0 This value sticks for the user until system is rebooted, causing follow-on system calls using locked_vm resource limit to fail. Note: There is no issue using the normal trace buffer. In fact the issue is in perf_mmap_close(). During allocation auxiliary trace buffer memory is either traced as 'extra' and added to 'pinned_vm' or trace as 'user_extra' and added to 'locked_vm'. This applies for normal trace buffers and auxiliary trace buffer. However in function perf_mmap_close() all auxiliary trace buffer is subtraced from 'locked_vm' and never from 'pinned_vm'. This breaks the ballance. Signed-off-by: Thomas Richter <tmricht@linux.ibm.com> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: acme@kernel.org Cc: gor@linux.ibm.com Cc: hechaol@fb.com Cc: heiko.carstens@de.ibm.com Cc: linux-perf-users@vger.kernel.org Cc: songliubraving@fb.com Fixes: d44248a41337 ("perf/core: Rework memory accounting in perf_mmap()") Link: https://lkml.kernel.org/r/20191021083354.67868-1-tmricht@linux.ibm.com [ Minor readability edits. ] Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-10-21PM: QoS: Introduce frequency QoSRafael J. Wysocki
Introduce frequency QoS, based on the "raw" low-level PM QoS, to represent min and max frequency requests and aggregate constraints. The min and max frequency requests are to be represented by struct freq_qos_request objects and the aggregate constraints are to be represented by struct freq_constraints objects. The latter are expected to be initialized with the help of freq_constraints_init(). The freq_qos_read_value() helper is defined to retrieve the aggregate constraints values from a given struct freq_constraints object and there are the freq_qos_add_request(), freq_qos_update_request() and freq_qos_remove_request() helpers to manipulate the min and max frequency requests. It is assumed that the the helpers will not run concurrently with each other for the same struct freq_qos_request object, so if that may be the case, their uses must ensure proper synchronization between them (e.g. through locking). In addition, freq_qos_add_notifier() and freq_qos_remove_notifier() are provided to add and remove notifiers that will trigger on aggregate constraint changes to and from a given struct freq_constraints object, respectively. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
2019-10-20Merge tag 'kbuild-fixes-v5.4-2' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild Pull more Kbuild fixes from Masahiro Yamada: - fix a bashism of setlocalversion - do not use the too new --sort option of tar * tag 'kbuild-fixes-v5.4-2' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild: kheaders: substituting --sort in archive creation scripts: setlocalversion: fix a bashism kbuild: update comment about KBUILD_ALLDIRS
2019-10-20Merge branch 'timers-urgent-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull hrtimer fixlet from Thomas Gleixner: "A single commit annotating the lockcless access to timer->base with READ_ONCE() and adding the WRITE_ONCE() counterparts for completeness" * 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: hrtimer: Annotate lockless access to timer->base
2019-10-20Merge branch 'core-urgent-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull stop-machine fix from Thomas Gleixner: "A single fix, amending stop machine with WRITE/READ_ONCE() to address the fallout of KCSAN" * 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: stop_machine: Avoid potential race behaviour
2019-10-19kernel/events/uprobes.c: only do FOLL_SPLIT_PMD for uprobe registerSong Liu
Attaching uprobe to text section in THP splits the PMD mapped page table into PTE mapped entries. On uprobe detach, we would like to regroup PMD mapped page table entry to regain performance benefit of THP. However, the regroup is broken For perf_event based trace_uprobe. This is because perf_event based trace_uprobe calls uprobe_unregister twice on close: first in TRACE_REG_PERF_CLOSE, then in TRACE_REG_PERF_UNREGISTER. The second call will split the PMD mapped page table entry, which is not the desired behavior. Fix this by only use FOLL_SPLIT_PMD for uprobe register case. Add a WARN() to confirm uprobe unregister never work on huge pages, and abort the operation when this WARN() triggers. Link: http://lkml.kernel.org/r/20191017164223.2762148-6-songliubraving@fb.com Fixes: 5a52c9df62b4 ("uprobe: use FOLL_SPLIT_PMD instead of FOLL_SPLIT") Signed-off-by: Song Liu <songliubraving@fb.com> Reviewed-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: William Kucharski <william.kucharski@oracle.com> Cc: Yang Shi <yang.shi@linux.alibaba.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-10-18tracing: Fix "gfp_t" format for synthetic eventsZhengjun Xing
In the format of synthetic events, the "gfp_t" is shown as "signed:1", but in fact the "gfp_t" is "unsigned", should be shown as "signed:0". The issue can be reproduced by the following commands: echo 'memlatency u64 lat; unsigned int order; gfp_t gfp_flags; int migratetype' > /sys/kernel/debug/tracing/synthetic_events cat /sys/kernel/debug/tracing/events/synthetic/memlatency/format name: memlatency ID: 2233 format: field:unsigned short common_type; offset:0; size:2; signed:0; field:unsigned char common_flags; offset:2; size:1; signed:0; field:unsigned char common_preempt_count; offset:3; size:1; signed:0; field:int common_pid; offset:4; size:4; signed:1; field:u64 lat; offset:8; size:8; signed:0; field:unsigned int order; offset:16; size:4; signed:0; field:gfp_t gfp_flags; offset:24; size:4; signed:1; field:int migratetype; offset:32; size:4; signed:1; print fmt: "lat=%llu, order=%u, gfp_flags=%x, migratetype=%d", REC->lat, REC->order, REC->gfp_flags, REC->migratetype Link: http://lkml.kernel.org/r/20191018012034.6404-1-zhengjun.xing@linux.intel.com Reviewed-by: Tom Zanussi <tom.zanussi@linux.intel.com> Signed-off-by: Zhengjun Xing <zhengjun.xing@linux.intel.com> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2019-10-18Merge tag 'pm-5.4-rc4' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm Pull power management fixes from Rafael Wysocki: "These include a fix for a recent regression in the ACPI CPU performance scaling code, a PCI device power management fix, a system shutdown fix related to cpufreq, a removal of an ACPI suspend-to-idle blacklist entry and a build warning fix. Specifics: - Fix possible NULL pointer dereference in the ACPI processor scaling initialization code introduced by a recent cpufreq update (Rafael Wysocki). - Fix possible deadlock due to suspending cpufreq too late during system shutdown (Rafael Wysocki). - Make the PCI device system resume code path be more consistent with its PM-runtime counterpart to fix an issue with missing delay on transitions from D3cold to D0 during system resume from suspend-to-idle on some systems (Rafael Wysocki). - Drop Dell XPS13 9360 from the LPS0 Idle _DSM blacklist to make it use suspend-to-idle by default (Mario Limonciello). - Fix build warning in the core system suspend support code (Ben Dooks)" * tag 'pm-5.4-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: ACPI: processor: Avoid NULL pointer dereferences at init time PCI: PM: Fix pci_power_up() PM: sleep: include <linux/pm_runtime.h> for pm_wq cpufreq: Avoid cpufreq_suspend() deadlock on system shutdown ACPI: PM: Drop Dell XPS13 9360 from LPS0 Idle _DSM blacklist
2019-10-18Merge branches 'pm-cpufreq' and 'pm-sleep'Rafael J. Wysocki
* pm-cpufreq: ACPI: processor: Avoid NULL pointer dereferences at init time cpufreq: Avoid cpufreq_suspend() deadlock on system shutdown * pm-sleep: PM: sleep: include <linux/pm_runtime.h> for pm_wq ACPI: PM: Drop Dell XPS13 9360 from LPS0 Idle _DSM blacklist
2019-10-17stop_machine: Avoid potential race behaviourMark Rutland
Both multi_cpu_stop() and set_state() access multi_stop_data::state racily using plain accesses. These are subject to compiler transformations which could break the intended behaviour of the code, and this situation is detected by KCSAN on both arm64 and x86 (splats below). Improve matters by using READ_ONCE() and WRITE_ONCE() to ensure that the compiler cannot elide, replay, or tear loads and stores. In multi_cpu_stop() the two loads of multi_stop_data::state are expected to be a consistent value, so snapshot the value into a temporary variable to ensure this. The state transitions are serialized by atomic manipulation of multi_stop_data::num_threads, and other fields in multi_stop_data are not modified while subject to concurrent reads. KCSAN splat on arm64: | BUG: KCSAN: data-race in multi_cpu_stop+0xa8/0x198 and set_state+0x80/0xb0 | | write to 0xffff00001003bd00 of 4 bytes by task 24 on cpu 3: | set_state+0x80/0xb0 | multi_cpu_stop+0x16c/0x198 | cpu_stopper_thread+0x170/0x298 | smpboot_thread_fn+0x40c/0x560 | kthread+0x1a8/0x1b0 | ret_from_fork+0x10/0x18 | | read to 0xffff00001003bd00 of 4 bytes by task 14 on cpu 1: | multi_cpu_stop+0xa8/0x198 | cpu_stopper_thread+0x170/0x298 | smpboot_thread_fn+0x40c/0x560 | kthread+0x1a8/0x1b0 | ret_from_fork+0x10/0x18 | | Reported by Kernel Concurrency Sanitizer on: | CPU: 1 PID: 14 Comm: migration/1 Not tainted 5.3.0-00007-g67ab35a199f4-dirty #3 | Hardware name: linux,dummy-virt (DT) KCSAN splat on x86: | write to 0xffffb0bac0013e18 of 4 bytes by task 19 on cpu 2: | set_state kernel/stop_machine.c:170 [inline] | ack_state kernel/stop_machine.c:177 [inline] | multi_cpu_stop+0x1a4/0x220 kernel/stop_machine.c:227 | cpu_stopper_thread+0x19e/0x280 kernel/stop_machine.c:516 | smpboot_thread_fn+0x1a8/0x300 kernel/smpboot.c:165 | kthread+0x1b5/0x200 kernel/kthread.c:255 | ret_from_fork+0x35/0x40 arch/x86/entry/entry_64.S:352 | | read to 0xffffb0bac0013e18 of 4 bytes by task 44 on cpu 7: | multi_cpu_stop+0xb4/0x220 kernel/stop_machine.c:213 | cpu_stopper_thread+0x19e/0x280 kernel/stop_machine.c:516 | smpboot_thread_fn+0x1a8/0x300 kernel/smpboot.c:165 | kthread+0x1b5/0x200 kernel/kthread.c:255 | ret_from_fork+0x35/0x40 arch/x86/entry/entry_64.S:352 | | Reported by Kernel Concurrency Sanitizer on: | CPU: 7 PID: 44 Comm: migration/7 Not tainted 5.3.0+ #1 | Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-1 04/01/2014 Signed-off-by: Mark Rutland <mark.rutland@arm.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Marco Elver <elver@google.com> Link: https://lkml.kernel.org/r/20191007104536.27276-1-mark.rutland@arm.com
2019-10-17kheaders: substituting --sort in archive creationDmitry Goldin
The option --sort=ORDER was only introduced in tar 1.28 (2014), which is rather new and might not be available in some setups. This patch tries to replicate the previous behaviour as closely as possible to fix the kheaders build for older environments. It does not produce identical archives compared to the previous version due to minor sorting differences but produces reproducible results itself in my tests. Reported-by: Andreas Schwab <schwab@suse.de> Signed-off-by: Dmitry Goldin <dgoldin+lkml@protonmail.ch> Tested-by: Andreas Schwab <schwab@suse.de> Tested-by: Quentin Perret <qperret@google.com> Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com>
2019-10-16kthread: make __kthread_queue_delayed_work staticBen Dooks
The __kthread_queue_delayed_work is not exported so make it static, to avoid the following sparse warning: kernel/kthread.c:869:6: warning: symbol '__kthread_queue_delayed_work' was not declared. Should it be static? Signed-off-by: Ben Dooks <ben.dooks@codethink.co.uk> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-10-15Merge branch 'parisc-5.4-2' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/deller/parisc-linux Pull parisc fixes from Helge Deller: - Fix a parisc-specific fallout of Christoph's dma_set_mask_and_coherent() patches (Sven) - Fix a vmap memory leak in ioremap()/ioremap() (Helge) - Some minor cleanups and documentation updates (Nick, Helge) * 'parisc-5.4-2' of git://git.kernel.org/pub/scm/linux/kernel/git/deller/parisc-linux: parisc: Remove 32-bit DMA enforcement from sba_iommu parisc: Fix vmap memory leak in ioremap()/iounmap() parisc: prefer __section from compiler_attributes.h parisc: sysctl.c: Use CONFIG_PARISC instead of __hppa_ define MAINTAINERS: Add hp_sdc drivers to parisc arch
2019-10-14parisc: sysctl.c: Use CONFIG_PARISC instead of __hppa_ defineHelge Deller
Signed-off-by: Helge Deller <deller@gmx.de>
2019-10-14hrtimer: Annotate lockless access to timer->baseEric Dumazet
Followup to commit dd2261ed45aa ("hrtimer: Protect lockless access to timer->base") lock_hrtimer_base() fetches timer->base without lock exclusion. Compiler is allowed to read timer->base twice (even if considered dumb) which could end up trying to lock migration_base and return &migration_base. base = timer->base; if (likely(base != &migration_base)) { /* compiler reads timer->base again, and now (base == &migration_base) raw_spin_lock_irqsave(&base->cpu_base->lock, *flags); if (likely(base == timer->base)) return base; /* == &migration_base ! */ Similarly the write sides must use WRITE_ONCE() to avoid store tearing. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lkml.kernel.org/r/20191008173204.180879-1-edumazet@google.com
2019-10-13Merge tag 'trace-v5.4-rc2' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace Pull tracing fixes from Steven Rostedt: "A few tracing fixes: - Remove lockdown from tracefs itself and moved it to the trace directory. Have the open functions there do the lockdown checks. - Fix a few races with opening an instance file and the instance being deleted (Discovered during the lockdown updates). Kept separate from the clean up code such that they can be backported to stable easier. - Clean up and consolidated the checks done when opening a trace file, as there were multiple checks that need to be done, and it did not make sense having them done in each open instance. - Fix a regression in the record mcount code. - Small hw_lat detector tracer fixes. - A trace_pipe read fix due to not initializing trace_seq" * tag 'trace-v5.4-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: tracing: Initialize iter->seq after zeroing in tracing_read_pipe() tracing/hwlat: Don't ignore outer-loop duration when calculating max_latency tracing/hwlat: Report total time spent in all NMIs during the sample recordmcount: Fix nop_mcount() function tracing: Do not create tracefs files if tracefs lockdown is in effect tracing: Add locked_down checks to the open calls of files created for tracefs tracing: Add tracing_check_open_get_tr() tracing: Have trace events system open call tracing_open_generic_tr() tracing: Get trace_array reference for available_tracers files ftrace: Get a reference counter for the trace_array on filter files tracefs: Revert ccbd54ff54e8 ("tracefs: Restrict tracefs when the kernel is locked down")
2019-10-12tracing: Initialize iter->seq after zeroing in tracing_read_pipe()Petr Mladek
A customer reported the following softlockup: [899688.160002] NMI watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [test.sh:16464] [899688.160002] CPU: 0 PID: 16464 Comm: test.sh Not tainted 4.12.14-6.23-azure #1 SLE12-SP4 [899688.160002] RIP: 0010:up_write+0x1a/0x30 [899688.160002] Kernel panic - not syncing: softlockup: hung tasks [899688.160002] RIP: 0010:up_write+0x1a/0x30 [899688.160002] RSP: 0018:ffffa86784d4fde8 EFLAGS: 00000257 ORIG_RAX: ffffffffffffff12 [899688.160002] RAX: ffffffff970fea00 RBX: 0000000000000001 RCX: 0000000000000000 [899688.160002] RDX: ffffffff00000001 RSI: 0000000000000080 RDI: ffffffff970fea00 [899688.160002] RBP: ffffffffffffffff R08: ffffffffffffffff R09: 0000000000000000 [899688.160002] R10: 0000000000000000 R11: 0000000000000000 R12: ffff8b59014720d8 [899688.160002] R13: ffff8b59014720c0 R14: ffff8b5901471090 R15: ffff8b5901470000 [899688.160002] tracing_read_pipe+0x336/0x3c0 [899688.160002] __vfs_read+0x26/0x140 [899688.160002] vfs_read+0x87/0x130 [899688.160002] SyS_read+0x42/0x90 [899688.160002] do_syscall_64+0x74/0x160 It caught the process in the middle of trace_access_unlock(). There is no loop. So, it must be looping in the caller tracing_read_pipe() via the "waitagain" label. Crashdump analyze uncovered that iter->seq was completely zeroed at this point, including iter->seq.seq.size. It means that print_trace_line() was never able to print anything and there was no forward progress. The culprit seems to be in the code: /* reset all but tr, trace, and overruns */ memset(&iter->seq, 0, sizeof(struct trace_iterator) - offsetof(struct trace_iterator, seq)); It was added by the commit 53d0aa773053ab182877 ("ftrace: add logic to record overruns"). It was v2.6.27-rc1. It was the time when iter->seq looked like: struct trace_seq { unsigned char buffer[PAGE_SIZE]; unsigned int len; }; There was no "size" variable and zeroing was perfectly fine. The solution is to reinitialize the structure after or without zeroing. Link: http://lkml.kernel.org/r/20191011142134.11997-1-pmladek@suse.com Signed-off-by: Petr Mladek <pmladek@suse.com> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2019-10-12tracing/hwlat: Don't ignore outer-loop duration when calculating max_latencySrivatsa S. Bhat (VMware)
max_latency is intended to record the maximum ever observed hardware latency, which may occur in either part of the loop (inner/outer). So we need to also consider the outer-loop sample when updating max_latency. Link: http://lkml.kernel.org/r/157073345463.17189.18124025522664682811.stgit@srivatsa-ubuntu Fixes: e7c15cd8a113 ("tracing: Added hardware latency tracer") Cc: stable@vger.kernel.org Signed-off-by: Srivatsa S. Bhat (VMware) <srivatsa@csail.mit.edu> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2019-10-12tracing/hwlat: Report total time spent in all NMIs during the sampleSrivatsa S. Bhat (VMware)
nmi_total_ts is supposed to record the total time spent in *all* NMIs that occur on the given CPU during the (active portion of the) sampling window. However, the code seems to be overwriting this variable for each NMI, thereby only recording the time spent in the most recent NMI. Fix it by accumulating the duration instead. Link: http://lkml.kernel.org/r/157073343544.17189.13911783866738671133.stgit@srivatsa-ubuntu Fixes: 7b2c86250122 ("tracing: Add NMI tracing in hwlat detector") Cc: stable@vger.kernel.org Signed-off-by: Srivatsa S. Bhat (VMware) <srivatsa@csail.mit.edu> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2019-10-12tracing: Add locked_down checks to the open calls of files created for tracefsSteven Rostedt (VMware)
Added various checks on open tracefs calls to see if tracefs is in lockdown mode, and if so, to return -EPERM. Note, the event format files (which are basically standard on all machines) as well as the enabled_functions file (which shows what is currently being traced) are not lockde down. Perhaps they should be, but it seems counter intuitive to lockdown information to help you know if the system has been modified. Link: http://lkml.kernel.org/r/CAHk-=wj7fGPKUspr579Cii-w_y60PtRaiDgKuxVtBAMK0VNNkA@mail.gmail.com Suggested-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2019-10-12tracing: Add tracing_check_open_get_tr()Steven Rostedt (VMware)
Currently, most files in the tracefs directory test if tracing_disabled is set. If so, it should return -ENODEV. The tracing_disabled is called when tracing is found to be broken. Originally it was done in case the ring buffer was found to be corrupted, and we wanted to prevent reading it from crashing the kernel. But it's also called if a tracing selftest fails on boot. It's a one way switch. That is, once it is triggered, tracing is disabled until reboot. As most tracefs files can also be used by instances in the tracefs directory, they need to be carefully done. Each instance has a trace_array associated to it, and when the instance is removed, the trace_array is freed. But if an instance is opened with a reference to the trace_array, then it requires looking up the trace_array to get its ref counter (as there could be a race with it being deleted and the open itself). Once it is found, a reference is added to prevent the instance from being removed (and the trace_array associated with it freed). Combine the two checks (tracing_disabled and trace_array_get()) into a single helper function. This will also make it easier to add lockdown to tracefs later. Link: http://lkml.kernel.org/r/20191011135458.7399da44@gandalf.local.home Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2019-10-12tracing: Have trace events system open call tracing_open_generic_tr()Steven Rostedt (VMware)
Instead of having the trace events system open call open code the taking of the trace_array descriptor (with trace_array_get()) and then calling trace_open_generic(), have it use the tracing_open_generic_tr() that does the combination of the two. This requires making tracing_open_generic_tr() global. Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2019-10-12tracing: Get trace_array reference for available_tracers filesSteven Rostedt (VMware)
As instances may have different tracers available, we need to look at the trace_array descriptor that shows the list of the available tracers for the instance. But there's a race between opening the file and an admin deleting the instance. The trace_array_get() needs to be called before accessing the trace_array. Cc: stable@vger.kernel.org Fixes: 607e2ea167e56 ("tracing: Set up infrastructure to allow tracers for instances") Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2019-10-12ftrace: Get a reference counter for the trace_array on filter filesSteven Rostedt (VMware)
The ftrace set_ftrace_filter and set_ftrace_notrace files are specific for an instance now. They need to take a reference to the instance otherwise there could be a race between accessing the files and deleting the instance. It wasn't until the :mod: caching where these file operations started referencing the trace_array directly. Cc: stable@vger.kernel.org Fixes: 673feb9d76ab3 ("ftrace: Add :mod: caching infrastructure to trace_array") Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2019-10-12Merge branch 'sched-urgent-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler fixes from Ingo Molnar: "Two fixes: a guest-cputime accounting fix, and a cgroup bandwidth quota precision fix" * 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: sched/vtime: Fix guest/system mis-accounting on task switch sched/fair: Scale bandwidth quota and period without losing quota/period ratio precision
2019-10-12Merge branch 'perf-urgent-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull perf fixes from Ingo Molnar: "Mostly tooling fixes, but also a couple of updates for new Intel models (which are technically hw-enablement, but to users it's a fix to perf behavior on those new CPUs - hope this is fine), an AUX inheritance fix, event time-sharing fix, and a fix for lost non-perf NMI events on AMD systems" * 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (36 commits) perf/x86/cstate: Add Tiger Lake CPU support perf/x86/msr: Add Tiger Lake CPU support perf/x86/intel: Add Tiger Lake CPU support perf/x86/cstate: Update C-state counters for Ice Lake perf/x86/msr: Add new CPU model numbers for Ice Lake perf/x86/cstate: Add Comet Lake CPU support perf/x86/msr: Add Comet Lake CPU support perf/x86/intel: Add Comet Lake CPU support perf/x86/amd: Change/fix NMI latency mitigation to use a timestamp perf/core: Fix corner case in perf_rotate_context() perf/core: Rework memory accounting in perf_mmap() perf/core: Fix inheritance of aux_output groups perf annotate: Don't return -1 for error when doing BPF disassembly perf annotate: Return appropriate error code for allocation failures perf annotate: Fix arch specific ->init() failure errors perf annotate: Propagate the symbol__annotate() error return perf annotate: Fix the signedness of failure returns perf annotate: Propagate perf_env__arch() error perf evsel: Fall back to global 'perf_env' in perf_evsel__env() perf tools: Propagate get_cpuid() error ...
2019-10-11Merge tag 'for-linus-20191010' of git://git.kernel.dk/linux-blockLinus Torvalds
Pull block fixes from Jens Axboe: - Fix wbt performance regression introduced with the blk-rq-qos refactoring (Harshad) - Fix io_uring fileset removal inadvertently killing the workqueue (me) - Fix io_uring typo in linked command nonblock submission (Pavel) - Remove spurious io_uring wakeups on request free (Pavel) - Fix null_blk zoned command error return (Keith) - Don't use freezable workqueues for backing_dev, also means we can revert a previous libata hack (Mika) - Fix nbd sysfs mutex dropped too soon at removal time (Xiubo) * tag 'for-linus-20191010' of git://git.kernel.dk/linux-block: nbd: fix possible sysfs duplicate warning null_blk: Fix zoned command return code io_uring: only flush workqueues on fileset removal io_uring: remove wait loop spurious wakeups blk-wbt: fix performance regression in wbt scale_up/scale_down Revert "libata, freezer: avoid block device removal while system is frozen" bdi: Do not use freezable workqueue io_uring: fix reversed nonblock flag for link submission
2019-10-10PM: sleep: include <linux/pm_runtime.h> for pm_wqBen Dooks
Include the <linux/runtime_pm.h> for the definition of pm_wq to avoid the following warning: kernel/power/main.c:890:25: warning: symbol 'pm_wq' was not declared. Should it be static? Signed-off-by: Ben Dooks <ben.dooks@codethink.co.uk> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2019-10-09perf/core: Fix corner case in perf_rotate_context()Song Liu
In perf_rotate_context(), when the first cpu flexible event fail to schedule, cpu_rotate is 1, while cpu_event is NULL. Since cpu_event is NULL, perf_rotate_context will _NOT_ call cpu_ctx_sched_out(), thus cpuctx->ctx.is_active will have EVENT_FLEXIBLE set. Then, the next perf_event_sched_in() will skip all cpu flexible events because of the EVENT_FLEXIBLE bit. In the next call of perf_rotate_context(), cpu_rotate stays 1, and cpu_event stays NULL, so this process repeats. The end result is, flexible events on this cpu will not be scheduled (until another event being added to the cpuctx). Here is an easy repro of this issue. On Intel CPUs, where ref-cycles could only use one counter, run one pinned event for ref-cycles, one flexible event for ref-cycles, and one flexible event for cycles. The flexible ref-cycles is never scheduled, which is expected. However, because of this issue, the cycles event is never scheduled either. $ perf stat -e ref-cycles:D,ref-cycles,cycles -C 5 -I 1000 time counts unit events 1.000152973 15,412,480 ref-cycles:D 1.000152973 <not counted> ref-cycles (0.00%) 1.000152973 <not counted> cycles (0.00%) 2.000486957 18,263,120 ref-cycles:D 2.000486957 <not counted> ref-cycles (0.00%) 2.000486957 <not counted> cycles (0.00%) To fix this, when the flexible_active list is empty, try rotate the first event in the flexible_groups. Also, rename ctx_first_active() to ctx_event_to_rotate(), which is more accurate. Signed-off-by: Song Liu <songliubraving@fb.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: <kernel-team@fb.com> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Sasha Levin <sashal@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Fixes: 8d5bce0c37fa ("perf/core: Optimize perf_rotate_context() event scheduling") Link: https://lkml.kernel.org/r/20191008165949.920548-1-songliubraving@fb.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-10-09perf/core: Rework memory accounting in perf_mmap()Song Liu
perf_mmap() always increases user->locked_vm. As a result, "extra" could grow bigger than "user_extra", which doesn't make sense. Here is an example case: (Note: Assume "user_lock_limit" is very small.) | # of perf_mmap calls |vma->vm_mm->pinned_vm|user->locked_vm| | 0 | 0 | 0 | | 1 | user_extra | user_extra | | 2 | 3 * user_extra | 2 * user_extra| | 3 | 6 * user_extra | 3 * user_extra| | 4 | 10 * user_extra | 4 * user_extra| Fix this by maintaining proper user_extra and extra. Reviewed-By: Hechao Li <hechaol@fb.com> Reported-by: Hechao Li <hechaol@fb.com> Signed-off-by: Song Liu <songliubraving@fb.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: <kernel-team@fb.com> Cc: Jie Meng <jmeng@fb.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: https://lkml.kernel.org/r/20190904214618.3795672-1-songliubraving@fb.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-10-09sched/vtime: Fix guest/system mis-accounting on task switchFrederic Weisbecker
vtime_account_system() assumes that the target task to account cputime to is always the current task. This is most often true indeed except on task switch where we call: vtime_common_task_switch(prev) vtime_account_system(prev) Here prev is the scheduling-out task where we account the cputime to. It doesn't match current that is already the scheduling-in task at this stage of the context switch. So we end up checking the wrong task flags to determine if we are accounting guest or system time to the previous task. As a result the wrong task is used to check if the target is running in guest mode. We may then spuriously account or leak either system or guest time on task switch. Fix this assumption and also turn vtime_guest_enter/exit() to use the task passed in parameter as well to avoid future similar issues. Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@redhat.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Wanpeng Li <wanpengli@tencent.com> Fixes: 2a42eb9594a1 ("sched/cputime: Accumulate vtime on top of nsec clocksource") Link: https://lkml.kernel.org/r/20190925214242.21873-1-frederic@kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-10-09sched/fair: Scale bandwidth quota and period without losing quota/period ↵Xuewei Zhang
ratio precision The quota/period ratio is used to ensure a child task group won't get more bandwidth than the parent task group, and is calculated as: normalized_cfs_quota() = [(quota_us << 20) / period_us] If the quota/period ratio was changed during this scaling due to precision loss, it will cause inconsistency between parent and child task groups. See below example: A userspace container manager (kubelet) does three operations: 1) Create a parent cgroup, set quota to 1,000us and period to 10,000us. 2) Create a few children cgroups. 3) Set quota to 1,000us and period to 10,000us on a child cgroup. These operations are expected to succeed. However, if the scaling of 147/128 happens before step 3, quota and period of the parent cgroup will be changed: new_quota: 1148437ns, 1148us new_period: 11484375ns, 11484us And when step 3 comes in, the ratio of the child cgroup will be 104857, which will be larger than the parent cgroup ratio (104821), and will fail. Scaling them by a factor of 2 will fix the problem. Tested-by: Phil Auld <pauld@redhat.com> Signed-off-by: Xuewei Zhang <xueweiz@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Phil Auld <pauld@redhat.com> Cc: Anton Blanchard <anton@ozlabs.org> Cc: Ben Segall <bsegall@google.com> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com> Cc: Juri Lelli <juri.lelli@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vincent Guittot <vincent.guittot@linaro.org> Fixes: 2e8e19226398 ("sched/fair: Limit sched_cfs_period_timer() loop to avoid hard lockup") Link: https://lkml.kernel.org/r/20191004001243.140897-1-xueweiz@google.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-10-07Merge branch 'akpm' (patches from Andrew)Linus Torvalds
Merge misc fixes from Andrew Morton: "The usual shower of hotfixes. Chris's memcg patches aren't actually fixes - they're mature but a few niggling review issues were late to arrive. The ocfs2 fixes are quite old - those took some time to get reviewer attention. Subsystems affected by this patch series: ocfs2, hotfixes, mm/memcg, mm/slab-generic" * emailed patches from Andrew Morton <akpm@linux-foundation.org>: mm, sl[aou]b: guarantee natural alignment for kmalloc(power-of-two) mm, sl[ou]b: improve memory accounting mm, memcg: make scan aggression always exclude protection mm, memcg: make memory.emin the baseline for utilisation determination mm, memcg: proportional memory.{low,min} reclaim mm/vmpressure.c: fix a signedness bug in vmpressure_register_event() mm/page_alloc.c: fix a crash in free_pages_prepare() mm/z3fold.c: claim page in the beginning of free kernel/sysctl.c: do not override max_threads provided by userspace memcg: only record foreign writebacks with dirty pages when memcg is not disabled mm: fix -Wmissing-prototypes warnings writeback: fix use-after-free in finish_writeback_work() mm/memremap: drop unused SECTION_SIZE and SECTION_MASK panic: ensure preemption is disabled during panic() fs: ocfs2: fix a possible null-pointer dereference in ocfs2_info_scan_inode_alloc() fs: ocfs2: fix a possible null-pointer dereference in ocfs2_write_end_nolock() fs: ocfs2: fix possible null-pointer dereferences in ocfs2_xa_prepare_entry() ocfs2: clear zero in unaligned direct IO
2019-10-07kernel/sysctl.c: do not override max_threads provided by userspaceMichal Hocko
Partially revert 16db3d3f1170 ("kernel/sysctl.c: threads-max observe limits") because the patch is causing a regression to any workload which needs to override the auto-tuning of the limit provided by kernel. set_max_threads is implementing a boot time guesstimate to provide a sensible limit of the concurrently running threads so that runaways will not deplete all the memory. This is a good thing in general but there are workloads which might need to increase this limit for an application to run (reportedly WebSpher MQ is affected) and that is simply not possible after the mentioned change. It is also very dubious to override an admin decision by an estimation that doesn't have any direct relation to correctness of the kernel operation. Fix this by dropping set_max_threads from sysctl_max_threads so any value is accepted as long as it fits into MAX_THREADS which is important to check because allowing more threads could break internal robust futex restriction. While at it, do not use MIN_THREADS as the lower boundary because it is also only a heuristic for automatic estimation and admin might have a good reason to stop new threads to be created even when below this limit. This became more severe when we switched x86 from 4k to 8k kernel stacks. Starting since 6538b8ea886e ("x86_64: expand kernel stack to 16K") (3.16) we use THREAD_SIZE_ORDER = 2 and that halved the auto-tuned value. In the particular case 3.12 kernel.threads-max = 515561 4.4 kernel.threads-max = 200000 Neither of the two values is really insane on 32GB machine. I am not sure we want/need to tune the max_thread value further. If anything the tuning should be removed altogether if proven not useful in general. But we definitely need a way to override this auto-tuning. Link: http://lkml.kernel.org/r/20190922065801.GB18814@dhcp22.suse.cz Fixes: 16db3d3f1170 ("kernel/sysctl.c: threads-max observe limits") Signed-off-by: Michal Hocko <mhocko@suse.com> Reviewed-by: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Heinrich Schuchardt <xypron.glpk@gmx.de> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-10-07panic: ensure preemption is disabled during panic()Will Deacon
Calling 'panic()' on a kernel with CONFIG_PREEMPT=y can leave the calling CPU in an infinite loop, but with interrupts and preemption enabled. From this state, userspace can continue to be scheduled, despite the system being "dead" as far as the kernel is concerned. This is easily reproducible on arm64 when booting with "nosmp" on the command line; a couple of shell scripts print out a periodic "Ping" message whilst another triggers a crash by writing to /proc/sysrq-trigger: | sysrq: Trigger a crash | Kernel panic - not syncing: sysrq triggered crash | CPU: 0 PID: 1 Comm: init Not tainted 5.2.15 #1 | Hardware name: linux,dummy-virt (DT) | Call trace: | dump_backtrace+0x0/0x148 | show_stack+0x14/0x20 | dump_stack+0xa0/0xc4 | panic+0x140/0x32c | sysrq_handle_reboot+0x0/0x20 | __handle_sysrq+0x124/0x190 | write_sysrq_trigger+0x64/0x88 | proc_reg_write+0x60/0xa8 | __vfs_write+0x18/0x40 | vfs_write+0xa4/0x1b8 | ksys_write+0x64/0xf0 | __arm64_sys_write+0x14/0x20 | el0_svc_common.constprop.0+0xb0/0x168 | el0_svc_handler+0x28/0x78 | el0_svc+0x8/0xc | Kernel Offset: disabled | CPU features: 0x0002,24002004 | Memory Limit: none | ---[ end Kernel panic - not syncing: sysrq triggered crash ]--- | Ping 2! | Ping 1! | Ping 1! | Ping 2! The issue can also be triggered on x86 kernels if CONFIG_SMP=n, otherwise local interrupts are disabled in 'smp_send_stop()'. Disable preemption in 'panic()' before re-enabling interrupts. Link: http://lkml.kernel.org/r/20191002123538.22609-1-will@kernel.org Link: https://lore.kernel.org/r/BX1W47JXPMR8.58IYW53H6M5N@dragonstone Signed-off-by: Will Deacon <will@kernel.org> Reported-by: Xogium <contact@xogium.me> Reviewed-by: Kees Cook <keescook@chromium.org> Cc: Russell King <linux@armlinux.org.uk> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Petr Mladek <pmladek@suse.com> Cc: Feng Tang <feng.tang@intel.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-10-07perf/core: Fix inheritance of aux_output groupsAlexander Shishkin
Commit: ab43762ef010 ("perf: Allow normal events to output AUX data") forgets to configure aux_output relation in the inherited groups, which results in child PEBS events forever failing to schedule. Fix this by setting up the AUX output link in the inheritance path. Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: https://lkml.kernel.org/r/20191004125729.32397-1-alexander.shishkin@linux.intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-10-06Merge tag 'dma-mapping-5.4-1' of git://git.infradead.org/users/hch/dma-mappingLinus Torvalds
Pull dma-mapping regression fix from Christoph Hellwig: "Revert an incorret hunk from a patch that caused problems on various arm boards (Andrey Smirnov)" * tag 'dma-mapping-5.4-1' of git://git.infradead.org/users/hch/dma-mapping: dma-mapping: fix false positive warnings in dma_common_free_remap()
2019-10-06Revert "libata, freezer: avoid block device removal while system is frozen"Mika Westerberg
This reverts commit 85fbd722ad0f5d64d1ad15888cd1eb2188bfb557. The commit was added as a quick band-aid for a hang that happened when a block device was removed during system suspend. Now that bdi_wq is not freezable anymore the hang should not be possible and we can get rid of this hack by reverting it. Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Signed-off-by: Mika Westerberg <mika.westerberg@linux.intel.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-10-05Merge tag 'kbuild-fixes-v5.4' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild Pull Kbuild fixes from Masahiro Yamada: - remove unneeded ar-option and KBUILD_ARFLAGS - remove long-deprecated SUBDIRS - fix modpost to suppress false-positive warnings for UML builds - fix namespace.pl to handle relative paths to ${objtree}, ${srctree} - make setlocalversion work for /bin/sh - make header archive reproducible - fix some Makefiles and documents * tag 'kbuild-fixes-v5.4' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild: kheaders: make headers archive reproducible kbuild: update compile-test header list for v5.4-rc2 kbuild: two minor updates for Documentation/kbuild/modules.rst scripts/setlocalversion: clear local variable to make it work for sh namespace: fix namespace.pl script to support relative paths video/logo: do not generate unneeded logo C files video/logo: remove unneeded *.o pattern from clean-files integrity: remove pointless subdir-$(CONFIG_...) integrity: remove unneeded, broken attempt to add -fshort-wchar modpost: fix static EXPORT_SYMBOL warnings for UML build kbuild: correct formatting of header in kbuild module docs kbuild: remove SUBDIRS support kbuild: remove ar-option and KBUILD_ARFLAGS
2019-10-05dma-mapping: fix false positivse warnings in dma_common_free_remap()Andrey Smirnov
Commit 5cf4537975bb ("dma-mapping: introduce a dma_common_find_pages helper") changed invalid input check in dma_common_free_remap() from: if (!area || !area->flags != VM_DMA_COHERENT) to if (!area || !area->flags != VM_DMA_COHERENT || !area->pages) which seem to produce false positives for memory obtained via dma_common_contiguous_remap() This triggers the following warning message when doing "reboot" on ZII VF610 Dev Board Rev B: WARNING: CPU: 0 PID: 1 at kernel/dma/remap.c:112 dma_common_free_remap+0x88/0x8c trying to free invalid coherent area: 9ef82980 Modules linked in: CPU: 0 PID: 1 Comm: systemd-shutdow Not tainted 5.3.0-rc6-next-20190820 #119 Hardware name: Freescale Vybrid VF5xx/VF6xx (Device Tree) Backtrace: [<8010d1ec>] (dump_backtrace) from [<8010d588>] (show_stack+0x20/0x24) r7:8015ed78 r6:00000009 r5:00000000 r4:9f4d9b14 [<8010d568>] (show_stack) from [<8077e3f0>] (dump_stack+0x24/0x28) [<8077e3cc>] (dump_stack) from [<801197a0>] (__warn.part.3+0xcc/0xe4) [<801196d4>] (__warn.part.3) from [<80119830>] (warn_slowpath_fmt+0x78/0x94) r6:00000070 r5:808e540c r4:81c03048 [<801197bc>] (warn_slowpath_fmt) from [<8015ed78>] (dma_common_free_remap+0x88/0x8c) r3:9ef82980 r2:808e53e0 r7:00001000 r6:a0b1e000 r5:a0b1e000 r4:00001000 [<8015ecf0>] (dma_common_free_remap) from [<8010fa9c>] (remap_allocator_free+0x60/0x68) r5:81c03048 r4:9f4d9b78 [<8010fa3c>] (remap_allocator_free) from [<801100d0>] (__arm_dma_free.constprop.3+0xf8/0x148) r5:81c03048 r4:9ef82900 [<8010ffd8>] (__arm_dma_free.constprop.3) from [<80110144>] (arm_dma_free+0x24/0x2c) r5:9f563410 r4:80110120 [<80110120>] (arm_dma_free) from [<8015d80c>] (dma_free_attrs+0xa0/0xdc) [<8015d76c>] (dma_free_attrs) from [<8020f3e4>] (dma_pool_destroy+0xc0/0x154) r8:9efa8860 r7:808f02f0 r6:808f02d0 r5:9ef82880 r4:9ef82780 [<8020f324>] (dma_pool_destroy) from [<805525d0>] (ehci_mem_cleanup+0x6c/0x150) r7:9f563410 r6:9efa8810 r5:00000000 r4:9efd0148 [<80552564>] (ehci_mem_cleanup) from [<80558e0c>] (ehci_stop+0xac/0xc0) r5:9efd0148 r4:9efd0000 [<80558d60>] (ehci_stop) from [<8053c4bc>] (usb_remove_hcd+0xf4/0x1b0) r7:9f563410 r6:9efd0074 r5:81c03048 r4:9efd0000 [<8053c3c8>] (usb_remove_hcd) from [<8056361c>] (host_stop+0x48/0xb8) r7:9f563410 r6:9efd0000 r5:9f5f4040 r4:9f5f5040 [<805635d4>] (host_stop) from [<80563d0c>] (ci_hdrc_host_destroy+0x34/0x38) r7:9f563410 r6:9f5f5040 r5:9efa8800 r4:9f5f4040 [<80563cd8>] (ci_hdrc_host_destroy) from [<8055ef18>] (ci_hdrc_remove+0x50/0x10c) [<8055eec8>] (ci_hdrc_remove) from [<804a2ed8>] (platform_drv_remove+0x34/0x4c) r7:9f563410 r6:81c4f99c r5:9efa8810 r4:9efa8810 [<804a2ea4>] (platform_drv_remove) from [<804a18a8>] (device_release_driver_internal+0xec/0x19c) r5:00000000 r4:9efa8810 [<804a17bc>] (device_release_driver_internal) from [<804a1978>] (device_release_driver+0x20/0x24) r7:9f563410 r6:81c41ed0 r5:9efa8810 r4:9f4a1dac [<804a1958>] (device_release_driver) from [<804a01b8>] (bus_remove_device+0xdc/0x108) [<804a00dc>] (bus_remove_device) from [<8049c204>] (device_del+0x150/0x36c) r7:9f563410 r6:81c03048 r5:9efa8854 r4:9efa8810 [<8049c0b4>] (device_del) from [<804a3368>] (platform_device_del.part.2+0x20/0x84) r10:9f563414 r9:809177e0 r8:81cb07dc r7:81c78320 r6:9f563454 r5:9efa8800 r4:9efa8800 [<804a3348>] (platform_device_del.part.2) from [<804a3420>] (platform_device_unregister+0x28/0x34) r5:9f563400 r4:9efa8800 [<804a33f8>] (platform_device_unregister) from [<8055dce0>] (ci_hdrc_remove_device+0x1c/0x30) r5:9f563400 r4:00000001 [<8055dcc4>] (ci_hdrc_remove_device) from [<805652ac>] (ci_hdrc_imx_remove+0x38/0x118) r7:81c78320 r6:9f563454 r5:9f563410 r4:9f541010 [<8056538c>] (ci_hdrc_imx_shutdown) from [<804a2970>] (platform_drv_shutdown+0x2c/0x30) [<804a2944>] (platform_drv_shutdown) from [<8049e4fc>] (device_shutdown+0x158/0x1f0) [<8049e3a4>] (device_shutdown) from [<8013ac80>] (kernel_restart_prepare+0x44/0x48) r10:00000058 r9:9f4d8000 r8:fee1dead r7:379ce700 r6:81c0b280 r5:81c03048 r4:00000000 [<8013ac3c>] (kernel_restart_prepare) from [<8013ad14>] (kernel_restart+0x1c/0x60) [<8013acf8>] (kernel_restart) from [<8013af84>] (__do_sys_reboot+0xe0/0x1d8) r5:81c03048 r4:00000000 [<8013aea4>] (__do_sys_reboot) from [<8013b0ec>] (sys_reboot+0x18/0x1c) r8:80101204 r7:00000058 r6:00000000 r5:00000000 r4:00000000 [<8013b0d4>] (sys_reboot) from [<80101000>] (ret_fast_syscall+0x0/0x54) Exception stack(0x9f4d9fa8 to 0x9f4d9ff0) 9fa0: 00000000 00000000 fee1dead 28121969 01234567 379ce700 9fc0: 00000000 00000000 00000000 00000058 00000000 00000000 00000000 00016d04 9fe0: 00028e0c 7ec87c64 000135ec 76c1f410 Restore original invalid input check in dma_common_free_remap() to avoid this problem. Fixes: 5cf4537975bb ("dma-mapping: introduce a dma_common_find_pages helper") Signed-off-by: Andrey Smirnov <andrew.smirnov@gmail.com> [hch: just revert the offending hunk instead of creating a new helper] Signed-off-by: Christoph Hellwig <hch@lst.de>
2019-10-05kheaders: make headers archive reproducibleDmitry Goldin
In commit 43d8ce9d65a5 ("Provide in-kernel headers to make extending kernel easier") a new mechanism was introduced, for kernels >=5.2, which embeds the kernel headers in the kernel image or a module and exposes them in procfs for use by userland tools. The archive containing the header files has nondeterminism caused by header files metadata. This patch normalizes the metadata and utilizes KBUILD_BUILD_TIMESTAMP if provided and otherwise falls back to the default behaviour. In commit f7b101d33046 ("kheaders: Move from proc to sysfs") it was modified to use sysfs and the script for generation of the archive was renamed to what is being patched. Signed-off-by: Dmitry Goldin <dgoldin+lkml@protonmail.ch> Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org> Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com>
2019-10-04Merge tag 'copy-struct-from-user-v5.4-rc2' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux Pull copy_struct_from_user() helper from Christian Brauner: "This contains the copy_struct_from_user() helper which got split out from the openat2() patchset. It is a generic interface designed to copy a struct from userspace. The helper will be especially useful for structs versioned by size of which we have quite a few. This allows for backwards compatibility, i.e. an extended struct can be passed to an older kernel, or a legacy struct can be passed to a newer kernel. For the first case (extended struct, older kernel) the new fields in an extended struct can be set to zero and the struct safely passed to an older kernel. The most obvious benefit is that this helper lets us get rid of duplicate code present in at least sched_setattr(), perf_event_open(), and clone3(). More importantly it will also help to ensure that users implementing versioning-by-size end up with the same core semantics. This point is especially crucial since we have at least one case where versioning-by-size is used but with slighly different semantics: sched_setattr(), perf_event_open(), and clone3() all do do similar checks to copy_struct_from_user() while rt_sigprocmask(2) always rejects differently-sized struct arguments. With this pull request we also switch over sched_setattr(), perf_event_open(), and clone3() to use the new helper" * tag 'copy-struct-from-user-v5.4-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux: usercopy: Add parentheses around assignment in test_copy_struct_from_user perf_event_open: switch to copy_struct_from_user() sched_setattr: switch to copy_struct_from_user() clone3: switch to copy_struct_from_user() lib: introduce copy_struct_from_user() helper