summaryrefslogtreecommitdiff
path: root/kernel
AgeCommit message (Collapse)Author
2017-11-12genirq: Fix type of shifting literal 1 in __setup_irq()Rasmus Villemoes
If ffz() ever returns a value >= 31 then the following shift is undefined behaviour because the literal 1 which gets shifted is treated as signed integer. In practice, the bug is probably harmless, since the first undefined shift count is 31 which results - ignoring UB - in (int)(0x80000000). This gets sign extended so bit 32-63 will be set as well and all subsequent __setup_irq() calls would just end up hitting the -EBUSY branch. However, a sufficiently aggressive optimizer may use the UB of 1<<31 to decide that doesn't happen, and hence elide the sign-extension code, so that subsequent calls can indeed get ffz > 31. In any case, the right thing to do is to make the literal 1UL. [ tglx: For this to happen a single interrupt would have to be shared by 32 devices. Hardware like that does not exist and would have way more problems than that. ] Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lkml.kernel.org/r/20171030213548.16831-1-linux@rasmusvillemoes.dk
2017-11-12irqdomain: Drop pointless NULL check in virq_debug_show_oneRasmus Villemoes
data has been already derefenced unconditionally, so it's pointless to do a NULL pointer check on it afterwards. Drop it. [ tglx: Depersonify changelog. ] Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Marc Zyngier <marc.zyngier@arm.com> Link: https://lkml.kernel.org/r/20171112212904.28574-1-linux@rasmusvillemoes.dk
2017-11-12genirq/proc: Return proper error code when irq_set_affinity() failsWen Yaxng
write_irq_affinity() returns the number of written bytes, which means success, unconditionally whether the actual irq_set_affinity() call succeeded or not. Add proper error handling and pass the error code returned from irq_set_affinity() back to user space in case of failure. [ tglx: Fixed coding style and massaged changelog ] Signed-off-by: Wen Yang <wen.yang99@zte.com.cn> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Jiang Biao <jiang.biao2@zte.com.cn> Cc: zhong.weidong@zte.com.cn Link: https://lkml.kernel.org/r/1510106103-184761-1-git-send-email-wen.yang99@zte.com.cn
2017-11-12timers: Add a function to start/reduce a timerDavid Howells
Add a function, similar to mod_timer(), that will start a timer if it isn't running and will modify it if it is running and has an expiry time longer than the new time. If the timer is running with an expiry time that's the same or sooner, no change is made. The function looks like: int timer_reduce(struct timer_list *timer, unsigned long expires); This can be used by code such as networking code to make it easier to share a timer for multiple timeouts. For instance, in upcoming AF_RXRPC code, the rxrpc_call struct will maintain a number of timeouts: unsigned long ack_at; unsigned long resend_at; unsigned long ping_at; unsigned long expect_rx_by; unsigned long expect_req_by; unsigned long expect_term_by; each of which is set independently of the others. With timer reduction available, when the code needs to set one of the timeouts, it only needs to look at that timeout and then call timer_reduce() to modify the timer, starting it or bringing it forward if necessary. There is no need to refer to the other timeouts to see which is earliest and no need to take any lock other than, potentially, the timer lock inside timer_reduce(). Note, that this does not protect against concurrent invocations of any of the timer functions. As an example, the expect_rx_by timeout above, which terminates a call if we don't get a packet from the server within a certain time window, would be set something like this: unsigned long now = jiffies; unsigned long expect_rx_by = now + packet_receive_timeout; WRITE_ONCE(call->expect_rx_by, expect_rx_by); timer_reduce(&call->timer, expect_rx_by); The timer service code (which might, say, be in a work function) would then check all the timeouts to see which, if any, had triggered, deal with those: t = READ_ONCE(call->ack_at); if (time_after_eq(now, t)) { cmpxchg(&call->ack_at, t, now + MAX_JIFFY_OFFSET); set_bit(RXRPC_CALL_EV_ACK, &call->events); } and then restart the timer if necessary by finding the soonest timeout that hasn't yet passed and then calling timer_reduce(). The disadvantage of doing things this way rather than comparing the timers each time and calling mod_timer() is that you *will* take timer events unless you can finish what you're doing and delete the timer in time. The advantage of doing things this way is that you don't need to use a lock to work out when the next timer should be set, other than the timer's own lock - which you might not have to take. [ tglx: Fixed weird formatting and adopted it to pending changes ] Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: keyrings@vger.kernel.org Cc: linux-afs@lists.infradead.org Link: https://lkml.kernel.org/r/151023090769.23050.1801643667223880753.stgit@warthog.procyon.org.uk
2017-11-12pstore: Use ktime_get_real_fast_ns() instead of __getnstimeofday()Arnd Bergmann
__getnstimeofday() is a rather odd interface, with a number of quirks: - The caller may come from NMI context, but the implementation is not NMI safe, one way to get there from NMI is NMI handler: something bad panic() kmsg_dump() pstore_dump() pstore_record_init() __getnstimeofday() - The calling conventions are different from any other timekeeping functions, to deal with returning an error code during suspended timekeeping. Address the above issues by using a completely different method to get the time: ktime_get_real_fast_ns() is NMI safe and has a reasonable behavior when timekeeping is suspended: it returns the time at which it got suspended. As Thomas Gleixner explained, this is safe, as ktime_get_real_fast_ns() does not call into the clocksource driver that might be suspended. The result can easily be transformed into a timespec structure. Since ktime_get_real_fast_ns() was not exported to modules, add the export. The pstore behavior for the suspended case changes slightly, as it now stores the timestamp at which timekeeping was suspended instead of storing a zero timestamp. This change is not addressing y2038-safety, that's subject to a more complex follow up patch. Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Kees Cook <keescook@chromium.org> Cc: Tony Luck <tony.luck@intel.com> Cc: Anton Vorontsov <anton@enomsg.org> Cc: Stephen Boyd <sboyd@codeaurora.org> Cc: John Stultz <john.stultz@linaro.org> Cc: Colin Cross <ccross@android.com> Link: https://lkml.kernel.org/r/20171110152530.1926955-1-arnd@arndb.de
2017-11-12irq/work: Use llist_for_each_entry_safeThomas Gleixner
The llist_for_each_entry() loop in irq_work_run_list() is unsafe because once the works PENDING bit is cleared it can be requeued on another CPU. Use llist_for_each_entry_safe() instead. Fixes: 16c0890dc66d ("irq/work: Don't reinvent the wheel but use existing llist API") Reported-by:Chris Wilson <chris@chris-wilson.co.uk> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Frederic Weisbecker <frederic@kernel.org> Cc: Byungchul Park <byungchul.park@lge.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Petri Latvala <petri.latvala@intel.com> Link: http://lkml.kernel.org/r/151027307351.14762.4611888896020658384@mail.alporthouse.com
2017-11-11bpf: Revert bpf_overrid_function() helper changes.David S. Miller
NACK'd by x86 maintainer. Signed-off-by: David S. Miller <davem@davemloft.net>
2017-11-11bpf: add a bpf_override_function helperJosef Bacik
Error injection is sloppy and very ad-hoc. BPF could fill this niche perfectly with it's kprobe functionality. We could make sure errors are only triggered in specific call chains that we care about with very specific situations. Accomplish this with the bpf_override_funciton helper. This will modify the probe'd callers return value to the specified value and set the PC to an override function that simply returns, bypassing the originally probed function. This gives us a nice clean way to implement systematic error injection for all of our code paths. Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Josef Bacik <jbacik@fb.com> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-11-10kthread: zero the kthread data structureShaohua Li
kthread() could bail out early before we initialize blkcg_css (if the kthread is killed very early. Please see xchg() statement in kthread()), which confuses free_kthread_struct. Instead of moving the blkcg_css initialization early, we simply zero the whole 'self' data structure, which doesn't sound much overhead. Reported-by: syzbot <syzkaller@googlegroups.com> Fixes: 05e3db95ebfc ("kthread: add a mechanism to store cgroup info") Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Ingo Molnar <mingo@kernel.org> Cc: Dmitry Vyukov <dvyukov@google.com> Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-11-10blktrace: fix unlocked registration of tracepointsJens Axboe
We need to ensure that tracepoints are registered and unregistered with the users of them. The existing atomic count isn't enough for that. Add a lock around the tracepoints, so we serialize access to them. This fixes cases where we have multiple users setting up and tearing down tracepoints, like this: CPU: 0 PID: 2995 Comm: syzkaller857118 Not tainted 4.14.0-rc5-next-20171018+ #36 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Call Trace: __dump_stack lib/dump_stack.c:16 [inline] dump_stack+0x194/0x257 lib/dump_stack.c:52 panic+0x1e4/0x41c kernel/panic.c:183 __warn+0x1c4/0x1e0 kernel/panic.c:546 report_bug+0x211/0x2d0 lib/bug.c:183 fixup_bug+0x40/0x90 arch/x86/kernel/traps.c:177 do_trap_no_signal arch/x86/kernel/traps.c:211 [inline] do_trap+0x260/0x390 arch/x86/kernel/traps.c:260 do_error_trap+0x120/0x390 arch/x86/kernel/traps.c:297 do_invalid_op+0x1b/0x20 arch/x86/kernel/traps.c:310 invalid_op+0x18/0x20 arch/x86/entry/entry_64.S:905 RIP: 0010:tracepoint_add_func kernel/tracepoint.c:210 [inline] RIP: 0010:tracepoint_probe_register_prio+0x397/0x9a0 kernel/tracepoint.c:283 RSP: 0018:ffff8801d1d1f6c0 EFLAGS: 00010293 RAX: ffff8801d22e8540 RBX: 00000000ffffffef RCX: ffffffff81710f07 RDX: 0000000000000000 RSI: ffffffff85b679c0 RDI: ffff8801d5f19818 RBP: ffff8801d1d1f7c8 R08: ffffffff81710c10 R09: 0000000000000004 R10: ffff8801d1d1f6b0 R11: 0000000000000003 R12: ffffffff817597f0 R13: 0000000000000000 R14: 00000000ffffffff R15: ffff8801d1d1f7a0 tracepoint_probe_register+0x2a/0x40 kernel/tracepoint.c:304 register_trace_block_rq_insert include/trace/events/block.h:191 [inline] blk_register_tracepoints+0x1e/0x2f0 kernel/trace/blktrace.c:1043 do_blk_trace_setup+0xa10/0xcf0 kernel/trace/blktrace.c:542 blk_trace_setup+0xbd/0x180 kernel/trace/blktrace.c:564 sg_ioctl+0xc71/0x2d90 drivers/scsi/sg.c:1089 vfs_ioctl fs/ioctl.c:45 [inline] do_vfs_ioctl+0x1b1/0x1520 fs/ioctl.c:685 SYSC_ioctl fs/ioctl.c:700 [inline] SyS_ioctl+0x8f/0xc0 fs/ioctl.c:691 entry_SYSCALL_64_fastpath+0x1f/0xbe RIP: 0033:0x444339 RSP: 002b:00007ffe05bb5b18 EFLAGS: 00000206 ORIG_RAX: 0000000000000010 RAX: ffffffffffffffda RBX: 00000000006d66c0 RCX: 0000000000444339 RDX: 000000002084cf90 RSI: 00000000c0481273 RDI: 0000000000000009 RBP: 0000000000000082 R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000206 R12: ffffffffffffffff R13: 00000000c0481273 R14: 0000000000000000 R15: 0000000000000000 since we can now run these in parallel. Ensure that the exported helpers for doing this are grabbing the queue trace mutex. Reported-by: Steven Rostedt <rostedt@goodmis.org> Tested-by: Dmitry Vyukov <dvyukov@google.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-11-10blktrace: fix unlocked access to init/start-stop/teardownJens Axboe
sg.c calls into the blktrace functions without holding the proper queue mutex for doing setup, start/stop, or teardown. Add internal unlocked variants, and export the ones that do the proper locking. Fixes: 6da127ad0918 ("blktrace: Add blktrace ioctls to SCSI generic devices") Tested-by: Dmitry Vyukov <dvyukov@google.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-11-10audit: filter PATH records keyed on filesystem magicRichard Guy Briggs
Tracefs or debugfs were causing hundreds to thousands of PATH records to be associated with the init_module and finit_module SYSCALL records on a few modules when the following rule was in place for startup: -a always,exit -F arch=x86_64 -S init_module -F key=mod-load Provide a method to ignore these large number of PATH records from overwhelming the logs if they are not of interest. Introduce a new filter list "AUDIT_FILTER_FS", with a new field type AUDIT_FSTYPE, which keys off the filesystem 4-octet hexadecimal magic identifier to filter specific filesystem PATH records. An example rule would look like: -a never,filesystem -F fstype=0x74726163 -F key=ignore_tracefs -a never,filesystem -F fstype=0x64626720 -F key=ignore_debugfs Arguably the better way to address this issue is to disable tracefs and debugfs on boot from production systems. See: https://github.com/linux-audit/audit-kernel/issues/16 See: https://github.com/linux-audit/audit-userspace/issues/8 Test case: https://github.com/linux-audit/audit-testsuite/issues/42 Signed-off-by: Richard Guy Briggs <rgb@redhat.com> [PM: fixed the whitespace damage in kernel/auditsc.c] Signed-off-by: Paul Moore <paul@paul-moore.com>
2017-11-10Audit: remove unused audit_log_secctx functionCasey Schaufler
The function audit_log_secctx() is unused in the upstream kernel. All it does is wrap another function that doesn't need wrapping. It claims to give you the SELinux context, but that is not true if you are using a different security module. Signed-off-by: Casey Schaufler <casey@schaufler-ca.com> Reviewed-by: James Morris <james.l.morris@oracle.com> Signed-off-by: Paul Moore <paul@paul-moore.com>
2017-11-10audit: Allow auditd to set pid to 0 to end auditingSteve Grubb
The API to end auditing has historically been for auditd to set the pid to 0. This patch restores that functionality. See: https://github.com/linux-audit/audit-kernel/issues/69 Reviewed-by: Richard Guy Briggs <rgb@redhat.com> Signed-off-by: Steve Grubb <sgrubb@redhat.com> Signed-off-by: Paul Moore <paul@paul-moore.com>
2017-11-10audit: use audit_set_enabled() in audit_enable()Paul Moore
Use audit_set_enabled() to enable auditing during early boot. This obviously won't emit an audit change record, but it will work anyway and should help prevent in future problems by consolidating the enable/disable code in one function. Reviewed-by: Richard Guy Briggs <rgb@redhat.com> Signed-off-by: Paul Moore <paul@paul-moore.com>
2017-11-10audit: convert audit_ever_enabled to a booleanPaul Moore
We were treating it as a boolean, let's make it a boolean to help avoid future mistakes. Reviewed-by: Richard Guy Briggs <rgb@redhat.com> Signed-off-by: Paul Moore <paul@paul-moore.com>
2017-11-10audit: don't use simple_strtol() anymorePaul Moore
The simple_strtol() function is deprecated, use kstrtol() instead. Reviewed-by: Richard Guy Briggs <rgb@redhat.com> Signed-off-by: Paul Moore <paul@paul-moore.com>
2017-11-10audit: initialize the audit subsystem as early as possiblePaul Moore
We can't initialize the audit subsystem until after the network layer is initialized (core_initcall), but do it soon after. Reviewed-by: Richard Guy Briggs <rgb@redhat.com> Signed-off-by: Paul Moore <paul@paul-moore.com>
2017-11-10audit: ensure that 'audit=1' actually enables audit for PID 1Paul Moore
Prior to this patch we enabled audit in audit_init(), which is too late for PID 1 as the standard initcalls are run after the PID 1 task is forked. This means that we never allocate an audit_context (see audit_alloc()) for PID 1 and therefore miss a lot of audit events generated by PID 1. This patch enables audit as early as possible to help ensure that when PID 1 is forked it can allocate an audit_context if required. Reviewed-by: Richard Guy Briggs <rgb@redhat.com> Signed-off-by: Paul Moore <paul@paul-moore.com>
2017-11-10genirq: Track whether the trigger type has been setMarc Zyngier
When requesting a shared interrupt, we assume that the firmware support code (DT or ACPI) has called irqd_set_trigger_type already, so that we can retrieve it and check that the requester is being reasonnable. Unfortunately, we still have non-DT, non-ACPI systems around, and these guys won't call irqd_set_trigger_type before requesting the interrupt. The consequence is that we fail the request that would have worked before. We can either chase all these use cases (boring), or address it in core code (easier). Let's have a per-irq_desc flag that indicates whether irqd_set_trigger_type has been called, and let's just check it when checking for a shared interrupt. If it hasn't been set, just take whatever the interrupt requester asks. Fixes: 382bd4de6182 ("genirq: Use irqd_get_trigger_type to compare the trigger type for shared IRQs") Cc: stable@vger.kernel.org Reported-and-tested-by: Petr Cvek <petrcvekcz@gmail.com> Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
2017-11-10Merge branch 'linus' into x86/asm, to resolve conflictIngo Molnar
Conflicts: arch/x86/mm/mem_encrypt.c Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-11-10Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller
Simple cases of overlapping changes in the packet scheduler. Must easier to resolve this time. Which probably means that I screwed it up somehow. Signed-off-by: David S. Miller <davem@davemloft.net>
2017-11-09Merge tag 'pm-final-4.14' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm Pull final power management fixes from Rafael Wysocki: "These fix a regression in the schedutil cpufreq governor introduced by a recent change and blacklist Dell XPS13 9360 from using the Low Power S0 Idle _DSM interface which triggers serious problems on one of these machines. Specifics: - Prevent the schedutil cpufreq governor from using the utilization of a wrong CPU in some cases which started to happen after one of the recent changes in it (Chris Redpath). - Blacklist Dell XPS13 9360 from using the Low Power S0 Idle _DSM interface as that causes serious issue (related to NVMe) to appear on one of these machines, even though the other Dells XPS13 9360 in somewhat different HW configurations behave correctly (Rafael Wysocki)" * tag 'pm-final-4.14' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: ACPI / PM: Blacklist Low Power S0 Idle _DSM for Dell XPS13 9360 cpufreq: schedutil: Examine the correct CPU when we update util
2017-11-09sched/core: Optimize sched_feat() for !CONFIG_SCHED_DEBUG buildsPatrick Bellasi
When the kernel is compiled with !CONFIG_SCHED_DEBUG support, we expect that all SCHED_FEAT are turned into compile time constants being propagated to support compiler optimizations. Specifically, we expect that code blocks like this: if (sched_feat(FEATURE_NAME) [&& <other_conditions>]) { /* FEATURE CODE */ } are turned into dead-code in case FEATURE_NAME defaults to FALSE, and thus being removed by the compiler from the finale image. For this mechanism to properly work it's required for the compiler to have full access, from each translation unit, to whatever is the value defined by the sched_feat macro. This macro is defined as: #define sched_feat(x) (sysctl_sched_features & (1UL << __SCHED_FEAT_##x)) and thus, the compiler can optimize that code only if the value of sysctl_sched_features is visible within each translation unit. Since: 029632fbb ("sched: Make separate sched*.c translation units") the scheduler code has been split into separate translation units however the definition of sysctl_sched_features is part of kernel/sched/core.c while, for all the other scheduler modules, it is visible only via kernel/sched/sched.h as an: extern const_debug unsigned int sysctl_sched_features Unfortunately, an extern reference does not allow the compiler to apply constants propagation. Thus, on !CONFIG_SCHED_DEBUG kernel we still end up with code to load a memory reference and (eventually) doing an unconditional jump of a chunk of code. This mechanism is unavoidable when sched_features can be turned on and off at run-time. However, this is not the case for "production" kernels compiled with !CONFIG_SCHED_DEBUG. In this case, sysctl_sched_features is just a constant value which cannot be changed at run-time and thus memory loads and jumps can be avoided altogether. This patch fixes the case of !CONFIG_SCHED_DEBUG kernel by declaring a local version of the sysctl_sched_features constant for each translation unit. This will ultimately allow the compiler to perform constants propagation and dead-code pruning. Tests have been done, with !CONFIG_SCHED_DEBUG on a v4.14-rc8 with and without the patch, by running 30 iterations of: perf bench sched messaging --pipe --thread --group 4 --loop 50000 on a 40 cores Intel(R) Xeon(R) CPU E5-2690 v2 @ 3.00GHz using the powersave governor to rule out variations due to frequency scaling. Statistics on the reported completion time: count mean std min 99% max v4.14-rc8 30.0 15.7831 0.176032 15.442 16.01226 16.014 v4.14-rc8+patch 30.0 15.5033 0.189681 15.232 15.93938 15.962 ... show a 1.8% speedup on average completion time and 0.5% speedup in the 99 percentile. Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com> Signed-off-by: Chris Redpath <chris.redpath@arm.com> Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Reviewed-by: Brendan Jackman <brendan.jackman@arm.com> Acked-by: Peter Zijlstra <peterz@infradead.org> Cc: Juri Lelli <juri.lelli@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Morten Rasmussen <morten.rasmussen@arm.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vincent Guittot <vincent.guittot@linaro.org> Link: http://lkml.kernel.org/r/20171108184101.16006-1-patrick.bellasi@arm.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-11-09Merge branch 'pm-cpufreq-sched'Rafael J. Wysocki
* pm-cpufreq-sched: cpufreq: schedutil: Examine the correct CPU when we update util
2017-11-08cpufreq: schedutil: Reset cached_raw_freq when not in sync with next_freqViresh Kumar
'cached_raw_freq' is used to get the next frequency quickly but should always be in sync with sg_policy->next_freq. There is a case where it is not and in such cases it should be reset to avoid switching to incorrect frequencies. Consider this case for example: - policy->cur is 1.2 GHz (Max) - New request comes for 780 MHz and we store that in cached_raw_freq. - Based on 780 MHz, we calculate the effective frequency as 800 MHz. - We then see the CPU wasn't idle recently and choose to keep the next freq as 1.2 GHz. - Now we have cached_raw_freq is 780 MHz and sg_policy->next_freq is 1.2 GHz. - Now if the utilization doesn't change in then next request, then the next target frequency will still be 780 MHz and it will match with cached_raw_freq. But we will choose 1.2 GHz instead of 800 MHz here. Fixes: b7eaf1aab9f8 (cpufreq: schedutil: Avoid reducing frequency of busy CPUs prematurely) Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org> Cc: 4.12+ <stable@vger.kernel.org> # 4.12+ Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2017-11-08PM / s2idle: Clear the events_check_enabled flagRajat Jain
Problem: This flag does not get cleared currently in the suspend or resume path in the following cases: * In case some driver's suspend routine returns an error. * Successful s2idle case * etc? Why is this a problem: What happens is that the next suspend attempt could fail even though the user did not enable the flag by writing to /sys/power/wakeup_count. This is 1 use case how the issue can be seen (but similar use case with driver suspend failure can be thought of): 1. Read /sys/power/wakeup_count 2. echo count > /sys/power/wakeup_count 3. echo freeze > /sys/power/wakeup_count 4. Let the system suspend, and wakeup the system using some wake source that calls pm_wakeup_event() e.g. power button or something. 5. Note that the combined wakeup count would be incremented due to the pm_wakeup_event() in the resume path. 6. After resuming the events_check_enabled flag is still set. At this point if the user attempts to freeze again (without writing to /sys/power/wakeup_count), the suspend would fail even though there has been no wake event since the past resume. Address that by clearing the flag just before a resume is completed, so that it is always cleared for the corner cases mentioned above. Signed-off-by: Rajat Jain <rajatja@google.com> Acked-by: Pavel Machek <pavel@ucw.cz> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2017-11-08stop using '%pK' for /proc/kallsyms pointer valuesLinus Torvalds
Not only is it annoying to have one single flag for all pointers, as if that was a global choice and all kernel pointers are the same, but %pK can't get the 'access' vs 'open' time check right anyway. So make the /proc/kallsyms pointer value code use logic specific to that particular file. We do continue to honor kptr_restrict, but the default (which is unrestricted) is changed to instead take expected users into account, and restrict access by default. Right now the only actual expected user is kernel profiling, which has a separate sysctl flag for kernel profile access. There may be others. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-11-08module: export module signature enforcement statusBruno E. O. Meneguele
A static variable sig_enforce is used as status var to indicate the real value of CONFIG_MODULE_SIG_FORCE, once this one is set the var will hold true, but if the CONFIG is not set the status var will hold whatever value is present in the module.sig_enforce kernel cmdline param: true when =1 and false when =0 or not present. Considering this cmdline param take place over the CONFIG value when it's not set, other places in the kernel could misbehave since they would have only the CONFIG_MODULE_SIG_FORCE value to rely on. Exporting this status var allows the kernel to rely in the effective value of module signature enforcement, being it from CONFIG value or cmdline param. Signed-off-by: Bruno E. O. Meneguele <brdeoliv@redhat.com> Signed-off-by: Mimi Zohar <zohar@linux.vnet.ibm.com>
2017-11-08rcu: Use lockdep to assert IRQs are disabled/enabledFrederic Weisbecker
Lockdep now has an integrated IRQs disabled/enabled sanity check. Just use it instead of the ad-hoc RCU version. Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Acked-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: David S . Miller <davem@davemloft.net> Cc: Lai Jiangshan <jiangshanlai@gmail.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Tejun Heo <tj@kernel.org> Link: http://lkml.kernel.org/r/1509980490-4285-15-git-send-email-frederic@kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-11-08timers/posix-cpu-timers: Use lockdep to assert IRQs are disabled/enabledFrederic Weisbecker
Use lockdep to check that IRQs are enabled or disabled as expected. This way the sanity check only shows overhead when concurrency correctness debug code is enabled. Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Acked-by: Thomas Gleixner <tglx@linutronix.de> Cc: David S . Miller <davem@davemloft.net> Cc: Lai Jiangshan <jiangshanlai@gmail.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Tejun Heo <tj@kernel.org> Link: http://lkml.kernel.org/r/1509980490-4285-13-git-send-email-frederic@kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-11-08sched/clock, sched/cputime: Use lockdep to assert IRQs are disabled/enabledFrederic Weisbecker
Use lockdep to check that IRQs are enabled or disabled as expected. This way the sanity check only shows overhead when concurrency correctness debug code is enabled. Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Acked-by: Thomas Gleixner <tglx@linutronix.de> Cc: David S . Miller <davem@davemloft.net> Cc: Lai Jiangshan <jiangshanlai@gmail.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Tejun Heo <tj@kernel.org> Link: http://lkml.kernel.org/r/1509980490-4285-12-git-send-email-frederic@kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-11-08irq_work: Use lockdep to assert IRQs are disabled/enabledFrederic Weisbecker
Use lockdep to check that IRQs are enabled or disabled as expected. This way the sanity check only shows overhead when concurrency correctness debug code is enabled. Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Acked-by: Thomas Gleixner <tglx@linutronix.de> Cc: David S . Miller <davem@davemloft.net> Cc: Lai Jiangshan <jiangshanlai@gmail.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Tejun Heo <tj@kernel.org> Link: http://lkml.kernel.org/r/1509980490-4285-11-git-send-email-frederic@kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-11-08irq/timings: Use lockdep to assert IRQs are disabled/enabledFrederic Weisbecker
Use lockdep to check that IRQs are enabled or disabled as expected. This way the sanity check only shows overhead when concurrency correctness debug code is enabled. Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Acked-by: Thomas Gleixner <tglx@linutronix.de> Cc: David S . Miller <davem@davemloft.net> Cc: Lai Jiangshan <jiangshanlai@gmail.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Tejun Heo <tj@kernel.org> Link: http://lkml.kernel.org/r/1509980490-4285-10-git-send-email-frederic@kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-11-08perf/core: Use lockdep to assert IRQs are disabled/enabledFrederic Weisbecker
Use lockdep to check that IRQs are enabled or disabled as expected. This way the sanity check only shows overhead when concurrency correctness debug code is enabled. Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Acked-by: Thomas Gleixner <tglx@linutronix.de> Cc: David S . Miller <davem@davemloft.net> Cc: Lai Jiangshan <jiangshanlai@gmail.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Tejun Heo <tj@kernel.org> Link: http://lkml.kernel.org/r/1509980490-4285-9-git-send-email-frederic@kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-11-08smp/core: Use lockdep to assert IRQs are disabled/enabledFrederic Weisbecker
Use lockdep to check that IRQs are enabled or disabled as expected. This way the sanity check only shows overhead when concurrency correctness debug code is enabled. Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Acked-by: Thomas Gleixner <tglx@linutronix.de> Cc: David S . Miller <davem@davemloft.net> Cc: Lai Jiangshan <jiangshanlai@gmail.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Tejun Heo <tj@kernel.org> Link: http://lkml.kernel.org/r/1509980490-4285-7-git-send-email-frederic@kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-11-08timers/hrtimer: Use lockdep to assert IRQs are disabled/enabledFrederic Weisbecker
Use lockdep to check that IRQs are enabled or disabled as expected. This way the sanity check only shows overhead when concurrency correctness debug code is enabled. Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Acked-by: Thomas Gleixner <tglx@linutronix.de> Cc: David S . Miller <davem@davemloft.net> Cc: Lai Jiangshan <jiangshanlai@gmail.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Tejun Heo <tj@kernel.org> Link: http://lkml.kernel.org/r/1509980490-4285-6-git-send-email-frederic@kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-11-08timers/nohz: Use lockdep to assert IRQs are disabled/enabledFrederic Weisbecker
Use lockdep to check that IRQs are enabled or disabled as expected. This way the sanity check only shows overhead when concurrency correctness debug code is enabled. Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Acked-by: Thomas Gleixner <tglx@linutronix.de> Cc: David S . Miller <davem@davemloft.net> Cc: Lai Jiangshan <jiangshanlai@gmail.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Tejun Heo <tj@kernel.org> Link: http://lkml.kernel.org/r/1509980490-4285-5-git-send-email-frederic@kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-11-08workqueue: Use lockdep to assert IRQs are disabled/enabledFrederic Weisbecker
Use lockdep to check that IRQs are enabled or disabled as expected. This way the sanity check only shows overhead when concurrency correctness debug code is enabled. Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Acked-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Tejun Heo <tj@kernel.org> Cc: David S . Miller <davem@davemloft.net> Cc: Lai Jiangshan <jiangshanlai@gmail.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1509980490-4285-4-git-send-email-frederic@kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-11-08irq/softirqs: Use lockdep to assert IRQs are disabled/enabledFrederic Weisbecker
Use lockdep to check that IRQs are enabled or disabled as expected. This way the sanity check only shows overhead when concurrency correctness debug code is enabled. Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Acked-by: Thomas Gleixner <tglx@linutronix.de> Cc: David S . Miller <davem@davemloft.net> Cc: Lai Jiangshan <jiangshanlai@gmail.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Tejun Heo <tj@kernel.org> Link: http://lkml.kernel.org/r/1509980490-4285-3-git-send-email-frederic@kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-11-08Merge branch 'linus' into sched/core, to pick up fixesIngo Molnar
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-11-08locking/pvqspinlock: Implement hybrid PV queued/unfair locksWaiman Long
Currently, all the lock waiters entering the slowpath will do one lock stealing attempt to acquire the lock. That helps performance, especially in VMs with over-committed vCPUs. However, the current pvqspinlocks still don't perform as good as unfair locks in many cases. On the other hands, unfair locks do have the problem of lock starvation that pvqspinlocks don't have. This patch combines the best attributes of an unfair lock and a pvqspinlock into a hybrid lock with 2 modes - queued mode & unfair mode. A lock waiter goes into the unfair mode when there are waiters in the wait queue but the pending bit isn't set. Otherwise, it will go into the queued mode waiting in the queue for its turn. On a 2-socket 36-core E5-2699 v3 system (HT off), a kernel build (make -j<n>) was done in a VM with unpinned vCPUs 3 times with the best time selected and <n> is the number of vCPUs available. The build times of the original pvqspinlock, hybrid pvqspinlock and unfair lock with various number of vCPUs are as follows: vCPUs pvqlock hybrid pvqlock unfair lock ----- ------- -------------- ----------- 30 342.1s 329.1s 329.1s 36 314.1s 305.3s 307.3s 45 345.0s 302.1s 306.6s 54 365.4s 308.6s 307.8s 72 358.9s 293.6s 303.9s 108 343.0s 285.9s 304.2s The hybrid pvqspinlock performs better or comparable to the unfair lock. By turning on QUEUED_LOCK_STAT, the table below showed the number of lock acquisitions in unfair mode and queue mode after a kernel build with various number of vCPUs. vCPUs queued mode unfair mode ----- ----------- ----------- 30 9,130,518 294,954 36 10,856,614 386,809 45 8,467,264 11,475,373 54 6,409,987 19,670,855 72 4,782,063 25,712,180 It can be seen that as the VM became more and more over-committed, the ratio of locks acquired in unfair mode increases. This is all done automatically to get the best overall performance as possible. Using a kernel locking microbenchmark with number of locking threads equals to the number of vCPUs available on the same machine, the minimum, average and maximum (min/avg/max) numbers of locking operations done per thread in a 5-second testing interval are shown below: vCPUs hybrid pvqlock unfair lock ----- -------------- ----------- 36 822,135/881,063/950,363 75,570/313,496/ 690,465 54 542,435/581,664/625,937 35,460/204,280/ 457,172 72 397,500/428,177/499,299 17,933/150,679/ 708,001 108 257,898/288,150/340,871 3,085/181,176/1,257,109 It can be seen that the hybrid pvqspinlocks are more fair and performant than the unfair locks in this test. The table below shows the kernel build times on a smaller 2-socket 16-core 32-thread E5-2620 v4 system. vCPUs pvqlock hybrid pvqlock unfair lock ----- ------- -------------- ----------- 16 436.8s 433.4s 435.6s 36 366.2s 364.8s 364.5s 48 423.6s 376.3s 370.2s 64 433.1s 376.6s 376.8s Again, the performance of the hybrid pvqspinlock was comparable to that of the unfair lock. Signed-off-by: Waiman Long <longman@redhat.com> Reviewed-by: Juergen Gross <jgross@suse.com> Reviewed-by: Eduardo Valentin <eduval@amazon.com> Acked-by: Peter Zijlstra <peterz@infradead.org> Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Radim Krčmář <rkrcmar@redhat.com> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/1510089486-3466-1-git-send-email-longman@redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-11-07x86/mm, resource: Use PAGE_KERNEL protection for ioremap of memory pagesTom Lendacky
In order for memory pages to be properly mapped when SEV is active, it's necessary to use the PAGE_KERNEL protection attribute as the base protection. This ensures that memory mapping of, e.g. ACPI tables, receives the proper mapping attributes. Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com> Signed-off-by: Brijesh Singh <brijesh.singh@amd.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Borislav Petkov <bp@suse.de> Tested-by: Borislav Petkov <bp@suse.de> Cc: Laura Abbott <labbott@redhat.com> Cc: Kees Cook <keescook@chromium.org> Cc: kvm@vger.kernel.org Cc: Jérôme Glisse <jglisse@redhat.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Andy Lutomirski <luto@kernel.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Dan Williams <dan.j.williams@intel.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Link: https://lkml.kernel.org/r/20171020143059.3291-11-brijesh.singh@amd.com
2017-11-07resource: Provide resource struct in resource walk callbackTom Lendacky
In preperation for a new function that will need additional resource information during the resource walk, update the resource walk callback to pass the resource structure. Since the current callback start and end arguments are pulled from the resource structure, the callback functions can obtain them from the resource structure directly. Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com> Signed-off-by: Brijesh Singh <brijesh.singh@amd.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Kees Cook <keescook@chromium.org> Reviewed-by: Borislav Petkov <bp@suse.de> Tested-by: Borislav Petkov <bp@suse.de> Cc: kvm@vger.kernel.org Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Borislav Petkov <bp@alien8.de> Cc: linuxppc-dev@lists.ozlabs.org Link: https://lkml.kernel.org/r/20171020143059.3291-10-brijesh.singh@amd.com
2017-11-07resource: Consolidate resource walking codeTom Lendacky
The walk_iomem_res_desc(), walk_system_ram_res() and walk_system_ram_range() functions each have much of the same code. Create a new function that consolidates the common code from these functions in one place to reduce the amount of duplicated code. Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com> Signed-off-by: Brijesh Singh <brijesh.singh@amd.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Borislav Petkov <bp@suse.de> Tested-by: Borislav Petkov <bp@suse.de> Cc: kvm@vger.kernel.org Cc: Borislav Petkov <bp@alien8.de> Link: https://lkml.kernel.org/r/20171020143059.3291-9-brijesh.singh@amd.com
2017-11-07locking/rwlocks: Fix commentsCheng Jian
- fix the list of locking API headers in kernel/locking/spinlock.c - fix an #endif comment Signed-off-by: Cheng Jian <cj.chengjian@huawei.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: huawei.libin@huawei.com Cc: xiexiuqi@huawei.com Link: http://lkml.kernel.org/r/1509706788-152547-1-git-send-email-cj.chengjian@huawei.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-11-07kprobes, x86/alternatives: Use text_mutex to protect smp_alt_modulesZhou Chengming
We use alternatives_text_reserved() to check if the address is in the fixed pieces of alternative reserved, but the problem is that we don't hold the smp_alt mutex when call this function. So the list traversal may encounter a deleted list_head if another path is doing alternatives_smp_module_del(). One solution is that we can hold smp_alt mutex before call this function, but the difficult point is that the callers of this functions, arch_prepare_kprobe() and arch_prepare_optimized_kprobe(), are called inside the text_mutex. So we must hold smp_alt mutex before we go into these arch dependent code. But we can't now, the smp_alt mutex is the arch dependent part, only x86 has it. Maybe we can export another arch dependent callback to solve this. But there is a simpler way to handle this problem. We can reuse the text_mutex to protect smp_alt_modules instead of using another mutex. And all the arch dependent checks of kprobes are inside the text_mutex, so it's safe now. Signed-off-by: Zhou Chengming <zhouchengming1@huawei.com> Reviewed-by: Masami Hiramatsu <mhiramat@kernel.org> Acked-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: bp@suse.de Fixes: 2cfa197 "ftrace/alternatives: Introducing *_text_reserved functions" Link: http://lkml.kernel.org/r/1509585501-79466-1-git-send-email-zhouchengming1@huawei.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-11-07Merge branch 'linus' into x86/apic, to resolve conflictsIngo Molnar
Conflicts: arch/x86/include/asm/x2apic.h Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-11-07Merge branch 'linus' into locking/core, to resolve conflictsIngo Molnar
Conflicts: include/linux/compiler-clang.h include/linux/compiler-gcc.h include/linux/compiler-intel.h include/uapi/linux/stddef.h Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-11-07Merge branch 'linus' into perf/core, to fix conflictsIngo Molnar
Conflicts: tools/perf/arch/arm/annotate/instructions.c tools/perf/arch/arm64/annotate/instructions.c tools/perf/arch/powerpc/annotate/instructions.c tools/perf/arch/s390/annotate/instructions.c tools/perf/arch/x86/tests/intel-cqm.c tools/perf/ui/tui/progress.c tools/perf/util/zlib.c Signed-off-by: Ingo Molnar <mingo@kernel.org>