summaryrefslogtreecommitdiff
path: root/kernel
AgeCommit message (Collapse)Author
2018-01-10sched/fair: Correct obsolete comment about cpufreq_update_util()Joel Fernandes
Since the remote cpufreq callback work, the cpufreq_update_util() call can happen from remote CPUs. The comment about local CPUs is thus obsolete. Update it accordingly. Signed-off-by: Joel Fernandes <joelaf@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Viresh Kumar <viresh.kumar@linaro.org> Cc: Android Kernel <kernel-team@android.com> Cc: Atish Patra <atish.patra@oracle.com> Cc: Chris Redpath <Chris.Redpath@arm.com> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com> Cc: EAS Dev <eas-dev@lists.linaro.org> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Josef Bacik <jbacik@fb.com> Cc: Juri Lelli <juri.lelli@arm.com> Cc: Len Brown <lenb@kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Morten Ramussen <morten.rasmussen@arm.com> Cc: Patrick Bellasi <patrick.bellasi@arm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rafael J. Wysocki <rjw@rjwysocki.net> Cc: Rohit Jain <rohit.k.jain@oracle.com> Cc: Saravana Kannan <skannan@quicinc.com> Cc: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> Cc: Steve Muckle <smuckle@google.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vikram Mulukutla <markivx@codeaurora.org> Cc: Vincent Guittot <vincent.guittot@linaro.org> Link: http://lkml.kernel.org/r/20171215153944.220146-2-joelaf@google.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-01-10sched/fair: Remove impossible condition from find_idlest_group_cpu()Joel Fernandes
find_idlest_group_cpu() goes through CPUs of a group previous selected by find_idlest_group(). find_idlest_group() returns NULL if the local group is the selected one and doesn't execute find_idlest_group_cpu if the group to which 'cpu' belongs to is chosen. So we're always guaranteed to call find_idlest_group_cpu() with a group to which 'cpu' is non-local. This makes one of the conditions in find_idlest_group_cpu() an impossible one, which we can get rid off. Signed-off-by: Joel Fernandes <joelaf@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Brendan Jackman <brendan.jackman@arm.com> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Cc: Android Kernel <kernel-team@android.com> Cc: Atish Patra <atish.patra@oracle.com> Cc: Chris Redpath <Chris.Redpath@arm.com> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com> Cc: EAS Dev <eas-dev@lists.linaro.org> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Josef Bacik <jbacik@fb.com> Cc: Juri Lelli <juri.lelli@arm.com> Cc: Len Brown <lenb@kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Morten Ramussen <morten.rasmussen@arm.com> Cc: Patrick Bellasi <patrick.bellasi@arm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rafael J. Wysocki <rjw@rjwysocki.net> Cc: Rohit Jain <rohit.k.jain@oracle.com> Cc: Saravana Kannan <skannan@quicinc.com> Cc: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> Cc: Steve Muckle <smuckle@google.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vikram Mulukutla <markivx@codeaurora.org> Cc: Viresh Kumar <viresh.kumar@linaro.org> Link: http://lkml.kernel.org/r/20171215153944.220146-3-joelaf@google.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-01-10sched/cpufreq: Don't pass flags to sugov_set_iowait_boost()Viresh Kumar
We are already passing sg_cpu as argument to sugov_set_iowait_boost() helper and the same can be used to retrieve the flags value. Get rid of the redundant argument. Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rafael Wysocki <rjw@rjwysocki.net> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vincent Guittot <vincent.guittot@linaro.org> Cc: dietmar.eggemann@arm.com Cc: joelaf@google.com Cc: juri.lelli@redhat.com Cc: morten.rasmussen@arm.com Cc: tkjos@android.com Link: http://lkml.kernel.org/r/4ec5562b1a87e146ebab11fb5dde1ca9c763a7fb.1513158452.git.viresh.kumar@linaro.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-01-10sched/cpufreq: Initialize sg_cpu->flags to 0Viresh Kumar
Initializing sg_cpu->flags to SCHED_CPUFREQ_RT has no obvious benefit. The flags field wouldn't be used until the utilization update handler is called for the first time, and once that is called we will overwrite flags anyway. Initialize it to 0. Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Juri Lelli <juri.lelli@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rafael Wysocki <rjw@rjwysocki.net> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vincent Guittot <vincent.guittot@linaro.org> Cc: dietmar.eggemann@arm.com Cc: joelaf@google.com Cc: morten.rasmussen@arm.com Cc: tkjos@android.com Link: http://lkml.kernel.org/r/763feda6424ced8486b25a0c52979634e6104478.1513158452.git.viresh.kumar@linaro.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-01-10sched/fair: Consider RT/IRQ pressure in capacity_spare_wake()Joel Fernandes
capacity_spare_wake() in the slow path influences choice of idlest groups, as we search for groups with maximum spare capacity. In scenarios where RT pressure is high, a sub optimal group can be chosen and hurt performance of the task being woken up. Fix this by using capacity_of() instead of capacity_orig_of() in capacity_spare_wake(). Tests results from improvements with this change are below. More tests were also done by myself and Matt Fleming to ensure no degradation in different benchmarks. 1) Rohit ran barrier.c test (details below) with following improvements: ------------------------------------------------------------------------ This was Rohit's original use case for a patch he posted at [1] however from his recent tests he showed my patch can replace his slow path changes [1] and there's no need to selectively scan/skip CPUs in find_idlest_group_cpu in the slow path to get the improvement he sees. barrier.c (open_mp code) as a micro-benchmark. It does a number of iterations and barrier sync at the end of each for loop. Here barrier,c is running in along with ping on CPU 0 and 1 as: 'ping -l 10000 -q -s 10 -f hostX' barrier.c can be found at: http://www.spinics.net/lists/kernel/msg2506955.html Following are the results for the iterations per second with this micro-benchmark (higher is better), on a 44 core, 2 socket 88 Threads Intel x86 machine: +--------+------------------+---------------------------+ |Threads | Without patch | With patch | | | | | +--------+--------+---------+-----------------+---------+ | | Mean | Std Dev | Mean | Std Dev | +--------+--------+---------+-----------------+---------+ |1 | 539.36 | 60.16 | 572.54 (+6.15%) | 40.95 | |2 | 481.01 | 19.32 | 530.64 (+10.32%)| 56.16 | |4 | 474.78 | 22.28 | 479.46 (+0.99%) | 18.89 | |8 | 450.06 | 24.91 | 447.82 (-0.50%) | 12.36 | |16 | 436.99 | 22.57 | 441.88 (+1.12%) | 7.39 | |32 | 388.28 | 55.59 | 429.4 (+10.59%)| 31.14 | |64 | 314.62 | 6.33 | 311.81 (-0.89%) | 11.99 | +--------+--------+---------+-----------------+---------+ 2) ping+hackbench test on bare-metal sever (by Rohit) ----------------------------------------------------- Here hackbench is running in threaded mode along with, running ping on CPU 0 and 1 as: 'ping -l 10000 -q -s 10 -f hostX' This test is running on 2 socket, 20 core and 40 threads Intel x86 machine: Number of loops is 10000 and runtime is in seconds (Lower is better). +--------------+-----------------+--------------------------+ |Task Groups | Without patch | With patch | | +-------+---------+----------------+---------+ |(Groups of 40)| Mean | Std Dev | Mean | Std Dev | +--------------+-------+---------+----------------+---------+ |1 | 0.851 | 0.007 | 0.828 (+2.77%)| 0.032 | |2 | 1.083 | 0.203 | 1.087 (-0.37%)| 0.246 | |4 | 1.601 | 0.051 | 1.611 (-0.62%)| 0.055 | |8 | 2.837 | 0.060 | 2.827 (+0.35%)| 0.031 | |16 | 5.139 | 0.133 | 5.107 (+0.63%)| 0.085 | |25 | 7.569 | 0.142 | 7.503 (+0.88%)| 0.143 | +--------------+-------+---------+----------------+---------+ [1] https://patchwork.kernel.org/patch/9991635/ Matt Fleming also ran several different hackbench tests and cyclic test to santiy-check that the patch doesn't harm other usecases. Tested-by: Matt Fleming <matt@codeblueprint.co.uk> Tested-by: Rohit Jain <rohit.k.jain@oracle.com> Signed-off-by: Joel Fernandes <joelaf@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Cc: Atish Patra <atish.patra@oracle.com> Cc: Brendan Jackman <brendan.jackman@arm.com> Cc: Chris Redpath <Chris.Redpath@arm.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Juri Lelli <juri.lelli@arm.com> Cc: Len Brown <lenb@kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Morten Ramussen <morten.rasmussen@arm.com> Cc: Patrick Bellasi <patrick.bellasi@arm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rafael J. Wysocki <rjw@rjwysocki.net> Cc: Saravana Kannan <skannan@quicinc.com> Cc: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> Cc: Steve Muckle <smuckle@google.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vikram Mulukutla <markivx@codeaurora.org> Cc: Viresh Kumar <viresh.kumar@linaro.org> Link: http://lkml.kernel.org/r/20171214212158.188190-1-joelaf@google.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-01-10sched/fair: Use 'unsigned long' for utilization, consistentlyPatrick Bellasi
Utilization and capacity are tracked as 'unsigned long', however some functions using them return an 'int' which is ultimately assigned back to 'unsigned long' variables. Since there is not scope on using a different and signed type, consolidate the signature of functions returning utilization to always use the native type. This change improves code consistency, and it also benefits code paths where utilizations should be clamped by avoiding further type conversions or ugly type casts. Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Chris Redpath <chris.redpath@arm.com> Reviewed-by: Brendan Jackman <brendan.jackman@arm.com> Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Cc: Joel Fernandes <joelaf@google.com> Cc: Juri Lelli <juri.lelli@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Morten Rasmussen <morten.rasmussen@arm.com> Cc: Paul Turner <pjt@google.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rafael J . Wysocki <rafael.j.wysocki@intel.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Todd Kjos <tkjos@android.com> Cc: Vincent Guittot <vincent.guittot@linaro.org> Cc: Viresh Kumar <viresh.kumar@linaro.org> Link: http://lkml.kernel.org/r/20171205171018.9203-2-patrick.bellasi@arm.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-01-10sched/core: Rework and clarify prepare_lock_switch()rodrigosiqueira
The prepare_lock_switch() function has an unused parameter, and also the function name was not descriptive. To improve readability and remove the extra parameter, do the following changes: * Move prepare_lock_switch() from kernel/sched/sched.h to kernel/sched/core.c, rename it to prepare_task(), and remove the unused parameter. * Split the smp_store_release() out from finish_lock_switch() to a function named finish_task. * Comments ajdustments. Signed-off-by: Rodrigo Siqueira <rodrigosiqueiramelo@gmail.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/20171215140603.gxe5i2y6fg5ojfpp@smtp.gmail.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-01-10Merge branch 'sched/urgent' into sched/core, to pick up fixesIngo Molnar
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-01-10membarrier: Disable preemption when calling smp_call_function_many()Mathieu Desnoyers
smp_call_function_many() requires disabling preemption around the call. Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: <stable@vger.kernel.org> # v4.14+ Cc: Andrea Parri <parri.andrea@gmail.com> Cc: Andrew Hunter <ahh@google.com> Cc: Avi Kivity <avi@scylladb.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Boqun Feng <boqun.feng@gmail.com> Cc: Dave Watson <davejwatson@fb.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Maged Michael <maged.michael@gmail.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Paul E . McKenney <paulmck@linux.vnet.ibm.com> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Paul Mackerras <paulus@samba.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/20171215192310.25293-1-mathieu.desnoyers@efficios.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-01-10PM / sleep: Make lock/unlock_system_sleep() available to kernel modulesBart Van Assche
Since pm_mutex is not exported using lock/unlock_system_sleep() from inside a kernel module causes a "pm_mutex undefined" linker error. Hence move lock/unlock_system_sleep() into kernel/power/main.c and export these. Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2018-01-09bpf: introduce BPF_JIT_ALWAYS_ON configAlexei Starovoitov
The BPF interpreter has been used as part of the spectre 2 attack CVE-2017-5715. A quote from goolge project zero blog: "At this point, it would normally be necessary to locate gadgets in the host kernel code that can be used to actually leak data by reading from an attacker-controlled location, shifting and masking the result appropriately and then using the result of that as offset to an attacker-controlled address for a load. But piecing gadgets together and figuring out which ones work in a speculation context seems annoying. So instead, we decided to use the eBPF interpreter, which is built into the host kernel - while there is no legitimate way to invoke it from inside a VM, the presence of the code in the host kernel's text section is sufficient to make it usable for the attack, just like with ordinary ROP gadgets." To make attacker job harder introduce BPF_JIT_ALWAYS_ON config option that removes interpreter from the kernel in favor of JIT-only mode. So far eBPF JIT is supported by: x64, arm64, arm32, sparc64, s390, powerpc64, mips64 The start of JITed program is randomized and code page is marked as read-only. In addition "constant blinding" can be turned on with net.core.bpf_jit_harden v2->v3: - move __bpf_prog_ret0 under ifdef (Daniel) v1->v2: - fix init order, test_bpf and cBPF (Daniel's feedback) - fix offloaded bpf (Jakub's feedback) - add 'return 0' dummy in case something can invoke prog->bpf_func - retarget bpf tree. For bpf-next the patch would need one extra hunk. It will be sent when the trees are merged back to net-next Considered doing: int bpf_jit_enable __read_mostly = BPF_EBPF_JIT_DEFAULT; but it seems better to land the patch as-is and in bpf-next remove bpf_jit_enable global variable from all JITs, consolidate in one place and remove this jit_init() function. Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-01-09Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller
2018-01-09symbol lookup: introduce dereference_symbol_descriptor()Sergey Senozhatsky
dereference_symbol_descriptor() invokes appropriate ARCH specific function descriptor dereference callbacks: - dereference_kernel_function_descriptor() if the pointer is a kernel symbol; - dereference_module_function_descriptor() if the pointer is a module symbol. This is the last step needed to make '%pS/%ps' smart enough to handle function descriptor dereference on affected ARCHs and to retire '%pF/%pf'. To refresh it: Some architectures (ia64, ppc64, parisc64) use an indirect pointer for C function pointers - the function pointer points to a function descriptor and we need to dereference it to get the actual function pointer. Function descriptors live in .opd elf section and all affected ARCHs (ia64, ppc64, parisc64) handle it properly for kernel and modules. So we, technically, can decide if the dereference is needed by simply looking at the pointer: if it belongs to .opd section then we need to dereference it. The kernel and modules have their own .opd sections, obviously, that's why we need to split dereference_function_descriptor() and use separate kernel and module dereference arch callbacks. Link: http://lkml.kernel.org/r/20171206043649.GB15885@jagdpanzerIV Cc: Fenghua Yu <fenghua.yu@intel.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: James Bottomley <jejb@parisc-linux.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Jessica Yu <jeyu@kernel.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: linux-ia64@vger.kernel.org Cc: linux-parisc@vger.kernel.org Cc: linuxppc-dev@lists.ozlabs.org Cc: linux-kernel@vger.kernel.org Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Tested-by: Tony Luck <tony.luck@intel.com> #ia64 Tested-by: Santosh Sivaraj <santosh@fossix.org> #powerpc Tested-by: Helge Deller <deller@gmx.de> #parisc64 Signed-off-by: Petr Mladek <pmladek@suse.com>
2018-01-09sections: split dereference_function_descriptor()Sergey Senozhatsky
There are two format specifiers to print out a pointer in symbolic format: '%pS/%ps' and '%pF/%pf'. On most architectures, the two mean exactly the same thing, but some architectures (ia64, ppc64, parisc64) use an indirect pointer for C function pointers, where the function pointer points to a function descriptor (which in turn contains the actual pointer to the code). The '%pF/%pf, when used appropriately, automatically does the appropriate function descriptor dereference on such architectures. The "when used appropriately" part is tricky. Basically this is a subtle ABI detail, specific to some platforms, that made it to the API level and people can be unaware of it and miss the whole "we need to dereference the function" business out. [1] proves that point (note that it fixes only '%pF' and '%pS', there might be '%pf' and '%ps' cases as well). It appears that we can handle everything within the affected arches and make '%pS/%ps' smart enough to retire '%pF/%pf'. Function descriptors live in .opd elf section and all affected arches (ia64, ppc64, parisc64) handle it properly for kernel and modules. So we, technically, can decide if the dereference is needed by simply looking at the pointer: if it belongs to .opd section then we need to dereference it. The kernel and modules have their own .opd sections, obviously, that's why we need to split dereference_function_descriptor() and use separate kernel and module dereference arch callbacks. This patch does the first step, it a) adds dereference_kernel_function_descriptor() function. b) adds a weak alias to dereference_module_function_descriptor() function. So, for the time being, we will have: 1) dereference_function_descriptor() A generic function, that simply dereferences the pointer. There is bunch of places that call it: kgdbts, init/main.c, extable, etc. 2) dereference_kernel_function_descriptor() A function to call on kernel symbols that does kernel .opd section address range test. 3) dereference_module_function_descriptor() A function to call on modules' symbols that does modules' .opd section address range test. [1] https://marc.info/?l=linux-kernel&m=150472969730573 Link: http://lkml.kernel.org/r/20171109234830.5067-2-sergey.senozhatsky@gmail.com To: Fenghua Yu <fenghua.yu@intel.com> To: Benjamin Herrenschmidt <benh@kernel.crashing.org> To: Paul Mackerras <paulus@samba.org> To: Michael Ellerman <mpe@ellerman.id.au> To: James Bottomley <jejb@parisc-linux.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Jessica Yu <jeyu@kernel.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: linux-ia64@vger.kernel.org Cc: linux-parisc@vger.kernel.org Cc: linuxppc-dev@lists.ozlabs.org Cc: linux-kernel@vger.kernel.org Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Tested-by: Tony Luck <tony.luck@intel.com> #ia64 Tested-by: Santosh Sivaraj <santosh@fossix.org> #powerpc Tested-by: Helge Deller <deller@gmx.de> #parisc64 Signed-off-by: Petr Mladek <pmladek@suse.com>
2018-01-09bpf: prevent out-of-bounds speculationAlexei Starovoitov
Under speculation, CPUs may mis-predict branches in bounds checks. Thus, memory accesses under a bounds check may be speculated even if the bounds check fails, providing a primitive for building a side channel. To avoid leaking kernel data round up array-based maps and mask the index after bounds check, so speculated load with out of bounds index will load either valid value from the array or zero from the padded area. Unconditionally mask index for all array types even when max_entries are not rounded to power of 2 for root user. When map is created by unpriv user generate a sequence of bpf insns that includes AND operation to make sure that JITed code includes the same 'index & index_mask' operation. If prog_array map is created by unpriv user replace bpf_tail_call(ctx, map, index); with if (index >= max_entries) { index &= map->index_mask; bpf_tail_call(ctx, map, index); } (along with roundup to power 2) to prevent out-of-bounds speculation. There is secondary redundant 'if (index >= max_entries)' in the interpreter and in all JITs, but they can be optimized later if necessary. Other array-like maps (cpumap, devmap, sockmap, perf_event_array, cgroup_array) cannot be used by unpriv, so no changes there. That fixes bpf side of "Variant 1: bounds check bypass (CVE-2017-5753)" on all architectures with and without JIT. v2->v3: Daniel noticed that attack potentially can be crafted via syscall commands without loading the program, so add masking to those paths as well. Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: John Fastabend <john.fastabend@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-01-08memremap: merge find_dev_pagemap into get_dev_pagemapChristoph Hellwig
There is only one caller of the trivial function find_dev_pagemap left, so just merge it into the caller. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2018-01-08memremap: change devm_memremap_pages interface to use struct dev_pagemapChristoph Hellwig
This new interface is similar to how struct device (and many others) work. The caller initializes a 'struct dev_pagemap' as required and calls 'devm_memremap_pages'. This allows the pagemap structure to be embedded in another structure and thus container_of can be used. In this way application specific members can be stored in a containing struct. This will be used by the P2P infrastructure and HMM could probably be cleaned up to use it as well (instead of having it's own, similar 'hmm_devmem_pages_create' function). Signed-off-by: Logan Gunthorpe <logang@deltatee.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2018-01-08memremap: drop private struct page_mapLogan Gunthorpe
'struct page_map' is a private structure of 'struct dev_pagemap' but the latter replicates all the same fields as the former so there isn't much value in it. Thus drop it in favour of a completely public struct. This is a clean up in preperation for a more generally useful 'devm_memeremap_pages' interface. Signed-off-by: Logan Gunthorpe <logang@deltatee.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2018-01-08memremap: simplify duplicate region handling in devm_memremap_pagesChristoph Hellwig
__radix_tree_insert already checks for duplicates and returns -EEXIST in that case, so remove the duplicate (and racy) duplicates check. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Logan Gunthorpe <logang@deltatee.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2018-01-08memremap: remove to_vmem_altmapChristoph Hellwig
All callers are gone now. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2018-01-08mm: optimize dev_pagemap reference counting around get_dev_pagemapChristoph Hellwig
Change the calling convention so that get_dev_pagemap always consumes the previous reference instead of doing this using an explicit earlier call to put_dev_pagemap in the callers. The callers will still need to put the final reference after finishing the loop over the pages. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Logan Gunthorpe <logang@deltatee.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2018-01-08mm: move get_dev_pagemap out of lineChristoph Hellwig
This is a pretty big function, which should be out of line in general, and a no-op stub if CONFIG_ZONE_DEVICЕ is not set. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Logan Gunthorpe <logang@deltatee.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2018-01-08mm: pass the vmem_altmap to memmap_init_zoneChristoph Hellwig
Pass the vmem_altmap two levels down instead of needing a lookup. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2018-01-08mm: pass the vmem_altmap to arch_remove_memory and __remove_pagesChristoph Hellwig
We can just pass this on instead of having to do a radix tree lookup without proper locking 2 levels into the callchain. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2018-01-08mm: pass the vmem_altmap to arch_add_memory and __add_pagesChristoph Hellwig
We can just pass this on instead of having to do a radix tree lookup without proper locking 2 levels into the callchain. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2018-01-08Merge branch 'for-4.15-fixes' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup Pull cgroup fixes from Tejun Heo: "This contains fixes for the following two non-trivial issues: - The task iterator got broken while adding thread mode support for v4.14. It was less visible because it only triggers when both cgroup1 and cgroup2 hierarchies are in use. The recent versions of systemd uses cgroup2 for process management even when cgroup1 is used for resource control exposing this issue. - cpuset CPU hotplug path could deadlock when racing against exits. There also are two patches to replace unlimited strcpy() usages with strlcpy()" * 'for-4.15-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: cgroup: fix css_task_iter crash on CSS_TASK_ITER_PROC cgroup: Fix deadlock in cpu hotplug path cgroup: use strlcpy() instead of strscpy() to avoid spurious warning cgroup: avoid copying strings longer than the buffers
2018-01-08irq/work: Improve the flag definitionsBartosz Golaszewski
IRQ_WORK_FLAGS is defined simply to 3UL. This is confusing as it says nothing about its purpose. Define IRQ_WORK_FLAGS as a bitwise OR of IRQ_WORK_PENDING and IRQ_WORK_BUSY and change its name to IRQ_WORK_CLAIMED. While we're at it: use the BIT() macro for all flags. Signed-off-by: Bartosz Golaszewski <brgl@bgdev.pl> Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Reviewed-by: Andy Shevchenko <andy.shevchenko@gmail.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Marc Zyngier <marc.zyngier@arm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/1515125996-21564-1-git-send-email-frederic@kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-01-08bpf: fix verifier GPF in kmalloc failure pathAlexei Starovoitov
syzbot reported the following panic in the verifier triggered by kmalloc error injection: kasan: GPF could be caused by NULL-ptr deref or user memory access RIP: 0010:copy_func_state kernel/bpf/verifier.c:403 [inline] RIP: 0010:copy_verifier_state+0x364/0x590 kernel/bpf/verifier.c:431 Call Trace: pop_stack+0x8c/0x270 kernel/bpf/verifier.c:449 push_stack kernel/bpf/verifier.c:491 [inline] check_cond_jmp_op kernel/bpf/verifier.c:3598 [inline] do_check+0x4b60/0xa050 kernel/bpf/verifier.c:4731 bpf_check+0x3296/0x58c0 kernel/bpf/verifier.c:5489 bpf_prog_load+0xa2a/0x1b00 kernel/bpf/syscall.c:1198 SYSC_bpf kernel/bpf/syscall.c:1807 [inline] SyS_bpf+0x1044/0x4420 kernel/bpf/syscall.c:1769 when copy_verifier_state() aborts in the middle due to kmalloc failure some of the frames could have been partially copied while current free_verifier_state() loop for (i = 0; i <= state->curframe; i++) assumed that all frames are non-null. Simply fix it by adding 'if (!state)' to free_func_state(). Also avoid stressing copy frame logic more if kzalloc fails in push_stack() free env->cur_state right away. Fixes: f4d7e40a5b71 ("bpf: introduce function calls (verification)") Reported-by: syzbot+32ac5a3e473f2e01cfc7@syzkaller.appspotmail.com Reported-by: syzbot+fa99e24f3c29d269a7d5@syzkaller.appspotmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-01-08locking/lockdep: Remove cross-release leftoversIngo Molnar
There's two cross-release leftover facilities: - the crossrelease_hist_*() irq-tracing callbacks (NOPs currently) - the complete_release_commit() callback (NOP as well) Remove them. Cc: David Sterba <dsterba@suse.com> Cc: Byungchul Park <byungchul.park@lge.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-01-08perf: Return empty callchain instead of NULLJiri Olsa
It simplifies the code a bit, because we dump the callchain Link: http://lkml.kernel.org/n/tip-uqp7qd6aif47g39glnbu95yl@git.kernel.org even if it's empty. With 'empty' callchain we can remove all the NULL-checking code paths. Original-patch-from: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Jiri Olsa <jolsa@kernel.org> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: David Ahern <dsahern@gmail.com> Cc: Namhyung Kim <namhyung@kernel.org> Link: http://lkml.kernel.org/r/20180107160356.28203-7-jolsa@kernel.org Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2018-01-08perf: Make perf_callchain function staticJiri Olsa
And move it to core.c, because there's no caller of this function other than the one in core.c Signed-off-by: Jiri Olsa <jolsa@kernel.org> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: David Ahern <dsahern@gmail.com> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/20180107160356.28203-6-jolsa@kernel.org Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2018-01-08perf: Allocate context task_ctx_data for child eventJiri Olsa
Currently we use perf_event_context::task_ctx_data to save and restore the LBR status when the task is scheduled out and in. We don't allocate it for child contexts, which results in shorter task's LBR stack, because we don't save the history from previous run and start over every time we schedule the task in. I made a test to generate samples with LBR call stack and got higher numbers on bigger chain depths: before: after: LBR call chain: nr: 1 60561 498127 LBR call chain: nr: 2 0 0 LBR call chain: nr: 3 107030 2172 LBR call chain: nr: 4 466685 62758 LBR call chain: nr: 5 2307319 878046 LBR call chain: nr: 6 48713 495218 LBR call chain: nr: 7 1040 4551 LBR call chain: nr: 8 481 172 LBR call chain: nr: 9 878 120 LBR call chain: nr: 10 2377 6698 LBR call chain: nr: 11 28830 151487 LBR call chain: nr: 12 29347 339867 LBR call chain: nr: 13 4 22 LBR call chain: nr: 14 3 53 Signed-off-by: Jiri Olsa <jolsa@kernel.org> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: David Ahern <dsahern@gmail.com> Cc: Namhyung Kim <namhyung@kernel.org> Fixes: 4af57ef28c2c ("perf: Add pmu specific data for perf task context") Link: http://lkml.kernel.org/r/20180107160356.28203-4-jolsa@kernel.org Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2018-01-08workqueue: allow WQ_MEM_RECLAIM on early init workqueuesTejun Heo
Workqueues can be created early during boot before workqueue subsystem in fully online - work items are queued waiting for later full initialization. However, early init wasn't supported for WQ_MEM_RECLAIM workqueues causing unnecessary annoyances for a subset of users. Expand early init support to include WQ_MEM_RECLAIM workqueues. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Paul McKenney <paulmck@linux.vnet.ibm.com>
2018-01-08workqueue: separate out init_rescuer()Tejun Heo
Separate out init_rescuer() from __alloc_workqueue_key() to prepare for early init support for WQ_MEM_RECLAIM. This patch doesn't introduce any functional changes. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Paul McKenney <paulmck@linux.vnet.ibm.com>
2018-01-06Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull vfs fixes from Al Viro: - untangle sys_close() abuses in xt_bpf - deal with register_shrinker() failures in sget() * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: fix "netfilter: xt_bpf: Fix XT_BPF_MODE_FD_PINNED mode of 'xt_bpf_info_v1'" sget(): handle failures of register_shrinker() mm,vmscan: Make unregister_shrinker() no-op if register_shrinker() failed.
2018-01-07bpf: sockmap missing NULL psock checkJohn Fastabend
Add psock NULL check to handle a racing sock event that can get the sk_callback_lock before this case but after xchg happens causing the refcnt to hit zero and sock user data (psock) to be null and queued for garbage collection. Also add a comment in the code because this is a bit subtle and not obvious in my opinion. Signed-off-by: John Fastabend <john.fastabend@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-01-06bpf: implement syscall command BPF_MAP_GET_NEXT_KEY for stacktrace mapYonghong Song
Currently, bpf syscall command BPF_MAP_GET_NEXT_KEY is not supported for stacktrace map. However, there are use cases where user space wants to enumerate all stacktrace map entries where BPF_MAP_GET_NEXT_KEY command will be really helpful. In addition, if user space wants to delete all map entries in order to save memory and does not want to close the map file descriptor, BPF_MAP_GET_NEXT_KEY may help improve performance if map entries are sparsely populated. The implementation has similar behavior for BPF_MAP_GET_NEXT_KEY implementation in hashtab. If user provides a NULL key pointer or an invalid key, the first key is returned. Otherwise, the first valid key after the input parameter "key" is returned, or -ENOENT if no valid key can be found. Signed-off-by: Yonghong Song <yhs@fb.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-01-06block: convert to bio_first_bvec_all & bio_first_page_allMing Lei
This patch converts to bio_first_bvec_all() & bio_first_page_all() for retrieving the 1st bvec/page, and prepares for supporting multipage bvec. Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-01-05bpf_obj_do_pin(): switch to vfs_mkobj(), quit abusing ->mknod()Al Viro
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2018-01-05fix "netfilter: xt_bpf: Fix XT_BPF_MODE_FD_PINNED mode of 'xt_bpf_info_v1'"Al Viro
Descriptor table is a shared object; it's not a place where you can stick temporary references to files, especially when we don't need an opened file at all. Cc: stable@vger.kernel.org # v4.14 Fixes: 98589a0998b8 ("netfilter: xt_bpf: Fix XT_BPF_MODE_FD_PINNED mode of 'xt_bpf_info_v1'") Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2018-01-05irq debug: do not use print_symbol()Sergey Senozhatsky
print_symbol() is a very old API that has been obsoleted by %pS format specifier in a normal printk() call. Replace print_symbol() with a direct printk("%pS") call and avoid using continuous lines. Link: http://lkml.kernel.org/r/20171212073453.21455-1-sergey.senozhatsky@gmail.com To: Andrew Morton <akpm@linux-foundation.org> To: Russell King <linux@armlinux.org.uk> To: Catalin Marinas <catalin.marinas@arm.com> To: Mark Salter <msalter@redhat.com> To: Tony Luck <tony.luck@intel.com> To: David Howells <dhowells@redhat.com> To: Yoshinori Sato <ysato@users.sourceforge.jp> To: Guan Xuetao <gxt@mprc.pku.edu.cn> To: Borislav Petkov <bp@alien8.de> To: Greg Kroah-Hartman <gregkh@linuxfoundation.org> To: Thomas Gleixner <tglx@linutronix.de> To: Peter Zijlstra <peterz@infradead.org> To: Vineet Gupta <vgupta@synopsys.com> To: Fengguang Wu <fengguang.wu@intel.com> To: David Laight <David.Laight@ACULAB.COM> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Petr Mladek <pmladek@suse.com> Cc: LKML <linux-kernel@vger.kernel.org> Cc: linux-arm-kernel@lists.infradead.org Cc: linux-c6x-dev@linux-c6x.org Cc: linux-ia64@vger.kernel.org Cc: linux-am33-list@redhat.com Cc: linux-sh@vger.kernel.org Cc: linux-edac@vger.kernel.org Cc: x86@kernel.org Cc: linux-snps-arc@lists.infradead.org Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> [pmladek@suse.com: updated commit message] Signed-off-by: Petr Mladek <pmladek@suse.com>
2018-01-05PM: hibernate: Do not subtract NR_FILE_MAPPED in minimum_image_size()Rainer Fiebig
s2disk/s2both may fail unnecessarily and erratically if NR_FILE_MAPPED is high - for instance when using VMs with VirtualBox and perhaps VMware Player. In those situations s2disk becomes unreliable and therefore unusable. A typical scenario is: user issues a s2disk and it fails. User issues a second s2disk immediately after that and it succeeds. And user wonders why. The problem is caused by minimum_image_size() in snapshot.c. The value it returns is roughly 100% too high because NR_FILE_MAPPED is subtracted in its calculation. Eventually the number of preallocated image pages is falsely too low. This doesn't matter as long as NR_FILE_MAPPED-values are in a normal range or in 32bit-environments as the code allows for allocation of additional pages from highmem. But with the high values generated by VirtualBox-VMs (a 2-GB-VM causes NR_FILE_MAPPED go up by 2 GB) it may lead to failure in 64bit-systems. Not subtracting NR_FILE_MAPPED in minimum_image_size() solves the problem. I've done at least hundreds of successful s2both/s2disk now on an x86_64 system (with and without VirtualBox) which gives me some confidence that this is right. It has turned s2disk/s2both from unusable into 100% reliable. Link: https://bugzilla.kernel.org/show_bug.cgi?id=97201 Signed-off-by: Rainer Fiebig <jrf@mailbox.org> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2018-01-05padata: add SPDX identifierCheah Kok Cheong
Add SPDX license identifier according to the type of license text found in the file. Cc: Philippe Ombredanne <pombredanne@nexb.com> Signed-off-by: Cheah Kok Cheong <thrust73@gmail.com> Acked-by: Steffen Klassert <steffen.klassert@secunet.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2018-01-04kernel/exit.c: export abort() to modulesAndrew Morton
gcc -fisolate-erroneous-paths-dereference can generate calls to abort() from modular code too. [arnd@arndb.de: drop duplicate exports of abort()] Link: http://lkml.kernel.org/r/20180102103311.706364-1-arnd@arndb.de Reported-by: Vineet Gupta <Vineet.Gupta1@synopsys.com> Cc: Sudip Mukherjee <sudipm.mukherjee@gmail.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Alexey Brodkin <Alexey.Brodkin@synopsys.com> Cc: Russell King <rmk+kernel@armlinux.org.uk> Cc: Jose Abreu <Jose.Abreu@synopsys.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-01-04kernel/acct.c: fix the acct->needcheck check in check_free_space()Oleg Nesterov
As Tsukada explains, the time_is_before_jiffies(acct->needcheck) check is very wrong, we need time_is_after_jiffies() to make sys_acct() work. Ignoring the overflows, the code should "goto out" if needcheck > jiffies, while currently it checks "needcheck < jiffies" and thus in the likely case check_free_space() does nothing until jiffies overflow. In particular this means that sys_acct() is simply broken, acct_on() sets acct->needcheck = jiffies and expects that check_free_space() should set acct->active = 1 after the free-space check, but this won't happen if jiffies increments in between. This was broken by commit 32dc73086015 ("get rid of timer in kern/acct.c") in 2011, then another (correct) commit 795a2f22a8ea ("acct() should honour the limits from the very beginning") made the problem more visible. Link: http://lkml.kernel.org/r/20171213133940.GA6554@redhat.com Fixes: 32dc73086015 ("get rid of timer in kern/acct.c") Reported-by: TSUKADA Koutaro <tsukada@ascade.co.jp> Suggested-by: TSUKADA Koutaro <tsukada@ascade.co.jp> Signed-off-by: Oleg Nesterov <oleg@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-01-04bpf: only build sockmap with CONFIG_INETJohn Fastabend
The sockmap infrastructure is only aware of TCP sockets at the moment. In the future we plan to add UDP. In both cases CONFIG_NET should be built-in. So lets only build sockmap if CONFIG_INET is enabled. Signed-off-by: John Fastabend <john.fastabend@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-01-04bpf: sockmap remove unused functionJohn Fastabend
This was added for some work that was eventually factored out but the helper call was missed. Remove it now and add it back later if needed. Signed-off-by: John Fastabend <john.fastabend@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-01-04posix-timers: Prevent UB from shifting negative signed valueNick Desaulniers
Shifting a negative signed number is undefined behavior. Looking at the macros MAKE_PROCESS_CPUCLOCK and FD_TO_CLOCKID, it seems that the subexpression: (~(clockid_t) (pid) << 3) where clockid_t resolves to a signed int, which once negated, is undefined behavior to shift the value of if the results thus far are negative. It was further suggested to make these macros into inline functions. Suggested-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Nick Desaulniers <nick.desaulniers@gmail.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Dimitri Sivanich <sivanich@hpe.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: linux-kselftest@vger.kernel.org Cc: Shuah Khan <shuah@kernel.org> Cc: Deepa Dinamani <deepa.kernel@gmail.com> Link: https://lkml.kernel.org/r/1514517100-18051-1-git-send-email-nick.desaulniers@gmail.com
2018-01-04printk: add console_msg_format command line optionSergey Senozhatsky
0day and kernelCI automatically parse kernel log - basically some sort of grepping using the pre-defined text patterns - in order to detect and report regressions/errors. There are several sources they get the kernel logs from: a) dmesg or /proc/ksmg This is the preferred way. Because `dmesg --raw' (see later Note) and /proc/kmsg output contains facility and log level, which greatly simplifies grepping for EMERG/ALERT/CRIT/ERR messages. b) serial consoles This option is harder to maintain, because serial console messages don't contain facility and log level. This patch introduces a `console_msg_format=' command line option, to switch between different message formatting on serial consoles. For the time being we have just two options - default and syslog. The "default" option just keeps the existing format. While the "syslog" option makes serial console messages to appear in syslog format [syslog() syscall], matching the `dmesg -S --raw' and `cat /proc/kmsg' output formats: - facility and log level - time stamp (depends on printk_time/PRINTK_TIME) - message <%u>[time stamp] text\n NOTE: while Kevin and Fengguang talk about "dmesg --raw", it's actually "dmesg -S --raw" that always prints messages in syslog format [per Petr Mladek]. Running "dmesg --raw" may produce output in non-syslog format sometimes. console_msg_format=syslog enables syslog format, thus in documentation we mention "dmesg -S --raw", not "dmesg --raw". Per Kevin Hilman: : Right now we can get this info from a "dmesg --raw" after bootup, : but it would be really nice in certain automation frameworks to : have a kernel command-line option to enable printing of loglevels : in default boot log. : : This is especially useful when ingesting kernel logs into advanced : search/analytics frameworks (I'm playing with and ELK stack: Elastic : Search, Logstash, Kibana). : : The other important reason for having this on the command line is that : for testing linux-next (and other bleeding edge developer branches), : it's common that we never make it to userspace, so can't even run : "dmesg --raw" (or equivalent.) So we really want this on the primary : boot (serial) console. Per Fengguang Wu, 0day scripts should quickly benefit from that feature, because they will be able to switch to a more reliable parsing, based on messages' facility and log levels [1]: `#{grep} -a -E -e '^<[0123]>' -e '^kern :(err |crit |alert |emerg )' instead of doing text pattern matching `#{grep} -a -F -f /lkp/printk-error-messages #{kmsg_file} | grep -a -v -E -f #{LKP_SRC}/etc/oops-pattern | grep -a -v -F -f #{LKP_SRC}/etc/kmsg-blacklist` [1] https://github.com/fengguang/lkp-tests/blob/master/lib/dmesg.rb Link: http://lkml.kernel.org/r/20171221054149.4398-1-sergey.senozhatsky@gmail.com To: Steven Rostedt <rostedt@goodmis.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Fengguang Wu <fengguang.wu@intel.com> Cc: Kevin Hilman <khilman@baylibre.com> Cc: Mark Brown <broonie@kernel.org> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: LKML <linux-kernel@vger.kernel.org> Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Reviewed-by: Fengguang Wu <fengguang.wu@intel.com> Reviewed-by: Kevin Hilman <khilman@baylibre.com> Tested-by: Kevin Hilman <khilman@baylibre.com> Signed-off-by: Petr Mladek <pmladek@suse.com>
2018-01-03signal: Simplify and fix kdb_send_sigEric W. Biederman
- Rename from kdb_send_sig_info to kdb_send_sig As there is no meaningful siginfo sent - Use SEND_SIG_PRIV instead of generating a siginfo for a kdb signal. The generated siginfo had a bogus rationale and was not correct in the face of pid namespaces. SEND_SIG_PRIV is simpler and actually correct. - As the code grabs siglock just send the signal with siglock held instead of dropping siglock and attempting to grab it again. - Move the sig_valid test into kdb_kill where it can generate a good error message. Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>