summaryrefslogtreecommitdiff
path: root/kernel
AgeCommit message (Collapse)Author
2020-05-21irqdomain: Allow software nodes for IRQ domain creationAndy Shevchenko
In some cases we need to have an IRQ domain created out of software node. One of such cases is DesignWare GPIO driver when it's instantiated from half-baked ACPI table (alas, we can't fix it for devices which are few years on market) and thus using software nodes to quirk this. But the driver is using IRQ domains based on per GPIO port firmware nodes, which are in the above case software ones. This brings a warning message to be printed [ 73.957183] irq: Invalid fwnode type for irqdomain and creates an anonymous IRQ domain without a debugfs entry. Allowing software nodes to be valid for IRQ domains rids us of the warning and debugs gets correctly populated. % ls -1 /sys/kernel/debug/irq/domains/ ... intel-quark-dw-apb-gpio:portA Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> [maz: refactored commit message] Signed-off-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20200520164927.39090-3-andriy.shevchenko@linux.intel.com
2020-05-21irqdomain: Get rid of special treatment for ACPI in __irq_domain_add()Andy Shevchenko
Now that __irq_domain_add() is able to better deals with generic fwnodes, there is no need to special-case ACPI anymore. Get rid of the special treatment for ACPI. Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Signed-off-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20200520164927.39090-2-andriy.shevchenko@linux.intel.com
2020-05-21irqdomain: Make __irq_domain_add() less OF-dependentAndy Shevchenko
__irq_domain_add() relies in some places on the fact that the fwnode can be only of type OF. This prevents refactoring of the code to support other types of fwnode. Make it less OF-dependent by switching it to use the fwnode directly where it makes sense. Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Signed-off-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20200520164927.39090-1-andriy.shevchenko@linux.intel.com
2020-05-20bpf: Prevent mmap()'ing read-only maps as writableAndrii Nakryiko
As discussed in [0], it's dangerous to allow mapping BPF map, that's meant to be frozen and is read-only on BPF program side, because that allows user-space to actually store a writable view to the page even after it is frozen. This is exacerbated by BPF verifier making a strong assumption that contents of such frozen map will remain unchanged. To prevent this, disallow mapping BPF_F_RDONLY_PROG mmap()'able BPF maps as writable, ever. [0] https://lore.kernel.org/bpf/CAEf4BzYGWYhXdp6BJ7_=9OQPJxQpgug080MMjdSB72i9R+5c6g@mail.gmail.com/ Fixes: fc9702273e2e ("bpf: Add mmap() support for BPF_MAP_TYPE_ARRAY") Suggested-by: Jann Horn <jannh@google.com> Signed-off-by: Andrii Nakryiko <andriin@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Reviewed-by: Jann Horn <jannh@google.com> Link: https://lore.kernel.org/bpf/20200519053824.1089415-1-andriin@fb.com
2020-05-20audit: add subj creds to NETFILTER_CFG record toRichard Guy Briggs
Some table unregister actions seem to be initiated by the kernel to garbage collect unused tables that are not initiated by any userspace actions. It was found to be necessary to add the subject credentials to cover this case to reveal the source of these actions. A sample record: The uid, auid, tty, ses and exe fields have not been included since they are in the SYSCALL record and contain nothing useful in the non-user context. Here are two sample orphaned records: type=NETFILTER_CFG msg=audit(2020-05-20 12:14:36.505:5) : table=filter family=ipv4 entries=0 op=register pid=1 subj=kernel comm=swapper/0 type=NETFILTER_CFG msg=audit(2020-05-20 12:15:27.701:301) : table=nat family=bridge entries=0 op=unregister pid=30 subj=system_u:system_r:kernel_t:s0 comm=kworker/u4:1 Signed-off-by: Richard Guy Briggs <rgb@redhat.com> Signed-off-by: Paul Moore <paul@paul-moore.com>
2020-05-20exec: Teach prepare_exec_creds how exec treats uids & gidsEric W. Biederman
It is almost possible to use the result of prepare_exec_creds with no modifications during exec. Update prepare_exec_creds to initialize the suid and the fsuid to the euid, and the sgid and the fsgid to the egid. This is all that is needed to handle the common case of exec when nothing special like a setuid exec is happening. That this preserves the existing behavior of exec can be verified by examing bprm_fill_uid and cap_bprm_set_creds. This change makes it clear that the later parts of exec that update bprm->cred are just need to handle special cases such as setuid exec and change of domains. Link: https://lkml.kernel.org/r/871rng22dm.fsf_-_@x220.int.ebiederm.org Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Reviewed-by: Kees Cook <keescook@chromium.org> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2020-05-20fs: rename pipe_buf ->steal to ->try_stealChristoph Hellwig
And replace the arcane return value convention with a simple bool where true means success and false means failure. [AV: braino fix folded in] Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2020-05-20fs: make the pipe_buf_operations ->confirm operation optionalChristoph Hellwig
Just return 0 for success if it is not present. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2020-05-20fs: make the pipe_buf_operations ->steal operation optionalChristoph Hellwig
Just return 1 for failure if it is not present. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2020-05-20trace: remove tracing_pipe_buf_opsChristoph Hellwig
tracing_pipe_buf_ops has identical ops to default_pipe_buf_ops, so use that instead. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2020-05-20Merge branch 'topic/uaccess-ppc' into nextMichael Ellerman
Merge our uaccess-ppc topic branch. It is based on the uaccess topic branch that we're sharing with Viro. This includes the addition of user_[read|write]_access_begin(), as well as some powerpc specific changes to our uaccess routines that would conflict badly if merged separately.
2020-05-20Merge tag 'noinstr-x86-kvm-2020-05-16' of ↵Paolo Bonzini
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip into HEAD
2020-05-19locking/lockdep: Replace zero-length array with flexible-arrayGustavo A. R. Silva
The current codebase makes use of the zero-length array language extension to the C90 standard, but the preferred mechanism to declare variable-length types such as these ones is a flexible array member[1][2], introduced in C99: struct foo { int stuff; struct boo array[]; }; By making use of the mechanism above, we will get a compiler warning in case the flexible array does not occur last in the structure, which will help us prevent some kind of undefined behavior bugs from being inadvertently introduced[3] to the codebase from now on. Also, notice that, dynamic memory allocations won't be affected by this change: "Flexible array members have incomplete type, and so the sizeof operator may not be applied. As a quirk of the original implementation of zero-length arrays, sizeof evaluates to zero."[1] sizeof(flexible-array-member) triggers a warning because flexible array members have incomplete type[1]. There are some instances of code in which the sizeof operator is being incorrectly/erroneously applied to zero-length arrays and the result is zero. Such instances may be hiding some bugs. So, this work (flexible-array member conversions) will also help to get completely rid of those sorts of issues. This issue was found with the help of Coccinelle. [1] https://gcc.gnu.org/onlinedocs/gcc/Zero-Length.html [2] https://github.com/KSPP/linux/issues/21 [3] commit 76497732932f ("cxgb3/l2t: Fix undefined behaviour") Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20200507185804.GA15036@embeddedor
2020-05-19perf/core: Replace zero-length array with flexible-arrayGustavo A. R. Silva
The current codebase makes use of the zero-length array language extension to the C90 standard, but the preferred mechanism to declare variable-length types such as these ones is a flexible array member[1][2], introduced in C99: struct foo { int stuff; struct boo array[]; }; By making use of the mechanism above, we will get a compiler warning in case the flexible array does not occur last in the structure, which will help us prevent some kind of undefined behavior bugs from being inadvertently introduced[3] to the codebase from now on. Also, notice that, dynamic memory allocations won't be affected by this change: "Flexible array members have incomplete type, and so the sizeof operator may not be applied. As a quirk of the original implementation of zero-length arrays, sizeof evaluates to zero."[1] sizeof(flexible-array-member) triggers a warning because flexible array members have incomplete type[1]. There are some instances of code in which the sizeof operator is being incorrectly/erroneously applied to zero-length arrays and the result is zero. Such instances may be hiding some bugs. So, this work (flexible-array member conversions) will also help to get completely rid of those sorts of issues. This issue was found with the help of Coccinelle. [1] https://gcc.gnu.org/onlinedocs/gcc/Zero-Length.html [2] https://github.com/KSPP/linux/issues/21 [3] commit 76497732932f ("cxgb3/l2t: Fix undefined behaviour") Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20200511201227.GA14041@embeddedor
2020-05-19sched: Defend cfs and rt bandwidth quota against overflowHuaixin Chang
When users write some huge number into cpu.cfs_quota_us or cpu.rt_runtime_us, overflow might happen during to_ratio() shifts of schedulable checks. to_ratio() could be altered to avoid unnecessary internal overflow, but min_cfs_quota_period is less than 1 << BW_SHIFT, so a cutoff would still be needed. Set a cap MAX_BW for cfs_quota_us and rt_runtime_us to prevent overflow. Signed-off-by: Huaixin Chang <changhuaixin@linux.alibaba.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Ben Segall <bsegall@google.com> Link: https://lkml.kernel.org/r/20200425105248.60093-1-changhuaixin@linux.alibaba.com
2020-05-19sched/cpuacct: Fix charge cpuacct.usage_sysMuchun Song
The user_mode(task_pt_regs(tsk)) always return true for user thread, and false for kernel thread. So it means that the cpuacct.usage_sys is the time that kernel thread uses not the time that thread uses in the kernel mode. We can try get_irq_regs() first, if it is NULL, then we can fall back to task_pt_regs(). Signed-off-by: Muchun Song <songmuchun@bytedance.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20200420070453.76815-1-songmuchun@bytedance.com
2020-05-19sched/fair: Replace zero-length array with flexible-arrayGustavo A. R. Silva
The current codebase makes use of the zero-length array language extension to the C90 standard, but the preferred mechanism to declare variable-length types such as these ones is a flexible array member[1][2], introduced in C99: struct foo { int stuff; struct boo array[]; }; By making use of the mechanism above, we will get a compiler warning in case the flexible array does not occur last in the structure, which will help us prevent some kind of undefined behavior bugs from being inadvertently introduced[3] to the codebase from now on. Also, notice that, dynamic memory allocations won't be affected by this change: "Flexible array members have incomplete type, and so the sizeof operator may not be applied. As a quirk of the original implementation of zero-length arrays, sizeof evaluates to zero."[1] sizeof(flexible-array-member) triggers a warning because flexible array members have incomplete type[1]. There are some instances of code in which the sizeof operator is being incorrectly/erroneously applied to zero-length arrays and the result is zero. Such instances may be hiding some bugs. So, this work (flexible-array member conversions) will also help to get completely rid of those sorts of issues. This issue was found with the help of Coccinelle. [1] https://gcc.gnu.org/onlinedocs/gcc/Zero-Length.html [2] https://github.com/KSPP/linux/issues/21 [3] commit 76497732932f ("cxgb3/l2t: Fix undefined behaviour") Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20200507192141.GA16183@embeddedor
2020-05-19sched/pelt: Sync util/runnable_sum with PELT window when propagatingVincent Guittot
update_tg_cfs_*() propagate the impact of the attach/detach of an entity down into the cfs_rq hierarchy and must keep the sync with the current pelt window. Even if we can't sync child cfs_rq and its group se, we can sync the group se and its parent cfs_rq with current position in the PELT window. In fact, we must keep them sync in order to stay also synced with others entities and group entities that are already attached to the cfs_rq. Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20200506155301.14288-1-vincent.guittot@linaro.org
2020-05-19sched/cpuacct: Use __this_cpu_add() instead of this_cpu_ptr()Muchun Song
The cpuacct_charge() and cpuacct_account_field() are called with rq->lock held, and this means preemption(and IRQs) are indeed disabled, so it is safe to use __this_cpu_*() to allow for better code-generation. Signed-off-by: Muchun Song <songmuchun@bytedance.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20200507031039.32615-1-songmuchun@bytedance.com
2020-05-19sched/fair: Optimize enqueue_task_fair()Vincent Guittot
enqueue_task_fair jumps to enqueue_throttle label when cfs_rq_of(se) is throttled which means that se can't be NULL in such case and we can move the label after the if (!se) statement. Futhermore, the latter can be removed because se is always NULL when reaching this point. Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Phil Auld <pauld@redhat.com> Link: https://lkml.kernel.org/r/20200513135502.4672-1-vincent.guittot@linaro.org
2020-05-19Merge branch 'sched/urgent'Peter Zijlstra
2020-05-19sched/fair: Fix unthrottle_cfs_rq() for leaf_cfs_rq listVincent Guittot
Although not exactly identical, unthrottle_cfs_rq() and enqueue_task_fair() are quite close and follow the same sequence for enqueuing an entity in the cfs hierarchy. Modify unthrottle_cfs_rq() to use the same pattern as enqueue_task_fair(). This fixes a problem already faced with the latter and add an optimization in the last for_each_sched_entity loop. Fixes: fe61468b2cb (sched/fair: Fix enqueue_task_fair warning) Reported-by Tao Zhou <zohooouoto@zoho.com.cn> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Phil Auld <pauld@redhat.com> Reviewed-by: Ben Segall <bsegall@google.com> Link: https://lkml.kernel.org/r/20200513135528.4742-1-vincent.guittot@linaro.org
2020-05-19sched/debug: Fix requested task uclamp values shown in procfsPavankumar Kondeti
The intention of commit 96e74ebf8d59 ("sched/debug: Add task uclamp values to SCHED_DEBUG procfs") was to print requested and effective task uclamp values. The requested values printed are read from p->uclamp, which holds the last effective values. Fix this by printing the values from p->uclamp_req. Fixes: 96e74ebf8d59 ("sched/debug: Add task uclamp values to SCHED_DEBUG procfs") Signed-off-by: Pavankumar Kondeti <pkondeti@codeaurora.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Valentin Schneider <valentin.schneider@arm.com> Tested-by: Valentin Schneider <valentin.schneider@arm.com> Link: https://lkml.kernel.org/r/1589115401-26391-1-git-send-email-pkondeti@codeaurora.org
2020-05-19sched/fair: Fix enqueue_task_fair() warning some morePhil Auld
sched/fair: Fix enqueue_task_fair warning some more The recent patch, fe61468b2cb (sched/fair: Fix enqueue_task_fair warning) did not fully resolve the issues with the rq->tmp_alone_branch != &rq->leaf_cfs_rq_list warning in enqueue_task_fair. There is a case where the first for_each_sched_entity loop exits due to on_rq, having incompletely updated the list. In this case the second for_each_sched_entity loop can further modify se. The later code to fix up the list management fails to do what is needed because se does not point to the sched_entity which broke out of the first loop. The list is not fixed up because the throttled parent was already added back to the list by a task enqueue in a parallel child hierarchy. Address this by calling list_add_leaf_cfs_rq if there are throttled parents while doing the second for_each_sched_entity loop. Fixes: fe61468b2cb ("sched/fair: Fix enqueue_task_fair warning") Suggested-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Phil Auld <pauld@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Link: https://lkml.kernel.org/r/20200512135222.GC2201@lorien.usersys.redhat.com
2020-05-19bpf: Add get{peer, sock}name attach types for sock_addrDaniel Borkmann
As stated in 983695fa6765 ("bpf: fix unconnected udp hooks"), the objective for the existing cgroup connect/sendmsg/recvmsg/bind BPF hooks is to be transparent to applications. In Cilium we make use of these hooks [0] in order to enable E-W load balancing for existing Kubernetes service types for all Cilium managed nodes in the cluster. Those backends can be local or remote. The main advantage of this approach is that it operates as close as possible to the socket, and therefore allows to avoid packet-based NAT given in connect/sendmsg/recvmsg hooks we only need to xlate sock addresses. This also allows to expose NodePort services on loopback addresses in the host namespace, for example. As another advantage, this also efficiently blocks bind requests for applications in the host namespace for exposed ports. However, one missing item is that we also need to perform reverse xlation for inet{,6}_getname() hooks such that we can return the service IP/port tuple back to the application instead of the remote peer address. The vast majority of applications does not bother about getpeername(), but in a few occasions we've seen breakage when validating the peer's address since it returns unexpectedly the backend tuple instead of the service one. Therefore, this trivial patch allows to customise and adds a getpeername() as well as getsockname() BPF cgroup hook for both IPv4 and IPv6 in order to address this situation. Simple example: # ./cilium/cilium service list ID Frontend Service Type Backend 1 1.2.3.4:80 ClusterIP 1 => 10.0.0.10:80 Before; curl's verbose output example, no getpeername() reverse xlation: # curl --verbose 1.2.3.4 * Rebuilt URL to: 1.2.3.4/ * Trying 1.2.3.4... * TCP_NODELAY set * Connected to 1.2.3.4 (10.0.0.10) port 80 (#0) > GET / HTTP/1.1 > Host: 1.2.3.4 > User-Agent: curl/7.58.0 > Accept: */* [...] After; with getpeername() reverse xlation: # curl --verbose 1.2.3.4 * Rebuilt URL to: 1.2.3.4/ * Trying 1.2.3.4... * TCP_NODELAY set * Connected to 1.2.3.4 (1.2.3.4) port 80 (#0) > GET / HTTP/1.1 > Host: 1.2.3.4 > User-Agent: curl/7.58.0 > Accept: */* [...] Originally, I had both under a BPF_CGROUP_INET{4,6}_GETNAME type and exposed peer to the context similar as in inet{,6}_getname() fashion, but API-wise this is suboptimal as it always enforces programs having to test for ctx->peer which can easily be missed, hence BPF_CGROUP_INET{4,6}_GET{PEER,SOCK}NAME split. Similarly, the checked return code is on tnum_range(1, 1), but if a use case comes up in future, it can easily be changed to return an error code instead. Helper and ctx member access is the same as with connect/sendmsg/etc hooks. [0] https://github.com/cilium/cilium/blob/master/bpf/bpf_sock.c Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Andrii Nakryiko <andriin@fb.com> Acked-by: Andrey Ignatov <rdna@fb.com> Link: https://lore.kernel.org/bpf/61a479d759b2482ae3efb45546490bacd796a220.1589841594.git.daniel@iogearbox.net
2020-05-19PM: hibernate: Split off snapshot dev optionDomenico Andreoli
Make it possible to reduce the attack surface in case the snapshot device is not to be used from userspace. Signed-off-by: Domenico Andreoli <domenico.andreoli@linux.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2020-05-19PM: hibernate: Incorporate concurrency handlingDomenico Andreoli
Hibernation concurrency handling is currently delegated to user.c, where it's also used for regulating the access to the snapshot device. In the prospective of making user.c a separate configuration option, such mutual exclusion is brought into hibernate.c and made available through accessor helpers hereby introduced. Signed-off-by: Domenico Andreoli <domenico.andreoli@linux.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2020-05-19kprobes: Prevent probes in .noinstr.text sectionThomas Gleixner
Instrumentation is forbidden in the .noinstr.text section. Make kprobes respect this. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Alexandre Chartre <alexandre.chartre@oracle.com> Acked-by: Peter Zijlstra <peterz@infradead.org> Acked-by: Masami Hiramatsu <mhiramat@kernel.org> Link: https://lkml.kernel.org/r/20200505134100.179862032@linutronix.de
2020-05-19rcu: Provide __rcu_is_watching()Thomas Gleixner
Same as rcu_is_watching() but without the preempt_disable/enable() pair inside the function. It is merked noinstr so it ends up in the non-instrumentable text section. This is useful for non-preemptible code especially in the low level entry section. Using rcu_is_watching() there results in a call to the preempt_schedule_notrace() thunk which triggers noinstr section warnings in objtool. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Peter Zijlstra <peterz@infradead.org> Link: https://lkml.kernel.org/r/20200512213810.518709291@linutronix.de
2020-05-19rcu: Provide rcu_irq_exit_preempt()Thomas Gleixner
Interrupts and exceptions invoke rcu_irq_enter() on entry and need to invoke rcu_irq_exit() before they either return to the interrupted code or invoke the scheduler due to preemption. The general assumption is that RCU idle code has to have preemption disabled so that a return from interrupt cannot schedule. So the return from interrupt code invokes rcu_irq_exit() and preempt_schedule_irq(). If there is any imbalance in the rcu_irq/nmi* invocations or RCU idle code had preemption enabled then this goes unnoticed until the CPU goes idle or some other RCU check is executed. Provide rcu_irq_exit_preempt() which can be invoked from the interrupt/exception return code in case that preemption is enabled. It invokes rcu_irq_exit() and contains a few sanity checks in case that CONFIG_PROVE_RCU is enabled to catch such issues directly. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Paul E. McKenney <paulmck@kernel.org> Reviewed-by: Alexandre Chartre <alexandre.chartre@oracle.com> Acked-by: Peter Zijlstra <peterz@infradead.org> Link: https://lkml.kernel.org/r/20200505134904.364456424@linutronix.de
2020-05-19rcu: Make RCU IRQ enter/exit functions rely on in_nmi()Paul E. McKenney
The rcu_nmi_enter_common() and rcu_nmi_exit_common() functions take an "irq" parameter that indicates whether these functions have been invoked from an irq handler (irq==true) or an NMI handler (irq==false). However, recent changes have applied notrace to a few critical functions such that rcu_nmi_enter_common() and rcu_nmi_exit_common() many now rely on in_nmi(). Note that in_nmi() works no differently than before, but rather that tracing is now prohibited in code regions where in_nmi() would incorrectly report NMI state. Therefore remove the "irq" parameter and inline rcu_nmi_enter_common() and rcu_nmi_exit_common() into rcu_nmi_enter() and rcu_nmi_exit(), respectively. Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Alexandre Chartre <alexandre.chartre@oracle.com> Link: https://lkml.kernel.org/r/20200505134101.617130349@linutronix.de
2020-05-19rcu/tree: Mark the idle relevant functions noinstrThomas Gleixner
These functions are invoked from context tracking and other places in the low level entry code. Move them into the .noinstr.text section to exclude them from instrumentation. Mark the places which are safe to invoke traceable functions with instrumentation_begin/end() so objtool won't complain. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Alexandre Chartre <alexandre.chartre@oracle.com> Acked-by: Peter Zijlstra <peterz@infradead.org> Acked-by: Paul E. McKenney <paulmck@kernel.org> Link: https://lkml.kernel.org/r/20200505134100.575356107@linutronix.de
2020-05-19sh/ftrace: Move arch_ftrace_nmi_{enter,exit} into nmi exceptionPeter Zijlstra
SuperH is the last remaining user of arch_ftrace_nmi_{enter,exit}(), remove it from the generic code and into the SuperH code. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Alexandre Chartre <alexandre.chartre@oracle.com> Acked-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Cc: Rich Felker <dalias@libc.org> Cc: Yoshinori Sato <ysato@users.sourceforge.jp> Link: https://lkml.kernel.org/r/20200505134101.248881738@linutronix.de
2020-05-19lockdep: Always inline lockdep_{off,on}()Peter Zijlstra
These functions are called {early,late} in nmi_{enter,exit} and should not be traced or probed. They are also puny, so 'inline' them. Reported-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Alexandre Chartre <alexandre.chartre@oracle.com> Link: https://lkml.kernel.org/r/20200505134101.048523500@linutronix.de
2020-05-19printk: Disallow instrumenting print_nmi_enter()Peter Zijlstra
It happens early in nmi_enter(), no tracing, probing or other funnies allowed. Specifically as nmi_enter() will be used in do_debug(), which would cause recursive exceptions when kprobed. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Alexandre Chartre <alexandre.chartre@oracle.com> Link: https://lkml.kernel.org/r/20200505134101.139720912@linutronix.de
2020-05-19printk: Prepare for nested printk_nmi_enter()Petr Mladek
There is plenty of space in the printk_context variable. Reserve one byte there for the NMI context to be on the safe side. It should never overflow. The BUG_ON(in_nmi() == NMI_MASK) in nmi_enter() will trigger much earlier. Signed-off-by: Petr Mladek <pmladek@suse.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Alexandre Chartre <alexandre.chartre@oracle.com> Link: https://lkml.kernel.org/r/20200505134100.681374113@linutronix.de
2020-05-19Merge tag 'noinstr-lds-2020-05-19' into core/rcuThomas Gleixner
Get the noinstr section and annotation markers to base the RCU parts on.
2020-05-19lockdep: Prepare for noinstr sectionsPeter Zijlstra
Force inlining and prevent instrumentation of all sorts by marking the functions which are invoked from low level entry code with 'noinstr'. Split the irqflags tracking into two parts. One which does the heavy lifting while RCU is watching and the final one which can be invoked after RCU is turned off. Signed-off-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Alexandre Chartre <alexandre.chartre@oracle.com> Link: https://lkml.kernel.org/r/20200505134100.484532537@linutronix.de
2020-05-19tracing: Provide lockdep less trace_hardirqs_on/off() variantsThomas Gleixner
trace_hardirqs_on/off() is only partially safe vs. RCU idle. The tracer core itself is safe, but the resulting tracepoints can be utilized by e.g. BPF which is unsafe. Provide variants which do not contain the lockdep invocation so the lockdep and tracer invocations can be split at the call site and placed properly. This is required because lockdep needs to be aware of the state before switching away from RCU idle and after switching to RCU idle because these transitions can take locks. As these code pathes are going to be non-instrumentable the tracer can be invoked after RCU is turned on and before the switch to RCU idle. So for these new variants there is no need to invoke the rcuidle aware tracer functions. Name them so they match the lockdep counterparts. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Alexandre Chartre <alexandre.chartre@oracle.com> Acked-by: Peter Zijlstra <peterz@infradead.org> Link: https://lkml.kernel.org/r/20200505134100.270771162@linutronix.de
2020-05-19proc: proc_pid_ns takes super_block as an argumentAlexey Gladkov
syzbot found that touch /proc/testfile causes NULL pointer dereference at tomoyo_get_local_path() because inode of the dentry is NULL. Before c59f415a7cb6, Tomoyo received pid_ns from proc's s_fs_info directly. Since proc_pid_ns() can only work with inode, using it in the tomoyo_get_local_path() was wrong. To avoid creating more functions for getting proc_ns, change the argument type of the proc_pid_ns() function. Then, Tomoyo can use the existing super_block to get pid_ns. Link: https://lkml.kernel.org/r/0000000000002f0c7505a5b0e04c@google.com Link: https://lkml.kernel.org/r/20200518180738.2939611-1-gladkov.alexey@gmail.com Reported-by: syzbot+c1af344512918c61362c@syzkaller.appspotmail.com Fixes: c59f415a7cb6 ("Use proc_pid_ns() to get pid_namespace from the proc superblock") Signed-off-by: Alexey Gladkov <gladkov.alexey@gmail.com> Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2020-05-19ARM: 8976/1: module: allow arch overrides for .init section namesVincent Whitchurch
ARM stores unwind information for .init.text in sections named .ARM.extab.init.text and .ARM.exidx.init.text. Since those aren't currently recognized as init sections, they're allocated along with the core section, and relocation fails if the core and the init section are allocated from different regions and can't reach other. final section addresses: ... 0x7f800000 .init.text .. 0xcbb54078 .ARM.exidx.init.text .. section 16 reloc 0 sym '': relocation 42 out of range (0xcbb54078 -> 0x7f800000) Allow architectures to override the section name so that ARM can fix this. Acked-by: Jessica Yu <jeyu@kernel.org> Signed-off-by: Vincent Whitchurch <vincent.whitchurch@axis.com> Signed-off-by: Russell King <rmk+kernel@armlinux.org.uk>
2020-05-18kgdb: Add kgdb_has_hit_break functionVincent Chen
The break instruction in RISC-V does not have an immediate value field, so the kernel cannot identify the purpose of each trap exception through the opcode. This makes the existing identification schemes in other architecture unsuitable for the RISC-V kernel. To solve this problem, this patch adds kgdb_has_hit_break(), which can help RISC-V kernel identify the KGDB trap exception. Signed-off-by: Vincent Chen <vincent.chen@sifive.com> Reviewed-by: Palmer Dabbelt <palmerdabbelt@google.com> Acked-by: Daniel Thompson <daniel.thompson@linaro.org> Signed-off-by: Palmer Dabbelt <palmerdabbelt@google.com>
2020-05-18kgdboc: Add kgdboc_earlycon to support early kgdb using boot consolesDouglas Anderson
We want to enable kgdb to debug the early parts of the kernel. Unfortunately kgdb normally is a client of the tty API in the kernel and serial drivers don't register to the tty layer until fairly late in the boot process. Serial drivers do, however, commonly register a boot console. Let's enable the kgdboc driver to work with boot consoles to provide early debugging. This change co-opts the existing read() function pointer that's part of "struct console". It's assumed that if a boot console (with the flag CON_BOOT) has implemented read() that both the read() and write() function are polling functions. That means they work without interrupts and read() will return immediately (with 0 bytes read) if there's nothing to read. This should be a safe assumption since it appears that no current boot consoles implement read() right now and there seems no reason to do so unless they wanted to support "kgdboc_earlycon". The normal/expected way to make all this work is to use "kgdboc_earlycon" and "kgdboc" together. You should point them both to the same physical serial connection. At boot time, as the system transitions from the boot console to the normal console (and registers a tty), kgdb will switch over. One awkward part of all this, though, is that there can be a window where the boot console goes away and we can't quite transtion over to the main kgdboc that uses the tty layer. There are two main problems: 1. The act of registering the tty doesn't cause any call into kgdboc so there is a window of time when the tty is there but kgdboc's init code hasn't been called so we can't transition to it. 2. On some serial drivers the normal console inits (and replaces the boot console) quite early in the system. Presumably these drivers were coded up before earlycon worked as well as it does today and probably they don't need to do this anymore, but it causes us problems nontheless. Problem #1 is not too big of a deal somewhat due to the luck of probe ordering. kgdboc is last in the tty/serial/Makefile so its probe gets right after all other tty devices. It's not fun to rely on this, but it does work for the most part. Problem #2 is a big deal, but only for some serial drivers. Other serial drivers end up registering the console (which gets rid of the boot console) and tty at nearly the same time. The way we'll deal with the window when the system has stopped using the boot console and the time when we're setup using the tty is to keep using the boot console. This may sound surprising, but it has been found to work well in practice. If it doesn't work, it shouldn't be too hard for a given serial driver to make it keep working. Specifically, it's expected that the read()/write() function provided in the boot console should be the same (or nearly the same) as the normal kgdb polling functions. That means continuing to use them should work just fine. To make things even more likely to work work we'll also trap the recently added exit() function in the boot console we're using and delay any calls to it until we're all done with the boot console. NOTE: there could be ways to use all this in weird / unexpected ways. If you do something like this, it's a bit of a buyer beware situation. Specifically: - If you specify only "kgdboc_earlycon" but not "kgdboc" then (depending on your serial driver) things will probably work OK, but you'll get a warning printed the first time you use kgdb after the boot console is gone. You'd only be able to do this, of course, if the serial driver you're running atop provided an early boot console. - If your "kgdboc_earlycon" and "kgdboc" devices are not the same device things should work OK, but it'll be your job to switch over which device you're monitoring (including figuring out how to switch over gdb in-flight if you're using it). When trying to enable "kgdboc_earlycon" it should be noted that the names that are registered through the boot console layer and the tty layer are not the same for the same port. For example when debugging on one board I'd need to pass "kgdboc_earlycon=qcom_geni kgdboc=ttyMSM0" to enable things properly. Since digging up the boot console name is a pain and there will rarely be more than one boot console enabled, you can provide the "kgdboc_earlycon" parameter without specifying the name of the boot console. In this case we'll just pick the first boot that implements read() that we find. This new "kgdboc_earlycon" parameter should be contrasted to the existing "ekgdboc" parameter. While both provide a way to debug very early, the usage and mechanisms are quite different. Specifically "kgdboc_earlycon" is meant to be used in tandem with "kgdboc" and there is a transition from one to the other. The "ekgdboc" parameter, on the other hand, replaces the "kgdboc" parameter. It runs the same logic as the "kgdboc" parameter but just relies on your TTY driver being present super early. The only known usage of the old "ekgdboc" parameter is documented as "ekgdboc=kbd earlyprintk=vga". It should be noted that "kbd" has special treatment allowing it to init early as a tty device. Signed-off-by: Douglas Anderson <dianders@chromium.org> Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Tested-by: Sumit Garg <sumit.garg@linaro.org> Link: https://lore.kernel.org/r/20200507130644.v4.8.I8fba5961bf452ab92350654aa61957f23ecf0100@changeid Signed-off-by: Daniel Thompson <daniel.thompson@linaro.org>
2020-05-18kgdb: Prevent infinite recursive entries to the debuggerDouglas Anderson
If we detect that we recursively entered the debugger we should hack our I/O ops to NULL so that the panic() in the next line won't actually cause another recursion into the debugger. The first line of kgdb_panic() will check this and return. Signed-off-by: Douglas Anderson <dianders@chromium.org> Reviewed-by: Daniel Thompson <daniel.thompson@linaro.org> Link: https://lore.kernel.org/r/20200507130644.v4.6.I89de39f68736c9de610e6f241e68d8dbc44bc266@changeid Signed-off-by: Daniel Thompson <daniel.thompson@linaro.org>
2020-05-18kgdb: Delay "kgdbwait" to dbg_late_init() by defaultDouglas Anderson
Using kgdb requires at least some level of architecture-level initialization. If nothing else, it relies on the architecture to pass breakpoints / crashes onto kgdb. On some architectures this all works super early, specifically it starts working at some point in time before Linux parses early_params's. On other architectures it doesn't. A survey of a few platforms: a) x86: Presumably it all works early since "ekgdboc" is documented to work here. b) arm64: Catching crashes works; with a simple patch breakpoints can also be made to work. c) arm: Nothing in kgdb works until paging_init() -> devicemaps_init() -> early_trap_init() Let's be conservative and, by default, process "kgdbwait" (which tells the kernel to drop into the debugger ASAP at boot) a bit later at dbg_late_init() time. If an architecture has tested it and wants to re-enable super early debugging, they can select the ARCH_HAS_EARLY_DEBUG KConfig option. We'll do this for x86 to start. It should be noted that dbg_late_init() is still called quite early in the system. Note that this patch doesn't affect when kgdb runs its init. If kgdb is set to initialize early it will still initialize when parsing early_param's. This patch _only_ inhibits the initial breakpoint from "kgdbwait". This means: * Without any extra patches arm64 platforms will at least catch crashes after kgdb inits. * arm platforms will catch crashes (and could handle a hardcoded kgdb_breakpoint()) any time after early_trap_init() runs, even before dbg_late_init(). Signed-off-by: Douglas Anderson <dianders@chromium.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: Borislav Petkov <bp@alien8.de> Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Link: https://lore.kernel.org/r/20200507130644.v4.4.I3113aea1b08d8ce36dc3720209392ae8b815201b@changeid Signed-off-by: Daniel Thompson <daniel.thompson@linaro.org>
2020-05-18scs: Remove references to asm/scs.h from core codeWill Deacon
asm/scs.h is no longer needed by the core code, so remove a redundant header inclusion and update the stale Kconfig text. Tested-by: Sami Tolvanen <samitolvanen@google.com> Reviewed-by: Mark Rutland <mark.rutland@arm.com> Signed-off-by: Will Deacon <will@kernel.org>
2020-05-18scs: Move scs_overflow_check() out of architecture codeWill Deacon
There is nothing architecture-specific about scs_overflow_check() as it's just a trivial wrapper around scs_corrupted(). For parity with task_stack_end_corrupted(), rename scs_corrupted() to task_scs_end_corrupted() and call it from schedule_debug() when CONFIG_SCHED_STACK_END_CHECK_is enabled, which better reflects its purpose as a debug feature to catch inadvertent overflow of the SCS. Finally, remove the unused scs_overflow_check() function entirely. This has absolutely no impact on architectures that do not support SCS (currently arm64 only). Tested-by: Sami Tolvanen <samitolvanen@google.com> Reviewed-by: Mark Rutland <mark.rutland@arm.com> Signed-off-by: Will Deacon <will@kernel.org>
2020-05-18scs: Move accounting into alloc/free functionsWill Deacon
There's no need to perform the shadow stack page accounting independently of the lifetime of the underlying allocation, so call the accounting code from the {alloc,free}() functions and simplify the code in the process. Tested-by: Sami Tolvanen <samitolvanen@google.com> Reviewed-by: Mark Rutland <mark.rutland@arm.com> Signed-off-by: Will Deacon <will@kernel.org>
2020-05-18arm64: scs: Store absolute SCS stack pointer value in thread_infoWill Deacon
Storing the SCS information in thread_info as a {base,offset} pair introduces an additional load instruction on the ret-to-user path, since the SCS stack pointer in x18 has to be converted back to an offset by subtracting the base. Replace the offset with the absolute SCS stack pointer value instead and avoid the redundant load. Tested-by: Sami Tolvanen <samitolvanen@google.com> Reviewed-by: Mark Rutland <mark.rutland@arm.com> Signed-off-by: Will Deacon <will@kernel.org>
2020-05-18kgdb: Disable WARN_CONSOLE_UNLOCKED for all kgdbDouglas Anderson
In commit 81eaadcae81b ("kgdboc: disable the console lock when in kgdb") we avoided the WARN_CONSOLE_UNLOCKED() yell when we were in kgdboc. That still works fine, but it turns out that we get a similar yell when using other I/O drivers. One example is the "I/O driver" for the kgdb test suite (kgdbts). When I enabled that I again got the same yells. Even though "kgdbts" doesn't actually interact with the user over the console, using it still causes kgdb to print to the consoles. That trips the same warning: con_is_visible+0x60/0x68 con_scroll+0x110/0x1b8 lf+0x4c/0xc8 vt_console_print+0x1b8/0x348 vkdb_printf+0x320/0x89c kdb_printf+0x68/0x90 kdb_main_loop+0x190/0x860 kdb_stub+0x2cc/0x3ec kgdb_cpu_enter+0x268/0x744 kgdb_handle_exception+0x1a4/0x200 kgdb_compiled_brk_fn+0x34/0x44 brk_handler+0x7c/0xb8 do_debug_exception+0x1b4/0x228 Let's increment/decrement the "ignore_console_lock_warning" variable all the time when we enter the debugger. This will allow us to later revert commit 81eaadcae81b ("kgdboc: disable the console lock when in kgdb"). Signed-off-by: Douglas Anderson <dianders@chromium.org> Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Reviewed-by: Daniel Thompson <daniel.thompson@linaro.org> Link: https://lore.kernel.org/r/20200507130644.v4.1.Ied2b058357152ebcc8bf68edd6f20a11d98d7d4e@changeid Signed-off-by: Daniel Thompson <daniel.thompson@linaro.org>