summaryrefslogtreecommitdiff
path: root/include/linux
AgeCommit message (Collapse)Author
2020-05-19perf/core: Replace zero-length array with flexible-arrayGustavo A. R. Silva
The current codebase makes use of the zero-length array language extension to the C90 standard, but the preferred mechanism to declare variable-length types such as these ones is a flexible array member[1][2], introduced in C99: struct foo { int stuff; struct boo array[]; }; By making use of the mechanism above, we will get a compiler warning in case the flexible array does not occur last in the structure, which will help us prevent some kind of undefined behavior bugs from being inadvertently introduced[3] to the codebase from now on. Also, notice that, dynamic memory allocations won't be affected by this change: "Flexible array members have incomplete type, and so the sizeof operator may not be applied. As a quirk of the original implementation of zero-length arrays, sizeof evaluates to zero."[1] sizeof(flexible-array-member) triggers a warning because flexible array members have incomplete type[1]. There are some instances of code in which the sizeof operator is being incorrectly/erroneously applied to zero-length arrays and the result is zero. Such instances may be hiding some bugs. So, this work (flexible-array member conversions) will also help to get completely rid of those sorts of issues. This issue was found with the help of Coccinelle. [1] https://gcc.gnu.org/onlinedocs/gcc/Zero-Length.html [2] https://github.com/KSPP/linux/issues/21 [3] commit 76497732932f ("cxgb3/l2t: Fix undefined behaviour") Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20200511201227.GA14041@embeddedor
2020-05-19Merge branch 'sched/urgent'Peter Zijlstra
2020-05-19bpf: Add get{peer, sock}name attach types for sock_addrDaniel Borkmann
As stated in 983695fa6765 ("bpf: fix unconnected udp hooks"), the objective for the existing cgroup connect/sendmsg/recvmsg/bind BPF hooks is to be transparent to applications. In Cilium we make use of these hooks [0] in order to enable E-W load balancing for existing Kubernetes service types for all Cilium managed nodes in the cluster. Those backends can be local or remote. The main advantage of this approach is that it operates as close as possible to the socket, and therefore allows to avoid packet-based NAT given in connect/sendmsg/recvmsg hooks we only need to xlate sock addresses. This also allows to expose NodePort services on loopback addresses in the host namespace, for example. As another advantage, this also efficiently blocks bind requests for applications in the host namespace for exposed ports. However, one missing item is that we also need to perform reverse xlation for inet{,6}_getname() hooks such that we can return the service IP/port tuple back to the application instead of the remote peer address. The vast majority of applications does not bother about getpeername(), but in a few occasions we've seen breakage when validating the peer's address since it returns unexpectedly the backend tuple instead of the service one. Therefore, this trivial patch allows to customise and adds a getpeername() as well as getsockname() BPF cgroup hook for both IPv4 and IPv6 in order to address this situation. Simple example: # ./cilium/cilium service list ID Frontend Service Type Backend 1 1.2.3.4:80 ClusterIP 1 => 10.0.0.10:80 Before; curl's verbose output example, no getpeername() reverse xlation: # curl --verbose 1.2.3.4 * Rebuilt URL to: 1.2.3.4/ * Trying 1.2.3.4... * TCP_NODELAY set * Connected to 1.2.3.4 (10.0.0.10) port 80 (#0) > GET / HTTP/1.1 > Host: 1.2.3.4 > User-Agent: curl/7.58.0 > Accept: */* [...] After; with getpeername() reverse xlation: # curl --verbose 1.2.3.4 * Rebuilt URL to: 1.2.3.4/ * Trying 1.2.3.4... * TCP_NODELAY set * Connected to 1.2.3.4 (1.2.3.4) port 80 (#0) > GET / HTTP/1.1 > Host: 1.2.3.4 > User-Agent: curl/7.58.0 > Accept: */* [...] Originally, I had both under a BPF_CGROUP_INET{4,6}_GETNAME type and exposed peer to the context similar as in inet{,6}_getname() fashion, but API-wise this is suboptimal as it always enforces programs having to test for ctx->peer which can easily be missed, hence BPF_CGROUP_INET{4,6}_GET{PEER,SOCK}NAME split. Similarly, the checked return code is on tnum_range(1, 1), but if a use case comes up in future, it can easily be changed to return an error code instead. Helper and ctx member access is the same as with connect/sendmsg/etc hooks. [0] https://github.com/cilium/cilium/blob/master/bpf/bpf_sock.c Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Andrii Nakryiko <andriin@fb.com> Acked-by: Andrey Ignatov <rdna@fb.com> Link: https://lore.kernel.org/bpf/61a479d759b2482ae3efb45546490bacd796a220.1589841594.git.daniel@iogearbox.net
2020-05-19mm/memory-failure: Add memory_failure_queue_kick()James Morse
The GHES code calls memory_failure_queue() from IRQ context to schedule work on the current CPU so that memory_failure() can sleep. For synchronous memory errors the arch code needs to know any signals that memory_failure() will trigger are pending before it returns to user-space, possibly when exiting from the IRQ. Add a helper to kick the memory failure queue, to ensure the scheduled work has happened. This has to be called from process context, so may have been migrated from the original cpu. Pass the cpu the work was queued on. Change memory_failure_work_func() to permit being called on the 'wrong' cpu. Signed-off-by: James Morse <james.morse@arm.com> Tested-by: Tyler Baicar <baicar@os.amperecomputing.com> Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2020-05-19block: Document the bio_vec propertiesBart Van Assche
Since it is nontrivial that nth_page() does not have to be used for a bio_vec, document this. Signed-off-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Christoph Hellwig <hch@lst.de> CC: Christoph Hellwig <hch@infradead.org> Cc: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-05-19bio.h: Declare the arguments of the bio iteration functions constBart Van Assche
This change makes it possible to pass 'const struct bio *' arguments to these functions. Signed-off-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Cc: Ming Lei <ming.lei@redhat.com> Cc: Damien Le Moal <damien.lemoal@wdc.com> Cc: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Cc: Alexander Potapenko <glider@google.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-05-19smack: Implement the watch_key and post_notification hooksDavid Howells
Implement the watch_key security hook in Smack to make sure that a key grants the caller Read permission in order to set a watch on a key. Also implement the post_notification security hook to make sure that the notification source is granted Write permission by the watch queue. For the moment, the watch_devices security hook is left unimplemented as it's not obvious what the object should be since the queue is global and didn't previously exist. Signed-off-by: David Howells <dhowells@redhat.com> Acked-by: Casey Schaufler <casey@schaufler-ca.com>
2020-05-19keys: Make the KEY_NEED_* perms an enum rather than a maskDavid Howells
Since the meaning of combining the KEY_NEED_* constants is undefined, make it so that you can't do that by turning them into an enum. The enum is also given some extra values to represent special circumstances, such as: (1) The '0' value is reserved and causes a warning to trap the parameter being unset. (2) The key is to be unlinked and we require no permissions on it, only the keyring, (this replaces the KEY_LOOKUP_FOR_UNLINK flag). (3) An override due to CAP_SYS_ADMIN. (4) An override due to an instantiation token being present. (5) The permissions check is being deferred to later key_permission() calls. The extra values give the opportunity for LSMs to audit these situations. [Note: This really needs overhauling so that lookup_user_key() tells key_task_permission() and the LSM what operation is being done and leaves it to those functions to decide how to map that onto the available permits. However, I don't really want to make these change in the middle of the notifications patchset.] Signed-off-by: David Howells <dhowells@redhat.com> cc: Jarkko Sakkinen <jarkko.sakkinen@linux.intel.com> cc: Paul Moore <paul@paul-moore.com> cc: Stephen Smalley <stephen.smalley.work@gmail.com> cc: Casey Schaufler <casey@schaufler-ca.com> cc: keyrings@vger.kernel.org cc: selinux@vger.kernel.org
2020-05-19pipe: Add notification lossage handlingDavid Howells
Add handling for loss of notifications by having read() insert a loss-notification message after it has read the pipe buffer that was last in the ring when the loss occurred. Lossage can come about either by running out of notification descriptors or by running out of space in the pipe ring. Signed-off-by: David Howells <dhowells@redhat.com>
2020-05-19pipe: Allow buffers to be marked read-whole-or-error for notificationsDavid Howells
Allow a buffer to be marked such that read() must return the entire buffer in one go or return ENOBUFS. Multiple buffers can be amalgamated into a single read, but a short read will occur if the next "whole" buffer won't fit. This is useful for watch queue notifications to make sure we don't split a notification across multiple reads, especially given that we need to fabricate an overrun record under some circumstances - and that isn't in the buffers. Signed-off-by: David Howells <dhowells@redhat.com>
2020-05-19coresight: cti: Add CPU Hotplug handling to CTI driverMike Leach
Adds registration of CPU start and stop functions to CPU hotplug mechanisms - for any CPU bound CTI. Sets CTI powered flag according to state. Will enable CTI on CPU start if there are existing enable requests. Signed-off-by: Mike Leach <mike.leach@linaro.org> Signed-off-by: Mathieu Poirier <mathieu.poirier@linaro.org> Link: https://lore.kernel.org/r/20200518180242.7916-23-mathieu.poirier@linaro.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-05-19coresight: Fix support for sparsely populated portsSuzuki K Poulose
On some systems the firmware may not describe all the ports connected to a component (e.g, for security reasons). This could be especially problematic for "funnels" where we could end up in modifying memory beyond the allocated space for refcounts. e.g, for a funnel with input ports listed 0, 3, 5, nr_inport = 3. However the we could access refcnts[5] while checking for references, like : [ 526.110401] ================================================================== [ 526.117988] BUG: KASAN: slab-out-of-bounds in funnel_enable+0x54/0x1b0 [ 526.124706] Read of size 4 at addr ffffff8135f9549c by task bash/1114 [ 526.131324] [ 526.132886] CPU: 3 PID: 1114 Comm: bash Tainted: G S 5.4.25 #232 [ 526.140397] Hardware name: Qualcomm Technologies, Inc. SC7180 IDP (DT) [ 526.147113] Call trace: [ 526.149653] dump_backtrace+0x0/0x188 [ 526.153431] show_stack+0x20/0x2c [ 526.156852] dump_stack+0xdc/0x144 [ 526.160370] print_address_description+0x3c/0x494 [ 526.165211] __kasan_report+0x144/0x168 [ 526.169170] kasan_report+0x10/0x18 [ 526.172769] check_memory_region+0x1a4/0x1b4 [ 526.177164] __kasan_check_read+0x18/0x24 [ 526.181292] funnel_enable+0x54/0x1b0 [ 526.185072] coresight_enable_path+0x104/0x198 [ 526.189649] coresight_enable+0x118/0x26c ... [ 526.237782] Allocated by task 280: [ 526.241298] __kasan_kmalloc+0xf0/0x1ac [ 526.245249] kasan_kmalloc+0xc/0x14 [ 526.248849] __kmalloc+0x28c/0x3b4 [ 526.252361] coresight_register+0x88/0x250 [ 526.256587] funnel_probe+0x15c/0x228 [ 526.260365] dynamic_funnel_probe+0x20/0x2c [ 526.264679] amba_probe+0xbc/0x158 [ 526.268193] really_probe+0x144/0x408 [ 526.271970] driver_probe_device+0x70/0x140 ... [ 526.316810] [ 526.318364] Freed by task 0: [ 526.321344] (stack is not available) [ 526.325024] [ 526.326580] The buggy address belongs to the object at ffffff8135f95480 [ 526.326580] which belongs to the cache kmalloc-128 of size 128 [ 526.339439] The buggy address is located 28 bytes inside of [ 526.339439] 128-byte region [ffffff8135f95480, ffffff8135f95500) [ 526.351399] The buggy address belongs to the page: [ 526.356342] page:ffffffff04b7e500 refcount:1 mapcount:0 mapping:ffffff814b00c380 index:0x0 compound_mapcount: 0 [ 526.366711] flags: 0x4000000000010200(slab|head) [ 526.371475] raw: 4000000000010200 ffffffff05034008 ffffffff0501eb08 ffffff814b00c380 [ 526.379435] raw: 0000000000000000 0000000000190019 00000001ffffffff 0000000000000000 [ 526.387393] page dumped because: kasan: bad access detected [ 526.393128] [ 526.394681] Memory state around the buggy address: [ 526.399619] ffffff8135f95380: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc [ 526.407046] ffffff8135f95400: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc [ 526.414473] >ffffff8135f95480: 04 fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc [ 526.421900] ^ [ 526.426029] ffffff8135f95500: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc [ 526.433456] ffffff8135f95580: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc [ 526.440883] ================================================================== To keep the code simple, we now track the maximum number of possible input/output connections to/from this component @ nr_inport and nr_outport in platform_data, respectively. Thus the output connections could be sparse and code is adjusted to skip the unspecified connections. Cc: Mathieu Poirier <mathieu.poirier@linaro.org> Cc: Mike Leach <mike.leach@linaro.org> Reported-by: Sai Prakash Ranjan <saiprakash.ranjan@codeaurora.org> Tested-by: Sai Prakash Ranjan <saiprakash.ranjan@codeaurora.org> Tested-by: Stephen Boyd <swboyd@chromium.org> Signed-off-by: Suzuki K Poulose <suzuki.poulose@arm.com> Signed-off-by: Mathieu Poirier <mathieu.poirier@linaro.org> Link: https://lore.kernel.org/r/20200518180242.7916-13-mathieu.poirier@linaro.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-05-19coresight: Expose device connections via sysfsSuzuki K Poulose
Coresight device connections are a bit complicated and is not exposed currently to the user. One has to look at the platform descriptions (DT bindings or ACPI bindings) to make an understanding. Given the new naming scheme, it will be helpful to have this information to choose the appropriate devices for tracing. This patch exposes the device connections via links in the sysfs directories. e.g, for a connection devA[OutputPort_X] -> devB[InputPort_Y] is represented as two symlinks: /sys/bus/coresight/.../devA/out:X -> /sys/bus/coresight/.../devB /sys/bus/coresight/.../devB/in:Y -> /sys/bus/coresight/.../devA Signed-off-by: Suzuki K Poulose <suzuki.poulose@arm.com> [Revised to use the generic sysfs links functions & link structures. Provides a connections sysfs group in each device to hold the links.] Co-developed-by: Mike Leach <mike.leach@linaro.org> Signed-off-by: Mike Leach <mike.leach@linaro.org> Signed-off-by: Mathieu Poirier <mathieu.poirier@linaro.org> Link: https://lore.kernel.org/r/20200518180242.7916-5-mathieu.poirier@linaro.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-05-19coresight: Add generic sysfs link creation functionsMike Leach
To allow the connections between coresight components to be represented in sysfs, generic methods for creating sysfs links between two coresight devices are added. Signed-off-by: Mike Leach <mike.leach@linaro.org> Reviewed-by: Suzuki K Poulose <suzuki.poulose@arm.com> Signed-off-by: Mathieu Poirier <mathieu.poirier@linaro.org> Link: https://lore.kernel.org/r/20200518180242.7916-4-mathieu.poirier@linaro.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-05-19watch_queue: Add a key/keyring notification facilityDavid Howells
Add a key/keyring change notification facility whereby notifications about changes in key and keyring content and attributes can be received. Firstly, an event queue needs to be created: pipe2(fds, O_NOTIFICATION_PIPE); ioctl(fds[1], IOC_WATCH_QUEUE_SET_SIZE, 256); then a notification can be set up to report notifications via that queue: struct watch_notification_filter filter = { .nr_filters = 1, .filters = { [0] = { .type = WATCH_TYPE_KEY_NOTIFY, .subtype_filter[0] = UINT_MAX, }, }, }; ioctl(fds[1], IOC_WATCH_QUEUE_SET_FILTER, &filter); keyctl_watch_key(KEY_SPEC_SESSION_KEYRING, fds[1], 0x01); After that, records will be placed into the queue when events occur in which keys are changed in some way. Records are of the following format: struct key_notification { struct watch_notification watch; __u32 key_id; __u32 aux; } *n; Where: n->watch.type will be WATCH_TYPE_KEY_NOTIFY. n->watch.subtype will indicate the type of event, such as NOTIFY_KEY_REVOKED. n->watch.info & WATCH_INFO_LENGTH will indicate the length of the record. n->watch.info & WATCH_INFO_ID will be the second argument to keyctl_watch_key(), shifted. n->key will be the ID of the affected key. n->aux will hold subtype-dependent information, such as the key being linked into the keyring specified by n->key in the case of NOTIFY_KEY_LINKED. Note that it is permissible for event records to be of variable length - or, at least, the length may be dependent on the subtype. Note also that the queue can be shared between multiple notifications of various types. Signed-off-by: David Howells <dhowells@redhat.com> Reviewed-by: James Morris <jamorris@linux.microsoft.com>
2020-05-19security: Add hooks to rule on setting a watchDavid Howells
Add security hooks that will allow an LSM to rule on whether or not a watch may be set. More than one hook is required as the watches watch different types of object. Signed-off-by: David Howells <dhowells@redhat.com> Acked-by: James Morris <jamorris@linux.microsoft.com> cc: Casey Schaufler <casey@schaufler-ca.com> cc: Stephen Smalley <sds@tycho.nsa.gov> cc: linux-security-module@vger.kernel.org
2020-05-19pipe: Add general notification queue supportDavid Howells
Make it possible to have a general notification queue built on top of a standard pipe. Notifications are 'spliced' into the pipe and then read out. splice(), vmsplice() and sendfile() are forbidden on pipes used for notifications as post_one_notification() cannot take pipe->mutex. This means that notifications could be posted in between individual pipe buffers, making iov_iter_revert() difficult to effect. The way the notification queue is used is: (1) An application opens a pipe with a special flag and indicates the number of messages it wishes to be able to queue at once (this can only be set once): pipe2(fds, O_NOTIFICATION_PIPE); ioctl(fds[0], IOC_WATCH_QUEUE_SET_SIZE, queue_depth); (2) The application then uses poll() and read() as normal to extract data from the pipe. read() will return multiple notifications if the buffer is big enough, but it will not split a notification across buffers - rather it will return a short read or EMSGSIZE. Notification messages include a length in the header so that the caller can split them up. Each message has a header that describes it: struct watch_notification { __u32 type:24; __u32 subtype:8; __u32 info; }; The type indicates the source (eg. mount tree changes, superblock events, keyring changes, block layer events) and the subtype indicates the event type (eg. mount, unmount; EIO, EDQUOT; link, unlink). The info field indicates a number of things, including the entry length, an ID assigned to a watchpoint contributing to this buffer and type-specific flags. Supplementary data, such as the key ID that generated an event, can be attached in additional slots. The maximum message size is 127 bytes. Messages may not be padded or aligned, so there is no guarantee, for example, that the notification type will be on a 4-byte bounary. Signed-off-by: David Howells <dhowells@redhat.com>
2020-05-19security: Add a hook for the point of notification insertionDavid Howells
Add a security hook that allows an LSM to rule on whether a notification message is allowed to be inserted into a particular watch queue. The hook is given the following information: (1) The credentials of the triggerer (which may be init_cred for a system notification, eg. a hardware error). (2) The credentials of the whoever set the watch. (3) The notification message. Signed-off-by: David Howells <dhowells@redhat.com> Acked-by: James Morris <jamorris@linux.microsoft.com> cc: Casey Schaufler <casey@schaufler-ca.com> cc: Stephen Smalley <sds@tycho.nsa.gov> cc: linux-security-module@vger.kernel.org
2020-05-19kprobes: Prevent probes in .noinstr.text sectionThomas Gleixner
Instrumentation is forbidden in the .noinstr.text section. Make kprobes respect this. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Alexandre Chartre <alexandre.chartre@oracle.com> Acked-by: Peter Zijlstra <peterz@infradead.org> Acked-by: Masami Hiramatsu <mhiramat@kernel.org> Link: https://lkml.kernel.org/r/20200505134100.179862032@linutronix.de
2020-05-19Merge tag 'noinstr-lds-2020-05-19' into core/kprobesThomas Gleixner
Get the noinstr section and markers to base the kprobe changes on.
2020-05-19rcu: Provide __rcu_is_watching()Thomas Gleixner
Same as rcu_is_watching() but without the preempt_disable/enable() pair inside the function. It is merked noinstr so it ends up in the non-instrumentable text section. This is useful for non-preemptible code especially in the low level entry section. Using rcu_is_watching() there results in a call to the preempt_schedule_notrace() thunk which triggers noinstr section warnings in objtool. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Peter Zijlstra <peterz@infradead.org> Link: https://lkml.kernel.org/r/20200512213810.518709291@linutronix.de
2020-05-19rcu: Provide rcu_irq_exit_preempt()Thomas Gleixner
Interrupts and exceptions invoke rcu_irq_enter() on entry and need to invoke rcu_irq_exit() before they either return to the interrupted code or invoke the scheduler due to preemption. The general assumption is that RCU idle code has to have preemption disabled so that a return from interrupt cannot schedule. So the return from interrupt code invokes rcu_irq_exit() and preempt_schedule_irq(). If there is any imbalance in the rcu_irq/nmi* invocations or RCU idle code had preemption enabled then this goes unnoticed until the CPU goes idle or some other RCU check is executed. Provide rcu_irq_exit_preempt() which can be invoked from the interrupt/exception return code in case that preemption is enabled. It invokes rcu_irq_exit() and contains a few sanity checks in case that CONFIG_PROVE_RCU is enabled to catch such issues directly. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Paul E. McKenney <paulmck@kernel.org> Reviewed-by: Alexandre Chartre <alexandre.chartre@oracle.com> Acked-by: Peter Zijlstra <peterz@infradead.org> Link: https://lkml.kernel.org/r/20200505134904.364456424@linutronix.de
2020-05-19x86/mce: Send #MC singal from task workPeter Zijlstra
Convert #MC over to using task_work_add(); it will run the same code slightly later, on the return to user path of the same exception. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Frederic Weisbecker <frederic@kernel.org> Reviewed-by: Alexandre Chartre <alexandre.chartre@oracle.com> Link: https://lkml.kernel.org/r/20200505134100.957390899@linutronix.de
2020-05-19sched,rcu,tracing: Avoid tracing before in_nmi() is correctPeter Zijlstra
If a tracer is invoked before in_nmi() becomes true, the tracer can no longer detect it is called from NMI context and behave correctly. Therefore change nmi_{enter,exit}() to use __preempt_count_{add,sub}() as the normal preempt_count_{add,sub}() have a (desired) function trace entry. This fixes a potential issue with the current code; when the function-tracer has stack-tracing enabled __trace_stack() will malfunction when it hits the preempt_count_add() function entry from NMI context. Suggested-by: Steven Rostedt (VMware) <rosted@goodmis.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Reviewed-by: Alexandre Chartre <alexandre.chartre@oracle.com> Link: https://lkml.kernel.org/r/20200505134101.434193525@linutronix.de
2020-05-19sh/ftrace: Move arch_ftrace_nmi_{enter,exit} into nmi exceptionPeter Zijlstra
SuperH is the last remaining user of arch_ftrace_nmi_{enter,exit}(), remove it from the generic code and into the SuperH code. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Alexandre Chartre <alexandre.chartre@oracle.com> Acked-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Cc: Rich Felker <dalias@libc.org> Cc: Yoshinori Sato <ysato@users.sourceforge.jp> Link: https://lkml.kernel.org/r/20200505134101.248881738@linutronix.de
2020-05-19lockdep: Always inline lockdep_{off,on}()Peter Zijlstra
These functions are called {early,late} in nmi_{enter,exit} and should not be traced or probed. They are also puny, so 'inline' them. Reported-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Alexandre Chartre <alexandre.chartre@oracle.com> Link: https://lkml.kernel.org/r/20200505134101.048523500@linutronix.de
2020-05-19hardirq/nmi: Allow nested nmi_enter()Peter Zijlstra
Since there are already a number of sites (ARM64, PowerPC) that effectively nest nmi_enter(), make the primitive support this before adding even more. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Marc Zyngier <maz@kernel.org> Acked-by: Will Deacon <will@kernel.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Link: https://lkml.kernel.org/r/20200505134100.864179229@linutronix.de
2020-05-19Merge tag 'noinstr-lds-2020-05-19' into core/rcuThomas Gleixner
Get the noinstr section and annotation markers to base the RCU parts on.
2020-05-19context_tracking: Make guest_enter/exit() .noinstr readyThomas Gleixner
Force inlining of the helpers and mark the instrumentable parts accordingly. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Alexandre Chartre <alexandre.chartre@oracle.com> Acked-by: Peter Zijlstra <peterz@infradead.org> Link: https://lkml.kernel.org/r/20200505134341.672545766@linutronix.de
2020-05-19lockdep: Prepare for noinstr sectionsPeter Zijlstra
Force inlining and prevent instrumentation of all sorts by marking the functions which are invoked from low level entry code with 'noinstr'. Split the irqflags tracking into two parts. One which does the heavy lifting while RCU is watching and the final one which can be invoked after RCU is turned off. Signed-off-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Alexandre Chartre <alexandre.chartre@oracle.com> Link: https://lkml.kernel.org/r/20200505134100.484532537@linutronix.de
2020-05-19tracing: Provide lockdep less trace_hardirqs_on/off() variantsThomas Gleixner
trace_hardirqs_on/off() is only partially safe vs. RCU idle. The tracer core itself is safe, but the resulting tracepoints can be utilized by e.g. BPF which is unsafe. Provide variants which do not contain the lockdep invocation so the lockdep and tracer invocations can be split at the call site and placed properly. This is required because lockdep needs to be aware of the state before switching away from RCU idle and after switching to RCU idle because these transitions can take locks. As these code pathes are going to be non-instrumentable the tracer can be invoked after RCU is turned on and before the switch to RCU idle. So for these new variants there is no need to invoke the rcuidle aware tracer functions. Name them so they match the lockdep counterparts. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Alexandre Chartre <alexandre.chartre@oracle.com> Acked-by: Peter Zijlstra <peterz@infradead.org> Link: https://lkml.kernel.org/r/20200505134100.270771162@linutronix.de
2020-05-19vmlinux.lds.h: Create section for protection against instrumentationThomas Gleixner
Some code pathes, especially the low level entry code, must be protected against instrumentation for various reasons: - Low level entry code can be a fragile beast, especially on x86. - With NO_HZ_FULL RCU state needs to be established before using it. Having a dedicated section for such code allows to validate with tooling that no unsafe functions are invoked. Add the .noinstr.text section and the noinstr attribute to mark functions. noinstr implies notrace. Kprobes will gain a section check later. Provide also a set of markers: instrumentation_begin()/end() These are used to mark code inside a noinstr function which calls into regular instrumentable text section as safe. The instrumentation markers are only active when CONFIG_DEBUG_ENTRY is enabled as the end marker emits a NOP to prevent the compiler from merging the annotation points. This means the objtool verification requires a kernel compiled with this option. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Alexandre Chartre <alexandre.chartre@oracle.com> Acked-by: Peter Zijlstra <peterz@infradead.org> Link: https://lkml.kernel.org/r/20200505134100.075416272@linutronix.de
2020-05-19proc: proc_pid_ns takes super_block as an argumentAlexey Gladkov
syzbot found that touch /proc/testfile causes NULL pointer dereference at tomoyo_get_local_path() because inode of the dentry is NULL. Before c59f415a7cb6, Tomoyo received pid_ns from proc's s_fs_info directly. Since proc_pid_ns() can only work with inode, using it in the tomoyo_get_local_path() was wrong. To avoid creating more functions for getting proc_ns, change the argument type of the proc_pid_ns() function. Then, Tomoyo can use the existing super_block to get pid_ns. Link: https://lkml.kernel.org/r/0000000000002f0c7505a5b0e04c@google.com Link: https://lkml.kernel.org/r/20200518180738.2939611-1-gladkov.alexey@gmail.com Reported-by: syzbot+c1af344512918c61362c@syzkaller.appspotmail.com Fixes: c59f415a7cb6 ("Use proc_pid_ns() to get pid_namespace from the proc superblock") Signed-off-by: Alexey Gladkov <gladkov.alexey@gmail.com> Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2020-05-19thermal/drivers/cpuidle_cooling: Change the registration functionDaniel Lezcano
Today, there is no user for the cpuidle cooling device. The targetted platform is ARM and ARM64. The cpuidle and the cpufreq cooling device are based on the device tree. As the cpuidle cooling device can have its own configuration depending on the platform and the available idle states. The DT node description will give the optional properties to set the cooling device up. Do no longer rely on the CPU node which is prone to error and will lead to a confusion in the DT because the cpufreq cooling device is also using it. Let initialize the cpuidle cooling device with the DT binding. This was tested on: - hikey960 - hikey6220 - rock960 - db845c Acked-by: Viresh Kumar <viresh.kumar@linaro.org> Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org> Reviewed-by: Lukasz Luba <lukasz.luba@arm.com> Reviewed-by: Amit Kucheria <amit.kucheria@linaro.org> Tested-by: Amit Kucheria <amit.kucheria@linaro.org> Link: https://lore.kernel.org/r/20200429103644.5492-3-daniel.lezcano@linaro.org
2020-05-19powercap/drivers/idle_inject: Specify idle state max latencyDaniel Lezcano
Currently the idle injection framework uses the play_idle() function which puts the current CPU in an idle state. The idle state is the deepest one, as specified by the latency constraint when calling the subsequent play_idle_precise() function with the INT_MAX. The idle_injection is used by the cpuidle_cooling device which computes the idle / run duration to mitigate the temperature by injecting idle cycles. The cooling device has no control on the depth of the idle state. Allow finer control of the idle injection mechanism by allowing to specify the latency for the idle state. Thus the cooling device has the ability to have a guarantee on the exit latency of the idle states it is injecting. Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org> Reviewed-by: Amit Kucheria <amit.kucheria@linaro.org> Link: https://lore.kernel.org/r/20200429103644.5492-1-daniel.lezcano@linaro.org
2020-05-19ARM: 8976/1: module: allow arch overrides for .init section namesVincent Whitchurch
ARM stores unwind information for .init.text in sections named .ARM.extab.init.text and .ARM.exidx.init.text. Since those aren't currently recognized as init sections, they're allocated along with the core section, and relocation fails if the core and the init section are allocated from different regions and can't reach other. final section addresses: ... 0x7f800000 .init.text .. 0xcbb54078 .ARM.exidx.init.text .. section 16 reloc 0 sym '': relocation 42 out of range (0xcbb54078 -> 0x7f800000) Allow architectures to override the section name so that ARM can fix this. Acked-by: Jessica Yu <jeyu@kernel.org> Signed-off-by: Vincent Whitchurch <vincent.whitchurch@axis.com> Signed-off-by: Russell King <rmk+kernel@armlinux.org.uk>
2020-05-19soundwire: bus_type: add sdw_master_device supportPierre-Louis Bossart
In the existing SoundWire code, Master Devices are not explicitly represented - only SoundWire Slave Devices are exposed (the use of capital letters follows the SoundWire specification conventions). With the existing code, the bus is handled without using a proper device, and bus->dev typically points to a platform device. The right thing to do as discussed in multiple reviews is use a device for each bus. The sdw_master_device addition is done with minimal internal plumbing and not exposed externally. The existing API based on sdw_bus_master_add() and sdw_bus_master_delete() will deal with the sdw_master_device life cycle, which minimizes changes to existing drivers. Note that the Intel code will be modified in follow-up patches (no impact on any platform since the connection with ASoC is not supported upstream so far). Signed-off-by: Pierre-Louis Bossart <pierre-louis.bossart@linux.intel.com> Signed-off-by: Bard Liao <yung-chuan.liao@linux.intel.com> Acked-by: Jaroslav Kysela <perex@perex.cz> Link: https://lore.kernel.org/r/20200518174322.31561-5-yung-chuan.liao@linux.intel.com Signed-off-by: Vinod Koul <vkoul@kernel.org>
2020-05-19soundwire: bus: add unique bus idBard Liao
Adding an unique id for each bus. Suggested-by: Vinod Koul <vkoul@kernel.org> Signed-off-by: Bard Liao <yung-chuan.liao@linux.intel.com> Acked-by: Jaroslav Kysela <perex@perex.cz> Link: https://lore.kernel.org/r/20200518174322.31561-4-yung-chuan.liao@linux.intel.com Signed-off-by: Vinod Koul <vkoul@kernel.org>
2020-05-19soundwire: bus_type: introduce sdw_slave_type and sdw_master_typePierre-Louis Bossart
this is a preparatory patch before the introduction of the sdw_master_type. The SoundWire slave support is slightly modified with the use of a sdw_slave_type, and the uevent handling move to slave.c (since it's not necessary for the master). No functionality change other than moving code around. Signed-off-by: Pierre-Louis Bossart <pierre-louis.bossart@linux.intel.com> Signed-off-by: Bard Liao <yung-chuan.liao@linux.intel.com> Acked-by: Jaroslav Kysela <perex@perex.cz> Link: https://lore.kernel.org/r/20200518174322.31561-3-yung-chuan.liao@linux.intel.com Signed-off-by: Vinod Koul <vkoul@kernel.org>
2020-05-19soundwire: bus: rename sdw_bus_master_add/delete, add argumentsPierre-Louis Bossart
In preparation for future extensions, rename functions to use sdw_bus_master prefix and add a parent and fwnode argument to sdw_bus_master_add to help with device registration in follow-up patches. No functionality change, just renames and additional arguments. The Intel code is currently unused, the two additional arguments are only needed for compilation. Signed-off-by: Pierre-Louis Bossart <pierre-louis.bossart@linux.intel.com> Signed-off-by: Bard Liao <yung-chuan.liao@linux.intel.com> Acked-by: Jaroslav Kysela <perex@perex.cz> Link: https://lore.kernel.org/r/20200518174322.31561-2-yung-chuan.liao@linux.intel.com Signed-off-by: Vinod Koul <vkoul@kernel.org>
2020-05-18fscrypt: support test_dummy_encryption=v2Eric Biggers
v1 encryption policies are deprecated in favor of v2, and some new features (e.g. encryption+casefolding) are only being added for v2. Therefore, the "test_dummy_encryption" mount option (which is used for encryption I/O testing with xfstests) needs to support v2 policies. To do this, extend its syntax to be "test_dummy_encryption=v1" or "test_dummy_encryption=v2". The existing "test_dummy_encryption" (no argument) also continues to be accepted, to specify the default setting -- currently v1, but the next patch changes it to v2. To cleanly support both v1 and v2 while also making it easy to support specifying other encryption settings in the future (say, accepting "$contents_mode:$filenames_mode:v2"), make ext4 and f2fs maintain a pointer to the dummy fscrypt_context rather than using mount flags. To avoid concurrency issues, don't allow test_dummy_encryption to be set or changed during a remount. (The former restriction is new, but xfstests doesn't run into it, so no one should notice.) Tested with 'gce-xfstests -c {ext4,f2fs}/encrypt -g auto'. On ext4, there are two regressions, both of which are test bugs: ext4/023 and ext4/028 fail because they set an xattr and expect it to be stored inline, but the increase in size of the fscrypt_context from 24 to 40 bytes causes this xattr to be spilled into an external block. Link: https://lore.kernel.org/r/20200512233251.118314-4-ebiggers@kernel.org Acked-by: Jaegeuk Kim <jaegeuk@kernel.org> Reviewed-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Eric Biggers <ebiggers@google.com>
2020-05-18net: phy: simplify phy_link_change argumentsDoug Berger
This function was introduced to allow for different handling of link up and link down events particularly with regard to the netif_carrier. The third argument do_carrier allowed the flag to be left unchanged. Since then the phylink has introduced an implementation that completely ignores the third parameter since it never wants to change the flag and the phylib always sets the third parameter to true so the flag is always changed. Therefore the third argument (i.e. do_carrier) is no longer necessary and can be removed. This also means that the phylib phy_link_down() function no longer needs its second argument. Signed-off-by: Doug Berger <opendmb@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-18kgdboc: Add kgdboc_earlycon to support early kgdb using boot consolesDouglas Anderson
We want to enable kgdb to debug the early parts of the kernel. Unfortunately kgdb normally is a client of the tty API in the kernel and serial drivers don't register to the tty layer until fairly late in the boot process. Serial drivers do, however, commonly register a boot console. Let's enable the kgdboc driver to work with boot consoles to provide early debugging. This change co-opts the existing read() function pointer that's part of "struct console". It's assumed that if a boot console (with the flag CON_BOOT) has implemented read() that both the read() and write() function are polling functions. That means they work without interrupts and read() will return immediately (with 0 bytes read) if there's nothing to read. This should be a safe assumption since it appears that no current boot consoles implement read() right now and there seems no reason to do so unless they wanted to support "kgdboc_earlycon". The normal/expected way to make all this work is to use "kgdboc_earlycon" and "kgdboc" together. You should point them both to the same physical serial connection. At boot time, as the system transitions from the boot console to the normal console (and registers a tty), kgdb will switch over. One awkward part of all this, though, is that there can be a window where the boot console goes away and we can't quite transtion over to the main kgdboc that uses the tty layer. There are two main problems: 1. The act of registering the tty doesn't cause any call into kgdboc so there is a window of time when the tty is there but kgdboc's init code hasn't been called so we can't transition to it. 2. On some serial drivers the normal console inits (and replaces the boot console) quite early in the system. Presumably these drivers were coded up before earlycon worked as well as it does today and probably they don't need to do this anymore, but it causes us problems nontheless. Problem #1 is not too big of a deal somewhat due to the luck of probe ordering. kgdboc is last in the tty/serial/Makefile so its probe gets right after all other tty devices. It's not fun to rely on this, but it does work for the most part. Problem #2 is a big deal, but only for some serial drivers. Other serial drivers end up registering the console (which gets rid of the boot console) and tty at nearly the same time. The way we'll deal with the window when the system has stopped using the boot console and the time when we're setup using the tty is to keep using the boot console. This may sound surprising, but it has been found to work well in practice. If it doesn't work, it shouldn't be too hard for a given serial driver to make it keep working. Specifically, it's expected that the read()/write() function provided in the boot console should be the same (or nearly the same) as the normal kgdb polling functions. That means continuing to use them should work just fine. To make things even more likely to work work we'll also trap the recently added exit() function in the boot console we're using and delay any calls to it until we're all done with the boot console. NOTE: there could be ways to use all this in weird / unexpected ways. If you do something like this, it's a bit of a buyer beware situation. Specifically: - If you specify only "kgdboc_earlycon" but not "kgdboc" then (depending on your serial driver) things will probably work OK, but you'll get a warning printed the first time you use kgdb after the boot console is gone. You'd only be able to do this, of course, if the serial driver you're running atop provided an early boot console. - If your "kgdboc_earlycon" and "kgdboc" devices are not the same device things should work OK, but it'll be your job to switch over which device you're monitoring (including figuring out how to switch over gdb in-flight if you're using it). When trying to enable "kgdboc_earlycon" it should be noted that the names that are registered through the boot console layer and the tty layer are not the same for the same port. For example when debugging on one board I'd need to pass "kgdboc_earlycon=qcom_geni kgdboc=ttyMSM0" to enable things properly. Since digging up the boot console name is a pain and there will rarely be more than one boot console enabled, you can provide the "kgdboc_earlycon" parameter without specifying the name of the boot console. In this case we'll just pick the first boot that implements read() that we find. This new "kgdboc_earlycon" parameter should be contrasted to the existing "ekgdboc" parameter. While both provide a way to debug very early, the usage and mechanisms are quite different. Specifically "kgdboc_earlycon" is meant to be used in tandem with "kgdboc" and there is a transition from one to the other. The "ekgdboc" parameter, on the other hand, replaces the "kgdboc" parameter. It runs the same logic as the "kgdboc" parameter but just relies on your TTY driver being present super early. The only known usage of the old "ekgdboc" parameter is documented as "ekgdboc=kbd earlyprintk=vga". It should be noted that "kbd" has special treatment allowing it to init early as a tty device. Signed-off-by: Douglas Anderson <dianders@chromium.org> Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Tested-by: Sumit Garg <sumit.garg@linaro.org> Link: https://lore.kernel.org/r/20200507130644.v4.8.I8fba5961bf452ab92350654aa61957f23ecf0100@changeid Signed-off-by: Daniel Thompson <daniel.thompson@linaro.org>
2020-05-18scs: Move DEFINE_SCS macro into core codeWill Deacon
Defining static shadow call stacks is not architecture-specific, so move the DEFINE_SCS() macro into the core header file. Tested-by: Sami Tolvanen <samitolvanen@google.com> Reviewed-by: Mark Rutland <mark.rutland@arm.com> Signed-off-by: Will Deacon <will@kernel.org>
2020-05-18scs: Move scs_overflow_check() out of architecture codeWill Deacon
There is nothing architecture-specific about scs_overflow_check() as it's just a trivial wrapper around scs_corrupted(). For parity with task_stack_end_corrupted(), rename scs_corrupted() to task_scs_end_corrupted() and call it from schedule_debug() when CONFIG_SCHED_STACK_END_CHECK_is enabled, which better reflects its purpose as a debug feature to catch inadvertent overflow of the SCS. Finally, remove the unused scs_overflow_check() function entirely. This has absolutely no impact on architectures that do not support SCS (currently arm64 only). Tested-by: Sami Tolvanen <samitolvanen@google.com> Reviewed-by: Mark Rutland <mark.rutland@arm.com> Signed-off-by: Will Deacon <will@kernel.org>
2020-05-18arm64: scs: Store absolute SCS stack pointer value in thread_infoWill Deacon
Storing the SCS information in thread_info as a {base,offset} pair introduces an additional load instruction on the ret-to-user path, since the SCS stack pointer in x18 has to be converted back to an offset by subtracting the base. Replace the offset with the absolute SCS stack pointer value instead and avoid the redundant load. Tested-by: Sami Tolvanen <samitolvanen@google.com> Reviewed-by: Mark Rutland <mark.rutland@arm.com> Signed-off-by: Will Deacon <will@kernel.org>
2020-05-18net/mlx5: Move iseg access helper routines close to mlx5_core driverParav Pandit
Only mlx5_core driver handles fw initialization check and command interface revision check. Hence move them inside the mlx5_core driver where it is used. This avoid exposing these helpers to all mlx5 drivers. Signed-off-by: Parav Pandit <parav@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2020-05-18net/mlx5: Cleanup mlx5_ifc_fte_match_set_misc2_bitsRaed Salem
Remove the "metadata_reg_b" field and all uses of this field in code to match the device specification. As this field is not in use in SW steering it is safe to remove it. Signed-off-by: Raed Salem <raeds@mellanox.com> Reviewed-by: Alex Vesker <valex@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2020-05-18SUNRPC: Rename svc_sock::sk_reclenChuck Lever
Clean up. I find the name of the svc_sock::sk_reclen field confusing, so I've changed it to better reflect its function. This field is not read directly to get the record length. Rather, it is a buffer containing a record marker that needs to be decoded. Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2020-05-18svcrdma: Remove the SVCRDMA_DEBUG macroChuck Lever
Clean up: Commit d21b05f101ae ("rdma: SVCRMDA Header File") introduced the SVCRDMA_DEBUG macro, but it doesn't seem to have been used. Signed-off-by: Chuck Lever <chuck.lever@oracle.com>