summaryrefslogtreecommitdiff
path: root/kernel
AgeCommit message (Collapse)Author
2018-05-24bpf: introduce bpf subcommand BPF_TASK_FD_QUERYYonghong Song
Currently, suppose a userspace application has loaded a bpf program and attached it to a tracepoint/kprobe/uprobe, and a bpf introspection tool, e.g., bpftool, wants to show which bpf program is attached to which tracepoint/kprobe/uprobe. Such attachment information will be really useful to understand the overall bpf deployment in the system. There is a name field (16 bytes) for each program, which could be used to encode the attachment point. There are some drawbacks for this approaches. First, bpftool user (e.g., an admin) may not really understand the association between the name and the attachment point. Second, if one program is attached to multiple places, encoding a proper name which can imply all these attachments becomes difficult. This patch introduces a new bpf subcommand BPF_TASK_FD_QUERY. Given a pid and fd, if the <pid, fd> is associated with a tracepoint/kprobe/uprobe perf event, BPF_TASK_FD_QUERY will return . prog_id . tracepoint name, or . k[ret]probe funcname + offset or kernel addr, or . u[ret]probe filename + offset to the userspace. The user can use "bpftool prog" to find more information about bpf program itself with prog_id. Acked-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: Yonghong Song <yhs@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2018-05-24perf/core: add perf_get_event() to return perf_event given a struct fileYonghong Song
A new extern function, perf_get_event(), is added to return a perf event given a struct file. This function will be used in later patches. Signed-off-by: Yonghong Song <yhs@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2018-05-24bpf: properly enforce index mask to prevent out-of-bounds speculationDaniel Borkmann
While reviewing the verifier code, I recently noticed that the following two program variants in relation to tail calls can be loaded. Variant 1: # bpftool p d x i 15 0: (15) if r1 == 0x0 goto pc+3 1: (18) r2 = map[id:5] 3: (05) goto pc+2 4: (18) r2 = map[id:6] 6: (b7) r3 = 7 7: (35) if r3 >= 0xa0 goto pc+2 8: (54) (u32) r3 &= (u32) 255 9: (85) call bpf_tail_call#12 10: (b7) r0 = 1 11: (95) exit # bpftool m s i 5 5: prog_array flags 0x0 key 4B value 4B max_entries 4 memlock 4096B # bpftool m s i 6 6: prog_array flags 0x0 key 4B value 4B max_entries 160 memlock 4096B Variant 2: # bpftool p d x i 20 0: (15) if r1 == 0x0 goto pc+3 1: (18) r2 = map[id:8] 3: (05) goto pc+2 4: (18) r2 = map[id:7] 6: (b7) r3 = 7 7: (35) if r3 >= 0x4 goto pc+2 8: (54) (u32) r3 &= (u32) 3 9: (85) call bpf_tail_call#12 10: (b7) r0 = 1 11: (95) exit # bpftool m s i 8 8: prog_array flags 0x0 key 4B value 4B max_entries 160 memlock 4096B # bpftool m s i 7 7: prog_array flags 0x0 key 4B value 4B max_entries 4 memlock 4096B In both cases the index masking inserted by the verifier in order to control out of bounds speculation from a CPU via b2157399cc98 ("bpf: prevent out-of-bounds speculation") seems to be incorrect in what it is enforcing. In the 1st variant, the mask is applied from the map with the significantly larger number of entries where we would allow to a certain degree out of bounds speculation for the smaller map, and in the 2nd variant where the mask is applied from the map with the smaller number of entries, we get buggy behavior since we truncate the index of the larger map. The original intent from commit b2157399cc98 is to reject such occasions where two or more different tail call maps are used in the same tail call helper invocation. However, the check on the BPF_MAP_PTR_POISON is never hit since we never poisoned the saved pointer in the first place! We do this explicitly for map lookups but in case of tail calls we basically used the tail call map in insn_aux_data that was processed in the most recent path which the verifier walked. Thus any prior path that stored a pointer in insn_aux_data at the helper location was always overridden. Fix it by moving the map pointer poison logic into a small helper that covers both BPF helpers with the same logic. After that in fixup_bpf_calls() the poison check is then hit for tail calls and the program rejected. Latter only happens in unprivileged case since this is the *only* occasion where a rewrite needs to happen, and where such rewrite is specific to the map (max_entries, index_mask). In the privileged case the rewrite is generic for the insn->imm / insn->code update so multiple maps from different paths can be handled just fine since all the remaining logic happens in the instruction processing itself. This is similar to the case of map lookups: in case there is a collision of maps in fixup_bpf_calls() we must skip the inlined rewrite since this will turn the generic instruction sequence into a non- generic one. Thus the patch_call_imm will simply update the insn->imm location where the bpf_map_lookup_elem() will later take care of the dispatch. Given we need this 'poison' state as a check, the information of whether a map is an unpriv_array gets lost, so enforcing it prior to that needs an additional state. In general this check is needed since there are some complex and tail call intensive BPF programs out there where LLVM tends to generate such code occasionally. We therefore convert the map_ptr rather into map_state to store all this w/o extra memory overhead, and the bit whether one of the maps involved in the collision was from an unpriv_array thus needs to be retained as well there. Fixes: b2157399cc98 ("bpf: prevent out-of-bounds speculation") Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2018-05-24ipv6: sr: Add seg6local action End.BPFMathieu Xhonneux
This patch adds the End.BPF action to the LWT seg6local infrastructure. This action works like any other seg6local End action, meaning that an IPv6 header with SRH is needed, whose DA has to be equal to the SID of the action. It will also advance the SRH to the next segment, the BPF program does not have to take care of this. Since the BPF program may not be a source of instability in the kernel, it is important to ensure that the integrity of the packet is maintained before yielding it back to the IPv6 layer. The hook hence keeps track if the SRH has been altered through the helpers, and re-validates its content if needed with seg6_validate_srh. The state kept for validation is stored in a per-CPU buffer. The BPF program is not allowed to directly write into the packet, and only some fields of the SRH can be altered through the helper bpf_lwt_seg6_store_bytes. Performances profiling has shown that the SRH re-validation does not induce a significant overhead. If the altered SRH is deemed as invalid, the packet is dropped. This validation is also done before executing any action through bpf_lwt_seg6_action, and will not be performed again if the SRH is not modified after calling the action. The BPF program may return 3 types of return codes: - BPF_OK: the End.BPF action will look up the next destination through seg6_lookup_nexthop. - BPF_REDIRECT: if an action has been executed through the bpf_lwt_seg6_action helper, the BPF program should return this value, as the skb's destination is already set and the default lookup should not be performed. - BPF_DROP : the packet will be dropped. Signed-off-by: Mathieu Xhonneux <m.xhonneux@gmail.com> Acked-by: David Lebrun <dlebrun@google.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-05-24cpufreq: schedutil: Avoid missing updates for one-CPU policiesRafael J. Wysocki
Commit 152db033d775 (schedutil: Allow cpufreq requests to be made even when kthread kicked) made changes to prevent utilization updates from being discarded during processing a previous request, but it left a small window in which that still can happen in the one-CPU policy case. Namely, updates coming in after setting work_in_progress in sugov_update_commit() and clearing it in sugov_work() will still be dropped due to the work_in_progress check in sugov_update_single(). To close that window, rearrange the code so as to acquire the update lock around the deferred update branch in sugov_update_single() and drop the work_in_progress check from it. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Reviewed-by: Juri Lelli <juri.lelli@redhat.com> Acked-by: Viresh Kumar <viresh.kumar@linaro.org> Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org>
2018-05-24bpf: get JITed image lengths of functions via syscallSandipan Das
This adds new two new fields to struct bpf_prog_info. For multi-function programs, these fields can be used to pass a list of the JITed image lengths of each function for a given program to userspace using the bpf system call with the BPF_OBJ_GET_INFO_BY_FD command. This can be used by userspace applications like bpftool to split up the contiguous JITed dump, also obtained via the system call, into more relatable chunks corresponding to each function. Signed-off-by: Sandipan Das <sandipan@linux.vnet.ibm.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-05-24bpf: fix multi-function JITed dump obtained via syscallSandipan Das
Currently, for multi-function programs, we cannot get the JITed instructions using the bpf system call's BPF_OBJ_GET_INFO_BY_FD command. Because of this, userspace tools such as bpftool fail to identify a multi-function program as being JITed or not. With the JIT enabled and the test program running, this can be verified as follows: # cat /proc/sys/net/core/bpf_jit_enable 1 Before applying this patch: # bpftool prog list 1: kprobe name foo tag b811aab41a39ad3d gpl loaded_at 2018-05-16T11:43:38+0530 uid 0 xlated 216B not jited memlock 65536B ... # bpftool prog dump jited id 1 no instructions returned After applying this patch: # bpftool prog list 1: kprobe name foo tag b811aab41a39ad3d gpl loaded_at 2018-05-16T12:13:01+0530 uid 0 xlated 216B jited 308B memlock 65536B ... # bpftool prog dump jited id 1 0: nop 4: nop 8: mflr r0 c: std r0,16(r1) 10: stdu r1,-112(r1) 14: std r31,104(r1) 18: addi r31,r1,48 1c: li r3,10 ... Signed-off-by: Sandipan Das <sandipan@linux.vnet.ibm.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-05-24bpf: get kernel symbol addresses via syscallSandipan Das
This adds new two new fields to struct bpf_prog_info. For multi-function programs, these fields can be used to pass a list of kernel symbol addresses for all functions in a given program to userspace using the bpf system call with the BPF_OBJ_GET_INFO_BY_FD command. When bpf_jit_kallsyms is enabled, we can get the address of the corresponding kernel symbol for a callee function and resolve the symbol's name. The address is determined by adding the value of the call instruction's imm field to __bpf_call_base. This offset gets assigned to the imm field by the verifier. For some architectures, such as powerpc64, the imm field is not large enough to hold this offset. We resolve this by: [1] Assigning the subprog id to the imm field of a call instruction in the verifier instead of the offset of the callee's symbol's address from __bpf_call_base. [2] Determining the address of a callee's corresponding symbol by using the imm field as an index for the list of kernel symbol addresses now available from the program info. Suggested-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Sandipan Das <sandipan@linux.vnet.ibm.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-05-24bpf: support 64-bit offsets for bpf function callsSandipan Das
The imm field of a bpf instruction is a signed 32-bit integer. For JITed bpf-to-bpf function calls, it holds the offset of the start address of the callee's JITed image from __bpf_call_base. For some architectures, such as powerpc64, this offset may be as large as 64 bits and cannot be accomodated in the imm field without truncation. We resolve this by: [1] Additionally using the auxiliary data of each function to keep a list of start addresses of the JITed images for all functions determined by the verifier. [2] Retaining the subprog id inside the off field of the call instructions and using it to index into the list mentioned above and lookup the callee's address. To make sure that the existing JIT compilers continue to work without requiring changes, we keep the imm field as it is. Signed-off-by: Sandipan Das <sandipan@linux.vnet.ibm.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-05-24bpf: btf: Avoid variable length arrayMartin KaFai Lau
Sparse warning: kernel/bpf/btf.c:1985:34: warning: Variable length array is used. This patch directly uses ARRAY_SIZE(). Fixes: f80442a4cd18 ("bpf: btf: Change how section is supported in btf_header") Signed-off-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-05-23workqueue: move function definitions within CONFIG_SMP blockMathieu Malaterre
In commit 7ee681b25284 ("workqueue: Convert to state machine callbacks"), three new function definitions were added: ‘workqueue_prepare_cpu’, ‘workqueue_online_cpu’ and ‘workqueue_offline_cpu’. Move these function definitions within a CONFIG_SMP block since they are not used outside of it. This will match function declarations in header <include/linux/workqueue.h>, and silence the following gcc warning (W=1): kernel/workqueue.c:4743:5: warning: no previous prototype for ‘workqueue_prepare_cpu’ [-Wmissing-prototypes] kernel/workqueue.c:4756:5: warning: no previous prototype for ‘workqueue_online_cpu’ [-Wmissing-prototypes] kernel/workqueue.c:4783:5: warning: no previous prototype for ‘workqueue_offline_cpu’ [-Wmissing-prototypes] Signed-off-by: Mathieu Malaterre <malat@debian.org> Signed-off-by: Tejun Heo <tj@kernel.org>
2018-05-23cgroup: css_set_lock should nest inside tasklist_lockTejun Heo
cgroup_enable_task_cg_lists() incorrectly nests non-irq-safe tasklist_lock inside irq-safe css_set_lock triggering the following lockdep warning. WARNING: possible irq lock inversion dependency detected 4.17.0-rc1-00027-gb37d049 #6 Not tainted -------------------------------------------------------- systemd/1 just changed the state of lock: 00000000fe57773b (css_set_lock){..-.}, at: cgroup_free+0xf2/0x12a but this lock took another, SOFTIRQ-unsafe lock in the past: (tasklist_lock){.+.+} and interrupts could create inverse lock ordering between them. other info that might help us debug this: Possible interrupt unsafe locking scenario: CPU0 CPU1 ---- ---- lock(tasklist_lock); local_irq_disable(); lock(css_set_lock); lock(tasklist_lock); <Interrupt> lock(css_set_lock); *** DEADLOCK *** The condition is highly unlikely to actually happen especially given that the path is executed only once per boot. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: Boqun Feng <boqun.feng@gmail.com>
2018-05-23umh: introduce fork_usermode_blob() helperAlexei Starovoitov
Introduce helper: int fork_usermode_blob(void *data, size_t len, struct umh_info *info); struct umh_info { struct file *pipe_to_umh; struct file *pipe_from_umh; pid_t pid; }; that GPLed kernel modules (signed or unsigned) can use it to execute part of its own data as swappable user mode process. The kernel will do: - allocate a unique file in tmpfs - populate that file with [data, data + len] bytes - user-mode-helper code will do_execve that file and, before the process starts, the kernel will create two unix pipes for bidirectional communication between kernel module and umh - close tmpfs file, effectively deleting it - the fork_usermode_blob will return zero on success and populate 'struct umh_info' with two unix pipes and the pid of the user process As the first step in the development of the bpfilter project the fork_usermode_blob() helper is introduced to allow user mode code to be invoked from a kernel module. The idea is that user mode code plus normal kernel module code are built as part of the kernel build and installed as traditional kernel module into distro specified location, such that from a distribution point of view, there is no difference between regular kernel modules and kernel modules + umh code. Such modules can be signed, modprobed, rmmod, etc. The use of this new helper by a kernel module doesn't make it any special from kernel and user space tooling point of view. Such approach enables kernel to delegate functionality traditionally done by the kernel modules into the user space processes (either root or !root) and reduces security attack surface of the new code. The buggy umh code would crash the user process, but not the kernel. Another advantage is that umh code of the kernel module can be debugged and tested out of user space (e.g. opening the possibility to run clang sanitizers, fuzzers or user space test suites on the umh code). In case of the bpfilter project such architecture allows complex control plane to be done in the user space while bpf based data plane stays in the kernel. Since umh can crash, can be oom-ed by the kernel, killed by the admin, the kernel module that uses them (like bpfilter) needs to manage life time of umh on its own via two unix pipes and the pid of umh. The exit code of such kernel module should kill the umh it started, so that rmmod of the kernel module will cleanup the corresponding umh. Just like if the kernel module does kmalloc() it should kfree() it in the exit code. Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-05-23bpf: btf: Rename btf_key_id and btf_value_id in bpf_map_infoMartin KaFai Lau
In "struct bpf_map_info", the name "btf_id", "btf_key_id" and "btf_value_id" could cause confusion because the "id" of "btf_id" means the BPF obj id given to the BTF object while "btf_key_id" and "btf_value_id" means the BTF type id within that BTF object. To make it clear, btf_key_id and btf_value_id are renamed to btf_key_type_id and btf_value_type_id. Suggested-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Martin KaFai Lau <kafai@fb.com> Acked-by: Yonghong Song <yhs@fb.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-05-23bpf: btf: Remove unused bits from uapi/linux/btf.hMartin KaFai Lau
This patch does the followings: 1. Limit BTF_MAX_TYPES and BTF_MAX_NAME_OFFSET to 64k. We can raise it later. 2. Remove the BTF_TYPE_PARENT and BTF_STR_TBL_ELF_ID. They are currently encoded at the highest bit of a u32. It is because the current use case does not require supporting parent type (i.e type_id referring to a type in another BTF file). It also does not support referring to a string in ELF. The BTF_TYPE_PARENT and BTF_STR_TBL_ELF_ID checks are replaced by BTF_TYPE_ID_CHECK and BTF_STR_OFFSET_CHECK which are defined in btf.c instead of uapi/linux/btf.h. 3. Limit the BTF_INFO_KIND from 5 bits to 4 bits which is enough. There is unused bits headroom if we ever needed it later. 4. The root bit in BTF_INFO is also removed because it is not used in the current use case. 5. Remove BTF_INT_VARARGS since func type is not supported now. The BTF_INT_ENCODING is limited to 4 bits instead of 8 bits. The above can be added back later because the verifier ensures the unused bits are zeros. Signed-off-by: Martin KaFai Lau <kafai@fb.com> Acked-by: Yonghong Song <yhs@fb.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-05-23bpf: btf: Check array->index_typeMartin KaFai Lau
Instead of ingoring the array->index_type field. Enforce that it must be a BTF_KIND_INT in size 1/2/4/8 bytes. Signed-off-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-05-23bpf: btf: Change how section is supported in btf_headerMartin KaFai Lau
There are currently unused section descriptions in the btf_header. Those sections are here to support future BTF use cases. For example, the func section (func_off) is to support function signature (e.g. the BPF prog function signature). Instead of spelling out all potential sections up-front in the btf_header. This patch makes changes to btf_header such that extending it (e.g. adding a section) is possible later. The unused ones can be removed for now and they can be added back later. This patch: 1. adds a hdr_len to the btf_header. It will allow adding sections (and other info like parent_label and parent_name) later. The check is similar to the existing bpf_attr. If a user passes in a longer hdr_len, the kernel ensures the extra tailing bytes are 0. 2. allows the section order in the BTF object to be different from its sec_off order in btf_header. 3. each sec_off is followed by a sec_len. It must not have gap or overlapping among sections. The string section is ensured to be at the end due to the 4 bytes alignment requirement of the type section. The above changes will allow enough flexibility to add new sections (and other info) to the btf_header later. This patch also removes an unnecessary !err check at the end of btf_parse(). Signed-off-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-05-23bpf: Expose check_uarg_tail_zero()Martin KaFai Lau
This patch exposes check_uarg_tail_zero() which will be reused by a later BTF patch. Its name is changed to bpf_check_uarg_tail_zero(). Signed-off-by: Martin KaFai Lau <kafai@fb.com> Acked-by: Yonghong Song <yhs@fb.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-05-23schedutil: Allow cpufreq requests to be made even when kthread kickedJoel Fernandes (Google)
Currently there is a chance of a schedutil cpufreq update request to be dropped if there is a pending update request. This pending request can be delayed if there is a scheduling delay of the irq_work and the wake up of the schedutil governor kthread. A very bad scenario is when a schedutil request was already just made, such as to reduce the CPU frequency, then a newer request to increase CPU frequency (even sched deadline urgent frequency increase requests) can be dropped, even though the rate limits suggest that its Ok to process a request. This is because of the way the work_in_progress flag is used. This patch improves the situation by allowing new requests to happen even though the old one is still being processed. Note that in this approach, if an irq_work was already issued, we just update next_freq and don't bother to queue another request so there's no extra work being done to make this happen. Acked-by: Viresh Kumar <viresh.kumar@linaro.org> Acked-by: Juri Lelli <juri.lelli@redhat.com> Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2018-05-23cpufreq: Rename cpufreq_can_do_remote_dvfs()Viresh Kumar
This routine checks if the CPU running this code belongs to the policy of the target CPU or if not, can it do remote DVFS for it remotely. But the current name of it implies as if it is only about doing remote updates. Rename it to make it more relevant. Suggested-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2018-05-22rcu/x86: Provide early rcu_cpu_starting() callbackPeter Zijlstra
The x86/mtrr code does horrific things because hardware. It uses stop_machine_from_inactive_cpu(), which does a wakeup (of the stopper thread on another CPU), which uses RCU, all before the CPU is onlined. RCU complains about this, because wakeups use RCU and RCU does (rightfully) not consider offline CPUs for grace-periods. Fix this by initializing RCU way early in the MTRR case. Tested-by: Mike Galbraith <efault@gmx.de> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> [ paulmck: Add !SMP support, per 0day Test Robot report. ]
2018-05-22mm: introduce MEMORY_DEVICE_FS_DAX and CONFIG_DEV_PAGEMAP_OPSDan Williams
In preparation for fixing dax-dma-vs-unmap issues, filesystems need to be able to rely on the fact that they will get wakeups on dev_pagemap page-idle events. Introduce MEMORY_DEVICE_FS_DAX and generic_dax_page_free() as common indicator / infrastructure for dax filesytems to require. With this change there are no users of the MEMORY_DEVICE_HOST designation, so remove it. The HMM sub-system extended dev_pagemap to arrange a callback when a dev_pagemap managed page is freed. Since a dev_pagemap page is free / idle when its reference count is 1 it requires an additional branch to check the page-type at put_page() time. Given put_page() is a hot-path we do not want to incur that check if HMM is not in use, so a static branch is used to avoid that overhead when not necessary. Now, the FS_DAX implementation wants to reuse this mechanism for receiving dev_pagemap ->page_free() callbacks. Rework the HMM-specific static-key into a generic mechanism that either HMM or FS_DAX code paths can enable. For ARCH=um builds, and any other arch that lacks ZONE_DEVICE support, care must be taken to compile out the DEV_PAGEMAP_OPS infrastructure. However, we still need to support FS_DAX in the FS_DAX_LIMITED case implemented by the s390/dcssblk driver. Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Michal Hocko <mhocko@suse.com> Reported-by: kbuild test robot <lkp@intel.com> Reported-by: Thomas Meyer <thomas@m3y3r.de> Reported-by: Dave Jiang <dave.jiang@intel.com> Cc: "Jérôme Glisse" <jglisse@redhat.com> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2018-05-22cpufreq: schedutil: Cleanup and document iowait boostPatrick Bellasi
The iowait boosting code has been recently updated to add a progressive boosting behavior which allows to be less aggressive in boosting tasks doing only sporadic IO operations, thus being more energy efficient for example on mobile platforms. The current code is now however a bit convoluted. Some functionalities (e.g. iowait boost reset) are replicated in different paths and their documentation is slightly misaligned. Let's cleanup the code by consolidating all the IO wait boosting related functionality within within few dedicated functions and better define their role: - sugov_iowait_boost: set/increase the IO wait boost of a CPU - sugov_iowait_apply: apply/reduce the IO wait boost of a CPU Both these two function are used at every sugov update and they make use of a unified IO wait boost reset policy provided by: - sugov_iowait_reset: reset/disable the IO wait boost of a CPU if a CPU is not updated for more then one tick This makes possible a cleaner and more self-contained design for the IO wait boosting code since the rest of the sugov update routines, both for single and shared frequency domains, follow the same template: /* Configure IO boost, if required */ sugov_iowait_boost() /* Return here if freq change is in progress or throttled */ /* Collect and aggregate utilization information */ sugov_get_util() sugov_aggregate_util() /* * Add IO boost, if currently enabled, on top of the aggregated * utilization value */ sugov_iowait_apply() As a extra bonus, let's also add the documentation for the new functions and better align the in-code documentation. Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com> Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org> Acked-by: Viresh Kumar <viresh.kumar@linaro.org> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2018-05-22cpufreq: schedutil: Fix iowait boost resetPatrick Bellasi
A more energy efficient update of the IO wait boosting mechanism has been introduced in: commit a5a0809bc58e ("cpufreq: schedutil: Make iowait boost more energy efficient") where the boost value is expected to be: - doubled at each successive wakeup from IO staring from the minimum frequency supported by a CPU - reset when a CPU is not updated for more then one tick by either disabling the IO wait boost or resetting its value to the minimum frequency if this new update requires an IO boost. This approach is supposed to "ignore" boosting for sporadic wakeups from IO, while still getting the frequency boosted to the maximum to benefit long sequence of wakeup from IO operations. However, these assumptions are not always satisfied. For example, when an IO boosted CPU enters idle for more the one tick and then wakes up after an IO wait, since in sugov_set_iowait_boost() we first check the IOWAIT flag, we keep doubling the iowait boost instead of restarting from the minimum frequency value. This misbehavior could happen mainly on non-shared frequency domains, thus defeating the energy efficiency optimization, but it can also happen on shared frequency domain systems. Let fix this issue in sugov_set_iowait_boost() by: - first check the IO wait boost reset conditions to eventually reset the boost value - then applying the correct IO boost value if required by the caller Fixes: a5a0809bc58e (cpufreq: schedutil: Make iowait boost more energy efficient) Reported-by: Viresh Kumar <viresh.kumar@linaro.org> Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com> Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org> Acked-by: Viresh Kumar <viresh.kumar@linaro.org> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2018-05-21Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller
S390 bpf_jit.S is removed in net-next and had changes in 'net', since that code isn't used any more take the removal. TLS data structures split the TX and RX components in 'net-next', put the new struct members from the bug fix in 'net' into the RX part. The 'net-next' tree had some reworking of how the ERSPAN code works in the GRE tunneling code, overlapping with a one-line headroom calculation fix in 'net'. Overlapping changes in __sock_map_ctx_update_elem(), keep the bits that read the prog members via READ_ONCE() into local variables before using them. Signed-off-by: David S. Miller <davem@davemloft.net>
2018-05-21audit: Fix wrong task in comparison of session IDOndrej Mosnáček
The audit_filter_rules() function in auditsc.c compared the session ID with the credentials of the current task, while it should use the credentials of the task given to audit_filter_rules() as a parameter (tsk). GitHub issue: https://github.com/linux-audit/audit-kernel/issues/82 Fixes: 8fae47705685 ("audit: add support for session ID user filter") Signed-off-by: Ondrej Mosnacek <omosnace@redhat.com> Reviewed-by: Richard Guy Briggs <rgb@redhat.com> [PM: not user visible, dropped stable] Signed-off-by: Paul Moore <paul@paul-moore.com>
2018-05-21Merge branch 'speck-v20' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Merge speculative store buffer bypass fixes from Thomas Gleixner: - rework of the SPEC_CTRL MSR management to accomodate the new fancy SSBD (Speculative Store Bypass Disable) bit handling. - the CPU bug and sysfs infrastructure for the exciting new Speculative Store Bypass 'feature'. - support for disabling SSB via LS_CFG MSR on AMD CPUs including Hyperthread synchronization on ZEN. - PRCTL support for dynamic runtime control of SSB - SECCOMP integration to automatically disable SSB for sandboxed processes with a filter flag for opt-out. - KVM integration to allow guests fiddling with SSBD including the new software MSR VIRT_SPEC_CTRL to handle the LS_CFG based oddities on AMD. - BPF protection against SSB .. this is just the core and x86 side, other architecture support will come separately. * 'speck-v20' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (49 commits) bpf: Prevent memory disambiguation attack x86/bugs: Rename SSBD_NO to SSB_NO KVM: SVM: Implement VIRT_SPEC_CTRL support for SSBD x86/speculation, KVM: Implement support for VIRT_SPEC_CTRL/LS_CFG x86/bugs: Rework spec_ctrl base and mask logic x86/bugs: Remove x86_spec_ctrl_set() x86/bugs: Expose x86_spec_ctrl_base directly x86/bugs: Unify x86_spec_ctrl_{set_guest,restore_host} x86/speculation: Rework speculative_store_bypass_update() x86/speculation: Add virtualized speculative store bypass disable support x86/bugs, KVM: Extend speculation control for VIRT_SPEC_CTRL x86/speculation: Handle HT correctly on AMD x86/cpufeatures: Add FEATURE_ZEN x86/cpufeatures: Disentangle SSBD enumeration x86/cpufeatures: Disentangle MSR_SPEC_CTRL enumeration from IBRS x86/speculation: Use synthetic bits for IBRS/IBPB/STIBP KVM: SVM: Move spec control call after restore of GS x86/cpu: Make alternative_msr_write work for 32-bit code x86/bugs: Fix the parameters alignment and missing void x86/bugs: Make cpu_show_common() static ...
2018-05-21Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netLinus Torvalds
Pull networking fixes from David Miller: 1) Fix refcounting bug for connections in on-packet scheduling mode of IPVS, from Julian Anastasov. 2) Set network header properly in AF_PACKET's packet_snd, from Willem de Bruijn. 3) Fix regressions in 3c59x by converting to generic DMA API. It was relying upon the hack that the PCI DMA interfaces would accept NULL for EISA devices. From Christoph Hellwig. 4) Remove RDMA devices before unregistering netdev in QEDE driver, from Michal Kalderon. 5) Use after free in TUN driver ptr_ring usage, from Jason Wang. 6) Properly check for missing netlink attributes in SMC_PNETID requests, from Eric Biggers. 7) Set DMA mask before performaing any DMA operations in vmxnet3 driver, from Regis Duchesne. 8) Fix mlx5 build with SMP=n, from Saeed Mahameed. 9) Classifier fixes in bcm_sf2 driver from Florian Fainelli. 10) Tuntap use after free during release, from Jason Wang. 11) Don't use stack memory in scatterlists in tls code, from Matt Mullins. 12) Not fully initialized flow key object in ipv4 routing code, from David Ahern. 13) Various packet headroom bug fixes in ip6_gre driver, from Petr Machata. 14) Remove queues from XPS maps using correct index, from Amritha Nambiar. 15) Fix use after free in sock_diag, from Eric Dumazet. * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (64 commits) net: ip6_gre: fix tunnel metadata device sharing. cxgb4: fix offset in collecting TX rate limit info net: sched: red: avoid hashing NULL child sock_diag: fix use-after-free read in __sk_free sh_eth: Change platform check to CONFIG_ARCH_RENESAS net: dsa: Do not register devlink for unused ports net: Fix a bug in removing queues from XPS map bpf: fix truncated jump targets on heavy expansions bpf: parse and verdict prog attach may race with bpf map update bpf: sockmap update rollback on error can incorrectly dec prog refcnt net: test tailroom before appending to linear skb net: ip6_gre: Fix ip6erspan hlen calculation net: ip6_gre: Split up ip6gre_changelink() net: ip6_gre: Split up ip6gre_newlink() net: ip6_gre: Split up ip6gre_tnl_change() net: ip6_gre: Split up ip6gre_tnl_link_config() net: ip6_gre: Fix headroom request in ip6erspan_tunnel_xmit() net: ip6_gre: Request headroom in __gre6_xmit() selftests/bpf: check return value of fopen in test_verifier.c erspan: fix invalid erspan version. ...
2018-05-21workqueue: Make sure struct worker is accessible for wq_worker_comm()Tejun Heo
The worker struct could already be freed when wq_worker_comm() tries to access it for reporting. This patch protects PF_WQ_WORKER modifications with wq_pool_attach_mutex and makes wq_worker_comm() test the flag before dereferencing worker from kthread_data(), which ensures that it only dereferences when the worker struct is valid. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: Lai Jiangshan <jiangshanlai@gmail.com> Fixes: 6b59808bfe48 ("workqueue: Show the latest workqueue name in /proc/PID/{comm,stat,status}")
2018-05-20Merge branch 'timers-urgent-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull UP timer fix from Thomas Gleixner: "Work around the for_each_cpu() oddity on UP kernels in the tick broadcast code which causes boot failures because the CPU0 bit is always reported as set independent of the cpumask content" * 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: tick/broadcast: Use for_each_cpu() specially on UP kernels
2018-05-20Merge branch 'sched-urgent-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler fixlets from Thomas Gleixner: "Three trivial fixlets for the scheduler: - move print_rt_rq() and print_dl_rq() declarations to the right place - make grub_reclaim() static - fix the bogus documentation reference in Kconfig" * 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: sched/fair: Fix documentation file path sched/deadline: Make the grub_reclaim() function static sched/debug: Move the print_rt_rq() and print_dl_rq() declarations to kernel/sched/sched.h
2018-05-19bpf: Prevent memory disambiguation attackAlexei Starovoitov
Detect code patterns where malicious 'speculative store bypass' can be used and sanitize such patterns. 39: (bf) r3 = r10 40: (07) r3 += -216 41: (79) r8 = *(u64 *)(r7 +0) // slow read 42: (7a) *(u64 *)(r10 -72) = 0 // verifier inserts this instruction 43: (7b) *(u64 *)(r8 +0) = r3 // this store becomes slow due to r8 44: (79) r1 = *(u64 *)(r6 +0) // cpu speculatively executes this load 45: (71) r2 = *(u8 *)(r1 +0) // speculatively arbitrary 'load byte' // is now sanitized Above code after x86 JIT becomes: e5: mov %rbp,%rdx e8: add $0xffffffffffffff28,%rdx ef: mov 0x0(%r13),%r14 f3: movq $0x0,-0x48(%rbp) fb: mov %rdx,0x0(%r14) ff: mov 0x0(%rbx),%rdi 103: movzbq 0x0(%rdi),%rsi Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2018-05-19timekeeping: Add ktime_get_coarse_with_offsetArnd Bergmann
I have run into a couple of drivers using current_kernel_time() suffering from the y2038 problem, and they could be converted to using ktime_t, but don't have interfaces that skip the nanosecond calculation at the moment. This introduces ktime_get_coarse_with_offset() as a simpler variant of ktime_get_with_offset(), and adds wrappers for the three time domains we support with the existing function. Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Stephen Boyd <sboyd@kernel.org> Cc: y2038@lists.linaro.org Cc: John Stultz <john.stultz@linaro.org> Link: https://lkml.kernel.org/r/20180427134016.2525989-5-arnd@arndb.de
2018-05-19timekeeping: Standardize on ktime_get_*() namingArnd Bergmann
The current_kernel_time64, get_monotonic_coarse64, getrawmonotonic64, get_monotonic_boottime64 and timekeeping_clocktai64 interfaces have rather inconsistent naming, and they differ in the calling conventions by passing the output either by reference or as a return value. Rename them to ktime_get_coarse_real_ts64, ktime_get_coarse_ts64, ktime_get_raw_ts64, ktime_get_boottime_ts64 and ktime_get_clocktai_ts64 respectively, and provide the interfaces with macros or inline functions as needed. Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Stephen Boyd <sboyd@kernel.org> Cc: y2038@lists.linaro.org Cc: John Stultz <john.stultz@linaro.org> Link: https://lkml.kernel.org/r/20180427134016.2525989-4-arnd@arndb.de
2018-05-19timekeeping: Clean up ktime_get_real_ts64Arnd Bergmann
In a move to make ktime_get_*() the preferred driver interface into the timekeeping code, sanitizes ktime_get_real_ts64() to be a proper exported symbol rather than an alias for getnstimeofday64(). The internal __getnstimeofday64() is no longer used, so remove that and merge it into ktime_get_real_ts64(). Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Stephen Boyd <sboyd@kernel.org> Cc: y2038@lists.linaro.org Cc: John Stultz <john.stultz@linaro.org> Link: https://lkml.kernel.org/r/20180427134016.2525989-3-arnd@arndb.de
2018-05-19timekeeping: Remove timespec64 hackArnd Bergmann
At this point, we have converted most of the kernel to use timespec64 consistently in place of timespec, so it seems it's time to make timespec64 the native structure and define timespec in terms of that one on 64-bit architectures. Starting with gcc-5, the compiler can completely optimize away the timespec_to_timespec64 and timespec64_to_timespec functions on 64-bit architectures. With older compilers, we introduce a couple of extra copies of local variables, but those are easily avoided by using the timespec64 based interfaces consistently, as we do in most of the important code paths already. The main upside of removing the hack is that printing the tv_sec field of a timespec64 structure can now use the %lld format string on all architectures without a cast to time64_t. Without this patch, the field is a 'long' type and would have to be printed using %ld on 64-bit architectures. Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Stephen Boyd <sboyd@kernel.org> Cc: y2038@lists.linaro.org Cc: John Stultz <john.stultz@linaro.org> Link: https://lkml.kernel.org/r/20180427134016.2525989-2-arnd@arndb.de
2018-05-19Merge branch 'linus' into timers/2038Thomas Gleixner
Merge upstream to pick up changes on which pending patches depend on.
2018-05-18bpf: allow sk_msg programs to read sock fieldsJohn Fastabend
Currently sk_msg programs only have access to the raw data. However, it is often useful when building policies to have the policies specific to the socket endpoint. This allows using the socket tuple as input into filters, etc. This patch adds ctx access to the sock fields. Signed-off-by: John Fastabend <john.fastabend@gmail.com> Acked-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-05-18audit: use existing session info functionRichard Guy Briggs
Use the existing audit_log_session_info() function rather than hardcoding its functionality. Signed-off-by: Richard Guy Briggs <rgb@redhat.com> Signed-off-by: Paul Moore <paul@paul-moore.com>
2018-05-18workqueue: Show the latest workqueue name in /proc/PID/{comm,stat,status}Tejun Heo
There can be a lot of workqueue workers and they all show up with the cryptic kworker/* names making it difficult to understand which is doing what and how they came to be. # ps -ef | grep kworker root 4 2 0 Feb25 ? 00:00:00 [kworker/0:0H] root 6 2 0 Feb25 ? 00:00:00 [kworker/u112:0] root 19 2 0 Feb25 ? 00:00:00 [kworker/1:0H] root 25 2 0 Feb25 ? 00:00:00 [kworker/2:0H] root 31 2 0 Feb25 ? 00:00:00 [kworker/3:0H] ... This patch makes workqueue workers report the latest workqueue it was executing for through /proc/PID/{comm,stat,status}. The extra information is appended to the kthread name with intervening '+' if currently executing, otherwise '-'. # cat /proc/25/comm kworker/2:0-events_power_efficient # cat /proc/25/stat 25 (kworker/2:0-events_power_efficient) I 2 0 0 0 -1 69238880 0 0... # grep Name /proc/25/status Name: kworker/2:0-events_power_efficient Unfortunately, ps(1) truncates comm to 15 characters, # ps 25 PID TTY STAT TIME COMMAND 25 ? I 0:00 [kworker/2:0-eve] making it a lot less useful; however, this should be an easy fix from ps(1) side. Signed-off-by: Tejun Heo <tj@kernel.org> Suggested-by: Linus Torvalds <torvalds@linux-foundation.org> Cc: Craig Small <csmall@enc.com.au>
2018-05-18workqueue: Set worker->desc to workqueue name by defaultTejun Heo
Work functions can use set_worker_desc() to improve the visibility of what the worker task is doing. Currently, the desc field is unset at the beginning of each execution and there is a separate field to track the field is set during the current execution. Instead of leaving empty till desc is set, worker->desc can be used to remember the last workqueue the worker worked on by default and users that use set_worker_desc() can override it to something more informative as necessary. This simplifies desc handling and helps tracking the last workqueue that the worker exected on to improve visibility. Signed-off-by: Tejun Heo <tj@kernel.org>
2018-05-18workqueue: Make worker_attach/detach_pool() update worker->poolTejun Heo
For historical reasons, the worker attach/detach functions don't currently manage worker->pool and the callers are manually and inconsistently updating it. This patch moves worker->pool updates into the worker attach/detach functions. This makes worker->pool consistent and clearly defines how worker->pool updates are synchronized. This will help later workqueue visibility improvements by allowing safe access to workqueue information from worker->task. Signed-off-by: Tejun Heo <tj@kernel.org>
2018-05-18workqueue: Replace pool->attach_mutex with global wq_pool_attach_mutexTejun Heo
To improve workqueue visibility, we want to be able to access workqueue information from worker tasks. The per-pool attach mutex makes that difficult because there's no way of stabilizing task -> worker pool association without knowing the pool first. Worker attach/detach is a slow path and there's no need for different pools to be able to perform them concurrently. This patch replaces the per-pool attach_mutex with global wq_pool_attach_mutex to prepare for visibility improvement changes. Signed-off-by: Tejun Heo <tj@kernel.org>
2018-05-18scsi: zfcp: workqueue: set description for port work items with their WWPN ↵Steffen Maier
as context As a prerequisite, complement commit 3d1cb2059d93 ("workqueue: include workqueue info when printing debug dump of a worker task") to be usable with kernel modules by exporting the symbol set_worker_desc(). Current built-in user was introduced with commit ef3b101925f2 ("writeback: set worker desc to identify writeback workers in task dumps"). Can help distinguishing work items which do not have adapter scope. Description is printed out with task dump for debugging on WARN, BUG, panic, or magic-sysrq [show-task-states(t)]. Example: $ echo 0 >| /sys/bus/ccw/drivers/zfcp/0.0.1880/0x50050763031bd327/failed & $ echo 't' >| /proc/sysrq-trigger $ dmesg sysrq: SysRq : Show State task PC stack pid father ... zfcp_q_0.0.1880 S14640 2165 2 0x02000000 Call Trace: ([<00000000009df464>] __schedule+0xbf4/0xc78) [<00000000009df57c>] schedule+0x94/0xc0 [<0000000000168654>] rescuer_thread+0x33c/0x3a0 [<000000000016f8be>] kthread+0x166/0x178 [<00000000009e71f2>] kernel_thread_starter+0x6/0xc [<00000000009e71ec>] kernel_thread_starter+0x0/0xc no locks held by zfcp_q_0.0.1880/2165. ... kworker/u512:2 D11280 2193 2 0x02000000 Workqueue: zfcp_q_0.0.1880 zfcp_scsi_rport_work [zfcp] (zrpd-50050763031bd327) ^^^^^^^^^^^^^^^^^^^^^ Call Trace: ([<00000000009df464>] __schedule+0xbf4/0xc78) [<00000000009df57c>] schedule+0x94/0xc0 [<00000000009e50c0>] schedule_timeout+0x488/0x4d0 [<00000000001e425c>] msleep+0x5c/0x78 >>test code only<< [<000003ff8008a21e>] zfcp_scsi_rport_work+0xbe/0x100 [zfcp] [<0000000000167154>] process_one_work+0x3b4/0x718 [<000000000016771c>] worker_thread+0x264/0x408 [<000000000016f8be>] kthread+0x166/0x178 [<00000000009e71f2>] kernel_thread_starter+0x6/0xc [<00000000009e71ec>] kernel_thread_starter+0x0/0xc 2 locks held by kworker/u512:2/2193: #0: (name){++++.+}, at: [<0000000000166f4e>] process_one_work+0x1ae/0x718 #1: ((&(&port->rport_work)->work)){+.+.+.}, at: [<0000000000166f4e>] process_one_work+0x1ae/0x718 ... ============================================= Showing busy workqueues and worker pools: workqueue zfcp_q_0.0.1880: flags=0x2000a pwq 512: cpus=0-255 flags=0x4 nice=0 active=1/1 in-flight: 2193:zfcp_scsi_rport_work [zfcp] pool 512: cpus=0-255 flags=0x4 nice=0 hung=0s workers=4 idle: 5 2354 2311 Work items with adapter scope are already identified by the workqueue name "zfcp_q_<devbusid>" and the work item function name. Signed-off-by: Steffen Maier <maier@linux.ibm.com> Cc: Tejun Heo <tj@kernel.org> Cc: Lai Jiangshan <jiangshanlai@gmail.com> Reviewed-by: Benjamin Block <bblock@linux.ibm.com> Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
2018-05-18xsk: clean up SPDX headersBjörn Töpel
Clean up SPDX-License-Identifier and removing licensing leftovers. Signed-off-by: Björn Töpel <bjorn.topel@intel.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-05-18fsnotify: add fsnotify_add_inode_mark() wrappersAmir Goldstein
Before changing the arguments of the functions fsnotify_add_mark() and fsnotify_add_mark_locked(), convert most callers to use a wrapper. Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Jan Kara <jack@suse.cz>
2018-05-18fsnotify: remove redundant arguments to handle_event()Amir Goldstein
inode_mark and vfsmount_mark arguments are passed to handle_event() operation as function arguments as well as on iter_info struct. The difference is that iter_info struct may contain marks that should not be handled and are represented as NULL arguments to inode_mark or vfsmount_mark. Instead of passing the inode_mark and vfsmount_mark arguments, add a report_mask member to iter_info struct to indicate which marks should be handled, versus marks that should only be kept alive during user wait. This change is going to be used for passing more mark types with handle_event() (i.e. super block marks). Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Jan Kara <jack@suse.cz>
2018-05-18sched/deadline: Make the grub_reclaim() function staticMathieu Malaterre
Since the grub_reclaim() function can be made static, make it so. Silences the following GCC warning (W=1): kernel/sched/deadline.c:1120:5: warning: no previous prototype for ‘grub_reclaim’ [-Wmissing-prototypes] Signed-off-by: Mathieu Malaterre <malat@debian.org> Acked-by: Peter Zijlstra <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/20180516200902.959-1-malat@debian.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-05-18sched/debug: Move the print_rt_rq() and print_dl_rq() declarations to ↵Mathieu Malaterre
kernel/sched/sched.h In the following commit: 6b55c9654fcc ("sched/debug: Move print_cfs_rq() declaration to kernel/sched/sched.h") the print_cfs_rq() prototype was added to <kernel/sched/sched.h>, right next to the prototypes for print_cfs_stats(), print_rt_stats() and print_dl_stats(). Finish this previous commit and also move related prototypes for print_rt_rq() and print_dl_rq(). Remove existing extern declarations now that they not needed anymore. Silences the following GCC warning, triggered by W=1: kernel/sched/debug.c:573:6: warning: no previous prototype for ‘print_rt_rq’ [-Wmissing-prototypes] kernel/sched/debug.c:603:6: warning: no previous prototype for ‘print_dl_rq’ [-Wmissing-prototypes] Signed-off-by: Mathieu Malaterre <malat@debian.org> Acked-by: Peter Zijlstra <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/20180516195348.30426-1-malat@debian.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-05-17bpf: fix truncated jump targets on heavy expansionsDaniel Borkmann
Recently during testing, I ran into the following panic: [ 207.892422] Internal error: Accessing user space memory outside uaccess.h routines: 96000004 [#1] SMP [ 207.901637] Modules linked in: binfmt_misc [...] [ 207.966530] CPU: 45 PID: 2256 Comm: test_verifier Tainted: G W 4.17.0-rc3+ #7 [ 207.974956] Hardware name: FOXCONN R2-1221R-A4/C2U4N_MB, BIOS G31FB18A 03/31/2017 [ 207.982428] pstate: 60400005 (nZCv daif +PAN -UAO) [ 207.987214] pc : bpf_skb_load_helper_8_no_cache+0x34/0xc0 [ 207.992603] lr : 0xffff000000bdb754 [ 207.996080] sp : ffff000013703ca0 [ 207.999384] x29: ffff000013703ca0 x28: 0000000000000001 [ 208.004688] x27: 0000000000000001 x26: 0000000000000000 [ 208.009992] x25: ffff000013703ce0 x24: ffff800fb4afcb00 [ 208.015295] x23: ffff00007d2f5038 x22: ffff00007d2f5000 [ 208.020599] x21: fffffffffeff2a6f x20: 000000000000000a [ 208.025903] x19: ffff000009578000 x18: 0000000000000a03 [ 208.031206] x17: 0000000000000000 x16: 0000000000000000 [ 208.036510] x15: 0000ffff9de83000 x14: 0000000000000000 [ 208.041813] x13: 0000000000000000 x12: 0000000000000000 [ 208.047116] x11: 0000000000000001 x10: ffff0000089e7f18 [ 208.052419] x9 : fffffffffeff2a6f x8 : 0000000000000000 [ 208.057723] x7 : 000000000000000a x6 : 00280c6160000000 [ 208.063026] x5 : 0000000000000018 x4 : 0000000000007db6 [ 208.068329] x3 : 000000000008647a x2 : 19868179b1484500 [ 208.073632] x1 : 0000000000000000 x0 : ffff000009578c08 [ 208.078938] Process test_verifier (pid: 2256, stack limit = 0x0000000049ca7974) [ 208.086235] Call trace: [ 208.088672] bpf_skb_load_helper_8_no_cache+0x34/0xc0 [ 208.093713] 0xffff000000bdb754 [ 208.096845] bpf_test_run+0x78/0xf8 [ 208.100324] bpf_prog_test_run_skb+0x148/0x230 [ 208.104758] sys_bpf+0x314/0x1198 [ 208.108064] el0_svc_naked+0x30/0x34 [ 208.111632] Code: 91302260 f9400001 f9001fa1 d2800001 (29500680) [ 208.117717] ---[ end trace 263cb8a59b5bf29f ]--- The program itself which caused this had a long jump over the whole instruction sequence where all of the inner instructions required heavy expansions into multiple BPF instructions. Additionally, I also had BPF hardening enabled which requires once more rewrites of all constant values in order to blind them. Each time we rewrite insns, bpf_adj_branches() would need to potentially adjust branch targets which cross the patchlet boundary to accommodate for the additional delta. Eventually that lead to the case where the target offset could not fit into insn->off's upper 0x7fff limit anymore where then offset wraps around becoming negative (in s16 universe), or vice versa depending on the jump direction. Therefore it becomes necessary to detect and reject any such occasions in a generic way for native eBPF and cBPF to eBPF migrations. For the latter we can simply check bounds in the bpf_convert_filter()'s BPF_EMIT_JMP helper macro and bail out once we surpass limits. The bpf_patch_insn_single() for native eBPF (and cBPF to eBPF in case of subsequent hardening) is a bit more complex in that we need to detect such truncations before hitting the bpf_prog_realloc(). Thus the latter is split into an extra pass to probe problematic offsets on the original program in order to fail early. With that in place and carefully tested I no longer hit the panic and the rewrites are rejected properly. The above example panic I've seen on bpf-next, though the issue itself is generic in that a guard against this issue in bpf seems more appropriate in this case. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>