summaryrefslogtreecommitdiff
path: root/kernel
AgeCommit message (Collapse)Author
2019-01-14audit: give a clue what CONFIG_CHANGE op was involvedRichard Guy Briggs
The failure to add an audit rule due to audit locked gives no clue what CONFIG_CHANGE operation failed. Similarly the set operation is the only other operation that doesn't give the "op=" field to indicate the action. All other CONFIG_CHANGE records include an op= field to give a clue as to what sort of configuration change is being executed. Since these are the only CONFIG_CHANGE records that that do not have an op= field, add them to bring them in line with the rest. Old records: type=CONFIG_CHANGE msg=audit(1519812997.781:374): pid=610 uid=0 auid=0 ses=1 subj=... audit_enabled=2 res=0 type=CONFIG_CHANGE msg=audit(2018-06-14 14:55:04.507:47) : audit_enabled=1 old=1 auid=unset ses=unset subj=... res=yes New records: type=CONFIG_CHANGE msg=audit(1520958477.855:100): pid=610 uid=0 auid=0 ses=1 subj=... op=add_rule audit_enabled=2 res=0 type=CONFIG_CHANGE msg=audit(2018-06-14 14:55:04.507:47) : op=set audit_enabled=1 old=1 auid=unset ses=unset subj=... res=yes See: https://github.com/linux-audit/audit-kernel/issues/59 Signed-off-by: Richard Guy Briggs <rgb@redhat.com> [PM: fixed checkpatch.pl line length problems] Signed-off-by: Paul Moore <paul@paul-moore.com>
2019-01-14kernel/sys.c: Clarify that UNAME26 does not generate unique versions anymoreJonathan Neuschäfer
UNAME26 is a mechanism to report Linux's version as 2.6.x, for compatibility with old/broken software. Due to the way it is implemented, it would have to be updated after 5.0, to keep the resulting versions unique. Linus Torvalds argued: "Do we actually need this? I'd rather let it bitrot, and just let it return random versions. It will just start again at 2.4.60, won't it? Anybody who uses UNAME26 for a 5.x kernel might as well think it's still 4.x. The user space is so old that it can't possibly care about differences between 4.x and 5.x, can it? The only thing that matters is that it shows "2.4.<largeenough>", which it will do regardless" Signed-off-by: Jonathan Neuschäfer <j.neuschaefer@gmx.net> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-01-11umh: add exit routine for UMH processTaehee Yoo
A UMH process which is created by the fork_usermode_blob() such as bpfilter needs to release members of the umh_info when process is terminated. But the do_exit() does not release members of the umh_info. hence module which uses UMH needs own code to detect whether UMH process is terminated or not. But this implementation needs extra code for checking the status of UMH process. it eventually makes the code more complex. The new PF_UMH flag is added and it is used to identify UMH processes. The exit_umh() does not release members of the umh_info. Hence umh_info->cleanup callback should release both members of the umh_info and the private data. Suggested-by: David S. Miller <davem@davemloft.net> Signed-off-by: Taehee Yoo <ap420073@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-01-11livepatch: Remove ordering (stacking) of the livepatchesPetr Mladek
The atomic replace and cumulative patches were introduced as a more secure way to handle dependent patches. They simplify the logic: + Any new cumulative patch is supposed to take over shadow variables and changes made by callbacks from previous livepatches. + All replaced patches are discarded and the modules can be unloaded. As a result, there is only one scenario when a cumulative livepatch gets disabled. The different handling of "normal" and cumulative patches might cause confusion. It would make sense to keep only one mode. On the other hand, it would be rude to enforce using the cumulative livepatches even for trivial and independent (hot) fixes. However, the stack of patches is not really necessary any longer. The patch ordering was never clearly visible via the sysfs interface. Also the "normal" patches need a lot of caution anyway. Note that the list of enabled patches is still necessary but the ordering is not longer enforced. Otherwise, the code is ready to disable livepatches in an random order. Namely, klp_check_stack_func() always looks for the function from the livepatch that is being disabled. klp_func structures are just removed from the related func_stack. Finally, the ftrace handlers is removed only when the func_stack becomes empty. Signed-off-by: Petr Mladek <pmladek@suse.com> Acked-by: Miroslav Benes <mbenes@suse.cz> Acked-by: Josh Poimboeuf <jpoimboe@redhat.com> Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2019-01-11livepatch: Remove Nop structures when unusedPetr Mladek
Replaced patches are removed from the stack when the transition is finished. It means that Nop structures will never be needed again and can be removed. Why should we care? + Nop structures give the impression that the function is patched even though the ftrace handler has no effect. + Ftrace handlers do not come for free. They cause slowdown that might be visible in some workloads. The ftrace-related slowdown might actually be the reason why the function is no longer patched in the new cumulative patch. One would expect that cumulative patch would help solve these problems as well. + Cumulative patches are supposed to replace any earlier version of the patch. The amount of NOPs depends on which version was replaced. This multiplies the amount of scenarios that might happen. One might say that NOPs are innocent. But there are even optimized NOP instructions for different processors, for example, see arch/x86/kernel/alternative.c. And klp_ftrace_handler() is much more complicated. + It sounds natural to clean up a mess that is no longer needed. It could only be worse if we do not do it. This patch allows to unpatch and free the dynamic structures independently when the transition finishes. The free part is a bit tricky because kobject free callbacks are called asynchronously. We could not wait for them easily. Fortunately, we do not have to. Any further access can be avoided by removing them from the dynamic lists. Signed-off-by: Petr Mladek <pmladek@suse.com> Acked-by: Miroslav Benes <mbenes@suse.cz> Acked-by: Josh Poimboeuf <jpoimboe@redhat.com> Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2019-01-11livepatch: Add atomic replaceJason Baron
Sometimes we would like to revert a particular fix. Currently, this is not easy because we want to keep all other fixes active and we could revert only the last applied patch. One solution would be to apply new patch that implemented all the reverted functions like in the original code. It would work as expected but there will be unnecessary redirections. In addition, it would also require knowing which functions need to be reverted at build time. Another problem is when there are many patches that touch the same functions. There might be dependencies between patches that are not enforced on the kernel side. Also it might be pretty hard to actually prepare the patch and ensure compatibility with the other patches. Atomic replace && cumulative patches: A better solution would be to create cumulative patch and say that it replaces all older ones. This patch adds a new "replace" flag to struct klp_patch. When it is enabled, a set of 'nop' klp_func will be dynamically created for all functions that are already being patched but that will no longer be modified by the new patch. They are used as a new target during the patch transition. The idea is to handle Nops' structures like the static ones. When the dynamic structures are allocated, we initialize all values that are normally statically defined. The only exception is "new_func" in struct klp_func. It has to point to the original function and the address is known only when the object (module) is loaded. Note that we really need to set it. The address is used, for example, in klp_check_stack_func(). Nevertheless we still need to distinguish the dynamically allocated structures in some operations. For this, we add "nop" flag into struct klp_func and "dynamic" flag into struct klp_object. They need special handling in the following situations: + The structures are added into the lists of objects and functions immediately. In fact, the lists were created for this purpose. + The address of the original function is known only when the patched object (module) is loaded. Therefore it is copied later in klp_init_object_loaded(). + The ftrace handler must not set PC to func->new_func. It would cause infinite loop because the address points back to the beginning of the original function. + The various free() functions must free the structure itself. Note that other ways to detect the dynamic structures are not considered safe. For example, even the statically defined struct klp_object might include empty funcs array. It might be there just to run some callbacks. Also note that the safe iterator must be used in the free() functions. Otherwise already freed structures might get accessed. Special callbacks handling: The callbacks from the replaced patches are _not_ called by intention. It would be pretty hard to define a reasonable semantic and implement it. It might even be counter-productive. The new patch is cumulative. It is supposed to include most of the changes from older patches. In most cases, it will not want to call pre_unpatch() post_unpatch() callbacks from the replaced patches. It would disable/break things for no good reasons. Also it should be easier to handle various scenarios in a single script in the new patch than think about interactions caused by running many scripts from older patches. Not to say that the old scripts even would not expect to be called in this situation. Removing replaced patches: One nice effect of the cumulative patches is that the code from the older patches is no longer used. Therefore the replaced patches can be removed. It has several advantages: + Nops' structs will no longer be necessary and might be removed. This would save memory, restore performance (no ftrace handler), allow clear view on what is really patched. + Disabling the patch will cause using the original code everywhere. Therefore the livepatch callbacks could handle only one scenario. Note that the complication is already complex enough when the patch gets enabled. It is currently solved by calling callbacks only from the new cumulative patch. + The state is clean in both the sysfs interface and lsmod. The modules with the replaced livepatches might even get removed from the system. Some people actually expected this behavior from the beginning. After all a cumulative patch is supposed to "completely" replace an existing one. It is like when a new version of an application replaces an older one. This patch does the first step. It removes the replaced patches from the list of patches. It is safe. The consistency model ensures that they are no longer used. By other words, each process works only with the structures from klp_transition_patch. The removal is done by a special function. It combines actions done by __disable_patch() and klp_complete_transition(). But it is a fast track without all the transaction-related stuff. Signed-off-by: Jason Baron <jbaron@akamai.com> [pmladek@suse.com: Split, reuse existing code, simplified] Signed-off-by: Petr Mladek <pmladek@suse.com> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Jessica Yu <jeyu@kernel.org> Cc: Jiri Kosina <jikos@kernel.org> Cc: Miroslav Benes <mbenes@suse.cz> Acked-by: Miroslav Benes <mbenes@suse.cz> Acked-by: Josh Poimboeuf <jpoimboe@redhat.com> Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2019-01-11livepatch: Use lists to manage patches, objects and functionsJason Baron
Currently klp_patch contains a pointer to a statically allocated array of struct klp_object and struct klp_objects contains a pointer to a statically allocated array of klp_func. In order to allow for the dynamic allocation of objects and functions, link klp_patch, klp_object, and klp_func together via linked lists. This allows us to more easily allocate new objects and functions, while having the iterator be a simple linked list walk. The static structures are added to the lists early. It allows to add the dynamically allocated objects before klp_init_object() and klp_init_func() calls. Therefore it reduces the further changes to the code. This patch does not change the existing behavior. Signed-off-by: Jason Baron <jbaron@akamai.com> [pmladek@suse.com: Initialize lists before init calls] Signed-off-by: Petr Mladek <pmladek@suse.com> Acked-by: Miroslav Benes <mbenes@suse.cz> Acked-by: Joe Lawrence <joe.lawrence@redhat.com> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Jiri Kosina <jikos@kernel.org> Acked-by: Josh Poimboeuf <jpoimboe@redhat.com> Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2019-01-11livepatch: Simplify API by removing registration stepPetr Mladek
The possibility to re-enable a registered patch was useful for immediate patches where the livepatch module had to stay until the system reboot. The improved consistency model allows to achieve the same result by unloading and loading the livepatch module again. Also we are going to add a feature called atomic replace. It will allow to create a patch that would replace all already registered patches. The aim is to handle dependent patches more securely. It will obsolete the stack of patches that helped to handle the dependencies so far. Then it might be unclear when a cumulative patch re-enabling is safe. It would be complicated to support the many modes. Instead we could actually make the API and code easier to understand. Therefore, remove the two step public API. All the checks and init calls are moved from klp_register_patch() to klp_enabled_patch(). Also the patch is automatically freed, including the sysfs interface when the transition to the disabled state is completed. As a result, there is never a disabled patch on the top of the stack. Therefore we do not need to check the stack in __klp_enable_patch(). And we could simplify the check in __klp_disable_patch(). Also the API and logic is much easier. It is enough to call klp_enable_patch() in module_init() call. The patch can be disabled by writing '0' into /sys/kernel/livepatch/<patch>/enabled. Then the module can be removed once the transition finishes and sysfs interface is freed. The only problem is how to free the structures and kobjects safely. The operation is triggered from the sysfs interface. We could not put the related kobject from there because it would cause lock inversion between klp_mutex and kernfs locks, see kn->count lockdep map. Therefore, offload the free task to a workqueue. It is perfectly fine: + The patch can no longer be used in the livepatch operations. + The module could not be removed until the free operation finishes and module_put() is called. + The operation is asynchronous already when the first klp_try_complete_transition() fails and another call is queued with a delay. Suggested-by: Josh Poimboeuf <jpoimboe@redhat.com> Signed-off-by: Petr Mladek <pmladek@suse.com> Acked-by: Miroslav Benes <mbenes@suse.cz> Acked-by: Josh Poimboeuf <jpoimboe@redhat.com> Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2019-01-11livepatch: Don't block the removal of patches loaded after a forced transitionPetr Mladek
module_put() is currently never called in klp_complete_transition() when klp_force is set. As a result, we might keep the reference count even when klp_enable_patch() fails and klp_cancel_transition() is called. This might give the impression that a module might get blocked in some strange init state. Fortunately, it is not the case. The reference count is ignored when mod->init fails and erroneous modules are always removed. Anyway, this might be confusing. Instead, this patch moves the global klp_forced flag into struct klp_patch. As a result, we block only modules that might still be in use after a forced transition. Newly loaded livepatches might be eventually completely removed later. It is not a big deal. But the code is at least consistent with the reality. Signed-off-by: Petr Mladek <pmladek@suse.com> Acked-by: Joe Lawrence <joe.lawrence@redhat.com> Acked-by: Miroslav Benes <mbenes@suse.cz> Acked-by: Josh Poimboeuf <jpoimboe@redhat.com> Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2019-01-11livepatch: Consolidate klp_free functionsPetr Mladek
The code for freeing livepatch structures is a bit scattered and tricky: + direct calls to klp_free_*_limited() and kobject_put() are used to release partially initialized objects + klp_free_patch() removes the patch from the public list and releases all objects except for patch->kobj + object_put(&patch->kobj) and the related wait_for_completion() are called directly outside klp_mutex; this code is duplicated; Now, we are going to remove the registration stage to simplify the API and the code. This would require handling more situations in klp_enable_patch() error paths. More importantly, we are going to add a feature called atomic replace. It will need to dynamically create func and object structures. We will want to reuse the existing init() and free() functions. This would create even more error path scenarios. This patch implements more straightforward free functions: + checks kobj_added flag instead of @limit[*] + initializes patch->list early so that the check for empty list always works + The action(s) that has to be done outside klp_mutex are done in separate klp_free_patch_finish() function. It waits only when patch->kobj was really released via the _start() part. The patch does not change the existing behavior. [*] We need our own flag to track that the kobject was successfully added to the hierarchy. Note that kobj.state_initialized only indicates that kobject has been initialized, not whether is has been added (and needs to be removed on cleanup). Signed-off-by: Petr Mladek <pmladek@suse.com> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Miroslav Benes <mbenes@suse.cz> Cc: Jessica Yu <jeyu@kernel.org> Cc: Jiri Kosina <jikos@kernel.org> Cc: Jason Baron <jbaron@akamai.com> Acked-by: Miroslav Benes <mbenes@suse.cz> Acked-by: Josh Poimboeuf <jpoimboe@redhat.com> Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2019-01-11livepatch: Shuffle klp_enable_patch()/klp_disable_patch() codePetr Mladek
We are going to simplify the API and code by removing the registration step. This would require calling init/free functions from enable/disable ones. This patch just moves the code to prevent more forward declarations. This patch does not change the code except for two forward declarations. Signed-off-by: Petr Mladek <pmladek@suse.com> Acked-by: Miroslav Benes <mbenes@suse.cz> Acked-by: Joe Lawrence <joe.lawrence@redhat.com> Acked-by: Josh Poimboeuf <jpoimboe@redhat.com> Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2019-01-11livepatch: Change unsigned long old_addr -> void *old_func in struct klp_funcPetr Mladek
The address of the to be patched function and new function is stored in struct klp_func as: void *new_func; unsigned long old_addr; The different naming scheme and type are derived from the way the addresses are set. @old_addr is assigned at runtime using kallsyms-based search. @new_func is statically initialized, for example: static struct klp_func funcs[] = { { .old_name = "cmdline_proc_show", .new_func = livepatch_cmdline_proc_show, }, { } }; This patch changes unsigned long old_addr -> void *old_func. It removes some confusion when these address are later used in the code. It is motivated by a followup patch that adds special NOP struct klp_func where we want to assign func->new_func = func->old_addr respectively func->new_func = func->old_func. This patch does not modify the existing behavior. Suggested-by: Josh Poimboeuf <jpoimboe@redhat.com> Signed-off-by: Petr Mladek <pmladek@suse.com> Acked-by: Miroslav Benes <mbenes@suse.cz> Acked-by: Joe Lawrence <joe.lawrence@redhat.com> Acked-by: Alice Ferrazzi <alice.ferrazzi@gmail.com> Acked-by: Josh Poimboeuf <jpoimboe@redhat.com> Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2019-01-11bpf: fix bpffs bitfield pretty printYonghong Song
Commit 9d5f9f701b18 ("bpf: btf: fix struct/union/fwd types with kind_flag") introduced kind_flag and used bitfield_size in the btf_member to directly pretty print member values. The commit contained a bug where the incorrect parameters could be passed to function btf_bitfield_seq_show(). The bits_offset parameter in the function expects a value less than 8. Instead, the member offset in the structure is passed. The below is btf_bitfield_seq_show() func signature: void btf_bitfield_seq_show(void *data, u8 bits_offset, u8 nr_bits, struct seq_file *m) both bits_offset and nr_bits are u8 type. If the bitfield member offset is greater than 256, incorrect value will be printed. This patch fixed the issue by calculating correct proper data offset and bits_offset similar to non kind_flag case. Fixes: 9d5f9f701b18 ("bpf: btf: fix struct/union/fwd types with kind_flag") Acked-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: Yonghong Song <yhs@fb.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-01-10LSM: generalize flag passing to security_capableMicah Morton
This patch provides a general mechanism for passing flags to the security_capable LSM hook. It replaces the specific 'audit' flag that is used to tell security_capable whether it should log an audit message for the given capability check. The reason for generalizing this flag passing is so we can add an additional flag that signifies whether security_capable is being called by a setid syscall (which is needed by the proposed SafeSetID LSM). Signed-off-by: Micah Morton <mortonm@chromium.org> Reviewed-by: Kees Cook <keescook@chromium.org> Signed-off-by: James Morris <james.morris@microsoft.com>
2019-01-10bpf: fix panic in stack_map_get_build_id() on i386 and arm32Song Liu
As Naresh reported, test_stacktrace_build_id() causes panic on i386 and arm32 systems. This is caused by page_address() returns NULL in certain cases. This patch fixes this error by using kmap_atomic/kunmap_atomic instead of page_address. Fixes: 615755a77b24 (" bpf: extend stackmap to save binary_build_id+offset instead of address") Reported-by: Naresh Kamboju <naresh.kamboju@linaro.org> Signed-off-by: Song Liu <songliubraving@fb.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-01-08Merge branch 'akpm' (patches from Andrew)Linus Torvalds
Merge misc fixes from Andrew Morton: "14 fixes" * emailed patches from Andrew Morton <akpm@linux-foundation.org>: mm, page_alloc: do not wake kswapd with zone lock held hugetlbfs: revert "use i_mmap_rwsem for more pmd sharing synchronization" hugetlbfs: revert "Use i_mmap_rwsem to fix page fault/truncate race" mm: page_mapped: don't assume compound page is huge or THP mm/memory.c: initialise mmu_notifier_range correctly tools/vm/page_owner: use page_owner_sort in the use example kasan: fix krealloc handling for tag-based mode kasan: make tag based mode work with CONFIG_HARDENED_USERCOPY kasan, arm64: use ARCH_SLAB_MINALIGN instead of manual aligning mm, memcg: fix reclaim deadlock with writeback mm/usercopy.c: no check page span for stack objects slab: alien caches must not be initialized if the allocation of the alien cache failed fork, memcg: fix cached_stacks case zram: idle writeback fixes and cleanup
2019-01-08fork, memcg: fix cached_stacks caseShakeel Butt
Commit 5eed6f1dff87 ("fork,memcg: fix crash in free_thread_stack on memcg charge fail") fixes a crash caused due to failed memcg charge of the kernel stack. However the fix misses the cached_stacks case which this patch fixes. So, the same crash can happen if the memcg charge of a cached stack is failed. Link: http://lkml.kernel.org/r/20190102180145.57406-1-shakeelb@google.com Fixes: 5eed6f1dff87 ("fork,memcg: fix crash in free_thread_stack on memcg charge fail") Signed-off-by: Shakeel Butt <shakeelb@google.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Rik van Riel <riel@surriel.com> Cc: Rik van Riel <riel@surriel.com> Cc: Roman Gushchin <guro@fb.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Tejun Heo <tj@kernel.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-01-08SELinux: Remove cred security blob poisoningCasey Schaufler
The SELinux specific credential poisioning only makes sense if SELinux is managing the credentials. As the intent of this patch set is to move the blob management out of the modules and into the infrastructure, the SELinux specific code has to go. The poisioning could be introduced into the infrastructure at some later date. Signed-off-by: Casey Schaufler <casey@schaufler-ca.com> Reviewed-by: Kees Cook <keescook@chromium.org> Signed-off-by: Kees Cook <keescook@chromium.org>
2019-01-08fork: record start_time lateDavid Herrmann
This changes the fork(2) syscall to record the process start_time after initializing the basic task structure but still before making the new process visible to user-space. Technically, we could record the start_time anytime during fork(2). But this might lead to scenarios where a start_time is recorded long before a process becomes visible to user-space. For instance, with userfaultfd(2) and TLS, user-space can delay the execution of fork(2) for an indefinite amount of time (and will, if this causes network access, or similar). By recording the start_time late, it much closer reflects the point in time where the process becomes live and can be observed by other processes. Lastly, this makes it much harder for user-space to predict and control the start_time they get assigned. Previously, user-space could fork a process and stall it in copy_thread_tls() before its pid is allocated, but after its start_time is recorded. This can be misused to later-on cycle through PIDs and resume the stalled fork(2) yielding a process that has the same pid and start_time as a process that existed before. This can be used to circumvent security systems that identify processes by their pid+start_time combination. Even though user-space was always aware that start_time recording is flaky (but several projects are known to still rely on start_time-based identification), changing the start_time to be recorded late will help mitigate existing attacks and make it much harder for user-space to control the start_time a process gets assigned. Reported-by: Jann Horn <jannh@google.com> Signed-off-by: Tom Gundersen <teg@jklm.no> Signed-off-by: David Herrmann <dh.herrmann@gmail.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-01-06Merge tag 'kbuild-v4.21-3' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild Pull more Kbuild updates from Masahiro Yamada: - improve boolinit.cocci and use_after_iter.cocci semantic patches - fix alignment for kallsyms - move 'asm goto' compiler test to Kconfig and clean up jump_label CONFIG option - generate asm-generic wrappers automatically if arch does not implement mandatory UAPI headers - remove redundant generic-y defines - misc cleanups * tag 'kbuild-v4.21-3' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild: kconfig: rename generated .*conf-cfg to *conf-cfg kbuild: remove unnecessary stubs for archheader and archscripts kbuild: use assignment instead of define ... endef for filechk_* rules arch: remove redundant UAPI generic-y defines kbuild: generate asm-generic wrappers if mandatory headers are missing arch: remove stale comments "UAPI Header export list" riscv: remove redundant kernel-space generic-y kbuild: change filechk to surround the given command with { } kbuild: remove redundant target cleaning on failure kbuild: clean up rule_dtc_dt_yaml kbuild: remove UIMAGE_IN and UIMAGE_OUT jump_label: move 'asm goto' support test to Kconfig kallsyms: lower alignment on ARM scripts: coccinelle: boolinit: drop warnings on named constants scripts: coccinelle: check for redeclaration kconfig: remove unused "file" field of yylval union nds32: remove redundant kernel-space generic-y nios2: remove unneeded HAS_DMA define
2019-01-06Merge tag 'dma-mapping-4.21-1' of git://git.infradead.org/users/hch/dma-mappingLinus Torvalds
Pull dma-mapping fixes from Christoph Hellwig: "Fix various regressions introduced in this cycles: - fix dma-debug tracking for the map_page / map_single consolidatation - properly stub out DMA mapping symbols for !HAS_DMA builds to avoid link failures - fix AMD Gart direct mappings - setup the dma address for no kernel mappings using the remap allocator" * tag 'dma-mapping-4.21-1' of git://git.infradead.org/users/hch/dma-mapping: dma-direct: fix DMA_ATTR_NO_KERNEL_MAPPING for remapped allocations x86/amd_gart: fix unmapping of non-GART mappings dma-mapping: remove a few unused exports dma-mapping: properly stub out the DMA API for !CONFIG_HAS_DMA dma-mapping: remove dmam_{declare,release}_coherent_memory dma-mapping: implement dmam_alloc_coherent using dmam_alloc_attrs dma-mapping: implement dma_map_single_attrs using dma_map_page_attrs
2019-01-05bpf: fix sanitation of alu op with pointer / scalar type from different pathsDaniel Borkmann
While 979d63d50c0c ("bpf: prevent out of bounds speculation on pointer arithmetic") took care of rejecting alu op on pointer when e.g. pointer came from two different map values with different map properties such as value size, Jann reported that a case was not covered yet when a given alu op is used in both "ptr_reg += reg" and "numeric_reg += reg" from different branches where we would incorrectly try to sanitize based on the pointer's limit. Catch this corner case and reject the program instead. Fixes: 979d63d50c0c ("bpf: prevent out of bounds speculation on pointer arithmetic") Reported-by: Jann Horn <jannh@google.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2019-01-06kbuild: change filechk to surround the given command with { }Masahiro Yamada
filechk_* rules often consist of multiple 'echo' lines. They must be surrounded with { } or ( ) to work correctly. Otherwise, only the string from the last 'echo' would be written into the target. Let's take care of that in the 'filechk' in scripts/Kbuild.include to clean up filechk_* rules. Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com>
2019-01-06jump_label: move 'asm goto' support test to KconfigMasahiro Yamada
Currently, CONFIG_JUMP_LABEL just means "I _want_ to use jump label". The jump label is controlled by HAVE_JUMP_LABEL, which is defined like this: #if defined(CC_HAVE_ASM_GOTO) && defined(CONFIG_JUMP_LABEL) # define HAVE_JUMP_LABEL #endif We can improve this by testing 'asm goto' support in Kconfig, then make JUMP_LABEL depend on CC_HAS_ASM_GOTO. Ugly #ifdef HAVE_JUMP_LABEL will go away, and CONFIG_JUMP_LABEL will match to the real kernel capability. Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com> Acked-by: Michael Ellerman <mpe@ellerman.id.au> (powerpc) Tested-by: Sedat Dilek <sedat.dilek@gmail.com>
2019-01-05Merge branch 'akpm' (patches from Andrew)Linus Torvalds
Merge more updates from Andrew Morton: - procfs updates - various misc bits - lib/ updates - epoll updates - autofs - fatfs - a few more MM bits * emailed patches from Andrew Morton <akpm@linux-foundation.org>: (58 commits) mm/page_io.c: fix polled swap page in checkpatch: add Co-developed-by to signature tags docs: fix Co-Developed-by docs drivers/base/platform.c: kmemleak ignore a known leak fs: don't open code lru_to_page() fs/: remove caller signal_pending branch predictions mm/: remove caller signal_pending branch predictions arch/arc/mm/fault.c: remove caller signal_pending_branch predictions kernel/sched/: remove caller signal_pending branch predictions kernel/locking/mutex.c: remove caller signal_pending branch predictions mm: select HAVE_MOVE_PMD on x86 for faster mremap mm: speed up mremap by 20x on large regions mm: treewide: remove unused address argument from pte_alloc functions initramfs: cleanup incomplete rootfs scripts/gdb: fix lx-version string output kernel/kcov.c: mark write_comp_data() as notrace kernel/sysctl: add panic_print into sysctl panic: add options to print system info when panic happens bfs: extra sanity checking and static inode bitmap exec: separate MM_ANONPAGES and RLIMIT_STACK accounting ...
2019-01-05dma-direct: fix DMA_ATTR_NO_KERNEL_MAPPING for remapped allocationsChristoph Hellwig
We need to return a dma_addr_t even if we don't have a kernel mapping. Do so by consolidating the phys_to_dma call in a single place and jump to it from all the branches that return successfully. Fixes: bfd56cd60521 ("dma-mapping: support highmem in the generic remap allocator") Reported-by: Liviu Dudau <liviu@dudau.co.uk Signed-off-by: Christoph Hellwig <hch@lst.de> Tested-by: Liviu Dudau <liviu@dudau.co.uk>
2019-01-04kernel/sched/: remove caller signal_pending branch predictionsDavidlohr Bueso
This is already done for us internally by the signal machinery. Link: http://lkml.kernel.org/r/20181116002713.8474-3-dave@stgolabs.net Signed-off-by: Davidlohr Bueso <dave@stgolabs.net> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-01-04kernel/locking/mutex.c: remove caller signal_pending branch predictionsDavidlohr Bueso
This is already done for us internally by the signal machinery. Link: http://lkml.kernel.org/r/20181116002713.8474-2-dave@stgolabs.net Signed-off-by: Davidlohr Bueso <dave@stgolabs.net> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-01-04kernel/kcov.c: mark write_comp_data() as notraceAnders Roxell
Since __sanitizer_cov_trace_const_cmp4 is marked as notrace, the function called from __sanitizer_cov_trace_const_cmp4 shouldn't be traceable either. ftrace_graph_caller() gets called every time func write_comp_data() gets called if it isn't marked 'notrace'. This is the backtrace from gdb: #0 ftrace_graph_caller () at ../arch/arm64/kernel/entry-ftrace.S:179 #1 0xffffff8010201920 in ftrace_caller () at ../arch/arm64/kernel/entry-ftrace.S:151 #2 0xffffff8010439714 in write_comp_data (type=5, arg1=0, arg2=0, ip=18446743524224276596) at ../kernel/kcov.c:116 #3 0xffffff8010439894 in __sanitizer_cov_trace_const_cmp4 (arg1=<optimized out>, arg2=<optimized out>) at ../kernel/kcov.c:188 #4 0xffffff8010201874 in prepare_ftrace_return (self_addr=18446743524226602768, parent=0xffffff801014b918, frame_pointer=18446743524223531344) at ./include/generated/atomic-instrumented.h:27 #5 0xffffff801020194c in ftrace_graph_caller () at ../arch/arm64/kernel/entry-ftrace.S:182 Rework so that write_comp_data() that are called from __sanitizer_cov_trace_*_cmp*() are marked as 'notrace'. Commit 903e8ff86753 ("kernel/kcov.c: mark funcs in __sanitizer_cov_trace_pc() as notrace") missed to mark write_comp_data() as 'notrace'. When that patch was created gcc-7 was used. In lib/Kconfig.debug config KCOV_ENABLE_COMPARISONS depends on $(cc-option,-fsanitize-coverage=trace-cmp) That code path isn't hit with gcc-7. However, it were that with gcc-8. Link: http://lkml.kernel.org/r/20181206143011.23719-1-anders.roxell@linaro.org Signed-off-by: Anders Roxell <anders.roxell@linaro.org> Signed-off-by: Arnd Bergmann <arnd@arndb.de> Co-developed-by: Arnd Bergmann <arnd@arndb.de> Acked-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Cc: Will Deacon <will.deacon@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-01-04kernel/sysctl: add panic_print into sysctlFeng Tang
So that we can also runtime chose to print out the needed system info for panic, other than setting the kernel cmdline. Link: http://lkml.kernel.org/r/1543398842-19295-3-git-send-email-feng.tang@intel.com Signed-off-by: Feng Tang <feng.tang@intel.com> Suggested-by: Steven Rostedt <rostedt@goodmis.org> Acked-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: John Stultz <john.stultz@linaro.org> Cc: Ingo Molnar <mingo@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Kees Cook <keescook@chromium.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-01-04panic: add options to print system info when panic happensFeng Tang
Kernel panic issues are always painful to debug, partially because it's not easy to get enough information of the context when panic happens. And we have ramoops and kdump for that, while this commit tries to provide a easier way to show the system info by adding a cmdline parameter, referring some idea from sysrq handler. Link: http://lkml.kernel.org/r/1543398842-19295-2-git-send-email-feng.tang@intel.com Signed-off-by: Feng Tang <feng.tang@intel.com> Reviewed-by: Kees Cook <keescook@chromium.org> Acked-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: John Stultz <john.stultz@linaro.org> Cc: Ingo Molnar <mingo@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-01-04fork: fix some -Wmissing-prototypes warningsYi Wang
We get a warning when building kernel with W=1: kernel/fork.c:167:13: warning: no previous prototype for `arch_release_thread_stack' [-Wmissing-prototypes] kernel/fork.c:779:13: warning: no previous prototype for `fork_init' [-Wmissing-prototypes] Add the missing declaration in head file to fix this. Also, remove arch_release_thread_stack() completely because no arch seems to implement it since bb9d81264 (arch: remove tile port). Link: http://lkml.kernel.org/r/1542170087-23645-1-git-send-email-wang.yi59@zte.com.cn Signed-off-by: Yi Wang <wang.yi59@zte.com.cn> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Mike Rapoport <rppt@linux.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-01-04kernel/hung_task.c: break RCU locks based on jiffiesTetsuo Handa
check_hung_uninterruptible_tasks() is currently calling rcu_lock_break() for every 1024 threads. But check_hung_task() is very slow if printk() was called, and is very fast otherwise. If many threads within some 1024 threads called printk(), the RCU grace period might be extended enough to trigger RCU stall warnings. Therefore, calling rcu_lock_break() for every some fixed jiffies will be safer. Link: http://lkml.kernel.org/r/1544800658-11423-1-git-send-email-penguin-kernel@I-love.SAKURA.ne.jp Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Acked-by: Paul E. McKenney <paulmck@linux.ibm.com> Cc: Petr Mladek <pmladek@suse.com> Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: "Rafael J. Wysocki" <rafael.j.wysocki@intel.com> Cc: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-01-04kernel/hung_task.c: force console verbose before panicLiu, Chuansheng
Based on commit 401c636a0eeb ("kernel/hung_task.c: show all hung tasks before panic"), we could get the call stack of hung task. However, if the console loglevel is not high, we still can not see the useful panic information in practice, and in most cases users don't set console loglevel to high level. This patch is to force console verbose before system panic, so that the real useful information can be seen in the console, instead of being like the following, which doesn't have hung task information. INFO: task init:1 blocked for more than 120 seconds. Tainted: G U W 4.19.0-quilt-2e5dc0ac-g51b6c21d76cc #1 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Kernel panic - not syncing: hung_task: blocked tasks CPU: 2 PID: 479 Comm: khungtaskd Tainted: G U W 4.19.0-quilt-2e5dc0ac-g51b6c21d76cc #1 Call Trace: dump_stack+0x4f/0x65 panic+0xde/0x231 watchdog+0x290/0x410 kthread+0x12c/0x150 ret_from_fork+0x35/0x40 reboot: panic mode set: p,w Kernel Offset: 0x34000000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff) Link: http://lkml.kernel.org/r/27240C0AC20F114CBF8149A2696CBE4A6015B675@SHSMSX101.ccr.corp.intel.com Signed-off-by: Chuansheng Liu <chuansheng.liu@intel.com> Reviewed-by: Petr Mladek <pmladek@suse.com> Reviewed-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-01-04proc/sysctl: fix return error for proc_doulongvec_minmax()Cheng Lin
If the number of input parameters is less than the total parameters, an EINVAL error will be returned. For example, we use proc_doulongvec_minmax to pass up to two parameters with kern_table: { .procname = "monitor_signals", .data = &monitor_sigs, .maxlen = 2*sizeof(unsigned long), .mode = 0644, .proc_handler = proc_doulongvec_minmax, }, Reproduce: When passing two parameters, it's work normal. But passing only one parameter, an error "Invalid argument"(EINVAL) is returned. [root@cl150 ~]# echo 1 2 > /proc/sys/kernel/monitor_signals [root@cl150 ~]# cat /proc/sys/kernel/monitor_signals 1 2 [root@cl150 ~]# echo 3 > /proc/sys/kernel/monitor_signals -bash: echo: write error: Invalid argument [root@cl150 ~]# echo $? 1 [root@cl150 ~]# cat /proc/sys/kernel/monitor_signals 3 2 [root@cl150 ~]# The following is the result after apply this patch. No error is returned when the number of input parameters is less than the total parameters. [root@cl150 ~]# echo 1 2 > /proc/sys/kernel/monitor_signals [root@cl150 ~]# cat /proc/sys/kernel/monitor_signals 1 2 [root@cl150 ~]# echo 3 > /proc/sys/kernel/monitor_signals [root@cl150 ~]# echo $? 0 [root@cl150 ~]# cat /proc/sys/kernel/monitor_signals 3 2 [root@cl150 ~]# There are three processing functions dealing with digital parameters, __do_proc_dointvec/__do_proc_douintvec/__do_proc_doulongvec_minmax. This patch deals with __do_proc_doulongvec_minmax, just as __do_proc_dointvec does, adding a check for parameters 'left'. In __do_proc_douintvec, its code implementation explicitly does not support multiple inputs. static int __do_proc_douintvec(...){ ... /* * Arrays are not supported, keep this simple. *Do not* add * support for them. */ if (vleft != 1) { *lenp = 0; return -EINVAL; } ... } So, just __do_proc_doulongvec_minmax has the problem. And most use of proc_doulongvec_minmax/proc_doulongvec_ms_jiffies_minmax just have one parameter. Link: http://lkml.kernel.org/r/1544081775-15720-1-git-send-email-cheng.lin130@zte.com.cn Signed-off-by: Cheng Lin <cheng.lin130@zte.com.cn> Acked-by: Luis Chamberlain <mcgrof@kernel.org> Reviewed-by: Kees Cook <keescook@chromium.org> Cc: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-01-04make 'user_access_begin()' do 'access_ok()'Linus Torvalds
Originally, the rule used to be that you'd have to do access_ok() separately, and then user_access_begin() before actually doing the direct (optimized) user access. But experience has shown that people then decide not to do access_ok() at all, and instead rely on it being implied by other operations or similar. Which makes it very hard to verify that the access has actually been range-checked. If you use the unsafe direct user accesses, hardware features (either SMAP - Supervisor Mode Access Protection - on x86, or PAN - Privileged Access Never - on ARM) do force you to use user_access_begin(). But nothing really forces the range check. By putting the range check into user_access_begin(), we actually force people to do the right thing (tm), and the range check vill be visible near the actual accesses. We have way too long a history of people trying to avoid them. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-01-04dma-mapping: remove a few unused exportsChristoph Hellwig
Now that the slow path DMA API calls are implemented out of line a few helpers only used by them don't need to be exported anymore. Signed-off-by: Christoph Hellwig <hch@lst.de>
2019-01-04dma-mapping: remove dmam_{declare,release}_coherent_memoryChristoph Hellwig
These functions have never been used. Signed-off-by: Christoph Hellwig <hch@lst.de>
2019-01-04dma-mapping: implement dmam_alloc_coherent using dmam_alloc_attrsChristoph Hellwig
dmam_alloc_coherent is just the default no-flags case of dmam_alloc_attrs, so take advantage of this similar to the non-managed version. Signed-off-by: Christoph Hellwig <hch@lst.de>
2019-01-04dma-mapping: implement dma_map_single_attrs using dma_map_page_attrsChristoph Hellwig
And also switch the way we implement the unmap side around to stay consistent. This ensures dma-debug works again because it records which function we used for mapping to ensure it is also used for unmapping, and also reduces further code duplication. Last but not least this also officially allows calling dma_sync_single_* for mappings created using dma_map_page, which is perfectly fine given that the sync calls only take a dma_addr_t, but not a virtual address or struct page. Fixes: 7f0fee242e ("dma-mapping: merge dma_unmap_page_attrs and dma_unmap_single_attrs") Signed-off-by: Christoph Hellwig <hch@lst.de> Tested-by: LABBE Corentin <clabbe.montjoie@gmail.com>
2019-01-03Remove 'type' argument from access_ok() functionLinus Torvalds
Nobody has actually used the type (VERIFY_READ vs VERIFY_WRITE) argument of the user address range verification function since we got rid of the old racy i386-only code to walk page tables by hand. It existed because the original 80386 would not honor the write protect bit when in kernel mode, so you had to do COW by hand before doing any user access. But we haven't supported that in a long time, and these days the 'type' argument is a purely historical artifact. A discussion about extending 'user_access_begin()' to do the range checking resulted this patch, because there is no way we're going to move the old VERIFY_xyz interface to that model. And it's best done at the end of the merge window when I've done most of my merges, so let's just get this done once and for all. This patch was mostly done with a sed-script, with manual fix-ups for the cases that weren't of the trivial 'access_ok(VERIFY_xyz' form. There were a couple of notable cases: - csky still had the old "verify_area()" name as an alias. - the iter_iov code had magical hardcoded knowledge of the actual values of VERIFY_{READ,WRITE} (not that they mattered, since nothing really used it) - microblaze used the type argument for a debug printout but other than those oddities this should be a total no-op patch. I tried to fix up all architectures, did fairly extensive grepping for access_ok() uses, and the changes are trivial, but I may have missed something. Any missed conversion should be trivially fixable, though. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-01-03Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netLinus Torvalds
Pull networking fixes from David Miller: "Several fixes here. Basically split down the line between newly introduced regressions and long existing problems: 1) Double free in tipc_enable_bearer(), from Cong Wang. 2) Many fixes to nf_conncount, from Florian Westphal. 3) op->get_regs_len() can throw an error, check it, from Yunsheng Lin. 4) Need to use GFP_ATOMIC in *_add_hash_mac_address() of fsl/fman driver, from Scott Wood. 5) Inifnite loop in fib_empty_table(), from Yue Haibing. 6) Use after free in ax25_fillin_cb(), from Cong Wang. 7) Fix socket locking in nr_find_socket(), also from Cong Wang. 8) Fix WoL wakeup enable in r8169, from Heiner Kallweit. 9) On 32-bit sock->sk_stamp is not thread-safe, from Deepa Dinamani. 10) Fix ptr_ring wrap during queue swap, from Cong Wang. 11) Missing shutdown callback in hinic driver, from Xue Chaojing. 12) Need to return NULL on error from ip6_neigh_lookup(), from Stefano Brivio. 13) BPF out of bounds speculation fixes from Daniel Borkmann" * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (57 commits) ipv6: Consider sk_bound_dev_if when binding a socket to an address ipv6: Fix dump of specific table with strict checking bpf: add various test cases to selftests bpf: prevent out of bounds speculation on pointer arithmetic bpf: fix check_map_access smin_value test when pointer contains offset bpf: restrict unknown scalars of mixed signed bounds for unprivileged bpf: restrict stack pointer arithmetic for unprivileged bpf: restrict map value pointer arithmetic for unprivileged bpf: enable access to ax register also from verifier rewrite bpf: move tmp variable into ax register in interpreter bpf: move {prev_,}insn_idx into verifier env isdn: fix kernel-infoleak in capi_unlocked_ioctl ipv6: route: Fix return value of ip6_neigh_lookup() on neigh_create() error net/hamradio/6pack: use mod_timer() to rearm timers net-next/hinic:add shutdown callback net: hns3: call hns3_nic_net_open() while doing HNAE3_UP_CLIENT ip: validate header length on virtual device xmit tap: call skb_probe_transport_header after setting skb->dev ptr_ring: wrap back ->producer in __ptr_ring_swap_queue() net: rds: remove unnecessary NULL check ...
2019-01-02Merge tag 'nfs-for-4.21-1' of git://git.linux-nfs.org/projects/anna/linux-nfsLinus Torvalds
Pull NFS client updates from Anna Schumaker: "Stable bugfixes: - xprtrdma: Yet another double DMA-unmap # v4.20 Features: - Allow some /proc/sys/sunrpc entries without CONFIG_SUNRPC_DEBUG - Per-xprt rdma receive workqueues - Drop support for FMR memory registration - Make port= mount option optional for RDMA mounts Other bugfixes and cleanups: - Remove unused nfs4_xdev_fs_type declaration - Fix comments for behavior that has changed - Remove generic RPC credentials by switching to 'struct cred' - Fix crossing mountpoints with different auth flavors - Various xprtrdma fixes from testing and auditing the close code - Fixes for disconnect issues when using xprtrdma with krb5 - Clean up and improve xprtrdma trace points - Fix NFS v4.2 async copy reboot recovery" * tag 'nfs-for-4.21-1' of git://git.linux-nfs.org/projects/anna/linux-nfs: (63 commits) sunrpc: convert to DEFINE_SHOW_ATTRIBUTE sunrpc: Add xprt after nfs4_test_session_trunk() sunrpc: convert unnecessary GFP_ATOMIC to GFP_NOFS sunrpc: handle ENOMEM in rpcb_getport_async NFS: remove unnecessary test for IS_ERR(cred) xprtrdma: Prevent leak of rpcrdma_rep objects NFSv4.2 fix async copy reboot recovery xprtrdma: Don't leak freed MRs xprtrdma: Add documenting comment for rpcrdma_buffer_destroy xprtrdma: Replace outdated comment for rpcrdma_ep_post xprtrdma: Update comments in frwr_op_send SUNRPC: Fix some kernel doc complaints SUNRPC: Simplify defining common RPC trace events NFS: Fix NFSv4 symbolic trace point output xprtrdma: Trace mapping, alloc, and dereg failures xprtrdma: Add trace points for calls to transport switch methods xprtrdma: Relocate the xprtrdma_mr_map trace points xprtrdma: Clean up of xprtrdma chunk trace points xprtrdma: Remove unused fields from rpcrdma_ia xprtrdma: Cull dprintk() call sites ...
2019-01-02bpf: prevent out of bounds speculation on pointer arithmeticDaniel Borkmann
Jann reported that the original commit back in b2157399cc98 ("bpf: prevent out-of-bounds speculation") was not sufficient to stop CPU from speculating out of bounds memory access: While b2157399cc98 only focussed on masking array map access for unprivileged users for tail calls and data access such that the user provided index gets sanitized from BPF program and syscall side, there is still a more generic form affected from BPF programs that applies to most maps that hold user data in relation to dynamic map access when dealing with unknown scalars or "slow" known scalars as access offset, for example: - Load a map value pointer into R6 - Load an index into R7 - Do a slow computation (e.g. with a memory dependency) that loads a limit into R8 (e.g. load the limit from a map for high latency, then mask it to make the verifier happy) - Exit if R7 >= R8 (mispredicted branch) - Load R0 = R6[R7] - Load R0 = R6[R0] For unknown scalars there are two options in the BPF verifier where we could derive knowledge from in order to guarantee safe access to the memory: i) While </>/<=/>= variants won't allow to derive any lower or upper bounds from the unknown scalar where it would be safe to add it to the map value pointer, it is possible through ==/!= test however. ii) another option is to transform the unknown scalar into a known scalar, for example, through ALU ops combination such as R &= <imm> followed by R |= <imm> or any similar combination where the original information from the unknown scalar would be destroyed entirely leaving R with a constant. The initial slow load still precedes the latter ALU ops on that register, so the CPU executes speculatively from that point. Once we have the known scalar, any compare operation would work then. A third option only involving registers with known scalars could be crafted as described in [0] where a CPU port (e.g. Slow Int unit) would be filled with many dependent computations such that the subsequent condition depending on its outcome has to wait for evaluation on its execution port and thereby executing speculatively if the speculated code can be scheduled on a different execution port, or any other form of mistraining as described in [1], for example. Given this is not limited to only unknown scalars, not only map but also stack access is affected since both is accessible for unprivileged users and could potentially be used for out of bounds access under speculation. In order to prevent any of these cases, the verifier is now sanitizing pointer arithmetic on the offset such that any out of bounds speculation would be masked in a way where the pointer arithmetic result in the destination register will stay unchanged, meaning offset masked into zero similar as in array_index_nospec() case. With regards to implementation, there are three options that were considered: i) new insn for sanitation, ii) push/pop insn and sanitation as inlined BPF, iii) reuse of ax register and sanitation as inlined BPF. Option i) has the downside that we end up using from reserved bits in the opcode space, but also that we would require each JIT to emit masking as native arch opcodes meaning mitigation would have slow adoption till everyone implements it eventually which is counter-productive. Option ii) and iii) have both in common that a temporary register is needed in order to implement the sanitation as inlined BPF since we are not allowed to modify the source register. While a push / pop insn in ii) would be useful to have in any case, it requires once again that every JIT needs to implement it first. While possible, amount of changes needed would also be unsuitable for a -stable patch. Therefore, the path which has fewer changes, less BPF instructions for the mitigation and does not require anything to be changed in the JITs is option iii) which this work is pursuing. The ax register is already mapped to a register in all JITs (modulo arm32 where it's mapped to stack as various other BPF registers there) and used in constant blinding for JITs-only so far. It can be reused for verifier rewrites under certain constraints. The interpreter's tmp "register" has therefore been remapped into extending the register set with hidden ax register and reusing that for a number of instructions that needed the prior temporary variable internally (e.g. div, mod). This allows for zero increase in stack space usage in the interpreter, and enables (restricted) generic use in rewrites otherwise as long as such a patchlet does not make use of these instructions. The sanitation mask is dynamic and relative to the offset the map value or stack pointer currently holds. There are various cases that need to be taken under consideration for the masking, e.g. such operation could look as follows: ptr += val or val += ptr or ptr -= val. Thus, the value to be sanitized could reside either in source or in destination register, and the limit is different depending on whether the ALU op is addition or subtraction and depending on the current known and bounded offset. The limit is derived as follows: limit := max_value_size - (smin_value + off). For subtraction: limit := umax_value + off. This holds because we do not allow any pointer arithmetic that would temporarily go out of bounds or would have an unknown value with mixed signed bounds where it is unclear at verification time whether the actual runtime value would be either negative or positive. For example, we have a derived map pointer value with constant offset and bounded one, so limit based on smin_value works because the verifier requires that statically analyzed arithmetic on the pointer must be in bounds, and thus it checks if resulting smin_value + off and umax_value + off is still within map value bounds at time of arithmetic in addition to time of access. Similarly, for the case of stack access we derive the limit as follows: MAX_BPF_STACK + off for subtraction and -off for the case of addition where off := ptr_reg->off + ptr_reg->var_off.value. Subtraction is a special case for the masking which can be in form of ptr += -val, ptr -= -val, or ptr -= val. In the first two cases where we know that the value is negative, we need to temporarily negate the value in order to do the sanitation on a positive value where we later swap the ALU op, and restore original source register if the value was in source. The sanitation of pointer arithmetic alone is still not fully sufficient as is, since a scenario like the following could happen ... PTR += 0x1000 (e.g. K-based imm) PTR -= BIG_NUMBER_WITH_SLOW_COMPARISON PTR += 0x1000 PTR -= BIG_NUMBER_WITH_SLOW_COMPARISON [...] ... which under speculation could end up as ... PTR += 0x1000 PTR -= 0 [ truncated by mitigation ] PTR += 0x1000 PTR -= 0 [ truncated by mitigation ] [...] ... and therefore still access out of bounds. To prevent such case, the verifier is also analyzing safety for potential out of bounds access under speculative execution. Meaning, it is also simulating pointer access under truncation. We therefore "branch off" and push the current verification state after the ALU operation with known 0 to the verification stack for later analysis. Given the current path analysis succeeded it is likely that the one under speculation can be pruned. In any case, it is also subject to existing complexity limits and therefore anything beyond this point will be rejected. In terms of pruning, it needs to be ensured that the verification state from speculative execution simulation must never prune a non-speculative execution path, therefore, we mark verifier state accordingly at the time of push_stack(). If verifier detects out of bounds access under speculative execution from one of the possible paths that includes a truncation, it will reject such program. Given we mask every reg-based pointer arithmetic for unprivileged programs, we've been looking into how it could affect real-world programs in terms of size increase. As the majority of programs are targeted for privileged-only use case, we've unconditionally enabled masking (with its alu restrictions on top of it) for privileged programs for the sake of testing in order to check i) whether they get rejected in its current form, and ii) by how much the number of instructions and size will increase. We've tested this by using Katran, Cilium and test_l4lb from the kernel selftests. For Katran we've evaluated balancer_kern.o, Cilium bpf_lxc.o and an older test object bpf_lxc_opt_-DUNKNOWN.o and l4lb we've used test_l4lb.o as well as test_l4lb_noinline.o. We found that none of the programs got rejected by the verifier with this change, and that impact is rather minimal to none. balancer_kern.o had 13,904 bytes (1,738 insns) xlated and 7,797 bytes JITed before and after the change. Most complex program in bpf_lxc.o had 30,544 bytes (3,817 insns) xlated and 18,538 bytes JITed before and after and none of the other tail call programs in bpf_lxc.o had any changes either. For the older bpf_lxc_opt_-DUNKNOWN.o object we found a small increase from 20,616 bytes (2,576 insns) and 12,536 bytes JITed before to 20,664 bytes (2,582 insns) and 12,558 bytes JITed after the change. Other programs from that object file had similar small increase. Both test_l4lb.o had no change and remained at 6,544 bytes (817 insns) xlated and 3,401 bytes JITed and for test_l4lb_noinline.o constant at 5,080 bytes (634 insns) xlated and 3,313 bytes JITed. This can be explained in that LLVM typically optimizes stack based pointer arithmetic by using K-based operations and that use of dynamic map access is not overly frequent. However, in future we may decide to optimize the algorithm further under known guarantees from branch and value speculation. Latter seems also unclear in terms of prediction heuristics that today's CPUs apply as well as whether there could be collisions in e.g. the predictor's Value History/Pattern Table for triggering out of bounds access, thus masking is performed unconditionally at this point but could be subject to relaxation later on. We were generally also brainstorming various other approaches for mitigation, but the blocker was always lack of available registers at runtime and/or overhead for runtime tracking of limits belonging to a specific pointer. Thus, we found this to be minimally intrusive under given constraints. With that in place, a simple example with sanitized access on unprivileged load at post-verification time looks as follows: # bpftool prog dump xlated id 282 [...] 28: (79) r1 = *(u64 *)(r7 +0) 29: (79) r2 = *(u64 *)(r7 +8) 30: (57) r1 &= 15 31: (79) r3 = *(u64 *)(r0 +4608) 32: (57) r3 &= 1 33: (47) r3 |= 1 34: (2d) if r2 > r3 goto pc+19 35: (b4) (u32) r11 = (u32) 20479 | 36: (1f) r11 -= r2 | Dynamic sanitation for pointer 37: (4f) r11 |= r2 | arithmetic with registers 38: (87) r11 = -r11 | containing bounded or known 39: (c7) r11 s>>= 63 | scalars in order to prevent 40: (5f) r11 &= r2 | out of bounds speculation. 41: (0f) r4 += r11 | 42: (71) r4 = *(u8 *)(r4 +0) 43: (6f) r4 <<= r1 [...] For the case where the scalar sits in the destination register as opposed to the source register, the following code is emitted for the above example: [...] 16: (b4) (u32) r11 = (u32) 20479 17: (1f) r11 -= r2 18: (4f) r11 |= r2 19: (87) r11 = -r11 20: (c7) r11 s>>= 63 21: (5f) r2 &= r11 22: (0f) r2 += r0 23: (61) r0 = *(u32 *)(r2 +0) [...] JIT blinding example with non-conflicting use of r10: [...] d5: je 0x0000000000000106 _ d7: mov 0x0(%rax),%edi | da: mov $0xf153246,%r10d | Index load from map value and e0: xor $0xf153259,%r10 | (const blinded) mask with 0x1f. e7: and %r10,%rdi |_ ea: mov $0x2f,%r10d | f0: sub %rdi,%r10 | Sanitized addition. Both use r10 f3: or %rdi,%r10 | but do not interfere with each f6: neg %r10 | other. (Neither do these instructions f9: sar $0x3f,%r10 | interfere with the use of ax as temp fd: and %r10,%rdi | in interpreter.) 100: add %rax,%rdi |_ 103: mov 0x0(%rdi),%eax [...] Tested that it fixes Jann's reproducer, and also checked that test_verifier and test_progs suite with interpreter, JIT and JIT with hardening enabled on x86-64 and arm64 runs successfully. [0] Speculose: Analyzing the Security Implications of Speculative Execution in CPUs, Giorgi Maisuradze and Christian Rossow, https://arxiv.org/pdf/1801.04084.pdf [1] A Systematic Evaluation of Transient Execution Attacks and Defenses, Claudio Canella, Jo Van Bulck, Michael Schwarz, Moritz Lipp, Benjamin von Berg, Philipp Ortner, Frank Piessens, Dmitry Evtyushkin, Daniel Gruss, https://arxiv.org/pdf/1811.05441.pdf Fixes: b2157399cc98 ("bpf: prevent out-of-bounds speculation") Reported-by: Jann Horn <jannh@google.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2019-01-02bpf: fix check_map_access smin_value test when pointer contains offsetDaniel Borkmann
In check_map_access() we probe actual bounds through __check_map_access() with offset of reg->smin_value + off for lower bound and offset of reg->umax_value + off for the upper bound. However, even though the reg->smin_value could have a negative value, the final result of the sum with off could be positive when pointer arithmetic with known and unknown scalars is combined. In this case we reject the program with an error such as "R<x> min value is negative, either use unsigned index or do a if (index >=0) check." even though the access itself would be fine. Therefore extend the check to probe whether the actual resulting reg->smin_value + off is less than zero. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2019-01-02bpf: restrict unknown scalars of mixed signed bounds for unprivilegedDaniel Borkmann
For unknown scalars of mixed signed bounds, meaning their smin_value is negative and their smax_value is positive, we need to reject arithmetic with pointer to map value. For unprivileged the goal is to mask every map pointer arithmetic and this cannot reliably be done when it is unknown at verification time whether the scalar value is negative or positive. Given this is a corner case, the likelihood of breaking should be very small. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2019-01-02bpf: restrict stack pointer arithmetic for unprivilegedDaniel Borkmann
Restrict stack pointer arithmetic for unprivileged users in that arithmetic itself must not go out of bounds as opposed to the actual access later on. Therefore after each adjust_ptr_min_max_vals() with a stack pointer as a destination we simulate a check_stack_access() of 1 byte on the destination and once that fails the program is rejected for unprivileged program loads. This is analog to map value pointer arithmetic and needed for masking later on. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2019-01-02bpf: restrict map value pointer arithmetic for unprivilegedDaniel Borkmann
Restrict map value pointer arithmetic for unprivileged users in that arithmetic itself must not go out of bounds as opposed to the actual access later on. Therefore after each adjust_ptr_min_max_vals() with a map value pointer as a destination it will simulate a check_map_access() of 1 byte on the destination and once that fails the program is rejected for unprivileged program loads. We use this later on for masking any pointer arithmetic with the remainder of the map value space. The likelihood of breaking any existing real-world unprivileged eBPF program is very small for this corner case. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2019-01-02bpf: enable access to ax register also from verifier rewriteDaniel Borkmann
Right now we are using BPF ax register in JIT for constant blinding as well as in interpreter as temporary variable. Verifier will not be able to use it simply because its use will get overridden from the former in bpf_jit_blind_insn(). However, it can be made to work in that blinding will be skipped if there is prior use in either source or destination register on the instruction. Taking constraints of ax into account, the verifier is then open to use it in rewrites under some constraints. Note, ax register already has mappings in every eBPF JIT. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2019-01-02bpf: move tmp variable into ax register in interpreterDaniel Borkmann
This change moves the on-stack 64 bit tmp variable in ___bpf_prog_run() into the hidden ax register. The latter is currently only used in JITs for constant blinding as a temporary scratch register, meaning the BPF interpreter will never see the use of ax. Therefore it is safe to use it for the cases where tmp has been used earlier. This is needed to later on allow restricted hidden use of ax in both interpreter and JITs. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Alexei Starovoitov <ast@kernel.org>