summaryrefslogtreecommitdiff
path: root/arch/um/kernel/skas
AgeCommit message (Collapse)Author
2025-06-02um: pass FD for memory operations when neededBenjamin Berg
Instead of always sharing the FDs with the userspace process, only hand over the FDs needed for mmap when required. The idea is that userspace might be able to force the stub into executing an mmap syscall, however, it will not be able to manipulate the control flow sufficiently to have access to an FD that would allow mapping arbitrary memory. Security wise, we need to be sure that only the expected syscalls are executed after the kernel sends FDs through the socket. This is currently not the case, as userspace can trivially jump to the rt_sigreturn syscall instruction to execute any syscall that the stub is permitted to do. With this, it can trick the kernel to send the FD, which in turn allows userspace to freely map any physical memory. As such, this is currently *not* secure. However, in principle the approach should be fine with a more strict SECCOMP filter and a careful review of the stub control flow (as userspace can prepare a stack). With some care, it is likely possible to extend the security model to SMP if desired. Signed-off-by: Benjamin Berg <benjamin.berg@intel.com> Link: https://patch.msgid.link/20250602130052.545733-8-benjamin@sipsolutions.net Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2025-06-02um: Implement kernel side of SECCOMP based process handlingBenjamin Berg
This adds the kernel side of the seccomp based process handling. Co-authored-by: Johannes Berg <johannes@sipsolutions.net> Signed-off-by: Benjamin Berg <benjamin@sipsolutions.net> Signed-off-by: Benjamin Berg <benjamin.berg@intel.com> Link: https://patch.msgid.link/20250602130052.545733-6-benjamin@sipsolutions.net Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2025-06-02um: Track userspace children dying in SECCOMP modeBenjamin Berg
When in seccomp mode, we would hang forever on the futex if a child has died unexpectedly. In contrast, ptrace mode will notice it and kill the corresponding thread when it fails to run it. Fix this issue using a new IRQ that is fired after a SIGCHLD and keeping an (internal) list of all MMs. In the IRQ handler, find the affected MM and set its PID to -1 as well as the futex variable to FUTEX_IN_KERN. This, together with futex returning -EINTR after the signal is sufficient to implement a race-free detection of a child dying. Note that this also enables IRQ handling while starting a userspace process. This should be safe and SECCOMP requires the IRQ in case the process does not come up properly. Signed-off-by: Benjamin Berg <benjamin@sipsolutions.net> Signed-off-by: Benjamin Berg <benjamin.berg@intel.com> Link: https://patch.msgid.link/20250602130052.545733-5-benjamin@sipsolutions.net Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2025-06-02um: Add stub side of SECCOMP/futex based process handlingBenjamin Berg
This adds the stub side for the new seccomp process management code. In this case we do register save/restore through the signal handler mcontext. Add special code for handling TLS, which for x86_64 means setting the FS_BASE/GS_BASE registers while for i386 it means calling the set_thread_area syscall. Co-authored-by: Johannes Berg <johannes@sipsolutions.net> Signed-off-by: Benjamin Berg <benjamin@sipsolutions.net> Signed-off-by: Benjamin Berg <benjamin.berg@intel.com> Link: https://patch.msgid.link/20250602130052.545733-3-benjamin@sipsolutions.net Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2025-03-18um: work around sched_yield not yielding in time-travel modeBenjamin Berg
sched_yield by a userspace may not actually cause scheduling in time-travel mode as no time has passed. In the case seen it appears to be a badly implemented userspace spinlock in ASAN. Unfortunately, with time-travel it causes an extreme slowdown or even deadlock depending on the kernel configuration (CONFIG_UML_MAX_USERSPACE_ITERATIONS). Work around it by accounting time to the process whenever it executes a sched_yield syscall. Signed-off-by: Benjamin Berg <benjamin.berg@intel.com> Link: https://patch.msgid.link/20250314130815.226872-1-benjamin@sipsolutions.net Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2024-11-12um: move thread info into taskBenjamin Berg
This selects the THREAD_INFO_IN_TASK option for UM and changes the way that the current task is discovered. This is trivial though, as UML already tracks the current task in cpu_tasks[] and this can be used to retrieve it. Also remove the signal handler code that copies the thread information into the IRQ stack. It is obsolete now, which also means that the mentioned race condition cannot happen anymore. Signed-off-by: Benjamin Berg <benjamin.berg@intel.com> Reviewed-by: Hajime Tazaki <thehajime@gmail.com> Link: https://patch.msgid.link/20241111102910.46512-1-benjamin@sipsolutions.net Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2024-11-07um: Remove double zero checkShaojie Dong
free_pages() performs a parameter null check inside therefore remove double zero check here. Signed-off-by: Shaojie Dong <quic_shaojied@quicinc.com> Link: https://patch.msgid.link/20241025-upstream_branch-v5-1-b8998beb2c64@quicinc.com Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2024-10-26um: fix stub exe build with CONFIG_GCOVJohannes Berg
CONFIG_GCOV is special and only in UML since it builds the kernel with a "userspace" option. This is fine, but the stub is even more special and not really a full userspace process, so it then fails to link as reported. Remove the GCOV options from the stub build. For good measure, also remove the GPROF options, even though they don't seem to cause build failures now. To be able to do this, export the specific options (GCOV_OPT and GPROF_OPT) but rename them so there's less chance of any conflicts. Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/oe-kbuild-all/202410242238.SXhs2kQ4-lkp@intel.com/ Fixes: 32e8eaf263d9 ("um: use execveat to create userspace MMs") Link: https://patch.msgid.link/20241025102700.9fbb9c34473f.I7f1537fe075638f8da64beb52ef6c9e5adc51bc3@changeid Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2024-10-23um: Abandon the _PAGE_NEWPROT bitTiwei Bie
When a PTE is updated in the page table, the _PAGE_NEWPAGE bit will always be set. And the corresponding page will always be mapped or unmapped depending on whether the PTE is present or not. The check on the _PAGE_NEWPROT bit is not really reachable. Abandoning it will allow us to simplify the code and remove the unreachable code. Reviewed-by: Benjamin Berg <benjamin.berg@intel.com> Signed-off-by: Tiwei Bie <tiwei.btw@antgroup.com> Link: https://patch.msgid.link/20241011102354.1682626-2-tiwei.btw@antgroup.com Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2024-10-23um: make stub_exe _start() pure inline asmJohannes Berg
Since __attribute__((naked)) cannot be used with functions containing C statements, just generate the few instructions it needs in assembly directly. While at it, fix the stack usage ("1 + 2*x - 1" is odd) and document what it must do, and why it must adjust the stack. Fixes: 8508a5e0e9db ("um: Fix misaligned stack in stub_exe") Link: https://lore.kernel.org/linux-um/CABVgOSntH-uoOFMP5HwMXjx_f1osMnVdhgKRKm4uz6DFm2Lb8Q@mail.gmail.com/ Reviewed-by: David Gow <davidgow@google.com> Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2024-10-18um: Fix misaligned stack in stub_exeDavid Gow
The stub_exe could segfault when built with some compilers (e.g. gcc 13.2.0), as SSE instructions which relied on stack alignment could be generated, but the stack was misaligned. This seems to be due to the __start entry point being run with a 16-byte aligned stack, but the x86_64 SYSV ABI wanting the stack to be so aligned _before_ a function call (so it is misaligned when the function is entered due to the return address being pushed). The function prologue then realigns it. Because the entry point is never _called_, and hence there is no return address, the prologue is therefore actually misaligning it, and causing the generated movaps instructions to SIGSEGV. This results in the following error: start_userspace : expected SIGSTOP, got status = 139 Don't generate this prologue for __start by using __attribute__((naked)), which resolves the issue. Fixes: 32e8eaf263d9 ("um: use execveat to create userspace MMs") Signed-off-by: David Gow <davidgow@google.com> Link: https://lore.kernel.org/linux-um/CABVgOS=boUoG6=LHFFhxEd8H8jDP1zOaPKFEjH+iy2n2Q5S2aQ@mail.gmail.com/ Link: https://patch.msgid.link/20241017231007.1500497-2-davidgow@google.com Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2024-10-17um: Disable auto variable initialization for stub_exe.cNathan Chancellor
When automatic variable initialization is enabled via CONFIG_INIT_STACK_ALL_{PATTERN,ZERO}, clang will insert a call to memset() to initialize an object created with __builtin_alloca(). This ultimately breaks the build when linking stub_exe because it is a standalone executable that does not include or link against memset(). ld: arch/um/kernel/skas/stub_exe.o: in function `_start': arch/um/kernel/skas/stub_exe.c:83:(.ltext+0x15): undefined reference to `memset' Disable automatic variable initialization for stub_exe.c by passing the default value of 'uninitialized' to '-ftrivial-auto-var-init', which avoids generating the call to memset(). This code is small and runs quickly as it is just designed to set up an environment, so stack variable initialization is unnecessary overhead for little gain. Fixes: 32e8eaf263d9 ("um: use execveat to create userspace MMs") Signed-off-by: Nathan Chancellor <nathan@kernel.org> Link: https://patch.msgid.link/20241016-uml-fix-stub_exe-clang-v1-2-3d6381dc5a78@kernel.org Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2024-10-17um: Fix passing '-n' to linker for stub_exeNathan Chancellor
When building stub_exe with clang, there is an error because '-n' is not a recognized flag by the clang driver (which is being used to invoke the linker): clang: error: unknown argument: '-n' '-n' should be passed along to the linker, as it is the short flag for '--nmagic', so prefix it with '-Wl,'. Fixes: 32e8eaf263d9 ("um: use execveat to create userspace MMs") Signed-off-by: Nathan Chancellor <nathan@kernel.org> Link: https://patch.msgid.link/20241016-uml-fix-stub_exe-clang-v1-1-3d6381dc5a78@kernel.org Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2024-10-10um: clear all memory in new userspace processesBenjamin Berg
With the change to use execve() we can now safely clear the memory up to STUB_START as rseq will not be trying to use memory in that region. Also, on 64 bit the previous changes should mean that there is no usable memory range above the stub. Make the change and remove the comment as it is not needed anymore. Signed-off-by: Benjamin Berg <benjamin.berg@intel.com> Link: https://patch.msgid.link/20240919124511.282088-10-benjamin@sipsolutions.net Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2024-10-10um: Set parent death signal for userspace processBenjamin Berg
Enable PR_SET_PDEATHSIG so that the UML userspace process will be killed when the kernel exits unexpectedly. Signed-off-by: Benjamin Berg <benjamin.berg@intel.com> Link: https://patch.msgid.link/20240919124511.282088-4-benjamin@sipsolutions.net Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2024-10-10um: use execveat to create userspace MMsBenjamin Berg
Using clone will not undo features that have been enabled by libc. An example of this already happening is rseq, which could cause the kernel to read/write memory of the userspace process. In the future the standard library might also use mseal by default to protect itself, which would also thwart our attempts at unmapping everything. Solve all this by taking a step back and doing an execve into a tiny static binary that sets up the minimal environment required for the stub without using any standard library. That way we have a clean execution environment that is fully under the control of UML. Note that this changes things a bit as the FDs are not anymore shared with the kernel. Instead, we explicitly share the FDs for the physical memory and all existing iomem regions. Doing this is fine, as iomem regions cannot be added at runtime. Signed-off-by: Benjamin Berg <benjamin.berg@intel.com> Link: https://patch.msgid.link/20240919124511.282088-3-benjamin@sipsolutions.net [use pipe() instead of pipe2(), remove unneeded close() calls] Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2024-09-12um: fix time-travel syscall scheduling hackJohannes Berg
The schedule() call there really never did anything at least since the introduction of the EEVDF scheduler, but now I found a case where we permanently hang in a loop of -ERESTARTNOINTR (due to locking.) Work around it by making any syscalls with error return take time (and then schedule after) so we cannot hang in such a loop forever. Signed-off-by: Johannes Berg <johannes.berg@intel.com> Signed-off-by: Richard Weinberger <richard@nod.at>
2024-09-12um: Remove unused mm_fd field from mm_idTiwei Bie
It's no longer used since the removal of the SKAS3/4 support. Signed-off-by: Tiwei Bie <tiwei.btw@antgroup.com> Signed-off-by: Richard Weinberger <richard@nod.at>
2024-09-12um: Remove unused fields from thread_structTiwei Bie
These fields are no longer used since the removal of tt mode. Signed-off-by: Tiwei Bie <tiwei.btw@antgroup.com> Signed-off-by: Richard Weinberger <richard@nod.at>
2024-07-03um: refactor TLB update handlingBenjamin Berg
Conceptually, we want the memory mappings to always be up to date and represent whatever is in the TLB. To ensure that, we need to sync them over in the userspace case and for the kernel we need to process the mappings. The kernel will call flush_tlb_* if page table entries that were valid before become invalid. Unfortunately, this is not the case if entries are added. As such, change both flush_tlb_* and set_ptes to track the memory range that has to be synchronized. For the kernel, we need to execute a flush_tlb_kern_* immediately but we can wait for the first page fault in case of set_ptes. For userspace in contrast we only store that a range of memory needs to be synced and do so whenever we switch to that process. Signed-off-by: Benjamin Berg <benjamin.berg@intel.com> Link: https://patch.msgid.link/20240703134536.1161108-13-benjamin@sipsolutions.net Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2024-07-03um: Do not flush MM in flush_threadBenjamin Berg
There should be no need to flush the memory in flush_thread. Doing this likely worked around some issue where memory was still incorrectly mapped when creating or cloning an MM. With the removal of the special clone path, that isn't relevant anymore. However, add the flush into MM initialization so that any new userspace MM is guaranteed to be clean. Signed-off-by: Benjamin Berg <benjamin.berg@intel.com> Link: https://patch.msgid.link/20240703134536.1161108-10-benjamin@sipsolutions.net Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2024-07-03um: Delay flushing syscalls until the thread is restartedBenjamin Berg
As running the syscalls is expensive due to context switches, we should do so as late as possible in case more syscalls need to be queued later on. This will also benefit a later move to a SECCOMP enabled userspace as in that case the need for extra context switches is removed entirely. Signed-off-by: Benjamin Berg <benjamin@sipsolutions.net> Link: https://patch.msgid.link/20240703134536.1161108-9-benjamin@sipsolutions.net Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2024-07-03um: remove copy_context_skas0Benjamin Berg
The kernel flushes the memory ranges anyway for CoW and does not assume that the userspace process has anything set up already. So, start with a fresh process for the new mm context. Signed-off-by: Benjamin Berg <benjamin.berg@intel.com> Link: https://patch.msgid.link/20240703134536.1161108-8-benjamin@sipsolutions.net Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2024-07-03um: remove LDT supportBenjamin Berg
The current LDT code has a few issues that mean it should be redone in a different way once we always start with a fresh MM even when cloning. In a new and better world, the kernel would just ensure its own LDT is clear at startup. At that point, all that is needed is a simple function to populate the LDT from another MM in arch_dup_mmap combined with some tracking of the installed LDT entries for each MM. Note that the old implementation was even incorrect with regard to reading, as it copied out the LDT entries in the internal format rather than converting them to the userspace structure. Removal should be fine as the LDT is not used for thread-local storage anymore. Signed-off-by: Benjamin Berg <benjamin.berg@intel.com> Link: https://patch.msgid.link/20240703134536.1161108-7-benjamin@sipsolutions.net Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2024-07-03um: Rework syscall handlingBenjamin Berg
Rework syscall handling to be platform independent. Also create a clean split between queueing of syscalls and flushing them out, removing the need to keep state in the code that triggers the syscalls. The code adds syscall_data_len to the global mm_id structure. This will be used later to allow surrounding code to track whether syscalls still need to run and if errors occurred. Signed-off-by: Benjamin Berg <benjamin@sipsolutions.net> Link: https://patch.msgid.link/20240703134536.1161108-5-benjamin@sipsolutions.net Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2024-07-03um: Create signal stack memory assignment in stub_dataBenjamin Berg
When we switch to use seccomp, we need both the signal stack and other data (i.e. syscall information) to co-exist in the stub data. To facilitate this, start by defining separate memory areas for the stack and syscall data. This moves the signal stack onto a new page as the memory area is not sufficient to hold both signal stack and syscall information. Only change the signal stack setup for now, as the syscall code will be reworked later. Signed-off-by: Benjamin Berg <benjamin@sipsolutions.net> Link: https://patch.msgid.link/20240703134536.1161108-3-benjamin@sipsolutions.net Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2024-04-30um: Stop tracking host PID in cpu_tasksTiwei Bie
The host PID tracked in 'cpu_tasks' is no longer used. Stopping tracking it will also save some cycles. Signed-off-by: Tiwei Bie <tiwei.btw@antgroup.com> Signed-off-by: Richard Weinberger <richard@nod.at>
2024-04-22um: Add missing headersTiwei Bie
This will address below -Wmissing-prototypes warnings: arch/um/kernel/mem.c:202:8: warning: no previous prototype for ‘pgd_alloc’ [-Wmissing-prototypes] arch/um/kernel/mem.c:215:7: warning: no previous prototype for ‘uml_kmalloc’ [-Wmissing-prototypes] arch/um/kernel/process.c:207:6: warning: no previous prototype for ‘arch_cpu_idle’ [-Wmissing-prototypes] arch/um/kernel/process.c:328:15: warning: no previous prototype for ‘arch_align_stack’ [-Wmissing-prototypes] arch/um/kernel/reboot.c:45:6: warning: no previous prototype for ‘machine_restart’ [-Wmissing-prototypes] arch/um/kernel/reboot.c:51:6: warning: no previous prototype for ‘machine_power_off’ [-Wmissing-prototypes] arch/um/kernel/reboot.c:57:6: warning: no previous prototype for ‘machine_halt’ [-Wmissing-prototypes] arch/um/kernel/skas/mmu.c:17:5: warning: no previous prototype for ‘init_new_context’ [-Wmissing-prototypes] arch/um/kernel/skas/mmu.c:60:6: warning: no previous prototype for ‘destroy_context’ [-Wmissing-prototypes] arch/um/kernel/skas/process.c:36:12: warning: no previous prototype for ‘start_uml’ [-Wmissing-prototypes] arch/um/kernel/time.c:807:15: warning: no previous prototype for ‘calibrate_delay_is_known’ [-Wmissing-prototypes] arch/um/kernel/tlb.c:594:6: warning: no previous prototype for ‘force_flush_all’ [-Wmissing-prototypes] arch/x86/um/bugs_32.c:22:6: warning: no previous prototype for ‘arch_check_bugs’ [-Wmissing-prototypes] arch/x86/um/bugs_32.c:44:6: warning: no previous prototype for ‘arch_examine_signal’ [-Wmissing-prototypes] arch/x86/um/bugs_64.c:9:6: warning: no previous prototype for ‘arch_check_bugs’ [-Wmissing-prototypes] arch/x86/um/bugs_64.c:13:6: warning: no previous prototype for ‘arch_examine_signal’ [-Wmissing-prototypes] arch/x86/um/elfcore.c:10:12: warning: no previous prototype for ‘elf_core_extra_phdrs’ [-Wmissing-prototypes] arch/x86/um/elfcore.c:15:5: warning: no previous prototype for ‘elf_core_write_extra_phdrs’ [-Wmissing-prototypes] arch/x86/um/elfcore.c:42:5: warning: no previous prototype for ‘elf_core_write_extra_data’ [-Wmissing-prototypes] arch/x86/um/elfcore.c:63:8: warning: no previous prototype for ‘elf_core_extra_data_size’ [-Wmissing-prototypes] arch/x86/um/fault.c:18:5: warning: no previous prototype for ‘arch_fixup’ [-Wmissing-prototypes] arch/x86/um/os-Linux/mcontext.c:7:6: warning: no previous prototype for ‘get_regs_from_mc’ [-Wmissing-prototypes] arch/x86/um/os-Linux/tls.c:22:6: warning: no previous prototype for ‘check_host_supports_tls’ [-Wmissing-prototypes] Signed-off-by: Tiwei Bie <tiwei.btw@antgroup.com> Signed-off-by: Richard Weinberger <richard@nod.at>
2024-01-04um: document arch_futex_atomic_op_inuserAnton Ivanov
arch_futex_atomic_op_inuser was not documented correctly resulting in build time warnings. Signed-off-by: Anton Ivanov <anton.ivanov@cambridgegreys.com> Signed-off-by: Richard Weinberger <richard@nod.at>
2023-04-20um: make stub data pages size tweakableJohannes Berg
There's a lot of code here that hard-codes that the data is a single page, and right now that seems to be sufficient, but to make it easier to change this in the future, add a new STUB_DATA_PAGES constant and use it throughout the code. Signed-off-by: Johannes Berg <johannes.berg@intel.com> Signed-off-by: Richard Weinberger <richard@nod.at>
2023-02-05kbuild: remove --include-dir MAKEFLAG from top MakefileMasahiro Yamada
I added $(srctree)/ to some included Makefiles in the following commits: - 3204a7fb98a3 ("kbuild: prefix $(srctree)/ to some included Makefiles") - d82856395505 ("kbuild: do not require sub-make for separate output tree builds") They were a preparation for removing --include-dir flag. I have never thought --include-dir useful. Rather, it _is_ harmful. For example, run the following commands: $ make -s ARCH=x86 mrproper defconfig $ make ARCH=arm O=foo dtbs make[1]: Entering directory '/tmp/linux/foo' HOSTCC scripts/basic/fixdep Error: kernelrelease not valid - run 'make prepare' to update it UPD include/config/kernel.release make[1]: Leaving directory '/tmp/linux/foo' The first command configures the source tree for x86. The next command tries to build ARM device trees in the separate foo/ directory - this must stop because the directory foo/ has not been configured yet. However, due to --include-dir=$(abs_srctree), the top Makefile includes the wrong include/config/auto.conf from the source tree and continues building. Kbuild traverses the directory tree, but of course it does not work correctly. The Error message is also pointless - 'make prepare' does not help at all for fixing the issue. This commit fixes more arch Makefile, and finally removes --include-dir from the top Makefile. There are more breakages under drivers/, but I do not volunteer to fix them all. I just moved --include-dir to drivers/Makefile. With this commit, the second command will stop with a sensible message. $ make -s ARCH=x86 mrproper defconfig $ make ARCH=arm O=foo dtbs make[1]: Entering directory '/tmp/linux/foo' SYNC include/config/auto.conf.cmd *** *** The source tree is not clean, please run 'make ARCH=arm mrproper' *** in /tmp/linux *** make[2]: *** [../Makefile:646: outputmakefile] Error 1 /tmp/linux/Makefile:770: include/config/auto.conf.cmd: No such file or directory make[1]: *** [/tmp/linux/Makefile:793: include/config/auto.conf.cmd] Error 2 make[1]: Leaving directory '/tmp/linux/foo' make: *** [Makefile:226: __sub-make] Error 2 Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
2022-01-11Merge tag 'locking_core_for_v5.17_rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull locking updates from Borislav Petkov: "Lots of cleanups and preparation. Highlights: - futex: Cleanup and remove runtime futex_cmpxchg detection - rtmutex: Some fixes for the PREEMPT_RT locking infrastructure - kcsan: Share owner_on_cpu() between mutex,rtmutex and rwsem and annotate the racy owner->on_cpu access *once*. - atomic64: Dead-Code-Elemination" [ Description above by Peter Zijlstra ] * tag 'locking_core_for_v5.17_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: locking/atomic: atomic64: Remove unusable atomic ops futex: Fix additional regressions locking: Allow to include asm/spinlock_types.h from linux/spinlock_types_raw.h x86/mm: Include spinlock_t definition in pgtable. locking: Mark racy reads of owner->on_cpu locking: Make owner_on_cpu() into <linux/sched.h> lockdep/selftests: Adapt ww-tests for PREEMPT_RT lockdep/selftests: Skip the softirq related tests on PREEMPT_RT lockdep/selftests: Unbalanced migrate_disable() & rcu_read_lock(). lockdep/selftests: Avoid using local_lock_{acquire|release}(). lockdep: Remove softirq accounting on PREEMPT_RT. locking/rtmutex: Add rt_mutex_lock_nest_lock() and rt_mutex_lock_killable(). locking/rtmutex: Squash self-deadlock check for ww_rt_mutex. locking: Remove rt_rwlock_is_contended(). sched: Trigger warning if ->migration_disabled counter underflows. futex: Fix sparc32/m68k/nds32 build regression futex: Remove futex_cmpxchg detection futex: Ensure futex_atomic_cmpxchg_inatomic() is present kernel/locking: Use a pointer in ww_mutex_trylock().
2021-12-22um: remove set_fsChristoph Hellwig
Remove address space overrides using set_fs() for User Mode Linux. Note that just like the existing kernel access case of the uaccess routines the new nofault kernel handlers do not actually have any exception handling. This is probably broken, but not change to the status quo. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Richard Weinberger <richard@nod.at>
2021-11-25futex: Remove futex_cmpxchg detectionArnd Bergmann
Now that all architectures have a working futex implementation in any configuration, remove the runtime detection code. Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Acked-by: Vineet Gupta <vgupta@kernel.org> Acked-by: Max Filippov <jcmvbkbc@gmail.com> Acked-by: Christian Borntraeger <borntraeger@de.ibm.com> Link: https://lore.kernel.org/r/20211026100432.1730393-2-arnd@kernel.org
2021-09-09Merge tag 'for-linus-5.15-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/rw/uml Pull UML updates from Richard Weinberger: - Support for VMAP_STACK - Support for splice_write in hostfs - Fixes for virt-pci - Fixes for virtio_uml - Various fixes * tag 'for-linus-5.15-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rw/uml: um: fix stub location calculation um: virt-pci: fix uapi documentation um: enable VMAP_STACK um: virt-pci: don't do DMA from stack hostfs: support splice_write um: virtio_uml: fix memory leak on init failures um: virtio_uml: include linux/virtio-uml.h lib/logic_iomem: fix sparse warnings um: make PCI emulation driver init/exit static
2021-08-26um: fix stub location calculationJohannes Berg
In commit 9f0b4807a44f ("um: rework userspace stubs to not hard-code stub location") I changed stub_segv_handler() to do a calculation with a pointer to a stack variable to find the data page that we're using for the stack and the rest of the data. This same commit was meant to do it as well for stub_clone_handler(), but the change inadvertently went into commit 84b2789d6115 ("um: separate child and parent errors in clone stub") instead. This was reported to not be compiled correctly by gcc 5, causing the code to crash here. I'm not sure why, perhaps it's UB because the var isn't initialized? In any case, this trick always seemed bad, so just create a new inline function that does the calculation in assembly. Reported-by: subashab@codeaurora.org Fixes: 9f0b4807a44f ("um: rework userspace stubs to not hard-code stub location") Fixes: 84b2789d6115 ("um: separate child and parent errors in clone stub") Signed-off-by: Johannes Berg <johannes.berg@intel.com> Signed-off-by: Richard Weinberger <richard@nod.at>
2021-07-23asm-generic/uaccess.h: remove __strncpy_from_user/__strnlen_userArnd Bergmann
This is a preparation for changing over architectures to the generic implementation one at a time. As there are no callers of either __strncpy_from_user() or __strnlen_user(), fold these into the strncpy_from_user() and strnlen_user() functions to make each implementation independent of the others. Many of these implementations have known bugs, but the intention here is to not change behavior at all and stay compatible with those bugs for the moment. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Arnd Bergmann <arnd@arndb.de>
2021-07-09Merge tag 'for-linus-5.14-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/rw/uml Pull UML updates from Richard Weinberger: - Support for optimized routines based on the host CPU - Support for PCI via virtio - Various fixes * tag 'for-linus-5.14-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rw/uml: um: remove unneeded semicolon in um_arch.c um: Remove the repeated declaration um: fix error return code in winch_tramp() um: fix error return code in slip_open() um: Fix stack pointer alignment um: implement flush_cache_vmap/flush_cache_vunmap um: add a UML specific futex implementation um: enable the use of optimized xor routines in UML um: Add support for host CPU flags and alignment um: allow not setting extra rpaths in the linux binary um: virtio/pci: enable suspend/resume um: add PCI over virtio emulation driver um: irqs: allow invoking time-travel handler multiple times um: time-travel/signals: fix ndelay() in interrupt um: expose time-travel mode to userspace side um: export signals_enabled directly um: remove unused smp_sigio_handler() declaration lib: add iomem emulation (logic_iomem) um: allow disabling NO_IOMEM
2021-06-17um: Fix stack pointer alignmentYiFei Zhu
GCC assumes that stack is aligned to 16-byte on call sites [1]. Since GCC 8, GCC began using 16-byte aligned SSE instructions to implement assignments to structs on stack. When CC_OPTIMIZE_FOR_PERFORMANCE is enabled, this affects os-Linux/sigio.c, write_sigio_thread: struct pollfds *fds, tmp; tmp = current_poll; Note that struct pollfds is exactly 16 bytes in size. GCC 8+ generates assembly similar to: movdqa (%rdi),%xmm0 movaps %xmm0,-0x50(%rbp) This is an issue, because movaps will #GP if -0x50(%rbp) is not aligned to 16 bytes [2], and how rbp gets assigned to is via glibc clone thread_start, then function prologue, going though execution trace similar to (showing only relevant instructions): sub $0x10,%rsi mov %rcx,0x8(%rsi) mov %rdi,(%rsi) syscall pop %rax pop %rdi callq *%rax push %rbp mov %rsp,%rbp The stack pointer always points to the topmost element on stack, rather then the space right above the topmost. On push, the pointer decrements first before writing to the memory pointed to by it. Therefore, there is no need to have the stack pointer pointer always point to valid memory unless the stack is poped; so the `- sizeof(void *)` in the code is unnecessary. On the other hand, glibc reserves the 16 bytes it needs on stack and pops itself, so by the call instruction the stack pointer is exactly the caller-supplied sp. It then push the 16 bytes of the return address and the saved stack pointer, so the base pointer will be 16-byte aligned if and only if the caller supplied sp is 16-byte aligned. Therefore, the caller must supply a 16-byte aligned pointer, which `stack + UM_KERN_PAGE_SIZE` already satisfies. On a side note, musl is unaffected by this issue because it forces 16 byte alignment via `and $-16,%rsi` in its clone wrapper. Similarly, glibc i386 is also unaffected because it has `andl $0xfffffff0, %ecx`. To reproduce this bug, enable CONFIG_UML_RTC and CC_OPTIMIZE_FOR_PERFORMANCE. uml_rtc will call add_sigio_fd which will then cause write_sigio_thread to either go into segfault loop or panic with "Segfault with no mm". Similarly, signal stacks will be aligned by the host kernel upon signal delivery. `- sizeof(void *)` to sigaltstack is unconventional and extraneous. On a related note, initialization of longjmp buffers do require `- sizeof(void *)`. This is to account for the return address that would have been pushed to the stack at the call site. The reason for uml to respect 16-byte alignment, rather than telling GCC to assume 8-byte alignment like the host kernel since commit d9b0cde91c60 ("x86-64, gcc: Use -mpreferred-stack-boundary=3 if supported"), is because uml links against libc. There is no reason to assume libc is also compiled with that flag and assumes 8-byte alignment rather than 16-byte. [1] https://gcc.gnu.org/bugzilla/show_bug.cgi?id=40838 [2] https://c9x.me/x86/html/file_module_x86_id_180.html Signed-off-by: YiFei Zhu <zhuyifei1999@gmail.com> Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Reviewed-by: Johannes Berg <johannes@sipsolutions.net> Signed-off-by: Richard Weinberger <richard@nod.at>
2021-06-17um: add a UML specific futex implementationAnton Ivanov
The generic asm futex implementation emulates atomic access to memory by doing a get_user followed by put_user. These translate to two mapping operations on UML with paging enabled in the meantime. This, in turn may end up changing interrupts, invoking the signal loop, etc. This replaces the generic implementation by a mapping followed by an operation on the mapped segment. Signed-off-by: Anton Ivanov <anton.ivanov@cambridgegreys.com> Signed-off-by: Richard Weinberger <richard@nod.at>
2021-05-20x86/syscalls: Use __NR_syscalls instead of __NR_syscall_maxMasahiro Yamada
__NR_syscall_max is only used by x86 and UML. In contrast, __NR_syscalls is widely used by all the architectures. Convert __NR_syscall_max to __NR_syscalls and adjust the usage sites. This prepares x86 to switch to the generic syscallhdr.sh script. Signed-off-by: Masahiro Yamada <masahiroy@kernel.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/20210517073815.97426-6-masahiroy@kernel.org
2021-02-12um: remove process stub VMAJohannes Berg
This mostly reverts the old commit 3963333fe676 ("uml: cover stubs with a VMA") which had added a VMA to the existing PTEs. However, there's no real reason to have the PTEs in the first place and the VMA cannot be 'fixed' in place, which leads to bugs that userspace could try to unmap them and be forcefully killed, or such. Also, there's a bit of an ugly hole in userspace's address space. Simplify all this: just install the stub code/page at the top of the (inner) address space, i.e. put it just above TASK_SIZE. The pages are simply hard-coded to be mapped in the userspace process we use to implement an mm context, and they're out of reach of the inner mmap/munmap/mprotect etc. since they're above TASK_SIZE. Getting rid of the VMA also makes vma_merge() no longer hit one of the VM_WARN_ON()s there because we installed a VMA while the code assumes the stack VMA is the first one. It also removes a lockdep warning about mmap_sem usage since we no longer have uml_setup_stubs() and thus no longer need to do any manipulation that would require mmap_sem in activate_mm(). Signed-off-by: Johannes Berg <johannes.berg@intel.com> Signed-off-by: Richard Weinberger <richard@nod.at>
2021-02-12um: rework userspace stubs to not hard-code stub locationJohannes Berg
The userspace stacks mostly have a stack (and in the case of the syscall stub we can just set their stack pointer) that points to the location of the stub data page already. Rework the stubs to use the stack pointer to derive the start of the data page, rather than requiring it to be hard-coded. In the clone stub, also integrate the int3 into the stack remap, since we really must not use the stack while we remap it. This prepares for putting the stub at a variable location that's not part of the normal address space of the userspace processes running inside the UML machine. Signed-off-by: Johannes Berg <johannes.berg@intel.com> Signed-off-by: Richard Weinberger <richard@nod.at>
2021-02-12um: separate child and parent errors in clone stubJohannes Berg
If the two are mixed up, then it looks as though the parent returned an error if the child failed (before) the mmap(), and then the resulting process never gets killed. Fix this by splitting the child and parent errors, reporting and using them appropriately. Signed-off-by: Johannes Berg <johannes.berg@intel.com> Signed-off-by: Richard Weinberger <richard@nod.at>
2020-10-26arch/um: partially revert the conversion to __section() macroLinus Torvalds
A couple of um files ended up not including the header file that defines the __section() macro, and the simplest fix is to just revert the change for those files. Fixes: 33def8498fdd treewide: Convert macro and uses of __section(foo) to __section("foo") Reported-and-tested-by: Guenter Roeck <linux@roeck-us.net> Cc: Joe Perches <joe@perches.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-10-25treewide: Convert macro and uses of __section(foo) to __section("foo")Joe Perches
Use a more generic form for __section that requires quotes to avoid complications with clang and gcc differences. Remove the quote operator # from compiler_attributes.h __section macro. Convert all unquoted __section(foo) uses to quoted __section("foo"). Also convert __attribute__((section("foo"))) uses to __section("foo") even if the __attribute__ has multiple list entry forms. Conversion done using the script at: https://lore.kernel.org/lkml/75393e5ddc272dc7403de74d645e6c6e0f4e70eb.camel@perches.com/2-convert_section.pl Signed-off-by: Joe Perches <joe@perches.com> Reviewed-by: Nick Desaulniers <ndesaulniers@gooogle.com> Reviewed-by: Miguel Ojeda <ojeda@kernel.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-09mmap locking API: convert mmap_sem commentsMichel Lespinasse
Convert comments that reference mmap_sem to reference mmap_lock instead. [akpm@linux-foundation.org: fix up linux-next leftovers] [akpm@linux-foundation.org: s/lockaphore/lock/, per Vlastimil] [akpm@linux-foundation.org: more linux-next fixups, per Michel] Signed-off-by: Michel Lespinasse <walken@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com> Cc: Davidlohr Bueso <dbueso@suse.de> Cc: David Rientjes <rientjes@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Jerome Glisse <jglisse@redhat.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Laurent Dufour <ldufour@linux.ibm.com> Cc: Liam Howlett <Liam.Howlett@oracle.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Ying Han <yinghan@google.com> Link: http://lkml.kernel.org/r/20200520052908.204642-13-walken@google.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-09mm: don't include asm/pgtable.h if linux/mm.h is already includedMike Rapoport
Patch series "mm: consolidate definitions of page table accessors", v2. The low level page table accessors (pXY_index(), pXY_offset()) are duplicated across all architectures and sometimes more than once. For instance, we have 31 definition of pgd_offset() for 25 supported architectures. Most of these definitions are actually identical and typically it boils down to, e.g. static inline unsigned long pmd_index(unsigned long address) { return (address >> PMD_SHIFT) & (PTRS_PER_PMD - 1); } static inline pmd_t *pmd_offset(pud_t *pud, unsigned long address) { return (pmd_t *)pud_page_vaddr(*pud) + pmd_index(address); } These definitions can be shared among 90% of the arches provided XYZ_SHIFT, PTRS_PER_XYZ and xyz_page_vaddr() are defined. For architectures that really need a custom version there is always possibility to override the generic version with the usual ifdefs magic. These patches introduce include/linux/pgtable.h that replaces include/asm-generic/pgtable.h and add the definitions of the page table accessors to the new header. This patch (of 12): The linux/mm.h header includes <asm/pgtable.h> to allow inlining of the functions involving page table manipulations, e.g. pte_alloc() and pmd_alloc(). So, there is no point to explicitly include <asm/pgtable.h> in the files that include <linux/mm.h>. The include statements in such cases are remove with a simple loop: for f in $(git grep -l "include <linux/mm.h>") ; do sed -i -e '/include <asm\/pgtable.h>/ d' $f done Signed-off-by: Mike Rapoport <rppt@linux.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Borislav Petkov <bp@alien8.de> Cc: Brian Cain <bcain@codeaurora.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Chris Zankel <chris@zankel.net> Cc: "David S. Miller" <davem@davemloft.net> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Greentime Hu <green.hu@gmail.com> Cc: Greg Ungerer <gerg@linux-m68k.org> Cc: Guan Xuetao <gxt@pku.edu.cn> Cc: Guo Ren <guoren@kernel.org> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Helge Deller <deller@gmx.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: Ley Foon Tan <ley.foon.tan@intel.com> Cc: Mark Salter <msalter@redhat.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Matt Turner <mattst88@gmail.com> Cc: Max Filippov <jcmvbkbc@gmail.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Michal Simek <monstr@monstr.eu> Cc: Mike Rapoport <rppt@kernel.org> Cc: Nick Hu <nickhu@andestech.com> Cc: Paul Walmsley <paul.walmsley@sifive.com> Cc: Richard Weinberger <richard@nod.at> Cc: Rich Felker <dalias@libc.org> Cc: Russell King <linux@armlinux.org.uk> Cc: Stafford Horne <shorne@gmail.com> Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Tony Luck <tony.luck@intel.com> Cc: Vincent Chen <deanbo422@gmail.com> Cc: Vineet Gupta <vgupta@synopsys.com> Cc: Will Deacon <will@kernel.org> Cc: Yoshinori Sato <ysato@users.sourceforge.jp> Link: http://lkml.kernel.org/r/20200514170327.31389-1-rppt@kernel.org Link: http://lkml.kernel.org/r/20200514170327.31389-2-rppt@kernel.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-04-29um: syscall.c: include <asm/unistd.h>Johannes Berg
Without CONFIG_SECCOMP, we don't get this include recursively through the existing includes, thus failing the build on not having __NR_syscall_max defined. Add the necessary include to fix this. Signed-off-by: Johannes Berg <johannes.berg@intel.com> Acked-By: Anton Ivanov <anton.ivanov@cambridgegreys.com> Signed-off-by: Richard Weinberger <richard@nod.at>
2020-03-29um: Implement time-travel=extJohannes Berg
This implements synchronized time-travel mode which - using a special application on a unix socket - lets multiple machines take part in a time-travelling simulation together. The protocol for the unix domain socket is defined in the new file include/uapi/linux/um_timetravel.h. Signed-off-by: Johannes Berg <johannes.berg@intel.com> Signed-off-by: Richard Weinberger <richard@nod.at>