summaryrefslogtreecommitdiff
path: root/arch/x86/entry
AgeCommit message (Collapse)Author
2019-06-22x86/entry/64: Introduce the FIND_PERCPU_BASE macroChang S. Bae
GSBASE is used to find per-CPU data in the kernel. But when GSBASE is unknown, the per-CPU base can be found from the per_cpu_offset table with a CPU NR. The CPU NR is extracted from the limit field of the CPUNODE entry in GDT, or by the RDPID instruction. This is a prerequisite for using FSGSBASE in the low level entry code. Also, add the GAS-compatible RDPID macro as binutils 2.21 do not support it. Support is added in version 2.27. [ tglx: Massaged changelog ] Suggested-by: H. Peter Anvin <hpa@zytor.com> Signed-off-by: Chang S. Bae <chang.seok.bae@intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Andy Lutomirski <luto@kernel.org> Cc: Andi Kleen <ak@linux.intel.com> Cc: Ravi Shankar <ravi.v.shankar@intel.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Link: https://lkml.kernel.org/r/1557309753-24073-12-git-send-email-chang.seok.bae@intel.com
2019-06-22x86/entry/64: Switch CR3 before SWAPGS in paranoid entryChang S. Bae
When FSGSBASE is enabled, the GSBASE handling in paranoid entry will need to retrieve the kernel GSBASE which requires that the kernel page table is active. As the CR3 switch to the kernel page tables (PTI is active) does not depend on kernel GSBASE, move the CR3 switch in front of the GSBASE handling. Comment the EBX content while at it. No functional change. [ tglx: Rewrote changelog and comments ] Signed-off-by: Chang S. Bae <chang.seok.bae@intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Andy Lutomirski <luto@kernel.org> Cc: "H . Peter Anvin" <hpa@zytor.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: Ravi Shankar <ravi.v.shankar@intel.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: H. Peter Anvin <hpa@zytor.com> Link: https://lkml.kernel.org/r/1557309753-24073-11-git-send-email-chang.seok.bae@intel.com
2019-06-21x86/vdso: Prevent segfaults due to hoisted vclock readsAndy Lutomirski
GCC 5.5.0 sometimes cleverly hoists reads of the pvclock and/or hvclock pages before the vclock mode checks. This creates a path through vclock_gettime() in which no vclock is enabled at all (due to disabled TSC on old CPUs, for example) but the pvclock or hvclock page nevertheless read. This will segfault on bare metal. This fixes commit 459e3a21535a ("gcc-9: properly declare the {pv,hv}clock_page storage") in the sense that, before that commit, GCC didn't seem to generate the offending code. There was nothing wrong with that commit per se, and -stable maintainers should backport this to all supported kernels regardless of whether the offending commit was present, since the same crash could just as easily be triggered by the phase of the moon. On GCC 9.1.1, this doesn't seem to affect the generated code at all, so I'm not too concerned about performance regressions from this fix. Cc: stable@vger.kernel.org Cc: x86@kernel.org Cc: Borislav Petkov <bp@alien8.de> Reported-by: Duncan Roe <duncan_roe@optusnet.com.au> Signed-off-by: Andy Lutomirski <luto@kernel.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-06-19treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 474Thomas Gleixner
Based on 1 normalized pattern(s): subject to the gnu public license v 2 no warranty of any kind extracted by the scancode license scanner the SPDX license identifier GPL-2.0-only has been chosen to replace the boilerplate/reference in 2 file(s). Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Enrico Weigelt <info@metux.net> Reviewed-by: Allison Randal <allison@lohutok.net> Cc: linux-spdx@vger.kernel.org Link: https://lkml.kernel.org/r/20190604081203.641025917@linutronix.de Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-06-14Merge tag 'v5.2-rc4' into mauroJonathan Corbet
We need to pick up post-rc1 changes to various document files so they don't get lost in Mauro's massive RST conversion push.
2019-06-11x86/acrn: Use HYPERVISOR_CALLBACK_VECTOR for ACRN guest upcall vectorZhao Yakui
Use the HYPERVISOR_CALLBACK_VECTOR to notify an ACRN guest. Co-developed-by: Jason Chen CJ <jason.cj.chen@intel.com> Signed-off-by: Jason Chen CJ <jason.cj.chen@intel.com> Signed-off-by: Zhao Yakui <yakui.zhao@intel.com> Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Cc: Andy Lutomirski <luto@kernel.org> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: x86-ml <x86@kernel.org> Link: https://lkml.kernel.org/r/1559108037-18813-4-git-send-email-yakui.zhao@intel.com
2019-06-09arch: wire-up clone3() syscallChristian Brauner
Wire up the clone3() call on all arches that don't require hand-rolled assembly. Some of the arches look like they need special assembly massaging and it is probably smarter if the appropriate arch maintainers would do the actual wiring. Arches that are wired-up are: - x86{_32,64} - arm{64} - xtensa Signed-off-by: Christian Brauner <christian@brauner.io> Acked-by: Arnd Bergmann <arnd@arndb.de> Cc: Kees Cook <keescook@chromium.org> Cc: David Howells <dhowells@redhat.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Adrian Reber <adrian@lisas.de> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Florian Weimer <fweimer@redhat.com> Cc: linux-api@vger.kernel.org Cc: linux-arch@vger.kernel.org Cc: x86@kernel.org
2019-06-08docs: fix broken documentation linksMauro Carvalho Chehab
Mostly due to x86 and acpi conversion, several documentation links are still pointing to the old file. Fix them. Signed-off-by: Mauro Carvalho Chehab <mchehab+samsung@kernel.org> Reviewed-by: Wolfram Sang <wsa@the-dreams.de> Reviewed-by: Sven Van Asbroeck <TheSven73@gmail.com> Reviewed-by: Bhupesh Sharma <bhsharma@redhat.com> Acked-by: Mark Brown <broonie@kernel.org> Signed-off-by: Jonathan Corbet <corbet@lwn.net>
2019-06-05treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 257Thomas Gleixner
Based on 1 normalized pattern(s): gpl v2 extracted by the scancode license scanner the SPDX license identifier GPL-2.0-only has been chosen to replace the boilerplate/reference in 19 file(s). Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Allison Randal <allison@lohutok.net> Reviewed-by: Richard Fontana <rfontana@redhat.com> Reviewed-by: Steve Winslow <swinslow@gmail.com> Reviewed-by: Kate Stewart <kstewart@linuxfoundation.org> Reviewed-by: Alexios Zavras <alexios.zavras@intel.com> Cc: linux-spdx@vger.kernel.org Link: https://lkml.kernel.org/r/20190529141333.108140152@linutronix.de Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-05-30treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 223Thomas Gleixner
Based on 1 normalized pattern(s): subject to the gnu public license v 2 extracted by the scancode license scanner the SPDX license identifier GPL-2.0-only has been chosen to replace the boilerplate/reference in 9 file(s). Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Allison Randal <allison@lohutok.net> Reviewed-by: Alexios Zavras <alexios.zavras@intel.com> Reviewed-by: Steve Winslow <swinslow@gmail.com> Cc: linux-spdx@vger.kernel.org Link: https://lkml.kernel.org/r/20190528171440.130801526@linutronix.de Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-05-30treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 214Thomas Gleixner
Based on 1 normalized pattern(s): subject to the gpl v 2 extracted by the scancode license scanner the SPDX license identifier GPL-2.0-only has been chosen to replace the boilerplate/reference in 2 file(s). Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Allison Randal <allison@lohutok.net> Reviewed-by: Steve Winslow <swinslow@gmail.com> Reviewed-by: Alexios Zavras <alexios.zavras@intel.com> Cc: linux-spdx@vger.kernel.org Link: https://lkml.kernel.org/r/20190528171439.372657724@linutronix.de Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-05-30treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 180Thomas Gleixner
Based on 1 normalized pattern(s): subject to the gnu general public license version 2 extracted by the scancode license scanner the SPDX license identifier GPL-2.0-only has been chosen to replace the boilerplate/reference in 4 file(s). Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Steve Winslow <swinslow@gmail.com> Reviewed-by: Richard Fontana <rfontana@redhat.com> Reviewed-by: Armijn Hemel <armijn@tjaldur.nl> Reviewed-by: Allison Randal <allison@lohutok.net> Reviewed-by: Alexios Zavras <alexios.zavras@intel.com> Cc: linux-spdx@vger.kernel.org Link: https://lkml.kernel.org/r/20190528170026.343113277@linutronix.de Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-05-29signal: Remove the task parameter from force_sig_faultEric W. Biederman
As synchronous exceptions really only make sense against the current task (otherwise how are you synchronous) remove the task parameter from from force_sig_fault to make it explicit that is what is going on. The two known exceptions that deliver a synchronous exception to a stopped ptraced task have already been changed to force_sig_fault_to_task. The callers have been changed with the following emacs regular expression (with obvious variations on the architectures that take more arguments) to avoid typos: force_sig_fault[(]\([^,]+\)[,]\([^,]+\)[,]\([^,]+\)[,]\W+current[)] -> force_sig_fault(\1,\2,\3) Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2019-05-27signal: Remove task parameter from force_sigEric W. Biederman
All of the remaining callers pass current into force_sig so remove the task parameter to make this obvious and to make misuse more difficult in the future. This also makes it clear force_sig passes current into force_sig_info. Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2019-05-21treewide: Add SPDX license identifier - Makefile/KconfigThomas Gleixner
Add SPDX license identifiers to all Make/Kconfig files which: - Have no license information of any form These files fall under the project license, GPL v2 only. The resulting SPDX license identifier is: GPL-2.0-only Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-05-17Merge branch 'fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfsLinus Torvalds
Pull more vfs mount updates from Al Viro: "Propagation of new syscalls to other architectures + cosmetic change from Christian (fscontext didn't follow the convention for anon inode names)" * 'fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: uapi: Wire up the mount API syscalls on non-x86 arches [ver #2] uapi, x86: Fix the syscall numbering of the mount API syscalls [ver #2] uapi, fsopen: use square brackets around "fscontext" [ver #2]
2019-05-16Merge branch 'x86-urgent-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 fixes from Ingo Molnar: "Misc fixes and updates: - a handful of MDS documentation/comment updates - a cleanup related to hweight interfaces - a SEV guest fix for large pages - a kprobes LTO fix - and a final cleanup commit for vDSO HPET support removal" * 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/speculation/mds: Improve CPU buffer clear documentation x86/speculation/mds: Revert CPU buffer clear on double fault exit x86/kconfig: Disable CONFIG_GENERIC_HWEIGHT and remove __HAVE_ARCH_SW_HWEIGHT x86/mm: Do not use set_{pud, pmd}_safe() when splitting a large page x86/kprobes: Make trampoline_handler() global and visible x86/vdso: Remove hpet_page from vDSO
2019-05-16uapi, x86: Fix the syscall numbering of the mount API syscalls [ver #2]David Howells
Fix the syscall numbering of the mount API syscalls so that the numbers match between i386 and x86_64 and that they're in the common numbering scheme space. Fixes: a07b20004793 ("vfs: syscall: Add open_tree(2) to reference or clone a mount") Fixes: 2db154b3ea8e ("vfs: syscall: Add move_mount(2) to move mounts around") Fixes: 24dcb3d90a1f ("vfs: syscall: Add fsopen() to prepare for superblock creation") Fixes: ecdab150fddb ("vfs: syscall: Add fsconfig() for configuring and managing a context") Fixes: 93766fbd2696 ("vfs: syscall: Add fsmount() to create a mount for a superblock") Fixes: cf3cba4a429b ("vfs: syscall: Add fspick() to select a superblock for reconfiguration") Reported-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: David Howells <dhowells@redhat.com> Reviewed-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2019-05-16Merge branch 'linus' into x86/urgent, to pick up dependent changesIngo Molnar
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-05-15Merge tag 'trace-v5.2' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace Pull tracing updates from Steven Rostedt: "The major changes in this tracing update includes: - Removal of non-DYNAMIC_FTRACE from 32bit x86 - Removal of mcount support from x86 - Emulating a call from int3 on x86_64, fixes live kernel patching - Consolidated Tracing Error logs file Minor updates: - Removal of klp_check_compiler_support() - kdb ftrace dumping output changes - Accessing and creating ftrace instances from inside the kernel - Clean up of #define if macro - Introduction of TRACE_EVENT_NOP() to disable trace events based on config options And other minor fixes and clean ups" * tag 'trace-v5.2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: (44 commits) x86: Hide the int3_emulate_call/jmp functions from UML livepatch: Remove klp_check_compiler_support() ftrace/x86: Remove mcount support ftrace/x86_32: Remove support for non DYNAMIC_FTRACE tracing: Simplify "if" macro code tracing: Fix documentation about disabling options using trace_options tracing: Replace kzalloc with kcalloc tracing: Fix partial reading of trace event's id file tracing: Allow RCU to run between postponed startup tests tracing: Fix white space issues in parse_pred() function tracing: Eliminate const char[] auto variables ring-buffer: Fix mispelling of Calculate tracing: probeevent: Fix to make the type of $comm string tracing: probeevent: Do not accumulate on ret variable tracing: uprobes: Re-enable $comm support for uprobe events ftrace/x86_64: Emulate call function while updating in breakpoint handler x86_64: Allow breakpoints to emulate call instructions x86_64: Add gap to int3 to allow for call emulation tracing: kdb: Allow ftdump to skip all but the last few entries tracing: Add trace_total_entries() / trace_total_entries_cpu() ...
2019-05-14Merge branch 'x86-mds-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 MDS mitigations from Thomas Gleixner: "Microarchitectural Data Sampling (MDS) is a hardware vulnerability which allows unprivileged speculative access to data which is available in various CPU internal buffers. This new set of misfeatures has the following CVEs assigned: CVE-2018-12126 MSBDS Microarchitectural Store Buffer Data Sampling CVE-2018-12130 MFBDS Microarchitectural Fill Buffer Data Sampling CVE-2018-12127 MLPDS Microarchitectural Load Port Data Sampling CVE-2019-11091 MDSUM Microarchitectural Data Sampling Uncacheable Memory MDS attacks target microarchitectural buffers which speculatively forward data under certain conditions. Disclosure gadgets can expose this data via cache side channels. Contrary to other speculation based vulnerabilities the MDS vulnerability does not allow the attacker to control the memory target address. As a consequence the attacks are purely sampling based, but as demonstrated with the TLBleed attack samples can be postprocessed successfully. The mitigation is to flush the microarchitectural buffers on return to user space and before entering a VM. It's bolted on the VERW instruction and requires a microcode update. As some of the attacks exploit data structures shared between hyperthreads, full protection requires to disable hyperthreading. The kernel does not do that by default to avoid breaking unattended updates. The mitigation set comes with documentation for administrators and a deeper technical view" * 'x86-mds-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (23 commits) x86/speculation/mds: Fix documentation typo Documentation: Correct the possible MDS sysfs values x86/mds: Add MDSUM variant to the MDS documentation x86/speculation/mds: Add 'mitigations=' support for MDS x86/speculation/mds: Print SMT vulnerable on MSBDS with mitigations off x86/speculation/mds: Fix comment x86/speculation/mds: Add SMT warning message x86/speculation: Move arch_smt_update() call to after mitigation decisions x86/speculation/mds: Add mds=full,nosmt cmdline option Documentation: Add MDS vulnerability documentation Documentation: Move L1TF to separate directory x86/speculation/mds: Add mitigation mode VMWERV x86/speculation/mds: Add sysfs reporting for MDS x86/speculation/mds: Add mitigation control for MDS x86/speculation/mds: Conditionally clear CPU buffers on idle entry x86/kvm/vmx: Add MDS protection when L1D Flush is not active x86/speculation/mds: Clear CPU buffers on exit to user x86/speculation/mds: Add mds_clear_cpu_buffers() x86/kvm: Expose X86_FEATURE_MD_CLEAR to guests x86/speculation/mds: Add BUG_MSBDS_ONLY ...
2019-05-08x86_64: Add gap to int3 to allow for call emulationJosh Poimboeuf
To allow an int3 handler to emulate a call instruction, it must be able to push a return address onto the stack. Add a gap to the stack to allow the int3 handler to push the return address and change the return from int3 to jump straight to the emulated called function target. Link: http://lkml.kernel.org/r/20181130183917.hxmti5josgq4clti@treble Link: http://lkml.kernel.org/r/20190502162133.GX2623@hirez.programming.kicks-ass.net [ Note, this is needed to allow Live Kernel Patching to not miss calling a patched function when tracing is enabled. -- Steven Rostedt ] Cc: stable@vger.kernel.org Fixes: b700e7f03df5 ("livepatch: kernel: add support for live patching") Tested-by: Nicolai Stange <nstange@suse.de> Reviewed-by: Nicolai Stange <nstange@suse.de> Reviewed-by: Masami Hiramatsu <mhiramat@kernel.org> Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2019-05-08x86/vdso: Remove hpet_page from vDSOJia Zhang
This trivial cleanup finalizes the removal of vDSO HPET support. Fixes: 1ed95e52d902 ("x86/vdso: Remove direct HPET access through the vDSO") Signed-off-by: Jia Zhang <zhang.jia@linux.alibaba.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: luto@kernel.org Cc: bp@alien8.de Link: https://lkml.kernel.org/r/20190401114045.7280-1-zhang.jia@linux.alibaba.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-05-07Merge branch 'work.mount-syscalls' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull mount ABI updates from Al Viro: "The syscalls themselves, finally. That's not all there is to that stuff, but switching individual filesystems to new methods is fortunately independent from everything else, so e.g. NFS series can go through NFS tree, etc. As those conversions get done, we'll be finally able to get rid of a bunch of duplication in fs/super.c introduced in the beginning of the entire thing. I expect that to be finished in the next window..." * 'work.mount-syscalls' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: vfs: Add a sample program for the new mount API vfs: syscall: Add fspick() to select a superblock for reconfiguration vfs: syscall: Add fsmount() to create a mount for a superblock vfs: syscall: Add fsconfig() for configuring and managing a context vfs: Implement logging through fs_context vfs: syscall: Add fsopen() to prepare for superblock creation Make anon_inodes unconditional teach move_mount(2) to work with OPEN_TREE_CLONE vfs: syscall: Add move_mount(2) to move mounts around vfs: syscall: Add open_tree(2) to reference or clone a mount
2019-05-07Merge branch 'x86-fpu-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 FPU state handling updates from Borislav Petkov: "This contains work started by Rik van Riel and brought to fruition by Sebastian Andrzej Siewior with the main goal to optimize when to load FPU registers: only when returning to userspace and not on every context switch (while the task remains in the kernel). In addition, this optimization makes kernel_fpu_begin() cheaper by requiring registers saving only on the first invocation and skipping that in following ones. What is more, this series cleans up and streamlines many aspects of the already complex FPU code, hopefully making it more palatable for future improvements and simplifications. Finally, there's a __user annotations fix from Jann Horn" * 'x86-fpu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (29 commits) x86/fpu: Fault-in user stack if copy_fpstate_to_sigframe() fails x86/pkeys: Add PKRU value to init_fpstate x86/fpu: Restore regs in copy_fpstate_to_sigframe() in order to use the fastpath x86/fpu: Add a fastpath to copy_fpstate_to_sigframe() x86/fpu: Add a fastpath to __fpu__restore_sig() x86/fpu: Defer FPU state load until return to userspace x86/fpu: Merge the two code paths in __fpu__restore_sig() x86/fpu: Restore from kernel memory on the 64-bit path too x86/fpu: Inline copy_user_to_fpregs_zeroing() x86/fpu: Update xstate's PKRU value on write_pkru() x86/fpu: Prepare copy_fpstate_to_sigframe() for TIF_NEED_FPU_LOAD x86/fpu: Always store the registers in copy_fpstate_to_sigframe() x86/entry: Add TIF_NEED_FPU_LOAD x86/fpu: Eager switch PKRU state x86/pkeys: Don't check if PKRU is zero before writing it x86/fpu: Only write PKRU if it is different from current x86/pkeys: Provide *pkru() helpers x86/fpu: Use a feature number instead of mask in two more helpers x86/fpu: Make __raw_xsave_addr() use a feature number instead of mask x86/fpu: Add an __fpregs_load_activate() internal helper ...
2019-05-06Merge branch 'x86-irq-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 irq updates from Ingo Molnar: "Here are the main changes in this tree: - Introduce x86-64 IRQ/exception/debug stack guard pages to detect stack overflows immediately and deterministically. - Clean up over a decade worth of cruft accumulated. The outcome of this should be more clear-cut faults/crashes when any of the low level x86 CPU stacks overflow, instead of silent memory corruption and sporadic failures much later on" * 'x86-irq-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (33 commits) x86/irq: Fix outdated comments x86/irq/64: Remove stack overflow debug code x86/irq/64: Remap the IRQ stack with guard pages x86/irq/64: Split the IRQ stack into its own pages x86/irq/64: Init hardirq_stack_ptr during CPU hotplug x86/irq/32: Handle irq stack allocation failure proper x86/irq/32: Invoke irq_ctx_init() from init_IRQ() x86/irq/64: Rename irq_stack_ptr to hardirq_stack_ptr x86/irq/32: Rename hard/softirq_stack to hard/softirq_stack_ptr x86/irq/32: Make irq stack a character array x86/irq/32: Define IRQ_STACK_SIZE x86/dumpstack/64: Speedup in_exception_stack() x86/exceptions: Split debug IST stack x86/exceptions: Enable IST guard pages x86/exceptions: Disconnect IST index and stack order x86/cpu: Remove orig_ist array x86/cpu: Prepare TSS.IST setup for guard pages x86/dumpstack/64: Use cpu_entry_area instead of orig_ist x86/irq/64: Use cpu entry area instead of orig_ist x86/traps: Use cpu_entry_area instead of orig_ist ...
2019-05-06Merge branch 'x86-entry-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 entry cleanup from Ingo Molnar: "A single commit that removes a redundant complication from preempt-schedule handling in the x86 entry code" * 'x86-entry-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/entry: Remove unneeded need_resched() loop
2019-05-06Merge branch 'x86-asm-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 asm updates from Ingo Molnar: "This includes the following changes: - cpu_has() cleanups - sync_bitops.h modernization to the rmwcc.h facility, similarly to bitops.h - continued LTO annotations/fixes - misc cleanups and smaller cleanups" * 'x86-asm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/um/vdso: Drop unnecessary cc-ldoption x86/vdso: Rename variable to fix -Wshadow warning x86/cpu/amd: Exclude 32bit only assembler from 64bit build x86/asm: Mark all top level asm statements as .text x86/build/vdso: Add FORCE to the build rule of %.so x86/asm: Modernize sync_bitops.h x86/mm: Convert some slow-path static_cpu_has() callers to boot_cpu_has() x86: Convert some slow-path static_cpu_has() callers to boot_cpu_has() x86/asm: Clarify static_cpu_has()'s intended use x86/uaccess: Fix implicit cast of __user pointer x86/cpufeature: Remove __pure attribute to _static_cpu_has()
2019-05-06Merge branch 'core-objtool-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull objtool updates from Ingo Molnar: "This is a series from Peter Zijlstra that adds x86 build-time uaccess validation of SMAP to objtool, which will detect and warn about the following uaccess API usage bugs and weirdnesses: - call to %s() with UACCESS enabled - return with UACCESS enabled - return with UACCESS disabled from a UACCESS-safe function - recursive UACCESS enable - redundant UACCESS disable - UACCESS-safe disables UACCESS As it turns out not leaking uaccess permissions outside the intended uaccess functionality is hard when the interfaces are complex and when such bugs are mostly dormant. As a bonus we now also check the DF flag. We had at least one high-profile bug in that area in the early days of Linux, and the checking is fairly simple. The checks performed and warnings emitted are: - call to %s() with DF set - return with DF set - return with modified stack frame - recursive STD - redundant CLD It's all x86-only for now, but later on this can also be used for PAN on ARM and objtool is fairly cross-platform in principle. While all warnings emitted by this new checking facility that got reported to us were fixed, there might be GCC version dependent warnings that were not reported yet - which we'll address, should they trigger. The warnings are non-fatal build warnings" * 'core-objtool-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (27 commits) mm/uaccess: Use 'unsigned long' to placate UBSAN warnings on older GCC versions x86/uaccess: Dont leak the AC flag into __put_user() argument evaluation sched/x86_64: Don't save flags on context switch objtool: Add Direction Flag validation objtool: Add UACCESS validation objtool: Fix sibling call detection objtool: Rewrite alt->skip_orig objtool: Add --backtrace support objtool: Rewrite add_ignores() objtool: Handle function aliases objtool: Set insn->func for alternatives x86/uaccess, kcov: Disable stack protector x86/uaccess, ftrace: Fix ftrace_likely_update() vs. SMAP x86/uaccess, ubsan: Fix UBSAN vs. SMAP x86/uaccess, kasan: Fix KASAN vs SMAP x86/smap: Ditch __stringify() x86/uaccess: Introduce user_access_{save,restore}() x86/uaccess, signal: Fix AC=1 bloat x86/uaccess: Always inline user_access_begin() x86/uaccess, xen: Suppress SMAP warnings ...
2019-05-01gcc-9: properly declare the {pv,hv}clock_page storageLinus Torvalds
The pvlock_page and hvclock_page variables are (as the name implies) addresses to pages, created by the linker script. But we declared them as just "extern u8" variables, which _works_, but now that gcc does some more bounds checking, it causes warnings like warning: array subscript 1 is outside array bounds of ‘u8[1]’ when we then access more than one byte from those variables. Fix this by simply making the declaration of the variables match reality, which makes the compiler happy too. Signed-off-by: Linus Torvalds <torvalds@-linux-foundation.org>
2019-04-19x86/vdso: Rename variable to fix -Wshadow warningLeonardo Brás
The go32() and go64() functions has an argument and a local variable called ‘name’. Rename both to clarify the code and to fix a warning with -Wshadow. Signed-off-by: Leonardo Brás <leobras.c@gmail.com> Acked-by: Andy Lutomirski <luto@kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: David.Laight@aculab.com Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Masahiro Yamada <yamada.masahiro@socionext.com> Cc: Michal Marek <michal.lkml@markovi.net> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: helen@koikeco.de Cc: linux-kbuild@vger.kernel.org Cc: lkcamp@lists.libreplanetbr.org Link: http://lkml.kernel.org/r/20181023011022.GA6574@WindFlash Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-04-18x86/build/vdso: Add FORCE to the build rule of %.soMasahiro Yamada
$(call if_changed,...) must have FORCE as a prerequisite. Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/1554280212-10578-1-git-send-email-yamada.masahiro@socionext.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-04-17x86/irq/64: Split the IRQ stack into its own pagesAndy Lutomirski
Currently, the IRQ stack is hardcoded as the first page of the percpu area, and the stack canary lives on the IRQ stack. The former gets in the way of adding an IRQ stack guard page, and the latter is a potential weakness in the stack canary mechanism. Split the IRQ stack into its own private percpu pages. [ tglx: Make 64 and 32 bit share struct irq_stack ] Signed-off-by: Andy Lutomirski <luto@kernel.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Borislav Petkov <bp@suse.de> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org> Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com> Cc: Brijesh Singh <brijesh.singh@amd.com> Cc: "Chang S. Bae" <chang.seok.bae@intel.com> Cc: Dominik Brodowski <linux@dominikbrodowski.net> Cc: Feng Tang <feng.tang@intel.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jan Beulich <JBeulich@suse.com> Cc: Jiri Kosina <jkosina@suse.cz> Cc: Joerg Roedel <jroedel@suse.de> Cc: Jordan Borgner <mail@jordan-borgner.de> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Juergen Gross <jgross@suse.com> Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Cc: Maran Wilson <maran.wilson@oracle.com> Cc: Masahiro Yamada <yamada.masahiro@socionext.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@linux.vnet.ibm.com> Cc: Nick Desaulniers <ndesaulniers@google.com> Cc: Nicolai Stange <nstange@suse.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Pu Wen <puwen@hygon.cn> Cc: "Rafael Ávila de Espíndola" <rafael@espindo.la> Cc: Sean Christopherson <sean.j.christopherson@intel.com> Cc: Stefano Stabellini <sstabellini@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: x86-ml <x86@kernel.org> Cc: xen-devel@lists.xenproject.org Link: https://lkml.kernel.org/r/20190414160146.267376656@linutronix.de
2019-04-17x86/irq/64: Rename irq_stack_ptr to hardirq_stack_ptrThomas Gleixner
Preparatory patch to share code with 32bit. No functional changes. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Borislav Petkov <bp@suse.de> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: "Chang S. Bae" <chang.seok.bae@intel.com> Cc: Dominik Brodowski <linux@dominikbrodowski.net> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jiri Kosina <jkosina@suse.cz> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@linux.vnet.ibm.com> Cc: Nick Desaulniers <ndesaulniers@google.com> Cc: Nicolai Stange <nstange@suse.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Pingfan Liu <kernelfans@gmail.com> Cc: Sean Christopherson <sean.j.christopherson@intel.com> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: x86-ml <x86@kernel.org> Link: https://lkml.kernel.org/r/20190414160145.912584074@linutronix.de
2019-04-17x86/exceptions: Split debug IST stackThomas Gleixner
The debug IST stack is actually two separate debug stacks to handle #DB recursion. This is required because the CPU starts always at top of stack on exception entry, which means on #DB recursion the second #DB would overwrite the stack of the first. The low level entry code therefore adjusts the top of stack on entry so a secondary #DB starts from a different stack page. But the stack pages are adjacent without a guard page between them. Split the debug stack into 3 stacks which are separated by guard pages. The 3rd stack is never mapped into the cpu_entry_area and is only there to catch triple #DB nesting: --- top of DB_stack <- Initial stack --- end of DB_stack guard page --- top of DB1_stack <- Top of stack after entering first #DB --- end of DB1_stack guard page --- top of DB2_stack <- Top of stack after entering second #DB --- end of DB2_stack guard page If DB2 would not act as the final guard hole, a second #DB would point the top of #DB stack to the stack below #DB1 which would be valid and not catch the not so desired triple nesting. The backing store does not allocate any memory for DB2 and its guard page as it is not going to be mapped into the cpu_entry_area. - Adjust the low level entry code so it adjusts top of #DB with the offset between the stacks instead of exception stack size. - Make the dumpstack code aware of the new stacks. - Adjust the in_debug_stack() implementation and move it into the NMI code where it belongs. As this is NMI hotpath code, it just checks the full area between top of DB_stack and bottom of DB1_stack without checking for the guard page. That's correct because the NMI cannot hit a stackpointer pointing to the guard page between DB and DB1 stack. Even if it would, then the NMI operation still is unaffected, but the resume of the debug exception on the topmost DB stack will crash by touching the guard page. [ bp: Make exception_stack_names static const char * const ] Suggested-by: Andy Lutomirski <luto@kernel.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: Sean Christopherson <sean.j.christopherson@intel.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Baoquan He <bhe@redhat.com> Cc: "Chang S. Bae" <chang.seok.bae@intel.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Dominik Brodowski <linux@dominikbrodowski.net> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Joerg Roedel <jroedel@suse.de> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Juergen Gross <jgross@suse.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Cc: linux-doc@vger.kernel.org Cc: Masahiro Yamada <yamada.masahiro@socionext.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Qian Cai <cai@lca.pw> Cc: Sean Christopherson <sean.j.christopherson@intel.com> Cc: x86-ml <x86@kernel.org> Link: https://lkml.kernel.org/r/20190414160145.439944544@linutronix.de
2019-04-17x86/exceptions: Disconnect IST index and stack orderThomas Gleixner
The entry order of the TSS.IST array and the order of the stack storage/mapping are not required to be the same. With the upcoming split of the debug stack this is going to fall apart as the number of TSS.IST array entries stays the same while the actual stacks are increasing. Make them separate so that code like dumpstack can just utilize the mapping order. The IST index is solely required for the actual TSS.IST array initialization. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Borislav Petkov <bp@suse.de> Cc: Andy Lutomirski <luto@kernel.org> Cc: Baoquan He <bhe@redhat.com> Cc: "Chang S. Bae" <chang.seok.bae@intel.com> Cc: Dominik Brodowski <linux@dominikbrodowski.net> Cc: Dou Liyang <douly.fnst@cn.fujitsu.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jann Horn <jannh@google.com> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Kees Cook <keescook@chromium.org> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Cc: Nicolai Stange <nstange@suse.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Qian Cai <cai@lca.pw> Cc: Sean Christopherson <sean.j.christopherson@intel.com> Cc: x86-ml <x86@kernel.org> Link: https://lkml.kernel.org/r/20190414160145.241588113@linutronix.de
2019-04-17x86/exceptions: Make IST index zero basedThomas Gleixner
The defines for the exception stack (IST) array in the TSS are using the SDM convention IST1 - IST7. That causes all sorts of code to subtract 1 for array indices related to IST. That's confusing at best and does not provide any value. Make the indices zero based and fixup the usage sites. The only code which needs to adjust the 0 based index is the interrupt descriptor setup which needs to add 1 now. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: Sean Christopherson <sean.j.christopherson@intel.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Baoquan He <bhe@redhat.com> Cc: "Chang S. Bae" <chang.seok.bae@intel.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Dominik Brodowski <linux@dominikbrodowski.net> Cc: Dou Liyang <douly.fnst@cn.fujitsu.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Cc: linux-doc@vger.kernel.org Cc: Nicolai Stange <nstange@suse.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Qian Cai <cai@lca.pw> Cc: x86-ml <x86@kernel.org> Link: https://lkml.kernel.org/r/20190414160144.331772825@linutronix.de
2019-04-12x86/fpu: Defer FPU state load until return to userspaceRik van Riel
Defer loading of FPU state until return to userspace. This gives the kernel the potential to skip loading FPU state for tasks that stay in kernel mode, or for tasks that end up with repeated invocations of kernel_fpu_begin() & kernel_fpu_end(). The fpregs_lock/unlock() section ensures that the registers remain unchanged. Otherwise a context switch or a bottom half could save the registers to its FPU context and the processor's FPU registers would became random if modified at the same time. KVM swaps the host/guest registers on entry/exit path. This flow has been kept as is. First it ensures that the registers are loaded and then saves the current (host) state before it loads the guest's registers. The swap is done at the very end with disabled interrupts so it should not change anymore before theg guest is entered. The read/save version seems to be cheaper compared to memcpy() in a micro benchmark. Each thread gets TIF_NEED_FPU_LOAD set as part of fork() / fpu__copy(). For kernel threads, this flag gets never cleared which avoids saving / restoring the FPU state for kernel threads and during in-kernel usage of the FPU registers. [ bp: Correct and update commit message and fix checkpatch warnings. s/register/registers/ where it is used in plural. minor comment corrections. remove unused trace_x86_fpu_activate_state() TP. ] Signed-off-by: Rik van Riel <riel@surriel.com> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: Dave Hansen <dave.hansen@intel.com> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Cc: Andy Lutomirski <luto@kernel.org> Cc: Aubrey Li <aubrey.li@intel.com> Cc: Babu Moger <Babu.Moger@amd.com> Cc: "Chang S. Bae" <chang.seok.bae@intel.com> Cc: Dmitry Safonov <dima@arista.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jann Horn <jannh@google.com> Cc: "Jason A. Donenfeld" <Jason@zx2c4.com> Cc: Joerg Roedel <jroedel@suse.de> Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Cc: kvm ML <kvm@vger.kernel.org> Cc: Nicolai Stange <nstange@suse.de> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: "Radim Krčmář" <rkrcmar@redhat.com> Cc: Tim Chen <tim.c.chen@linux.intel.com> Cc: Waiman Long <longman@redhat.com> Cc: x86-ml <x86@kernel.org> Cc: Yi Wang <wang.yi59@zte.com.cn> Link: https://lkml.kernel.org/r/20190403164156.19645-24-bigeasy@linutronix.de
2019-04-05x86/entry: Remove unneeded need_resched() loopValentin Schneider
Since the enabling and disabling of IRQs within preempt_schedule_irq() is contained in a need_resched() loop, there is no need for the outer architecture specific loop. Signed-off-by: Valentin Schneider <valentin.schneider@arm.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Andy Lutomirski <luto@kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: "H. Peter Anvin" <hpa@zytor.com> Link: https://lkml.kernel.org/r/20190311224752.8337-14-valentin.schneider@arm.com
2019-04-03sched/x86_64: Don't save flags on context switchPeter Zijlstra
Now that we have objtool validating AC=1 state for all x86_64 code, we can once again guarantee clean flags on schedule. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-04-03sched/x86: Save [ER]FLAGS on context switchPeter Zijlstra
Effectively reverts commit: 2c7577a75837 ("sched/x86_64: Don't save flags on context switch") Specifically because SMAP uses FLAGS.AC which invalidates the claim that the kernel has clean flags. In particular; while preemption from interrupt return is fine (the IRET frame on the exception stack contains FLAGS) it breaks any code that does synchonous scheduling, including preempt_enable(). This has become a significant issue ever since commit: 5b24a7a2aa20 ("Add 'unsafe' user access functions for batched accesses") provided for means of having 'normal' C code between STAC / CLAC, exposing the FLAGS.AC state. So far this hasn't led to trouble, however fix it before it comes apart. Reported-by: Julien Thierry <julien.thierry@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Andy Lutomirski <luto@amacapital.net> Cc: Borislav Petkov <bp@alien8.de> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: stable@kernel.org Fixes: 5b24a7a2aa20 ("Add 'unsafe' user access functions for batched accesses") Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-03-20vfs: syscall: Add fspick() to select a superblock for reconfigurationDavid Howells
Provide an fspick() system call that can be used to pick an existing mountpoint into an fs_context which can thereafter be used to reconfigure a superblock (equivalent of the superblock side of -o remount). This looks like: int fd = fspick(AT_FDCWD, "/mnt", FSPICK_CLOEXEC | FSPICK_NO_AUTOMOUNT); fsconfig(fd, FSCONFIG_SET_FLAG, "intr", NULL, 0); fsconfig(fd, FSCONFIG_SET_FLAG, "noac", NULL, 0); fsconfig(fd, FSCONFIG_CMD_RECONFIGURE, NULL, NULL, 0); At the point of fspick being called, the file descriptor referring to the filesystem context is in exactly the same state as the one that was created by fsopen() after fsmount() has been successfully called. Signed-off-by: David Howells <dhowells@redhat.com> cc: linux-api@vger.kernel.org Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2019-03-20vfs: syscall: Add fsmount() to create a mount for a superblockDavid Howells
Provide a system call by which a filesystem opened with fsopen() and configured by a series of fsconfig() calls can have a detached mount object created for it. This mount object can then be attached to the VFS mount hierarchy using move_mount() by passing the returned file descriptor as the from directory fd. The system call looks like: int mfd = fsmount(int fsfd, unsigned int flags, unsigned int attr_flags); where fsfd is the file descriptor returned by fsopen(). flags can be 0 or FSMOUNT_CLOEXEC. attr_flags is a bitwise-OR of the following flags: MOUNT_ATTR_RDONLY Mount read-only MOUNT_ATTR_NOSUID Ignore suid and sgid bits MOUNT_ATTR_NODEV Disallow access to device special files MOUNT_ATTR_NOEXEC Disallow program execution MOUNT_ATTR__ATIME Setting on how atime should be updated MOUNT_ATTR_RELATIME - Update atime relative to mtime/ctime MOUNT_ATTR_NOATIME - Do not update access times MOUNT_ATTR_STRICTATIME - Always perform atime updates MOUNT_ATTR_NODIRATIME Do not update directory access times In the event that fsmount() fails, it may be possible to get an error message by calling read() on fsfd. If no message is available, ENODATA will be reported. Signed-off-by: David Howells <dhowells@redhat.com> cc: linux-api@vger.kernel.org Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2019-03-20vfs: syscall: Add fsconfig() for configuring and managing a contextDavid Howells
Add a syscall for configuring a filesystem creation context and triggering actions upon it, to be used in conjunction with fsopen, fspick and fsmount. long fsconfig(int fs_fd, unsigned int cmd, const char *key, const void *value, int aux); Where fs_fd indicates the context, cmd indicates the action to take, key indicates the parameter name for parameter-setting actions and, if needed, value points to a buffer containing the value and aux can give more information for the value. The following command IDs are proposed: (*) FSCONFIG_SET_FLAG: No value is specified. The parameter must be boolean in nature. The key may be prefixed with "no" to invert the setting. value must be NULL and aux must be 0. (*) FSCONFIG_SET_STRING: A string value is specified. The parameter can be expecting boolean, integer, string or take a path. A conversion to an appropriate type will be attempted (which may include looking up as a path). value points to a NUL-terminated string and aux must be 0. (*) FSCONFIG_SET_BINARY: A binary blob is specified. value points to the blob and aux indicates its size. The parameter must be expecting a blob. (*) FSCONFIG_SET_PATH: A non-empty path is specified. The parameter must be expecting a path object. value points to a NUL-terminated string that is the path and aux is a file descriptor at which to start a relative lookup or AT_FDCWD. (*) FSCONFIG_SET_PATH_EMPTY: As fsconfig_set_path, but with AT_EMPTY_PATH implied. (*) FSCONFIG_SET_FD: An open file descriptor is specified. value must be NULL and aux indicates the file descriptor. (*) FSCONFIG_CMD_CREATE: Trigger superblock creation. (*) FSCONFIG_CMD_RECONFIGURE: Trigger superblock reconfiguration. For the "set" command IDs, the idea is that the file_system_type will point to a list of parameters and the types of value that those parameters expect to take. The core code can then do the parse and argument conversion and then give the LSM and FS a cooked option or array of options to use. Source specification is also done the same way same way, using special keys "source", "source1", "source2", etc.. [!] Note that, for the moment, the key and value are just glued back together and handed to the filesystem. Every filesystem that uses options uses match_token() and co. to do this, and this will need to be changed - but not all at once. Example usage: fd = fsopen("ext4", FSOPEN_CLOEXEC); fsconfig(fd, fsconfig_set_path, "source", "/dev/sda1", AT_FDCWD); fsconfig(fd, fsconfig_set_path_empty, "journal_path", "", journal_fd); fsconfig(fd, fsconfig_set_fd, "journal_fd", "", journal_fd); fsconfig(fd, fsconfig_set_flag, "user_xattr", NULL, 0); fsconfig(fd, fsconfig_set_flag, "noacl", NULL, 0); fsconfig(fd, fsconfig_set_string, "sb", "1", 0); fsconfig(fd, fsconfig_set_string, "errors", "continue", 0); fsconfig(fd, fsconfig_set_string, "data", "journal", 0); fsconfig(fd, fsconfig_set_string, "context", "unconfined_u:...", 0); fsconfig(fd, fsconfig_cmd_create, NULL, NULL, 0); mfd = fsmount(fd, FSMOUNT_CLOEXEC, MS_NOEXEC); or: fd = fsopen("ext4", FSOPEN_CLOEXEC); fsconfig(fd, fsconfig_set_string, "source", "/dev/sda1", 0); fsconfig(fd, fsconfig_cmd_create, NULL, NULL, 0); mfd = fsmount(fd, FSMOUNT_CLOEXEC, MS_NOEXEC); or: fd = fsopen("afs", FSOPEN_CLOEXEC); fsconfig(fd, fsconfig_set_string, "source", "#grand.central.org:root.cell", 0); fsconfig(fd, fsconfig_cmd_create, NULL, NULL, 0); mfd = fsmount(fd, FSMOUNT_CLOEXEC, MS_NOEXEC); or: fd = fsopen("jffs2", FSOPEN_CLOEXEC); fsconfig(fd, fsconfig_set_string, "source", "mtd0", 0); fsconfig(fd, fsconfig_cmd_create, NULL, NULL, 0); mfd = fsmount(fd, FSMOUNT_CLOEXEC, MS_NOEXEC); Signed-off-by: David Howells <dhowells@redhat.com> cc: linux-api@vger.kernel.org Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2019-03-20vfs: syscall: Add fsopen() to prepare for superblock creationDavid Howells
Provide an fsopen() system call that starts the process of preparing to create a superblock that will then be mountable, using an fd as a context handle. fsopen() is given the name of the filesystem that will be used: int mfd = fsopen(const char *fsname, unsigned int flags); where flags can be 0 or FSOPEN_CLOEXEC. For example: sfd = fsopen("ext4", FSOPEN_CLOEXEC); fsconfig(sfd, FSCONFIG_SET_PATH, "source", "/dev/sda1", AT_FDCWD); fsconfig(sfd, FSCONFIG_SET_FLAG, "noatime", NULL, 0); fsconfig(sfd, FSCONFIG_SET_FLAG, "acl", NULL, 0); fsconfig(sfd, FSCONFIG_SET_FLAG, "user_xattr", NULL, 0); fsconfig(sfd, FSCONFIG_SET_STRING, "sb", "1", 0); fsconfig(sfd, FSCONFIG_CMD_CREATE, NULL, NULL, 0); fsinfo(sfd, NULL, ...); // query new superblock attributes mfd = fsmount(sfd, FSMOUNT_CLOEXEC, MS_RELATIME); move_mount(mfd, "", sfd, AT_FDCWD, "/mnt", MOVE_MOUNT_F_EMPTY_PATH); sfd = fsopen("afs", -1); fsconfig(fd, FSCONFIG_SET_STRING, "source", "#grand.central.org:root.cell", 0); fsconfig(fd, FSCONFIG_CMD_CREATE, NULL, NULL, 0); mfd = fsmount(sfd, 0, MS_NODEV); move_mount(mfd, "", sfd, AT_FDCWD, "/mnt", MOVE_MOUNT_F_EMPTY_PATH); If an error is reported at any step, an error message may be available to be read() back (ENODATA will be reported if there isn't an error available) in the form: "e <subsys>:<problem>" "e SELinux:Mount on mountpoint not permitted" Once fsmount() has been called, further fsconfig() calls will incur EBUSY, even if the fsmount() fails. read() is still possible to retrieve error information. The fsopen() syscall creates a mount context and hangs it of the fd that it returns. Netlink is not used because it is optional and would make the core VFS dependent on the networking layer and also potentially add network namespace issues. Note that, for the moment, the caller must have SYS_CAP_ADMIN to use fsopen(). Signed-off-by: David Howells <dhowells@redhat.com> cc: linux-api@vger.kernel.org Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2019-03-20vfs: syscall: Add move_mount(2) to move mounts aroundDavid Howells
Add a move_mount() system call that will move a mount from one place to another and, in the next commit, allow to attach an unattached mount tree. The new system call looks like the following: int move_mount(int from_dfd, const char *from_path, int to_dfd, const char *to_path, unsigned int flags); Signed-off-by: David Howells <dhowells@redhat.com> cc: linux-api@vger.kernel.org Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2019-03-20vfs: syscall: Add open_tree(2) to reference or clone a mountAl Viro
open_tree(dfd, pathname, flags) Returns an O_PATH-opened file descriptor or an error. dfd and pathname specify the location to open, in usual fashion (see e.g. fstatat(2)). flags should be an OR of some of the following: * AT_PATH_EMPTY, AT_NO_AUTOMOUNT, AT_SYMLINK_NOFOLLOW - same meanings as usual * OPEN_TREE_CLOEXEC - make the resulting descriptor close-on-exec * OPEN_TREE_CLONE or OPEN_TREE_CLONE | AT_RECURSIVE - instead of opening the location in question, create a detached mount tree matching the subtree rooted at location specified by dfd/pathname. With AT_RECURSIVE the entire subtree is cloned, without it - only the part within in the mount containing the location in question. In other words, the same as mount --rbind or mount --bind would've taken. The detached tree will be dissolved on the final close of obtained file. Creation of such detached trees requires the same capabilities as doing mount --bind. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: David Howells <dhowells@redhat.com> cc: linux-api@vger.kernel.org Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2019-03-16Merge tag 'pidfd-v5.1-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux Pull pidfd system call from Christian Brauner: "This introduces the ability to use file descriptors from /proc/<pid>/ as stable handles on struct pid. Even if a pid is recycled the handle will not change. For a start these fds can be used to send signals to the processes they refer to. With the ability to use /proc/<pid> fds as stable handles on struct pid we can fix a long-standing issue where after a process has exited its pid can be reused by another process. If a caller sends a signal to a reused pid it will end up signaling the wrong process. With this patchset we enable a variety of use cases. One obvious example is that we can now safely delegate an important part of process management - sending signals - to processes other than the parent of a given process by sending file descriptors around via scm rights and not fearing that the given process will have been recycled in the meantime. It also allows for easy testing whether a given process is still alive or not by sending signal 0 to a pidfd which is quite handy. There has been some interest in this feature e.g. from systems management (systemd, glibc) and container managers. I have requested and gotten comments from glibc to make sure that this syscall is suitable for their needs as well. In the future I expect it to take on most other pid-based signal syscalls. But such features are left for the future once they are needed. This has been sitting in linux-next for quite a while and has not caused any issues. It comes with selftests which verify basic functionality and also test that a recycled pid cannot be signaled via a pidfd. Jon has written about a prior version of this patchset. It should cover the basic functionality since not a lot has changed since then: https://lwn.net/Articles/773459/ The commit message for the syscall itself is extensively documenting the syscall, including it's functionality and extensibility" * tag 'pidfd-v5.1-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux: selftests: add tests for pidfd_send_signal() signal: add pidfd_send_signal() syscall
2019-03-08Merge tag 'io_uring-2019-03-06' of git://git.kernel.dk/linux-blockLinus Torvalds
Pull io_uring IO interface from Jens Axboe: "Second attempt at adding the io_uring interface. Since the first one, we've added basic unit testing of the three system calls, that resides in liburing like the other unit tests that we have so far. It'll take a while to get full coverage of it, but we're working towards it. I've also added two basic test programs to tools/io_uring. One uses the raw interface and has support for all the various features that io_uring supports outside of standard IO, like fixed files, fixed IO buffers, and polled IO. The other uses the liburing API, and is a simplified version of cp(1). This adds support for a new IO interface, io_uring. io_uring allows an application to communicate with the kernel through two rings, the submission queue (SQ) and completion queue (CQ) ring. This allows for very efficient handling of IOs, see the v5 posting for some basic numbers: https://lore.kernel.org/linux-block/20190116175003.17880-1-axboe@kernel.dk/ Outside of just efficiency, the interface is also flexible and extendable, and allows for future use cases like the upcoming NVMe key-value store API, networked IO, and so on. It also supports async buffered IO, something that we've always failed to support in the kernel. Outside of basic IO features, it supports async polled IO as well. This particular feature has already been tested at Facebook months ago for flash storage boxes, with 25-33% improvements. It makes polled IO actually useful for real world use cases, where even basic flash sees a nice win in terms of efficiency, latency, and performance. These boxes were IOPS bound before, now they are not. This series adds three new system calls. One for setting up an io_uring instance (io_uring_setup(2)), one for submitting/completing IO (io_uring_enter(2)), and one for aux functions like registrating file sets, buffers, etc (io_uring_register(2)). Through the help of Arnd, I've coordinated the syscall numbers so merge on that front should be painless. Jon did a writeup of the interface a while back, which (except for minor details that have been tweaked) is still accurate. Find that here: https://lwn.net/Articles/776703/ Huge thanks to Al Viro for helping getting the reference cycle code correct, and to Jann Horn for his extensive reviews focused on both security and bugs in general. There's a userspace library that provides basic functionality for applications that don't need or want to care about how to fiddle with the rings directly. It has helpers to allow applications to easily set up an io_uring instance, and submit/complete IO through it without knowing about the intricacies of the rings. It also includes man pages (thanks to Jeff Moyer), and will continue to grow support helper functions and features as time progresses. Find it here: git://git.kernel.dk/liburing Fio has full support for the raw interface, both in the form of an IO engine (io_uring), but also with a small test application (t/io_uring) that can exercise and benchmark the interface" * tag 'io_uring-2019-03-06' of git://git.kernel.dk/linux-block: io_uring: add a few test tools io_uring: allow workqueue item to handle multiple buffered requests io_uring: add support for IORING_OP_POLL io_uring: add io_kiocb ref count io_uring: add submission polling io_uring: add file set registration net: split out functions related to registering inflight socket files io_uring: add support for pre-mapped user IO buffers block: implement bio helper to add iter bvec pages to bio io_uring: batch io_kiocb allocation io_uring: use fget/fput_many() for file references fs: add fget_many() and fput_many() io_uring: support for IO polling io_uring: add fsync support Add io_uring IO interface
2019-03-06x86/speculation/mds: Clear CPU buffers on exit to userThomas Gleixner
Add a static key which controls the invocation of the CPU buffer clear mechanism on exit to user space and add the call into prepare_exit_to_usermode() and do_nmi() right before actually returning. Add documentation which kernel to user space transition this covers and explain why some corner cases are not mitigated. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Reviewed-by: Borislav Petkov <bp@suse.de> Reviewed-by: Frederic Weisbecker <frederic@kernel.org> Reviewed-by: Jon Masters <jcm@redhat.com> Tested-by: Jon Masters <jcm@redhat.com>