summaryrefslogtreecommitdiff
path: root/arch/x86/kernel/process_64.c
AgeCommit message (Collapse)Author
2024-07-02x86/mm: Cleanup prctl_enable_tagged_addr() nr_bits error checkingYosry Ahmed
There are two separate checks in prctl_enable_tagged_addr() that nr_bits is in the correct range. The checks are arranged such the correct case is sandwiched between both error cases, which do exactly the same thing. Simplify the if condition and pull the correct case outside with the rest of the success code path. Signed-off-by: Yosry Ahmed <yosryahmed@google.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Link: https://lore.kernel.org/all/20240702132139.3332013-4-yosryahmed%40google.com
2024-07-02x86/mm: Fix LAM inconsistency during context switchYosry Ahmed
LAM can only be enabled when a process is single-threaded. But _kernel_ threads can temporarily use a single-threaded process's mm. That means that a context-switching kernel thread can race and observe the mm's LAM metadata (mm->context.lam_cr3_mask) change. The context switch code does two logical things with that metadata: populate CR3 and populate 'cpu_tlbstate.lam'. If it hits this race, 'cpu_tlbstate.lam' and CR3 can end up out of sync. This de-synchronization is currently harmless. But it is confusing and might lead to warnings or real bugs. Update set_tlbstate_lam_mode() to take in the LAM mask and untag mask instead of an mm_struct pointer, and while we are at it, rename it to cpu_tlbstate_update_lam(). This should also make it clearer that we are updating cpu_tlbstate. In switch_mm_irqs_off(), read the LAM mask once and use it for both the cpu_tlbstate update and the CR3 update. Signed-off-by: Yosry Ahmed <yosryahmed@google.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Link: https://lore.kernel.org/all/20240702132139.3332013-3-yosryahmed%40google.com
2024-07-02x86/mm: Use IPIs to synchronize LAM enablementYosry Ahmed
LAM can only be enabled when a process is single-threaded. But _kernel_ threads can temporarily use a single-threaded process's mm. If LAM is enabled by a userspace process while a kthread is using its mm, the kthread will not observe LAM enablement (i.e. LAM will be disabled in CR3). This could be fine for the kthread itself, as LAM only affects userspace addresses. However, if the kthread context switches to a thread in the same userspace process, CR3 may or may not be updated because the mm_struct doesn't change (based on pending TLB flushes). If CR3 is not updated, the userspace thread will run incorrectly with LAM disabled, which may cause page faults when using tagged addresses. Example scenario: CPU 1 CPU 2 /* kthread */ kthread_use_mm() /* user thread */ prctl_enable_tagged_addr() /* LAM enabled on CPU 2 */ /* LAM disabled on CPU 1 */ context_switch() /* to CPU 1 */ /* Switching to user thread */ switch_mm_irqs_off() /* CR3 not updated */ /* LAM is still disabled on CPU 1 */ Synchronize LAM enablement by sending an IPI to all CPUs running with the mm_struct to enable LAM. This makes sure LAM is enabled on CPU 1 in the above scenario before prctl_enable_tagged_addr() returns and userspace starts using tagged addresses, and before it's possible to run the userspace process on CPU 1. In switch_mm_irqs_off(), move reading the LAM mask until after mm_cpumask() is updated. This ensures that if an outdated LAM mask is written to CR3, an IPI is received to update it right after IRQs are re-enabled. [ dhansen: Add a LAM enabling helper and comment it ] Fixes: 82721d8b25d7 ("x86/mm: Handle LAM on context switch") Suggested-by: Andy Lutomirski <luto@kernel.org> Signed-off-by: Yosry Ahmed <yosryahmed@google.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Link: https://lore.kernel.org/all/20240702132139.3332013-2-yosryahmed%40google.com
2024-04-24x86/cpu: Fix check for RDPKRU in __show_regs()David Kaplan
cpu_feature_enabled(X86_FEATURE_OSPKE) does not necessarily reflect whether CR4.PKE is set on the CPU. In particular, they may differ on non-BSP CPUs before setup_pku() is executed. In this scenario, RDPKRU will #UD causing the system to hang. Fix by checking CR4 for PKE enablement which is always correct for the current CPU. The scenario happens by inserting a WARN* before setup_pku() in identiy_cpu() or some other diagnostic which would lead to calling __show_regs(). [ bp: Massage commit message. ] Signed-off-by: David Kaplan <david.kaplan@amd.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/r/20240421191728.32239-1-bp@kernel.org
2024-03-11Merge tag 'x86-core-2024-03-11' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull core x86 updates from Ingo Molnar: - The biggest change is the rework of the percpu code, to support the 'Named Address Spaces' GCC feature, by Uros Bizjak: - This allows C code to access GS and FS segment relative memory via variables declared with such attributes, which allows the compiler to better optimize those accesses than the previous inline assembly code. - The series also includes a number of micro-optimizations for various percpu access methods, plus a number of cleanups of %gs accesses in assembly code. - These changes have been exposed to linux-next testing for the last ~5 months, with no known regressions in this area. - Fix/clean up __switch_to()'s broken but accidentally working handling of FPU switching - which also generates better code - Propagate more RIP-relative addressing in assembly code, to generate slightly better code - Rework the CPU mitigations Kconfig space to be less idiosyncratic, to make it easier for distros to follow & maintain these options - Rework the x86 idle code to cure RCU violations and to clean up the logic - Clean up the vDSO Makefile logic - Misc cleanups and fixes * tag 'x86-core-2024-03-11' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (52 commits) x86/idle: Select idle routine only once x86/idle: Let prefer_mwait_c1_over_halt() return bool x86/idle: Cleanup idle_setup() x86/idle: Clean up idle selection x86/idle: Sanitize X86_BUG_AMD_E400 handling sched/idle: Conditionally handle tick broadcast in default_idle_call() x86: Increase brk randomness entropy for 64-bit systems x86/vdso: Move vDSO to mmap region x86/vdso/kbuild: Group non-standard build attributes and primary object file rules together x86/vdso: Fix rethunk patching for vdso-image-{32,64}.o x86/retpoline: Ensure default return thunk isn't used at runtime x86/vdso: Use CONFIG_COMPAT_32 to specify vdso32 x86/vdso: Use $(addprefix ) instead of $(foreach ) x86/vdso: Simplify obj-y addition x86/vdso: Consolidate targets and clean-files x86/bugs: Rename CONFIG_RETHUNK => CONFIG_MITIGATION_RETHUNK x86/bugs: Rename CONFIG_CPU_SRSO => CONFIG_MITIGATION_SRSO x86/bugs: Rename CONFIG_CPU_IBRS_ENTRY => CONFIG_MITIGATION_IBRS_ENTRY x86/bugs: Rename CONFIG_CPU_UNRET_ENTRY => CONFIG_MITIGATION_UNRET_ENTRY x86/bugs: Rename CONFIG_SLS => CONFIG_MITIGATION_SLS ...
2024-01-31x86/fred: Allow single-step trap and NMI when starting a new taskH. Peter Anvin (Intel)
Entering a new task is logically speaking a return from a system call (exec, fork, clone, etc.). As such, if ptrace enables single stepping a single step exception should be allowed to trigger immediately upon entering user space. This is not optional. NMI should *never* be disabled in user space. As such, this is an optional, opportunistic way to catch errors. Allow single-step trap and NMI when starting a new task, thus once the new task enters user space, single-step trap and NMI are both enabled immediately. Signed-off-by: H. Peter Anvin (Intel) <hpa@zytor.com> Signed-off-by: Xin Li <xin3.li@intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Tested-by: Shan Kang <shan.kang@intel.com> Link: https://lore.kernel.org/r/20231205105030.8698-21-xin3.li@intel.com
2024-01-31x86/fred: Disallow the swapgs instruction when FRED is enabledH. Peter Anvin (Intel)
SWAPGS is no longer needed thus NOT allowed with FRED because FRED transitions ensure that an operating system can _always_ operate with its own GS base address: - For events that occur in ring 3, FRED event delivery swaps the GS base address with the IA32_KERNEL_GS_BASE MSR. - ERETU (the FRED transition that returns to ring 3) also swaps the GS base address with the IA32_KERNEL_GS_BASE MSR. And the operating system can still setup the GS segment for a user thread without the need of loading a user thread GS with: - Using LKGS, available with FRED, to modify other attributes of the GS segment without compromising its ability always to operate with its own GS base address. - Accessing the GS segment base address for a user thread as before using RDMSR or WRMSR on the IA32_KERNEL_GS_BASE MSR. Note, LKGS loads the GS base address into the IA32_KERNEL_GS_BASE MSR instead of the GS segment's descriptor cache. As such, the operating system never changes its runtime GS base address. Signed-off-by: H. Peter Anvin (Intel) <hpa@zytor.com> Signed-off-by: Xin Li <xin3.li@intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Tested-by: Shan Kang <shan.kang@intel.com> Link: https://lore.kernel.org/r/20231205105030.8698-19-xin3.li@intel.com
2024-01-31x86/ptrace: Cleanup the definition of the pt_regs structureXin Li
struct pt_regs is hard to read because the member or section related comments are not aligned with the members. The 'cs' and 'ss' members of pt_regs are type of 'unsigned long' while in reality they are only 16-bit wide. This works so far as the remaining space is unused, but FRED will use the remaining bits for other purposes. To prepare for FRED: - Cleanup the formatting - Convert 'cs' and 'ss' to u16 and embed them into an union with a u64 - Fixup the related printk() format strings Suggested-by: Thomas Gleixner <tglx@linutronix.de> Originally-by: H. Peter Anvin (Intel) <hpa@zytor.com> Signed-off-by: Xin Li <xin3.li@intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Tested-by: Shan Kang <shan.kang@intel.com> Link: https://lore.kernel.org/r/20231205105030.8698-14-xin3.li@intel.com
2023-10-20x86/fpu: Clean up FPU switching in the middle of task switchingLinus Torvalds
It happens to work, but it's very very wrong, because our 'current' macro is magic that is supposedly loading a stable value. It just happens to be not quite stable enough and the compilers re-load the value enough for this code to work. But it's wrong. The whole struct fpu *prev_fpu = &prev->fpu; thing in __switch_to() is pretty ugly. There's no reason why we should look at that 'prev_fpu' pointer there, or pass it down. And it only generates worse code, in how it loads 'current' when __switch_to() has the right task pointers. The attached patch not only cleans this up, it actually generates better code too: (a) it removes one push/pop pair at entry/exit because there's one less register used (no 'current') (b) it removes that pointless load of 'current' because it just uses the right argument: - movq %gs:pcpu_hot(%rip), %r12 - testq $16384, (%r12) + testq $16384, (%rdi) Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Uros Bizjak <ubizjak@gmail.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20231018184227.446318-1-ubizjak@gmail.com
2023-08-02x86/shstk: Add ARCH_SHSTK_STATUSRick Edgecombe
CRIU and GDB need to get the current shadow stack and WRSS enablement status. This information is already available via /proc/pid/status, but this is inconvenient for CRIU because it involves parsing the text output in an area of the code where this is difficult. Provide a status arch_prctl(), ARCH_SHSTK_STATUS for retrieving the status. Have arg2 be a userspace address, and make the new arch_prctl simply copy the features out to userspace. Suggested-by: Mike Rapoport <rppt@kernel.org> Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Kees Cook <keescook@chromium.org> Acked-by: Mike Rapoport (IBM) <rppt@kernel.org> Tested-by: Pengfei Xu <pengfei.xu@intel.com> Tested-by: John Allen <john.allen@amd.com> Tested-by: Kees Cook <keescook@chromium.org> Link: https://lore.kernel.org/all/20230613001108.3040476-43-rick.p.edgecombe%40intel.com
2023-08-02x86/shstk: Add ARCH_SHSTK_UNLOCKMike Rapoport
Userspace loaders may lock features before a CRIU restore operation has the chance to set them to whatever state is required by the process being restored. Allow a way for CRIU to unlock features. Add it as an arch_prctl() like the other shadow stack operations, but restrict it being called by the ptrace arch_pctl() interface. [Merged into recent API changes, added commit log and docs] Signed-off-by: Mike Rapoport <rppt@linux.ibm.com> Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Kees Cook <keescook@chromium.org> Reviewed-by: David Hildenbrand <david@redhat.com> Tested-by: Pengfei Xu <pengfei.xu@intel.com> Tested-by: John Allen <john.allen@amd.com> Tested-by: Kees Cook <keescook@chromium.org> Link: https://lore.kernel.org/all/20230613001108.3040476-42-rick.p.edgecombe%40intel.com
2023-08-02x86: Introduce userspace API for shadow stackRick Edgecombe
Add three new arch_prctl() handles: - ARCH_SHSTK_ENABLE/DISABLE enables or disables the specified feature. Returns 0 on success or a negative value on error. - ARCH_SHSTK_LOCK prevents future disabling or enabling of the specified feature. Returns 0 on success or a negative value on error. The features are handled per-thread and inherited over fork(2)/clone(2), but reset on exec(). Co-developed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Kees Cook <keescook@chromium.org> Acked-by: Mike Rapoport (IBM) <rppt@kernel.org> Tested-by: Pengfei Xu <pengfei.xu@intel.com> Tested-by: John Allen <john.allen@amd.com> Tested-by: Kees Cook <keescook@chromium.org> Link: https://lore.kernel.org/all/20230613001108.3040476-27-rick.p.edgecombe%40intel.com
2023-04-30Merge tag 'iommu-updates-v6.4' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu Pull iommu updates from Joerg Roedel: - Convert to platform remove callback returning void - Extend changing default domain to normal group - Intel VT-d updates: - Remove VT-d virtual command interface and IOASID - Allow the VT-d driver to support non-PRI IOPF - Remove PASID supervisor request support - Various small and misc cleanups - ARM SMMU updates: - Device-tree binding updates: * Allow Qualcomm GPU SMMUs to accept relevant clock properties * Document Qualcomm 8550 SoC as implementing an MMU-500 * Favour new "qcom,smmu-500" binding for Adreno SMMUs - Fix S2CR quirk detection on non-architectural Qualcomm SMMU implementations - Acknowledge SMMUv3 PRI queue overflow when consuming events - Document (in a comment) why ATS is disabled for bypass streams - AMD IOMMU updates: - 5-level page-table support - NUMA awareness for memory allocations - Unisoc driver: Support for reattaching an existing domain - Rockchip driver: Add missing set_platform_dma_ops callback - Mediatek driver: Adjust the dma-ranges - Various other small fixes and cleanups * tag 'iommu-updates-v6.4' of git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu: (82 commits) iommu: Remove iommu_group_get_by_id() iommu: Make iommu_release_device() static iommu/vt-d: Remove BUG_ON in dmar_insert_dev_scope() iommu/vt-d: Remove a useless BUG_ON(dev->is_virtfn) iommu/vt-d: Remove BUG_ON in map/unmap() iommu/vt-d: Remove BUG_ON when domain->pgd is NULL iommu/vt-d: Remove BUG_ON in handling iotlb cache invalidation iommu/vt-d: Remove BUG_ON on checking valid pfn range iommu/vt-d: Make size of operands same in bitwise operations iommu/vt-d: Remove PASID supervisor request support iommu/vt-d: Use non-privileged mode for all PASIDs iommu/vt-d: Remove extern from function prototypes iommu/vt-d: Do not use GFP_ATOMIC when not needed iommu/vt-d: Remove unnecessary checks in iopf disabling path iommu/vt-d: Move PRI handling to IOPF feature path iommu/vt-d: Move pfsid and ats_qdep calculation to device probe path iommu/vt-d: Move iopf code from SVA to IOPF enabling path iommu/vt-d: Allow SVA with device-specific IOPF dmaengine: idxd: Add enable/disable device IOPF feature arm64: dts: mt8186: Add dma-ranges for the parent "soc" node ...
2023-04-06x86/mm/iommu/sva: Do not allow to set FORCE_TAGGED_SVA bit from outsideKirill A. Shutemov
arch_prctl(ARCH_FORCE_TAGGED_SVA) overrides the default and allows LAM and SVA to co-exist in the process. It is expected by called by the process when it knows what it is doing. arch_prctl() operates on the current process, but the same code is reachable from ptrace where it can be called on arbitrary task. Make it strict and only allow to set MM_CONTEXT_FORCE_TAGGED_SVA for the current process. Fixes: 23e5d9ec2bab ("x86/mm/iommu/sva: Make LAM and SVA mutually exclusive") Suggested-by: Dmitry Vyukov <dvyukov@google.com> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Dmitry Vyukov <dvyukov@google.com> Link: https://lore.kernel.org/all/20230403111020.3136-3-kirill.shutemov%40linux.intel.com
2023-04-06x86/mm/iommu/sva: Fix error code for LAM enabling failure due to SVAKirill A. Shutemov
Normally, LAM and SVA are mutually exclusive. LAM enabling will fail if SVA is already in use. Correct error code for the failure. EINTR is nonsensical there. Fixes: 23e5d9ec2bab ("x86/mm/iommu/sva: Make LAM and SVA mutually exclusive") Reported-by: Dmitry Vyukov <dvyukov@google.com> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Dmitry Vyukov <dvyukov@google.com> Link: https://lore.kernel.org/all/CACT4Y+YfqSMsZArhh25TESmG-U4jO5Hjphz87wKSnTiaw2Wrfw@mail.gmail.com Link: https://lore.kernel.org/all/20230403111020.3136-2-kirill.shutemov%40linux.intel.com
2023-03-16x86/mm/iommu/sva: Make LAM and SVA mutually exclusiveKirill A. Shutemov
IOMMU and SVA-capable devices know nothing about LAM and only expect canonical addresses. An attempt to pass down tagged pointer will lead to address translation failure. By default do not allow to enable both LAM and use SVA in the same process. The new ARCH_FORCE_TAGGED_SVA arch_prctl() overrides the limitation. By using the arch_prctl() userspace takes responsibility to never pass tagged address to the device. Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Ashok Raj <ashok.raj@intel.com> Reviewed-by: Jacob Pan <jacob.jun.pan@linux.intel.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/all/20230312112612.31869-12-kirill.shutemov%40linux.intel.com
2023-03-16x86/mm: Provide arch_prctl() interface for LAMKirill A. Shutemov
Add a few of arch_prctl() handles: - ARCH_ENABLE_TAGGED_ADDR enabled LAM. The argument is required number of tag bits. It is rounded up to the nearest LAM mode that can provide it. For now only LAM_U57 is supported, with 6 tag bits. - ARCH_GET_UNTAG_MASK returns untag mask. It can indicates where tag bits located in the address. - ARCH_GET_MAX_TAG_BITS returns the maximum tag bits user can request. Zero if LAM is not supported. Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Alexander Potapenko <glider@google.com> Link: https://lore.kernel.org/all/20230312112612.31869-9-kirill.shutemov%40linux.intel.com
2023-03-16x86: Allow atomic MM_CONTEXT flags settingKirill A. Shutemov
So far there's no need in atomic setting of MM context flags in mm_context_t::flags. The flags set early in exec and never change after that. LAM enabling requires atomic flag setting. The upcoming flag MM_CONTEXT_FORCE_TAGGED_SVA can be set much later in the process lifetime where multiple threads exist. Convert the field to unsigned long and do MM_CONTEXT_* accesses with __set_bit() and test_bit(). No functional changes. Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Alexander Potapenko <glider@google.com> Link: https://lore.kernel.org/all/20230312112612.31869-3-kirill.shutemov%40linux.intel.com
2023-03-08x86/resctl: fix scheduler confusion with 'current'Linus Torvalds
The implementation of 'current' on x86 is very intentionally special: it is a very common thing to look up, and it uses 'this_cpu_read_stable()' to get the current thread pointer efficiently from per-cpu storage. And the keyword in there is 'stable': the current thread pointer never changes as far as a single thread is concerned. Even if when a thread is preempted, or moved to another CPU, or even across an explicit call 'schedule()' that thread will still have the same value for 'current'. It is, after all, the kernel base pointer to thread-local storage. That's why it's stable to begin with, but it's also why it's important enough that we have that special 'this_cpu_read_stable()' access for it. So this is all done very intentionally to allow the compiler to treat 'current' as a value that never visibly changes, so that the compiler can do CSE and combine multiple different 'current' accesses into one. However, there is obviously one very special situation when the currently running thread does actually change: inside the scheduler itself. So the scheduler code paths are special, and do not have a 'current' thread at all. Instead there are _two_ threads: the previous and the next thread - typically called 'prev' and 'next' (or prev_p/next_p) internally. So this is all actually quite straightforward and simple, and not all that complicated. Except for when you then have special code that is run in scheduler context, that code then has to be aware that 'current' isn't really a valid thing. Did you mean 'prev'? Did you mean 'next'? In fact, even if then look at the code, and you use 'current' after the new value has been assigned to the percpu variable, we have explicitly told the compiler that 'current' is magical and always stable. So the compiler is quite free to use an older (or newer) value of 'current', and the actual assignment to the percpu storage is not relevant even if it might look that way. Which is exactly what happened in the resctl code, that blithely used 'current' in '__resctrl_sched_in()' when it really wanted the new process state (as implied by the name: we're scheduling 'into' that new resctl state). And clang would end up just using the old thread pointer value at least in some configurations. This could have happened with gcc too, and purely depends on random compiler details. Clang just seems to have been more aggressive about moving the read of the per-cpu current_task pointer around. The fix is trivial: just make the resctl code adhere to the scheduler rules of using the prev/next thread pointer explicitly, instead of using 'current' in a situation where it just wasn't valid. That same code is then also used outside of the scheduler context (when a thread resctl state is explicitly changed), and then we will just pass in 'current' as that pointer, of course. There is no ambiguity in that case. The fix may be trivial, but noticing and figuring out what went wrong was not. The credit for that goes to Stephane Eranian. Reported-by: Stephane Eranian <eranian@google.com> Link: https://lore.kernel.org/lkml/20230303231133.1486085-1-eranian@google.com/ Link: https://lore.kernel.org/lkml/alpine.LFD.2.01.0908011214330.3304@localhost.localdomain/ Reviewed-by: Nick Desaulniers <ndesaulniers@google.com> Tested-by: Tony Luck <tony.luck@intel.com> Tested-by: Stephane Eranian <eranian@google.com> Tested-by: Babu Moger <babu.moger@amd.com> Cc: stable@kernel.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2022-12-14Merge tag 'x86_core_for_v6.2' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 core updates from Borislav Petkov: - Add the call depth tracking mitigation for Retbleed which has been long in the making. It is a lighterweight software-only fix for Skylake-based cores where enabling IBRS is a big hammer and causes a significant performance impact. What it basically does is, it aligns all kernel functions to 16 bytes boundary and adds a 16-byte padding before the function, objtool collects all functions' locations and when the mitigation gets applied, it patches a call accounting thunk which is used to track the call depth of the stack at any time. When that call depth reaches a magical, microarchitecture-specific value for the Return Stack Buffer, the code stuffs that RSB and avoids its underflow which could otherwise lead to the Intel variant of Retbleed. This software-only solution brings a lot of the lost performance back, as benchmarks suggest: https://lore.kernel.org/all/20220915111039.092790446@infradead.org/ That page above also contains a lot more detailed explanation of the whole mechanism - Implement a new control flow integrity scheme called FineIBT which is based on the software kCFI implementation and uses hardware IBT support where present to annotate and track indirect branches using a hash to validate them - Other misc fixes and cleanups * tag 'x86_core_for_v6.2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (80 commits) x86/paravirt: Use common macro for creating simple asm paravirt functions x86/paravirt: Remove clobber bitmask from .parainstructions x86/debug: Include percpu.h in debugreg.h to get DECLARE_PER_CPU() et al x86/cpufeatures: Move X86_FEATURE_CALL_DEPTH from bit 18 to bit 19 of word 11, to leave space for WIP X86_FEATURE_SGX_EDECCSSA bit x86/Kconfig: Enable kernel IBT by default x86,pm: Force out-of-line memcpy() objtool: Fix weak hole vs prefix symbol objtool: Optimize elf_dirty_reloc_sym() x86/cfi: Add boot time hash randomization x86/cfi: Boot time selection of CFI scheme x86/ibt: Implement FineIBT objtool: Add --cfi to generate the .cfi_sites section x86: Add prefix symbols for function padding objtool: Add option to generate prefix symbols objtool: Avoid O(bloody terrible) behaviour -- an ode to libelf objtool: Slice up elf_create_section_symbol() kallsyms: Revert "Take callthunks into account" x86: Unconfuse CONFIG_ and X86_FEATURE_ namespaces x86/retpoline: Fix crash printing warning x86/paravirt: Fix a !PARAVIRT build warning ...
2022-11-22x86/cpu: Switch to cpu_feature_enabled() for X86_FEATURE_XENPVJuergen Gross
Convert the remaining cases of static_cpu_has(X86_FEATURE_XENPV) and boot_cpu_has(X86_FEATURE_XENPV) to use cpu_feature_enabled(), allowing more efficient code in case the kernel is configured without CONFIG_XEN_PV. Signed-off-by: Juergen Gross <jgross@suse.com> Signed-off-by: Borislav Petkov <bp@suse.de> Acked-by: Dave Hansen <dave.hansen@linux.intel.com> Link: https://lore.kernel.org/r/20221104072701.20283-6-jgross@suse.com
2022-10-17x86/percpu: Move irq_stack variables next to current_taskThomas Gleixner
Further extend struct pcpu_hot with the hard and soft irq stack pointers. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20220915111145.599170752@infradead.org
2022-10-17x86/percpu: Move current_top_of_stack next to current_taskThomas Gleixner
Extend the struct pcpu_hot cacheline with current_top_of_stack; another very frequently used value. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20220915111145.493038635@infradead.org
2022-10-17x86: Put hot per CPU variables into a structThomas Gleixner
The layout of per-cpu variables is at the mercy of the compiler. This can lead to random performance fluctuations from build to build. Create a structure to hold some of the hottest per-cpu variables, starting with current_task. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20220915111145.179707194@infradead.org
2022-10-03x86: kmsan: skip shadow checks in __switch_to()Alexander Potapenko
When instrumenting functions, KMSAN obtains the per-task state (mostly pointers to metadata for function arguments and return values) once per function at its beginning, using the `current` pointer. Every time the instrumented function calls another function, this state (`struct kmsan_context_state`) is updated with shadow/origin data of the passed and returned values. When `current` changes in the low-level arch code, instrumented code can not notice that, and will still refer to the old state, possibly corrupting it or using stale data. This may result in false positive reports. To deal with that, we need to apply __no_kmsan_checks to the functions performing context switching - this will result in skipping all KMSAN shadow checks and marking newly created values as initialized, preventing all false positive reports in those functions. False negatives are still possible, but we expect them to be rare and impersistent. Link: https://lkml.kernel.org/r/20220915150417.722975-34-glider@google.com Suggested-by: Marco Elver <elver@google.com> Signed-off-by: Alexander Potapenko <glider@google.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Alexei Starovoitov <ast@kernel.org> Cc: Andrey Konovalov <andreyknvl@gmail.com> Cc: Andrey Konovalov <andreyknvl@google.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Borislav Petkov <bp@alien8.de> Cc: Christoph Hellwig <hch@lst.de> Cc: Christoph Lameter <cl@linux.com> Cc: David Rientjes <rientjes@google.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Eric Biggers <ebiggers@google.com> Cc: Eric Biggers <ebiggers@kernel.org> Cc: Eric Dumazet <edumazet@google.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Herbert Xu <herbert@gondor.apana.org.au> Cc: Ilya Leoshkevich <iii@linux.ibm.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Kees Cook <keescook@chromium.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Michael S. Tsirkin <mst@redhat.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Petr Mladek <pmladek@suse.com> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Vegard Nossum <vegard.nossum@oracle.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-05-13x86/prctl: Remove pointless task argumentThomas Gleixner
The functions invoked via do_arch_prctl_common() can only operate on the current task and none of these function uses the task argument. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Borislav Petkov <bp@suse.de> Link: https://lore.kernel.org/r/87lev7vtxj.ffs@tglx
2022-03-15x86: Remove toolchain check for X32 ABI capabilityMasahiro Yamada
Commit 0bf6276392e9 ("x32: Warn and disable rather than error if binutils too old") added a small test in arch/x86/Makefile because binutils 2.22 or newer is needed to properly support elf32-x86-64. This check is no longer necessary, as the minimum supported version of binutils is 2.23, which is enforced at configuration time with scripts/min-tool-version.sh. Remove this check and replace all uses of CONFIG_X86_X32 with CONFIG_X86_X32_ABI, as two symbols are no longer necessary. [nathan: Rebase, fix up a few places where CONFIG_X86_X32 was still used, and simplify commit message to satisfy -tip requirements] Signed-off-by: Masahiro Yamada <masahiroy@kernel.org> Signed-off-by: Nathan Chancellor <nathan@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20220314194842.3452-2-nathan@kernel.org
2021-10-20x86/fpu: Move context switch and exit to user inlines into sched.hThomas Gleixner
internal.h is a kitchen sink which needs to get out of the way to prepare for the upcoming changes. Move the context switch and exit to user inlines into a separate header, which is all that code needs. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Borislav Petkov <bp@suse.de> Link: https://lkml.kernel.org/r/20211015011539.349132461@linutronix.de
2021-10-20x86/fpu: Remove pointless argument from switch_fpu_finish()Thomas Gleixner
Unused since the FPU switching rework. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Borislav Petkov <bp@suse.de> Link: https://lkml.kernel.org/r/20211015011538.433135710@linutronix.de
2021-06-23x86/fpu: Add PKRU storage outside of task XSAVE bufferDave Hansen
PKRU is currently partly XSAVE-managed and partly not. It has space in the task XSAVE buffer and is context-switched by XSAVE/XRSTOR. However, it is switched more eagerly than FPU because there may be a need for PKRU to be up-to-date for things like copy_to/from_user() since PKRU affects user-permission memory accesses, not just accesses from userspace itself. This leaves PKRU in a very odd position. XSAVE brings very little value to the table for how Linux uses PKRU except for signal related XSTATE handling. Prepare to move PKRU away from being XSAVE-managed. Allocate space in the thread_struct for it and save/restore it in the context-switch path separately from the XSAVE-managed features. task->thread_struct.pkru is only valid when the task is scheduled out. For the current task the authoritative source is the hardware, i.e. it has to be retrieved via rdpkru(). Leave the XSAVE code in place for now to ensure bisectability. Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: Borislav Petkov <bp@suse.de> Link: https://lkml.kernel.org/r/20210623121456.399107624@linutronix.de
2021-06-23x86/cpu: Sanitize X86_FEATURE_OSPKEThomas Gleixner
X86_FEATURE_OSPKE is enabled first on the boot CPU and the feature flag is set. Secondary CPUs have to enable CR4.PKE as well and set their per CPU feature flag. That's ineffective because all call sites have checks for boot_cpu_data. Make it smarter and force the feature flag when PKU is enabled on the boot cpu which allows then to use cpu_feature_enabled(X86_FEATURE_OSPKE) all over the place. That either compiles the code out when PKEY support is disabled in Kconfig or uses a static_cpu_has() for the feature check which makes a significant difference in hotpaths, e.g. context switch. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: Borislav Petkov <bp@suse.de> Link: https://lkml.kernel.org/r/20210623121455.305113644@linutronix.de
2021-06-23x86/pkeys: Move read_pkru() and write_pkru()Dave Hansen
write_pkru() was originally used just to write to the PKRU register. It was mercifully short and sweet and was not out of place in pgtable.h with some other pkey-related code. But, later work included a requirement to also modify the task XSAVE buffer when updating the register. This really is more related to the XSAVE architecture than to paging. Move the read/write_pkru() to asm/pkru.h. pgtable.h won't miss them. Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Borislav Petkov <bp@suse.de> Link: https://lkml.kernel.org/r/20210623121455.102647114@linutronix.de
2021-02-10x86/irq: Sanitize irq stack trackingThomas Gleixner
The recursion protection for hard interrupt stacks is an unsigned int per CPU variable initialized to -1 named __irq_count. The irq stack switching is only done when the variable is -1, which creates worse code than just checking for 0. When the stack switching happens it uses this_cpu_add/sub(1), but there is no reason to do so. It simply can use straight writes. This is a historical leftover from the low level ASM code which used inc and jz to make a decision. Rename it to hardirq_stack_inuse, make it a bool and use plain stores. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Kees Cook <keescook@chromium.org> Link: https://lore.kernel.org/r/20210210002512.228830141@linutronix.de
2020-10-26x86: Reclaim TIF_IA32 and TIF_X32Gabriel Krisman Bertazi
Now that these flags are no longer used, reclaim those TIF bits. Signed-off-by: Gabriel Krisman Bertazi <krisman@collabora.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/20201004032536.1229030-11-krisman@collabora.com
2020-10-26x86/mm: Convert mmu context ia32_compat into a proper flags fieldGabriel Krisman Bertazi
The ia32_compat attribute is a weird thing. It mirrors TIF_IA32 and TIF_X32 and is used only in two very unrelated places: (1) to decide if the vsyscall page is accessible (2) for uprobes to find whether the patched instruction is 32 or 64 bit. In preparation to remove the TIF flags, a new mechanism is required for ia32_compat, but given its odd semantics, adding a real flags field which configures these specific behaviours is the best option. So, set_personality_x64() can ask for the vsyscall page, which is not available in x32/ia32 and set_personality_ia32() can configure the uprobe code as needed. uprobe cannot rely on other methods like user_64bit_mode() to decide how to patch, so it needs some specific flag like this. Signed-off-by: Gabriel Krisman Bertazi <krisman@collabora.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Andy Lutomirski<luto@kernel.org> Link: https://lore.kernel.org/r/20201004032536.1229030-10-krisman@collabora.com
2020-10-26x86/elf: Use e_machine to select start_thread for x32Gabriel Krisman Bertazi
Since TIF_X32 is going away, avoid using it to find the ELF type in compat_start_thread. According to SysV AMD64 ABI Draft, an AMD64 ELF object using ILP32 must have ELFCLASS32 with (E_MACHINE == EM_X86_64), so use that ELF field to differentiate a x32 object from a IA32 object when executing start_thread() in compat mode. Signed-off-by: Gabriel Krisman Bertazi <krisman@collabora.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Andy Lutomirski <luto@kernel.org> Link: https://lore.kernel.org/r/20201004032536.1229030-7-krisman@collabora.com
2020-08-24x86/fsgsbase: Replace static_cpu_has() with boot_cpu_has()Borislav Petkov
ptrace and prctl() are not really fast paths to warrant the use of static_cpu_has() and cause alternatives patching for no good reason. Replace with boot_cpu_has() which is simple and fast enough. No functional changes. Signed-off-by: Borislav Petkov <bp@suse.de> Link: https://lkml.kernel.org/r/20200818103715.32736-1-bp@alien8.de
2020-08-14x86/fsgsbase/64: Fix NULL deref in 86_fsgsbase_read_taskEric Dumazet
syzbot found its way in 86_fsgsbase_read_task() and triggered this oops: KASAN: null-ptr-deref in range [0x0000000000000008-0x000000000000000f] CPU: 0 PID: 6866 Comm: syz-executor262 Not tainted 5.8.0-syzkaller #0 RIP: 0010:x86_fsgsbase_read_task+0x16d/0x310 arch/x86/kernel/process_64.c:393 Call Trace: putreg32+0x3ab/0x530 arch/x86/kernel/ptrace.c:876 genregs32_set arch/x86/kernel/ptrace.c:1026 [inline] genregs32_set+0xa4/0x100 arch/x86/kernel/ptrace.c:1006 copy_regset_from_user include/linux/regset.h:326 [inline] ia32_arch_ptrace arch/x86/kernel/ptrace.c:1061 [inline] compat_arch_ptrace+0x36c/0xd90 arch/x86/kernel/ptrace.c:1198 __do_compat_sys_ptrace kernel/ptrace.c:1420 [inline] __se_compat_sys_ptrace kernel/ptrace.c:1389 [inline] __ia32_compat_sys_ptrace+0x220/0x2f0 kernel/ptrace.c:1389 do_syscall_32_irqs_on arch/x86/entry/common.c:84 [inline] __do_fast_syscall_32+0x57/0x80 arch/x86/entry/common.c:126 do_fast_syscall_32+0x2f/0x70 arch/x86/entry/common.c:149 entry_SYSENTER_compat_after_hwframe+0x4d/0x5c This can happen if ptrace() or sigreturn() pokes an LDT selector into FS or GS for a task with no LDT and something tries to read the base before a return to usermode notices the bad selector and fixes it. The fix is to make sure ldt pointer is not NULL. Fixes: 07e1d88adaae ("x86/fsgsbase/64: Fix ptrace() to read the FS/GS base accurately") Co-developed-by: Jann Horn <jannh@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: syzbot <syzkaller@googlegroups.com> Acked-by: Andy Lutomirski <luto@kernel.org> Cc: Chang S. Bae <chang.seok.bae@intel.com> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Borislav Petkov <bp@alien8.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Markus T Metzger <markus.t.metzger@intel.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Ravi Shankar <ravi.v.shankar@intel.com> Cc: Rik van Riel <riel@surriel.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@kernel.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-08-04Merge tag 'x86-fsgsbase-2020-08-04' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 fsgsbase from Thomas Gleixner: "Support for FSGSBASE. Almost 5 years after the first RFC to support it, this has been brought into a shape which is maintainable and actually works. This final version was done by Sasha Levin who took it up after Intel dropped the ball. Sasha discovered that the SGX (sic!) offerings out there ship rogue kernel modules enabling FSGSBASE behind the kernels back which opens an instantanious unpriviledged root hole. The FSGSBASE instructions provide a considerable speedup of the context switch path and enable user space to write GSBASE without kernel interaction. This enablement requires careful handling of the exception entries which go through the paranoid entry path as they can no longer rely on the assumption that user GSBASE is positive (as enforced via prctl() on non FSGSBASE enabled systemn). All other entries (syscalls, interrupts and exceptions) can still just utilize SWAPGS unconditionally when the entry comes from user space. Converting these entries to use FSGSBASE has no benefit as SWAPGS is only marginally slower than WRGSBASE and locating and retrieving the kernel GSBASE value is not a free operation either. The real benefit of RD/WRGSBASE is the avoidance of the MSR reads and writes. The changes come with appropriate selftests and have held up in field testing against the (sanitized) Graphene-SGX driver" * tag 'x86-fsgsbase-2020-08-04' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (21 commits) x86/fsgsbase: Fix Xen PV support x86/ptrace: Fix 32-bit PTRACE_SETREGS vs fsbase and gsbase selftests/x86/fsgsbase: Add a missing memory constraint selftests/x86/fsgsbase: Fix a comment in the ptrace_write_gsbase test selftests/x86: Add a syscall_arg_fault_64 test for negative GSBASE selftests/x86/fsgsbase: Test ptracer-induced GS base write with FSGSBASE selftests/x86/fsgsbase: Test GS selector on ptracer-induced GS base write Documentation/x86/64: Add documentation for GS/FS addressing mode x86/elf: Enumerate kernel FSGSBASE capability in AT_HWCAP2 x86/cpu: Enable FSGSBASE on 64bit by default and add a chicken bit x86/entry/64: Handle FSGSBASE enabled paranoid entry/exit x86/entry/64: Introduce the FIND_PERCPU_BASE macro x86/entry/64: Switch CR3 before SWAPGS in paranoid entry x86/speculation/swapgs: Check FSGSBASE in enabling SWAPGS mitigation x86/process/64: Use FSGSBASE instructions on thread copy and ptrace x86/process/64: Use FSBSBASE in switch_to() if available x86/process/64: Make save_fsgs_for_kvm() ready for FSGSBASE x86/fsgsbase/64: Enable FSGSBASE instructions in helper functions x86/fsgsbase/64: Add intrinsics for FSGSBASE instructions x86/cpu: Add 'unsafe_fsgsbase' to enable CR4.FSGSBASE ...
2020-07-22x86/dumpstack: Add log_lvl to __show_regs()Dmitry Safonov
show_trace_log_lvl() provides x86 platform-specific way to unwind backtrace with a given log level. Unfortunately, registers dump(s) are not printed with the same log level - instead, KERN_DEFAULT is always used. Arista's switches uses quite common setup with rsyslog, where only urgent messages goes to console (console_log_level=KERN_ERR), everything else goes into /var/log/ as the console baud-rate often is indecently slow (9600 bps). Backtrace dumps without registers printed have proven to be as useful as morning standups. Furthermore, in order to introduce KERN_UNSUPPRESSED (which I believe is still the most elegant way to fix raciness of sysrq[1]) the log level should be passed down the stack to register dumping functions. Besides, there is a potential use-case for printing traces with KERN_DEBUG level [2] (where registers dump shouldn't appear with higher log level). Add log_lvl parameter to __show_regs(). Keep the used log level intact to separate visible change. [1]: https://lore.kernel.org/lkml/20190528002412.1625-1-dima@arista.com/ [2]: https://lore.kernel.org/linux-doc/20190724170249.9644-1-dima@arista.com/ Signed-off-by: Dmitry Safonov <dima@arista.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Petr Mladek <pmladek@suse.com> Link: https://lkml.kernel.org/r/20200629144847.492794-3-dima@arista.com
2020-07-22x86/dumpstack: Add log_lvl to show_iret_regs()Dmitry Safonov
show_trace_log_lvl() provides x86 platform-specific way to unwind backtrace with a given log level. Unfortunately, registers dump(s) are not printed with the same log level - instead, KERN_DEFAULT is always used. Arista's switches uses quite common setup with rsyslog, where only urgent messages goes to console (console_log_level=KERN_ERR), everything else goes into /var/log/ as the console baud-rate often is indecently slow (9600 bps). Backtrace dumps without registers printed have proven to be as useful as morning standups. Furthermore, in order to introduce KERN_UNSUPPRESSED (which I believe is still the most elegant way to fix raciness of sysrq[1]) the log level should be passed down the stack to register dumping functions. Besides, there is a potential use-case for printing traces with KERN_DEBUG level [2] (where registers dump shouldn't appear with higher log level). Add log_lvl parameter to show_iret_regs() as a preparation to add it to __show_regs() and show_regs_if_on_stack(). [1]: https://lore.kernel.org/lkml/20190528002412.1625-1-dima@arista.com/ [2]: https://lore.kernel.org/linux-doc/20190724170249.9644-1-dima@arista.com/ Signed-off-by: Dmitry Safonov <dima@arista.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Petr Mladek <pmladek@suse.com> Link: https://lkml.kernel.org/r/20200629144847.492794-2-dima@arista.com
2020-07-01x86/fsgsbase: Fix Xen PV supportAndy Lutomirski
On Xen PV, SWAPGS doesn't work. Teach __rdfsbase_inactive() and __wrgsbase_inactive() to use rdmsrl()/wrmsrl() on Xen PV. The Xen pvop code will understand this and issue the correct hypercalls. Signed-off-by: Andy Lutomirski <luto@kernel.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lkml.kernel.org/r/f07c08f178fe9711915862b656722a207cd52c28.1593192140.git.luto@kernel.org
2020-07-01x86/ptrace: Fix 32-bit PTRACE_SETREGS vs fsbase and gsbaseAndy Lutomirski
Debuggers expect that doing PTRACE_GETREGS, then poking at a tracee and maybe letting it run for a while, then doing PTRACE_SETREGS will put the tracee back where it was. In the specific case of a 32-bit tracer and tracee, the PTRACE_GETREGS/SETREGS data structure doesn't have fs_base or gs_base fields, so FSBASE and GSBASE fields are never stored anywhere. Everything used to still work because nonzero FS or GS would result full reloads of the segment registers when the tracee resumes, and the bases associated with FS==0 or GS==0 are irrelevant to 32-bit code. Adding FSGSBASE support broke this: when FSGSBASE is enabled, FSBASE and GSBASE are now restored independently of FS and GS for all tasks when context-switched in. This means that, if a 32-bit tracer restores a previous state using PTRACE_SETREGS but the tracee's pre-restore and post-restore bases don't match, then the tracee is resumed with the wrong base. Fix it by explicitly loading the base when a 32-bit tracer pokes FS or GS on a 64-bit kernel. Also add a test case. Fixes: 673903495c85 ("x86/process/64: Use FSBSBASE in switch_to() if available") Signed-off-by: Andy Lutomirski <luto@kernel.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lkml.kernel.org/r/229cc6a50ecbb701abd50fe4ddaf0eda888898cd.1593192140.git.luto@kernel.org
2020-06-18x86/process/64: Use FSGSBASE instructions on thread copy and ptraceChang S. Bae
When FSGSBASE is enabled, copying threads and reading fsbase and gsbase using ptrace must read the actual values. When copying a thread, use save_fsgs() and copy the saved values. For ptrace, the bases must be read from memory regardless of the selector if FSGSBASE is enabled. [ tglx: Invoke __rdgsbase_inactive() with interrupts disabled ] [ luto: Massage changelog ] Suggested-by: Andy Lutomirski <luto@kernel.org> Signed-off-by: Chang S. Bae <chang.seok.bae@intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Sasha Levin <sashal@kernel.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lkml.kernel.org/r/1557309753-24073-9-git-send-email-chang.seok.bae@intel.com Link: https://lkml.kernel.org/r/20200528201402.1708239-8-sashal@kernel.org
2020-06-18x86/process/64: Use FSBSBASE in switch_to() if availableAndy Lutomirski
With the new FSGSBASE instructions, FS and GSABSE can be efficiently read and writen in __switch_to(). Use that capability to preserve the full state. This will enable user code to do whatever it wants with the new instructions without any kernel-induced gotchas. (There can still be architectural gotchas: movl %gs,%eax; movl %eax,%gs may change GSBASE if WRGSBASE was used, but users are expected to read the CPU manual before doing things like that.) This is a considerable speedup. It seems to save about 100 cycles per context switch compared to the baseline 4.6-rc1 behavior on a Skylake laptop. This is mostly due to avoiding the WRMSR operation. [ chang: 5~10% performance improvements were seen with a context switch benchmark that ran threads with different FS/GSBASE values (to the baseline 4.16). Minor edit on the changelog. ] [ tglx: Masaage changelog ] Signed-off-by: Andy Lutomirski <luto@kernel.org> Signed-off-by: Chang S. Bae <chang.seok.bae@intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Sasha Levin <sashal@kernel.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Andi Kleen <ak@linux.intel.com> Link: https://lkml.kernel.org/r/1557309753-24073-8-git-send-email-chang.seok.bae@intel.com Link: https://lkml.kernel.org/r/20200528201402.1708239-6-sashal@kernel.org
2020-06-18x86/process/64: Make save_fsgs_for_kvm() ready for FSGSBASEThomas Gleixner
save_fsgs_for_kvm() is invoked via vcpu_enter_guest() kvm_x86_ops.prepare_guest_switch(vcpu) vmx_prepare_switch_to_guest() save_fsgs_for_kvm() with preemption disabled, but interrupts enabled. The upcoming FSGSBASE based GS safe needs interrupts to be disabled. This could be done in the helper function, but that function is also called from switch_to() which has interrupts disabled already. Disable interrupts inside save_fsgs_for_kvm() and rename the function to current_save_fsgs() so it can be invoked from other places. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Sasha Levin <sashal@kernel.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lkml.kernel.org/r/20200528201402.1708239-7-sashal@kernel.org
2020-06-18x86/fsgsbase/64: Enable FSGSBASE instructions in helper functionsChang S. Bae
Add cpu feature conditional FSGSBASE access to the relevant helper functions. That allows to accelerate certain FS/GS base operations in subsequent changes. Note, that while possible, the user space entry/exit GSBASE operations are not going to use the new FSGSBASE instructions. The reason is that it would require additional storage for the user space value which adds more complexity to the low level code and experiments have shown marginal benefit. This may be revisited later but for now the SWAPGS based handling in the entry code is preserved except for the paranoid entry/exit code. To preserve the SWAPGS entry mechanism introduce __[rd|wr]gsbase_inactive() helpers. Note, for Xen PV, paravirt hooks can be added later as they might allow a very efficient but different implementation. [ tglx: Massaged changelog, convert it to noinstr and force inline native_swapgs() ] Signed-off-by: Chang S. Bae <chang.seok.bae@intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Sasha Levin <sashal@kernel.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lkml.kernel.org/r/1557309753-24073-7-git-send-email-chang.seok.bae@intel.com Link: https://lkml.kernel.org/r/20200528201402.1708239-5-sashal@kernel.org
2020-06-09mm: don't include asm/pgtable.h if linux/mm.h is already includedMike Rapoport
Patch series "mm: consolidate definitions of page table accessors", v2. The low level page table accessors (pXY_index(), pXY_offset()) are duplicated across all architectures and sometimes more than once. For instance, we have 31 definition of pgd_offset() for 25 supported architectures. Most of these definitions are actually identical and typically it boils down to, e.g. static inline unsigned long pmd_index(unsigned long address) { return (address >> PMD_SHIFT) & (PTRS_PER_PMD - 1); } static inline pmd_t *pmd_offset(pud_t *pud, unsigned long address) { return (pmd_t *)pud_page_vaddr(*pud) + pmd_index(address); } These definitions can be shared among 90% of the arches provided XYZ_SHIFT, PTRS_PER_XYZ and xyz_page_vaddr() are defined. For architectures that really need a custom version there is always possibility to override the generic version with the usual ifdefs magic. These patches introduce include/linux/pgtable.h that replaces include/asm-generic/pgtable.h and add the definitions of the page table accessors to the new header. This patch (of 12): The linux/mm.h header includes <asm/pgtable.h> to allow inlining of the functions involving page table manipulations, e.g. pte_alloc() and pmd_alloc(). So, there is no point to explicitly include <asm/pgtable.h> in the files that include <linux/mm.h>. The include statements in such cases are remove with a simple loop: for f in $(git grep -l "include <linux/mm.h>") ; do sed -i -e '/include <asm\/pgtable.h>/ d' $f done Signed-off-by: Mike Rapoport <rppt@linux.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Borislav Petkov <bp@alien8.de> Cc: Brian Cain <bcain@codeaurora.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Chris Zankel <chris@zankel.net> Cc: "David S. Miller" <davem@davemloft.net> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Greentime Hu <green.hu@gmail.com> Cc: Greg Ungerer <gerg@linux-m68k.org> Cc: Guan Xuetao <gxt@pku.edu.cn> Cc: Guo Ren <guoren@kernel.org> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Helge Deller <deller@gmx.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: Ley Foon Tan <ley.foon.tan@intel.com> Cc: Mark Salter <msalter@redhat.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Matt Turner <mattst88@gmail.com> Cc: Max Filippov <jcmvbkbc@gmail.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Michal Simek <monstr@monstr.eu> Cc: Mike Rapoport <rppt@kernel.org> Cc: Nick Hu <nickhu@andestech.com> Cc: Paul Walmsley <paul.walmsley@sifive.com> Cc: Richard Weinberger <richard@nod.at> Cc: Rich Felker <dalias@libc.org> Cc: Russell King <linux@armlinux.org.uk> Cc: Stafford Horne <shorne@gmail.com> Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Tony Luck <tony.luck@intel.com> Cc: Vincent Chen <deanbo422@gmail.com> Cc: Vineet Gupta <vgupta@synopsys.com> Cc: Will Deacon <will@kernel.org> Cc: Yoshinori Sato <ysato@users.sourceforge.jp> Link: http://lkml.kernel.org/r/20200514170327.31389-1-rppt@kernel.org Link: http://lkml.kernel.org/r/20200514170327.31389-2-rppt@kernel.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-05-06x86/resctrl: Rename asm/resctrl_sched.h to asm/resctrl.hReinette Chatre
asm/resctrl_sched.h is dedicated to the code used for configuration of the CPU resource control state when a task is scheduled. Rename resctrl_sched.h to resctrl.h in preparation of additions that will no longer make this file dedicated to work done during scheduling. No functional change. Suggested-by: Borislav Petkov <bp@suse.de> Signed-off-by: Reinette Chatre <reinette.chatre@intel.com> Signed-off-by: Borislav Petkov <bp@suse.de> Link: https://lkml.kernel.org/r/6914e0ef880b539a82a6d889f9423496d471ad1d.1588715690.git.reinette.chatre@intel.com
2020-03-21x86: Remove unneeded includesBrian Gerst
Clean up includes of and in <asm/syscalls.h> Signed-off-by: Brian Gerst <brgerst@gmail.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lkml.kernel.org/r/20200313195144.164260-19-brgerst@gmail.com