summaryrefslogtreecommitdiff
path: root/arch/x86
AgeCommit message (Collapse)Author
2023-07-19Merge tag 'v6.5-rc2' into sched/core, to pick up fixesIngo Molnar
Sync with upstream fixes before applying EEVDF. Signed-off-by: Ingo Molnar <mingo@kernel.org>
2023-07-18x86/bugs: Increase the x86 bugs vector size to two u32sBorislav Petkov (AMD)
There was never a doubt in my mind that they would not fit into a single u32 eventually. Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
2023-07-17x86/mm: Fix PAT bit missing from page protection modify maskJanusz Krzysztofik
Visible glitches have been observed when running graphics applications on Linux under Xen hypervisor. Those observations have been confirmed with failures from kms_pwrite_crc Intel GPU test that verifies data coherency of DRM frame buffer objects using hardware CRC checksums calculated by display controllers, exposed to userspace via debugfs. Affected processing paths have then been identified with new IGT test variants that mmap the objects using different methods and caching modes [1]. When running as a Xen PV guest, Linux uses Xen provided PAT configuration which is different from its native one. In particular, Xen specific PTE encoding of write-combining caching, likely used by graphics applications, differs from the Linux default one found among statically defined minimal set of supported modes. Since Xen defines PTE encoding of the WC mode as _PAGE_PAT, it no longer belongs to the minimal set, depends on correct handling of _PAGE_PAT bit, and can be mismatched with write-back caching. When a user calls mmap() for a DRM buffer object, DRM device specific .mmap file operation, called from mmap_region(), takes care of setting PTE encoding bits in a vm_page_prot field of an associated virtual memory area structure. Unfortunately, _PAGE_PAT bit is not preserved when the vma's .vm_flags are then applied to .vm_page_prot via vm_set_page_prot(). Bits to be preserved are determined with _PAGE_CHG_MASK symbol that doesn't cover _PAGE_PAT. As a consequence, WB caching is requested instead of WC when running under Xen (also, WP is silently changed to WT, and UC downgraded to UC_MINUS). When running on bare metal, WC is not affected, but WP and WT extra modes are unintentionally replaced with WC and UC, respectively. WP and WT modes, encoded with _PAGE_PAT bit set, were introduced by commit 281d4078bec3 ("x86: Make page cache mode a real type"). Care was taken to extend _PAGE_CACHE_MASK symbol with that additional bit, but that symbol has never been used for identification of bits preserved when applying page protection flags. Support for all cache modes under Xen, including the problematic WC mode, was then introduced by commit 47591df50512 ("xen: Support Xen pv-domains using PAT"). The issue needs to be fixed by including _PAGE_PAT bit into a bitmask used by pgprot_modify() for selecting bits to be preserved. We can do that either internally to pgprot_modify() (as initially proposed), or by making _PAGE_PAT a part of _PAGE_CHG_MASK. If we go for the latter then, since _PAGE_PAT is the same as _PAGE_PSE, we need to note that _HPAGE_CHG_MASK -- a huge pmds' counterpart of _PAGE_CHG_MASK, introduced by commit c489f1257b8c ("thp: add pmd_modify"), defined as (_PAGE_CHG_MASK | _PAGE_PSE) -- will no longer differ from _PAGE_CHG_MASK. If such modification of _PAGE_CHG_MASK was irrelevant to its users then one might wonder why that new _HPAGE_CHG_MASK symbol was introduced instead of reusing the existing one with that otherwise irrelevant bit (_PAGE_PSE in that case) added. Add _PAGE_PAT to _PAGE_CHG_MASK and _PAGE_PAT_LARGE to _HPAGE_CHG_MASK for symmetry. Split out common bits from both symbols to a common symbol for clarity. [ dhansen: tweak the solution changelog description ] [1] https://gitlab.freedesktop.org/drm/igt-gpu-tools/-/commit/0f0754413f14 Fixes: 281d4078bec3 ("x86: Make page cache mode a real type") Signed-off-by: Janusz Krzysztofik <janusz.krzysztofik@linux.intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Andi Shyti <andi.shyti@linux.intel.com> Reviewed-by: Juergen Gross <jgross@suse.com> Tested-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com> Link: https://gitlab.freedesktop.org/drm/intel/-/issues/7648 Link: https://lore.kernel.org/all/20230710073613.8006-2-janusz.krzysztofik%40linux.intel.com
2023-07-17x86/cpu/amd: Add a Zenbleed fixBorislav Petkov (AMD)
Add a fix for the Zen2 VZEROUPPER data corruption bug where under certain circumstances executing VZEROUPPER can cause register corruption or leak data. The optimal fix is through microcode but in the case the proper microcode revision has not been applied, enable a fallback fix using a chicken bit. Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
2023-07-17x86/cpu/amd: Move the errata checking functionality upBorislav Petkov (AMD)
Avoid new and remove old forward declarations. No functional changes. Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
2023-07-16Merge tag 'perf_urgent_for_v6.5_rc2' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull perf fix from Borislav Petkov: - Fix a lockdep warning when the event given is the first one, no event group exists yet but the code still goes and iterates over event siblings * tag 'perf_urgent_for_v6.5_rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: perf/x86: Fix lockdep warning in for_each_sibling_event() on SPR
2023-07-14Merge tag 'x86_urgent_for_6.5_rc2' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 CFI fixes from Peter Zijlstra: "Fix kCFI/FineIBT weaknesses The primary bug Alyssa noticed was that with FineIBT enabled function prologues have a spurious ENDBR instruction: __cfi_foo: endbr64 subl $hash, %r10d jz 1f ud2 nop 1: foo: endbr64 <--- *sadface* This means that any indirect call that fails to target the __cfi symbol and instead targets (the regular old) foo+0, will succeed due to that second ENDBR. Fixing this led to the discovery of a single indirect call that was still doing this: ret_from_fork(). Since that's an assembly stub the compiler would not generate the proper kCFI indirect call magic and it would not get patched. Brian came up with the most comprehensive fix -- convert the thing to C with only a very thin asm wrapper. This ensures the kernel thread boostrap is a proper kCFI call. While discussing all this, Kees noted that kCFI hashes could/should be poisoned to seal all functions whose address is never taken, further limiting the valid kCFI targets -- much like we already do for IBT. So what was a 'simple' observation and fix cascaded into a bunch of inter-related CFI infrastructure fixes" * tag 'x86_urgent_for_6.5_rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/cfi: Only define poison_cfi() if CONFIG_X86_KERNEL_IBT=y x86/fineibt: Poison ENDBR at +0 x86: Rewrite ret_from_fork() in C x86/32: Remove schedule_tail_wrapper() x86/cfi: Extend ENDBR sealing to kCFI x86/alternative: Rename apply_ibt_endbr() x86/cfi: Extend {JMP,CAKK}_NOSPEC comment
2023-07-14x86/tsc: Extend watchdog check exemption to 4-Sockets platformFeng Tang
There were reports again that the tsc clocksource on 4 sockets x86 servers was wrongly judged as 'unstable' by 'jiffies' and other watchdogs, and disabled [1][2]. Commit b50db7095fe0 ("x86/tsc: Disable clocksource watchdog for TSC on qualified platorms") was introduce to deal with these false alarms of tsc unstable issues, covering qualified platforms for 2 sockets or smaller ones. And from history of chasing TSC issues, Thomas and Peter only saw real TSC synchronization issue on 8 socket machines. So extend the exemption to 4 sockets to fix the issue. Rui also proposed another way to disable 'jiffies' as clocksource watchdog [3], which can also solve problem in [1]. in an architecture independent way, but can't cure the problem in [2]. whose watchdog is HPET or PMTIMER, while 'jiffies' is mostly used as watchdog in boot phase. 'nr_online_nodes' has known inaccurate problem for cases like platform with cpu-less memory nodes, sub numa cluster enabled, fakenuma, kernel cmdline parameter 'maxcpus=', etc. The harmful case is the 'maxcpus' one which could possibly under estimates the package number, and disable the watchdog, but bright side is it is mostly for debug usage. All these will be addressed in other patches, as discussed in thread [4]. [1]. https://lore.kernel.org/all/9d3bf570-3108-0336-9c52-9bee15767d29@huawei.com/ [2]. https://lore.kernel.org/lkml/06df410c-2177-4671-832f-339cff05b1d9@paulmck-laptop/ [3]. https://lore.kernel.org/all/bd5b97f89ab2887543fc262348d1c7cafcaae536.camel@intel.com/ [4]. https://lore.kernel.org/all/20221021062131.1826810-1-feng.tang@intel.com/ Reported-by: Yu Liao <liaoyu15@huawei.com> Reported-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Feng Tang <feng.tang@intel.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2023-07-14ACPI: processor: Introduce acpi_processor_osc()Michal Wilczynski
The processor _OSC method is already used for a workaround introduced in commit a21211672c9a ("ACPI / processor: Request native thermal interrupt handling via _OSC"), but in accordance with ACPI 6.5 (and earlier), it should be used for negotiating all of the processor capabilities instead of _PDC (which has been deprecated since ACPI 3.0 and got removed from ACPI 6.5 entirely). Create a new callback function called acpi_processor_osc() to be invoked for every processor object and processor device in the ACPI namespace, in analogy with the already existing acpi_hwp_native_thermal_lvt_osc(). Make this function implement the workaround mentioned above and convey all of the OSPM processor support information to the platform firmware by setting all of the appropriate processor capabilities bits before evaluating _OSC for the given processor. For this purpose, make it call arch_acpi_set_proc_cap_bits() and modify the latter to set ACPI_PROC_CAP_COLLAB_PROC_PERF along with the other processor capabilities bits. Suggested-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Signed-off-by: Michal Wilczynski <michal.wilczynski@intel.com> Reviewed-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> [ rjw: Subject and changelog edits, whitespace fixup ] Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2023-07-14ACPI: processor: Set CAP_SMP_T_SWCOORD in arch_acpi_set_proc_cap_bits()Michal Wilczynski
Currently, ACPI_PROC_CAP_SMP_T_SWCOORD is set in acpi_set_pdc_bits(), but it is not _PDC-specific. It should be set along with the other processor capability bits. Move the setting of ACPI_PROC_CAP_SMP_T_SWCOORD to arch_acpi_set_proc_cap_bits(). Suggested-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Signed-off-by: Michal Wilczynski <michal.wilczynski@intel.com> [ rjw: Subject and changelog edits ] Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2023-07-14ACPI: processor: Clear C_C2C3_FFH and C_C1_FFH in arch_acpi_set_proc_cap_bits()Michal Wilczynski
Currently arch_acpi_set_proc_cap_bits() clears ACPI_PDC_C_C2C3_FFH bit in case MWAIT instruction is not supported. It should also clear ACPI_PDC_C_C1_FFH, as when MWAIT is not supported, C1 is entered by executing the HLT instruction. Quote from the C_C1_FFH description: "If set, OSPM is capable of performing native C State instructions (beyond halt) for the C1 handler in multi-processor configurations". As without MWAIT there is no native C-state instructions beyond HALT, this bit should be toggled off." Clear ACPI_PDC_C_C1_FFH and ACPI_PDC_C_C2C3_FFH in arch_acpi_set_proc_cap_bits() in case MWAIT is not supported or overridden. Remove setting those bits from the processor_pdc.c code. Suggested-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Signed-off-by: Michal Wilczynski <michal.wilczynski@intel.com> [ rjw: Subject and changelog edits ] Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2023-07-14ACPI: processor: Rename ACPI_PDC symbolsMichal Wilczynski
The prefix in the names of the ACPI_PDC symbols suggests that they are only relevant for _PDC, but in fact they can also be used in the _OSC. Change that prefix to a more generic ACPI_PROC_CAP that will better reflect the purpose of those symbols as they represent bits in a general processor capabilities buffer. Rename pdc_intel.h to proc_cap_intel.h to follow the change of the symbol name prefix. No intentional functional impact. Suggested-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Signed-off-by: Michal Wilczynski <michal.wilczynski@intel.com> [ rjw: Subject and changelog edits ] Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2023-07-14ACPI: processor: Refactor arch_acpi_set_pdc_bits()Michal Wilczynski
The capabilities buffer modified by the arch_acpi_set_pdc_bits() is not _PDC-specific, as it can be used by _OSC too. Change the name of that function to better reflect its independence from _PDC and make it take the capabilities buffer address as the argument directly, without any offset, as _OSC and _PDC use different capabilities buffer offsets. No intentional functional impact. Suggested-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Signed-off-by: Michal Wilczynski <michal.wilczynski@intel.com> Reviewed-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> [ rjw: Subject and changelog edits ] Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2023-07-14crypto: x86/aesni - Align the address before aes_set_key_common()Chang S. Bae
aes_set_key_common() performs runtime alignment to the void *raw_ctx pointer. This facilitates consistent access to the 16byte-aligned address during key extension. However, the alignment is already handlded in the GCM-related setkey functions before invoking the common function. Consequently, the alignment in the common function is unnecessary for those functions. To establish a consistent approach throughout the glue code, remove the aes_ctx() call from its current location. Instead, place it at each call site where the runtime alignment is currently absent. Link: https://lore.kernel.org/lkml/20230605024623.GA4653@quark.localdomain/ Suggested-by: Eric Biggers <ebiggers@kernel.org> Signed-off-by: Chang S. Bae <chang.seok.bae@intel.com> Cc: linux-crypto@vger.kernel.org Cc: x86@kernel.org Cc: linux-kernel@vger.kernel.org Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2023-07-13Merge tag 'for-netdev' of ↵Jakub Kicinski
https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next Alexei Starovoitov says: ==================== pull-request: bpf-next 2023-07-13 We've added 67 non-merge commits during the last 15 day(s) which contain a total of 106 files changed, 4444 insertions(+), 619 deletions(-). The main changes are: 1) Fix bpftool build in presence of stale vmlinux.h, from Alexander Lobakin. 2) Introduce bpf_me_mcache_free_rcu() and fix OOM under stress, from Alexei Starovoitov. 3) Teach verifier actual bounds of bpf_get_smp_processor_id() and fix perf+libbpf issue related to custom section handling, from Andrii Nakryiko. 4) Introduce bpf map element count, from Anton Protopopov. 5) Check skb ownership against full socket, from Kui-Feng Lee. 6) Support for up to 12 arguments in BPF trampoline, from Menglong Dong. 7) Export rcu_request_urgent_qs_task, from Paul E. McKenney. 8) Fix BTF walking of unions, from Yafang Shao. 9) Extend link_info for kprobe_multi and perf_event links, from Yafang Shao. * tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (67 commits) selftests/bpf: Add selftest for PTR_UNTRUSTED bpf: Fix an error in verifying a field in a union selftests/bpf: Add selftests for nested_trust bpf: Fix an error around PTR_UNTRUSTED selftests/bpf: add testcase for TRACING with 6+ arguments bpf, x86: allow function arguments up to 12 for TRACING bpf, x86: save/restore regs with BPF_DW size bpftool: Use "fallthrough;" keyword instead of comments bpf: Add object leak check. bpf: Convert bpf_cpumask to bpf_mem_cache_free_rcu. bpf: Introduce bpf_mem_free_rcu() similar to kfree_rcu(). selftests/bpf: Improve test coverage of bpf_mem_alloc. rcu: Export rcu_request_urgent_qs_task() bpf: Allow reuse from waiting_for_gp_ttrace list. bpf: Add a hint to allocated objects. bpf: Change bpf_mem_cache draining process. bpf: Further refactor alloc_bulk(). bpf: Factor out inc/dec of active flag into helpers. bpf: Refactor alloc_bulk(). bpf: Let free_all() return the number of freed elements. ... ==================== Link: https://lore.kernel.org/r/20230714020910.80794-1-alexei.starovoitov@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-07-13bpf, x86: allow function arguments up to 12 for TRACINGMenglong Dong
For now, the BPF program of type BPF_PROG_TYPE_TRACING can only be used on the kernel functions whose arguments count less than or equal to 6, if not considering '> 8 bytes' struct argument. This is not friendly at all, as too many functions have arguments count more than 6. According to the current kernel version, below is a statistics of the function arguments count: argument count | function count 7 | 704 8 | 270 9 | 84 10 | 47 11 | 47 12 | 27 13 | 22 14 | 5 15 | 0 16 | 1 Therefore, let's enhance it by increasing the function arguments count allowed in arch_prepare_bpf_trampoline(), for now, only x86_64. For the case that we don't need to call origin function, which means without BPF_TRAMP_F_CALL_ORIG, we need only copy the function arguments that stored in the frame of the caller to current frame. The 7th and later arguments are stored in "$rbp + 0x18", and they will be copied to the stack area following where register values are saved. For the case with BPF_TRAMP_F_CALL_ORIG, we need prepare the arguments in stack before call origin function, which means we need alloc extra "8 * (arg_count - 6)" memory in the top of the stack. Note, there should not be any data be pushed to the stack before calling the origin function. So 'rbx' value will be stored on a stack position higher than where stack arguments are stored for BPF_TRAMP_F_CALL_ORIG. According to the research of Yonghong, struct members should be all in register or all on the stack. Meanwhile, the compiler will pass the argument on regs if the remaining regs can hold the argument. Therefore, we need save the arguments in order. Otherwise, disorder of the args can happen. For example: struct foo_struct { long a; int b; }; int foo(char, char, char, char, char, struct foo_struct, char); the arg1-5,arg7 will be passed by regs, and arg6 will by stack. Therefore, we should save/restore the arguments in the same order with the declaration of foo(). And the args used as ctx in stack will be like this: reg_arg6 -- copy from regs stack_arg2 -- copy from stack stack_arg1 reg_arg5 -- copy from regs reg_arg4 reg_arg3 reg_arg2 reg_arg1 We use EMIT3_off32() or EMIT4() for "lea" and "sub". The range of the imm in "lea" and "sub" is [-128, 127] if EMIT4() is used. Therefore, we use EMIT3_off32() instead if the imm out of the range. It works well for the FENTRY/FEXIT/MODIFY_RETURN. Signed-off-by: Menglong Dong <imagedong@tencent.com> Acked-by: Yonghong Song <yhs@fb.com> Link: https://lore.kernel.org/r/20230713040738.1789742-3-imagedong@tencent.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-07-13bpf, x86: save/restore regs with BPF_DW sizeMenglong Dong
As we already reserve 8 byte in the stack for each reg, it is ok to store/restore the regs in BPF_DW size. This will make the code in save_regs()/restore_regs() simpler. Signed-off-by: Menglong Dong <imagedong@tencent.com> Acked-by: Yonghong Song <yhs@fb.com> Link: https://lore.kernel.org/r/20230713040738.1789742-2-imagedong@tencent.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-07-13Merge tag 'trace-v6.5-rc1-3' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace Pull tracing fixes from Steven Rostedt: - Fix some missing-prototype warnings - Fix user events struct args (did not include size of struct) When creating a user event, the "struct" keyword is to denote that the size of the field will be passed in. But the parsing failed to handle this case. - Add selftest to struct sizes for user events - Fix sample code for direct trampolines. The sample code for direct trampolines attached to handle_mm_fault(). But the prototype changed and the direct trampoline sample code was not updated. Direct trampolines needs to have the arguments correct otherwise it can fail or crash the system. - Remove unused ftrace_regs_caller_ret() prototype. - Quiet false positive of FORTIFY_SOURCE Due to backward compatibility, the structure used to save stack traces in the kernel had a fixed size of 8. This structure is exported to user space via the tracing format file. A change was made to allow more than 8 functions to be recorded, and user space now uses the size field to know how many functions are actually in the stack. But the structure still has size of 8 (even though it points into the ring buffer that has the required amount allocated to hold a full stack. This was fine until the fortifier noticed that the memcpy(&entry->caller, stack, size) was greater than the 8 functions and would complain at runtime about it. Hide this by using a pointer to the stack location on the ring buffer instead of using the address of the entry structure caller field. - Fix a deadloop in reading trace_pipe that was caused by a mismatch between ring_buffer_empty() returning false which then asked to read the data, but the read code uses rb_num_of_entries() that returned zero, and causing a infinite "retry". - Fix a warning caused by not using all pages allocated to store ftrace functions, where this can happen if the linker inserts a bunch of "NULL" entries, causing the accounting of how many pages needed to be off. - Fix histogram synthetic event crashing when the start event is removed and the end event is still using a variable from it - Fix memory leak in freeing iter->temp in tracing_release_pipe() * tag 'trace-v6.5-rc1-3' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: tracing: Fix memory leak of iter->temp when reading trace_pipe tracing/histograms: Add histograms to hist_vars if they have referenced variables tracing: Stop FORTIFY_SOURCE complaining about stack trace caller ftrace: Fix possible warning on checking all pages used in ftrace_process_locs() ring-buffer: Fix deadloop issue on reading trace_pipe tracing: arm64: Avoid missing-prototype warnings selftests/user_events: Test struct size match cases tracing/user_events: Fix struct arg size match check x86/ftrace: Remove unsued extern declaration ftrace_regs_caller_ret() arm64: ftrace: Add direct call trampoline samples support samples: ftrace: Save required argument registers in sample trampolines
2023-07-13Merge tag 'for-linus-6.5-rc2-tag' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip Pull xen fixes from Juergen Gross: - a cleanup of the Xen related ELF-notes - a fix for virtio handling in Xen dom0 when running Xen in a VM * tag 'for-linus-6.5-rc2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip: xen/virtio: Fix NULL deref when a bridge of PCI root bus has no parent x86/Xen: tidy xen-head.S
2023-07-13x86/PCI: Use struct_size() in pirq_convert_irt_table()Christophe JAILLET
Use struct_size() instead of hand-writing it. It is less verbose, more robust and more informative. Link: https://lore.kernel.org/r/00a5cc2cd322e7dea26579916ac6dda9c637aa57.1684518118.git.christophe.jaillet@wanadoo.fr Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
2023-07-13x86/sched: Enable cluster scheduling on HybridPeter Zijlstra
With the SMT vs non-SMT balancing issues sorted, also enable the cluster domain for Hybrid machines. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
2023-07-11x86/mm: Teach pte_mkwrite() about stack memoryRick Edgecombe
If a VMA has the VM_SHADOW_STACK flag, it is shadow stack memory. So when it is made writable with pte_mkwrite(), it should create shadow stack memory, not conventionally writable memory. Now that all the places where shadow stack memory might be created pass a VMA into pte_mkwrite(), it can know when it should do this. So make pte_mkwrite() create shadow stack memory when the VMA has the VM_SHADOW_STACK flag. Do the same thing for pmd_mkwrite(). This requires referencing VM_SHADOW_STACK in these functions, which are currently defined in pgtable.h, however mm.h (where VM_SHADOW_STACK is located) can't be pulled in without causing problems for files that reference pgtable.h. So also move pte/pmd_mkwrite() into pgtable.c, where they can safely reference VM_SHADOW_STACK. Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Kees Cook <keescook@chromium.org> Acked-by: Mike Rapoport (IBM) <rppt@kernel.org> Acked-by: Deepak Gupta <debug@rivosinc.com> Tested-by: Pengfei Xu <pengfei.xu@intel.com> Tested-by: John Allen <john.allen@amd.com> Tested-by: Kees Cook <keescook@chromium.org> Link: https://lore.kernel.org/all/20230613001108.3040476-22-rick.p.edgecombe%40intel.com
2023-07-11x86/mm: Introduce MAP_ABOVE4GRick Edgecombe
The x86 Control-flow Enforcement Technology (CET) feature includes a new type of memory called shadow stack. This shadow stack memory has some unusual properties, which require some core mm changes to function properly. One of the properties is that the shadow stack pointer (SSP), which is a CPU register that points to the shadow stack like the stack pointer points to the stack, can't be pointing outside of the 32 bit address space when the CPU is executing in 32 bit mode. It is desirable to prevent executing in 32 bit mode when shadow stack is enabled because the kernel can't easily support 32 bit signals. On x86 it is possible to transition to 32 bit mode without any special interaction with the kernel, by doing a "far call" to a 32 bit segment. So the shadow stack implementation can use this address space behavior as a feature, by enforcing that shadow stack memory is always mapped outside of the 32 bit address space. This way userspace will trigger a general protection fault which will in turn trigger a segfault if it tries to transition to 32 bit mode with shadow stack enabled. This provides a clean error generating border for the user if they try attempt to do 32 bit mode shadow stack, rather than leave the kernel in a half working state for userspace to be surprised by. So to allow future shadow stack enabling patches to map shadow stacks out of the 32 bit address space, introduce MAP_ABOVE4G. The behavior is pretty much like MAP_32BIT, except that it has the opposite address range. The are a few differences though. If both MAP_32BIT and MAP_ABOVE4G are provided, the kernel will use the MAP_ABOVE4G behavior. Like MAP_32BIT, MAP_ABOVE4G is ignored in a 32 bit syscall. Since the default search behavior is top down, the normal kaslr base can be used for MAP_ABOVE4G. This is unlike MAP_32BIT which has to add its own randomization in the bottom up case. For MAP_32BIT, only the bottom up search path is used. For MAP_ABOVE4G both are potentially valid, so both are used. In the bottomup search path, the default behavior is already consistent with MAP_ABOVE4G since mmap base should be above 4GB. Without MAP_ABOVE4G, the shadow stack will already normally be above 4GB. So without introducing MAP_ABOVE4G, trying to transition to 32 bit mode with shadow stack enabled would usually segfault anyway. This is already pretty decent guard rails. But the addition of MAP_ABOVE4G is some small complexity spent to make it make it more complete. Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Kees Cook <keescook@chromium.org> Acked-by: Mike Rapoport (IBM) <rppt@kernel.org> Tested-by: Pengfei Xu <pengfei.xu@intel.com> Tested-by: John Allen <john.allen@amd.com> Tested-by: Kees Cook <keescook@chromium.org> Link: https://lore.kernel.org/all/20230613001108.3040476-21-rick.p.edgecombe%40intel.com
2023-07-11x86/mm: Warn if create Write=0,Dirty=1 with raw protRick Edgecombe
When user shadow stack is in use, Write=0,Dirty=1 is treated by the CPU as shadow stack memory. So for shadow stack memory this bit combination is valid, but when Dirty=1,Write=1 (conventionally writable) memory is being write protected, the kernel has been taught to transition the Dirty=1 bit to SavedDirty=1, to avoid inadvertently creating shadow stack memory. It does this inside pte_wrprotect() because it knows the PTE is not intended to be a writable shadow stack entry, it is supposed to be write protected. However, when a PTE is created by a raw prot using mk_pte(), mk_pte() can't know whether to adjust Dirty=1 to SavedDirty=1. It can't distinguish between the caller intending to create a shadow stack PTE or needing the SavedDirty shift. The kernel has been updated to not do this, and so Write=0,Dirty=1 memory should only be created by the pte_mkfoo() helpers. Add a warning to make sure no new mk_pte() start doing this, like, for example, set_memory_rox() did. Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Tested-by: Pengfei Xu <pengfei.xu@intel.com> Tested-by: John Allen <john.allen@amd.com> Tested-by: Kees Cook <keescook@chromium.org> Link: https://lore.kernel.org/all/20230613001108.3040476-19-rick.p.edgecombe%40intel.com
2023-07-11mm: Warn on shadow stack memory in wrong vmaRick Edgecombe
The x86 Control-flow Enforcement Technology (CET) feature includes a new type of memory called shadow stack. This shadow stack memory has some unusual properties, which requires some core mm changes to function properly. One sharp edge is that PTEs that are both Write=0 and Dirty=1 are treated as shadow by the CPU, but this combination used to be created by the kernel on x86. Previous patches have changed the kernel to now avoid creating these PTEs unless they are for shadow stack memory. In case any missed corners of the kernel are still creating PTEs like this for non-shadow stack memory, and to catch any re-introductions of the logic, warn if any shadow stack PTEs (Write=0, Dirty=1) are found in non-shadow stack VMAs when they are being zapped. This won't catch transient cases but should have decent coverage. In order to check if a PTE is shadow stack in core mm code, add two arch breakouts arch_check_zapped_pte/pmd(). This will allow shadow stack specific code to be kept in arch/x86. Only do the check if shadow stack is supported by the CPU and configured because in rare cases older CPUs may write Dirty=1 to a Write=0 CPU on older CPUs. This check is handled in pte_shstk()/pmd_shstk(). Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Mark Brown <broonie@kernel.org> Acked-by: Mike Rapoport (IBM) <rppt@kernel.org> Tested-by: Pengfei Xu <pengfei.xu@intel.com> Tested-by: John Allen <john.allen@amd.com> Tested-by: Kees Cook <keescook@chromium.org> Link: https://lore.kernel.org/all/20230613001108.3040476-18-rick.p.edgecombe%40intel.com
2023-07-11x86/mm: Check shadow stack page fault errorsRick Edgecombe
The CPU performs "shadow stack accesses" when it expects to encounter shadow stack mappings. These accesses can be implicit (via CALL/RET instructions) or explicit (instructions like WRSS). Shadow stack accesses to shadow-stack mappings can result in faults in normal, valid operation just like regular accesses to regular mappings. Shadow stacks need some of the same features like delayed allocation, swap and copy-on-write. The kernel needs to use faults to implement those features. The architecture has concepts of both shadow stack reads and shadow stack writes. Any shadow stack access to non-shadow stack memory will generate a fault with the shadow stack error code bit set. This means that, unlike normal write protection, the fault handler needs to create a type of memory that can be written to (with instructions that generate shadow stack writes), even to fulfill a read access. So in the case of COW memory, the COW needs to take place even with a shadow stack read. Otherwise the page will be left (shadow stack) writable in userspace. So to trigger the appropriate behavior, set FAULT_FLAG_WRITE for shadow stack accesses, even if the access was a shadow stack read. For the purpose of making this clearer, consider the following example. If a process has a shadow stack, and forks, the shadow stack PTEs will become read-only due to COW. If the CPU in one process performs a shadow stack read access to the shadow stack, for example executing a RET and causing the CPU to read the shadow stack copy of the return address, then in order for the fault to be resolved the PTE will need to be set with shadow stack permissions. But then the memory would be changeable from userspace (from CALL, RET, WRSS, etc). So this scenario needs to trigger COW, otherwise the shared page would be changeable from both processes. Shadow stack accesses can also result in errors, such as when a shadow stack overflows, or if a shadow stack access occurs to a non-shadow-stack mapping. Also, generate the errors for invalid shadow stack accesses. Co-developed-by: Yu-cheng Yu <yu-cheng.yu@intel.com> Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com> Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Kees Cook <keescook@chromium.org> Acked-by: Mike Rapoport (IBM) <rppt@kernel.org> Tested-by: Pengfei Xu <pengfei.xu@intel.com> Tested-by: John Allen <john.allen@amd.com> Tested-by: Kees Cook <keescook@chromium.org> Link: https://lore.kernel.org/all/20230613001108.3040476-16-rick.p.edgecombe%40intel.com
2023-07-11x86/mm: Remove _PAGE_DIRTY from kernel RO pagesRick Edgecombe
New processors that support Shadow Stack regard Write=0,Dirty=1 PTEs as shadow stack pages. In normal cases, it can be helpful to create Write=1 PTEs as also Dirty=1 if HW dirty tracking is not needed, because if the Dirty bit is not already set the CPU has to set Dirty=1 when the memory gets written to. This creates additional work for the CPU. So traditional wisdom was to simply set the Dirty bit whenever you didn't care about it. However, it was never really very helpful for read-only kernel memory. When CR4.CET=1 and IA32_S_CET.SH_STK_EN=1, some instructions can write to such supervisor memory. The kernel does not set IA32_S_CET.SH_STK_EN, so avoiding kernel Write=0,Dirty=1 memory is not strictly needed for any functional reason. But having Write=0,Dirty=1 kernel memory doesn't have any functional benefit either, so to reduce ambiguity between shadow stack and regular Write=0 pages, remove Dirty=1 from any kernel Write=0 PTEs. Co-developed-by: Yu-cheng Yu <yu-cheng.yu@intel.com> Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com> Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Kees Cook <keescook@chromium.org> Acked-by: Mike Rapoport (IBM) <rppt@kernel.org> Tested-by: Pengfei Xu <pengfei.xu@intel.com> Tested-by: John Allen <john.allen@amd.com> Tested-by: Kees Cook <keescook@chromium.org> Link: https://lore.kernel.org/all/20230613001108.3040476-14-rick.p.edgecombe%40intel.com
2023-07-11x86/mm: Start actually marking _PAGE_SAVED_DIRTYRick Edgecombe
The recently introduced _PAGE_SAVED_DIRTY should be used instead of the HW Dirty bit whenever a PTE is Write=0, in order to not inadvertently create shadow stack PTEs. Update pte_mk*() helpers to do this, and apply the same changes to pmd and pud. Since there is no x86 version of pte_mkwrite() to hold this arch specific logic, create one. Add it to x86/mm/pgtable.c instead of x86/asm/include/pgtable.h as future patches will require it to live in pgtable.c and it will make the diff easier for reviewers. Since CPUs without shadow stack support could create Write=0,Dirty=1 PTEs, only return true for pte_shstk() if the CPU also supports shadow stack. This will prevent these HW creates PTEs as showing as true for pte_write(). For pte_modify() this is a bit trickier. It takes a "raw" pgprot_t which was not necessarily created with any of the existing PTE bit helpers. That means that it can return a pte_t with Write=0,Dirty=1, a shadow stack PTE, when it did not intend to create one. Modify it to also move _PAGE_DIRTY to _PAGE_SAVED_DIRTY. To avoid creating Write=0,Dirty=1 PTEs, pte_modify() needs to avoid: 1. Marking Write=0 PTEs Dirty=1 2. Marking Dirty=1 PTEs Write=0 The first case cannot happen as the existing behavior of pte_modify() is to filter out any Dirty bit passed in newprot. Handle the second case by shifting _PAGE_DIRTY=1 to _PAGE_SAVED_DIRTY=1 if the PTE was write protected by the pte_modify() call. Apply the same changes to pmd_modify(). Co-developed-by: Yu-cheng Yu <yu-cheng.yu@intel.com> Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com> Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Acked-by: Mike Rapoport (IBM) <rppt@kernel.org> Tested-by: Pengfei Xu <pengfei.xu@intel.com> Tested-by: John Allen <john.allen@amd.com> Tested-by: Kees Cook <keescook@chromium.org> Link: https://lore.kernel.org/all/20230613001108.3040476-13-rick.p.edgecombe%40intel.com
2023-07-11x86/mm: Update ptep/pmdp_set_wrprotect() for _PAGE_SAVED_DIRTYRick Edgecombe
When shadow stack is in use, Write=0,Dirty=1 PTE are preserved for shadow stack. Copy-on-write PTEs then have Write=0,SavedDirty=1. When a PTE goes from Write=1,Dirty=1 to Write=0,SavedDirty=1, it could become a transient shadow stack PTE in two cases: 1. Some processors can start a write but end up seeing a Write=0 PTE by the time they get to the Dirty bit, creating a transient shadow stack PTE. However, this will not occur on processors supporting shadow stack, and a TLB flush is not necessary. 2. When _PAGE_DIRTY is replaced with _PAGE_SAVED_DIRTY non-atomically, a transient shadow stack PTE can be created as a result. Prevent the second case when doing a write protection and Dirty->SavedDirty shift at the same time with a CMPXCHG loop. The first case Note, in the PAE case CMPXCHG will need to operate on 8 byte, but try_cmpxchg() will not use CMPXCHG8B, so it cannot operate on a full PAE PTE. However the exiting logic is not operating on a full 8 byte region either, and relies on the fact that the Write bit is in the first 4 bytes when doing the clear_bit(). Since both the Dirty, SavedDirty and Write bits are in the first 4 bytes, casting to a long will be similar to the existing behavior which also casts to a long. Dave Hansen, Jann Horn, Andy Lutomirski, and Peter Zijlstra provided many insights to the issue. Jann Horn provided the CMPXCHG solution. Co-developed-by: Yu-cheng Yu <yu-cheng.yu@intel.com> Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com> Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Acked-by: Mike Rapoport (IBM) <rppt@kernel.org> Tested-by: Pengfei Xu <pengfei.xu@intel.com> Tested-by: John Allen <john.allen@amd.com> Tested-by: Kees Cook <keescook@chromium.org> Link: https://lore.kernel.org/all/20230613001108.3040476-12-rick.p.edgecombe%40intel.com
2023-07-11x86/mm: Introduce _PAGE_SAVED_DIRTYRick Edgecombe
Some OSes have a greater dependence on software available bits in PTEs than Linux. That left the hardware architects looking for a way to represent a new memory type (shadow stack) within the existing bits. They chose to repurpose a lightly-used state: Write=0,Dirty=1. So in order to support shadow stack memory, Linux should avoid creating memory with this PTE bit combination unless it intends for it to be shadow stack. The reason it's lightly used is that Dirty=1 is normally set by HW _before_ a write. A write with a Write=0 PTE would typically only generate a fault, not set Dirty=1. Hardware can (rarely) both set Dirty=1 *and* generate the fault, resulting in a Write=0,Dirty=1 PTE. Hardware which supports shadow stacks will no longer exhibit this oddity. So that leaves Write=0,Dirty=1 PTEs created in software. To avoid inadvertently created shadow stack memory, in places where Linux normally creates Write=0,Dirty=1, it can use the software-defined _PAGE_SAVED_DIRTY in place of the hardware _PAGE_DIRTY. In other words, whenever Linux needs to create Write=0,Dirty=1, it instead creates Write=0,SavedDirty=1 except for shadow stack, which is Write=0,Dirty=1. There are six bits left available to software in the 64-bit PTE after consuming a bit for _PAGE_SAVED_DIRTY. For 32 bit, the same bit as _PAGE_BIT_UFFD_WP is used, since user fault fd is not supported on 32 bit. This leaves one unused software bit on 32 bit (_PAGE_BIT_SOFT_DIRTY, as this is also not supported on 32 bit). Implement only the infrastructure for _PAGE_SAVED_DIRTY. Changes to actually begin creating _PAGE_SAVED_DIRTY PTEs will follow once other pieces are in place. Since this SavedDirty shifting is done for all x86 CPUs, this leaves the possibility for the hardware oddity to still create Write=0,Dirty=1 PTEs in rare cases. Since these CPUs also don't support shadow stack, this will be harmless as it was before the introduction of SavedDirty. Implement the shifting logic to be branchless. Embed the logic of whether to do the shifting (including checking the Write bits) so that it can be called by future callers that would otherwise need additional branching logic. This efficiency allows the logic of when to do the shifting to be centralized, making the code easier to reason about. Co-developed-by: Yu-cheng Yu <yu-cheng.yu@intel.com> Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com> Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Tested-by: Pengfei Xu <pengfei.xu@intel.com> Tested-by: John Allen <john.allen@amd.com> Tested-by: Kees Cook <keescook@chromium.org> Link: https://lore.kernel.org/all/20230613001108.3040476-11-rick.p.edgecombe%40intel.com
2023-07-11x86/mm: Move pmd_write(), pud_write() up in the fileRick Edgecombe
To prepare the introduction of _PAGE_SAVED_DIRTY, move pmd_write() and pud_write() up in the file, so that they can be used by other helpers below. No functional changes. Co-developed-by: Yu-cheng Yu <yu-cheng.yu@intel.com> Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com> Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Kees Cook <keescook@chromium.org> Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: Mike Rapoport (IBM) <rppt@kernel.org> Tested-by: Pengfei Xu <pengfei.xu@intel.com> Tested-by: John Allen <john.allen@amd.com> Tested-by: Kees Cook <keescook@chromium.org> Link: https://lore.kernel.org/all/20230613001108.3040476-10-rick.p.edgecombe%40intel.com
2023-07-11x86/cpufeatures: Add CPU feature flags for shadow stacksRick Edgecombe
The Control-Flow Enforcement Technology contains two related features, one of which is Shadow Stacks. Future patches will utilize this feature for shadow stack support in KVM, so add a CPU feature flags for Shadow Stacks (CPUID.(EAX=7,ECX=0):ECX[bit 7]). To protect shadow stack state from malicious modification, the registers are only accessible in supervisor mode. This implementation context-switches the registers with XSAVES. Make X86_FEATURE_SHSTK depend on XSAVES. The shadow stack feature, enumerated by the CPUID bit described above, encompasses both supervisor and userspace support for shadow stack. In near future patches, only userspace shadow stack will be enabled. In expectation of future supervisor shadow stack support, create a software CPU capability to enumerate kernel utilization of userspace shadow stack support. This user shadow stack bit should depend on the HW "shstk" capability and that logic will be implemented in future patches. Co-developed-by: Yu-cheng Yu <yu-cheng.yu@intel.com> Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com> Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Kees Cook <keescook@chromium.org> Acked-by: Mike Rapoport (IBM) <rppt@kernel.org> Tested-by: Pengfei Xu <pengfei.xu@intel.com> Tested-by: John Allen <john.allen@amd.com> Tested-by: Kees Cook <keescook@chromium.org> Link: https://lore.kernel.org/all/20230613001108.3040476-9-rick.p.edgecombe%40intel.com
2023-07-11x86/traps: Move control protection handler to separate fileRick Edgecombe
Today the control protection handler is defined in traps.c and used only for the kernel IBT feature. To reduce ifdeffery, move it to it's own file. In future patches, functionality will be added to make this handler also handle user shadow stack faults. So name the file cet.c. No functional change. Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Kees Cook <keescook@chromium.org> Acked-by: Mike Rapoport (IBM) <rppt@kernel.org> Tested-by: Pengfei Xu <pengfei.xu@intel.com> Tested-by: John Allen <john.allen@amd.com> Tested-by: Kees Cook <keescook@chromium.org> Link: https://lore.kernel.org/all/20230613001108.3040476-8-rick.p.edgecombe%40intel.com
2023-07-11x86/shstk: Add Kconfig option for shadow stackRick Edgecombe
Shadow stack provides protection for applications against function return address corruption. It is active when the processor supports it, the kernel has CONFIG_X86_SHADOW_STACK enabled, and the application is built for the feature. This is only implemented for the 64-bit kernel. When it is enabled, legacy non-shadow stack applications continue to work, but without protection. Since there is another feature that utilizes CET (Kernel IBT) that will share implementation with shadow stacks, create CONFIG_CET to signify that at least one CET feature is configured. Co-developed-by: Yu-cheng Yu <yu-cheng.yu@intel.com> Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com> Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Kees Cook <keescook@chromium.org> Acked-by: Mike Rapoport (IBM) <rppt@kernel.org> Tested-by: Pengfei Xu <pengfei.xu@intel.com> Tested-by: John Allen <john.allen@amd.com> Tested-by: Kees Cook <keescook@chromium.org> Link: https://lore.kernel.org/all/20230613001108.3040476-7-rick.p.edgecombe%40intel.com
2023-07-11mm: Move pte/pmd_mkwrite() callers with no VMA to _novma()Rick Edgecombe
The x86 Shadow stack feature includes a new type of memory called shadow stack. This shadow stack memory has some unusual properties, which requires some core mm changes to function properly. One of these unusual properties is that shadow stack memory is writable, but only in limited ways. These limits are applied via a specific PTE bit combination. Nevertheless, the memory is writable, and core mm code will need to apply the writable permissions in the typical paths that call pte_mkwrite(). Future patches will make pte_mkwrite() take a VMA, so that the x86 implementation of it can know whether to create regular writable or shadow stack mappings. But there are a couple of challenges to this. Modifying the signatures of each arch pte_mkwrite() implementation would be error prone because some are generated with macros and would need to be re-implemented. Also, some pte_mkwrite() callers operate on kernel memory without a VMA. So this can be done in a three step process. First pte_mkwrite() can be renamed to pte_mkwrite_novma() in each arch, with a generic pte_mkwrite() added that just calls pte_mkwrite_novma(). Next callers without a VMA can be moved to pte_mkwrite_novma(). And lastly, pte_mkwrite() and all callers can be changed to take/pass a VMA. Earlier work did the first step, so next move the callers that don't have a VMA to pte_mkwrite_novma(). Also do the same for pmd_mkwrite(). This will be ok for the shadow stack feature, as these callers are on kernel memory which will not need to be made shadow stack, and the other architectures only currently support one type of memory in pte_mkwrite() Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Mike Rapoport (IBM) <rppt@kernel.org> Acked-by: David Hildenbrand <david@redhat.com> Link: https://lore.kernel.org/all/20230613001108.3040476-3-rick.p.edgecombe%40intel.com
2023-07-11mm: Rename arch pte_mkwrite()'s to pte_mkwrite_novma()Rick Edgecombe
The x86 Shadow stack feature includes a new type of memory called shadow stack. This shadow stack memory has some unusual properties, which requires some core mm changes to function properly. One of these unusual properties is that shadow stack memory is writable, but only in limited ways. These limits are applied via a specific PTE bit combination. Nevertheless, the memory is writable, and core mm code will need to apply the writable permissions in the typical paths that call pte_mkwrite(). The goal is to make pte_mkwrite() take a VMA, so that the x86 implementation of it can know whether to create regular writable or shadow stack mappings. But there are a couple of challenges to this. Modifying the signatures of each arch pte_mkwrite() implementation would be error prone because some are generated with macros and would need to be re-implemented. Also, some pte_mkwrite() callers operate on kernel memory without a VMA. So this can be done in a three step process. First pte_mkwrite() can be renamed to pte_mkwrite_novma() in each arch, with a generic pte_mkwrite() added that just calls pte_mkwrite_novma(). Next callers without a VMA can be moved to pte_mkwrite_novma(). And lastly, pte_mkwrite() and all callers can be changed to take/pass a VMA. Start the process by renaming pte_mkwrite() to pte_mkwrite_novma() and adding the pte_mkwrite() wrapper in linux/pgtable.h. Apply the same pattern for pmd_mkwrite(). Since not all archs have a pmd_mkwrite_novma(), create a new arch config HAS_HUGE_PAGE that can be used to tell if pmd_mkwrite() should be defined. Otherwise in the !HAS_HUGE_PAGE cases the compiler would not be able to find pmd_mkwrite_novma(). No functional change. Suggested-by: Linus Torvalds <torvalds@linuxfoundation.org> Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Mike Rapoport (IBM) <rppt@kernel.org> Acked-by: Geert Uytterhoeven <geert@linux-m68k.org> Acked-by: David Hildenbrand <david@redhat.com> Link: https://lore.kernel.org/lkml/CAHk-=wiZjSu7c9sFYZb3q04108stgHff2wfbokGCCgW7riz+8Q@mail.gmail.com/ Link: https://lore.kernel.org/all/20230613001108.3040476-2-rick.p.edgecombe%40intel.com
2023-07-11x86/cfi: Only define poison_cfi() if CONFIG_X86_KERNEL_IBT=yIngo Molnar
poison_cfi() was introduced in: 9831c6253ace ("x86/cfi: Extend ENDBR sealing to kCFI") ... but it's only ever used under CONFIG_X86_KERNEL_IBT=y, and if that option is disabled, we get: arch/x86/kernel/alternative.c:1243:13: error: ‘poison_cfi’ defined but not used [-Werror=unused-function] Guard the definition with CONFIG_X86_KERNEL_IBT. Cc: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Kees Cook <keescook@chromium.org> Cc: Sami Tolvanen <samitolvanen@google.com> Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2023-07-10x86/ftrace: Remove unsued extern declaration ftrace_regs_caller_ret()YueHaibing
This is now unused, so can remove it. Link: https://lore.kernel.org/linux-trace-kernel/20230623091640.21952-1-yuehaibing@huawei.com Cc: <mark.rutland@arm.com> Cc: <tglx@linutronix.de> Cc: <mingo@redhat.com> Cc: <bp@alien8.de> Cc: <dave.hansen@linux.intel.com> Cc: <x86@kernel.org> Cc: <hpa@zytor.com> Cc: <peterz@infradead.org> Signed-off-by: YueHaibing <yuehaibing@huawei.com> Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-07-10locking/arch: Avoid variable shadowing in local_try_cmpxchg()Uros Bizjak
Several architectures define arch_try_local_cmpxchg macro using internal temporary variables named ___old, __old or _old. Remove temporary varible in local_try_cmpxchg to avoid variable shadowing. No functional change intended. Fixes: d994f2c8e241 ("locking/arch: Wire up local_try_cmpxchg()") Closes: https://lore.kernel.org/lkml/CAFGhKbyxtuk=LoW-E3yLXgcmR93m+Dfo5-u9oQA_YC5Fcy_t9g@mail.gmail.com/ Reported-by: Charlemagne Lasse <charlemagnelasse@gmail.com> Signed-off-by: Uros Bizjak <ubizjak@gmail.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20230708090048.63046-1-ubizjak@gmail.com
2023-07-10perf/x86: Use local64_try_cmpxchgUros Bizjak
Use local64_try_cmpxchg instead of local64_cmpxchg (*ptr, old, new) == old. x86 CMPXCHG instruction returns success in ZF flag, so this change saves a compare after cmpxchg (and related move instruction in front of cmpxchg). Also, try_cmpxchg implicitly assigns old *ptr value to "old" when cmpxchg fails. There is no need to re-read the value in the loop. No functional change intended. Cc. "H. Peter Anvin" <hpa@zytor.com> Signed-off-by: Uros Bizjak <ubizjak@gmail.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20230706141720.2672-1-ubizjak@gmail.com
2023-07-10perf/amd: Prevent grouping of IBS eventsRavi Bangoria
IBS PMUs can have only one event active at any point in time. Restrict grouping of multiple IBS events. Reported-by: Sandipan Das <sandipan.das@amd.com> Signed-off-by: Ravi Bangoria <ravi.bangoria@amd.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20230620091603.269-1-ravi.bangoria@amd.com
2023-07-10x86/fineibt: Poison ENDBR at +0Peter Zijlstra
Alyssa noticed that when building the kernel with CFI_CLANG+IBT and booting on IBT enabled hardware to obtain FineIBT, the indirect functions look like: __cfi_foo: endbr64 subl $hash, %r10d jz 1f ud2 nop 1: foo: endbr64 This is because the compiler generates code for kCFI+IBT. In that case the caller does the hash check and will jump to +0, so there must be an ENDBR there. The compiler doesn't know about FineIBT at all; also it is possible to actually use kCFI+IBT when booting with 'cfi=kcfi' on IBT enabled hardware. Having this second ENDBR however makes it possible to elide the CFI check. Therefore, we should poison this second ENDBR when switching to FineIBT mode. Fixes: 931ab63664f0 ("x86/ibt: Implement FineIBT") Reported-by: "Milburn, Alyssa" <alyssa.milburn@intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Kees Cook <keescook@chromium.org> Reviewed-by: Sami Tolvanen <samitolvanen@google.com> Link: https://lore.kernel.org/r/20230615193722.194131053@infradead.org
2023-07-10x86: Rewrite ret_from_fork() in CBrian Gerst
When kCFI is enabled, special handling is needed for the indirect call to the kernel thread function. Rewrite the ret_from_fork() function in C so that the compiler can properly handle the indirect call. Suggested-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Brian Gerst <brgerst@gmail.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Kees Cook <keescook@chromium.org> Reviewed-by: Sami Tolvanen <samitolvanen@google.com> Link: https://lkml.kernel.org/r/20230623225529.34590-3-brgerst@gmail.com
2023-07-10x86/32: Remove schedule_tail_wrapper()Brian Gerst
The unwinder expects a return address at the very top of the kernel stack just below pt_regs and before any stack frame is created. Instead of calling a wrapper, set up a return address as if ret_from_fork() was called from the syscall entry code. Signed-off-by: Brian Gerst <brgerst@gmail.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Kees Cook <keescook@chromium.org> Reviewed-by: Sami Tolvanen <samitolvanen@google.com> Link: https://lkml.kernel.org/r/20230623225529.34590-2-brgerst@gmail.com
2023-07-10x86/cfi: Extend ENDBR sealing to kCFIPeter Zijlstra
Kees noted that IBT sealing could be extended to kCFI. Fundamentally it is the list of functions that do not have their address taken and are thus never called indirectly. It doesn't matter that objtool uses IBT infrastructure to determine this list, once we have it it can also be used to clobber kCFI hashes and avoid kCFI indirect calls. Suggested-by: Kees Cook <keescook@chromium.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Kees Cook <keescook@chromium.org> Reviewed-by: Sami Tolvanen <samitolvanen@google.com> Link: https://lkml.kernel.org/r/20230622144321.494426891%40infradead.org
2023-07-10x86/alternative: Rename apply_ibt_endbr()Peter Zijlstra
The current name doesn't reflect what it does very well. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Kees Cook <keescook@chromium.org> Reviewed-by: Sami Tolvanen <samitolvanen@google.com> Link: https://lkml.kernel.org/r/20230622144321.427441595%40infradead.org
2023-07-10x86/cfi: Extend {JMP,CAKK}_NOSPEC commentPeter Zijlstra
With the introduction of kCFI these helpers are no longer equivalent to C indirect calls and should be used with care. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Kees Cook <keescook@chromium.org> Reviewed-by: Sami Tolvanen <samitolvanen@google.com> Link: https://lkml.kernel.org/r/20230622144321.360957723%40infradead.org
2023-07-10perf/x86: Fix lockdep warning in for_each_sibling_event() on SPRNamhyung Kim
On SPR, the load latency event needs an auxiliary event in the same group to work properly. There's a check in intel_pmu_hw_config() for this to iterate sibling events and find a mem-loads-aux event. The for_each_sibling_event() has a lockdep assert to make sure if it disabled hardirq or hold leader->ctx->mutex. This works well if the given event has a separate leader event since perf_try_init_event() grabs the leader->ctx->mutex to protect the sibling list. But it can cause a problem when the event itself is a leader since the event is not initialized yet and there's no ctx for the event. Actually I got a lockdep warning when I run the below command on SPR, but I guess it could be a NULL pointer dereference. $ perf record -d -e cpu/mem-loads/uP true The code path to the warning is: sys_perf_event_open() perf_event_alloc() perf_init_event() perf_try_init_event() x86_pmu_event_init() hsw_hw_config() intel_pmu_hw_config() for_each_sibling_event() lockdep_assert_event_ctx() We don't need for_each_sibling_event() when it's a standalone event. Let's return the error code directly. Fixes: f3c0eba28704 ("perf: Add a few assertions") Reported-by: Greg Thelen <gthelen@google.com> Signed-off-by: Namhyung Kim <namhyung@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: stable@vger.kernel.org Link: https://lkml.kernel.org/r/20230704181516.3293665-1-namhyung@kernel.org
2023-07-09Merge tag 'x86_urgent_for_v6.5_rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 fpu fix from Borislav Petkov: - Do FPU AP initialization on Xen PV too which got missed by the recent boot reordering work * tag 'x86_urgent_for_v6.5_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/xen: Fix secondary processors' FPU initialization
2023-07-09Merge tag 'x86-core-2023-07-09' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 fix from Thomas Gleixner: "A single fix for the mechanism to park CPUs with an INIT IPI. On shutdown or kexec, the kernel tries to park the non-boot CPUs with an INIT IPI. But the same code path is also used by the crash utility. If the CPU which panics is not the boot CPU then it sends an INIT IPI to the boot CPU which resets the machine. Prevent this by validating that the CPU which runs the stop mechanism is the boot CPU. If not, leave the other CPUs in HLT" * tag 'x86-core-2023-07-09' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/smp: Don't send INIT to boot CPU