summaryrefslogtreecommitdiff
path: root/arch/x86/coco/tdx/tdx.c
AgeCommit message (Collapse)Author
2023-12-07x86/tdx: Allow 32-bit emulation by defaultKirill A. Shutemov
32-bit emulation was disabled on TDX to prevent a possible attack by a VMM injecting an interrupt on vector 0x80. Now that int80_emulation() has a check for external interrupts the limitation can be lifted. To distinguish software interrupts from external ones, int80_emulation() checks the APIC ISR bit relevant to the 0x80 vector. For software interrupts, this bit will be 0. On TDX, the VAPIC state (including ISR) is protected and cannot be manipulated by the VMM. The ISR bit is set by the microcode flow during the handling of posted interrupts. [ dhansen: more changelog tweaks ] Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Borislav Petkov (AMD) <bp@alien8.de> Cc: <stable@vger.kernel.org> # v6.0+
2023-12-07x86/coco: Disable 32-bit emulation by default on TDX and SEVKirill A. Shutemov
The INT 0x80 instruction is used for 32-bit x86 Linux syscalls. The kernel expects to receive a software interrupt as a result of the INT 0x80 instruction. However, an external interrupt on the same vector triggers the same handler. The kernel interprets an external interrupt on vector 0x80 as a 32-bit system call that came from userspace. A VMM can inject external interrupts on any arbitrary vector at any time. This remains true even for TDX and SEV guests where the VMM is untrusted. Put together, this allows an untrusted VMM to trigger int80 syscall handling at any given point. The content of the guest register file at that moment defines what syscall is triggered and its arguments. It opens the guest OS to manipulation from the VMM side. Disable 32-bit emulation by default for TDX and SEV. User can override it with the ia32_emulation=y command line option. [ dhansen: reword the changelog ] Reported-by: Supraja Sridhara <supraja.sridhara@inf.ethz.ch> Reported-by: Benedict Schlüter <benedict.schlueter@inf.ethz.ch> Reported-by: Mark Kuhne <mark.kuhne@inf.ethz.ch> Reported-by: Andrin Bertschi <andrin.bertschi@inf.ethz.ch> Reported-by: Shweta Shinde <shweta.shinde@inf.ethz.ch> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Borislav Petkov (AMD) <bp@alien8.de> Cc: <stable@vger.kernel.org> # v6.0+: 1da5c9b x86: Introduce ia32_enabled() Cc: <stable@vger.kernel.org> # v6.0+
2023-11-04Merge tag 'tsm-for-6.7' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/djbw/linux Pull unified attestation reporting from Dan Williams: "In an ideal world there would be a cross-vendor standard attestation report format for confidential guests along with a common device definition to act as the transport. In the real world the situation ended up with multiple platform vendors inventing their own attestation report formats with the SEV-SNP implementation being a first mover to define a custom sev-guest character device and corresponding ioctl(). Later, this configfs-tsm proposal intercepted an attempt to add a tdx-guest character device and a corresponding new ioctl(). It also anticipated ARM and RISC-V showing up with more chardevs and more ioctls(). The proposal takes for granted that Linux tolerates the vendor report format differentiation until a standard arrives. From talking with folks involved, it sounds like that standardization work is unlikely to resolve anytime soon. It also takes the position that kernfs ABIs are easier to maintain than ioctl(). The result is a shared configfs mechanism to return per-vendor report-blobs with the option to later support a standard when that arrives. Part of the goal here also is to get the community into the "uncomfortable, but beneficial to the long term maintainability of the kernel" state of talking to each other about their differentiation and opportunities to collaborate. Think of this like the device-driver equivalent of the common memory-management infrastructure for confidential-computing being built up in KVM. As for establishing an "upstream path for cross-vendor confidential-computing device driver infrastructure" this is something I want to discuss at Plumbers. At present, the multiple vendor proposals for assigning devices to confidential computing VMs likely needs a new dedicated repository and maintainer team, but that is a discussion for v6.8. For now, Greg and Thomas have acked this approach and this is passing is AMD, Intel, and Google tests. Summary: - Introduce configfs-tsm as a shared ABI for confidential computing attestation reports - Convert sev-guest to additionally support configfs-tsm alongside its vendor specific ioctl() - Added signed attestation report retrieval to the tdx-guest driver forgoing a new vendor specific ioctl() - Misc cleanups and a new __free() annotation for kvfree()" * tag 'tsm-for-6.7' of git://git.kernel.org/pub/scm/linux/kernel/git/djbw/linux: virt: tdx-guest: Add Quote generation support using TSM_REPORTS virt: sevguest: Add TSM_REPORTS support for SNP_GET_EXT_REPORT mm/slab: Add __free() support for kvfree virt: sevguest: Prep for kernel internal get_ext_report() configfs-tsm: Introduce a shared ABI for attestation reports virt: coco: Add a coco/Makefile and coco/Kconfig virt: sevguest: Fix passing a stack buffer as a scatterlist target
2023-11-01Merge tag 'x86_tdx_for_6.7' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 TDX updates from Dave Hansen: "The majority of this is a rework of the assembly and C wrappers that are used to talk to the TDX module and VMM. This is a nice cleanup in general but is also clearing the way for using this code when Linux is the TDX VMM. There are also some tidbits to make TDX guests play nicer with Hyper-V and to take advantage the hardware TSC. Summary: - Refactor and clean up TDX hypercall/module call infrastructure - Handle retrying/resuming page conversion hypercalls - Make sure to use the (shockingly) reliable TSC in TDX guests" [ TLA reminder: TDX is "Trust Domain Extensions", Intel's guest VM confidentiality technology ] * tag 'x86_tdx_for_6.7' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/tdx: Mark TSC reliable x86/tdx: Fix __noreturn build warning around __tdx_hypercall_failed() x86/virt/tdx: Make TDX_MODULE_CALL handle SEAMCALL #UD and #GP x86/virt/tdx: Wire up basic SEAMCALL functions x86/tdx: Remove 'struct tdx_hypercall_args' x86/tdx: Reimplement __tdx_hypercall() using TDX_MODULE_CALL asm x86/tdx: Make TDX_HYPERCALL asm similar to TDX_MODULE_CALL x86/tdx: Extend TDX_MODULE_CALL to support more TDCALL/SEAMCALL leafs x86/tdx: Pass TDCALL/SEAMCALL input/output registers via a structure x86/tdx: Rename __tdx_module_call() to __tdcall() x86/tdx: Make macros of TDCALLs consistent with the spec x86/tdx: Skip saving output regs when SEAMCALL fails with VMFailInvalid x86/tdx: Zero out the missing RSI in TDX_HYPERCALL macro x86/tdx: Retry partially-completed page conversion hypercalls
2023-10-19virt: tdx-guest: Add Quote generation support using TSM_REPORTSKuppuswamy Sathyanarayanan
In TDX guest, the attestation process is used to verify the TDX guest trustworthiness to other entities before provisioning secrets to the guest. The first step in the attestation process is TDREPORT generation, which involves getting the guest measurement data in the format of TDREPORT, which is further used to validate the authenticity of the TDX guest. TDREPORT by design is integrity-protected and can only be verified on the local machine. To support remote verification of the TDREPORT in a SGX-based attestation, the TDREPORT needs to be sent to the SGX Quoting Enclave (QE) to convert it to a remotely verifiable Quote. SGX QE by design can only run outside of the TDX guest (i.e. in a host process or in a normal VM) and guest can use communication channels like vsock or TCP/IP to send the TDREPORT to the QE. But for security concerns, the TDX guest may not support these communication channels. To handle such cases, TDX defines a GetQuote hypercall which can be used by the guest to request the host VMM to communicate with the SGX QE. More details about GetQuote hypercall can be found in TDX Guest-Host Communication Interface (GHCI) for Intel TDX 1.0, section titled "TDG.VP.VMCALL<GetQuote>". Trusted Security Module (TSM) [1] exposes a common ABI for Confidential Computing Guest platforms to get the measurement data via ConfigFS. Extend the TSM framework and add support to allow an attestation agent to get the TDX Quote data (included usage example below). report=/sys/kernel/config/tsm/report/report0 mkdir $report dd if=/dev/urandom bs=64 count=1 > $report/inblob hexdump -C $report/outblob rmdir $report GetQuote TDVMCALL requires TD guest pass a 4K aligned shared buffer with TDREPORT data as input, which is further used by the VMM to copy the TD Quote result after successful Quote generation. To create the shared buffer, allocate a large enough memory and mark it shared using set_memory_decrypted() in tdx_guest_init(). This buffer will be re-used for GetQuote requests in the TDX TSM handler. Although this method reserves a fixed chunk of memory for GetQuote requests, such one time allocation can help avoid memory fragmentation related allocation failures later in the uptime of the guest. Since the Quote generation process is not time-critical or frequently used, the current version uses a polling model for Quote requests and it also does not support parallel GetQuote requests. Link: https://lore.kernel.org/lkml/169342399185.3934343.3035845348326944519.stgit@dwillia2-xfh.jf.intel.com/ [1] Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com> Reviewed-by: Erdem Aktas <erdemaktas@google.com> Tested-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com> Tested-by: Peter Gonda <pgonda@google.com> Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2023-10-06x86/tdx: Mark TSC reliableKirill A. Shutemov
In x86 virtualization environments, including TDX, RDTSC instruction is handled without causing a VM exit, resulting in minimal overhead and jitters. On the other hand, other clock sources (such as HPET, ACPI timer, APIC, etc.) necessitate VM exits to implement, resulting in more fluctuating measurements compared to TSC. Thus, those clock sources are not effective for calibrating TSC. As a foundation, the host TSC is guaranteed to be invariant on any system which enumerates TDX support. TDX guests and the TDX module build on that foundation by enforcing: - Virtual TSC is monotonously incrementing for any single VCPU; - Virtual TSC values are consistent among all the TD’s VCPUs at the level supported by the CPU: + VMM is required to set the same TSC_ADJUST; + VMM must not modify from initial value of TSC_ADJUST before SEAMCALL; - The frequency is determined by TD configuration: + Virtual TSC frequency is specified by VMM on TDH.MNG.INIT; + Virtual TSC starts counting from 0 at TDH.MNG.INIT; The result is that a reliable TSC is a TDX architectural guarantee. Use the TSC as the only reliable clock source in TD guests, bypassing unstable calibration. This is similar to what the kernel already does in some VMWare and HyperV environments. [ dhansen: changelog tweaks ] Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com> Reviewed-by: Erdem Aktas <erdemaktas@google.com> Reviewed-by: Isaku Yamahata <isaku.yamahata@intel.com> Acked-by: Kai Huang <kai.huang@intel.com> Link: https://lore.kernel.org/all/20231006144549.2633-1-kirill.shutemov%40linux.intel.com
2023-10-04x86/tdx: Replace deprecated strncpy() with strtomem_pad()Justin Stitt
strncpy() works perfectly here in all cases, however, it is deprecated and as such we should prefer more robust and less ambiguous string APIs: https://www.kernel.org/doc/html/latest/process/deprecated.html#strncpy-on-nul-terminated-strings Let's use strtomem_pad() as this matches the functionality of strncpy() and is _not_ deprecated. Signed-off-by: Justin Stitt <justinstitt@google.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Kees Cook <keescook@chromium.org> Acked-by: Dave Hansen <dave.hansen@linux.intel.com> Link: https://github.com/KSPP/linux/issues/90 Link: https://lore.kernel.org/r/20231003-strncpy-arch-x86-coco-tdx-tdx-c-v2-1-0bd21174a217@google.com
2023-09-12x86/tdx: Remove 'struct tdx_hypercall_args'Kai Huang
Now 'struct tdx_hypercall_args' is basically 'struct tdx_module_args' minus RCX. Although from __tdx_hypercall()'s perspective RCX isn't used as shared register thus not part of input/output registers, it's not worth to have a separate structure just due to one register. Remove the 'struct tdx_hypercall_args' and use 'struct tdx_module_args' instead in __tdx_hypercall() related code. This also saves the memory copy between the two structures within __tdx_hypercall(). Suggested-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Kai Huang <kai.huang@intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/all/798dad5ce24e9d745cf0e16825b75ccc433ad065.1692096753.git.kai.huang%40intel.com
2023-09-12x86/tdx: Make TDX_HYPERCALL asm similar to TDX_MODULE_CALLKai Huang
Now the 'struct tdx_hypercall_args' and 'struct tdx_module_args' are almost the same, and the TDX_HYPERCALL and TDX_MODULE_CALL asm macro share similar code pattern too. The __tdx_hypercall() and __tdcall() should be unified to use the same assembly code. As a preparation to unify them, simplify the TDX_HYPERCALL to make it more like the TDX_MODULE_CALL. The TDX_HYPERCALL takes the pointer of 'struct tdx_hypercall_args' as function call argument, and does below extra things comparing to the TDX_MODULE_CALL: 1) It sets RAX to 0 (TDG.VP.VMCALL leaf) internally; 2) It sets RCX to the (fixed) bitmap of shared registers internally; 3) It calls __tdx_hypercall_failed() internally (and panics) when the TDCALL instruction itself fails; 4) After TDCALL, it moves R10 to RAX to return the return code of the VMCALL leaf, regardless the '\ret' asm macro argument; Firstly, change the TDX_HYPERCALL to take the same function call arguments as the TDX_MODULE_CALL does: TDCALL leaf ID, and the pointer to 'struct tdx_module_args'. Then 1) and 2) can be moved to the caller: - TDG.VP.VMCALL leaf ID can be passed via the function call argument; - 'struct tdx_module_args' is 'struct tdx_hypercall_args' + RCX, thus the bitmap of shared registers can be passed via RCX in the structure. Secondly, to move 3) and 4) out of assembly, make the TDX_HYPERCALL always save output registers to the structure. The caller then can: - Call __tdx_hypercall_failed() when TDX_HYPERCALL returns error; - Return R10 in the structure as the return code of the VMCALL leaf; With above changes, change the asm function from __tdx_hypercall() to __tdcall_hypercall(), and reimplement __tdx_hypercall() as the C wrapper of it. This avoids having to add another wrapper of __tdx_hypercall() (_tdx_hypercall() is already taken). The __tdcall_hypercall() will be replaced with a __tdcall() variant using TDX_MODULE_CALL in a later commit as the final goal is to have one assembly to handle both TDCALL and TDVMCALL. Currently, the __tdx_hypercall() asm is in '.noinstr.text'. To keep this unchanged, annotate __tdx_hypercall(), which is a C function now, as 'noinstr'. Remove the __tdx_hypercall_ret() as __tdx_hypercall() already does so. Implement __tdx_hypercall() in tdx-shared.c so it can be shared with the compressed code. Opportunistically fix a checkpatch error complaining using space around parenthesis '(' and ')' while moving the bitmap of shared registers to <asm/shared/tdx.h>. [ dhansen: quash new calls of __tdx_hypercall_ret() that showed up ] Suggested-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Kai Huang <kai.huang@intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/all/0cbf25e7aee3256288045023a31f65f0cef90af4.1692096753.git.kai.huang%40intel.com
2023-09-11x86/tdx: Pass TDCALL/SEAMCALL input/output registers via a structureKai Huang
Currently, the TDX_MODULE_CALL asm macro, which handles both TDCALL and SEAMCALL, takes one parameter for each input register and an optional 'struct tdx_module_output' (a collection of output registers) as output. This is different from the TDX_HYPERCALL macro which uses a single 'struct tdx_hypercall_args' to carry all input/output registers. The newer TDX versions introduce more TDCALLs/SEAMCALLs which use more input/output registers. Also, the TDH.VP.ENTER (which isn't covered by the current TDX_MODULE_CALL macro) basically can use all registers that the TDX_HYPERCALL does. The current TDX_MODULE_CALL macro isn't extendible to cover those cases. Similar to the TDX_HYPERCALL macro, simplify the TDX_MODULE_CALL macro to use a single structure 'struct tdx_module_args' to carry all the input/output registers. Currently, R10/R11 are only used as output register but not as input by any TDCALL/SEAMCALL. Change to also use R10/R11 as input register to make input/output registers symmetric. Currently, the TDX_MODULE_CALL macro depends on the caller to pass a non-NULL 'struct tdx_module_output' to get additional output registers. Similar to the TDX_HYPERCALL macro, change the TDX_MODULE_CALL macro to take a new 'ret' macro argument to indicate whether to save the output registers to the 'struct tdx_module_args'. Also introduce a new __tdcall_ret() for that purpose, similar to the __tdx_hypercall_ret(). Note the tdcall(), which is a wrapper of __tdcall(), is called by three callers: tdx_parse_tdinfo(), tdx_get_ve_info() and tdx_early_init(). The former two need the additional output but the last one doesn't. For simplicity, make tdcall() always call __tdcall_ret() to avoid another "_ret()" wrapper. The last caller tdx_early_init() isn't performance critical anyway. Suggested-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Kai Huang <kai.huang@intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/all/483616c1762d85eb3a3c3035a7de061cfacf2f14.1692096753.git.kai.huang%40intel.com
2023-09-11x86/tdx: Rename __tdx_module_call() to __tdcall()Kai Huang
__tdx_module_call() is only used by the TDX guest to issue TDCALL to the TDX module. Rename it to __tdcall() to match its behaviour, e.g., it cannot be used to make host-side SEAMCALL. Also rename tdx_module_call() which is a wrapper of __tdx_module_call() to tdcall(). No functional change intended. Signed-off-by: Kai Huang <kai.huang@intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Reviewed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/all/785d20d99fbcd0db8262c94da6423375422d8c75.1692096753.git.kai.huang%40intel.com
2023-09-11x86/tdx: Make macros of TDCALLs consistent with the specKai Huang
The TDX spec names all TDCALLs with prefix "TDG". Currently, the kernel doesn't follow such convention for the macros of those TDCALLs but uses prefix "TDX_" for all of them. Although it's arguable whether the TDX spec names those TDCALLs properly, it's better for the kernel to follow the spec when naming those macros. Change all macros of TDCALLs to make them consistent with the spec. As a bonus, they get distinguished easily from the host-side SEAMCALLs, which all have prefix "TDH". No functional change intended. Signed-off-by: Kai Huang <kai.huang@intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Reviewed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/all/516dccd0bd8fb9a0b6af30d25bb2d971aa03d598.1692096753.git.kai.huang%40intel.com
2023-09-11x86/tdx: Retry partially-completed page conversion hypercallsDexuan Cui
TDX guest memory is private by default and the VMM may not access it. However, in cases where the guest needs to share data with the VMM, the guest and the VMM can coordinate to make memory shared between them. The guest side of this protocol includes the "MapGPA" hypercall. This call takes a guest physical address range. The hypercall spec (aka. the GHCI) says that the MapGPA call is allowed to return partial progress in mapping this range and indicate that fact with a special error code. A guest that sees such partial progress is expected to retry the operation for the portion of the address range that was not completed. Hyper-V does this partial completion dance when set_memory_decrypted() is called to "decrypt" swiotlb bounce buffers that can be up to 1GB in size. It is evidently the only VMM that does this, which is why nobody noticed this until now. [ dhansen: rewrite changelog ] Signed-off-by: Dexuan Cui <decui@microsoft.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Michael Kelley <mikelley@microsoft.com> Reviewed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Link: https://lore.kernel.org/all/20230811021246.821-2-decui%40microsoft.com
2023-06-27Merge tag 'x86_sev_for_v6.5' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 SEV updates from Borislav Petkov: - Some SEV and CC platform helpers cleanup and simplifications now that the usage patterns are becoming apparent [ I'm sure I'm the only one that has gets confused by all the TLAs, but in case there are others: here SEV is AMD's "Secure Encrypted Virtualization" and CC is generic "Confidential Computing". There's also Intel SGX (Software Guard Extensions) and TDX (Trust Domain Extensions), along with all the vendor memory encryption extensions (SME, TSME, TME, and WTF). And then we have arm64 with RMA and CCA, and I probably forgot another dozen or so related acronyms - Linus ] * tag 'x86_sev_for_v6.5' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/coco: Get rid of accessor functions x86/sev: Get rid of special sev_es_enable_key x86/coco: Mark cc_platform_has() and descendants noinstr
2023-06-26Merge tag 'x86_tdx_for_6.5' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 tdx updates from Dave Hansen: - Fix a race window where load_unaligned_zeropad() could cause a fatal shutdown during TDX private<=>shared conversion The race has never been observed in practice but might allow load_unaligned_zeropad() to catch a TDX page in the middle of its conversion process which would lead to a fatal and unrecoverable guest shutdown. - Annotate sites where VM "exit reasons" are reused as hypercall numbers. * tag 'x86_tdx_for_6.5' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/mm: Fix enc_status_change_finish_noop() x86/tdx: Fix race between set_memory_encrypted() and load_unaligned_zeropad() x86/mm: Allow guest.enc_status_change_prepare() to fail x86/tdx: Wrap exit reason with hcall_func()
2023-06-26Merge tag 'x86_cc_for_v6.5' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 confidential computing update from Borislav Petkov: - Add support for unaccepted memory as specified in the UEFI spec v2.9. The gist of it all is that Intel TDX and AMD SEV-SNP confidential computing guests define the notion of accepting memory before using it and thus preventing a whole set of attacks against such guests like memory replay and the like. There are a couple of strategies of how memory should be accepted - the current implementation does an on-demand way of accepting. * tag 'x86_cc_for_v6.5' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: virt: sevguest: Add CONFIG_CRYPTO dependency x86/efi: Safely enable unaccepted memory in UEFI x86/sev: Add SNP-specific unaccepted memory support x86/sev: Use large PSC requests if applicable x86/sev: Allow for use of the early boot GHCB for PSC requests x86/sev: Put PSC struct on the stack in prep for unaccepted memory support x86/sev: Fix calculation of end address based on number of pages x86/tdx: Add unaccepted memory support x86/tdx: Refactor try_accept_one() x86/tdx: Make _tdx_hypercall() and __tdx_module_call() available in boot stub efi/unaccepted: Avoid load_unaligned_zeropad() stepping into unaccepted memory efi: Add unaccepted memory support x86/boot/compressed: Handle unaccepted memory efi/libstub: Implement support for unaccepted memory efi/x86: Get full memory map in allocate_e820() mm: Add support for unaccepted memory
2023-06-06x86/tdx: Fix race between set_memory_encrypted() and load_unaligned_zeropad()Kirill A. Shutemov
tl;dr: There is a race in the TDX private<=>shared conversion code which could kill the TDX guest. Fix it by changing conversion ordering to eliminate the window. TDX hardware maintains metadata to track which pages are private and shared. Additionally, TDX guests use the guest x86 page tables to specify whether a given mapping is intended to be private or shared. Bad things happen when the intent and metadata do not match. So there are two thing in play: 1. "the page" -- the physical TDX page metadata 2. "the mapping" -- the guest-controlled x86 page table intent For instance, an unrecoverable exit to VMM occurs if a guest touches a private mapping that points to a shared physical page. In summary: * Private mapping => Private Page == OK (obviously) * Shared mapping => Shared Page == OK (obviously) * Private mapping => Shared Page == BIG BOOM! * Shared mapping => Private Page == OK-ish (It will read generate a recoverable #VE via handle_mmio()) Enter load_unaligned_zeropad(). It can touch memory that is adjacent but otherwise unrelated to the memory it needs to touch. It will cause one of those unrecoverable exits (aka. BIG BOOM) if it blunders into a shared mapping pointing to a private page. This is a problem when __set_memory_enc_pgtable() converts pages from shared to private. It first changes the mapping and second modifies the TDX page metadata. It's moving from: * Shared mapping => Shared Page == OK to: * Private mapping => Shared Page == BIG BOOM! This means that there is a window with a shared mapping pointing to a private page where load_unaligned_zeropad() can strike. Add a TDX handler for guest.enc_status_change_prepare(). This converts the page from shared to private *before* the page becomes private. This ensures that there is never a private mapping to a shared page. Leave a guest.enc_status_change_finish() in place but only use it for private=>shared conversions. This will delay updating the TDX metadata marking the page private until *after* the mapping matches the metadata. This also ensures that there is never a private mapping to a shared page. [ dhansen: rewrite changelog ] Fixes: 7dbde7631629 ("x86/mm/cpa: Add support for TDX shared memory") Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com> Link: https://lore.kernel.org/all/20230606095622.1939-3-kirill.shutemov%40linux.intel.com
2023-06-06x86/tdx: Add unaccepted memory supportKirill A. Shutemov
Hookup TDX-specific code to accept memory. Accepting the memory is done with ACCEPT_PAGE module call on every page in the range. MAP_GPA hypercall is not required as the unaccepted memory is considered private already. Extract the part of tdx_enc_status_changed() that does memory acceptance in a new helper. Move the helper tdx-shared.c. It is going to be used by both main kernel and decompressor. [ bp: Fix the INTEL_TDX_GUEST=y, KVM_GUEST=n build. ] Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/r/20230606142637.5171-10-kirill.shutemov@linux.intel.com
2023-06-06x86/tdx: Refactor try_accept_one()Kirill A. Shutemov
Rework try_accept_one() to return accepted size instead of modifying 'start' inside the helper. It makes 'start' in-only argument and streamlines code on the caller side. Suggested-by: Borislav Petkov <bp@alien8.de> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com> Link: https://lore.kernel.org/r/20230606142637.5171-9-kirill.shutemov@linux.intel.com
2023-06-06x86/tdx: Make _tdx_hypercall() and __tdx_module_call() available in boot stubKirill A. Shutemov
Memory acceptance requires a hypercall and one or multiple module calls. Make helpers for the calls available in boot stub. It has to accept memory where kernel image and initrd are placed. Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com> Link: https://lore.kernel.org/r/20230606142637.5171-8-kirill.shutemov@linux.intel.com
2023-05-31x86/smpboot: Fix the parallel bringup decisionThomas Gleixner
The decision to allow parallel bringup of secondary CPUs checks CC_ATTR_GUEST_STATE_ENCRYPT to detect encrypted guests. Those cannot use parallel bootup because accessing the local APIC is intercepted and raises a #VC or #VE, which cannot be handled at that point. The check works correctly, but only for AMD encrypted guests. TDX does not set that flag. As there is no real connection between CC attributes and the inability to support parallel bringup, replace this with a generic control flag in x86_cpuinit and let SEV-ES and TDX init code disable it. Fixes: 0c7ffa32dbd6 ("x86/smpboot/64: Implement arch_cpuhp_init_parallel_bringup() and enable it") Reported-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Tom Lendacky <thomas.lendacky@amd.com> Tested-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Link: https://lore.kernel.org/r/87ilc9gd2d.ffs@tglx
2023-05-23x86/tdx: Wrap exit reason with hcall_func()Nikolay Borisov
TDX reuses VMEXIT "reasons" in its guest->host hypercall ABI. This is confusing because there might not be a VMEXIT involved at *all*. These instances are supposed to document situation and reduce confusion by wrapping VMEXIT reasons with hcall_func(). The decompression code does not follow this convention. Unify the TDX decompression code with the other TDX use of VMEXIT reasons. No functional changes. Signed-off-by: Nikolay Borisov <nik.borisov@suse.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Link: https://lore.kernel.org/all/20230505120332.1429957-1-nik.borisov%40suse.com
2023-05-09x86/coco: Get rid of accessor functionsBorislav Petkov (AMD)
cc_vendor is __ro_after_init and thus can be used directly. No functional changes. Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/r/20230508121957.32341-1-bp@alien8.de
2023-03-22x86/tdx: Drop flags from __tdx_hypercall()Kirill A. Shutemov
After TDX_HCALL_ISSUE_STI got dropped, the only flag left is TDX_HCALL_HAS_OUTPUT. The flag indicates if the caller wants to see tdx_hypercall_args updated based on the hypercall output. Drop the flags and provide __tdx_hypercall_ret() that matches TDX_HCALL_HAS_OUTPUT semantics. Suggested-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/all/20230321003511.9469-1-kirill.shutemov%40linux.intel.com
2023-02-25Merge tag 'x86_tdx_for_6.3' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull Intel Trust Domain Extensions (TDX) updates from Dave Hansen: "Other than a minor fixup, the content here is to ensure that TDX guests never see virtualization exceptions (#VE's) that might be induced by the untrusted VMM. This is a highly desirable property. Without it, #VE exception handling would fall somewhere between NMIs, machine checks and total insanity. With it, #VE handling remains pretty mundane. Summary: - Fixup comment typo - Prevent unexpected #VE's from: - Hosts removing perfectly good guest mappings (SEPT_VE_DISABLE) - Excessive #VE notifications (NOTIFY_ENABLES) which are delivered via a #VE" * tag 'x86_tdx_for_6.3' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/tdx: Do not corrupt frame-pointer in __tdx_hypercall() x86/tdx: Disable NOTIFY_ENABLES x86/tdx: Relax SEPT_VE_DISABLE check for debug TD x86/tdx: Use ReportFatalError to report missing SEPT_VE_DISABLE x86/tdx: Expand __tdx_hypercall() to handle more arguments x86/tdx: Refactor __tdx_hypercall() to allow pass down more arguments x86/tdx: Add more registers to struct tdx_hypercall_args x86/tdx: Fix typo in comment in __tdx_hypercall()
2023-01-31Merge tag 'v6.2-rc6' into sched/core, to pick up fixesIngo Molnar
Pick up fixes before merging another batch of cpuidle updates. Signed-off-by: Ingo Molnar <mingo@kernel.org>
2023-01-27x86/tdx: Disable NOTIFY_ENABLESKirill A. Shutemov
== Background == There is a class of side-channel attacks against SGX enclaves called "SGX Step"[1]. These attacks create lots of exceptions inside of enclaves. Basically, run an in-enclave instruction, cause an exception. Over and over. There is a concern that a VMM could attack a TDX guest in the same way by causing lots of #VE's. The TDX architecture includes new countermeasures for these attacks. It basically counts the number of exceptions and can send another *special* exception once the number of VMM-induced #VE's hits a critical threshold[2]. == Problem == But, these special exceptions are independent of any action that the guest takes. They can occur anywhere that the guest executes. This includes sensitive areas like the entry code. The (non-paranoid) #VE handler is incapable of handling exceptions in these areas. == Solution == Fortunately, the special exceptions can be disabled by the guest via write to NOTIFY_ENABLES TDCS field. NOTIFY_ENABLES is disabled by default, but might be enabled by a bootloader, firmware or an earlier kernel before the current kernel runs. Disable NOTIFY_ENABLES feature explicitly and unconditionally. Any NOTIFY_ENABLES-based #VE's that occur before this point will end up in the early #VE exception handler and die due to unexpected exit reason. [1] https://github.com/jovanbulck/sgx-step [2] https://intel.github.io/ccc-linux-guest-hardening-docs/security-spec.html#safety-against-ve-in-kernel-code Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Dave Hansen <dave.hansen@intel.com> Link: https://lore.kernel.org/all/20230126221159.8635-8-kirill.shutemov%40linux.intel.com
2023-01-27x86/tdx: Relax SEPT_VE_DISABLE check for debug TDKirill A. Shutemov
A "SEPT #VE" occurs when a TDX guest touches memory that is not properly mapped into the "secure EPT". This can be the result of hypervisor attacks or bugs, *OR* guest bugs. Most notably, buggy guests might touch unaccepted memory for lots of different memory safety bugs like buffer overflows. TDX guests do not want to continue in the face of hypervisor attacks or hypervisor bugs. They want to terminate as fast and safely as possible. SEPT_VE_DISABLE ensures that TDX guests *can't* continue in the face of these kinds of issues. But, that causes a problem. TDX guests that can't continue can't spit out oopses or other debugging info. In essence SEPT_VE_DISABLE=1 guests are not debuggable. Relax the SEPT_VE_DISABLE check to warning on debug TD and panic() in the #VE handler on EPT-violation on private memory. It will produce useful backtrace. Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Link: https://lore.kernel.org/all/20230126221159.8635-7-kirill.shutemov%40linux.intel.com
2023-01-27x86/tdx: Use ReportFatalError to report missing SEPT_VE_DISABLEKirill A. Shutemov
Linux TDX guests require that the SEPT_VE_DISABLE "attribute" be set. If it is not set, the kernel is theoretically required to handle exceptions anywhere that kernel memory is accessed, including places like NMI handlers and in the syscall entry gap. Rather than even try to handle these exceptions, the kernel refuses to run if SEPT_VE_DISABLE is unset. However, the SEPT_VE_DISABLE detection and refusal code happens very early in boot, even before earlyprintk runs. Calling panic() will effectively just hang the system. Instead, call a TDX-specific panic() function. This makes a very simple TDVMCALL which gets a short error string out to the hypervisor without any console infrastructure. Use TDG.VP.VMCALL<ReportFatalError> to report the error. The hypercall can encode message up to 64 bytes in eight registers. [ dhansen: tweak comment and remove while loop brackets. ] Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Link: https://lore.kernel.org/all/20230126221159.8635-6-kirill.shutemov%40linux.intel.com
2023-01-13cpuidle, tdx: Make TDX code noinstr cleanPeter Zijlstra
objtool found a few cases where this code called out into instrumented code: vmlinux.o: warning: objtool: __halt+0x2c: call to hcall_func.constprop.0() leaves .noinstr.text section vmlinux.o: warning: objtool: __halt+0x3f: call to __tdx_hypercall() leaves .noinstr.text section vmlinux.o: warning: objtool: __tdx_hypercall+0x66: call to __tdx_hypercall_failed() leaves .noinstr.text section Fix it by: - moving TDX tdcall assembly methods into .noinstr.text (they are already noistr-clean) - marking __tdx_hypercall_failed() as 'noinstr' - annotating hcall_func() as __always_inline Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Tested-by: Tony Lindgren <tony@atomide.com> Tested-by: Ulf Hansson <ulf.hansson@linaro.org> Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Acked-by: Frederic Weisbecker <frederic@kernel.org> Link: https://lore.kernel.org/r/20230112195541.111485720@infradead.org
2023-01-13x86/tdx: Remove TDX_HCALL_ISSUE_STIPeter Zijlstra
Now that arch_cpu_idle() is expected to return with IRQs disabled, avoid the useless STI/CLI dance. Per the specs this is supposed to work, but nobody has yet relied up this behaviour so broken implementations are possible. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Tested-by: Tony Lindgren <tony@atomide.com> Tested-by: Ulf Hansson <ulf.hansson@linaro.org> Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Acked-by: Frederic Weisbecker <frederic@kernel.org> Link: https://lore.kernel.org/r/20230112195540.682137572@infradead.org
2023-01-13arch/idle: Change arch_cpu_idle() behavior: always exit with IRQs disabledPeter Zijlstra
Current arch_cpu_idle() is called with IRQs disabled, but will return with IRQs enabled. However, the very first thing the generic code does after calling arch_cpu_idle() is raw_local_irq_disable(). This means that architectures that can idle with IRQs disabled end up doing a pointless 'enable-disable' dance. Therefore, push this IRQ disabling into the idle function, meaning that those architectures can avoid the pointless IRQ state flipping. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Tested-by: Tony Lindgren <tony@atomide.com> Tested-by: Ulf Hansson <ulf.hansson@linaro.org> Reviewed-by: Gautham R. Shenoy <gautham.shenoy@amd.com> Acked-by: Mark Rutland <mark.rutland@arm.com> [arm64] Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Acked-by: Guo Ren <guoren@kernel.org> Acked-by: Frederic Weisbecker <frederic@kernel.org> Link: https://lore.kernel.org/r/20230112195540.618076436@infradead.org
2023-01-03x86/insn: Avoid namespace clash by separating instruction decoder MMIO type ↵Jason A. Donenfeld
from MMIO trace type Both <linux/mmiotrace.h> and <asm/insn-eval.h> define various MMIO_ enum constants, whose namespace overlaps. Rename the <asm/insn-eval.h> ones to have a INSN_ prefix, so that the headers can be used from the same source file. Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20230101162910.710293-2-Jason@zx2c4.com
2022-11-17x86/tdx: Add a wrapper to get TDREPORT0 from the TDX ModuleKuppuswamy Sathyanarayanan
To support TDX attestation, the TDX guest driver exposes an IOCTL interface to allow userspace to get the TDREPORT0 (a.k.a. TDREPORT subtype 0) from the TDX module via TDG.MR.TDREPORT TDCALL. In order to get the TDREPORT0 in the TDX guest driver, instead of using a low level function like __tdx_module_call(), add a tdx_mcall_get_report0() wrapper function to handle it. This is a preparatory patch for adding attestation support. Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Acked-by: Wander Lairson Costa <wander@redhat.com> Link: https://lore.kernel.org/all/20221116223820.819090-2-sathyanarayanan.kuppuswamy%40linux.intel.com
2022-11-01x86/tdx: Panic on bad configs that #VE on "private" memory accessKirill A. Shutemov
All normal kernel memory is "TDX private memory". This includes everything from kernel stacks to kernel text. Handling exceptions on arbitrary accesses to kernel memory is essentially impossible because they can happen in horribly nasty places like kernel entry/exit. But, TDX hardware can theoretically _deliver_ a virtualization exception (#VE) on any access to private memory. But, it's not as bad as it sounds. TDX can be configured to never deliver these exceptions on private memory with a "TD attribute" called ATTR_SEPT_VE_DISABLE. The guest has no way to *set* this attribute, but it can check it. Ensure ATTR_SEPT_VE_DISABLE is set in early boot. panic() if it is unset. There is no sane way for Linux to run with this attribute clear so a panic() is appropriate. There's small window during boot before the check where kernel has an early #VE handler. But the handler is only for port I/O and will also panic() as soon as it sees any other #VE, such as a one generated by a private memory access. [ dhansen: Rewrite changelog and rebase on new tdx_parse_tdinfo(). Add Kirill's tested-by because I made changes since he wrote this. ] Fixes: 9a22bf6debbf ("x86/traps: Add #VE support for TDX guest") Reported-by: ruogui.ygr@alibaba-inc.com Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Tested-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/all/20221028141220.29217-3-kirill.shutemov%40linux.intel.com
2022-11-01x86/tdx: Prepare for using "INFO" call for a second purposeDave Hansen
The TDG.VP.INFO TDCALL provides the guest with various details about the TDX system that the guest needs to run. Only one field is currently used: 'gpa_width' which tells the guest which PTE bits mark pages shared or private. A second field is now needed: the guest "TD attributes" to tell if virtualization exceptions are configured in a way that can harm the guest. Make the naming and calling convention more generic and discrete from the mask-centric one. Thanks to Sathya for the inspiration here, but there's no code, comments or changelogs left from where he started. Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Tested-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: stable@vger.kernel.org
2022-06-17x86/tdx: Handle load_unaligned_zeropad() page-cross to a shared pageKirill A. Shutemov
load_unaligned_zeropad() can lead to unwanted loads across page boundaries. The unwanted loads are typically harmless. But, they might be made to totally unrelated or even unmapped memory. load_unaligned_zeropad() relies on exception fixup (#PF, #GP and now #VE) to recover from these unwanted loads. In TDX guests, the second page can be shared page and a VMM may configure it to trigger #VE. The kernel assumes that #VE on a shared page is an MMIO access and tries to decode instruction to handle it. In case of load_unaligned_zeropad() it may result in confusion as it is not MMIO access. Fix it by detecting split page MMIO accesses and failing them. load_unaligned_zeropad() will recover using exception fixups. The issue was discovered by analysis and reproduced artificially. It was not triggered during testing. [ dhansen: fix up changelogs and comments for grammar and clarity, plus incorporate Kirill's off-by-one fix] Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Link: https://lkml.kernel.org/r/20220614120135.14812-4-kirill.shutemov@linux.intel.com
2022-06-15x86/tdx: Clarify RIP adjustments in #VE handlerKirill A. Shutemov
After successful #VE handling, tdx_handle_virt_exception() has to move RIP to the next instruction. The handler needs to know the length of the instruction. If the #VE happened due to instruction execution, the GET_VEINFO TDX module call provides info on the instruction in R10, including its length. For #VE due to EPT violation, the info in R10 is not populand and the kernel must decode the instruction manually to find out its length. Restructure the code to make it explicit that the instruction length depends on the type of #VE. Make individual #VE handlers return the instruction length on success or -errno on failure. [ dhansen: fix up changelog and comments ] Suggested-by: Dave Hansen <dave.hansen@intel.com> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Link: https://lkml.kernel.org/r/20220614120135.14812-3-kirill.shutemov@linux.intel.com
2022-06-15x86/tdx: Fix early #VE handlingKirill A. Shutemov
tdx_early_handle_ve() does not increment RIP after successfully handling the exception. That leads to infinite loop of exceptions. Move RIP when exceptions are successfully handled. [ dhansen: make problem statement more clear ] Fixes: 32e72854fa5f ("x86/tdx: Port I/O: Add early boot support") Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com> Link: https://lkml.kernel.org/r/20220614120135.14812-2-kirill.shutemov@linux.intel.com
2022-04-07x86/mm/cpa: Add support for TDX shared memoryKirill A. Shutemov
Intel TDX protects guest memory from VMM access. Any memory that is required for communication with the VMM must be explicitly shared. It is a two-step process: the guest sets the shared bit in the page table entry and notifies VMM about the change. The notification happens using MapGPA hypercall. Conversion back to private memory requires clearing the shared bit, notifying VMM with MapGPA hypercall following with accepting the memory with AcceptPage hypercall. Provide a TDX version of x86_platform.guest.* callbacks. It makes __set_memory_enc_pgtable() work right in TDX guest. Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lkml.kernel.org/r/20220405232939.73860-27-kirill.shutemov@linux.intel.com
2022-04-07x86/tdx: Wire up KVM hypercallsKuppuswamy Sathyanarayanan
KVM hypercalls use the VMCALL or VMMCALL instructions. Although the ABI is similar, those instructions no longer function for TDX guests. Make vendor-specific TDVMCALLs instead of VMCALL. This enables TDX guests to run with KVM acting as the hypervisor. Among other things, KVM hypercall is used to send IPIs. Since the KVM driver can be built as a kernel module, export tdx_kvm_hypercall() to make the symbols visible to kvm.ko. Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lkml.kernel.org/r/20220405232939.73860-20-kirill.shutemov@linux.intel.com
2022-04-07x86/tdx: Port I/O: Add early boot supportAndi Kleen
TDX guests cannot do port I/O directly. The TDX module triggers a #VE exception to let the guest kernel emulate port I/O by converting them into TDCALLs to call the host. But before IDT handlers are set up, port I/O cannot be emulated using normal kernel #VE handlers. To support the #VE-based emulation during this boot window, add a minimal early #VE handler support in early exception handlers. This is similar to what AMD SEV does. This is mainly to support earlyprintk's serial driver, as well as potentially the VGA driver. The early handler only supports I/O-related #VE exceptions. Unhandled or failed exceptions will be handled via early_fixup_exceptions() (like normal exception failures). At runtime I/O-related #VE exceptions (along with other types) handled by virt_exception_kernel(). Signed-off-by: Andi Kleen <ak@linux.intel.com> Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Dan Williams <dan.j.williams@intel.com> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com> Link: https://lkml.kernel.org/r/20220405232939.73860-19-kirill.shutemov@linux.intel.com
2022-04-07x86/tdx: Port I/O: Add runtime hypercallsKuppuswamy Sathyanarayanan
TDX hypervisors cannot emulate instructions directly. This includes port I/O which is normally emulated in the hypervisor. All port I/O instructions inside TDX trigger the #VE exception in the guest and would be normally emulated there. Use a hypercall to emulate port I/O. Extend the tdx_handle_virt_exception() and add support to handle the #VE due to port I/O instructions. String I/O operations are not supported in TDX. Unroll them by declaring CC_ATTR_GUEST_UNROLL_STRING_IO confidential computing attribute. == Userspace Implications == The ioperm() facility allows userspace access to I/O instructions like inb/outb. Among other things, this allows writing userspace device drivers. This series has no special handling for ioperm(). Users will be able to successfully request I/O permissions but will induce a #VE on their first I/O instruction which leads SIGSEGV. If this is undesirable users can enable kernel lockdown feature with 'lockdown=integrity' kernel command line option. It makes ioperm() fail. More robust handling of this situation (denying ioperm() in all TDX guests) will be addressed in follow-on work. Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Andi Kleen <ak@linux.intel.com> Reviewed-by: Dan Williams <dan.j.williams@intel.com> Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lkml.kernel.org/r/20220405232939.73860-18-kirill.shutemov@linux.intel.com
2022-04-07x86/tdx: Handle in-kernel MMIOKirill A. Shutemov
In non-TDX VMs, MMIO is implemented by providing the guest a mapping which will cause a VMEXIT on access and then the VMM emulating the instruction that caused the VMEXIT. That's not possible for TDX VM. To emulate an instruction an emulator needs two things: - R/W access to the register file to read/modify instruction arguments and see RIP of the faulted instruction. - Read access to memory where instruction is placed to see what to emulate. In this case it is guest kernel text. Both of them are not available to VMM in TDX environment: - Register file is never exposed to VMM. When a TD exits to the module, it saves registers into the state-save area allocated for that TD. The module then scrubs these registers before returning execution control to the VMM, to help prevent leakage of TD state. - TDX does not allow guests to execute from shared memory. All executed instructions are in TD-private memory. Being private to the TD, VMMs have no way to access TD-private memory and no way to read the instruction to decode and emulate it. In TDX the MMIO regions are instead configured by VMM to trigger a #VE exception in the guest. Add #VE handling that emulates the MMIO instruction inside the guest and converts it into a controlled hypercall to the host. This approach is bad for performance. But, it has (virtually) no impact on the size of the kernel image and will work for a wide variety of drivers. This allows TDX deployments to use arbitrary devices and device drivers, including virtio. TDX customers have asked for the capability to use random devices in their deployments. In other words, even if all of the work was done to paravirtualize all x86 MMIO users and virtio, this approach would still be needed. There is essentially no way to get rid of this code. This approach is functional for all in-kernel MMIO users current and future and does so with a minimal amount of code and kernel image bloat. MMIO addresses can be used with any CPU instruction that accesses memory. Address only MMIO accesses done via io.h helpers, such as 'readl()' or 'writeq()'. Any CPU instruction that accesses memory can also be used to access MMIO. However, by convention, MMIO access are typically performed via io.h helpers such as 'readl()' or 'writeq()'. The io.h helpers intentionally use a limited set of instructions when accessing MMIO. This known, limited set of instructions makes MMIO instruction decoding and emulation feasible in KVM hosts and SEV guests today. MMIO accesses performed without the io.h helpers are at the mercy of the compiler. Compilers can and will generate a much more broad set of instructions which can not practically be decoded and emulated. TDX guests will oops if they encounter one of these decoding failures. This means that TDX guests *must* use the io.h helpers to access MMIO. This requirement is not new. Both KVM hosts and AMD SEV guests have the same limitations on MMIO access. === Potential alternative approaches === == Paravirtualizing all MMIO == An alternative to letting MMIO induce a #VE exception is to avoid the #VE in the first place. Similar to the port I/O case, it is theoretically possible to paravirtualize MMIO accesses. Like the exception-based approach offered here, a fully paravirtualized approach would be limited to MMIO users that leverage common infrastructure like the io.h macros. However, any paravirtual approach would be patching approximately 120k call sites. Any paravirtual approach would need to replace a bare memory access instruction with (at least) a function call. With a conservative overhead estimation of 5 bytes per call site (CALL instruction), it leads to bloating code by 600k. Many drivers will never be used in the TDX environment and the bloat cannot be justified. == Patching TDX drivers == Rather than touching the entire kernel, it might also be possible to just go after drivers that use MMIO in TDX guests *and* are performance critical to justify the effrort. Right now, that's limited only to virtio. All virtio MMIO appears to be done through a single function, which makes virtio eminently easy to patch. This approach will be adopted in the future, removing the bulk of MMIO #VEs. The #VE-based MMIO will remain serving non-virtio use cases. Co-developed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com> Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Andi Kleen <ak@linux.intel.com> Reviewed-by: Tony Luck <tony.luck@intel.com> Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lkml.kernel.org/r/20220405232939.73860-12-kirill.shutemov@linux.intel.com
2022-04-07x86/tdx: Handle CPUID via #VEKirill A. Shutemov
In TDX guests, most CPUID leaf/sub-leaf combinations are virtualized by the TDX module while some trigger #VE. Implement the #VE handling for EXIT_REASON_CPUID by handing it through the hypercall, which in turn lets the TDX module handle it by invoking the host VMM. More details on CPUID Virtualization can be found in the TDX module specification, the section titled "CPUID Virtualization". Note that VMM that handles the hypercall is not trusted. It can return data that may steer the guest kernel in wrong direct. Only allow VMM to control range reserved for hypervisor communication. Return all-zeros for any CPUID outside the hypervisor range. It matches CPU behaviour for non-supported leaf. Co-developed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com> Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Andi Kleen <ak@linux.intel.com> Reviewed-by: Tony Luck <tony.luck@intel.com> Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lkml.kernel.org/r/20220405232939.73860-11-kirill.shutemov@linux.intel.com
2022-04-07x86/tdx: Add MSR support for TDX guestsKirill A. Shutemov
Use hypercall to emulate MSR read/write for the TDX platform. There are two viable approaches for doing MSRs in a TD guest: 1. Execute the RDMSR/WRMSR instructions like most VMs and bare metal do. Some will succeed, others will cause a #VE. All of those that cause a #VE will be handled with a TDCALL. 2. Use paravirt infrastructure. The paravirt hook has to keep a list of which MSRs would cause a #VE and use a TDCALL. All other MSRs execute RDMSR/WRMSR instructions directly. The second option can be ruled out because the list of MSRs was challenging to maintain. That leaves option #1 as the only viable solution for the minimal TDX support. Kernel relies on the exception fixup machinery to handle MSR access errors. #VE handler uses the same exception fixup code as #GP. It covers MSR accesses along with other types of fixups. For performance-critical MSR writes (like TSC_DEADLINE), future patches will replace the WRMSR/#VE sequence with the direct TDCALL. RDMSR and WRMSR specification details can be found in Guest-Host-Communication Interface (GHCI) for Intel Trust Domain Extensions (Intel TDX) specification, sec titled "TDG.VP. VMCALL<Instruction.RDMSR>" and "TDG.VP.VMCALL<Instruction.WRMSR>". Co-developed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com> Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Andi Kleen <ak@linux.intel.com> Reviewed-by: Tony Luck <tony.luck@intel.com> Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lkml.kernel.org/r/20220405232939.73860-10-kirill.shutemov@linux.intel.com
2022-04-07x86/tdx: Add HLT support for TDX guestsKirill A. Shutemov
The HLT instruction is a privileged instruction, executing it stops instruction execution and places the processor in a HALT state. It is used in kernel for cases like reboot, idle loop and exception fixup handlers. For the idle case, interrupts will be enabled (using STI) before the HLT instruction (this is also called safe_halt()). To support the HLT instruction in TDX guests, it needs to be emulated using TDVMCALL (hypercall to VMM). More details about it can be found in Intel Trust Domain Extensions (Intel TDX) Guest-Host-Communication Interface (GHCI) specification, section TDVMCALL[Instruction.HLT]. In TDX guests, executing HLT instruction will generate a #VE, which is used to emulate the HLT instruction. But #VE based emulation will not work for the safe_halt() flavor, because it requires STI instruction to be executed just before the TDCALL. Since idle loop is the only user of safe_halt() variant, handle it as a special case. To avoid *safe_halt() call in the idle function, define the tdx_guest_idle() and use it to override the "x86_idle" function pointer for a valid TDX guest. Alternative choices like PV ops have been considered for adding safe_halt() support. But it was rejected because HLT paravirt calls only exist under PARAVIRT_XXL, and enabling it in TDX guest just for safe_halt() use case is not worth the cost. Co-developed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com> Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Andi Kleen <ak@linux.intel.com> Reviewed-by: Tony Luck <tony.luck@intel.com> Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com> Link: https://lkml.kernel.org/r/20220405232939.73860-9-kirill.shutemov@linux.intel.com
2022-04-07x86/traps: Add #VE support for TDX guestKirill A. Shutemov
Virtualization Exceptions (#VE) are delivered to TDX guests due to specific guest actions which may happen in either user space or the kernel: * Specific instructions (WBINVD, for example) * Specific MSR accesses * Specific CPUID leaf accesses * Access to specific guest physical addresses Syscall entry code has a critical window where the kernel stack is not yet set up. Any exception in this window leads to hard to debug issues and can be exploited for privilege escalation. Exceptions in the NMI entry code also cause issues. Returning from the exception handler with IRET will re-enable NMIs and nested NMI will corrupt the NMI stack. For these reasons, the kernel avoids #VEs during the syscall gap and the NMI entry code. Entry code paths do not access TD-shared memory, MMIO regions, use #VE triggering MSRs, instructions, or CPUID leaves that might generate #VE. VMM can remove memory from TD at any point, but access to unaccepted (or missing) private memory leads to VM termination, not to #VE. Similarly to page faults and breakpoints, #VEs are allowed in NMI handlers once the kernel is ready to deal with nested NMIs. During #VE delivery, all interrupts, including NMIs, are blocked until TDGETVEINFO is called. It prevents #VE nesting until the kernel reads the VE info. TDGETVEINFO retrieves the #VE info from the TDX module, which also clears the "#VE valid" flag. This must be done before anything else as any #VE that occurs while the valid flag is set escalates to #DF by TDX module. It will result in an oops. Virtual NMIs are inhibited if the #VE valid flag is set. NMI will not be delivered until TDGETVEINFO is called. For now, convert unhandled #VE's (everything, until later in this series) so that they appear just like a #GP by calling the ve_raise_fault() directly. The ve_raise_fault() function is similar to #GP handler and is responsible for sending SIGSEGV to userspace and CPU die and notifying debuggers and other die chain users. Co-developed-by: Sean Christopherson <sean.j.christopherson@intel.com> Co-developed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Andi Kleen <ak@linux.intel.com> Reviewed-by: Tony Luck <tony.luck@intel.com> Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com> Link: https://lkml.kernel.org/r/20220405232939.73860-8-kirill.shutemov@linux.intel.com
2022-04-07x86/tdx: Exclude shared bit from __PHYSICAL_MASKKirill A. Shutemov
In TDX guests, by default memory is protected from host access. If a guest needs to communicate with the VMM (like the I/O use case), it uses a single bit in the physical address to communicate the protected/shared attribute of the given page. In the x86 ARCH code, __PHYSICAL_MASK macro represents the width of the physical address in the given architecture. It is used in creating physical PAGE_MASK for address bits in the kernel. Since in TDX guest, a single bit is used as metadata, it needs to be excluded from valid physical address bits to avoid using incorrect addresses bits in the kernel. Enable DYNAMIC_PHYSICAL_MASK to support updating the __PHYSICAL_MASK. Co-developed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com> Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Andi Kleen <ak@linux.intel.com> Reviewed-by: Tony Luck <tony.luck@intel.com> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lkml.kernel.org/r/20220405232939.73860-6-kirill.shutemov@linux.intel.com
2022-04-07x86/tdx: Extend the confidential computing API to support TDX guestsKirill A. Shutemov
Confidential Computing (CC) features (like string I/O unroll support, memory encryption/decryption support, etc) are conditionally enabled in the kernel using cc_platform_has() API. Since TDX guests also need to use these CC features, extend cc_platform_has() API and add TDX guest-specific CC attributes support. CC API also provides an interface to deal with encryption mask. Extend it to cover TDX. Details about which bit in the page table entry to be used to indicate shared/private state is determined by using the TDINFO TDCALL. Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Borislav Petkov <bp@suse.de> Link: https://lkml.kernel.org/r/20220405232939.73860-5-kirill.shutemov@linux.intel.com