summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2023-12-14x86/virt/tdx: Make TDX host depend on X86_MCEKai Huang
A build failure was reported that when INTEL_TDX_HOST is enabled but X86_MCE is not, the tdx_dump_mce_info() function fails to link: ld: vmlinux.o: in function `tdx_dump_mce_info': ...: undefined reference to `mce_is_memory_error' ...: undefined reference to `mce_usable_address' The reason is in such configuration, despite there's no caller of tdx_dump_mce_info() it is still built and there's no implementation for the two "mce_*()" functions. Make INTEL_TDX_HOST depend on X86_MCE to fix. It makes sense to enable MCE support for the TDX host anyway. Because the only way that TDX has to report integrity errors is an MCE, and it is not good to silently ignore such MCE. The TDX spec also suggests the host VMM is expected to implement the MCE handler. Note it also makes sense to make INTEL_TDX_HOST select X86_MCE but this generates "recursive dependency detected!" error in the Kconfig. Closes: https://lore.kernel.org/all/20231212214612.GHZXjUpBFa1IwVMTI7@fat_crate.local/T/ Fixes: 70060463cb2b ("x86/mce: Differentiate real hardware #MCs from TDX erratum ones") Reported-by: Arnd Bergmann <arnd@kernel.org> Signed-off-by: Kai Huang <kai.huang@intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Link: https://lore.kernel.org/all/20231212214612.GHZXjUpBFa1IwVMTI7@fat_crate.local/T/#m1a109c29324b2bbd0b3b1d45c218012cd3a13be6 Link: https://lore.kernel.org/all/20231213222825.286809-1-kai.huang%40intel.com
2023-12-12x86/virt/tdx: Disable TDX host support when kexec is enabledDave Hansen
TDX host support currently lacks the ability to handle kexec. Disable TDX when kexec is enabled. Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Link: https://lore.kernel.org/all/20231208170740.53979-20-dave.hansen%40intel.com
2023-12-12Documentation/x86: Add documentation for TDX host supportKai Huang
Add documentation for TDX host kernel support. There is already one file Documentation/x86/tdx.rst containing documentation for TDX guest internals. Also reuse it for TDX host kernel support. Introduce a new level menu "TDX Guest Support" and move existing materials under it, and add a new menu for TDX host kernel support. Signed-off-by: Kai Huang <kai.huang@intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com> Link: https://lore.kernel.org/all/20231208170740.53979-19-dave.hansen%40intel.com
2023-12-12x86/mce: Differentiate real hardware #MCs from TDX erratum onesKai Huang
The first few generations of TDX hardware have an erratum. Triggering it in Linux requires some kind of kernel bug involving relatively exotic memory writes to TDX private memory and will manifest via spurious-looking machine checks when reading the affected memory. Make an effort to detect these TDX-induced machine checks and spit out a new blurb to dmesg so folks do not think their hardware is failing. == Background == Virtually all kernel memory accesses operations happen in full cachelines. In practice, writing a "byte" of memory usually reads a 64 byte cacheline of memory, modifies it, then writes the whole line back. Those operations do not trigger this problem. This problem is triggered by "partial" writes where a write transaction of less than cacheline lands at the memory controller. The CPU does these via non-temporal write instructions (like MOVNTI), or through UC/WC memory mappings. The issue can also be triggered away from the CPU by devices doing partial writes via DMA. == Problem == A partial write to a TDX private memory cacheline will silently "poison" the line. Subsequent reads will consume the poison and generate a machine check. According to the TDX hardware spec, neither of these things should have happened. To add insult to injury, the Linux machine code will present these as a literal "Hardware error" when they were, in fact, a software-triggered issue. == Solution == In the end, this issue is hard to trigger. Rather than do something rash (and incomplete) like unmap TDX private memory from the direct map, improve the machine check handler. Currently, the #MC handler doesn't distinguish whether the memory is TDX private memory or not but just dump, for instance, below message: [...] mce: [Hardware Error]: CPU 147: Machine Check Exception: f Bank 1: bd80000000100134 [...] mce: [Hardware Error]: RIP 10:<ffffffffadb69870> {__tlb_remove_page_size+0x10/0xa0} ... [...] mce: [Hardware Error]: Run the above through 'mcelog --ascii' [...] mce: [Hardware Error]: Machine check: Data load in unrecoverable area of kernel [...] Kernel panic - not syncing: Fatal local machine check Which says "Hardware Error" and "Data load in unrecoverable area of kernel". Ideally, it's better for the log to say "software bug around TDX private memory" instead of "Hardware Error". But in reality the real hardware memory error can happen, and sadly such software-triggered #MC cannot be distinguished from the real hardware error. Also, the error message is used by userspace tool 'mcelog' to parse, so changing the output may break userspace. So keep the "Hardware Error". The "Data load in unrecoverable area of kernel" is also helpful, so keep it too. Instead of modifying above error log, improve the error log by printing additional TDX related message to make the log like: ... [...] mce: [Hardware Error]: Machine check: Data load in unrecoverable area of kernel [...] mce: [Hardware Error]: Machine Check: TDX private memory error. Possible kernel bug. Adding this additional message requires determination of whether the memory page is TDX private memory. There is no existing infrastructure to do that. Add an interface to query the TDX module to fill this gap. == Impact == This issue requires some kind of kernel bug to trigger. TDX private memory should never be mapped UC/WC. A partial write originating from these mappings would require *two* bugs, first mapping the wrong page, then writing the wrong memory. It would also be detectable using traditional memory corruption techniques like DEBUG_PAGEALLOC. MOVNTI (and friends) could cause this issue with something like a simple buffer overrun or use-after-free on the direct map. It should also be detectable with normal debug techniques. The one place where this might get nasty would be if the CPU read data then wrote back the same data. That would trigger this problem but would not, for instance, set off mechanisms like slab redzoning because it doesn't actually corrupt data. With an IOMMU at least, the DMA exposure is similar to the UC/WC issue. TDX private memory would first need to be incorrectly mapped into the I/O space and then a later DMA to that mapping would actually cause the poisoning event. [ dhansen: changelog tweaks ] Signed-off-by: Kai Huang <kai.huang@intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Reviewed-by: Yuan Yao <yuan.yao@intel.com> Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Tony Luck <tony.luck@intel.com> Link: https://lore.kernel.org/all/20231208170740.53979-18-dave.hansen%40intel.com
2023-12-12x86/cpu: Detect TDX partial write machine check erratumKai Huang
TDX memory has integrity and confidentiality protections. Violations of this integrity protection are supposed to only affect TDX operations and are never supposed to affect the host kernel itself. In other words, the host kernel should never, itself, see machine checks induced by the TDX integrity hardware. Alas, the first few generations of TDX hardware have an erratum. A partial write to a TDX private memory cacheline will silently "poison" the line. Subsequent reads will consume the poison and generate a machine check. According to the TDX hardware spec, neither of these things should have happened. Virtually all kernel memory accesses operations happen in full cachelines. In practice, writing a "byte" of memory usually reads a 64 byte cacheline of memory, modifies it, then writes the whole line back. Those operations do not trigger this problem. This problem is triggered by "partial" writes where a write transaction of less than cacheline lands at the memory controller. The CPU does these via non-temporal write instructions (like MOVNTI), or through UC/WC memory mappings. The issue can also be triggered away from the CPU by devices doing partial writes via DMA. With this erratum, there are additional things need to be done. To prepare for those changes, add a CPU bug bit to indicate this erratum. Note this bug reflects the hardware thus it is detected regardless of whether the kernel is built with TDX support or not. Signed-off-by: Kai Huang <kai.huang@intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Reviewed-by: David Hildenbrand <david@redhat.com> Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com> Link: https://lore.kernel.org/all/20231208170740.53979-17-dave.hansen%40intel.com
2023-12-08x86/virt/tdx: Handle TDX interaction with sleep and hibernationKai Huang
TDX is incompatible with hibernation and some ACPI sleep states. Users must disable hibernation to use TDX. Users must also disable TDX if they want to use ACPI S3 sleep. This feels a bit wonky and asymmetric, but it avoids adding any new command-line parameters for now. It can be improved if users hate it too much. Long version: TDX cannot survive from S3 and deeper states. The hardware resets and disables TDX completely when platform goes to S3 and deeper. Both TDX guests and the TDX module get destroyed permanently. The kernel uses S3 to support suspend-to-ram, and S4 or deeper states to support hibernation. The kernel also maintains TDX states to track whether it has been initialized and its metadata resource, etc. After resuming from S3 or hibernation, these TDX states won't be correct anymore. Theoretically, the kernel can do more complicated things like resetting TDX internal states and TDX module metadata before going to S3 or deeper, and re-initialize TDX module after resuming, etc, but there is no way to save/restore TDX guests for now. Until TDX supports full save and restore of TDX guests, there is no big value to handle TDX module in suspend and hibernation alone. To make things simple, just choose to make TDX mutually exclusive with S3 and hibernation. Note the TDX module is initialized at runtime. To avoid having to deal with the fuss of determining TDX state at runtime, just choose TDX vs S3 and hibernation at kernel early boot. It's a bad user experience if the choice of TDX and S3/hibernation is done at runtime anyway, i.e., the user can experience being able to do S3/hibernation but later becoming unable to due to TDX being enabled. Disable TDX in kernel early boot when hibernation support is available. Currently there's no mechanism exposed by the hibernation code to allow other kernel code to disable hibernation once for all. Users that want TDX must disable hibernation, like using hibername=no on the command line. Disable ACPI S3 when TDX is enabled by the BIOS. For now the user needs to disable TDX in the BIOS to use ACPI S3. A new kernel command line can be added in the future if there's a need to let user disable TDX host via kernel command line. Alternatively, the kernel could disable TDX when ACPI S3 is supported and request the user to disable S3 to use TDX. But there's no existing kernel command line to do that, and BIOS doesn't always have an option to disable S3. [ dhansen: subject / changelog tweaks ] Signed-off-by: Kai Huang <kai.huang@intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com> Link: https://lore.kernel.org/all/20231208170740.53979-16-dave.hansen%40intel.com
2023-12-08x86/virt/tdx: Initialize all TDMRsKai Huang
After the global KeyID has been configured on all packages, initialize all TDMRs to make all TDX-usable memory regions that are passed to the TDX module become usable. This is the last step of initializing the TDX module. Initializing TDMRs can be time consuming on large memory systems as it involves initializing all metadata entries for all pages that can be used by TDX guests. Initializing different TDMRs can be parallelized. For now to keep it simple, just initialize all TDMRs one by one. It can be enhanced in the future. Signed-off-by: Kai Huang <kai.huang@intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Isaku Yamahata <isaku.yamahata@intel.com> Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Reviewed-by: Yuan Yao <yuan.yao@intel.com> Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com> Link: https://lore.kernel.org/all/20231208170740.53979-15-dave.hansen%40intel.com
2023-12-08x86/virt/tdx: Configure global KeyID on all packagesKai Huang
After the list of TDMRs and the global KeyID are configured to the TDX module, the kernel needs to configure the key of the global KeyID on all packages using TDH.SYS.KEY.CONFIG. This SEAMCALL cannot run parallel on different cpus. Loop all online cpus and use smp_call_on_cpu() to call this SEAMCALL on the first cpu of each package. To keep things simple, this implementation takes no affirmative steps to online cpus to make sure there's at least one cpu for each package. The callers (aka. KVM) can ensure success by ensuring sufficient CPUs are online for this to succeed. Intel hardware doesn't guarantee cache coherency across different KeyIDs. The PAMTs are transitioning from being used by the kernel mapping (KeyId 0) to the TDX module's "global KeyID" mapping. This means that the kernel must flush any dirty KeyID-0 PAMT cachelines before the TDX module uses the global KeyID to access the PAMTs. Otherwise, if those dirty cachelines were written back, they would corrupt the TDX module's metadata. Aside: This corruption would be detected by the memory integrity hardware on the next read of the memory with the global KeyID. The result would likely be fatal to the system but would not impact TDX security. Following the TDX module specification, flush cache before configuring the global KeyID on all packages. Given the PAMT size can be large (~1/256th of system RAM), just use WBINVD on all CPUs to flush. If TDH.SYS.KEY.CONFIG fails, the TDX module may already have "converted" some memory for TDX module use. Convert the memory back so that it can be safely used by the kernel again. Note that this is slower than it should be because of the "partial write machine check" erratum which affects TDX-capable hardware. Also refactor and introduce a new helper: tdmr_do_pamt_func(). This takes a TDMR and runs a function on its PAMT. It looks a _bit_ odd to pass a function pointer around like this, but its use is pretty narrow and it does eliminate what would otherwise be some copying and pasting. [ dhansen: * munge changelog as usual * remove weird (*pamd_func)() syntax ] Signed-off-by: Kai Huang <kai.huang@intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Isaku Yamahata <isaku.yamahata@intel.com> Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Reviewed-by: Yuan Yao <yuan.yao@intel.com> Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com> Link: https://lore.kernel.org/all/20231208170740.53979-14-dave.hansen%40intel.com
2023-12-08x86/virt/tdx: Configure TDX module with the TDMRs and global KeyIDKai Huang
The TDX module uses a private KeyID as the "global KeyID" for mapping things like the PAMT and other TDX metadata. This KeyID has already been reserved when detecting TDX during the kernel early boot. Now that the "TD Memory Regions" (TDMRs) are fully built, pass them to the TDX module together with the global KeyID. Signed-off-by: Kai Huang <kai.huang@intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Isaku Yamahata <isaku.yamahata@intel.com> Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Reviewed-by: Yuan Yao <yuan.yao@intel.com> Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com> Link: https://lore.kernel.org/all/20231208170740.53979-13-dave.hansen%40intel.com
2023-12-08x86/virt/tdx: Designate reserved areas for all TDMRsKai Huang
As the last step of constructing TDMRs, populate reserved areas for all TDMRs. Cover all memory holes and PAMTs with a TMDR reserved area. [ dhansen: trim down chagnelog ] Signed-off-by: Kai Huang <kai.huang@intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Isaku Yamahata <isaku.yamahata@intel.com> Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Reviewed-by: Yuan Yao <yuan.yao@intel.com> Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com> Link: https://lore.kernel.org/all/20231208170740.53979-12-dave.hansen%40intel.com
2023-12-08x86/virt/tdx: Allocate and set up PAMTs for TDMRsKai Huang
The TDX module uses additional metadata to record things like which guest "owns" a given page of memory. This metadata, referred as Physical Address Metadata Table (PAMT), essentially serves as the 'struct page' for the TDX module. PAMTs are not reserved by hardware up front. They must be allocated by the kernel and then given to the TDX module during module initialization. TDX supports 3 page sizes: 4K, 2M, and 1G. Each "TD Memory Region" (TDMR) has 3 PAMTs to track the 3 supported page sizes. Each PAMT must be a physically contiguous area from a Convertible Memory Region (CMR). However, the PAMTs which track pages in one TDMR do not need to reside within that TDMR but can be anywhere in CMRs. If one PAMT overlaps with any TDMR, the overlapping part must be reported as a reserved area in that particular TDMR. Use alloc_contig_pages() since PAMT must be a physically contiguous area and it may be potentially large (~1/256th of the size of the given TDMR). The downside is alloc_contig_pages() may fail at runtime. One (bad) mitigation is to launch a TDX guest early during system boot to get those PAMTs allocated at early time, but the only way to fix is to add a boot option to allocate or reserve PAMTs during kernel boot. It is imperfect but will be improved on later. TDX only supports a limited number of reserved areas per TDMR to cover both PAMTs and memory holes within the given TDMR. If many PAMTs are allocated within a single TDMR, the reserved areas may not be sufficient to cover all of them. Adopt the following policies when allocating PAMTs for a given TDMR: - Allocate three PAMTs of the TDMR in one contiguous chunk to minimize the total number of reserved areas consumed for PAMTs. - Try to first allocate PAMT from the local node of the TDMR for better NUMA locality. Also dump out how many pages are allocated for PAMTs when the TDX module is initialized successfully. This helps answer the eternal "where did all my memory go?" questions. [ dhansen: merge in error handling cleanup ] Signed-off-by: Kai Huang <kai.huang@intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Isaku Yamahata <isaku.yamahata@intel.com> Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Reviewed-by: Yuan Yao <yuan.yao@intel.com> Link: https://lore.kernel.org/all/20231208170740.53979-11-dave.hansen%40intel.com
2023-12-08x86/virt/tdx: Fill out TDMRs to cover all TDX memory regionsKai Huang
Start to transit out the "multi-steps" to construct a list of "TD Memory Regions" (TDMRs) to cover all TDX-usable memory regions. The kernel configures TDX-usable memory regions by passing a list of TDMRs "TD Memory Regions" (TDMRs) to the TDX module. Each TDMR contains the information of the base/size of a memory region, the base/size of the associated Physical Address Metadata Table (PAMT) and a list of reserved areas in the region. Do the first step to fill out a number of TDMRs to cover all TDX memory regions. To keep it simple, always try to use one TDMR for each memory region. As the first step only set up the base/size for each TDMR. Each TDMR must be 1G aligned and the size must be in 1G granularity. This implies that one TDMR could cover multiple memory regions. If a memory region spans the 1GB boundary and the former part is already covered by the previous TDMR, just use a new TDMR for the remaining part. TDX only supports a limited number of TDMRs. Disable TDX if all TDMRs are consumed but there is more memory region to cover. There are fancier things that could be done like trying to merge adjacent TDMRs. This would allow more pathological memory layouts to be supported. But, current systems are not even close to exhausting the existing TDMR resources in practice. For now, keep it simple. Signed-off-by: Kai Huang <kai.huang@intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Reviewed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com> Reviewed-by: Yuan Yao <yuan.yao@intel.com> Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com> Link: https://lore.kernel.org/all/20231208170740.53979-10-dave.hansen%40intel.com
2023-12-08x86/virt/tdx: Add placeholder to construct TDMRs to cover all TDX memory regionsKai Huang
After the kernel selects all TDX-usable memory regions, the kernel needs to pass those regions to the TDX module via data structure "TD Memory Region" (TDMR). Add a placeholder to construct a list of TDMRs (in multiple steps) to cover all TDX-usable memory regions. === Long Version === TDX provides increased levels of memory confidentiality and integrity. This requires special hardware support for features like memory encryption and storage of memory integrity checksums. Not all memory satisfies these requirements. As a result, TDX introduced the concept of a "Convertible Memory Region" (CMR). During boot, the firmware builds a list of all of the memory ranges which can provide the TDX security guarantees. The list of these ranges is available to the kernel by querying the TDX module. The TDX architecture needs additional metadata to record things like which TD guest "owns" a given page of memory. This metadata essentially serves as the 'struct page' for the TDX module. The space for this metadata is not reserved by the hardware up front and must be allocated by the kernel and given to the TDX module. Since this metadata consumes space, the VMM can choose whether or not to allocate it for a given area of convertible memory. If it chooses not to, the memory cannot receive TDX protections and can not be used by TDX guests as private memory. For every memory region that the VMM wants to use as TDX memory, it sets up a "TD Memory Region" (TDMR). Each TDMR represents a physically contiguous convertible range and must also have its own physically contiguous metadata table, referred to as a Physical Address Metadata Table (PAMT), to track status for each page in the TDMR range. Unlike a CMR, each TDMR requires 1G granularity and alignment. To support physical RAM areas that don't meet those strict requirements, each TDMR permits a number of internal "reserved areas" which can be placed over memory holes. If PAMT metadata is placed within a TDMR it must be covered by one of these reserved areas. Let's summarize the concepts: CMR - Firmware-enumerated physical ranges that support TDX. CMRs are 4K aligned. TDMR - Physical address range which is chosen by the kernel to support TDX. 1G granularity and alignment required. Each TDMR has reserved areas where TDX memory holes and overlapping PAMTs can be represented. PAMT - Physically contiguous TDX metadata. One table for each page size per TDMR. Roughly 1/256th of TDMR in size. 256G TDMR = ~1G PAMT. As one step of initializing the TDX module, the kernel configures TDX-usable memory regions by passing a list of TDMRs to the TDX module. Constructing the list of TDMRs consists below steps: 1) Fill out TDMRs to cover all memory regions that the TDX module will use for TD memory. 2) Allocate and set up PAMT for each TDMR. 3) Designate reserved areas for each TDMR. Add a placeholder to construct TDMRs to do the above steps. To keep things simple, just allocate enough space to hold maximum number of TDMRs up front. Signed-off-by: Kai Huang <kai.huang@intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Isaku Yamahata <isaku.yamahata@intel.com> Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Link: https://lore.kernel.org/all/20231208170740.53979-9-dave.hansen%40intel.com
2023-12-08x86/virt/tdx: Get module global metadata for module initializationKai Huang
The TDX module global metadata provides system-wide information about the module. TL;DR: Use the TDH.SYS.RD SEAMCALL to tell if the module is good or not. Long Version: 1) Only initialize TDX module with version 1.5 and later TDX module 1.0 has some compatibility issues with the later versions of module, as documented in the "Intel TDX module ABI incompatibilities between TDX1.0 and TDX1.5" spec. Don't bother with module versions that do not have a stable ABI. 2) Get the essential global metadata for module initialization TDX reports a list of "Convertible Memory Region" (CMR) to tell the kernel which memory is TDX compatible. The kernel needs to build a list of memory regions (out of CMRs) as "TDX-usable" memory and pass them to the TDX module. The kernel does this by constructing a list of "TD Memory Regions" (TDMRs) to cover all these memory regions and passing them to the TDX module. Each TDMR is a TDX architectural data structure containing the memory region that the TDMR covers, plus the information to track (within this TDMR): a) the "Physical Address Metadata Table" (PAMT) to track each TDX memory page's status (such as which TDX guest "owns" a given page, and b) the "reserved areas" to tell memory holes that cannot be used as TDX memory. The kernel needs to get below metadata from the TDX module to build the list of TDMRs: a) the maximum number of supported TDMRs b) the maximum number of supported reserved areas per TDMR and, c) the PAMT entry size for each TDX-supported page size. == Implementation == The TDX module has two modes of fetching the metadata: a one field at a time, or all in one blob. Use the field at a time for now. It is slower, but there just are not enough fields now to justify the complexity of extra unpacking. The err_free_tdxmem=>out_put_tdxmem goto looks wonky by itself. But it is the first of a bunch of error handling that will get stuck at its site. [ dhansen: clean up changelog and add a struct to map between the TDX module fields and 'struct tdx_tdmr_sysinfo' ] Signed-off-by: Kai Huang <kai.huang@intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Link: https://lore.kernel.org/all/20231208170740.53979-8-dave.hansen%40intel.com
2023-12-08x86/virt/tdx: Use all system memory when initializing TDX module as TDX memoryKai Huang
Start to transit out the "multi-steps" to initialize the TDX module. TDX provides increased levels of memory confidentiality and integrity. This requires special hardware support for features like memory encryption and storage of memory integrity checksums. Not all memory satisfies these requirements. As a result, TDX introduced the concept of a "Convertible Memory Region" (CMR). During boot, the firmware builds a list of all of the memory ranges which can provide the TDX security guarantees. The list of these ranges is available to the kernel by querying the TDX module. CMRs tell the kernel which memory is TDX compatible. The kernel needs to build a list of memory regions (out of CMRs) as "TDX-usable" memory and pass them to the TDX module. Once this is done, those "TDX-usable" memory regions are fixed during module's lifetime. To keep things simple, assume that all TDX-protected memory will come from the page allocator. Make sure all pages in the page allocator *are* TDX-usable memory. As TDX-usable memory is a fixed configuration, take a snapshot of the memory configuration from memblocks at the time of module initialization (memblocks are modified on memory hotplug). This snapshot is used to enable TDX support for *this* memory configuration only. Use a memory hotplug notifier to ensure that no other RAM can be added outside of this configuration. This approach requires all memblock memory regions at the time of module initialization to be TDX convertible memory to work, otherwise module initialization will fail in a later SEAMCALL when passing those regions to the module. This approach works when all boot-time "system RAM" is TDX convertible memory and no non-TDX-convertible memory is hot-added to the core-mm before module initialization. For instance, on the first generation of TDX machines, both CXL memory and NVDIMM are not TDX convertible memory. Using kmem driver to hot-add any CXL memory or NVDIMM to the core-mm before module initialization will result in failure to initialize the module. The SEAMCALL error code will be available in the dmesg to help user to understand the failure. Signed-off-by: Kai Huang <kai.huang@intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: "Huang, Ying" <ying.huang@intel.com> Reviewed-by: Isaku Yamahata <isaku.yamahata@intel.com> Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Link: https://lore.kernel.org/all/20231208170740.53979-7-dave.hansen%40intel.com
2023-12-08x86/virt/tdx: Add skeleton to enable TDX on demandKai Huang
There are essentially two steps to get the TDX module ready: 1) Get each CPU ready to run TDX 2) Set up the shared TDX module data structures Introduce and export (to KVM) the infrastructure to do both of these pieces at runtime. == Per-CPU TDX Initialization == Track the initialization status of each CPU with a per-cpu variable. This avoids failures in the case of KVM module reloads and handles cases where CPUs come online later. Generally, the per-cpu SEAMCALLs happen first. But there's actually one global call that has to happen before _any_ others (TDH_SYS_INIT). It's analogous to the boot CPU having to do a bit of extra work just because it happens to be the first one. Track if _any_ CPU has done this call and then only actually do it during the first per-cpu init. == Shared TDX Initialization == Create the global state function (tdx_enable()) as a simple placeholder. The TODO list will be pared down as functionality is added. Use a state machine protected by mutex to make sure the work in tdx_enable() will only be done once. This avoids failures if the KVM module is reloaded. A CPU must be made ready to run TDX before it can participate in initializing the shared parts of the module. Any caller of tdx_enable() need to ensure that it can never run on a CPU which is not ready to run TDX. It needs to be wary of CPU hotplug, preemption and the VMX enabling state of any CPU on which it might run. == Why runtime instead of boot time? == The TDX module can be initialized only once in its lifetime. Instead of always initializing it at boot time, this implementation chooses an "on demand" approach to initialize TDX until there is a real need (e.g when requested by KVM). This approach has below pros: 1) It avoids consuming the memory that must be allocated by kernel and given to the TDX module as metadata (~1/256th of the TDX-usable memory), and also saves the CPU cycles of initializing the TDX module (and the metadata) when TDX is not used at all. 2) The TDX module design allows it to be updated while the system is running. The update procedure shares quite a few steps with this "on demand" initialization mechanism. The hope is that much of "on demand" mechanism can be shared with a future "update" mechanism. A boot-time TDX module implementation would not be able to share much code with the update mechanism. 3) Making SEAMCALL requires VMX to be enabled. Currently, only the KVM code mucks with VMX enabling. If the TDX module were to be initialized separately from KVM (like at boot), the boot code would need to be taught how to muck with VMX enabling and KVM would need to be taught how to cope with that. Making KVM itself responsible for TDX initialization lets the rest of the kernel stay blissfully unaware of VMX. [ dhansen: completely reorder/rewrite changelog ] Signed-off-by: Kai Huang <kai.huang@intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Reviewed-by: Nikolay Borisov <nik.borisov@suse.com> Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com> Link: https://lore.kernel.org/all/20231208170740.53979-6-dave.hansen%40intel.com
2023-12-08x86/virt/tdx: Add SEAMCALL error printing for module initializationKai Huang
The SEAMCALLs involved during the TDX module initialization are not expected to fail. In fact, they are not expected to return any non-zero code (except the "running out of entropy error", which can be handled internally already). Add yet another set of SEAMCALL wrappers, which treats all non-zero return code as error, to support printing SEAMCALL error upon failure for module initialization. Note the TDX module initialization doesn't use the _saved_ret() variant thus no wrapper is added for it. SEAMCALL assembly can also return kernel-defined error codes for three special cases: 1) TDX isn't enabled by the BIOS; 2) TDX module isn't loaded; 3) CPU isn't in VMX operation. Whether they can legally happen depends on the caller, so leave to the caller to print error message when desired. Also convert the SEAMCALL error codes to the kernel error codes in the new wrappers so that each SEAMCALL caller doesn't have to repeat the conversion. [ dhansen: Align the register dump with show_regs(). Zero-pad the contents, split on two lines and use consistent spacing. ] Signed-off-by: Kai Huang <kai.huang@intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Reviewed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com> Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com> Link: https://lore.kernel.org/all/20231208170740.53979-5-dave.hansen%40intel.com
2023-12-08x86/virt/tdx: Handle SEAMCALL no entropy error in common codeKai Huang
Some SEAMCALLs use the RDRAND hardware and can fail for the same reasons as RDRAND. Use the kernel RDRAND retry logic for them. There are three __seamcall*() variants. Do the SEAMCALL retry in common code and add a wrapper for each of them. Signed-off-by: Kai Huang <kai.huang@intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Kirill A. Shutemov <kirll.shutemov@linux.intel.com> Reviewed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com> Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com> Link: https://lore.kernel.org/all/20231208170740.53979-4-dave.hansen%40intel.com
2023-12-08x86/virt/tdx: Make INTEL_TDX_HOST depend on X86_X2APICKai Huang
TDX capable platforms are locked to X2APIC mode and cannot fall back to the legacy xAPIC mode when TDX is enabled by the BIOS. TDX host support requires x2APIC. Make INTEL_TDX_HOST depend on X86_X2APIC. Signed-off-by: Kai Huang <kai.huang@intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: David Hildenbrand <david@redhat.com> Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Link: https://lore.kernel.org/lkml/ba80b303-31bf-d44a-b05d-5c0f83038798@intel.com/ Link: https://lore.kernel.org/all/20231208170740.53979-3-dave.hansen%40intel.com
2023-12-08x86/virt/tdx: Define TDX supported page sizes as macrosKai Huang
TDX supports 4K, 2M and 1G page sizes. The corresponding values are defined by the TDX module spec and used as TDX module ABI. Currently, they are used in try_accept_one() when the TDX guest tries to accept a page. However currently try_accept_one() uses hard-coded magic values. Define TDX supported page sizes as macros and get rid of the hard-coded values in try_accept_one(). TDX host support will need to use them too. Signed-off-by: Kai Huang <kai.huang@intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: David Hildenbrand <david@redhat.com> Link: https://lore.kernel.org/all/20231208170740.53979-2-dave.hansen%40intel.com
2023-12-08x86/virt/tdx: Detect TDX during kernel bootKai Huang
Intel Trust Domain Extensions (TDX) protects guest VMs from malicious host and certain physical attacks. A CPU-attested software module called 'the TDX module' runs inside a new isolated memory range as a trusted hypervisor to manage and run protected VMs. Pre-TDX Intel hardware has support for a memory encryption architecture called MKTME. The memory encryption hardware underpinning MKTME is also used for Intel TDX. TDX ends up "stealing" some of the physical address space from the MKTME architecture for crypto-protection to VMs. The BIOS is responsible for partitioning the "KeyID" space between legacy MKTME and TDX. The KeyIDs reserved for TDX are called 'TDX private KeyIDs' or 'TDX KeyIDs' for short. During machine boot, TDX microcode verifies that the BIOS programmed TDX private KeyIDs consistently and correctly programmed across all CPU packages. The MSRs are locked in this state after verification. This is why MSR_IA32_MKTME_KEYID_PARTITIONING gets used for TDX enumeration: it indicates not just that the hardware supports TDX, but that all the boot-time security checks passed. The TDX module is expected to be loaded by the BIOS when it enables TDX, but the kernel needs to properly initialize it before it can be used to create and run any TDX guests. The TDX module will be initialized by the KVM subsystem when KVM wants to use TDX. Detect platform TDX support by detecting TDX private KeyIDs. The TDX module itself requires one TDX KeyID as the 'TDX global KeyID' to protect its metadata. Each TDX guest also needs a TDX KeyID for its own protection. Just use the first TDX KeyID as the global KeyID and leave the rest for TDX guests. If no TDX KeyID is left for TDX guests, disable TDX as initializing the TDX module alone is useless. [ dhansen: add X86_FEATURE, replace helper function ] Signed-off-by: Kai Huang <kai.huang@intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Reviewed-by: Isaku Yamahata <isaku.yamahata@intel.com> Reviewed-by: David Hildenbrand <david@redhat.com> Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com> Link: https://lore.kernel.org/all/20231208170740.53979-1-dave.hansen%40intel.com
2023-12-03Linux 6.7-rc4Linus Torvalds
2023-12-03Merge tag 'v6.7-rc3-smb3-client-fixes' of git://git.samba.org/sfrench/cifs-2.6Linus Torvalds
Pull smb client fixes from Steve French: - Two fallocate fixes - Fix warnings from new gcc - Two symlink fixes * tag 'v6.7-rc3-smb3-client-fixes' of git://git.samba.org/sfrench/cifs-2.6: smb: client, common: fix fortify warnings cifs: Fix FALLOC_FL_INSERT_RANGE by setting i_size after EOF moved cifs: Fix FALLOC_FL_ZERO_RANGE by setting i_size if EOF moved smb: client: report correct st_size for SMB and NFS symlinks smb: client: fix missing mode bits for SMB symlinks
2023-12-03Merge tag 'firewire-fixes-6.7-rc4' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/ieee1394/linux1394 Pull firewire fix from Takashi Sakamoto: "A single patch to fix long-standing issue of memory leak at failure of device registration for fw_unit. We rarely encounter the issue, but it should be applied to stable releases, since it fixes inappropriate API usage" * tag 'firewire-fixes-6.7-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/ieee1394/linux1394: firewire: core: fix possible memory leak in create_units()
2023-12-03Merge tag 'powerpc-6.7-3' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux Pull powerpc fixes from Michael Ellerman: - Fix corruption of f0/vs0 during FP/Vector save, seen as userspace crashes when using io-uring workers (in particular with MariaDB) - Fix KVM_RUN potentially clobbering all host userspace FP/Vector registers Thanks to Timothy Pearson, Jens Axboe, and Nicholas Piggin. * tag 'powerpc-6.7-3' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux: KVM: PPC: Book3S HV: Fix KVM_RUN clobbering FP/VEC user registers powerpc: Don't clobber f0/vs0 during fp|altivec register save
2023-12-03Merge tag 'vfio-v6.7-rc4' of https://github.com/awilliam/linux-vfioLinus Torvalds
Pull vfio fixes from Alex Williamson: - Fix the lifecycle of a mutex in the pds variant driver such that a reset prior to opening the device won't find it uninitialized. Implement the release path to symmetrically destroy the mutex. Also switch a different lock from spinlock to mutex as the code path has the potential to sleep and doesn't need the spinlock context otherwise (Brett Creeley) - Fix an issue detected via randconfig where KVM tries to symbol_get an undeclared function. The symbol is temporarily declared unconditionally here, which resolves the problem and avoids churn relative to a series pending for the next merge window which resolves some of this symbol ugliness, but also fixes Kconfig dependencies (Sean Christopherson) * tag 'vfio-v6.7-rc4' of https://github.com/awilliam/linux-vfio: vfio: Drop vfio_file_iommu_group() stub to fudge around a KVM wart vfio/pds: Fix possible sleep while in atomic context vfio/pds: Fix mutex lock->magic != lock warning
2023-12-03Merge tag 'for-linus-6.7a-rc4-tag' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip Pull xen fixes from Juergen Gross: - A fix for the Xen event driver setting the correct return value when experiencing an allocation failure - A fix for allocating space for a struct in the percpu area to not cross page boundaries (this one is for x86, a similar one for Arm was already in the pull request for rc3) * tag 'for-linus-6.7a-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip: xen/events: fix error code in xen_bind_pirq_msi_to_irq() x86/xen: fix percpu vcpu_info allocation
2023-12-03Merge tag 'probes-fixes-v6.7-rc3' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace Pull probes fixes from Masami Hiramatsu: - objpool: Fix objpool overrun case on memory/cache access delay especially on the big.LITTLE SoC. The objpool uses a copy of object slot index internal loop, but the slot index can be changed on another processor in parallel. In that case, the difference of 'head' local copy and the 'slot->last' index will be bigger than local slot size. In that case, we need to re-read the slot::head to update it. - kretprobe: Fix to use appropriate rcu API for kretprobe holder. Since kretprobe_holder::rp is RCU managed, it should use rcu_assign_pointer() and rcu_dereference_check() correctly. Also adding __rcu tag for finding wrong usage by sparse. - rethook: Fix to use appropriate rcu API for rethook::handler. The same as kretprobe, rethook::handler is RCU managed and it should use rcu_assign_pointer() and rcu_dereference_check(). This also adds __rcu tag for finding wrong usage by sparse. * tag 'probes-fixes-v6.7-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: rethook: Use __rcu pointer for rethook::handler kprobes: consistent rcu api usage for kretprobe holder lib: objpool: fix head overrun on RK3588 SBC
2023-12-02Merge tag 'pm-6.7-rc4' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm Pull power management fixes from Rafael Wysocki: "These fix issues in two cpufreq drivers, in the AMD P-state driver and in the power-capping DTPM framework. Specifics: - Fix the AMD P-state driver's EPP sysfs interface in the cases when the performance governor is in use (Ayush Jain) - Make the ->fast_switch() callback in the AMD P-state driver return the target frequency as expected (Gautham R. Shenoy) - Allow user space to control the range of frequencies to use via scaling_min_freq and scaling_max_freq when AMD P-state driver is in use (Wyes Karny) - Prevent power domains needed for wakeup signaling from being turned off during system suspend on Qualcomm systems and prevent performance states votes from runtime-suspended devices from being lost across a system suspend-resume cycle in qcom-cpufreq-nvmem (Stephan Gerhold) - Fix disabling the 792 Mhz OPP in the imx6q cpufreq driver for the i.MX6ULL types that can run at that frequency (Christoph Niedermaier) - Eliminate unnecessary and harmful conversions to uW from the DTPM (dynamic thermal and power management) framework (Lukasz Luba)" * tag 'pm-6.7-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: cpufreq/amd-pstate: Only print supported EPP values for performance governor cpufreq/amd-pstate: Fix scaling_min_freq and scaling_max_freq update powercap: DTPM: Fix unneeded conversions to micro-Watts cpufreq/amd-pstate: Fix the return value of amd_pstate_fast_switch() pmdomain: qcom: rpmpd: Set GENPD_FLAG_ACTIVE_WAKEUP cpufreq: qcom-nvmem: Preserve PM domain votes in system suspend cpufreq: qcom-nvmem: Enable virtual power domain devices cpufreq: imx6q: Don't disable 792 Mhz OPP unnecessarily
2023-12-02Merge tag 'acpi-6.7-rc4' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm Pull ACPI fixes from Rafael Wysocki: "This fixes a recently introduced build issue on ARM32 and a NULL pointer dereference in the ACPI backlight driver due to a design issue exposed by a recent change in the ACPI bus type code. Specifics: - Fix a recently introduced build issue on ARM32 platforms caused by an inadvertent header file breakage (Dave Jiang) - Eliminate questionable usage of acpi_driver_data() in the ACPI backlight cooling device code that leads to NULL pointer dereferences after recent ACPI core changes (Hans de Goede)" * tag 'acpi-6.7-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: ACPI: video: Use acpi_video_device for cooling-dev driver data ACPI: Fix ARM32 platforms compile issue introduced by fw_table changes
2023-12-02Merge tag 'arm64-fixes' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux Pull arm64 fix from Catalin Marinas: "Fix a regression where the arm64 KPTI ends up enabled even on systems that don't need it" * tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux: arm64: Avoid enabling KPTI unnecessarily
2023-12-02Merge tag 'iommu-fixes-v6.7-rc3' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu Pull iommu fixes from Joerg Roedel: - Fix race conditions in device probe path - Handle ERR_PTR() returns in __iommu_domain_alloc() path - Update MAINTAINERS entry for Qualcom IOMMUs - Printk argument fix in device tree specific code - Several Intel VT-d fixes from Lu Baolu: - Do not support enforcing cache coherency for non-empty domains - Avoid devTLB invalidation if iommu is off - Disable PCI ATS in legacy passthrough mode - Support non-PCI devices when clearing context - Fix incorrect cache invalidation for mm notification - Add MTL to quirk list to skip TE disabling - Set variable intel_dirty_ops to static * tag 'iommu-fixes-v6.7-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu: iommu: Fix printk arg in of_iommu_get_resv_regions() iommu/vt-d: Set variable intel_dirty_ops to static iommu/vt-d: Fix incorrect cache invalidation for mm notification iommu/vt-d: Add MTL to quirk list to skip TE disabling iommu/vt-d: Make context clearing consistent with context mapping iommu/vt-d: Disable PCI ATS in legacy passthrough mode iommu/vt-d: Omit devTLB invalidation requests when TES=0 iommu/vt-d: Support enforce_cache_coherency only for empty domains iommu: Avoid more races around device probe MAINTAINERS: list all Qualcomm IOMMU drivers in the QUALCOMM IOMMU entry iommu: Flow ERR_PTR out from __iommu_domain_alloc()
2023-12-02Merge tag 'sound-6.7-rc4' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound Pull sound fixes from Takashi Iwai: "No surprise here, including only a collection of HD-audio device-specific small fixes" * tag 'sound-6.7-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound: ALSA: hda: Disable power-save on KONTRON SinglePC ALSA: hda/realtek: Add supported ALC257 for ChromeOS ALSA: hda/realtek: Headset Mic VREF to 100% ALSA: hda: intel-nhlt: Ignore vbps when looking for DMIC 32 bps format ALSA: hda: cs35l56: Enable low-power hibernation mode on SPI ALSA: cs35l41: Fix for old systems which do not support command ALSA: hda: cs35l41: Remove unnecessary boolean state variable firmware_running ALSA: hda - Fix speaker and headset mic pin config for CHUWI CoreBook XPro
2023-12-02Merge tag 'drm-fixes-2023-12-01' of git://anongit.freedesktop.org/drm/drmLinus Torvalds
Pull drm fixes from Dave Airlie: "Weekly fixes, mostly amdgpu fixes with a scattering of nouveau, i915, and a couple of reverts. Hopefully it will quieten down in coming weeks. drm: - Revert unexport of prime helpers for fd/handle conversion dma_resv: - Do not double add fences in dma_resv_add_fence. gpuvm: - Fix GPUVM license identifier. i915: - Mark internal GSC engine with reserved uabi class - Take VGA converters into account in eDP probe - Fix intel_pre_plane_updates() call to ensure workarounds get applied panel: - Revert panel fixes as they require exporting device_is_dependent. nouveau: - fix oversized allocations in new vm path - fix zero-length array - remove a stray lock nt36523: - Fix error check for nt36523. amdgpu: - DMUB fix - DCN 3.5 fixes - XGMI fix - DCN 3.2 fixes - Vangogh suspend fix - NBIO 7.9 fix - GFX11 golden register fix - Backlight fix - NBIO 7.11 fix - IB test overflow fix - DCN 3.1.4 fixes - fix a runtime pm ref count - Retimer fix - ABM fix - DCN 3.1.5 fix - Fix AGP addressing - Fix possible memory leak in SMU error path - Make sure PME is enabled in D3 - Fix possible NULL pointer dereference in debugfs - EEPROM fix - GC 9.4.3 fix amdkfd: - IP version check fix - Fix memory leak in pqm_uninit()" * tag 'drm-fixes-2023-12-01' of git://anongit.freedesktop.org/drm/drm: (53 commits) Revert "drm/prime: Unexport helpers for fd/handle conversion" drm/amdgpu: Use another offset for GC 9.4.3 remap drm/amd/display: Fix some HostVM parameters in DML drm/amdkfd: Free gang_ctx_bo and wptr_bo in pqm_uninit drm/amdgpu: Update EEPROM I2C address for smu v13_0_0 drm/amd/display: Allow DTBCLK disable for DCN35 drm/amdgpu: Fix cat debugfs amdgpu_regs_didt causes kernel null pointer drm/amd: Enable PCIe PME from D3 drm/amd/pm: fix a memleak in aldebaran_tables_init drm/amdgpu: fix AGP addressing when GART is not at 0 drm/amd/display: update dcn315 lpddr pstate latency drm/amd/display: fix ABM disablement drm/amd/display: Fix black screen on video playback with embedded panel drm/amd/display: Fix conversions between bytes and KB drm/amdkfd: Use common function for IP version check drm/amd/display: Remove config update drm/amd/display: Update DCN35 clock table policy drm/amd/display: force toggle rate wa for first link training for a retimer drm/amdgpu: correct the amdgpu runtime dereference usage count drm/amd/display: Update min Z8 residency time to 2100 for DCN314 ...
2023-12-02Merge tag 'io_uring-6.7-2023-11-30' of git://git.kernel.dk/linuxLinus Torvalds
Pull io_uring fixes from Jens Axboe: - Fix an issue with discontig page checking for IORING_SETUP_NO_MMAP - Fix an issue with not allowing IORING_SETUP_NO_MMAP also disallowing mmap'ed buffer rings - Fix an issue with deferred release of memory mapped pages - Fix a lockdep issue with IORING_SETUP_NO_MMAP - Use fget/fput consistently, even from our sync system calls. No real issue here, but if we were ever to allow closing io_uring descriptors it would be required. Let's play it safe and just use the full ref counted versions upfront. Most uses of io_uring are threaded anyway, and hence already doing the full version underneath. * tag 'io_uring-6.7-2023-11-30' of git://git.kernel.dk/linux: io_uring: use fget/fput consistently io_uring: free io_buffer_list entries via RCU io_uring/kbuf: prune deferred locked cache when tearing down io_uring/kbuf: recycle freed mapped buffer ring entries io_uring/kbuf: defer release of mapped buffer rings io_uring: enable io_mem_alloc/free to be used in other parts io_uring: don't guard IORING_OFF_PBUF_RING with SETUP_NO_MMAP io_uring: don't allow discontig pages for IORING_SETUP_NO_MMAP
2023-12-02Merge tag 'block-6.7-2023-12-01' of git://git.kernel.dk/linuxLinus Torvalds
Pull block fixes from Jens Axboe: - NVMe pull request via Keith: - Invalid namespace identification error handling (Marizio Ewan, Keith) - Fabrics keep-alive tuning (Mark) - Fix for a bad error check regression in bcache (Markus) - Fix for a performance regression with O_DIRECT (Ming) - Fix for a flush related deadlock (Ming) - Make the read-only warn on per-partition (Yu) * tag 'block-6.7-2023-12-01' of git://git.kernel.dk/linux: nvme-core: check for too small lba shift blk-mq: don't count completed flush data request as inflight in case of quiesce block: Document the role of the two attribute groups block: warn once for each partition in bio_check_ro() block: move .bd_inode into 1st cacheline of block_device nvme: check for valid nvme_identify_ns() before using it nvme-core: fix a memory leak in nvme_ns_info_from_identify() nvme: fine-tune sending of first keep-alive bcache: revert replacing IS_ERR_OR_NULL with IS_ERR
2023-12-02Merge tag 'dm-6.7/dm-fixes-2' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm Pull device mapper fixes from Mike Snitzer: - Fix DM verity target's FEC support to always initialize IO before it frees it. Also fix alignment of struct dm_verity_fec_io within the per-bio-data - Fix DM verity target to not FEC failed readahead IO - Update DM flakey target to use MAX_ORDER rather than MAX_ORDER - 1 * tag 'dm-6.7/dm-fixes-2' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm: dm-flakey: start allocating with MAX_ORDER dm-verity: align struct dm_verity_fec_io properly dm verity: don't perform FEC for failed readahead IO dm verity: initialize fec io before freeing it
2023-12-02Merge tag 'scsi-fixes' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi Pull SCSI fixes from James Bottomley: "Three small fixes, one in drivers. The core changes are to the internal representation of flags in scsi_devices which removes space wasting bools in favour of single bit flags and to add a flag to force a runtime resume which is used by ATA devices" * tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi: scsi: sd: Fix system start for ATA devices scsi: Change SCSI device boolean fields to single bit flags scsi: ufs: core: Clear cmd if abort succeeds in MCQ mode
2023-12-02Merge tag 'fs_for_v6.7-rc4' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs Pull ext2 fix from Jan Kara: "Fix an ext2 bug introduced by changes in ext2 & iomap stepping on each other toes (apparently ext2 driver does not get much testing in linux-next)" * tag 'fs_for_v6.7-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs: ext2: Fix ki_pos update for DIO buffered-io fallback case
2023-12-02Merge tag 'bcachefs-2023-11-29' of https://evilpiepirate.org/git/bcachefsLinus Torvalds
Pull more bcachefs bugfixes from Kent Overstreet: - bcache & bcachefs were broken with CFI enabled; patch for closures to fix type punning - mark erasure coding as extra-experimental; there are incompatible disk space accounting changes coming for erasure coding, and I'm still seeing checksum errors in some tests - several fixes for durability-related issues (durability is a device specific setting where we can tell bcachefs that data on a given device should be counted as replicated x times) - a fix for a rare livelock when a btree node merge then updates a parent node that is almost full - fix a race in the device removal path, where dropping a pointer in a btree node to a device would be clobbered by an in flight btree write updating the btree node key on completion - fix one SRCU lock hold time warning in the btree gc code - ther's still a bunch more of these to fix - fix a rare race where we'd start copygc before initializing the "are we rw" percpu refcount; copygc would think we were already ro and die immediately * tag 'bcachefs-2023-11-29' of https://evilpiepirate.org/git/bcachefs: (23 commits) bcachefs: Extra kthread_should_stop() calls for copygc bcachefs: Convert gc_alloc_start() to for_each_btree_key2() bcachefs: Fix race between btree writes and metadata drop bcachefs: move journal seq assertion bcachefs: -EROFS doesn't count as move_extent_start_fail bcachefs: trace_move_extent_start_fail() now includes errcode bcachefs: Fix split_race livelock bcachefs: Fix bucket data type for stripe buckets bcachefs: Add missing validation for jset_entry_data_usage bcachefs: Fix zstd compress workspace size bcachefs: bpos is misaligned on big endian bcachefs: Fix ec + durability calculation bcachefs: Data update path won't accidentaly grow replicas bcachefs: deallocate_extra_replicas() bcachefs: Proper refcounting for journal_keys bcachefs: preserve device path as device name bcachefs: Fix an endianness conversion bcachefs: Start gc, copygc, rebalance threads after initing writes ref bcachefs: Don't stop copygc thread on device resize bcachefs: Make sure bch2_move_ratelimit() also waits for move_ops ...
2023-12-01Merge branch 'acpi-tables'Rafael J. Wysocki
Merge a fix for a recently introduced build issue on ARM32 platforms caused by an inadvertent header file breakage (Dave Jiang). * acpi-tables: ACPI: Fix ARM32 platforms compile issue introduced by fw_table changes
2023-12-01Merge branch 'powercap'Rafael J. Wysocki
Merge a power capping fix for 6.7-rc4 which eliminates unnecessary and harmful conversions to uW from the DTPM (dynamic thermal and power management) framework (Lukasz Luba). * powercap: powercap: DTPM: Fix unneeded conversions to micro-Watts
2023-12-01Merge tag 'nvme-6.7-2023-12-01' of git://git.infradead.org/nvme into block-6.7Jens Axboe
Pull NVMe fixes from Keith: "nvme fixes for Linux 6.7 - Invalid namespace identification error handling (Marizio Ewan, Keith) - Fabrics keep-alive tuning (Mark)" * tag 'nvme-6.7-2023-12-01' of git://git.infradead.org/nvme: nvme-core: check for too small lba shift nvme: check for valid nvme_identify_ns() before using it nvme-core: fix a memory leak in nvme_ns_info_from_identify() nvme: fine-tune sending of first keep-alive
2023-12-01nvme-core: check for too small lba shiftKeith Busch
The block layer doesn't support logical block sizes smaller than 512 bytes. The nvme spec doesn't support that small either, but the driver isn't checking to make sure the device responded with usable data. Failing to catch this will result in a kernel bug, either from a division by zero when stacking, or a zero length bio. Reviewed-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Keith Busch <kbusch@kernel.org>
2023-12-01blk-mq: don't count completed flush data request as inflight in case of quiesceMing Lei
Request queue quiesce may interrupt flush sequence, and the original request may have been marked as COMPLETE, but can't get finished because of queue quiesce. This way is fine from driver viewpoint, because flush sequence is block layer concept, and it isn't related with driver. However, driver(such as dm-rq) can call blk_mq_queue_inflight() to count & drain inflight requests, then the wait & drain never gets done because the completed & not-finished flush request is counted as inflight. Fix this issue by not counting completed flush data request as inflight in case of quiesce. Cc: Mike Snitzer <snitzer@kernel.org> Cc: David Jeffery <djeffery@redhat.com> Cc: John Pittman <jpittman@redhat.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20231201085605.577730-1-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-12-01iommu: Fix printk arg in of_iommu_get_resv_regions()Daniel Mentz
The variable phys is defined as (struct resource *) which aligns with the printk format specifier %pr. Taking the address of it results in a value of type (struct resource **) which is incompatible with the format specifier %pr. Therefore, remove the address of operator (&). Fixes: a5bf3cfce8cb ("iommu: Implement of_iommu_get_resv_regions()") Signed-off-by: Daniel Mentz <danielmentz@google.com> Acked-by: Thierry Reding <treding@nvidia.com> Link: https://lore.kernel.org/r/20231108062226.928985-1-danielmentz@google.com Signed-off-by: Joerg Roedel <jroedel@suse.de>
2023-12-01rethook: Use __rcu pointer for rethook::handlerMasami Hiramatsu (Google)
Since the rethook::handler is an RCU-maganged pointer so that it will notice readers the rethook is stopped (unregistered) or not, it should be an __rcu pointer and use appropriate functions to be accessed. This will use appropriate memory barrier when accessing it. OTOH, rethook::data is never changed, so we don't need to check it in get_kretprobe(). NOTE: To avoid sparse warning, rethook::handler is defined by a raw function pointer type with __rcu instead of rethook_handler_t. Link: https://lore.kernel.org/all/170126066201.398836.837498688669005979.stgit@devnote2/ Fixes: 54ecbe6f1ed5 ("rethook: Add a generic return hook") Cc: stable@vger.kernel.org Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/oe-kbuild-all/202311241808.rv9ceuAh-lkp@intel.com/ Tested-by: JP Kobryn <inwardvessel@gmail.com> Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
2023-12-01kprobes: consistent rcu api usage for kretprobe holderJP Kobryn
It seems that the pointer-to-kretprobe "rp" within the kretprobe_holder is RCU-managed, based on the (non-rethook) implementation of get_kretprobe(). The thought behind this patch is to make use of the RCU API where possible when accessing this pointer so that the needed barriers are always in place and to self-document the code. The __rcu annotation to "rp" allows for sparse RCU checking. Plain writes done to the "rp" pointer are changed to make use of the RCU macro for assignment. For the single read, the implementation of get_kretprobe() is simplified by making use of an RCU macro which accomplishes the same, but note that the log warning text will be more generic. I did find that there is a difference in assembly generated between the usage of the RCU macros vs without. For example, on arm64, when using rcu_assign_pointer(), the corresponding store instruction is a store-release (STLR) which has an implicit barrier. When normal assignment is done, a regular store (STR) is found. In the macro case, this seems to be a result of rcu_assign_pointer() using smp_store_release() when the value to write is not NULL. Link: https://lore.kernel.org/all/20231122132058.3359-1-inwardvessel@gmail.com/ Fixes: d741bf41d7c7 ("kprobes: Remove kretprobe hash") Cc: stable@vger.kernel.org Signed-off-by: JP Kobryn <inwardvessel@gmail.com> Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
2023-12-01lib: objpool: fix head overrun on RK3588 SBCwuqiang.matt
objpool overrun stress with test_objpool on OrangePi5+ SBC triggered the following kernel warnings: WARNING: CPU: 6 PID: 3115 at lib/objpool.c:168 objpool_push+0xc0/0x100 This message is from objpool.c:168: WARN_ON_ONCE(tail - head > pool->nr_objs); The overrun test case is to validate the case that pre-allocated objects are insufficient: 8 objects are pre-allocated for each node and consumer thread per node tries to grab 16 objects in a row. The testing system is OrangePI 5+, with RK3588, a big.LITTLE SOC with 4x A76 and 4x A55. When disabling either all 4 big or 4 little cores, the overrun tests run well, and once with big and little cores mixed together, the overrun test would always cause an overrun loop. It's likely the memory timing differences of big and little cores cause this trouble. Here are the debugging data of objpool_try_get_slot after try_cmpxchg_release: objpool_pop: cpu: 4/0 0:0 head: 278/279 tail:278 last:276/278 The local copies of 'head' and 'last' were 278 and 276, and reloading of 'slot->head' and 'slot->last' got 279 and 278. After try_cmpxchg_release 'slot->head' became 'head + 1', which is correct. But what's wrong here is the stale value of 'last', and that stale value of 'last' finally led the overrun of 'head'. Memory updating of 'last' and 'head' are performed in push() and pop() independently, which could be the culprit leading this out of order visibility of 'last' and 'head'. So for objpool_try_get_slot(), it's not enough only checking the condition of 'head != slot', the implicit condition 'last - head <= nr_objs' must also be explicitly asserted to guarantee 'last' is always behind 'head' before the object retrieving. This patch will check and try reloading of 'head' and 'last' to ensure 'last' is behind 'head' at the time of object retrieving. Performance testings show the average impact is about 0.1% for X86_64 and 1.12% for ARM64. Here are the results: OS: Debian 10 X86_64, Linux 6.6rc HW: XEON 8336C x 2, 64 cores/128 threads, DDR4 3200MT/s 1T 2T 4T 8T 16T native: 49543304 99277826 199017659 399070324 795185848 objpool: 29909085 59865637 119692073 239750369 478005250 objpool+: 29879313 59230743 119609856 239067773 478509029 32T 48T 64T 96T 128T native: 1596927073 2390099988 2929397330 3183875848 3257546602 objpool: 957553042 1435814086 1680872925 2043126796 2165424198 objpool+: 956476281 1434491297 1666055740 2041556569 2157415622 OS: Debian 11 AARCH64, Linux 6.6rc HW: Kunpeng-920 96 cores/2 sockets/4 NUMA nodes, DDR4 2933 MT/s 1T 2T 4T 8T 16T native: 30890508 60399915 123111980 242257008 494002946 objpool: 14742531 28883047 57739948 115886644 232455421 objpool+: 14107220 29032998 57286084 113730493 232232850 24T 32T 48T 64T 96T native: 746406039 1000174750 1493236240 1998318364 2942911180 objpool: 349164852 467284332 702296756 934459713 1387898285 objpool+: 348388180 462750976 696606096 927865887 1368402195 Link: https://lore.kernel.org/all/20231114115148.298821-1-wuqiang.matt@bytedance.com/ Fixes: b4edb8d2d464 ("lib: objpool added: ring-array based lockless MPMC") Signed-off-by: wuqiang.matt <wuqiang.matt@bytedance.com> Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
2023-12-01Merge tag 'hardening-v6.7-rc4' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux Pull hardening fixes from Kees Cook: - struct_group: propagate attributes to top-level union (Dmitry Antipov) - gcc-plugins: randstruct: Update code comment in relayout_struct (Gustavo A. R. Silva) - MAINTAINERS: refresh LLVM support (Nick Desaulniers) * tag 'hardening-v6.7-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux: gcc-plugins: randstruct: Update code comment in relayout_struct() uapi: propagate __struct_group() attributes to the container union MAINTAINERS: refresh LLVM support