summaryrefslogtreecommitdiff
path: root/arch/x86/kernel/cpu/sgx
AgeCommit message (Collapse)Author
2022-07-07x86/sgx: Export sgx_encl_page_alloc()Jarkko Sakkinen
Move sgx_encl_page_alloc() to encl.c and export it so that it can be used in the implementation for support of adding pages to initialized enclaves, which requires to allocate new enclave pages. Signed-off-by: Jarkko Sakkinen <jarkko@kernel.org> Signed-off-by: Reinette Chatre <reinette.chatre@intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Link: https://lkml.kernel.org/r/57ae71b4ea17998467670232e12d6617b95c6811.1652137848.git.reinette.chatre@intel.com
2022-07-07x86/sgx: Export sgx_encl_{grow,shrink}()Reinette Chatre
In order to use sgx_encl_{grow,shrink}() in the page augmentation code located in encl.c, export these functions. Suggested-by: Jarkko Sakkinen <jarkko@kernel.org> Signed-off-by: Reinette Chatre <reinette.chatre@intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Jarkko Sakkinen <jarkko@kernel.org> Link: https://lkml.kernel.org/r/d51730acf54b6565710b2261b3099517b38c2ec4.1652137848.git.reinette.chatre@intel.com
2022-07-07x86/sgx: Keep record of SGX page typeReinette Chatre
SGX2 functions are not allowed on all page types. For example, ENCLS[EMODPR] is only allowed on regular SGX enclave pages and ENCLS[EMODPT] is only allowed on TCS and regular pages. If these functions are attempted on another type of page the hardware would trigger a fault. Keep a record of the SGX page type so that there is more certainty whether an SGX2 instruction can succeed and faults can be treated as real failures. The page type is a property of struct sgx_encl_page and thus does not cover the VA page type. VA pages are maintained in separate structures and their type can be determined in a different way. The SGX2 instructions needing the page type do not operate on VA pages and this is thus not a scenario needing to be covered at this time. struct sgx_encl_page hosting this information is maintained for each enclave page so the space consumed by the struct is important. The existing sgx_encl_page->vm_max_prot_bits is already unsigned long while only using three bits. Transition to a bitfield for the two members to support the additional information without increasing the space consumed by the struct. Signed-off-by: Reinette Chatre <reinette.chatre@intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Jarkko Sakkinen <jarkko@kernel.org> Link: https://lkml.kernel.org/r/a0a6939eefe7ba26514f6c49723521cde372de64.1652137848.git.reinette.chatre@intel.com
2022-07-07x86/sgx: Create utility to validate user provided offset and lengthReinette Chatre
User provided offset and length is validated when parsing the parameters of the SGX_IOC_ENCLAVE_ADD_PAGES ioctl(). Extract this validation (with consistent use of IS_ALIGNED) into a utility that can be used by the SGX2 ioctl()s that will also provide these values. Signed-off-by: Reinette Chatre <reinette.chatre@intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Jarkko Sakkinen <jarkko@kernel.org> Link: https://lkml.kernel.org/r/767147bc100047abed47fe27c592901adfbb93a2.1652137848.git.reinette.chatre@intel.com
2022-07-07x86/sgx: Make sgx_ipi_cb() available internallyReinette Chatre
The ETRACK function followed by an IPI to all CPUs within an enclave is a common pattern with more frequent use in support of SGX2. Make the (empty) IPI callback function available internally in preparation for usage by SGX2. Signed-off-by: Reinette Chatre <reinette.chatre@intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Jarkko Sakkinen <jarkko@kernel.org> Link: https://lkml.kernel.org/r/1179ed4a9c3c1c2abf49d51bfcf2c30b493181cc.1652137848.git.reinette.chatre@intel.com
2022-07-07x86/sgx: Move PTE zap code to new sgx_zap_enclave_ptes()Reinette Chatre
The SGX reclaimer removes page table entries pointing to pages that are moved to swap. SGX2 enables changes to pages belonging to an initialized enclave, thus enclave pages may have their permission or type changed while the page is being accessed by an enclave. Supporting SGX2 requires page table entries to be removed so that any cached mappings to changed pages are removed. For example, with the ability to change enclave page types a regular enclave page may be changed to a Thread Control Structure (TCS) page that may not be accessed by an enclave. Factor out the code removing page table entries to a separate function sgx_zap_enclave_ptes(), fixing accuracy of comments in the process, and make it available to the upcoming SGX2 code. Place sgx_zap_enclave_ptes() with the rest of the enclave code in encl.c interacting with the page table since this code is no longer unique to the reclaimer. Signed-off-by: Reinette Chatre <reinette.chatre@intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Jarkko Sakkinen <jarkko@kernel.org> Link: https://lkml.kernel.org/r/b010cdf01d7ce55dd0f00e883b7ccbd9db57160a.1652137848.git.reinette.chatre@intel.com
2022-07-07x86/sgx: Rename sgx_encl_ewb_cpumask() as sgx_encl_cpumask()Reinette Chatre
sgx_encl_ewb_cpumask() is no longer unique to the reclaimer where it is used during the EWB ENCLS leaf function when EPC pages are written out to main memory and sgx_encl_ewb_cpumask() is used to learn which CPUs might have executed the enclave to ensure that TLBs are cleared. Upcoming SGX2 enabling will use sgx_encl_ewb_cpumask() during the EMODPR and EMODT ENCLS leaf functions that make changes to enclave pages. The function is needed for the same reason it is used now: to learn which CPUs might have executed the enclave to ensure that TLBs no longer point to the changed pages. Rename sgx_encl_ewb_cpumask() to sgx_encl_cpumask() to reflect the broader usage. Signed-off-by: Reinette Chatre <reinette.chatre@intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Jarkko Sakkinen <jarkko@kernel.org> Link: https://lkml.kernel.org/r/d4d08c449450a13d8dd3bb6c2b1af03895586d4f.1652137848.git.reinette.chatre@intel.com
2022-07-07x86/sgx: Export sgx_encl_ewb_cpumask()Reinette Chatre
Using sgx_encl_ewb_cpumask() to learn which CPUs might have executed an enclave is useful to ensure that TLBs are cleared when changes are made to enclave pages. sgx_encl_ewb_cpumask() is used within the reclaimer when an enclave page is evicted. The upcoming SGX2 support enables changes to be made to enclave pages and will require TLBs to not refer to the changed pages and thus will be needing sgx_encl_ewb_cpumask(). Relocate sgx_encl_ewb_cpumask() to be with the rest of the enclave code in encl.c now that it is no longer unique to the reclaimer. Take care to ensure that any future usage maintains the current context requirement that ETRACK has been called first. Expand the existing comments to highlight this while moving them to a more prominent location before the function. No functional change. Signed-off-by: Reinette Chatre <reinette.chatre@intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Jarkko Sakkinen <jarkko@kernel.org> Link: https://lkml.kernel.org/r/05b60747fd45130cf9fc6edb1c373a69a18a22c5.1652137848.git.reinette.chatre@intel.com
2022-07-07x86/sgx: Support loading enclave page without VMA permissions checkReinette Chatre
sgx_encl_load_page() is used to find and load an enclave page into enclave (EPC) memory, potentially loading it from the backing storage. Both usages of sgx_encl_load_page() are during an access to the enclave page from a VMA and thus the permissions of the VMA are considered before the enclave page is loaded. SGX2 functions operating on enclave pages belonging to an initialized enclave requiring the page to be in EPC. It is thus required to support loading enclave pages into the EPC independent from a VMA. Split the current sgx_encl_load_page() to support the two usages: A new call, sgx_encl_load_page_in_vma(), behaves exactly like the current sgx_encl_load_page() that takes VMA permissions into account, while sgx_encl_load_page() just loads an enclave page into EPC. VMA, PTE, and EPCM permissions continue to dictate whether the pages can be accessed from within an enclave. Signed-off-by: Reinette Chatre <reinette.chatre@intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Jarkko Sakkinen <jarkko@kernel.org> Link: https://lkml.kernel.org/r/d4393513c1f18987c14a490bcf133bfb71a5dc43.1652137848.git.reinette.chatre@intel.com
2022-07-07x86/sgx: Add wrapper for SGX2 EAUG functionReinette Chatre
Add a wrapper for the EAUG ENCLS leaf function used to add a page to an initialized enclave. EAUG: 1) Stores all properties of the new enclave page in the SGX hardware's Enclave Page Cache Map (EPCM). 2) Sets the PENDING bit in the EPCM entry of the enclave page. This bit is cleared by the enclave by invoking ENCLU leaf function EACCEPT or EACCEPTCOPY. Access from within the enclave to the new enclave page is not possible until the PENDING bit is cleared. Signed-off-by: Reinette Chatre <reinette.chatre@intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Jarkko Sakkinen <jarkko@kernel.org> Link: https://lkml.kernel.org/r/97a46754fe4764e908651df63694fb760f783d6e.1652137848.git.reinette.chatre@intel.com
2022-07-07x86/sgx: Add wrapper for SGX2 EMODT functionReinette Chatre
Add a wrapper for the EMODT ENCLS leaf function used to change the type of an enclave page as maintained in the SGX hardware's Enclave Page Cache Map (EPCM). EMODT: 1) Updates the EPCM page type of the enclave page. 2) Sets the MODIFIED bit in the EPCM entry of the enclave page. This bit is reset by the enclave by invoking ENCLU leaf function EACCEPT or EACCEPTCOPY. Access from within the enclave to the enclave page is not possible while the MODIFIED bit is set. After changing the enclave page type by issuing EMODT the kernel needs to collaborate with the hardware to ensure that no logical processor continues to hold a reference to the changed page. This is required to ensure no required security checks are circumvented and is required for the enclave's EACCEPT/EACCEPTCOPY to succeed. Ensuring that no references to the changed page remain is accomplished with the ETRACK flow. Signed-off-by: Reinette Chatre <reinette.chatre@intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Jarkko Sakkinen <jarkko@kernel.org> Link: https://lkml.kernel.org/r/dba63a8c0db1d510b940beee1ba2a8207efeb1f1.1652137848.git.reinette.chatre@intel.com
2022-07-07x86/sgx: Add wrapper for SGX2 EMODPR functionReinette Chatre
Add a wrapper for the EMODPR ENCLS leaf function used to restrict enclave page permissions as maintained in the SGX hardware's Enclave Page Cache Map (EPCM). EMODPR: 1) Updates the EPCM permissions of an enclave page by treating the new permissions as a mask. Supplying a value that attempts to relax EPCM permissions has no effect on EPCM permissions (PR bit, see below, is changed). 2) Sets the PR bit in the EPCM entry of the enclave page to indicate that permission restriction is in progress. The bit is reset by the enclave by invoking ENCLU leaf function EACCEPT or EACCEPTCOPY. The enclave may access the page throughout the entire process if conforming to the EPCM permissions for the enclave page. After performing the permission restriction by issuing EMODPR the kernel needs to collaborate with the hardware to ensure that all logical processors sees the new restricted permissions. This is required for the enclave's EACCEPT/EACCEPTCOPY to succeed and is accomplished with the ETRACK flow. Expand enum sgx_return_code with the possible EMODPR return values. Signed-off-by: Reinette Chatre <reinette.chatre@intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Jarkko Sakkinen <jarkko@kernel.org> Link: https://lkml.kernel.org/r/d15e7a769e13e4ca671fa2d0a0d3e3aec5aedbd4.1652137848.git.reinette.chatre@intel.com
2022-07-07x86/sgx: Add short descriptions to ENCLS wrappersReinette Chatre
The SGX ENCLS instruction uses EAX to specify an SGX function and may require additional registers, depending on the SGX function. ENCLS invokes the specified privileged SGX function for managing and debugging enclaves. Macros are used to wrap the ENCLS functionality and several wrappers are used to wrap the macros to make the different SGX functions accessible in the code. The wrappers of the supported SGX functions are cryptic. Add short descriptions of each as a comment. Suggested-by: Dave Hansen <dave.hansen@linux.intel.com> Signed-off-by: Reinette Chatre <reinette.chatre@intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Jarkko Sakkinen <jarkko@kernel.org> Link: https://lkml.kernel.org/r/5e78a1126711cbd692d5b8132e0683873398f69e.1652137848.git.reinette.chatre@intel.com
2022-06-02x86/sgx: Set active memcg prior to shmem allocationKristen Carlson Accardi
When the system runs out of enclave memory, SGX can reclaim EPC pages by swapping to normal RAM. These backing pages are allocated via a per-enclave shared memory area. Since SGX allows unlimited over commit on EPC memory, the reclaimer thread can allocate a large number of backing RAM pages in response to EPC memory pressure. When the shared memory backing RAM allocation occurs during the reclaimer thread context, the shared memory is charged to the root memory control group, and the shmem usage of the enclave is not properly accounted for, making cgroups ineffective at limiting the amount of RAM an enclave can consume. For example, when using a cgroup to launch a set of test enclaves, the kernel does not properly account for 50% - 75% of shmem page allocations on average. In the worst case, when nearly all allocations occur during the reclaimer thread, the kernel accounts less than a percent of the amount of shmem used by the enclave's cgroup to the correct cgroup. SGX stores a list of mm_structs that are associated with an enclave. Pick one of them during reclaim and charge that mm's memcg with the shmem allocation. The one that gets picked is arbitrary, but this list almost always only has one mm. The cases where there is more than one mm with different memcg's are not worth considering. Create a new function - sgx_encl_alloc_backing(). This function is used whenever a new backing storage page needs to be allocated. Previously the same function was used for page allocation as well as retrieving a previously allocated page. Prior to backing page allocation, if there is a mm_struct associated with the enclave that is requesting the allocation, it is set as the active memory control group. [ dhansen: - fix merge conflict with ELDU fixes - check against actual ksgxd_tsk, not ->mm ] Cc: stable@vger.kernel.org Signed-off-by: Kristen Carlson Accardi <kristen@linux.intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Shakeel Butt <shakeelb@google.com> Acked-by: Roman Gushchin <roman.gushchin@linux.dev> Link: https://lkml.kernel.org/r/20220520174248.4918-1-kristen@linux.intel.com
2022-05-16x86/sgx: Ensure no data in PCMD page after truncateReinette Chatre
A PCMD (Paging Crypto MetaData) page contains the PCMD structures of enclave pages that have been encrypted and moved to the shmem backing store. When all enclave pages sharing a PCMD page are loaded in the enclave, there is no need for the PCMD page and it can be truncated from the backing store. A few issues appeared around the truncation of PCMD pages. The known issues have been addressed but the PCMD handling code could be made more robust by loudly complaining if any new issue appears in this area. Add a check that will complain with a warning if the PCMD page is not actually empty after it has been truncated. There should never be data in the PCMD page at this point since it is was just checked to be empty and truncated with enclave mutex held and is updated with the enclave mutex held. Suggested-by: Dave Hansen <dave.hansen@linux.intel.com> Signed-off-by: Reinette Chatre <reinette.chatre@intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Jarkko Sakkinen <jarkko@kernel.org> Tested-by: Haitao Huang <haitao.huang@intel.com> Link: https://lkml.kernel.org/r/6495120fed43fafc1496d09dd23df922b9a32709.1652389823.git.reinette.chatre@intel.com
2022-05-16x86/sgx: Fix race between reclaimer and page fault handlerReinette Chatre
Haitao reported encountering a WARN triggered by the ENCLS[ELDU] instruction faulting with a #GP. The WARN is encountered when the reclaimer evicts a range of pages from the enclave when the same pages are faulted back right away. Consider two enclave pages (ENCLAVE_A and ENCLAVE_B) sharing a PCMD page (PCMD_AB). ENCLAVE_A is in the enclave memory and ENCLAVE_B is in the backing store. PCMD_AB contains just one entry, that of ENCLAVE_B. Scenario proceeds where ENCLAVE_A is being evicted from the enclave while ENCLAVE_B is faulted in. sgx_reclaim_pages() { ... /* * Reclaim ENCLAVE_A */ mutex_lock(&encl->lock); /* * Get a reference to ENCLAVE_A's * shmem page where enclave page * encrypted data will be stored * as well as a reference to the * enclave page's PCMD data page, * PCMD_AB. * Release mutex before writing * any data to the shmem pages. */ sgx_encl_get_backing(...); encl_page->desc |= SGX_ENCL_PAGE_BEING_RECLAIMED; mutex_unlock(&encl->lock); /* * Fault ENCLAVE_B */ sgx_vma_fault() { mutex_lock(&encl->lock); /* * Get reference to * ENCLAVE_B's shmem page * as well as PCMD_AB. */ sgx_encl_get_backing(...) /* * Load page back into * enclave via ELDU. */ /* * Release reference to * ENCLAVE_B' shmem page and * PCMD_AB. */ sgx_encl_put_backing(...); /* * PCMD_AB is found empty so * it and ENCLAVE_B's shmem page * are truncated. */ /* Truncate ENCLAVE_B backing page */ sgx_encl_truncate_backing_page(); /* Truncate PCMD_AB */ sgx_encl_truncate_backing_page(); mutex_unlock(&encl->lock); ... } mutex_lock(&encl->lock); encl_page->desc &= ~SGX_ENCL_PAGE_BEING_RECLAIMED; /* * Write encrypted contents of * ENCLAVE_A to ENCLAVE_A shmem * page and its PCMD data to * PCMD_AB. */ sgx_encl_put_backing(...) /* * Reference to PCMD_AB is * dropped and it is truncated. * ENCLAVE_A's PCMD data is lost. */ mutex_unlock(&encl->lock); } What happens next depends on whether it is ENCLAVE_A being faulted in or ENCLAVE_B being evicted - but both end up with ENCLS[ELDU] faulting with a #GP. If ENCLAVE_A is faulted then at the time sgx_encl_get_backing() is called a new PCMD page is allocated and providing the empty PCMD data for ENCLAVE_A would cause ENCLS[ELDU] to #GP If ENCLAVE_B is evicted first then a new PCMD_AB would be allocated by the reclaimer but later when ENCLAVE_A is faulted the ENCLS[ELDU] instruction would #GP during its checks of the PCMD value and the WARN would be encountered. Noting that the reclaimer sets SGX_ENCL_PAGE_BEING_RECLAIMED at the time it obtains a reference to the backing store pages of an enclave page it is in the process of reclaiming, fix the race by only truncating the PCMD page after ensuring that no page sharing the PCMD page is in the process of being reclaimed. Cc: stable@vger.kernel.org Fixes: 08999b2489b4 ("x86/sgx: Free backing memory after faulting the enclave page") Reported-by: Haitao Huang <haitao.huang@intel.com> Signed-off-by: Reinette Chatre <reinette.chatre@intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Jarkko Sakkinen <jarkko@kernel.org> Tested-by: Haitao Huang <haitao.huang@intel.com> Link: https://lkml.kernel.org/r/ed20a5db516aa813873268e125680041ae11dfcf.1652389823.git.reinette.chatre@intel.com
2022-05-16x86/sgx: Obtain backing storage page with enclave mutex heldReinette Chatre
Haitao reported encountering a WARN triggered by the ENCLS[ELDU] instruction faulting with a #GP. The WARN is encountered when the reclaimer evicts a range of pages from the enclave when the same pages are faulted back right away. The SGX backing storage is accessed on two paths: when there are insufficient free pages in the EPC the reclaimer works to move enclave pages to the backing storage and as enclaves access pages that have been moved to the backing storage they are retrieved from there as part of page fault handling. An oversubscribed SGX system will often run the reclaimer and page fault handler concurrently and needs to ensure that the backing store is accessed safely between the reclaimer and the page fault handler. This is not the case because the reclaimer accesses the backing store without the enclave mutex while the page fault handler accesses the backing store with the enclave mutex. Consider the scenario where a page is faulted while a page sharing a PCMD page with the faulted page is being reclaimed. The consequence is a race between the reclaimer and page fault handler, the reclaimer attempting to access a PCMD at the same time it is truncated by the page fault handler. This could result in lost PCMD data. Data may still be lost if the reclaimer wins the race, this is addressed in the following patch. The reclaimer accesses pages from the backing storage without holding the enclave mutex and runs the risk of concurrently accessing the backing storage with the page fault handler that does access the backing storage with the enclave mutex held. In the scenario below a PCMD page is truncated from the backing store after all its pages have been loaded in to the enclave at the same time the PCMD page is loaded from the backing store when one of its pages are reclaimed: sgx_reclaim_pages() { sgx_vma_fault() { ... mutex_lock(&encl->lock); ... __sgx_encl_eldu() { ... if (pcmd_page_empty) { /* * EPC page being reclaimed /* * shares a PCMD page with an * PCMD page truncated * enclave page that is being * while requested from * faulted in. * reclaimer. */ */ sgx_encl_get_backing() <----------> sgx_encl_truncate_backing_page() } mutex_unlock(&encl->lock); } } In this scenario there is a race between the reclaimer and the page fault handler when the reclaimer attempts to get access to the same PCMD page that is being truncated. This could result in the reclaimer writing to the PCMD page that is then truncated, causing the PCMD data to be lost, or in a new PCMD page being allocated. The lost PCMD data may still occur after protecting the backing store access with the mutex - this is fixed in the next patch. By ensuring the backing store is accessed with the mutex held the enclave page state can be made accurate with the SGX_ENCL_PAGE_BEING_RECLAIMED flag accurately reflecting that a page is in the process of being reclaimed. Consistently protect the reclaimer's backing store access with the enclave's mutex to ensure that it can safely run concurrently with the page fault handler. Cc: stable@vger.kernel.org Fixes: 1728ab54b4be ("x86/sgx: Add a page reclaimer") Reported-by: Haitao Huang <haitao.huang@intel.com> Signed-off-by: Reinette Chatre <reinette.chatre@intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Jarkko Sakkinen <jarkko@kernel.org> Tested-by: Jarkko Sakkinen <jarkko@kernel.org> Tested-by: Haitao Huang <haitao.huang@intel.com> Link: https://lkml.kernel.org/r/fa2e04c561a8555bfe1f4e7adc37d60efc77387b.1652389823.git.reinette.chatre@intel.com
2022-05-16x86/sgx: Mark PCMD page as dirty when modifying contentsReinette Chatre
Recent commit 08999b2489b4 ("x86/sgx: Free backing memory after faulting the enclave page") expanded __sgx_encl_eldu() to clear an enclave page's PCMD (Paging Crypto MetaData) from the PCMD page in the backing store after the enclave page is restored to the enclave. Since the PCMD page in the backing store is modified the page should be marked as dirty to ensure the modified data is retained. Cc: stable@vger.kernel.org Fixes: 08999b2489b4 ("x86/sgx: Free backing memory after faulting the enclave page") Signed-off-by: Reinette Chatre <reinette.chatre@intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Jarkko Sakkinen <jarkko@kernel.org> Tested-by: Haitao Huang <haitao.huang@intel.com> Link: https://lkml.kernel.org/r/00cd2ac480db01058d112e347b32599c1a806bc4.1652389823.git.reinette.chatre@intel.com
2022-05-16x86/sgx: Disconnect backing page references from dirty statusReinette Chatre
SGX uses shmem backing storage to store encrypted enclave pages and their crypto metadata when enclave pages are moved out of enclave memory. Two shmem backing storage pages are associated with each enclave page - one backing page to contain the encrypted enclave page data and one backing page (shared by a few enclave pages) to contain the crypto metadata used by the processor to verify the enclave page when it is loaded back into the enclave. sgx_encl_put_backing() is used to release references to the backing storage and, optionally, mark both backing store pages as dirty. Managing references and dirty status together in this way results in both backing store pages marked as dirty, even if only one of the backing store pages are changed. Additionally, waiting until the page reference is dropped to set the page dirty risks a race with the page fault handler that may load outdated data into the enclave when a page is faulted right after it is reclaimed. Consider what happens if the reclaimer writes a page to the backing store and the page is immediately faulted back, before the reclaimer is able to set the dirty bit of the page: sgx_reclaim_pages() { sgx_vma_fault() { ... sgx_encl_get_backing(); ... ... sgx_reclaimer_write() { mutex_lock(&encl->lock); /* Write data to backing store */ mutex_unlock(&encl->lock); } mutex_lock(&encl->lock); __sgx_encl_eldu() { ... /* * Enclave backing store * page not released * nor marked dirty - * contents may not be * up to date. */ sgx_encl_get_backing(); ... /* * Enclave data restored * from backing store * and PCMD pages that * are not up to date. * ENCLS[ELDU] faults * because of MAC or PCMD * checking failure. */ sgx_encl_put_backing(); } ... /* set page dirty */ sgx_encl_put_backing(); ... mutex_unlock(&encl->lock); } } Remove the option to sgx_encl_put_backing() to set the backing pages as dirty and set the needed pages as dirty right after receiving important data while enclave mutex is held. This ensures that the page fault handler can get up to date data from a page and prepares the code for a following change where only one of the backing pages need to be marked as dirty. Cc: stable@vger.kernel.org Fixes: 1728ab54b4be ("x86/sgx: Add a page reclaimer") Suggested-by: Dave Hansen <dave.hansen@linux.intel.com> Signed-off-by: Reinette Chatre <reinette.chatre@intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Jarkko Sakkinen <jarkko@kernel.org> Tested-by: Haitao Huang <haitao.huang@intel.com> Link: https://lore.kernel.org/linux-sgx/8922e48f-6646-c7cc-6393-7c78dcf23d23@intel.com/ Link: https://lkml.kernel.org/r/fa9f98986923f43e72ef4c6702a50b2a0b3c42e3.1652389823.git.reinette.chatre@intel.com
2022-03-11x86/sgx: Free backing memory after faulting the enclave pageJarkko Sakkinen
There is a limited amount of SGX memory (EPC) on each system. When that memory is used up, SGX has its own swapping mechanism which is similar in concept but totally separate from the core mm/* code. Instead of swapping to disk, SGX swaps from EPC to normal RAM. That normal RAM comes from a shared memory pseudo-file and can itself be swapped by the core mm code. There is a hierarchy like this: EPC <-> shmem <-> disk After data is swapped back in from shmem to EPC, the shmem backing storage needs to be freed. Currently, the backing shmem is not freed. This effectively wastes the shmem while the enclave is running. The memory is recovered when the enclave is destroyed and the backing storage freed. Sort this out by freeing memory with shmem_truncate_range(), as soon as a page is faulted back to the EPC. In addition, free the memory for PCMD pages as soon as all PCMD's in a page have been marked as unused by zeroing its contents. Cc: stable@vger.kernel.org Fixes: 1728ab54b4be ("x86/sgx: Add a page reclaimer") Reported-by: Dave Hansen <dave.hansen@linux.intel.com> Signed-off-by: Jarkko Sakkinen <jarkko@kernel.org> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Link: https://lkml.kernel.org/r/20220303223859.273187-1-jarkko@kernel.org
2022-02-17x86/sgx: Fix missing poison handling in reclaimerReinette Chatre
The SGX reclaimer code lacks page poison handling in its main free path. This can lead to avoidable machine checks if a poisoned page is freed and reallocated instead of being isolated. A troublesome scenario is: 1. Machine check (#MC) occurs (asynchronous, !MF_ACTION_REQUIRED) 2. arch_memory_failure() is eventually called 3. (SGX) page->poison set to 1 4. Page is reclaimed 5. Page added to normal free lists by sgx_reclaim_pages() ^ This is the bug (poison pages should be isolated on the sgx_poison_page_list instead) 6. Page is reallocated by some innocent enclave, a second (synchronous) in-kernel #MC is induced, probably during EADD instruction. ^ This is the fallout from the bug (6) is unfortunate and can be avoided by replacing the open coded enclave page freeing code in the reclaimer with sgx_free_epc_page() to obtain support for poison page handling that includes placing the poisoned page on the correct list. Fixes: d6d261bded8a ("x86/sgx: Add new sgx_epc_page flag bit to mark free pages") Fixes: 992801ae9243 ("x86/sgx: Initial poison handling for dirty and free pages") Signed-off-by: Reinette Chatre <reinette.chatre@intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Jarkko Sakkinen <jarkko@kernel.org> Link: https://lkml.kernel.org/r/dcc95eb2aaefb042527ac50d0a50738c7c160dac.1643830353.git.reinette.chatre@intel.com
2022-02-10x86/sgx: Silence softlockup detection when releasing large enclavesReinette Chatre
Vijay reported that the "unclobbered_vdso_oversubscribed" selftest triggers the softlockup detector. Actual SGX systems have 128GB of enclave memory or more. The "unclobbered_vdso_oversubscribed" selftest creates one enclave which consumes all of the enclave memory on the system. Tearing down such a large enclave takes around a minute, most of it in the loop where the EREMOVE instruction is applied to each individual 4k enclave page. Spending one minute in a loop triggers the softlockup detector. Add a cond_resched() to give other tasks a chance to run and placate the softlockup detector. Cc: stable@vger.kernel.org Fixes: 1728ab54b4be ("x86/sgx: Add a page reclaimer") Reported-by: Vijay Dhanraj <vijay.dhanraj@intel.com> Signed-off-by: Reinette Chatre <reinette.chatre@intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Jarkko Sakkinen <jarkko@kernel.org> Acked-by: Dave Hansen <dave.hansen@linux.intel.com> Tested-by: Jarkko Sakkinen <jarkko@kernel.org> (kselftest as sanity check) Link: https://lkml.kernel.org/r/ced01cac1e75f900251b0a4ae1150aa8ebd295ec.1644345232.git.reinette.chatre@intel.com
2022-01-12Merge tag 'x86_core_for_v5.17_rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 core updates from Borislav Petkov: - Get rid of all the .fixup sections because this generates misleading/wrong stacktraces and confuse RELIABLE_STACKTRACE and LIVEPATCH as the backtrace misses the function which is being fixed up. - Add Straight Line Speculation mitigation support which uses a new compiler switch -mharden-sls= which sticks an INT3 after a RET or an indirect branch in order to block speculation after them. Reportedly, CPUs do speculate behind such insns. - The usual set of cleanups and improvements * tag 'x86_core_for_v5.17_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (32 commits) x86/entry_32: Fix segment exceptions objtool: Remove .fixup handling x86: Remove .fixup section x86/word-at-a-time: Remove .fixup usage x86/usercopy: Remove .fixup usage x86/usercopy_32: Simplify __copy_user_intel_nocache() x86/sgx: Remove .fixup usage x86/checksum_32: Remove .fixup usage x86/vmx: Remove .fixup usage x86/kvm: Remove .fixup usage x86/segment: Remove .fixup usage x86/fpu: Remove .fixup usage x86/xen: Remove .fixup usage x86/uaccess: Remove .fixup usage x86/futex: Remove .fixup usage x86/msr: Remove .fixup usage x86/extable: Extend extable functionality x86/entry_32: Remove .fixup usage x86/entry_64: Remove .fixup usage x86/copy_mc_64: Remove .fixup usage ...
2022-01-07x86/sgx: Fix NULL pointer dereference on non-SGX systemsDave Hansen
== Problem == Nathan Chancellor reported an oops when aceessing the 'sgx_total_bytes' sysfs file: https://lore.kernel.org/all/YbzhBrimHGGpddDM@archlinux-ax161/ The sysfs output code accesses the sgx_numa_nodes[] array unconditionally. However, this array is allocated during SGX initialization, which only occurs on systems where SGX is supported. If the sysfs file is accessed on systems without SGX support, sgx_numa_nodes[] is NULL and an oops occurs. == Solution == To fix this, hide the entire nodeX/x86/ attribute group on systems without SGX support using the ->is_visible attribute group callback. Unfortunately, SGX is initialized via a device_initcall() which occurs _after_ the ->is_visible() callback. Instead of moving SGX initialization earlier, call sysfs_update_group() during SGX initialization to update the group visiblility. This update requires moving the SGX sysfs code earlier in sgx/main.c. There are no code changes other than the addition of arch_update_sysfs_visibility() and a minor whitespace fixup to arch_node_attr_is_visible() which checkpatch caught. CC: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: linux-sgx@vger.kernel.org Cc: x86@kernel.org Fixes: 50468e431335 ("x86/sgx: Add an attribute for the amount of SGX memory in a NUMA node") Reported-by: Nathan Chancellor <nathan@kernel.org> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Reviewed-by: Jarkko Sakkinen <jarkko@kernel.org> Tested-by: Nathan Chancellor <nathan@kernel.org> Tested-by: Jarkko Sakkinen <jarkko@kernel.org> Link: https://lkml.kernel.org/r/20220104171527.5E8416A8@davehans-spike.ostc.intel.com
2021-12-11x86/sgx: Remove .fixup usagePeter Zijlstra
Create EX_TYPE_FAULT_SGX which does as EX_TYPE_FAULT does, except adds this extra bit that SGX really fancies having. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Josh Poimboeuf <jpoimboe@redhat.com> Link: https://lore.kernel.org/r/20211110101325.961246679@infradead.org
2021-12-09x86/sgx: Add an attribute for the amount of SGX memory in a NUMA nodeJarkko Sakkinen
== Problem == The amount of SGX memory on a system is determined by the BIOS and it varies wildly between systems. It can be as small as dozens of MB's and as large as many GB's on servers. Just like how applications need to know how much regular RAM is available, enclave builders need to know how much SGX memory an enclave can consume. == Solution == Introduce a new sysfs file: /sys/devices/system/node/nodeX/x86/sgx_total_bytes to enumerate the amount of SGX memory available in each NUMA node. This serves the same function for SGX as /proc/meminfo or /sys/devices/system/node/nodeX/meminfo does for normal RAM. 'sgx_total_bytes' is needed today to help drive the SGX selftests. SGX-specific swap code is exercised by creating overcommitted enclaves which are larger than the physical SGX memory on the system. They currently use a CPUID-based approach which can diverge from the actual amount of SGX memory available. 'sgx_total_bytes' ensures that the selftests can work efficiently and do not attempt stupid things like creating a 100,000 MB enclave on a system with 128 MB of SGX memory. == Implementation Details == Introduce CONFIG_HAVE_ARCH_NODE_DEV_GROUP opt-in flag to expose an arch specific attribute group, and add an attribute for the amount of SGX memory in bytes to each NUMA node: == ABI Design Discussion == As opposed to the per-node ABI, a single, global ABI was considered. However, this would prevent enclaves from being able to size themselves so that they fit on a single NUMA node. Essentially, a single value would rule out NUMA optimizations for enclaves. Create a new "x86/" directory inside each "nodeX/" sysfs directory. 'sgx_total_bytes' is expected to be the first of at least a few sgx-specific files to be placed in the new directory. Just scanning /proc/meminfo, these are the no-brainers that we have for RAM, but we need for SGX: MemTotal: xxxx kB // sgx_total_bytes (implemented here) MemFree: yyyy kB // sgx_free_bytes SwapTotal: zzzz kB // sgx_swapped_bytes So, at *least* three. I think we will eventually end up needing something more along the lines of a dozen. A new directory (as opposed to being in the nodeX/ "root") directory avoids cluttering the root with several "sgx_*" files. Place the new file in a new "nodeX/x86/" directory because SGX is highly x86-specific. It is very unlikely that any other architecture (or even non-Intel x86 vendor) will ever implement SGX. Using "sgx/" as opposed to "x86/" was also considered. But, there is a real chance this can get used for other arch-specific purposes. [ dhansen: rewrite changelog ] Signed-off-by: Jarkko Sakkinen <jarkko@kernel.org> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Acked-by: Borislav Petkov <bp@suse.de> Link: https://lkml.kernel.org/r/20211116162116.93081-2-jarkko@kernel.org
2021-11-19Merge branch 'x86/urgent' into x86/sgx, to resolve conflictIngo Molnar
Conflicts: arch/x86/kernel/cpu/sgx/main.c Signed-off-by: Ingo Molnar <mingo@kernel.org>
2021-11-16x86/sgx: Fix free page accountingReinette Chatre
The SGX driver maintains a single global free page counter, sgx_nr_free_pages, that reflects the number of free pages available across all NUMA nodes. Correspondingly, a list of free pages is associated with each NUMA node and sgx_nr_free_pages is updated every time a page is added or removed from any of the free page lists. The main usage of sgx_nr_free_pages is by the reclaimer that runs when it (sgx_nr_free_pages) goes below a watermark to ensure that there are always some free pages available to, for example, support efficient page faults. With sgx_nr_free_pages accessed and modified from a few places it is essential to ensure that these accesses are done safely but this is not the case. sgx_nr_free_pages is read without any protection and updated with inconsistent protection by any one of the spin locks associated with the individual NUMA nodes. For example: CPU_A CPU_B ----- ----- spin_lock(&nodeA->lock); spin_lock(&nodeB->lock); ... ... sgx_nr_free_pages--; /* NOT SAFE */ sgx_nr_free_pages--; spin_unlock(&nodeA->lock); spin_unlock(&nodeB->lock); Since sgx_nr_free_pages may be protected by different spin locks while being modified from different CPUs, the following scenario is possible: CPU_A CPU_B ----- ----- {sgx_nr_free_pages = 100} spin_lock(&nodeA->lock); spin_lock(&nodeB->lock); sgx_nr_free_pages--; sgx_nr_free_pages--; /* LOAD sgx_nr_free_pages = 100 */ /* LOAD sgx_nr_free_pages = 100 */ /* sgx_nr_free_pages-- */ /* sgx_nr_free_pages-- */ /* STORE sgx_nr_free_pages = 99 */ /* STORE sgx_nr_free_pages = 99 */ spin_unlock(&nodeA->lock); spin_unlock(&nodeB->lock); In the above scenario, sgx_nr_free_pages is decremented from two CPUs but instead of sgx_nr_free_pages ending with a value that is two less than it started with, it was only decremented by one while the number of free pages were actually reduced by two. The consequence of sgx_nr_free_pages not being protected is that its value may not accurately reflect the actual number of free pages on the system, impacting the availability of free pages in support of many flows. The problematic scenario is when the reclaimer does not run because it believes there to be sufficient free pages while any attempt to allocate a page fails because there are no free pages available. In the SGX driver the reclaimer's watermark is only 32 pages so after encountering the above example scenario 32 times a user space hang is possible when there are no more free pages because of repeated page faults caused by no free pages made available. The following flow was encountered: asm_exc_page_fault ... sgx_vma_fault() sgx_encl_load_page() sgx_encl_eldu() // Encrypted page needs to be loaded from backing // storage into newly allocated SGX memory page sgx_alloc_epc_page() // Allocate a page of SGX memory __sgx_alloc_epc_page() // Fails, no free SGX memory ... if (sgx_should_reclaim(SGX_NR_LOW_PAGES)) // Wake reclaimer wake_up(&ksgxd_waitq); return -EBUSY; // Return -EBUSY giving reclaimer time to run return -EBUSY; return -EBUSY; return VM_FAULT_NOPAGE; The reclaimer is triggered in above flow with the following code: static bool sgx_should_reclaim(unsigned long watermark) { return sgx_nr_free_pages < watermark && !list_empty(&sgx_active_page_list); } In the problematic scenario there were no free pages available yet the value of sgx_nr_free_pages was above the watermark. The allocation of SGX memory thus always failed because of a lack of free pages while no free pages were made available because the reclaimer is never started because of sgx_nr_free_pages' incorrect value. The consequence was that user space kept encountering VM_FAULT_NOPAGE that caused the same address to be accessed repeatedly with the same result. Change the global free page counter to an atomic type that ensures simultaneous updates are done safely. While doing so, move the updating of the variable outside of the spin lock critical section to which it does not belong. Cc: stable@vger.kernel.org Fixes: 901ddbb9ecf5 ("x86/sgx: Add a basic NUMA allocation scheme to sgx_alloc_epc_page()") Suggested-by: Dave Hansen <dave.hansen@linux.intel.com> Signed-off-by: Reinette Chatre <reinette.chatre@intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Tony Luck <tony.luck@intel.com> Acked-by: Jarkko Sakkinen <jarkko@kernel.org> Link: https://lkml.kernel.org/r/a95a40743bbd3f795b465f30922dde7f1ea9e0eb.1637004094.git.reinette.chatre@intel.com
2021-11-15x86/sgx: Add SGX infrastructure to recover from poisonTony Luck
Provide a recovery function sgx_memory_failure(). If the poison was consumed synchronously then send a SIGBUS. Note that the virtual address of the access is not included with the SIGBUS as is the case for poison outside of SGX enclaves. This doesn't matter as addresses of code/data inside an enclave is of little to no use to code executing outside the (now dead) enclave. Poison found in a free page results in the page being moved from the free list to the per-node poison page list. Signed-off-by: Tony Luck <tony.luck@intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Jarkko Sakkinen <jarkko@kernel.org> Tested-by: Reinette Chatre <reinette.chatre@intel.com> Link: https://lkml.kernel.org/r/20211026220050.697075-5-tony.luck@intel.com
2021-11-15x86/sgx: Initial poison handling for dirty and free pagesTony Luck
A memory controller patrol scrubber can report poison in a page that isn't currently being used. Add "poison" field in the sgx_epc_page that can be set for an sgx_epc_page. Check for it: 1) When sanitizing dirty pages 2) When freeing epc pages Poison is a new field separated from flags to avoid having to make all updates to flags atomic, or integrate poison state changes into some other locking scheme to protect flags (Currently just sgx_reclaimer_lock which protects the SGX_EPC_PAGE_RECLAIMER_TRACKED bit in page->flags). In both cases place the poisoned page on a per-node list of poisoned epc pages to make sure it will not be reallocated. Signed-off-by: Tony Luck <tony.luck@intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Jarkko Sakkinen <jarkko@kernel.org> Tested-by: Reinette Chatre <reinette.chatre@intel.com> Link: https://lkml.kernel.org/r/20211026220050.697075-4-tony.luck@intel.com
2021-11-15x86/sgx: Add infrastructure to identify SGX EPC pagesTony Luck
X86 machine check architecture reports a physical address when there is a memory error. Handling that error requires a method to determine whether the physical address reported is in any of the areas reserved for EPC pages by BIOS. SGX EPC pages do not have Linux "struct page" associated with them. Keep track of the mapping from ranges of EPC pages to the sections that contain them using an xarray. N.B. adds CONFIG_XARRAY_MULTI to the SGX dependecies. So "select" that in arch/x86/Kconfig for X86/SGX. Create a function arch_is_platform_page() that simply reports whether an address is an EPC page for use elsewhere in the kernel. The ACPI error injection code needs this function and is typically built as a module, so export it. Note that arch_is_platform_page() will be slower than other similar "what type is this page" functions that can simply check bits in the "struct page". If there is some future performance critical user of this function it may need to be implemented in a more efficient way. Note also that the current implementation of xarray allocates a few hundred kilobytes for this usage on a system with 4GB of SGX EPC memory configured. This isn't ideal, but worth it for the code simplicity. Signed-off-by: Tony Luck <tony.luck@intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Jarkko Sakkinen <jarkko@kernel.org> Tested-by: Reinette Chatre <reinette.chatre@intel.com> Link: https://lkml.kernel.org/r/20211026220050.697075-3-tony.luck@intel.com
2021-11-15x86/sgx: Add new sgx_epc_page flag bit to mark free pagesTony Luck
SGX EPC pages go through the following life cycle: DIRTY ---> FREE ---> IN-USE --\ ^ | \-----------------/ Recovery action for poison for a DIRTY or FREE page is simple. Just make sure never to allocate the page. IN-USE pages need some extra handling. Add a new flag bit SGX_EPC_PAGE_IS_FREE that is set when a page is added to a free list and cleared when the page is allocated. Notes: 1) These transitions are made while holding the node->lock so that future code that checks the flags while holding the node->lock can be sure that if the SGX_EPC_PAGE_IS_FREE bit is set, then the page is on the free list. 2) Initially while the pages are on the dirty list the SGX_EPC_PAGE_IS_FREE bit is cleared. Signed-off-by: Tony Luck <tony.luck@intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Jarkko Sakkinen <jarkko@kernel.org> Tested-by: Reinette Chatre <reinette.chatre@intel.com> Link: https://lkml.kernel.org/r/20211026220050.697075-2-tony.luck@intel.com
2021-10-22x86/sgx/virt: implement SGX_IOC_VEPC_REMOVE ioctlPaolo Bonzini
For bare-metal SGX on real hardware, the hardware provides guarantees SGX state at reboot. For instance, all pages start out uninitialized. The vepc driver provides a similar guarantee today for freshly-opened vepc instances, but guests such as Windows expect all pages to be in uninitialized state on startup, including after every guest reboot. Some userspace implementations of virtual SGX would rather avoid having to close and reopen the /dev/sgx_vepc file descriptor and re-mmap the virtual EPC. For example, they could sandbox themselves after the guest starts and forbid further calls to open(), in order to mitigate exploits from untrusted guests. Therefore, add a ioctl that does this with EREMOVE. Userspace can invoke the ioctl to bring its vEPC pages back to uninitialized state. There is a possibility that some pages fail to be removed if they are SECS pages, and the child and SECS pages could be in separate vEPC regions. Therefore, the ioctl returns the number of EREMOVE failures, telling userspace to try the ioctl again after it's done with all vEPC regions. A more verbose description of the correct usage and the possible error conditions is documented in sgx.rst. Reviewed-by: Jarkko Sakkinen <jarkko@kernel.org> Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Link: https://lkml.kernel.org/r/20211021201155.1523989-3-pbonzini@redhat.com
2021-10-22x86/sgx/virt: extract sgx_vepc_remove_pagePaolo Bonzini
For bare-metal SGX on real hardware, the hardware provides guarantees SGX state at reboot. For instance, all pages start out uninitialized. The vepc driver provides a similar guarantee today for freshly-opened vepc instances, but guests such as Windows expect all pages to be in uninitialized state on startup, including after every guest reboot. One way to do this is to simply close and reopen the /dev/sgx_vepc file descriptor and re-mmap the virtual EPC. However, this is problematic because it prevents sandboxing the userspace (for example forbidding open() after the guest starts; this is doable with heavy use of SCM_RIGHTS file descriptor passing). In order to implement this, we will need a ioctl that performs EREMOVE on all pages mapped by a /dev/sgx_vepc file descriptor: other possibilities, such as closing and reopening the device, are racy. Start the implementation by creating a separate function with just the __eremove wrapper. Reviewed-by: Jarkko Sakkinen <jarkko@kernel.org> Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Link: https://lkml.kernel.org/r/20211021201155.1523989-2-pbonzini@redhat.com
2021-06-29Merge branch 'akpm' (patches from Andrew)Linus Torvalds
Merge misc updates from Andrew Morton: "191 patches. Subsystems affected by this patch series: kthread, ia64, scripts, ntfs, squashfs, ocfs2, kernel/watchdog, and mm (gup, pagealloc, slab, slub, kmemleak, dax, debug, pagecache, gup, swap, memcg, pagemap, mprotect, bootmem, dma, tracing, vmalloc, kasan, initialization, pagealloc, and memory-failure)" * emailed patches from Andrew Morton <akpm@linux-foundation.org>: (191 commits) mm,hwpoison: make get_hwpoison_page() call get_any_page() mm,hwpoison: send SIGBUS with error virutal address mm/page_alloc: split pcp->high across all online CPUs for cpuless nodes mm/page_alloc: allow high-order pages to be stored on the per-cpu lists mm: replace CONFIG_FLAT_NODE_MEM_MAP with CONFIG_FLATMEM mm: replace CONFIG_NEED_MULTIPLE_NODES with CONFIG_NUMA docs: remove description of DISCONTIGMEM arch, mm: remove stale mentions of DISCONIGMEM mm: remove CONFIG_DISCONTIGMEM m68k: remove support for DISCONTIGMEM arc: remove support for DISCONTIGMEM arc: update comment about HIGHMEM implementation alpha: remove DISCONTIGMEM and NUMA mm/page_alloc: move free_the_page mm/page_alloc: fix counting of managed_pages mm/page_alloc: improve memmap_pages dbg msg mm: drop SECTION_SHIFT in code comments mm/page_alloc: introduce vm.percpu_pagelist_high_fraction mm/page_alloc: limit the number of pages on PCP lists when reclaim is active mm/page_alloc: scale the number of pages that are batch freed ...
2021-06-29x86/sgx: use vma_lookup() in sgx_encl_find()Liam Howlett
Use vma_lookup() to find the VMA at a specific address. As vma_lookup() will return NULL if the address is not within any VMA, the start address no longer needs to be validated. Link: https://lkml.kernel.org/r/20210521174745.2219620-10-Liam.Howlett@Oracle.com Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com> Reviewed-by: Laurent Dufour <ldufour@linux.ibm.com> Acked-by: David Hildenbrand <david@redhat.com> Acked-by: Davidlohr Bueso <dbueso@suse.de> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-06-28Merge tag 'x86-cleanups-2021-06-28' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 cleanups from Ingo Molnar: "Misc cleanups & removal of obsolete code" * tag 'x86-cleanups-2021-06-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/sgx: Correct kernel-doc's arg name in sgx_encl_release() doc: Remove references to IBM Calgary x86/setup: Document that Windows reserves the first MiB x86/crash: Remove crash_reserve_low_1M() x86/setup: Remove CONFIG_X86_RESERVE_LOW and reservelow= options x86/alternative: Align insn bytes vertically x86: Fix leftover comment typos x86/asm: Simplify __smp_mb() definition x86/alternatives: Make the x86nops[] symbol static
2021-06-15x86/sgx: Add missing xa_destroy() when virtual EPC is destroyedKai Huang
xa_destroy() needs to be called to destroy a virtual EPC's page array before calling kfree() to free the virtual EPC. Currently it is not called so add the missing xa_destroy(). Fixes: 540745ddbc70 ("x86/sgx: Introduce virtual EPC for use by KVM guests") Signed-off-by: Kai Huang <kai.huang@intel.com> Signed-off-by: Borislav Petkov <bp@suse.de> Acked-by: Dave Hansen <dave.hansen@intel.com> Tested-by: Yang Zhong <yang.zhong@intel.com> Link: https://lkml.kernel.org/r/20210615101639.291929-1-kai.huang@intel.com
2021-06-11x86/sgx: Correct kernel-doc's arg name in sgx_encl_release()ChenXiaoSong
Fix the following kernel-doc warning: arch/x86/kernel/cpu/sgx/encl.c:392: warning: Function parameter \ or member 'ref' not described in 'sgx_encl_release' [ bp: Massage commit message. ] Signed-off-by: ChenXiaoSong <chenxiaosong2@huawei.com> Signed-off-by: Borislav Petkov <bp@suse.de> Link: https://lkml.kernel.org/r/20210609035510.2083694-1-chenxiaosong2@huawei.com
2021-04-26Merge tag 'x86_cleanups_for_v5.13' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull misc x86 cleanups from Borislav Petkov: "Trivial cleanups and fixes all over the place" * tag 'x86_cleanups_for_v5.13' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: MAINTAINERS: Remove me from IDE/ATAPI section x86/pat: Do not compile stubbed functions when X86_PAT is off x86/asm: Ensure asm/proto.h can be included stand-alone x86/platform/intel/quark: Fix incorrect kernel-doc comment syntax in files x86/msr: Make locally used functions static x86/cacheinfo: Remove unneeded dead-store initialization x86/process/64: Move cpu_current_top_of_stack out of TSS tools/turbostat: Unmark non-kernel-doc comment x86/syscalls: Fix -Wmissing-prototypes warnings from COND_SYSCALL() x86/fpu/math-emu: Fix function cast warning x86/msr: Fix wr/rdmsr_safe_regs_on_cpu() prototypes x86: Fix various typos in comments, take #2 x86: Remove unusual Unicode characters from comments x86/kaslr: Return boolean values from a function returning bool x86: Fix various typos in comments x86/setup: Remove unused RESERVE_BRK_ARRAY() stacktrace: Move documentation for arch_stack_walk_reliable() to header x86: Remove duplicate TSC DEADLINE MSR definitions
2021-04-12x86/sgx: Mark sgx_vepc_vm_ops staticWei Yongjun
Fix the following sparse warning: arch/x86/kernel/cpu/sgx/virt.c:95:35: warning: symbol 'sgx_vepc_vm_ops' was not declared. Should it be static? This symbol is not used outside of virt.c so mark it static. [ bp: Massage commit message. ] Reported-by: Hulk Robot <hulkci@huawei.com> Signed-off-by: Wei Yongjun <weiyongjun1@huawei.com> Signed-off-by: Borislav Petkov <bp@suse.de> Link: https://lkml.kernel.org/r/20210412160023.193850-1-weiyongjun1@huawei.com
2021-04-08x86/sgx: Do not update sgx_nr_free_pages in sgx_setup_epc_section()Jarkko Sakkinen
The commit in Fixes: changed the SGX EPC page sanitization to end up in sgx_free_epc_page() which puts clean and sanitized pages on the free list. This was done for the reason that it is best to keep the logic to assign available-for-use EPC pages to the correct NUMA lists in a single location. sgx_nr_free_pages is also incremented by sgx_free_epc_pages() but those pages which are being added there per EPC section do not belong to the free list yet because they haven't been sanitized yet - they land on the dirty list first and the sanitization happens later when ksgxd starts massaging them. So remove that addition there and have sgx_free_epc_page() do that solely. [ bp: Sanitize commit message too. ] Fixes: 51ab30eb2ad4 ("x86/sgx: Replace section->init_laundry_list with sgx_dirty_page_list") Signed-off-by: Jarkko Sakkinen <jarkko@kernel.org> Signed-off-by: Borislav Petkov <bp@suse.de> Link: https://lkml.kernel.org/r/20210408092924.7032-1-jarkko@kernel.org
2021-04-06x86/sgx: Move provisioning device creation out of SGX driverSean Christopherson
And extract sgx_set_attribute() out of sgx_ioc_enclave_provision() and export it as symbol for KVM to use. The provisioning key is sensitive. The SGX driver only allows to create an enclave which can access the provisioning key when the enclave creator has permission to open /dev/sgx_provision. It should apply to a VM as well, as the provisioning key is platform-specific, thus an unrestricted VM can also potentially compromise the provisioning key. Move the provisioning device creation out of sgx_drv_init() to sgx_init() as a preparation for adding SGX virtualization support, so that even if the SGX driver is not enabled due to flexible launch control not being available, SGX virtualization can still be enabled, and use it to restrict a VM's capability of being able to access the provisioning key. [ bp: Massage commit message. ] Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Kai Huang <kai.huang@intel.com> Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: Jarkko Sakkinen <jarkko@kernel.org> Acked-by: Dave Hansen <dave.hansen@intel.com> Link: https://lkml.kernel.org/r/0f4d044d621561f26d5f4ef73e8dc6cd18cc7e79.1616136308.git.kai.huang@intel.com
2021-04-06x86/sgx: Add helpers to expose ECREATE and EINIT to KVMSean Christopherson
The host kernel must intercept ECREATE to impose policies on guests, and intercept EINIT to be able to write guest's virtual SGX_LEPUBKEYHASH MSR values to hardware before running guest's EINIT so it can run correctly according to hardware behavior. Provide wrappers around __ecreate() and __einit() to hide the ugliness of overloading the ENCLS return value to encode multiple error formats in a single int. KVM will trap-and-execute ECREATE and EINIT as part of SGX virtualization, and reflect ENCLS execution result to guest by setting up guest's GPRs, or on an exception, injecting the correct fault based on return value of __ecreate() and __einit(). Use host userspace addresses (provided by KVM based on guest physical address of ENCLS parameters) to execute ENCLS/EINIT when possible. Accesses to both EPC and memory originating from ENCLS are subject to segmentation and paging mechanisms. It's also possible to generate kernel mappings for ENCLS parameters by resolving PFN but using __uaccess_xx() is simpler. [ bp: Return early if the __user memory accesses fail, use cpu_feature_enabled(). ] Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Kai Huang <kai.huang@intel.com> Signed-off-by: Borislav Petkov <bp@suse.de> Acked-by: Jarkko Sakkinen <jarkko@kernel.org> Link: https://lkml.kernel.org/r/20e09daf559aa5e9e680a0b4b5fba940f1bad86e.1616136308.git.kai.huang@intel.com
2021-04-06x86/sgx: Add helper to update SGX_LEPUBKEYHASHn MSRsKai Huang
Add a helper to update SGX_LEPUBKEYHASHn MSRs. SGX virtualization also needs to update those MSRs based on guest's "virtual" SGX_LEPUBKEYHASHn before EINIT from guest. Signed-off-by: Kai Huang <kai.huang@intel.com> Signed-off-by: Borislav Petkov <bp@suse.de> Acked-by: Dave Hansen <dave.hansen@intel.com> Acked-by: Jarkko Sakkinen <jarkko@kernel.org> Link: https://lkml.kernel.org/r/dfb7cd39d4dd62ea27703b64afdd8bccb579f623.1616136308.git.kai.huang@intel.com
2021-04-06x86/sgx: Add encls_faulted() helperSean Christopherson
Add a helper to extract the fault indicator from an encoded ENCLS return value. SGX virtualization will also need to detect ENCLS faults. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Kai Huang <kai.huang@intel.com> Signed-off-by: Borislav Petkov <bp@suse.de> Acked-by: Jarkko Sakkinen <jarkko@kernel.org> Acked-by: Dave Hansen <dave.hansen@intel.com> Link: https://lkml.kernel.org/r/c1f955898110de2f669da536fc6cf62e003dff88.1616136308.git.kai.huang@intel.com
2021-04-06x86/sgx: Move ENCLS leaf definitions to sgx.hSean Christopherson
Move the ENCLS leaf definitions to sgx.h so that they can be used by KVM. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Kai Huang <kai.huang@intel.com> Signed-off-by: Borislav Petkov <bp@suse.de> Acked-by: Jarkko Sakkinen <jarkko@kernel.org> Acked-by: Dave Hansen <dave.hansen@intel.com> Link: https://lkml.kernel.org/r/2e6cd7c5c1ced620cfcd292c3c6c382827fde6b2.1616136308.git.kai.huang@intel.com
2021-04-06x86/sgx: Expose SGX architectural definitions to the kernelSean Christopherson
Expose SGX architectural structures, as KVM will use many of the architectural constants and structs to virtualize SGX. Name the new header file as asm/sgx.h, rather than asm/sgx_arch.h, to have single header to provide SGX facilities to share with other kernel componments. Also update MAINTAINERS to include asm/sgx.h. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Co-developed-by: Kai Huang <kai.huang@intel.com> Signed-off-by: Kai Huang <kai.huang@intel.com> Signed-off-by: Borislav Petkov <bp@suse.de> Acked-by: Jarkko Sakkinen <jarkko@kernel.org> Acked-by: Dave Hansen <dave.hansen@intel.com> Link: https://lkml.kernel.org/r/6bf47acd91ab4d709e66ad1692c7803e4c9063a0.1616136308.git.kai.huang@intel.com
2021-04-06x86/sgx: Initialize virtual EPC driver even when SGX driver is disabledKai Huang
Modify sgx_init() to always try to initialize the virtual EPC driver, even if the SGX driver is disabled. The SGX driver might be disabled if SGX Launch Control is in locked mode, or not supported in the hardware at all. This allows (non-Linux) guests that support non-LC configurations to use SGX. [ bp: De-silli-fy the test. ] Signed-off-by: Kai Huang <kai.huang@intel.com> Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: Sean Christopherson <seanjc@google.com> Acked-by: Jarkko Sakkinen <jarkko@kernel.org> Acked-by: Dave Hansen <dave.hansen@intel.com> Link: https://lkml.kernel.org/r/d35d17a02bbf8feef83a536cec8b43746d4ea557.1616136308.git.kai.huang@intel.com
2021-04-06x86/sgx: Introduce virtual EPC for use by KVM guestsSean Christopherson
Add a misc device /dev/sgx_vepc to allow userspace to allocate "raw" Enclave Page Cache (EPC) without an associated enclave. The intended and only known use case for raw EPC allocation is to expose EPC to a KVM guest, hence the 'vepc' moniker, virt.{c,h} files and X86_SGX_KVM Kconfig. The SGX driver uses the misc device /dev/sgx_enclave to support userspace in creating an enclave. Each file descriptor returned from opening /dev/sgx_enclave represents an enclave. Unlike the SGX driver, KVM doesn't control how the guest uses the EPC, therefore EPC allocated to a KVM guest is not associated with an enclave, and /dev/sgx_enclave is not suitable for allocating EPC for a KVM guest. Having separate device nodes for the SGX driver and KVM virtual EPC also allows separate permission control for running host SGX enclaves and KVM SGX guests. To use /dev/sgx_vepc to allocate a virtual EPC instance with particular size, the hypervisor opens /dev/sgx_vepc, and uses mmap() with the intended size to get an address range of virtual EPC. Then it may use the address range to create one KVM memory slot as virtual EPC for a guest. Implement the "raw" EPC allocation in the x86 core-SGX subsystem via /dev/sgx_vepc rather than in KVM. Doing so has two major advantages: - Does not require changes to KVM's uAPI, e.g. EPC gets handled as just another memory backend for guests. - EPC management is wholly contained in the SGX subsystem, e.g. SGX does not have to export any symbols, changes to reclaim flows don't need to be routed through KVM, SGX's dirty laundry doesn't have to get aired out for the world to see, and so on and so forth. The virtual EPC pages allocated to guests are currently not reclaimable. Reclaiming an EPC page used by enclave requires a special reclaim mechanism separate from normal page reclaim, and that mechanism is not supported for virutal EPC pages. Due to the complications of handling reclaim conflicts between guest and host, reclaiming virtual EPC pages is significantly more complex than basic support for SGX virtualization. [ bp: - Massage commit message and comments - use cpu_feature_enabled() - vertically align struct members init - massage Virtual EPC clarification text - move Kconfig prompt to Virtualization ] Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Co-developed-by: Kai Huang <kai.huang@intel.com> Signed-off-by: Kai Huang <kai.huang@intel.com> Signed-off-by: Borislav Petkov <bp@suse.de> Acked-by: Dave Hansen <dave.hansen@intel.com> Acked-by: Jarkko Sakkinen <jarkko@kernel.org> Link: https://lkml.kernel.org/r/0c38ced8c8e5a69872db4d6a1c0dabd01e07cad7.1616136308.git.kai.huang@intel.com