Age | Commit message (Collapse) | Author |
|
When Sub-NUMA Cluster (SNC) mode is enabled, the legacy monitor reporting files
must report the sum of the data from all of the SNC nodes that share the L3
cache that is referenced by the monitor file.
Resctrl squeezes all the attributes of these files into 32 bits so they can be
stored in the "priv" field of struct kernfs_node.
Currently, only three monitor events are defined by enum resctrl_event_id so
reducing it from 8 bits to 7 bits still provides more than enough space to
represent all the known event types.
But note that this choice was arbitrary. The "rid" field is also far wider
than needed for the current number of resource id types. This structure is
purely internal to resctrl, no ABI issues with modifying it. Subsequent changes
may rearrange the allocation of bits between each of the fields as needed.
Give the bit to a new "sum" field that indicates that reading this file must
sum across SNC nodes. This bit also indicates that the domid field is the id of
an L3 cache (instead of a domain id) to find which domains must be summed.
Fix up other issues in the kerneldoc description for mon_data_bits.
Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Tested-by: Babu Moger <babu.moger@amd.com>
Link: https://lore.kernel.org/r/20240628215619.76401-13-tony.luck@intel.com
|
|
In Sub-NUMA Cluster (SNC) mode Linux must create the monitor
files in the original "mon_L3_XX" directories and also in each
of the "mon_sub_L3_YY" directories.
Refactor mkdir_mondata_subdir() to move the creation of monitoring files
into a helper function to avoid the need to duplicate code later.
No functional change.
Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Tested-by: Babu Moger <babu.moger@amd.com>
Link: https://lore.kernel.org/r/20240628215619.76401-12-tony.luck@intel.com
|
|
New semantics rely on some struct rmid_read members having NULL values to
distinguish between the SNC and non-SNC scenarios. resctrl can thus no longer
rely on this struct not being initialized properly.
Initialize all on-stack declarations of struct rmid_read:
rdtgroup_mondata_show()
mbm_update()
mkdir_mondata_subdir()
to ensure that garbage values from the stack are not passed down to other
functions.
[ bp: Massage commit message. ]
Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Tested-by: Babu Moger <babu.moger@amd.com>
Link: https://lore.kernel.org/r/20240628215619.76401-11-tony.luck@intel.com
|
|
When a user reads a monitor file rdtgroup_mondata_show() calls mon_event_read()
to package up all the required details into an rmid_read structure which is
passed across the smp_call*() infrastructure to code that will read data from
hardware and return the value (or error status) in the rmid_read structure.
Sub-NUMA Cluster (SNC) mode adds files with new semantics. These require the
smp_call-ed code to sum event data from all domains that share an L3 cache.
Add a pointer to the L3 "cacheinfo" structure to struct rmid_read for the data
collection routines to use to pick the domains to be summed.
[ Reinette: the rmid_read structure has become complex enough so document each
of its fields and provide the kerneldoc documentation for struct rmid_read. ]
Co-developed-by: Reinette Chatre <reinette.chatre@intel.com>
Signed-off-by: Reinette Chatre <reinette.chatre@intel.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Tested-by: Babu Moger <babu.moger@amd.com>
Link: https://lore.kernel.org/r/20240628215619.76401-10-tony.luck@intel.com
|
|
When SNC is enabled, monitoring data is collected at the SNC node granularity,
but must be reported at L3-cache granularity for backwards compatibility in
addition to reporting at the node level.
Add a "ci" field to the rdt_mon_domain structure to save the cache information
about the enclosing L3 cache for the domain. This provides:
1) The cache id which is needed to compose the name of the legacy monitoring
directory, and to determine which domains should be summed to provide
L3-scoped data.
2) The shared_cpu_map which is needed to determine which CPUs can be used to
read the RMID counters with the MSR interface.
This is the first step to an eventual goal of monitor reporting files like this
(for a system with two SNC nodes per L3):
$ cd /sys/fs/resctrl/mon_data
$ tree mon_L3_00
mon_L3_00 <- 00 here is L3 cache id
├── llc_occupancy \ These files provide legacy support
├── mbm_local_bytes > for non-SNC aware monitor apps
├── mbm_total_bytes / that expect data at L3 cache level
├── mon_sub_L3_00 <- 00 here is SNC node id
│ ├── llc_occupancy \ These files are finer grained
│ ├── mbm_local_bytes > data from each SNC node
│ └── mbm_total_bytes /
└── mon_sub_L3_01
├── llc_occupancy \
├── mbm_local_bytes > As above, but for node 1.
└── mbm_total_bytes /
[ bp: Massage commit message. ]
Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Tested-by: Babu Moger <babu.moger@amd.com>
Link: https://lore.kernel.org/r/20240628215619.76401-9-tony.luck@intel.com
|
|
systems
When SNC is enabled there is a mismatch between the MBA control function
which operates at L3 cache scope and the MBM monitor functions which
measure memory bandwidth on each SNC node.
Block use of the mba_MBps when scopes for MBA/MBM do not match.
Improve user diagnostics by adding invalfc() message when mba_MBps
is not supported.
Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Tested-by: Babu Moger <babu.moger@amd.com>
Link: https://lore.kernel.org/r/20240628215619.76401-8-tony.luck@intel.com
|
|
Intel Sub-NUMA Cluster (SNC) is a feature that subdivides the CPU cores
and memory controllers on a socket into two or more groups. These are
presented to the operating system as NUMA nodes.
This may enable some workloads to have slightly lower latency to memory
as the memory controller(s) in an SNC node are electrically closer to the
CPU cores on that SNC node. This cost may be offset by lower bandwidth
since the memory accesses for each core can only be interleaved between
the memory controllers on the same SNC node.
Resctrl monitoring on an Intel system depends upon attaching RMIDs to tasks
to track L3 cache occupancy and memory bandwidth. There is an MSR that
controls how the RMIDs are shared between SNC nodes.
The default mode divides them numerically. E.g. when there are two SNC
nodes on a socket the lower number half of the RMIDs are given to the
first node, the remainder to the second node. This would be difficult
to use with the Linux resctrl interface as specific RMID values assigned
to resctrl groups are not visible to users.
RMID sharing mode divides the physical RMIDs evenly between SNC nodes
but uses a logical RMID in the IA32_PQR_ASSOC MSR. For example a system
with 200 physical RMIDs (as enumerated by CPUID leaf 0xF) that has two
SNC nodes per L3 cache instance would have 100 logical RMIDs available
for Linux to use. A task running on SNC node 0 with RMID 5 would
accumulate LLC occupancy and MBM bandwidth data in physical RMID 5.
Another task using RMID 5, but running on SNC node 1 would accumulate
data in physical RMID 105.
Even with this renumbering SNC mode requires several changes in resctrl
behavior for correct operation.
Add a static global to arch/x86/kernel/cpu/resctrl/monitor.c to indicate
how many SNC domains share an L3 cache instance. Initialize this to
"1". Runtime detection of SNC mode will adjust this value.
Update all places to take appropriate action when SNC mode is enabled:
1) The number of logical RMIDs per L3 cache available for use is the
number of physical RMIDs divided by the number of SNC nodes.
2) Likewise the "mon_scale" value must be divided by the number of SNC
nodes.
3) Add a function to convert from logical RMID values (assigned to
tasks and loaded into the IA32_PQR_ASSOC MSR on context switch)
to physical RMID values to load into IA32_QM_EVTSEL MSR when
reading counters on each SNC node.
Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Tested-by: Babu Moger <babu.moger@amd.com>
Link: https://lore.kernel.org/r/20240628215619.76401-7-tony.luck@intel.com
|
|
Currently supported resctrl features are all domain scoped the same as the
scope of the L2 or L3 caches.
Add RESCTRL_L3_NODE as a new option for features that are scoped at the
same granularity as NUMA nodes. This is needed for Intel's Sub-NUMA
Cluster (SNC) feature where monitoring features are divided between
nodes that share an L3 cache.
Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Tested-by: Babu Moger <babu.moger@amd.com>
Link: https://lore.kernel.org/r/20240628215619.76401-6-tony.luck@intel.com
|
|
The same rdt_domain structure is used for both control and monitor
functions. But this results in wasted memory as some of the fields are
only used by control functions, while most are only used for monitor
functions.
Split into separate rdt_ctrl_domain and rdt_mon_domain structures with
just the fields required for control and monitoring respectively.
Similar split of the rdt_hw_domain structure into rdt_hw_ctrl_domain
and rdt_hw_mon_domain.
Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Tested-by: Babu Moger <babu.moger@amd.com>
Link: https://lore.kernel.org/r/20240628215619.76401-5-tony.luck@intel.com
|
|
Resctrl assumes that control and monitor operations on a resource are
performed at the same scope.
Prepare for systems that use different scope (specifically Intel needs
to split the RDT_RESOURCE_L3 resource to use L3 scope for cache control
and NODE scope for cache occupancy and memory bandwidth monitoring).
Create separate domain lists for control and monitor operations.
Note that errors during initialization of either control or monitor
functions on a domain would previously result in that domain being
excluded from both control and monitor operations. Now the domains are
allocated independently it is no longer required to disable both control
and monitor operations if either fail.
Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Tested-by: Babu Moger <babu.moger@amd.com>
Link: https://lore.kernel.org/r/20240628215619.76401-4-tony.luck@intel.com
|
|
The rdt_domain structure is used for both control and monitor features.
It is about to be split into separate structures for these two usages
because the scope for control and monitoring features for a resource
will be different for future resources.
To allow for common code that scans a list of domains looking for a
specific domain id, move all the common fields ("list", "id", "cpu_mask")
into their own structure within the rdt_domain structure.
Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Tested-by: Babu Moger <babu.moger@amd.com>
Link: https://lore.kernel.org/r/20240628215619.76401-3-tony.luck@intel.com
|
|
Resctrl resources operate on subsets of CPUs in the system with the
defining attribute of each subset being an instance of a particular
level of cache. E.g. all CPUs sharing an L3 cache would be part of the
same domain.
In preparation for features that are scoped at the NUMA node level,
change the code from explicit references to "cache_level" to a more
generic scope. At this point the only options for this scope are groups
of CPUs that share an L2 cache or L3 cache.
Clean up the error handling when looking up domains. Report invalid ids
before calling rdt_find_domain() in preparation for better messages when
scope can be other than cache scope. This means that rdt_find_domain()
will never return an error. So remove checks for error from the call sites.
Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Tested-by: Babu Moger <babu.moger@amd.com>
Link: https://lore.kernel.org/r/20240628215619.76401-2-tony.luck@intel.com
|
|
Between kexec and confidential VM support, handling the EFI memory maps
correctly on x86 is already proving to be rather difficult (as opposed
to other EFI architectures which manage to never modify the EFI memory
map to begin with)
EFI fake memory map support is essentially a development hack (for
testing new support for the 'special purpose' and 'more reliable' EFI
memory attributes) that leaked into production code. The regions marked
in this manner are not actually recognized as such by the firmware
itself or the EFI stub (and never have), and marking memory as 'more
reliable' seems rather futile if the underlying memory is just ordinary
RAM.
Marking memory as 'special purpose' in this way is also dubious, but may
be in use in production code nonetheless. However, the same should be
achievable by using the memmap= command line option with the ! operator.
EFI fake memmap support is not enabled by any of the major distros
(Debian, Fedora, SUSE, Ubuntu) and does not exist on other
architectures, so let's drop support for it.
Acked-by: Borislav Petkov (AMD) <bp@alien8.de>
Acked-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
|
|
objtool complains:
arch/x86/kvm/kvm.o: warning: objtool: .altinstr_replacement+0xc5: call without frame pointer save/setup
vmlinux.o: warning: objtool: .altinstr_replacement+0x2eb: call without frame pointer save/setup
Make sure %rSP is an output operand to the respective asm() statements.
The test_cc() hunk and ALT_OUTPUT_SP() courtesy of peterz. Also from him
add some helpful debugging info to the documentation.
Now on to the explanations:
tl;dr: The alternatives macros are pretty fragile.
If I do ALT_OUTPUT_SP(output) in order to be able to package in a %rsp
reference for objtool so that a stack frame gets properly generated, the
inline asm input operand with positional argument 0 in clear_page():
"0" (page)
gets "renumbered" due to the added
: "+r" (current_stack_pointer), "=D" (page)
and then gcc says:
./arch/x86/include/asm/page_64.h:53:9: error: inconsistent operand constraints in an ‘asm’
The fix is to use an explicit "D" constraint which points to a singleton
register class (gcc terminology) which ends up doing what is expected
here: the page pointer - input and output - should be in the same %rdi
register.
Other register classes have more than one register in them - example:
"r" and "=r" or "A":
‘A’
The ‘a’ and ‘d’ registers. This class is used for
instructions that return double word results in the ‘ax:dx’
register pair. Single word values will be allocated either in
‘ax’ or ‘dx’.
so using "D" and "=D" just works in this particular case.
And yes, one would say, sure, why don't you do "+D" but then:
: "+r" (current_stack_pointer), "+D" (page)
: [old] "i" (clear_page_orig), [new1] "i" (clear_page_rep), [new2] "i" (clear_page_erms),
: "cc", "memory", "rax", "rcx")
now find the Waldo^Wcomma which throws a wrench into all this.
Because that silly macro has an "input..." consume-all last macro arg
and in it, one is supposed to supply input *and* clobbers, leading to
silly syntax snafus.
Yap, they need to be cleaned up, one fine day...
Closes: https://lore.kernel.org/oe-kbuild-all/202406141648.jO9qNGLa-lkp@intel.com/
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Acked-by: Sean Christopherson <seanjc@google.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20240625112056.GDZnqoGDXgYuWBDUwu@fat_crate.local
|
|
The outer if () should have been dropped when switching to c->x86_vfm.
Fixes: 6568fc18c2f6 ("x86/cpu/intel: Switch to new Intel CPU model defines")
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Acked-by: Tony Luck <tony.luck@intel.com>
Link: https://lore.kernel.org/r/20240529183605.17520-1-andrew.cooper3@citrix.com
|
|
The 'profile_pc()' function is used for timer-based profiling, which
isn't really all that relevant any more to begin with, but it also ends
up making assumptions based on the stack layout that aren't necessarily
valid.
Basically, the code tries to account the time spent in spinlocks to the
caller rather than the spinlock, and while I support that as a concept,
it's not worth the code complexity or the KASAN warnings when no serious
profiling is done using timers anyway these days.
And the code really does depend on stack layout that is only true in the
simplest of cases. We've lost the comment at some point (I think when
the 32-bit and 64-bit code was unified), but it used to say:
Assume the lock function has either no stack frame or a copy
of eflags from PUSHF.
which explains why it just blindly loads a word or two straight off the
stack pointer and then takes a minimal look at the values to just check
if they might be eflags or the return pc:
Eflags always has bits 22 and up cleared unlike kernel addresses
but that basic stack layout assumption assumes that there isn't any lock
debugging etc going on that would complicate the code and cause a stack
frame.
It causes KASAN unhappiness reported for years by syzkaller [1] and
others [2].
With no real practical reason for this any more, just remove the code.
Just for historical interest, here's some background commits relating to
this code from 2006:
0cb91a229364 ("i386: Account spinlocks to the caller during profiling for !FP kernels")
31679f38d886 ("Simplify profile_pc on x86-64")
and a code unification from 2009:
ef4512882dbe ("x86: time_32/64.c unify profile_pc")
but the basics of this thing actually goes back to before the git tree.
Link: https://syzkaller.appspot.com/bug?extid=84fe685c02cd112a2ac3 [1]
Link: https://lore.kernel.org/all/CAK55_s7Xyq=nh97=K=G1sxueOFrJDAvPOJAL4TPTCAYvmxO9_A@mail.gmail.com/ [2]
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
In cloud environments it can be useful to *only* enable the vmexit
mitigation and leave syscalls vulnerable. Add that as an option.
This is similar to the old spectre_bhi=auto option which was removed
with the following commit:
36d4fe147c87 ("x86/bugs: Remove CONFIG_BHI_MITIGATION_AUTO and spectre_bhi=auto")
with the main difference being that this has a more descriptive name and
is disabled by default.
Mitigation switch requested by Maksim Davydov <davydov-max@yandex-team.ru>.
[ bp: Massage. ]
Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Daniel Sneddon <daniel.sneddon@linux.intel.com>
Reviewed-by: Nikolay Borisov <nik.borisov@suse.com>
Link: https://lore.kernel.org/r/2cbad706a6d5e1da2829e5e123d8d5c80330148c.1719381528.git.jpoimboe@kernel.org
|
|
|
|
VMware hypercalls use I/O port, VMCALL or VMMCALL instructions. Add a call to
__tdx_hypercall() in order to support TDX guests.
No change in high bandwidth hypercalls, as only low bandwidth ones are supported
for TDX guests.
[ bp: Massage, clear on-stack struct tdx_module_args variable. ]
Co-developed-by: Tim Merrifield <tim.merrifield@broadcom.com>
Signed-off-by: Tim Merrifield <tim.merrifield@broadcom.com>
Signed-off-by: Alexey Makhalov <alexey.makhalov@broadcom.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20240613191650.9913-9-alexey.makhalov@broadcom.com
|
|
VCPU_RESERVED and LEGACY_X2APIC are not VMware hypercall commands. These are
bits in the return value of the VMWARE_CMD_GETVCPU_INFO command. Change
VMWARE_CMD_ prefix to GETVCPU_INFO_ one. And move the bit-shift
operation into the macro body.
Fixes: 4cca6ea04d31c ("x86/apic: Allow x2apic without IR on VMware platform")
Signed-off-by: Alexey Makhalov <alexey.makhalov@broadcom.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20240613191650.9913-7-alexey.makhalov@broadcom.com
|
|
Remove VMWARE_CMD macro and move to vmware_hypercall API.
No functional changes intended.
Use u32/u64 instead of uint32_t/uint64_t across the file.
Signed-off-by: Alexey Makhalov <alexey.makhalov@broadcom.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20240613191650.9913-6-alexey.makhalov@broadcom.com
|
|
Introduce a vmware_hypercall family of functions. It is a common implementation
to be used by the VMware guest code and virtual device drivers in architecture
independent manner.
The API consists of vmware_hypercallX and vmware_hypercall_hb_{out,in}
set of functions analogous to KVM's hypercall API. Architecture-specific
implementation is hidden inside.
It will simplify future enhancements in VMware hypercalls such as SEV-ES and
TDX related changes without needs to modify a caller in device drivers code.
Current implementation extends an idea from
bac7b4e84323 ("x86/vmware: Update platform detection code for VMCALL/VMMCALL hypercalls")
to have a slow, but safe path vmware_hypercall_slow() earlier during the boot
when alternatives are not yet applied. The code inherits VMWARE_CMD logic from
the commit mentioned above.
Move common macros from vmware.c to vmware.h.
[ bp: Fold in a fix:
https://lore.kernel.org/r/20240625083348.2299-1-alexey.makhalov@broadcom.com ]
Signed-off-by: Alexey Makhalov <alexey.makhalov@broadcom.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20240613191650.9913-2-alexey.makhalov@broadcom.com
|
|
x86_of_pci_irq_enable() returns PCIBIOS_* code received from
pci_read_config_byte() directly and also -EINVAL which are not
compatible error types. x86_of_pci_irq_enable() is used as
(*pcibios_enable_irq) function which should not return PCIBIOS_* codes.
Convert the PCIBIOS_* return code from pci_read_config_byte() into
normal errno using pcibios_err_to_errno().
Fixes: 96e0a0797eba ("x86: dtb: Add support for PCI devices backed by dtb nodes")
Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20240527125538.13620-1-ilpo.jarvinen@linux.intel.com
|
|
In a TDX VM without paravisor, currently the default timer is the Hyper-V
timer, which depends on the slow VM Reference Counter MSR: the Hyper-V TSC
page is not enabled in such a VM because the VM uses Invariant TSC as a
better clocksource and it's challenging to mark the Hyper-V TSC page shared
in very early boot.
Lower the rating of the Hyper-V timer so the local APIC timer becomes the
the default timer in such a VM, and print a warning in case Invariant TSC
is unavailable in such a VM. This change should cause no perceivable
performance difference.
Cc: stable@vger.kernel.org # 6.6+
Reviewed-by: Roman Kisel <romank@linux.microsoft.com>
Signed-off-by: Dexuan Cui <decui@microsoft.com>
Reviewed-by: Michael Kelley <mhklinux@outlook.com>
Link: https://lore.kernel.org/r/20240621061614.8339-1-decui@microsoft.com
Signed-off-by: Wei Liu <wei.liu@kernel.org>
Message-ID: <20240621061614.8339-1-decui@microsoft.com>
|
|
I'm getting tired of telling people to put a magic "" in the
#define X86_FEATURE /* "" ... */
comment to hide the new feature flag from the user-visible
/proc/cpuinfo.
Flip the logic to make it explicit: an explicit "<name>" in the comment
adds the flag to /proc/cpuinfo and otherwise not, by default.
Add the "<name>" of all the existing flags to keep backwards
compatibility with userspace.
There should be no functional changes resulting from this.
Acked-by: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20240618113840.24163-1-bp@kernel.org
|
|
Since FineIBT performs checking at the destination, it is weaker against
attacks that can construct arbitrary executable memory contents. As such,
some system builders want to run with FineIBT disabled by default. Allow
the "cfi=kcfi" boot param mode to be selectable through Kconfig via the
newly introduced CONFIG_CFI_AUTO_DEFAULT.
Reviewed-by: Sami Tolvanen <samitolvanen@google.com>
Reviewed-by: Nathan Chancellor <nathan@kernel.org>
Tested-by: Nathan Chancellor <nathan@kernel.org>
Link: https://lore.kernel.org/r/20240501000218.work.998-kees@kernel.org
Signed-off-by: Kees Cook <kees@kernel.org>
|
|
This implements the runtime constant infrastructure for x86, allowing
the dcache d_hash() function to be generated using as a constant for
hash table address followed by shift by a constant of the hash index.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Commit
6791e0ea3071 ("x86/resctrl: Access per-rmid structures by index")
adds logic to map individual monitoring groups into a global index space used
for tracking allocated RMIDs.
Attempts to free the default RMID are ignored in free_rmid(), and this works
fine on x86.
With arm64 MPAM, there is a latent bug here however: on platforms with no
monitors exposed through resctrl, each control group still gets a different
monitoring group ID as seen by the hardware, since the CLOSID always forms part
of the monitoring group ID.
This means that when removing a control group, the code may try to free this
group's default monitoring group RMID for real. If there are no monitors
however, the RMID tracking table rmid_ptrs[] would be a waste of memory and is
never allocated, leading to a splat when free_rmid() tries to dereference the
table.
One option would be to treat RMID 0 as special for every CLOSID, but this would
be ugly since bookkeeping still needs to be done for these monitoring group IDs
when there are monitors present in the hardware.
Instead, add a gating check of resctrl_arch_mon_capable() in free_rmid(), and
just do nothing if the hardware doesn't have monitors.
This fix mirrors the gating checks already present in
mkdir_rdt_prepare_rmid_alloc() and elsewhere.
No functional change on x86.
[ bp: Massage commit message. ]
Fixes: 6791e0ea3071 ("x86/resctrl: Access per-rmid structures by index")
Signed-off-by: Dave Martin <Dave.Martin@arm.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Tested-by: Reinette Chatre <reinette.chatre@intel.com>
Link: https://lore.kernel.org/r/20240618140152.83154-1-Dave.Martin@arm.com
|
|
Sync to v6.10-rc3.
Signed-off-by: Jani Nikula <jani.nikula@intel.com>
|
|
To allow execution at a level other than VMPL0, an SVSM must be present.
Allow the SEV-SNP guest to continue booting if an SVSM is detected and
the hypervisor supports the SVSM feature as indicated in the GHCB
hypervisor features bitmap.
[ bp: Massage a bit. ]
Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/2ce7cf281cce1d0cba88f3f576687ef75dc3c953.1717600736.git.thomas.lendacky@amd.com
|
|
When an SVSM is present, the guest can also request attestation reports
from it. These SVSM attestation reports can be used to attest the SVSM
and any services running within the SVSM.
Extend the config-fs attestation support to provide such. This involves
creating four new config-fs attributes:
- 'service-provider' (input)
This attribute is used to determine whether the attestation request
should be sent to the specified service provider or to the SEV
firmware. The SVSM service provider is represented by the value
'svsm'.
- 'service_guid' (input)
Used for requesting the attestation of a single service within the
service provider. A null GUID implies that the SVSM_ATTEST_SERVICES
call should be used to request the attestation report. A non-null
GUID implies that the SVSM_ATTEST_SINGLE_SERVICE call should be used.
- 'service_manifest_version' (input)
Used with the SVSM_ATTEST_SINGLE_SERVICE call, the service version
represents a specific service manifest version be used for the
attestation report.
- 'manifestblob' (output)
Used to return the service manifest associated with the attestation
report.
Only display these new attributes when running under an SVSM.
[ bp: Massage.
- s/svsm_attestation_call/svsm_attest_call/g ]
Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/965015dce3c76bb8724839d50c5dea4e4b5d598f.1717600736.git.thomas.lendacky@amd.com
|
|
Currently, the sev-guest driver uses the vmpck-0 key by default. When an
SVSM is present, the kernel is running at a VMPL other than 0 and the
vmpck-0 key is no longer available. If a specific vmpck key has not be
requested by the user via the vmpck_id module parameter, choose the
vmpck key based on the active VMPL level.
Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/b88081c5d88263176849df8ea93e90a404619cab.1717600736.git.thomas.lendacky@amd.com
|
|
Requesting an attestation report from userspace involves providing the VMPL
level for the report. Currently any value from 0-3 is valid because Linux
enforces running at VMPL0.
When an SVSM is present, though, Linux will not be running at VMPL0 and only
VMPL values starting at the VMPL level Linux is running at to 3 are valid. In
order to allow userspace to determine the minimum VMPL value that can be
supplied to an attestation report, create a sysfs entry that can be used to
retrieve the current VMPL level of the kernel.
[ bp: Add CONFIG_SYSFS ifdeffery. ]
Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/fff846da0d8d561f9fdaf297dcf8cd907545a25b.1717600736.git.thomas.lendacky@amd.com
|
|
The SVSM specification documents an alternative method of discovery for
the SVSM using a reserved CPUID bit and a reserved MSR. This is intended
for guest components that do not have access to the secrets page in
order to be able to call the SVSM (e.g. UEFI runtime services).
For the MSR support, a new reserved MSR 0xc001f000 has been defined. A #VC
should be generated when accessing this MSR. The #VC handler is expected
to ignore writes to this MSR and return the physical calling area address
(CAA) on reads of this MSR.
While the CPUID leaf is updated, allowing the creation of a CPU feature,
the code will continue to use the VMPL level as an indication of the
presence of an SVSM. This is because the SVSM can be called well before
the CPU feature is in place and a non-zero VMPL requires that an SVSM be
present.
Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/4f93f10a2ff3e9f368fd64a5920d51bf38d0c19e.1717600736.git.thomas.lendacky@amd.com
|
|
Using the RMPADJUST instruction, the VMSA attribute can only be changed
at VMPL0. An SVSM will be present when running at VMPL1 or a lower
privilege level.
In that case, use the SVSM_CORE_CREATE_VCPU call or the
SVSM_CORE_DESTROY_VCPU call to perform VMSA attribute changes. Use the
VMPL level supplied by the SVSM for the VMSA when starting the AP.
[ bp: Fix typo + touchups. ]
Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/bcdd95ecabe9723673b9693c7f1533a2b8f17781.1717600736.git.thomas.lendacky@amd.com
|
|
The PVALIDATE instruction can only be performed at VMPL0. If an SVSM is
present, it will be running at VMPL0 while the guest itself is then
running at VMPL1 or a lower privilege level.
In that case, use the SVSM_CORE_PVALIDATE call to perform memory
validation instead of issuing the PVALIDATE instruction directly.
The validation of a single 4K page is now explicitly identified as such
in the function name, pvalidate_4k_page(). The pvalidate_pages()
function is used for validating 1 or more pages at either 4K or 2M in
size. Each function, however, determines whether it can issue the
PVALIDATE directly or whether the SVSM needs to be invoked.
[ bp: Touchups. ]
[ Tom: fold in a fix for Coconut SVSM:
https://lore.kernel.org/r/234bb23c-d295-76e5-a690-7ea68dc1118b@amd.com ]
Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/4c4017d8b94512d565de9ccb555b1a9f8983c69c.1717600736.git.thomas.lendacky@amd.com
|
|
MADT Multiprocessor Wakeup structure version 1 brings support for CPU offlining:
BIOS provides a reset vector where the CPU has to jump to for offlining itself.
The new TEST mailbox command can be used to test whether the CPU offlined itself
which means the BIOS has control over the CPU and can online it again via the
ACPI MADT wakeup method.
Add CPU offlining support for the ACPI MADT wakeup method by implementing custom
cpu_die(), play_dead() and stop_this_cpu() SMP operations.
CPU offlining makes it possible to hand over secondary CPUs over kexec, not
limiting the second kernel to a single CPU.
The change conforms to the approved ACPI spec change proposal. See the Link.
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Kai Huang <kai.huang@intel.com>
Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Tested-by: Tao Liu <ltao@redhat.com>
Link: https://lore.kernel.org/all/13356251.uLZWGnKmhe@kreacher
Link: https://lore.kernel.org/r/20240614095904.1345461-19-kirill.shutemov@linux.intel.com
|
|
If the helper is defined, it is called instead of halt() to stop the CPU at the
end of stop_this_cpu() and on crash CPU shutdown.
ACPI MADT will use it to hand over the CPU to BIOS in order to be able to wake
it up again after kexec.
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Kai Huang <kai.huang@intel.com>
Tested-by: Tao Liu <ltao@redhat.com>
Link: https://lore.kernel.org/r/20240614095904.1345461-17-kirill.shutemov@linux.intel.com
|
|
ACPI MADT doesn't allow to offline a CPU after it was onlined. This limits
kexec: the second kernel won't be able to use more than one CPU.
To prevent a kexec kernel from onlining secondary CPUs, invalidate the mailbox
address in the ACPI MADT wakeup structure which prevents a kexec kernel to use
it.
This is safe as the booting kernel has the mailbox address cached already and
acpi_wakeup_cpu() uses the cached value to bring up the secondary CPUs.
Note: This is a Linux specific convention and not covered by the ACPI
specification.
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Kai Huang <kai.huang@intel.com>
Reviewed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Tested-by: Tao Liu <ltao@redhat.com>
Link: https://lore.kernel.org/r/20240614095904.1345461-16-kirill.shutemov@linux.intel.com
|
|
In order to support MADT wakeup structure version 1, provide more appropriate
names for the fields in the structure.
Rename 'mailbox_version' to 'version'. This field signifies the version of the
structure and the related protocols, rather than the version of the mailbox.
This field has not been utilized in the code thus far.
Rename 'base_address' to 'mailbox_address' to clarify the kind of address it
represents. In version 1, the structure includes the reset vector address. Clear
and distinct naming helps to prevent any confusion.
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Kai Huang <kai.huang@intel.com>
Reviewed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Tested-by: Tao Liu <ltao@redhat.com>
Link: https://lore.kernel.org/r/20240614095904.1345461-15-kirill.shutemov@linux.intel.com
|
|
e820__end_of_ram_pfn() is used to calculate max_pfn which, among other things,
guides where direct mapping ends. Any memory above max_pfn is not going to be
present in the direct mapping.
e820__end_of_ram_pfn() finds the end of the RAM based on the highest
E820_TYPE_RAM range. But it doesn't includes E820_TYPE_ACPI ranges into
calculation.
Despite the name, E820_TYPE_ACPI covers not only ACPI data, but also EFI tables
and might be required by kernel to function properly.
Usually the problem is hidden because there is some E820_TYPE_RAM memory above
E820_TYPE_ACPI. But crashkernel only presents pre-allocated crash memory as
E820_TYPE_RAM on boot. If the pre-allocated range is small, it can fit under the
last E820_TYPE_ACPI range.
Modify e820__end_of_ram_pfn() and e820__end_of_low_ram_pfn() to cover
E820_TYPE_ACPI memory.
The problem was discovered during debugging kexec for TDX guest. TDX guest uses
E820_TYPE_ACPI to store the unaccepted memory bitmap and pass it between the
kernels on kexec.
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Dave Hansen <dave.hansen@intel.com>
Tested-by: Tao Liu <ltao@redhat.com>
Link: https://lore.kernel.org/r/20240614095904.1345461-13-kirill.shutemov@linux.intel.com
|
|
AMD SEV and Intel TDX guests allocate shared buffers for performing I/O.
This is done by allocating pages normally from the buddy allocator and
then converting them to shared using set_memory_decrypted().
On kexec, the second kernel is unaware of which memory has been
converted in this manner. It only sees E820_TYPE_RAM. Accessing shared
memory as private is fatal.
Therefore, the memory state must be reset to its original state before
starting the new kernel with kexec.
The process of converting shared memory back to private occurs in two
steps:
- enc_kexec_begin() stops new conversions.
- enc_kexec_finish() unshares all existing shared memory, reverting it
back to private.
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Nikolay Borisov <nik.borisov@suse.com>
Reviewed-by: Kai Huang <kai.huang@intel.com>
Tested-by: Tao Liu <ltao@redhat.com>
Link: https://lore.kernel.org/r/20240614095904.1345461-11-kirill.shutemov@linux.intel.com
|
|
TDX is going to have more than one reason to fail enc_status_change_prepare().
Change the callback to return errno instead of assuming -EIO. Change
enc_status_change_finish() too to keep the interface symmetric.
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Dave Hansen <dave.hansen@intel.com>
Reviewed-by: Kai Huang <kai.huang@intel.com>
Reviewed-by: Michael Kelley <mhklinux@outlook.com>
Tested-by: Tao Liu <ltao@redhat.com>
Link: https://lore.kernel.org/r/20240614095904.1345461-8-kirill.shutemov@linux.intel.com
|
|
TDX guests run with MCA enabled (CR4.MCE=1b) from the very start. If
that bit is cleared during CR4 register reprogramming during boot or kexec
flows, a #VE exception will be raised which the guest kernel cannot handle.
Therefore, make sure the CR4.MCE setting is preserved over kexec too and avoid
raising any #VEs.
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20240614095904.1345461-7-kirill.shutemov@linux.intel.com
|
|
That identity_mapped() function was loving that "1" label to the point of
completely confusing its readers.
Use named labels in each place for clarity.
No functional changes.
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Link: https://lore.kernel.org/r/20240614095904.1345461-6-kirill.shutemov@linux.intel.com
|
|
ACPI MADT doesn't allow to offline a CPU after it has been woken up.
Currently, CPU hotplug is prevented based on the confidential computing
attribute which is set for Intel TDX. But TDX is not the only possible user of
the wake up method. Any platform that uses ACPI MADT wakeup method cannot
offline CPU.
Disable CPU offlining on ACPI MADT wakeup enumeration.
This has no visible effects for users: currently, TDX guest is the only platform
that uses the ACPI MADT wakeup method.
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Tested-by: Tao Liu <ltao@redhat.com>
Link: https://lore.kernel.org/r/20240614095904.1345461-5-kirill.shutemov@linux.intel.com
|
|
acpi_mp_wake_mailbox_paddr and acpi_mp_wake_mailbox are initialized once during
ACPI MADT init and never changed.
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Baoquan He <bhe@redhat.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Kai Huang <kai.huang@intel.com>
Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Tested-by: Tao Liu <ltao@redhat.com>
Link: https://lore.kernel.org/r/20240614095904.1345461-3-kirill.shutemov@linux.intel.com
|
|
In order to prepare for the expansion of support for the ACPI MADT
wakeup method, move the relevant code into a separate file.
Introduce a new configuration option to clearly indicate dependencies
without the use of ifdefs.
There have been no functional changes.
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Baoquan He <bhe@redhat.com>
Reviewed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Borislav Petkov (AMD) <bp@alien8.de>
Acked-by: Kai Huang <kai.huang@intel.com>
Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Tested-by: Tao Liu <ltao@redhat.com>
Link: https://lore.kernel.org/r/20240614095904.1345461-2-kirill.shutemov@linux.intel.com
|
|
This seemingly straightforward JMP was introduced in the initial version
of the the 64bit kexec code without any explanation.
It turns out (check accompanying Link) it's likely a copy/paste artefact
from 32-bit code, where such a JMP could be used as a serializing
instruction for the 486's prefetch queue. On x86_64 that's not needed
because there's already a preceding write to cr4 which itself is
a serializing operation.
[ bp: Typos. Let's try this and see what cries out. If it does,
reverting it is trivial. ]
Signed-off-by: Nikolay Borisov <nik.borisov@suse.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/all/55bc0649-c017-49ab-905d-212f140a403f@citrix.com/
|
|
The routine is used on syscall exit and on non-AMD CPUs is guaranteed to
be empty.
It probably does not need to be a function call even on CPUs which do need the
mitigation.
[ bp: Make sure it is always inlined so that noinstr marking works. ]
Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20240613082637.659133-1-mjguzik@gmail.com
|