Age | Commit message (Collapse) | Author |
|
Sync with upstream fixes before applying EEVDF.
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
There was never a doubt in my mind that they would not fit into a single
u32 eventually.
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
|
|
Visible glitches have been observed when running graphics applications on
Linux under Xen hypervisor. Those observations have been confirmed with
failures from kms_pwrite_crc Intel GPU test that verifies data coherency
of DRM frame buffer objects using hardware CRC checksums calculated by
display controllers, exposed to userspace via debugfs. Affected
processing paths have then been identified with new IGT test variants that
mmap the objects using different methods and caching modes [1].
When running as a Xen PV guest, Linux uses Xen provided PAT configuration
which is different from its native one. In particular, Xen specific PTE
encoding of write-combining caching, likely used by graphics applications,
differs from the Linux default one found among statically defined minimal
set of supported modes. Since Xen defines PTE encoding of the WC mode as
_PAGE_PAT, it no longer belongs to the minimal set, depends on correct
handling of _PAGE_PAT bit, and can be mismatched with write-back caching.
When a user calls mmap() for a DRM buffer object, DRM device specific
.mmap file operation, called from mmap_region(), takes care of setting PTE
encoding bits in a vm_page_prot field of an associated virtual memory area
structure. Unfortunately, _PAGE_PAT bit is not preserved when the vma's
.vm_flags are then applied to .vm_page_prot via vm_set_page_prot(). Bits
to be preserved are determined with _PAGE_CHG_MASK symbol that doesn't
cover _PAGE_PAT. As a consequence, WB caching is requested instead of WC
when running under Xen (also, WP is silently changed to WT, and UC
downgraded to UC_MINUS). When running on bare metal, WC is not affected,
but WP and WT extra modes are unintentionally replaced with WC and UC,
respectively.
WP and WT modes, encoded with _PAGE_PAT bit set, were introduced by commit
281d4078bec3 ("x86: Make page cache mode a real type"). Care was taken
to extend _PAGE_CACHE_MASK symbol with that additional bit, but that
symbol has never been used for identification of bits preserved when
applying page protection flags. Support for all cache modes under Xen,
including the problematic WC mode, was then introduced by commit
47591df50512 ("xen: Support Xen pv-domains using PAT").
The issue needs to be fixed by including _PAGE_PAT bit into a bitmask used
by pgprot_modify() for selecting bits to be preserved. We can do that
either internally to pgprot_modify() (as initially proposed), or by making
_PAGE_PAT a part of _PAGE_CHG_MASK. If we go for the latter then, since
_PAGE_PAT is the same as _PAGE_PSE, we need to note that _HPAGE_CHG_MASK
-- a huge pmds' counterpart of _PAGE_CHG_MASK, introduced by commit
c489f1257b8c ("thp: add pmd_modify"), defined as (_PAGE_CHG_MASK |
_PAGE_PSE) -- will no longer differ from _PAGE_CHG_MASK. If such
modification of _PAGE_CHG_MASK was irrelevant to its users then one might
wonder why that new _HPAGE_CHG_MASK symbol was introduced instead of
reusing the existing one with that otherwise irrelevant bit (_PAGE_PSE in
that case) added.
Add _PAGE_PAT to _PAGE_CHG_MASK and _PAGE_PAT_LARGE to _HPAGE_CHG_MASK for
symmetry. Split out common bits from both symbols to a common symbol for
clarity.
[ dhansen: tweak the solution changelog description ]
[1] https://gitlab.freedesktop.org/drm/igt-gpu-tools/-/commit/0f0754413f14
Fixes: 281d4078bec3 ("x86: Make page cache mode a real type")
Signed-off-by: Janusz Krzysztofik <janusz.krzysztofik@linux.intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Andi Shyti <andi.shyti@linux.intel.com>
Reviewed-by: Juergen Gross <jgross@suse.com>
Tested-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com>
Link: https://gitlab.freedesktop.org/drm/intel/-/issues/7648
Link: https://lore.kernel.org/all/20230710073613.8006-2-janusz.krzysztofik%40linux.intel.com
|
|
Add a fix for the Zen2 VZEROUPPER data corruption bug where under
certain circumstances executing VZEROUPPER can cause register
corruption or leak data.
The optimal fix is through microcode but in the case the proper
microcode revision has not been applied, enable a fallback fix using
a chicken bit.
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
|
|
Avoid new and remove old forward declarations.
No functional changes.
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull perf fix from Borislav Petkov:
- Fix a lockdep warning when the event given is the first one, no event
group exists yet but the code still goes and iterates over event
siblings
* tag 'perf_urgent_for_v6.5_rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
perf/x86: Fix lockdep warning in for_each_sibling_event() on SPR
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 CFI fixes from Peter Zijlstra:
"Fix kCFI/FineIBT weaknesses
The primary bug Alyssa noticed was that with FineIBT enabled function
prologues have a spurious ENDBR instruction:
__cfi_foo:
endbr64
subl $hash, %r10d
jz 1f
ud2
nop
1:
foo:
endbr64 <--- *sadface*
This means that any indirect call that fails to target the __cfi
symbol and instead targets (the regular old) foo+0, will succeed due
to that second ENDBR.
Fixing this led to the discovery of a single indirect call that was
still doing this: ret_from_fork(). Since that's an assembly stub the
compiler would not generate the proper kCFI indirect call magic and it
would not get patched.
Brian came up with the most comprehensive fix -- convert the thing to
C with only a very thin asm wrapper. This ensures the kernel thread
boostrap is a proper kCFI call.
While discussing all this, Kees noted that kCFI hashes could/should be
poisoned to seal all functions whose address is never taken, further
limiting the valid kCFI targets -- much like we already do for IBT.
So what was a 'simple' observation and fix cascaded into a bunch of
inter-related CFI infrastructure fixes"
* tag 'x86_urgent_for_6.5_rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/cfi: Only define poison_cfi() if CONFIG_X86_KERNEL_IBT=y
x86/fineibt: Poison ENDBR at +0
x86: Rewrite ret_from_fork() in C
x86/32: Remove schedule_tail_wrapper()
x86/cfi: Extend ENDBR sealing to kCFI
x86/alternative: Rename apply_ibt_endbr()
x86/cfi: Extend {JMP,CAKK}_NOSPEC comment
|
|
There were reports again that the tsc clocksource on 4 sockets x86
servers was wrongly judged as 'unstable' by 'jiffies' and other
watchdogs, and disabled [1][2].
Commit b50db7095fe0 ("x86/tsc: Disable clocksource watchdog for TSC
on qualified platorms") was introduce to deal with these false
alarms of tsc unstable issues, covering qualified platforms for 2
sockets or smaller ones. And from history of chasing TSC issues,
Thomas and Peter only saw real TSC synchronization issue on 8 socket
machines.
So extend the exemption to 4 sockets to fix the issue.
Rui also proposed another way to disable 'jiffies' as clocksource
watchdog [3], which can also solve problem in [1]. in an architecture
independent way, but can't cure the problem in [2]. whose watchdog
is HPET or PMTIMER, while 'jiffies' is mostly used as watchdog in
boot phase.
'nr_online_nodes' has known inaccurate problem for cases like
platform with cpu-less memory nodes, sub numa cluster enabled,
fakenuma, kernel cmdline parameter 'maxcpus=', etc. The harmful case
is the 'maxcpus' one which could possibly under estimates the package
number, and disable the watchdog, but bright side is it is mostly
for debug usage. All these will be addressed in other patches, as
discussed in thread [4].
[1]. https://lore.kernel.org/all/9d3bf570-3108-0336-9c52-9bee15767d29@huawei.com/
[2]. https://lore.kernel.org/lkml/06df410c-2177-4671-832f-339cff05b1d9@paulmck-laptop/
[3]. https://lore.kernel.org/all/bd5b97f89ab2887543fc262348d1c7cafcaae536.camel@intel.com/
[4]. https://lore.kernel.org/all/20221021062131.1826810-1-feng.tang@intel.com/
Reported-by: Yu Liao <liaoyu15@huawei.com>
Reported-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Feng Tang <feng.tang@intel.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
|
|
The processor _OSC method is already used for a workaround introduced
in commit a21211672c9a ("ACPI / processor: Request native thermal
interrupt handling via _OSC"), but in accordance with ACPI 6.5 (and
earlier), it should be used for negotiating all of the processor
capabilities instead of _PDC (which has been deprecated since ACPI 3.0
and got removed from ACPI 6.5 entirely).
Create a new callback function called acpi_processor_osc() to be invoked
for every processor object and processor device in the ACPI namespace, in
analogy with the already existing acpi_hwp_native_thermal_lvt_osc().
Make this function implement the workaround mentioned above and convey
all of the OSPM processor support information to the platform firmware
by setting all of the appropriate processor capabilities bits before
evaluating _OSC for the given processor. For this purpose, make it
call arch_acpi_set_proc_cap_bits() and modify the latter to set
ACPI_PROC_CAP_COLLAB_PROC_PERF along with the other processor
capabilities bits.
Suggested-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Signed-off-by: Michal Wilczynski <michal.wilczynski@intel.com>
Reviewed-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
[ rjw: Subject and changelog edits, whitespace fixup ]
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
|
|
Currently, ACPI_PROC_CAP_SMP_T_SWCOORD is set in acpi_set_pdc_bits(), but
it is not _PDC-specific. It should be set along with the other processor
capability bits.
Move the setting of ACPI_PROC_CAP_SMP_T_SWCOORD to
arch_acpi_set_proc_cap_bits().
Suggested-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Signed-off-by: Michal Wilczynski <michal.wilczynski@intel.com>
[ rjw: Subject and changelog edits ]
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
|
|
Currently arch_acpi_set_proc_cap_bits() clears ACPI_PDC_C_C2C3_FFH bit in
case MWAIT instruction is not supported.
It should also clear ACPI_PDC_C_C1_FFH, as when MWAIT is not supported,
C1 is entered by executing the HLT instruction.
Quote from the C_C1_FFH description:
"If set, OSPM is capable of performing native C State instructions (beyond
halt) for the C1 handler in multi-processor configurations". As without
MWAIT there is no native C-state instructions beyond HALT, this bit
should be toggled off."
Clear ACPI_PDC_C_C1_FFH and ACPI_PDC_C_C2C3_FFH in
arch_acpi_set_proc_cap_bits() in case MWAIT is not supported or
overridden.
Remove setting those bits from the processor_pdc.c code.
Suggested-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Signed-off-by: Michal Wilczynski <michal.wilczynski@intel.com>
[ rjw: Subject and changelog edits ]
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
|
|
The prefix in the names of the ACPI_PDC symbols suggests that they are
only relevant for _PDC, but in fact they can also be used in the _OSC.
Change that prefix to a more generic ACPI_PROC_CAP that will better
reflect the purpose of those symbols as they represent bits in a general
processor capabilities buffer.
Rename pdc_intel.h to proc_cap_intel.h to follow the change of the
symbol name prefix.
No intentional functional impact.
Suggested-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Signed-off-by: Michal Wilczynski <michal.wilczynski@intel.com>
[ rjw: Subject and changelog edits ]
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
|
|
The capabilities buffer modified by the arch_acpi_set_pdc_bits() is not
_PDC-specific, as it can be used by _OSC too.
Change the name of that function to better reflect its independence from
_PDC and make it take the capabilities buffer address as the argument
directly, without any offset, as _OSC and _PDC use different capabilities
buffer offsets.
No intentional functional impact.
Suggested-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Signed-off-by: Michal Wilczynski <michal.wilczynski@intel.com>
Reviewed-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
[ rjw: Subject and changelog edits ]
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
|
|
aes_set_key_common() performs runtime alignment to the void *raw_ctx
pointer. This facilitates consistent access to the 16byte-aligned
address during key extension.
However, the alignment is already handlded in the GCM-related setkey
functions before invoking the common function. Consequently, the
alignment in the common function is unnecessary for those functions.
To establish a consistent approach throughout the glue code, remove
the aes_ctx() call from its current location. Instead, place it at
each call site where the runtime alignment is currently absent.
Link: https://lore.kernel.org/lkml/20230605024623.GA4653@quark.localdomain/
Suggested-by: Eric Biggers <ebiggers@kernel.org>
Signed-off-by: Chang S. Bae <chang.seok.bae@intel.com>
Cc: linux-crypto@vger.kernel.org
Cc: x86@kernel.org
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
|
|
https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next
Alexei Starovoitov says:
====================
pull-request: bpf-next 2023-07-13
We've added 67 non-merge commits during the last 15 day(s) which contain
a total of 106 files changed, 4444 insertions(+), 619 deletions(-).
The main changes are:
1) Fix bpftool build in presence of stale vmlinux.h,
from Alexander Lobakin.
2) Introduce bpf_me_mcache_free_rcu() and fix OOM under stress,
from Alexei Starovoitov.
3) Teach verifier actual bounds of bpf_get_smp_processor_id()
and fix perf+libbpf issue related to custom section handling,
from Andrii Nakryiko.
4) Introduce bpf map element count, from Anton Protopopov.
5) Check skb ownership against full socket, from Kui-Feng Lee.
6) Support for up to 12 arguments in BPF trampoline, from Menglong Dong.
7) Export rcu_request_urgent_qs_task, from Paul E. McKenney.
8) Fix BTF walking of unions, from Yafang Shao.
9) Extend link_info for kprobe_multi and perf_event links,
from Yafang Shao.
* tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (67 commits)
selftests/bpf: Add selftest for PTR_UNTRUSTED
bpf: Fix an error in verifying a field in a union
selftests/bpf: Add selftests for nested_trust
bpf: Fix an error around PTR_UNTRUSTED
selftests/bpf: add testcase for TRACING with 6+ arguments
bpf, x86: allow function arguments up to 12 for TRACING
bpf, x86: save/restore regs with BPF_DW size
bpftool: Use "fallthrough;" keyword instead of comments
bpf: Add object leak check.
bpf: Convert bpf_cpumask to bpf_mem_cache_free_rcu.
bpf: Introduce bpf_mem_free_rcu() similar to kfree_rcu().
selftests/bpf: Improve test coverage of bpf_mem_alloc.
rcu: Export rcu_request_urgent_qs_task()
bpf: Allow reuse from waiting_for_gp_ttrace list.
bpf: Add a hint to allocated objects.
bpf: Change bpf_mem_cache draining process.
bpf: Further refactor alloc_bulk().
bpf: Factor out inc/dec of active flag into helpers.
bpf: Refactor alloc_bulk().
bpf: Let free_all() return the number of freed elements.
...
====================
Link: https://lore.kernel.org/r/20230714020910.80794-1-alexei.starovoitov@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
For now, the BPF program of type BPF_PROG_TYPE_TRACING can only be used
on the kernel functions whose arguments count less than or equal to 6, if
not considering '> 8 bytes' struct argument. This is not friendly at all,
as too many functions have arguments count more than 6.
According to the current kernel version, below is a statistics of the
function arguments count:
argument count | function count
7 | 704
8 | 270
9 | 84
10 | 47
11 | 47
12 | 27
13 | 22
14 | 5
15 | 0
16 | 1
Therefore, let's enhance it by increasing the function arguments count
allowed in arch_prepare_bpf_trampoline(), for now, only x86_64.
For the case that we don't need to call origin function, which means
without BPF_TRAMP_F_CALL_ORIG, we need only copy the function arguments
that stored in the frame of the caller to current frame. The 7th and later
arguments are stored in "$rbp + 0x18", and they will be copied to the
stack area following where register values are saved.
For the case with BPF_TRAMP_F_CALL_ORIG, we need prepare the arguments
in stack before call origin function, which means we need alloc extra
"8 * (arg_count - 6)" memory in the top of the stack. Note, there should
not be any data be pushed to the stack before calling the origin function.
So 'rbx' value will be stored on a stack position higher than where stack
arguments are stored for BPF_TRAMP_F_CALL_ORIG.
According to the research of Yonghong, struct members should be all in
register or all on the stack. Meanwhile, the compiler will pass the
argument on regs if the remaining regs can hold the argument. Therefore,
we need save the arguments in order. Otherwise, disorder of the args can
happen. For example:
struct foo_struct {
long a;
int b;
};
int foo(char, char, char, char, char, struct foo_struct,
char);
the arg1-5,arg7 will be passed by regs, and arg6 will by stack. Therefore,
we should save/restore the arguments in the same order with the
declaration of foo(). And the args used as ctx in stack will be like this:
reg_arg6 -- copy from regs
stack_arg2 -- copy from stack
stack_arg1
reg_arg5 -- copy from regs
reg_arg4
reg_arg3
reg_arg2
reg_arg1
We use EMIT3_off32() or EMIT4() for "lea" and "sub". The range of the
imm in "lea" and "sub" is [-128, 127] if EMIT4() is used. Therefore,
we use EMIT3_off32() instead if the imm out of the range.
It works well for the FENTRY/FEXIT/MODIFY_RETURN.
Signed-off-by: Menglong Dong <imagedong@tencent.com>
Acked-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/r/20230713040738.1789742-3-imagedong@tencent.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
As we already reserve 8 byte in the stack for each reg, it is ok to
store/restore the regs in BPF_DW size. This will make the code in
save_regs()/restore_regs() simpler.
Signed-off-by: Menglong Dong <imagedong@tencent.com>
Acked-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/r/20230713040738.1789742-2-imagedong@tencent.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace
Pull tracing fixes from Steven Rostedt:
- Fix some missing-prototype warnings
- Fix user events struct args (did not include size of struct)
When creating a user event, the "struct" keyword is to denote that
the size of the field will be passed in. But the parsing failed to
handle this case.
- Add selftest to struct sizes for user events
- Fix sample code for direct trampolines.
The sample code for direct trampolines attached to handle_mm_fault().
But the prototype changed and the direct trampoline sample code was
not updated. Direct trampolines needs to have the arguments correct
otherwise it can fail or crash the system.
- Remove unused ftrace_regs_caller_ret() prototype.
- Quiet false positive of FORTIFY_SOURCE
Due to backward compatibility, the structure used to save stack
traces in the kernel had a fixed size of 8. This structure is
exported to user space via the tracing format file. A change was made
to allow more than 8 functions to be recorded, and user space now
uses the size field to know how many functions are actually in the
stack.
But the structure still has size of 8 (even though it points into the
ring buffer that has the required amount allocated to hold a full
stack.
This was fine until the fortifier noticed that the
memcpy(&entry->caller, stack, size) was greater than the 8 functions
and would complain at runtime about it.
Hide this by using a pointer to the stack location on the ring buffer
instead of using the address of the entry structure caller field.
- Fix a deadloop in reading trace_pipe that was caused by a mismatch
between ring_buffer_empty() returning false which then asked to read
the data, but the read code uses rb_num_of_entries() that returned
zero, and causing a infinite "retry".
- Fix a warning caused by not using all pages allocated to store ftrace
functions, where this can happen if the linker inserts a bunch of
"NULL" entries, causing the accounting of how many pages needed to be
off.
- Fix histogram synthetic event crashing when the start event is
removed and the end event is still using a variable from it
- Fix memory leak in freeing iter->temp in tracing_release_pipe()
* tag 'trace-v6.5-rc1-3' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
tracing: Fix memory leak of iter->temp when reading trace_pipe
tracing/histograms: Add histograms to hist_vars if they have referenced variables
tracing: Stop FORTIFY_SOURCE complaining about stack trace caller
ftrace: Fix possible warning on checking all pages used in ftrace_process_locs()
ring-buffer: Fix deadloop issue on reading trace_pipe
tracing: arm64: Avoid missing-prototype warnings
selftests/user_events: Test struct size match cases
tracing/user_events: Fix struct arg size match check
x86/ftrace: Remove unsued extern declaration ftrace_regs_caller_ret()
arm64: ftrace: Add direct call trampoline samples support
samples: ftrace: Save required argument registers in sample trampolines
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip
Pull xen fixes from Juergen Gross:
- a cleanup of the Xen related ELF-notes
- a fix for virtio handling in Xen dom0 when running Xen in a VM
* tag 'for-linus-6.5-rc2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip:
xen/virtio: Fix NULL deref when a bridge of PCI root bus has no parent
x86/Xen: tidy xen-head.S
|
|
Use struct_size() instead of hand-writing it. It is less verbose, more
robust and more informative.
Link: https://lore.kernel.org/r/00a5cc2cd322e7dea26579916ac6dda9c637aa57.1684518118.git.christophe.jaillet@wanadoo.fr
Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
|
|
With the SMT vs non-SMT balancing issues sorted, also enable the
cluster domain for Hybrid machines.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
|
|
If a VMA has the VM_SHADOW_STACK flag, it is shadow stack memory. So
when it is made writable with pte_mkwrite(), it should create shadow
stack memory, not conventionally writable memory. Now that all the places
where shadow stack memory might be created pass a VMA into pte_mkwrite(),
it can know when it should do this.
So make pte_mkwrite() create shadow stack memory when the VMA has the
VM_SHADOW_STACK flag. Do the same thing for pmd_mkwrite().
This requires referencing VM_SHADOW_STACK in these functions, which are
currently defined in pgtable.h, however mm.h (where VM_SHADOW_STACK is
located) can't be pulled in without causing problems for files that
reference pgtable.h. So also move pte/pmd_mkwrite() into pgtable.c, where
they can safely reference VM_SHADOW_STACK.
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Kees Cook <keescook@chromium.org>
Acked-by: Mike Rapoport (IBM) <rppt@kernel.org>
Acked-by: Deepak Gupta <debug@rivosinc.com>
Tested-by: Pengfei Xu <pengfei.xu@intel.com>
Tested-by: John Allen <john.allen@amd.com>
Tested-by: Kees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/all/20230613001108.3040476-22-rick.p.edgecombe%40intel.com
|
|
The x86 Control-flow Enforcement Technology (CET) feature includes a new
type of memory called shadow stack. This shadow stack memory has some
unusual properties, which require some core mm changes to function
properly.
One of the properties is that the shadow stack pointer (SSP), which is a
CPU register that points to the shadow stack like the stack pointer points
to the stack, can't be pointing outside of the 32 bit address space when
the CPU is executing in 32 bit mode. It is desirable to prevent executing
in 32 bit mode when shadow stack is enabled because the kernel can't easily
support 32 bit signals.
On x86 it is possible to transition to 32 bit mode without any special
interaction with the kernel, by doing a "far call" to a 32 bit segment.
So the shadow stack implementation can use this address space behavior
as a feature, by enforcing that shadow stack memory is always mapped
outside of the 32 bit address space. This way userspace will trigger a
general protection fault which will in turn trigger a segfault if it
tries to transition to 32 bit mode with shadow stack enabled.
This provides a clean error generating border for the user if they try
attempt to do 32 bit mode shadow stack, rather than leave the kernel in a
half working state for userspace to be surprised by.
So to allow future shadow stack enabling patches to map shadow stacks
out of the 32 bit address space, introduce MAP_ABOVE4G. The behavior
is pretty much like MAP_32BIT, except that it has the opposite address
range. The are a few differences though.
If both MAP_32BIT and MAP_ABOVE4G are provided, the kernel will use the
MAP_ABOVE4G behavior. Like MAP_32BIT, MAP_ABOVE4G is ignored in a 32 bit
syscall.
Since the default search behavior is top down, the normal kaslr base can
be used for MAP_ABOVE4G. This is unlike MAP_32BIT which has to add its
own randomization in the bottom up case.
For MAP_32BIT, only the bottom up search path is used. For MAP_ABOVE4G
both are potentially valid, so both are used. In the bottomup search
path, the default behavior is already consistent with MAP_ABOVE4G since
mmap base should be above 4GB.
Without MAP_ABOVE4G, the shadow stack will already normally be above 4GB.
So without introducing MAP_ABOVE4G, trying to transition to 32 bit mode
with shadow stack enabled would usually segfault anyway. This is already
pretty decent guard rails. But the addition of MAP_ABOVE4G is some small
complexity spent to make it make it more complete.
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Kees Cook <keescook@chromium.org>
Acked-by: Mike Rapoport (IBM) <rppt@kernel.org>
Tested-by: Pengfei Xu <pengfei.xu@intel.com>
Tested-by: John Allen <john.allen@amd.com>
Tested-by: Kees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/all/20230613001108.3040476-21-rick.p.edgecombe%40intel.com
|
|
When user shadow stack is in use, Write=0,Dirty=1 is treated by the CPU as
shadow stack memory. So for shadow stack memory this bit combination is
valid, but when Dirty=1,Write=1 (conventionally writable) memory is being
write protected, the kernel has been taught to transition the Dirty=1
bit to SavedDirty=1, to avoid inadvertently creating shadow stack
memory. It does this inside pte_wrprotect() because it knows the PTE is
not intended to be a writable shadow stack entry, it is supposed to be
write protected.
However, when a PTE is created by a raw prot using mk_pte(), mk_pte()
can't know whether to adjust Dirty=1 to SavedDirty=1. It can't
distinguish between the caller intending to create a shadow stack PTE or
needing the SavedDirty shift.
The kernel has been updated to not do this, and so Write=0,Dirty=1
memory should only be created by the pte_mkfoo() helpers. Add a warning
to make sure no new mk_pte() start doing this, like, for example,
set_memory_rox() did.
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Tested-by: Pengfei Xu <pengfei.xu@intel.com>
Tested-by: John Allen <john.allen@amd.com>
Tested-by: Kees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/all/20230613001108.3040476-19-rick.p.edgecombe%40intel.com
|
|
The x86 Control-flow Enforcement Technology (CET) feature includes a new
type of memory called shadow stack. This shadow stack memory has some
unusual properties, which requires some core mm changes to function
properly.
One sharp edge is that PTEs that are both Write=0 and Dirty=1 are
treated as shadow by the CPU, but this combination used to be created by
the kernel on x86. Previous patches have changed the kernel to now avoid
creating these PTEs unless they are for shadow stack memory. In case any
missed corners of the kernel are still creating PTEs like this for
non-shadow stack memory, and to catch any re-introductions of the logic,
warn if any shadow stack PTEs (Write=0, Dirty=1) are found in non-shadow
stack VMAs when they are being zapped. This won't catch transient cases
but should have decent coverage.
In order to check if a PTE is shadow stack in core mm code, add two arch
breakouts arch_check_zapped_pte/pmd(). This will allow shadow stack
specific code to be kept in arch/x86.
Only do the check if shadow stack is supported by the CPU and configured
because in rare cases older CPUs may write Dirty=1 to a Write=0 CPU on
older CPUs. This check is handled in pte_shstk()/pmd_shstk().
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Mark Brown <broonie@kernel.org>
Acked-by: Mike Rapoport (IBM) <rppt@kernel.org>
Tested-by: Pengfei Xu <pengfei.xu@intel.com>
Tested-by: John Allen <john.allen@amd.com>
Tested-by: Kees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/all/20230613001108.3040476-18-rick.p.edgecombe%40intel.com
|
|
The CPU performs "shadow stack accesses" when it expects to encounter
shadow stack mappings. These accesses can be implicit (via CALL/RET
instructions) or explicit (instructions like WRSS).
Shadow stack accesses to shadow-stack mappings can result in faults in
normal, valid operation just like regular accesses to regular mappings.
Shadow stacks need some of the same features like delayed allocation, swap
and copy-on-write. The kernel needs to use faults to implement those
features.
The architecture has concepts of both shadow stack reads and shadow stack
writes. Any shadow stack access to non-shadow stack memory will generate
a fault with the shadow stack error code bit set.
This means that, unlike normal write protection, the fault handler needs
to create a type of memory that can be written to (with instructions that
generate shadow stack writes), even to fulfill a read access. So in the
case of COW memory, the COW needs to take place even with a shadow stack
read. Otherwise the page will be left (shadow stack) writable in
userspace. So to trigger the appropriate behavior, set FAULT_FLAG_WRITE
for shadow stack accesses, even if the access was a shadow stack read.
For the purpose of making this clearer, consider the following example.
If a process has a shadow stack, and forks, the shadow stack PTEs will
become read-only due to COW. If the CPU in one process performs a shadow
stack read access to the shadow stack, for example executing a RET and
causing the CPU to read the shadow stack copy of the return address, then
in order for the fault to be resolved the PTE will need to be set with
shadow stack permissions. But then the memory would be changeable from
userspace (from CALL, RET, WRSS, etc). So this scenario needs to trigger
COW, otherwise the shared page would be changeable from both processes.
Shadow stack accesses can also result in errors, such as when a shadow
stack overflows, or if a shadow stack access occurs to a non-shadow-stack
mapping. Also, generate the errors for invalid shadow stack accesses.
Co-developed-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Kees Cook <keescook@chromium.org>
Acked-by: Mike Rapoport (IBM) <rppt@kernel.org>
Tested-by: Pengfei Xu <pengfei.xu@intel.com>
Tested-by: John Allen <john.allen@amd.com>
Tested-by: Kees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/all/20230613001108.3040476-16-rick.p.edgecombe%40intel.com
|
|
New processors that support Shadow Stack regard Write=0,Dirty=1 PTEs as
shadow stack pages.
In normal cases, it can be helpful to create Write=1 PTEs as also Dirty=1
if HW dirty tracking is not needed, because if the Dirty bit is not already
set the CPU has to set Dirty=1 when the memory gets written to. This
creates additional work for the CPU. So traditional wisdom was to simply
set the Dirty bit whenever you didn't care about it. However, it was never
really very helpful for read-only kernel memory.
When CR4.CET=1 and IA32_S_CET.SH_STK_EN=1, some instructions can write to
such supervisor memory. The kernel does not set IA32_S_CET.SH_STK_EN, so
avoiding kernel Write=0,Dirty=1 memory is not strictly needed for any
functional reason. But having Write=0,Dirty=1 kernel memory doesn't have
any functional benefit either, so to reduce ambiguity between shadow stack
and regular Write=0 pages, remove Dirty=1 from any kernel Write=0 PTEs.
Co-developed-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Kees Cook <keescook@chromium.org>
Acked-by: Mike Rapoport (IBM) <rppt@kernel.org>
Tested-by: Pengfei Xu <pengfei.xu@intel.com>
Tested-by: John Allen <john.allen@amd.com>
Tested-by: Kees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/all/20230613001108.3040476-14-rick.p.edgecombe%40intel.com
|
|
The recently introduced _PAGE_SAVED_DIRTY should be used instead of the
HW Dirty bit whenever a PTE is Write=0, in order to not inadvertently
create shadow stack PTEs. Update pte_mk*() helpers to do this, and apply
the same changes to pmd and pud. Since there is no x86 version of
pte_mkwrite() to hold this arch specific logic, create one. Add it to
x86/mm/pgtable.c instead of x86/asm/include/pgtable.h as future patches
will require it to live in pgtable.c and it will make the diff easier
for reviewers.
Since CPUs without shadow stack support could create Write=0,Dirty=1
PTEs, only return true for pte_shstk() if the CPU also supports shadow
stack. This will prevent these HW creates PTEs as showing as true for
pte_write().
For pte_modify() this is a bit trickier. It takes a "raw" pgprot_t which
was not necessarily created with any of the existing PTE bit helpers.
That means that it can return a pte_t with Write=0,Dirty=1, a shadow
stack PTE, when it did not intend to create one.
Modify it to also move _PAGE_DIRTY to _PAGE_SAVED_DIRTY. To avoid
creating Write=0,Dirty=1 PTEs, pte_modify() needs to avoid:
1. Marking Write=0 PTEs Dirty=1
2. Marking Dirty=1 PTEs Write=0
The first case cannot happen as the existing behavior of pte_modify() is to
filter out any Dirty bit passed in newprot. Handle the second case by
shifting _PAGE_DIRTY=1 to _PAGE_SAVED_DIRTY=1 if the PTE was write
protected by the pte_modify() call. Apply the same changes to pmd_modify().
Co-developed-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Acked-by: Mike Rapoport (IBM) <rppt@kernel.org>
Tested-by: Pengfei Xu <pengfei.xu@intel.com>
Tested-by: John Allen <john.allen@amd.com>
Tested-by: Kees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/all/20230613001108.3040476-13-rick.p.edgecombe%40intel.com
|
|
When shadow stack is in use, Write=0,Dirty=1 PTE are preserved for
shadow stack. Copy-on-write PTEs then have Write=0,SavedDirty=1.
When a PTE goes from Write=1,Dirty=1 to Write=0,SavedDirty=1, it could
become a transient shadow stack PTE in two cases:
1. Some processors can start a write but end up seeing a Write=0 PTE by
the time they get to the Dirty bit, creating a transient shadow stack
PTE. However, this will not occur on processors supporting shadow
stack, and a TLB flush is not necessary.
2. When _PAGE_DIRTY is replaced with _PAGE_SAVED_DIRTY non-atomically, a
transient shadow stack PTE can be created as a result.
Prevent the second case when doing a write protection and Dirty->SavedDirty
shift at the same time with a CMPXCHG loop. The first case
Note, in the PAE case CMPXCHG will need to operate on 8 byte, but
try_cmpxchg() will not use CMPXCHG8B, so it cannot operate on a full PAE
PTE. However the exiting logic is not operating on a full 8 byte region
either, and relies on the fact that the Write bit is in the first 4
bytes when doing the clear_bit(). Since both the Dirty, SavedDirty and
Write bits are in the first 4 bytes, casting to a long will be similar to
the existing behavior which also casts to a long.
Dave Hansen, Jann Horn, Andy Lutomirski, and Peter Zijlstra provided many
insights to the issue. Jann Horn provided the CMPXCHG solution.
Co-developed-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Acked-by: Mike Rapoport (IBM) <rppt@kernel.org>
Tested-by: Pengfei Xu <pengfei.xu@intel.com>
Tested-by: John Allen <john.allen@amd.com>
Tested-by: Kees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/all/20230613001108.3040476-12-rick.p.edgecombe%40intel.com
|
|
Some OSes have a greater dependence on software available bits in PTEs than
Linux. That left the hardware architects looking for a way to represent a
new memory type (shadow stack) within the existing bits. They chose to
repurpose a lightly-used state: Write=0,Dirty=1. So in order to support
shadow stack memory, Linux should avoid creating memory with this PTE bit
combination unless it intends for it to be shadow stack.
The reason it's lightly used is that Dirty=1 is normally set by HW
_before_ a write. A write with a Write=0 PTE would typically only generate
a fault, not set Dirty=1. Hardware can (rarely) both set Dirty=1 *and*
generate the fault, resulting in a Write=0,Dirty=1 PTE. Hardware which
supports shadow stacks will no longer exhibit this oddity.
So that leaves Write=0,Dirty=1 PTEs created in software. To avoid
inadvertently created shadow stack memory, in places where Linux normally
creates Write=0,Dirty=1, it can use the software-defined _PAGE_SAVED_DIRTY
in place of the hardware _PAGE_DIRTY. In other words, whenever Linux needs
to create Write=0,Dirty=1, it instead creates Write=0,SavedDirty=1 except
for shadow stack, which is Write=0,Dirty=1.
There are six bits left available to software in the 64-bit PTE after
consuming a bit for _PAGE_SAVED_DIRTY. For 32 bit, the same bit as
_PAGE_BIT_UFFD_WP is used, since user fault fd is not supported on 32
bit. This leaves one unused software bit on 32 bit (_PAGE_BIT_SOFT_DIRTY,
as this is also not supported on 32 bit).
Implement only the infrastructure for _PAGE_SAVED_DIRTY. Changes to
actually begin creating _PAGE_SAVED_DIRTY PTEs will follow once other
pieces are in place.
Since this SavedDirty shifting is done for all x86 CPUs, this leaves
the possibility for the hardware oddity to still create Write=0,Dirty=1
PTEs in rare cases. Since these CPUs also don't support shadow stack, this
will be harmless as it was before the introduction of SavedDirty.
Implement the shifting logic to be branchless. Embed the logic of whether
to do the shifting (including checking the Write bits) so that it can be
called by future callers that would otherwise need additional branching
logic. This efficiency allows the logic of when to do the shifting to be
centralized, making the code easier to reason about.
Co-developed-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Tested-by: Pengfei Xu <pengfei.xu@intel.com>
Tested-by: John Allen <john.allen@amd.com>
Tested-by: Kees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/all/20230613001108.3040476-11-rick.p.edgecombe%40intel.com
|
|
To prepare the introduction of _PAGE_SAVED_DIRTY, move pmd_write() and
pud_write() up in the file, so that they can be used by other
helpers below. No functional changes.
Co-developed-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Mike Rapoport (IBM) <rppt@kernel.org>
Tested-by: Pengfei Xu <pengfei.xu@intel.com>
Tested-by: John Allen <john.allen@amd.com>
Tested-by: Kees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/all/20230613001108.3040476-10-rick.p.edgecombe%40intel.com
|
|
The Control-Flow Enforcement Technology contains two related features,
one of which is Shadow Stacks. Future patches will utilize this feature
for shadow stack support in KVM, so add a CPU feature flags for Shadow
Stacks (CPUID.(EAX=7,ECX=0):ECX[bit 7]).
To protect shadow stack state from malicious modification, the registers
are only accessible in supervisor mode. This implementation
context-switches the registers with XSAVES. Make X86_FEATURE_SHSTK depend
on XSAVES.
The shadow stack feature, enumerated by the CPUID bit described above,
encompasses both supervisor and userspace support for shadow stack. In
near future patches, only userspace shadow stack will be enabled. In
expectation of future supervisor shadow stack support, create a software
CPU capability to enumerate kernel utilization of userspace shadow stack
support. This user shadow stack bit should depend on the HW "shstk"
capability and that logic will be implemented in future patches.
Co-developed-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Kees Cook <keescook@chromium.org>
Acked-by: Mike Rapoport (IBM) <rppt@kernel.org>
Tested-by: Pengfei Xu <pengfei.xu@intel.com>
Tested-by: John Allen <john.allen@amd.com>
Tested-by: Kees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/all/20230613001108.3040476-9-rick.p.edgecombe%40intel.com
|
|
Today the control protection handler is defined in traps.c and used only
for the kernel IBT feature. To reduce ifdeffery, move it to it's own file.
In future patches, functionality will be added to make this handler also
handle user shadow stack faults. So name the file cet.c.
No functional change.
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Kees Cook <keescook@chromium.org>
Acked-by: Mike Rapoport (IBM) <rppt@kernel.org>
Tested-by: Pengfei Xu <pengfei.xu@intel.com>
Tested-by: John Allen <john.allen@amd.com>
Tested-by: Kees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/all/20230613001108.3040476-8-rick.p.edgecombe%40intel.com
|
|
Shadow stack provides protection for applications against function return
address corruption. It is active when the processor supports it, the
kernel has CONFIG_X86_SHADOW_STACK enabled, and the application is built
for the feature. This is only implemented for the 64-bit kernel. When it
is enabled, legacy non-shadow stack applications continue to work, but
without protection.
Since there is another feature that utilizes CET (Kernel IBT) that will
share implementation with shadow stacks, create CONFIG_CET to signify
that at least one CET feature is configured.
Co-developed-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Kees Cook <keescook@chromium.org>
Acked-by: Mike Rapoport (IBM) <rppt@kernel.org>
Tested-by: Pengfei Xu <pengfei.xu@intel.com>
Tested-by: John Allen <john.allen@amd.com>
Tested-by: Kees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/all/20230613001108.3040476-7-rick.p.edgecombe%40intel.com
|
|
The x86 Shadow stack feature includes a new type of memory called shadow
stack. This shadow stack memory has some unusual properties, which requires
some core mm changes to function properly.
One of these unusual properties is that shadow stack memory is writable,
but only in limited ways. These limits are applied via a specific PTE
bit combination. Nevertheless, the memory is writable, and core mm code
will need to apply the writable permissions in the typical paths that
call pte_mkwrite(). Future patches will make pte_mkwrite() take a VMA, so
that the x86 implementation of it can know whether to create regular
writable or shadow stack mappings.
But there are a couple of challenges to this. Modifying the signatures of
each arch pte_mkwrite() implementation would be error prone because some
are generated with macros and would need to be re-implemented. Also, some
pte_mkwrite() callers operate on kernel memory without a VMA.
So this can be done in a three step process. First pte_mkwrite() can be
renamed to pte_mkwrite_novma() in each arch, with a generic pte_mkwrite()
added that just calls pte_mkwrite_novma(). Next callers without a VMA can
be moved to pte_mkwrite_novma(). And lastly, pte_mkwrite() and all callers
can be changed to take/pass a VMA.
Earlier work did the first step, so next move the callers that don't have
a VMA to pte_mkwrite_novma(). Also do the same for pmd_mkwrite(). This
will be ok for the shadow stack feature, as these callers are on kernel
memory which will not need to be made shadow stack, and the other
architectures only currently support one type of memory in pte_mkwrite()
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Mike Rapoport (IBM) <rppt@kernel.org>
Acked-by: David Hildenbrand <david@redhat.com>
Link: https://lore.kernel.org/all/20230613001108.3040476-3-rick.p.edgecombe%40intel.com
|
|
The x86 Shadow stack feature includes a new type of memory called shadow
stack. This shadow stack memory has some unusual properties, which requires
some core mm changes to function properly.
One of these unusual properties is that shadow stack memory is writable,
but only in limited ways. These limits are applied via a specific PTE
bit combination. Nevertheless, the memory is writable, and core mm code
will need to apply the writable permissions in the typical paths that
call pte_mkwrite(). The goal is to make pte_mkwrite() take a VMA, so
that the x86 implementation of it can know whether to create regular
writable or shadow stack mappings.
But there are a couple of challenges to this. Modifying the signatures of
each arch pte_mkwrite() implementation would be error prone because some
are generated with macros and would need to be re-implemented. Also, some
pte_mkwrite() callers operate on kernel memory without a VMA.
So this can be done in a three step process. First pte_mkwrite() can be
renamed to pte_mkwrite_novma() in each arch, with a generic pte_mkwrite()
added that just calls pte_mkwrite_novma(). Next callers without a VMA can
be moved to pte_mkwrite_novma(). And lastly, pte_mkwrite() and all callers
can be changed to take/pass a VMA.
Start the process by renaming pte_mkwrite() to pte_mkwrite_novma() and
adding the pte_mkwrite() wrapper in linux/pgtable.h. Apply the same
pattern for pmd_mkwrite(). Since not all archs have a pmd_mkwrite_novma(),
create a new arch config HAS_HUGE_PAGE that can be used to tell if
pmd_mkwrite() should be defined. Otherwise in the !HAS_HUGE_PAGE cases the
compiler would not be able to find pmd_mkwrite_novma().
No functional change.
Suggested-by: Linus Torvalds <torvalds@linuxfoundation.org>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Mike Rapoport (IBM) <rppt@kernel.org>
Acked-by: Geert Uytterhoeven <geert@linux-m68k.org>
Acked-by: David Hildenbrand <david@redhat.com>
Link: https://lore.kernel.org/lkml/CAHk-=wiZjSu7c9sFYZb3q04108stgHff2wfbokGCCgW7riz+8Q@mail.gmail.com/
Link: https://lore.kernel.org/all/20230613001108.3040476-2-rick.p.edgecombe%40intel.com
|
|
poison_cfi() was introduced in:
9831c6253ace ("x86/cfi: Extend ENDBR sealing to kCFI")
... but it's only ever used under CONFIG_X86_KERNEL_IBT=y,
and if that option is disabled, we get:
arch/x86/kernel/alternative.c:1243:13: error: ‘poison_cfi’ defined but not used [-Werror=unused-function]
Guard the definition with CONFIG_X86_KERNEL_IBT.
Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Kees Cook <keescook@chromium.org>
Cc: Sami Tolvanen <samitolvanen@google.com>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
This is now unused, so can remove it.
Link: https://lore.kernel.org/linux-trace-kernel/20230623091640.21952-1-yuehaibing@huawei.com
Cc: <mark.rutland@arm.com>
Cc: <tglx@linutronix.de>
Cc: <mingo@redhat.com>
Cc: <bp@alien8.de>
Cc: <dave.hansen@linux.intel.com>
Cc: <x86@kernel.org>
Cc: <hpa@zytor.com>
Cc: <peterz@infradead.org>
Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
|
|
Several architectures define arch_try_local_cmpxchg macro using
internal temporary variables named ___old, __old or _old. Remove
temporary varible in local_try_cmpxchg to avoid variable shadowing.
No functional change intended.
Fixes: d994f2c8e241 ("locking/arch: Wire up local_try_cmpxchg()")
Closes: https://lore.kernel.org/lkml/CAFGhKbyxtuk=LoW-E3yLXgcmR93m+Dfo5-u9oQA_YC5Fcy_t9g@mail.gmail.com/
Reported-by: Charlemagne Lasse <charlemagnelasse@gmail.com>
Signed-off-by: Uros Bizjak <ubizjak@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20230708090048.63046-1-ubizjak@gmail.com
|
|
Use local64_try_cmpxchg instead of local64_cmpxchg (*ptr, old, new) == old.
x86 CMPXCHG instruction returns success in ZF flag, so this change saves a
compare after cmpxchg (and related move instruction in front of cmpxchg).
Also, try_cmpxchg implicitly assigns old *ptr value to "old" when cmpxchg
fails. There is no need to re-read the value in the loop.
No functional change intended.
Cc. "H. Peter Anvin" <hpa@zytor.com>
Signed-off-by: Uros Bizjak <ubizjak@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20230706141720.2672-1-ubizjak@gmail.com
|
|
IBS PMUs can have only one event active at any point in time. Restrict
grouping of multiple IBS events.
Reported-by: Sandipan Das <sandipan.das@amd.com>
Signed-off-by: Ravi Bangoria <ravi.bangoria@amd.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20230620091603.269-1-ravi.bangoria@amd.com
|
|
Alyssa noticed that when building the kernel with CFI_CLANG+IBT and
booting on IBT enabled hardware to obtain FineIBT, the indirect
functions look like:
__cfi_foo:
endbr64
subl $hash, %r10d
jz 1f
ud2
nop
1:
foo:
endbr64
This is because the compiler generates code for kCFI+IBT. In that case
the caller does the hash check and will jump to +0, so there must be
an ENDBR there. The compiler doesn't know about FineIBT at all; also
it is possible to actually use kCFI+IBT when booting with 'cfi=kcfi'
on IBT enabled hardware.
Having this second ENDBR however makes it possible to elide the CFI
check. Therefore, we should poison this second ENDBR when switching to
FineIBT mode.
Fixes: 931ab63664f0 ("x86/ibt: Implement FineIBT")
Reported-by: "Milburn, Alyssa" <alyssa.milburn@intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Sami Tolvanen <samitolvanen@google.com>
Link: https://lore.kernel.org/r/20230615193722.194131053@infradead.org
|
|
When kCFI is enabled, special handling is needed for the indirect call
to the kernel thread function. Rewrite the ret_from_fork() function in
C so that the compiler can properly handle the indirect call.
Suggested-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Brian Gerst <brgerst@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Sami Tolvanen <samitolvanen@google.com>
Link: https://lkml.kernel.org/r/20230623225529.34590-3-brgerst@gmail.com
|
|
The unwinder expects a return address at the very top of the kernel
stack just below pt_regs and before any stack frame is created. Instead
of calling a wrapper, set up a return address as if ret_from_fork()
was called from the syscall entry code.
Signed-off-by: Brian Gerst <brgerst@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Sami Tolvanen <samitolvanen@google.com>
Link: https://lkml.kernel.org/r/20230623225529.34590-2-brgerst@gmail.com
|
|
Kees noted that IBT sealing could be extended to kCFI.
Fundamentally it is the list of functions that do not have their
address taken and are thus never called indirectly. It doesn't matter
that objtool uses IBT infrastructure to determine this list, once we
have it it can also be used to clobber kCFI hashes and avoid kCFI
indirect calls.
Suggested-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Sami Tolvanen <samitolvanen@google.com>
Link: https://lkml.kernel.org/r/20230622144321.494426891%40infradead.org
|
|
The current name doesn't reflect what it does very well.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Sami Tolvanen <samitolvanen@google.com>
Link: https://lkml.kernel.org/r/20230622144321.427441595%40infradead.org
|
|
With the introduction of kCFI these helpers are no longer equivalent
to C indirect calls and should be used with care.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Sami Tolvanen <samitolvanen@google.com>
Link: https://lkml.kernel.org/r/20230622144321.360957723%40infradead.org
|
|
On SPR, the load latency event needs an auxiliary event in the same
group to work properly. There's a check in intel_pmu_hw_config()
for this to iterate sibling events and find a mem-loads-aux event.
The for_each_sibling_event() has a lockdep assert to make sure if it
disabled hardirq or hold leader->ctx->mutex. This works well if the
given event has a separate leader event since perf_try_init_event()
grabs the leader->ctx->mutex to protect the sibling list. But it can
cause a problem when the event itself is a leader since the event is
not initialized yet and there's no ctx for the event.
Actually I got a lockdep warning when I run the below command on SPR,
but I guess it could be a NULL pointer dereference.
$ perf record -d -e cpu/mem-loads/uP true
The code path to the warning is:
sys_perf_event_open()
perf_event_alloc()
perf_init_event()
perf_try_init_event()
x86_pmu_event_init()
hsw_hw_config()
intel_pmu_hw_config()
for_each_sibling_event()
lockdep_assert_event_ctx()
We don't need for_each_sibling_event() when it's a standalone event.
Let's return the error code directly.
Fixes: f3c0eba28704 ("perf: Add a few assertions")
Reported-by: Greg Thelen <gthelen@google.com>
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/20230704181516.3293665-1-namhyung@kernel.org
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 fpu fix from Borislav Petkov:
- Do FPU AP initialization on Xen PV too which got missed by the recent
boot reordering work
* tag 'x86_urgent_for_v6.5_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/xen: Fix secondary processors' FPU initialization
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 fix from Thomas Gleixner:
"A single fix for the mechanism to park CPUs with an INIT IPI.
On shutdown or kexec, the kernel tries to park the non-boot CPUs with
an INIT IPI. But the same code path is also used by the crash utility.
If the CPU which panics is not the boot CPU then it sends an INIT IPI
to the boot CPU which resets the machine.
Prevent this by validating that the CPU which runs the stop mechanism
is the boot CPU. If not, leave the other CPUs in HLT"
* tag 'x86-core-2023-07-09' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/smp: Don't send INIT to boot CPU
|