Age | Commit message (Collapse) | Author |
|
Call set_sev_encryption_mask() while still on the stage 1 #VC-handler
because the stage 2 handler needs the kernel's own page tables to be
set up, to which calling set_sev_encryption_mask() is a prerequisite.
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20200907131613.12703-21-joro@8bytes.org
|
|
Add the first handler for #VC exceptions. At stage 1 there is no GHCB
yet because the kernel might still be running on the EFI page table.
The stage 1 handler is limited to the MSR-based protocol to talk to the
hypervisor and can only support CPUID exit-codes, but that is enough to
get to stage 2.
[ bp: Zap superfluous newlines after rd/wrmsr instruction mnemonics. ]
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20200907131613.12703-20-joro@8bytes.org
|
|
Changing the function to take start and end as parameters instead of
start and size simplifies the callers which don't need to calculate the
size if they already have start and end.
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Kees Cook <keescook@chromium.org>
Link: https://lkml.kernel.org/r/20200907131613.12703-19-joro@8bytes.org
|
|
With the page-fault handler in place, he identity mapping can be built
on-demand. So remove the code which manually creates the mappings and
unexport/remove the functions used for it.
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Kees Cook <keescook@chromium.org>
Link: https://lkml.kernel.org/r/20200907131613.12703-18-joro@8bytes.org
|
|
When booted through startup_64(), the kernel keeps running on the EFI
page table until the KASLR code sets up its own page table. Without
KASLR, the pre-decompression boot code never switches off the EFI page
table. Change that by unconditionally switching to a kernel-controlled
page table after relocation.
This makes sure the kernel can make changes to the mapping when
necessary, for example map pages unencrypted in SEV and SEV-ES guests.
Also, remove the debug_putstr() calls in initialize_identity_maps()
because the function now runs before console_init() is called.
[ bp: Massage commit message. ]
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Kees Cook <keescook@chromium.org>
Link: https://lkml.kernel.org/r/20200907131613.12703-17-joro@8bytes.org
|
|
Install a page-fault handler to add an identity mapping to addresses
not yet mapped. Also do some checking whether the error code is sane.
This makes non SEV-ES machines use the exception handling
infrastructure in the pre-decompressions boot code too, making it less
likely to break in the future.
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Kees Cook <keescook@chromium.org>
Link: https://lkml.kernel.org/r/20200907131613.12703-16-joro@8bytes.org
|
|
The file contains only code related to identity-mapped page tables.
Rename the file and compile it always in.
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Kees Cook <keescook@chromium.org>
Link: https://lkml.kernel.org/r/20200907131613.12703-15-joro@8bytes.org
|
|
Add code needed to setup an IDT in the early pre-decompression
boot-code. The IDT is loaded first in startup_64, which is after
EfiExitBootServices() has been called, and later reloaded when the
kernel image has been relocated to the end of the decompression area.
This allows to setup different IDT handlers before and after the
relocation.
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20200907131613.12703-14-joro@8bytes.org
|
|
The x86-64 ABI defines a red-zone on the stack:
The 128-byte area beyond the location pointed to by %rsp is considered
to be reserved and shall not be modified by signal or interrupt
handlers. Therefore, functions may use this area for temporary data
that is not needed across function calls. In particular, leaf
functions may use this area for their entire stack frame, rather than
adjusting the stack pointer in the prologue and epilogue. This area is
known as the red zone.
This is not compatible with exception handling, because the IRET frame
written by the hardware at the stack pointer and the functions to handle
the exception will overwrite the temporary variables of the interrupted
function, causing undefined behavior. So disable red-zones for the
pre-decompression boot code.
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Kees Cook <keescook@chromium.org>
Link: https://lkml.kernel.org/r/20200907131613.12703-13-joro@8bytes.org
|
|
Add a function to check whether an instruction has a REP prefix.
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Masami Hiramatsu <mhiramat@kernel.org>
Link: https://lkml.kernel.org/r/20200907131613.12703-12-joro@8bytes.org
|
|
Use the shorthand to make it more readable.
No functional changes.
Signed-off-by: Borislav Petkov <bp@alien8.de>
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20200907131613.12703-5-joro@8bytes.org
|
|
Add a function to the instruction decoder which returns the pt_regs
offset of the register specified in the reg field of the modrm byte.
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Acked-by: Masami Hiramatsu <mhiramat@kernel.org>
Link: https://lkml.kernel.org/r/20200907131613.12703-11-joro@8bytes.org
|
|
Building a correct GHCB for the hypervisor requires setting valid bits
in the GHCB. Simplify that process by providing accessor functions to
set values and to update the valid bitmap and to check the valid bitmap
in KVM.
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20200907131613.12703-4-joro@8bytes.org
|
|
Factor out the code used to decode an instruction with the correct
address and operand sizes to a helper function.
No functional changes.
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20200907131613.12703-10-joro@8bytes.org
|
|
Extend the vmcb_safe_area with SEV-ES fields and add a new
'struct ghcb' which will be used for guest-hypervisor communication.
Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20200907131613.12703-3-joro@8bytes.org
|
|
Factor out the code to fetch the instruction from user-space to a helper
function.
No functional changes.
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20200907131613.12703-9-joro@8bytes.org
|
|
Do not allocate a vmcb_control_area and a vmcb_save_area on the stack,
as these structures will become larger with future extenstions of
SVM and thus the svm_set_nested_state() function will become a too large
stack frame.
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20200907131613.12703-2-joro@8bytes.org
|
|
The inat-tables.c file has some arrays in it that contain pointers to
other arrays. These pointers need to be relocated when the kernel
image is moved to a different location.
The pre-decompression boot-code has no support for applying ELF
relocations, so initialize these arrays at runtime in the
pre-decompression code to make sure all pointers are correctly
initialized.
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Acked-by: Masami Hiramatsu <mhiramat@kernel.org>
Link: https://lkml.kernel.org/r/20200907131613.12703-8-joro@8bytes.org
|
|
Move the definition of the x86 page-fault error code bits to a new
header file asm/trap_pf.h. This makes it easier to include them into
pre-decompression boot code. No functional changes.
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20200907131613.12703-7-joro@8bytes.org
|
|
Add CPU feature detection for Secure Encrypted Virtualization with
Encrypted State. This feature enhances SEV by also encrypting the
guest register state, making it in-accessible to the hypervisor.
Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20200907131613.12703-6-joro@8bytes.org
|
|
Pick up work happening in parallel to avoid nasty merge conflicts later.
Signed-off-by: Borislav Petkov <bp@suse.de>
|
|
Signed-off-by: Borislav Petkov <bp@suse.de>
|
|
We got slightly different patches removing a double word
in a comment in net/ipv4/raw.c - picked the version from net.
Simple conflict in drivers/net/ethernet/ibm/ibmvnic.c. Use cached
values instead of VNIC login response buffer (following what
commit 507ebe6444a4 ("ibmvnic: Fix use-after-free of VNIC login
response buffer") did).
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Fix spelling mistake "Could't" -> "Couldn't" in user-visible warning
messages.
[ bp: Massage commit message; s/cpu/CPU/g ]
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20200810075508.46490-1-colin.king@canonical.com
|
|
...because future AMD systems will support up to 64 MCA banks per CPU.
MAX_NR_BANKS is used to allocate a number of data structures, and it is
used as a ceiling for values read from MCG_CAP[Count]. Therefore, this
change will have no functional effect on existing systems with 32 or
fewer MCA banks per CPU.
However, this will increase the size of the following structures:
Global bitmaps:
- core.c / mce_banks_ce_disabled
- core.c / all_banks
- core.c / valid_banks
- core.c / toclear
- Total: 32 new bits * 4 bitmaps = 16 new bytes
Per-CPU bitmaps:
- core.c / mce_poll_banks
- intel.c / mce_banks_owned
- Total: 32 new bits * 2 bitmaps = 8 new bytes
The bitmaps are arrays of longs. So this change will only affect 32-bit
execution, since there will be one additional long used. There will be
no additional memory use on 64-bit execution, because the size of long
is 64 bits.
Global structs:
- amd.c / struct smca_bank smca_banks[]: 16 bytes per bank
- core.c / struct mce_bank_dev mce_bank_devs[]: 56 bytes per bank
- Total: 32 new banks * (16 + 56) bytes = 2304 new bytes
Per-CPU structs:
- core.c / struct mce_bank mce_banks_array[]: 16 bytes per bank
- Total: 32 new banks * 16 bytes = 512 new bytes
32-bit
Total global size increase: 2320 bytes
Total per-CPU size increase: 520 bytes
64-bit
Total global size increase: 2304 bytes
Total per-CPU size increase: 512 bytes
This additional memory should still fit within the existing .data
section of the kernel binary. However, in the case where it doesn't
fit, an additional page (4kB) of memory will be added to the binary to
accommodate the extra data which will be the maximum size increase of
vmlinux.
Signed-off-by: Akshay Gupta <Akshay.Gupta@amd.com>
[ Adjust commit message and code comment. ]
Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20200828192412.320052-1-Yazen.Ghannam@amd.com
|
|
Andy reported that the syscall treacing for 32bit fast syscall fails:
# ./tools/testing/selftests/x86/ptrace_syscall_32
...
[RUN] SYSEMU
[FAIL] Initial args are wrong (nr=224, args=10 11 12 13 14 4289172732)
...
[RUN] SYSCALL
[FAIL] Initial args are wrong (nr=29, args=0 0 0 0 0 4289172732)
The eason is that the conversion to generic entry code moved the retrieval
of the sixth argument (EBP) after the point where the syscall entry work
runs, i.e. ptrace, seccomp, audit...
Unbreak it by providing a split up version of syscall_enter_from_user_mode().
- syscall_enter_from_user_mode_prepare() establishes state and enables
interrupts
- syscall_enter_from_user_mode_work() runs the entry work
Replace the call to syscall_enter_from_user_mode() in the 32bit fast
syscall C-entry with the split functions and stick the EBP retrieval
between them.
Fixes: 27d6b4d14f5c ("x86/entry: Use generic syscall entry function")
Reported-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/87k0xdjbtt.fsf@nanos.tec.linutronix.de
|
|
Current usage of thread.debugreg6 is convoluted at best. It starts life as
a copy of the hardware DR6 value, but then various bits are cleared and
set.
Replace this with a new variable thread.virtual_dr6 that is initialized to
0 when DR6 is read and only gains bits, at the same time the actual (on
stack) dr6 value which is read from the hardware only gets bits cleared.
Suggested-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Daniel Thompson <daniel.thompson@linaro.org>
Link: https://lore.kernel.org/r/20200902133201.415372940@infradead.org
|
|
DR6 has a whole bunch of bits that have negative polarity; they were
architecturally reserved and defined to be 1 and are now getting used.
Since they're 1 by default, 0 becomes the signal value.
Handle this by xor'ing the read DR6 value by the reserved mask, this
will flip them around such that 1 is the signal value (positive
polarity).
Current Linux doesn't yet support any of these bits, but there's two
defined:
- DR6[11] Bus Lock Debug Exception (ISEr39)
- DR6[16] Restricted Transactional Memory (SDM)
Update ptrace_{set,get}_debugreg() to provide/consume the value in
architectural polarity. Although afaict ptrace_set_debugreg(6) is
pointless, the value is not consumed anywhere.
Change hw_breakpoint_restore() to alway write the DR6_RESERVED value
to DR6, again, no consumer for that write.
Suggested-by: Andrew Cooper <Andrew.Cooper3@citrix.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Daniel Thompson <daniel.thompson@linaro.org>
Link: https://lore.kernel.org/r/20200902133201.354220797@infradead.org
|
|
This is called with interrupts disabled, there's no point in using
get_cpu() and per_cpu().
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Daniel Thompson <daniel.thompson@linaro.org>
Link: https://lore.kernel.org/r/20200902133201.292906672@infradead.org
|
|
Unused remnants for the bit-bucket.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Daniel Thompson <daniel.thompson@linaro.org>
Link: https://lore.kernel.org/r/20200902133201.233022474@infradead.org
|
|
Remove the historical junk and replace it with a WARN and a comment.
The problem is that even though the kernel only uses TF single-step in
kprobes and KGDB, both of which consume the event before this, QEMU/KVM has
bugs in this area that can trigger this state so it has to be dealt with.
Suggested-by: Brian Gerst <brgerst@gmail.com>
Suggested-by: Andy Lutomirski <luto@amacapital.net>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Daniel Thompson <daniel.thompson@linaro.org>
Link: https://lore.kernel.org/r/20200902133201.170216274@infradead.org
|
|
The cond_local_irq_enable() block, dealing with vm86 and sending
signals is only relevant for #DB-from-user, move it there.
This then reduces handle_debug() to only the notifier call, so rename
it to notify_debug().
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Daniel Thompson <daniel.thompson@linaro.org>
Link: https://lore.kernel.org/r/20200902133201.094265982@infradead.org
|
|
The historical SYSENTER junk is explicitly for from-kernel, so move it
to the #DB-from-kernel handler.
It is ordered after the notifier, which is important for KGDB which uses TF
single-step and needs to consume the event before that point.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Daniel Thompson <daniel.thompson@linaro.org>
Link: https://lore.kernel.org/r/20200902133201.031099736@infradead.org
|
|
There's no point in calculating si_code if it's not going to be used.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Daniel Thompson <daniel.thompson@linaro.org>
Link: https://lore.kernel.org/r/20200902133200.967434217@infradead.org
|
|
The handle_debug(.user) argument is used to terminate the #DB handler early
for the INT1-from-kernel case, since the kernel doesn't use INT1.
Remove the argument and handle this explicitly in #DB-from-kernel.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Daniel Thompson <daniel.thompson@linaro.org>
Acked-by: Andy Lutomirski <luto@kernel.org>
Link: https://lore.kernel.org/r/20200902133200.907020598@infradead.org
|
|
Kprobes are on kernel text, and thus only matter for #DB-from-kernel.
Kprobes are ordered before the generic notifier, preserve that order.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Daniel Thompson <daniel.thompson@linaro.org>
Acked-by: Masami Hiramatsu <mhiramat@kernel.org>
Acked-by: Andy Lutomirski <luto@kernel.org>
Link: https://lore.kernel.org/r/20200902133200.847465360@infradead.org
|
|
Move the BTF sync near the DR6 load, as this will be the only common
code guaranteed to run on every #DB.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Daniel Thompson <daniel.thompson@linaro.org>
Acked-by: Andy Lutomirski <luto@kernel.org>
Link: https://lore.kernel.org/r/20200902133200.786888252@infradead.org
|
|
Trying to clear DR7 around a #DB from usermode malfunctions if the tasks
schedules when delivering SIGTRAP.
Rather than trying to define a special no-recursion region, just allow a
single level of recursion. The same mechanism is used for NMI, and it
hasn't caused any problems yet.
Fixes: 9f58fdde95c9 ("x86/db: Split out dr6/7 handling")
Reported-by: Kyle Huey <me@kylehuey.com>
Debugged-by: Josh Poimboeuf <jpoimboe@redhat.com>
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Daniel Thompson <daniel.thompson@linaro.org>
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/8b9bd05f187231df008d48cf818a6a311cbd5c98.1597882384.git.luto@kernel.org
Link: https://lore.kernel.org/r/20200902133200.726584153@infradead.org
|
|
The WARN added in commit 3c73b81a9164 ("x86/entry, selftests: Further
improve user entry sanity checks") unconditionally triggers on a IVB
machine because it does not support SMAP.
For !SMAP hardware the CLAC/STAC instructions are patched out and thus if
userspace sets AC, it is still have set after entry.
Fixes: 3c73b81a9164 ("x86/entry, selftests: Further improve user entry sanity checks")
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Daniel Thompson <daniel.thompson@linaro.org>
Acked-by: Andy Lutomirski <luto@kernel.org>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20200902133200.666781610@infradead.org
|
|
On i386, the order of parameters passed on regs is eax,edx,and ecx
(as per regparm(3) calling conventions).
Change the mapping in regs_get_kernel_argument(), so that arg1=ax
arg2=dx, and arg3=cx.
Running the selftests testcase kprobes_args_use.tc shows the result
as passed.
Fixes: 3c88ee194c28 ("x86: ptrace: Add function argument access API")
Signed-off-by: Vamshi K Sthambamkadi <vamshi.k.sthambamkadi@gmail.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Acked-by: Masami Hiramatsu <mhiramat@kernel.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: <stable@vger.kernel.org>
Link: https://lkml.kernel.org/r/20200828113242.GA1424@cosmos
|
|
Add MTE-specific SIGSEGV codes to siginfo.h and update the x86
BUILD_BUG_ON(NSIGSEGV != 7) compile check.
Signed-off-by: Vincenzo Frascino <vincenzo.frascino@arm.com>
[catalin.marinas@arm.com: renamed precise/imprecise to sync/async]
[catalin.marinas@arm.com: dropped #ifdef __aarch64__, renumbered]
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
Acked-by: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Will Deacon <will@kernel.org>
|
|
inst.h was included in calling.h solely to instantiate the RDPID macro.
The usage of RDPID was removed in
6a3ea3e68b8a ("x86/entry/64: Do not use RDPID in paranoid entry to accomodate KVM")
so remove the include.
Fixes: 6a3ea3e68b8a ("x86/entry/64: Do not use RDPID in paranoid entry to accomodate KVM")
Signed-off-by: Uros Bizjak <ubizjak@gmail.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20200827171735.93825-1-ubizjak@gmail.com
|
|
Commit:
cc9aec03e58f ("x86/numa_emulation: Introduce uniform split capability")
uses "-1" as the starting node ID, which causes the strange kernel log as
follows, when "numa=fake=32G" is added to the kernel command line:
Faking node -1 at [mem 0x0000000000000000-0x0000000893ffffff] (35136MB)
Faking node 0 at [mem 0x0000001840000000-0x000000203fffffff] (32768MB)
Faking node 1 at [mem 0x0000000894000000-0x000000183fffffff] (64192MB)
Faking node 2 at [mem 0x0000002040000000-0x000000283fffffff] (32768MB)
Faking node 3 at [mem 0x0000002840000000-0x000000303fffffff] (32768MB)
And finally the kernel crashes:
BUG: Bad page state in process swapper pfn:00011
page:(____ptrval____) refcount:0 mapcount:1 mapping:(____ptrval____) index:0x55cd7e44b270 pfn:0x11
failed to read mapping contents, not a valid kernel address?
flags: 0x5(locked|uptodate)
raw: 0000000000000005 000055cd7e44af30 000055cd7e44af50 0000000100000006
raw: 000055cd7e44b270 000055cd7e44b290 0000000000000000 000055cd7e44b510
page dumped because: page still charged to cgroup
page->mem_cgroup:000055cd7e44b510
Modules linked in:
CPU: 0 PID: 0 Comm: swapper Not tainted 5.9.0-rc2 #1
Hardware name: Intel Corporation S2600WFT/S2600WFT, BIOS SE5C620.86B.02.01.0008.031920191559 03/19/2019
Call Trace:
dump_stack+0x57/0x80
bad_page.cold+0x63/0x94
__free_pages_ok+0x33f/0x360
memblock_free_all+0x127/0x195
mem_init+0x23/0x1f5
start_kernel+0x219/0x4f5
secondary_startup_64+0xb6/0xc0
Fix this bug via using 0 as the starting node ID. This restores the
original behavior before cc9aec03e58f.
[ mingo: Massaged the changelog. ]
Fixes: cc9aec03e58f ("x86/numa_emulation: Introduce uniform split capability")
Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20200904061047.612950-1-ying.huang@intel.com
|
|
XORL %0,%0 is equivalent to XORQ %0,%0 as both will zero the entire
register. Use XORL %0,%0 for all operand sizes to avoid REX prefix byte
when legacy registers are used and to avoid size prefix byte when 16bit
registers are used.
Zeroing the full register is OK in this use case.
As a result, the size of the .fixup section decreases by 20 bytes.
[ bp: Massage commit message. ]
Signed-off-by: Uros Bizjak <ubizjak@gmail.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: H. Peter Anvin (Intel) <hpa@zytor.com>
Link: https://lkml.kernel.org/r/20200827180904.96399-1-ubizjak@gmail.com
|
|
We found that callers of dma_get_seg_boundary mostly do an ALIGN
with page mask and then do a page shift to get number of pages:
ALIGN(boundary + 1, 1 << shift) >> shift
However, the boundary might be as large as ULONG_MAX, which means
that a device has no specific boundary limit. So either "+ 1" or
passing it to ALIGN() would potentially overflow.
According to kernel defines:
#define ALIGN_MASK(x, mask) (((x) + (mask)) & ~(mask))
#define ALIGN(x, a) ALIGN_MASK(x, (typeof(x))(a) - 1)
We can simplify the logic here into a helper function doing:
ALIGN(boundary + 1, 1 << shift) >> shift
= ALIGN_MASK(b + 1, (1 << s) - 1) >> s
= {[b + 1 + (1 << s) - 1] & ~[(1 << s) - 1]} >> s
= [b + 1 + (1 << s) - 1] >> s
= [b + (1 << s)] >> s
= (b >> s) + 1
This patch introduces and applies dma_get_seg_boundary_nr_pages()
as an overflow-free helper for the dma_get_seg_boundary() callers
to get numbers of pages. It also takes care of the NULL dev case
for non-DMA API callers.
Suggested-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Nicolin Chen <nicoleotsuka@gmail.com>
Acked-by: Niklas Schnelle <schnelle@linux.ibm.com>
Acked-by: Michael Ellerman <mpe@ellerman.id.au> (powerpc)
Signed-off-by: Christoph Hellwig <hch@lst.de>
|
|
One can not simply remove vmalloc faulting on x86-32. Upstream
commit: 7f0a002b5a21 ("x86/mm: remove vmalloc faulting")
removed it on x86 alltogether because previously the
arch_sync_kernel_mappings() interface was introduced. This interface
added synchronization of vmalloc/ioremap page-table updates to all
page-tables in the system at creation time and was thought to make
vmalloc faulting obsolete.
But that assumption was incredibly naive.
It turned out that there is a race window between the time the vmalloc
or ioremap code establishes a mapping and the time it synchronizes
this change to other page-tables in the system.
During this race window another CPU or thread can establish a vmalloc
mapping which uses the same intermediate page-table entries (e.g. PMD
or PUD) and does no synchronization in the end, because it found all
necessary mappings already present in the kernel reference page-table.
But when these intermediate page-table entries are not yet
synchronized, the other CPU or thread will continue with a vmalloc
address that is not yet mapped in the page-table it currently uses,
causing an unhandled page fault and oops like below:
BUG: unable to handle page fault for address: fe80c000
#PF: supervisor write access in kernel mode
#PF: error_code(0x0002) - not-present page
*pde = 33183067 *pte = a8648163
Oops: 0002 [#1] SMP
CPU: 1 PID: 13514 Comm: cve-2017-17053 Tainted: G
...
Call Trace:
ldt_dup_context+0x66/0x80
dup_mm+0x2b3/0x480
copy_process+0x133b/0x15c0
_do_fork+0x94/0x3e0
__ia32_sys_clone+0x67/0x80
__do_fast_syscall_32+0x3f/0x70
do_fast_syscall_32+0x29/0x60
do_SYSENTER_32+0x15/0x20
entry_SYSENTER_32+0x9f/0xf2
EIP: 0xb7eef549
So the arch_sync_kernel_mappings() interface is racy, but removing it
would mean to re-introduce the vmalloc_sync_all() interface, which is
even more awful. Keep arch_sync_kernel_mappings() in place and catch
the race condition in the page-fault handler instead.
Do a partial revert of above commit to get vmalloc faulting on x86-32
back in place.
Fixes: 7f0a002b5a21 ("x86/mm: remove vmalloc faulting")
Reported-by: Naresh Kamboju <naresh.kamboju@linaro.org>
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20200902155904.17544-1-joro@8bytes.org
|
|
When CONFIG_RETPOLINE is disabled, Clang uses a jump table for the
switch statement in cmdline_find_option (jump tables are disabled when
CONFIG_RETPOLINE is enabled). This function is called very early in boot
from sme_enable() if CONFIG_AMD_MEM_ENCRYPT is enabled. At this time,
the kernel is still executing out of the identity mapping, but the jump
table will contain virtual addresses.
Fix this by disabling jump tables for cmdline.c when AMD_MEM_ENCRYPT is
enabled.
Signed-off-by: Arvind Sankar <nivedita@alum.mit.edu>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20200903023056.3914690-1-nivedita@alum.mit.edu
|
|
We don't want to depend on the linker's orphan section placement
heuristics as these can vary between linkers, and may change between
versions. All sections need to be explicitly handled in the linker
script.
Now that all sections are explicitly handled, enable orphan section
warnings.
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Nick Desaulniers <ndesaulniers@google.com>
Link: https://lore.kernel.org/r/20200902025347.2504702-6-keescook@chromium.org
|
|
We don't want to depend on the linker's orphan section placement
heuristics as these can vary between linkers, and may change between
versions. All sections need to be explicitly handled in the linker script.
Now that all sections are explicitly handled, enable orphan section
warnings.
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Nick Desaulniers <ndesaulniers@google.com>
Link: https://lore.kernel.org/r/20200902025347.2504702-5-keescook@chromium.org
|
|
Daniel Borkmann says:
====================
pull-request: bpf-next 2020-09-01
The following pull-request contains BPF updates for your *net-next* tree.
There are two small conflicts when pulling, resolve as follows:
1) Merge conflict in tools/lib/bpf/libbpf.c between 88a82120282b ("libbpf: Factor
out common ELF operations and improve logging") in bpf-next and 1e891e513e16
("libbpf: Fix map index used in error message") in net-next. Resolve by taking
the hunk in bpf-next:
[...]
scn = elf_sec_by_idx(obj, obj->efile.btf_maps_shndx);
data = elf_sec_data(obj, scn);
if (!scn || !data) {
pr_warn("elf: failed to get %s map definitions for %s\n",
MAPS_ELF_SEC, obj->path);
return -EINVAL;
}
[...]
2) Merge conflict in drivers/net/ethernet/mellanox/mlx5/core/en/xsk/rx.c between
9647c57b11e5 ("xsk: i40e: ice: ixgbe: mlx5: Test for dma_need_sync earlier for
better performance") in bpf-next and e20f0dbf204f ("net/mlx5e: RX, Add a prefetch
command for small L1_CACHE_BYTES") in net-next. Resolve the two locations by retaining
net_prefetch() and taking xsk_buff_dma_sync_for_cpu() from bpf-next. Should look like:
[...]
xdp_set_data_meta_invalid(xdp);
xsk_buff_dma_sync_for_cpu(xdp, rq->xsk_pool);
net_prefetch(xdp->data);
[...]
We've added 133 non-merge commits during the last 14 day(s) which contain
a total of 246 files changed, 13832 insertions(+), 3105 deletions(-).
The main changes are:
1) Initial support for sleepable BPF programs along with bpf_copy_from_user() helper
for tracing to reliably access user memory, from Alexei Starovoitov.
2) Add BPF infra for writing and parsing TCP header options, from Martin KaFai Lau.
3) bpf_d_path() helper for returning full path for given 'struct path', from Jiri Olsa.
4) AF_XDP support for shared umems between devices and queues, from Magnus Karlsson.
5) Initial prep work for full BPF-to-BPF call support in libbpf, from Andrii Nakryiko.
6) Generalize bpf_sk_storage map & add local storage for inodes, from KP Singh.
7) Implement sockmap/hash updates from BPF context, from Lorenz Bauer.
8) BPF xor verification for scalar types & add BPF link iterator, from Yonghong Song.
9) Use target's prog type for BPF_PROG_TYPE_EXT prog verification, from Udip Pant.
10) Rework BPF tracing samples to use libbpf loader, from Daniel T. Lee.
11) Fix xdpsock sample to really cycle through all buffers, from Weqaar Janjua.
12) Improve type safety for tun/veth XDP frame handling, from Maciej Żenczykowski.
13) Various smaller cleanups and improvements all over the place.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
|