From afc18069a2cb7ead5f86623a5f3d4ad6e21f940d Mon Sep 17 00:00:00 2001 From: Kairui Song Date: Wed, 14 Oct 2020 17:24:28 +0800 Subject: x86/kexec: Use up-to-dated screen_info copy to fill boot params kexec_file_load() currently reuses the old boot_params.screen_info, but if drivers have change the hardware state, boot_param.screen_info could contain invalid info. For example, the video type might be no longer VGA, or the frame buffer address might be changed. If the kexec kernel keeps using the old screen_info, kexec'ed kernel may attempt to write to an invalid framebuffer memory region. There are two screen_info instances globally available, boot_params.screen_info and screen_info. Later one is a copy, and is updated by drivers. So let kexec_file_load use the updated copy. [ mingo: Tidied up the changelog. ] Signed-off-by: Kairui Song Signed-off-by: Ingo Molnar Link: https://lore.kernel.org/r/20201014092429.1415040-2-kasong@redhat.com --- arch/x86/kernel/kexec-bzimage64.c | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) (limited to 'arch/x86/kernel') diff --git a/arch/x86/kernel/kexec-bzimage64.c b/arch/x86/kernel/kexec-bzimage64.c index 57c2ecf43134..ce831f9448e7 100644 --- a/arch/x86/kernel/kexec-bzimage64.c +++ b/arch/x86/kernel/kexec-bzimage64.c @@ -200,8 +200,7 @@ setup_boot_parameters(struct kimage *image, struct boot_params *params, params->hdr.hardware_subarch = boot_params.hdr.hardware_subarch; /* Copying screen_info will do? */ - memcpy(¶ms->screen_info, &boot_params.screen_info, - sizeof(struct screen_info)); + memcpy(¶ms->screen_info, &screen_info, sizeof(struct screen_info)); /* Fill in memsize later */ params->screen_info.ext_mem_k = 0; -- cgit From f2ac57a4c49d40409c21c82d23b5706df9b438af Mon Sep 17 00:00:00 2001 From: Jiri Slaby Date: Wed, 14 Oct 2020 07:30:51 +0200 Subject: x86/unwind/orc: Fix inactive tasks with stack pointer in %sp on GCC 10 compiled kernels GCC 10 optimizes the scheduler code differently than its predecessors. When CONFIG_DEBUG_SECTION_MISMATCH=y, the Makefile forces GCC not to inline some functions (-fno-inline-functions-called-once). Before GCC 10, "no-inlined" __schedule() starts with the usual prologue: push %bp mov %sp, %bp So the ORC unwinder simply picks stack pointer from %bp and unwinds from __schedule() just perfectly: $ cat /proc/1/stack [<0>] ep_poll+0x3e9/0x450 [<0>] do_epoll_wait+0xaa/0xc0 [<0>] __x64_sys_epoll_wait+0x1a/0x20 [<0>] do_syscall_64+0x33/0x40 [<0>] entry_SYSCALL_64_after_hwframe+0x44/0xa9 But now, with GCC 10, there is no %bp prologue in __schedule(): $ cat /proc/1/stack The ORC entry of the point in __schedule() is: sp:sp+88 bp:last_sp-48 type:call end:0 In this case, nobody subtracts sizeof "struct inactive_task_frame" in __unwind_start(). The struct is put on the stack by __switch_to_asm() and only then __switch_to_asm() stores %sp to task->thread.sp. But we start unwinding from a point in __schedule() (stored in frame->ret_addr by 'call') and not in __switch_to_asm(). So for these example values in __unwind_start(): sp=ffff94b50001fdc8 bp=ffff8e1f41d29340 ip=__schedule+0x1f0 The stack is: ffff94b50001fdc8: ffff8e1f41578000 # struct inactive_task_frame ffff94b50001fdd0: 0000000000000000 ffff94b50001fdd8: ffff8e1f41d29340 ffff94b50001fde0: ffff8e1f41611d40 # ... ffff94b50001fde8: ffffffff93c41920 # bx ffff94b50001fdf0: ffff8e1f41d29340 # bp ffff94b50001fdf8: ffffffff9376cad0 # ret_addr (and end of the struct) 0xffffffff9376cad0 is __schedule+0x1f0 (after the call to __switch_to_asm). Now follow those 88 bytes from the ORC entry (sp+88). The entry is correct, __schedule() really pushes 48 bytes (8*7) + 32 bytes via subq to store some local values (like 4U below). So to unwind, look at the offset 88-sizeof(long) = 0x50 from here: ffff94b50001fe00: ffff8e1f41578618 ffff94b50001fe08: 00000cc000000255 ffff94b50001fe10: 0000000500000004 ffff94b50001fe18: 7793fab6956b2d00 # NOTE (see below) ffff94b50001fe20: ffff8e1f41578000 ffff94b50001fe28: ffff8e1f41578000 ffff94b50001fe30: ffff8e1f41578000 ffff94b50001fe38: ffff8e1f41578000 ffff94b50001fe40: ffff94b50001fed8 ffff94b50001fe48: ffff8e1f41577ff0 ffff94b50001fe50: ffffffff9376cf12 Here ^^^^^^^^^^^^^^^^ is the correct ret addr from __schedule(). It translates to schedule+0x42 (insn after a call to __schedule()). BUT, unwind_next_frame() tries to take the address starting from 0xffff94b50001fdc8. That is exactly from thread.sp+88-sizeof(long) = 0xffff94b50001fdc8+88-8 = 0xffff94b50001fe18, which is garbage marked as NOTE above. So this quits the unwinding as 7793fab6956b2d00 is obviously not a kernel address. There was a fix to skip 'struct inactive_task_frame' in unwind_get_return_address_ptr in the following commit: 187b96db5ca7 ("x86/unwind/orc: Fix unwind_get_return_address_ptr() for inactive tasks") But we need to skip the struct already in the unwinder proper. So subtract the size (increase the stack pointer) of the structure in __unwind_start() directly. This allows for removal of the code added by commit 187b96db5ca7 completely, as the address is now at '(unsigned long *)state->sp - 1', the same as in the generic case. [ mingo: Cleaned up the changelog a bit, for better readability. ] Fixes: ee9f8fce9964 ("x86/unwind: Add the ORC unwinder") Bug: https://bugzilla.suse.com/show_bug.cgi?id=1176907 Signed-off-by: Jiri Slaby Signed-off-by: Ingo Molnar Link: https://lore.kernel.org/r/20201014053051.24199-1-jslaby@suse.cz --- arch/x86/kernel/unwind_orc.c | 9 +-------- 1 file changed, 1 insertion(+), 8 deletions(-) (limited to 'arch/x86/kernel') diff --git a/arch/x86/kernel/unwind_orc.c b/arch/x86/kernel/unwind_orc.c index ec88bbe08a32..4a96aa3de7d8 100644 --- a/arch/x86/kernel/unwind_orc.c +++ b/arch/x86/kernel/unwind_orc.c @@ -320,19 +320,12 @@ EXPORT_SYMBOL_GPL(unwind_get_return_address); unsigned long *unwind_get_return_address_ptr(struct unwind_state *state) { - struct task_struct *task = state->task; - if (unwind_done(state)) return NULL; if (state->regs) return &state->regs->ip; - if (task != current && state->sp == task->thread.sp) { - struct inactive_task_frame *frame = (void *)task->thread.sp; - return &frame->ret_addr; - } - if (state->sp) return (unsigned long *)state->sp - 1; @@ -662,7 +655,7 @@ void __unwind_start(struct unwind_state *state, struct task_struct *task, } else { struct inactive_task_frame *frame = (void *)task->thread.sp; - state->sp = task->thread.sp; + state->sp = task->thread.sp + sizeof(*frame); state->bp = READ_ONCE_NOCHECK(frame->bp); state->ip = READ_ONCE_NOCHECK(frame->ret_addr); state->signal = (void *)state->ip == ret_from_fork; -- cgit From abee7c494d8c41bb388839bccc47e06247f0d7de Mon Sep 17 00:00:00 2001 From: Juergen Gross Date: Fri, 9 Oct 2020 16:42:25 +0200 Subject: x86/alternative: Don't call text_poke() in lazy TLB mode When running in lazy TLB mode the currently active page tables might be the ones of a previous process, e.g. when running a kernel thread. This can be problematic in case kernel code is being modified via text_poke() in a kernel thread, and on another processor exit_mmap() is active for the process which was running on the first cpu before the kernel thread. As text_poke() is using a temporary address space and the former address space (obtained via cpu_tlbstate.loaded_mm) is restored afterwards, there is a race possible in case the cpu on which exit_mmap() is running wants to make sure there are no stale references to that address space on any cpu active (this e.g. is required when running as a Xen PV guest, where this problem has been observed and analyzed). In order to avoid that, drop off TLB lazy mode before switching to the temporary address space. Fixes: cefa929c034eb5d ("x86/mm: Introduce temporary mm structs") Signed-off-by: Juergen Gross Signed-off-by: Peter Zijlstra (Intel) Link: https://lkml.kernel.org/r/20201009144225.12019-1-jgross@suse.com --- arch/x86/kernel/alternative.c | 9 +++++++++ 1 file changed, 9 insertions(+) (limited to 'arch/x86/kernel') diff --git a/arch/x86/kernel/alternative.c b/arch/x86/kernel/alternative.c index cdaab30880b9..cd6be6f143e8 100644 --- a/arch/x86/kernel/alternative.c +++ b/arch/x86/kernel/alternative.c @@ -807,6 +807,15 @@ static inline temp_mm_state_t use_temporary_mm(struct mm_struct *mm) temp_mm_state_t temp_state; lockdep_assert_irqs_disabled(); + + /* + * Make sure not to be in TLB lazy mode, as otherwise we'll end up + * with a stale address space WITHOUT being in lazy mode after + * restoring the previous mm. + */ + if (this_cpu_read(cpu_tlbstate.is_lazy)) + leave_mm(smp_processor_id()); + temp_state.mm = this_cpu_read(cpu_tlbstate.loaded_mm); switch_mm_irqs_off(NULL, mm, current); -- cgit From 2a9baf5ad4884108b3c6d56a50e8105ccf8a4ee7 Mon Sep 17 00:00:00 2001 From: Peter Zijlstra Date: Tue, 27 Oct 2020 10:15:05 +0100 Subject: x86/debug: Fix BTF handling The SDM states that #DB clears DEBUGCTLMSR_BTF, this means that when the bit is set for userspace (TIF_BLOCKSTEP) and a kernel #DB happens first, the BTF bit meant for userspace execution is lost. Have the kernel #DB handler restore the BTF bit when it was requested for userspace. Signed-off-by: Peter Zijlstra (Intel) Signed-off-by: Thomas Gleixner Tested-by: Kyle Huey Link: https://lore.kernel.org/r/20201027093607.956147736@infradead.org --- arch/x86/kernel/traps.c | 28 +++++++++++++++++++++------- 1 file changed, 21 insertions(+), 7 deletions(-) (limited to 'arch/x86/kernel') diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c index 3c70fb34028b..b5208aa7351f 100644 --- a/arch/x86/kernel/traps.c +++ b/arch/x86/kernel/traps.c @@ -799,13 +799,6 @@ static __always_inline unsigned long debug_read_clear_dr6(void) */ current->thread.virtual_dr6 = 0; - /* - * The SDM says "The processor clears the BTF flag when it - * generates a debug exception." Clear TIF_BLOCKSTEP to keep - * TIF_BLOCKSTEP in sync with the hardware BTF flag. - */ - clear_thread_flag(TIF_BLOCKSTEP); - return dr6; } @@ -873,6 +866,20 @@ static __always_inline void exc_debug_kernel(struct pt_regs *regs, */ WARN_ON_ONCE(user_mode(regs)); + if (test_thread_flag(TIF_BLOCKSTEP)) { + /* + * The SDM says "The processor clears the BTF flag when it + * generates a debug exception." but PTRACE_BLOCKSTEP requested + * it for userspace, but we just took a kernel #DB, so re-set + * BTF. + */ + unsigned long debugctl; + + rdmsrl(MSR_IA32_DEBUGCTLMSR, debugctl); + debugctl |= DEBUGCTLMSR_BTF; + wrmsrl(MSR_IA32_DEBUGCTLMSR, debugctl); + } + /* * Catch SYSENTER with TF set and clear DR_STEP. If this hit a * watchpoint at the same time then that will still be handled. @@ -935,6 +942,13 @@ static __always_inline void exc_debug_user(struct pt_regs *regs, irqentry_enter_from_user_mode(regs); instrumentation_begin(); + /* + * The SDM says "The processor clears the BTF flag when it + * generates a debug exception." Clear TIF_BLOCKSTEP to keep + * TIF_BLOCKSTEP in sync with the hardware BTF flag. + */ + clear_thread_flag(TIF_BLOCKSTEP); + /* * If dr6 has no reason to give us about the origin of this trap, * then it's very likely the result of an icebp/int01 trap. -- cgit From a195f3d4528a2f88d6f986f6b1101775ad4891cf Mon Sep 17 00:00:00 2001 From: Peter Zijlstra Date: Tue, 27 Oct 2020 10:15:06 +0100 Subject: x86/debug: Only clear/set ->virtual_dr6 for userspace #DB The ->virtual_dr6 is the value used by ptrace_{get,set}_debugreg(6). A kernel #DB clearing it could mean spurious malfunction of ptrace() expectations. Signed-off-by: Peter Zijlstra (Intel) Signed-off-by: Thomas Gleixner Tested-by: Kyle Huey Link: https://lore.kernel.org/r/20201027093608.028952500@infradead.org --- arch/x86/kernel/traps.c | 12 ++++++------ 1 file changed, 6 insertions(+), 6 deletions(-) (limited to 'arch/x86/kernel') diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c index b5208aa7351f..32b2d39272e0 100644 --- a/arch/x86/kernel/traps.c +++ b/arch/x86/kernel/traps.c @@ -793,12 +793,6 @@ static __always_inline unsigned long debug_read_clear_dr6(void) set_debugreg(DR6_RESERVED, 6); dr6 ^= DR6_RESERVED; /* Flip to positive polarity */ - /* - * Clear the virtual DR6 value, ptrace routines will set bits here for - * things we want signals for. - */ - current->thread.virtual_dr6 = 0; - return dr6; } @@ -942,6 +936,12 @@ static __always_inline void exc_debug_user(struct pt_regs *regs, irqentry_enter_from_user_mode(regs); instrumentation_begin(); + /* + * Clear the virtual DR6 value, ptrace() routines will set bits here + * for things it wants signals for. + */ + current->thread.virtual_dr6 = 0; + /* * The SDM says "The processor clears the BTF flag when it * generates a debug exception." Clear TIF_BLOCKSTEP to keep -- cgit From cb05143bdf428f280a5d519c82abf196d7871c11 Mon Sep 17 00:00:00 2001 From: Peter Zijlstra Date: Tue, 27 Oct 2020 19:33:30 +0100 Subject: x86/debug: Fix DR_STEP vs ptrace_get_debugreg(6) Commit d53d9bc0cf78 ("x86/debug: Change thread.debugreg6 to thread.virtual_dr6") changed the semantics of the variable from random collection of bits, to exactly only those bits that ptrace() needs. Unfortunately this lost DR_STEP for PTRACE_{BLOCK,SINGLE}STEP. Furthermore, it turns out that userspace expects DR_STEP to be unconditionally available, even for manual TF usage outside of PTRACE_{BLOCK,SINGLE}_STEP. Fixes: d53d9bc0cf78 ("x86/debug: Change thread.debugreg6 to thread.virtual_dr6") Reported-by: Kyle Huey Signed-off-by: Peter Zijlstra (Intel) Signed-off-by: Thomas Gleixner Tested-by: Kyle Huey Link: https://lore.kernel.org/r/20201027183330.GM2628@hirez.programming.kicks-ass.net --- arch/x86/kernel/traps.c | 9 ++++++--- 1 file changed, 6 insertions(+), 3 deletions(-) (limited to 'arch/x86/kernel') diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c index 32b2d39272e0..e19df6cde35d 100644 --- a/arch/x86/kernel/traps.c +++ b/arch/x86/kernel/traps.c @@ -937,10 +937,13 @@ static __always_inline void exc_debug_user(struct pt_regs *regs, instrumentation_begin(); /* - * Clear the virtual DR6 value, ptrace() routines will set bits here - * for things it wants signals for. + * Start the virtual/ptrace DR6 value with just the DR_STEP mask + * of the real DR6. ptrace_triggered() will set the DR_TRAPn bits. + * + * Userspace expects DR_STEP to be visible in ptrace_get_debugreg(6) + * even if it is not the result of PTRACE_SINGLESTEP. */ - current->thread.virtual_dr6 = 0; + current->thread.virtual_dr6 = (dr6 & DR_STEP); /* * The SDM says "The processor clears the BTF flag when it -- cgit