summaryrefslogtreecommitdiff
path: root/arch/x86/kernel
AgeCommit message (Collapse)Author
2024-01-05x86/crash: use SZ_1M macro instead of hardcoded valueYuntao Wang
Use SZ_1M macro instead of hardcoded 1<<20 to make code more readable. Link: https://lkml.kernel.org/r/20240102144905.110047-3-ytcoode@gmail.com Signed-off-by: Yuntao Wang <ytcoode@gmail.com> Acked-by: Baoquan He <bhe@redhat.com> Cc: Borislav Petkov (AMD) <bp@alien8.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Dave Young <dyoung@redhat.com> Cc: Hari Bathini <hbathini@linux.ibm.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Sourabh Jain <sourabhjain@linux.ibm.com> Cc: Takashi Iwai <tiwai@suse.de> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-01-05x86/crash: remove the unused image parameter from prepare_elf_headers()Yuntao Wang
Patch series "crash: Some cleanups and fixes", v2. This patchset includes two cleanups and one fix. This patch (of 3): The image parameter is no longer in use, remove it. Also, tidy up the code formatting. Link: https://lkml.kernel.org/r/20240102144905.110047-1-ytcoode@gmail.com Link: https://lkml.kernel.org/r/20240102144905.110047-2-ytcoode@gmail.com Signed-off-by: Yuntao Wang <ytcoode@gmail.com> Acked-by: Baoquan He <bhe@redhat.com> Cc: Borislav Petkov (AMD) <bp@alien8.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Dave Young <dyoung@redhat.com> Cc: Hari Bathini <hbathini@linux.ibm.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Sourabh Jain <sourabhjain@linux.ibm.com> Cc: Takashi Iwai <tiwai@suse.de> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-01-04Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski
Cross-merge networking fixes after downstream PR. Conflicts: drivers/net/ethernet/broadcom/bnxt/bnxt.c e009b2efb7a8 ("bnxt_en: Remove mis-applied code from bnxt_cfg_ntp_filters()") 0f2b21477988 ("bnxt_en: Fix compile error without CONFIG_RFS_ACCEL") https://lore.kernel.org/all/20240105115509.225aa8a2@canb.auug.org.au/ Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-01-04x86/kprobes: fix incorrect return address calculation in ↵Jinghao Jia
kprobe_emulate_call_indirect kprobe_emulate_call_indirect currently uses int3_emulate_call to emulate indirect calls. However, int3_emulate_call always assumes the size of the call to be 5 bytes when calculating the return address. This is incorrect for register-based indirect calls in x86, which can be either 2 or 3 bytes depending on whether REX prefix is used. At kprobe runtime, the incorrect return address causes control flow to land onto the wrong place after return -- possibly not a valid instruction boundary. This can lead to a panic like the following: [ 7.308204][ C1] BUG: unable to handle page fault for address: 000000000002b4d8 [ 7.308883][ C1] #PF: supervisor read access in kernel mode [ 7.309168][ C1] #PF: error_code(0x0000) - not-present page [ 7.309461][ C1] PGD 0 P4D 0 [ 7.309652][ C1] Oops: 0000 [#1] SMP [ 7.309929][ C1] CPU: 1 PID: 0 Comm: swapper/1 Not tainted 6.7.0-rc5-trace-for-next #6 [ 7.310397][ C1] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.0-20220807_005459-localhost 04/01/2014 [ 7.311068][ C1] RIP: 0010:__common_interrupt+0x52/0xc0 [ 7.311349][ C1] Code: 01 00 4d 85 f6 74 39 49 81 fe 00 f0 ff ff 77 30 4c 89 f7 4d 8b 5e 68 41 ba 91 76 d8 42 45 03 53 fc 74 02 0f 0b cc ff d3 65 48 <8b> 05 30 c7 ff 7e 65 4c 89 3d 28 c7 ff 7e 5b 41 5c 41 5e 41 5f c3 [ 7.312512][ C1] RSP: 0018:ffffc900000e0fd0 EFLAGS: 00010046 [ 7.312899][ C1] RAX: 0000000000000001 RBX: 0000000000000023 RCX: 0000000000000001 [ 7.313334][ C1] RDX: 00000000000003cd RSI: 0000000000000001 RDI: ffff888100d302a4 [ 7.313702][ C1] RBP: 0000000000000001 R08: 0ef439818636191f R09: b1621ff338a3b482 [ 7.314146][ C1] R10: ffffffff81e5127b R11: ffffffff81059810 R12: 0000000000000023 [ 7.314509][ C1] R13: 0000000000000000 R14: ffff888100d30200 R15: 0000000000000000 [ 7.314951][ C1] FS: 0000000000000000(0000) GS:ffff88813bc80000(0000) knlGS:0000000000000000 [ 7.315396][ C1] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 7.315691][ C1] CR2: 000000000002b4d8 CR3: 0000000003028003 CR4: 0000000000370ef0 [ 7.316153][ C1] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 7.316508][ C1] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 7.316948][ C1] Call Trace: [ 7.317123][ C1] <IRQ> [ 7.317279][ C1] ? __die_body+0x64/0xb0 [ 7.317482][ C1] ? page_fault_oops+0x248/0x370 [ 7.317712][ C1] ? __wake_up+0x96/0xb0 [ 7.317964][ C1] ? exc_page_fault+0x62/0x130 [ 7.318211][ C1] ? asm_exc_page_fault+0x22/0x30 [ 7.318444][ C1] ? __cfi_native_send_call_func_single_ipi+0x10/0x10 [ 7.318860][ C1] ? default_idle+0xb/0x10 [ 7.319063][ C1] ? __common_interrupt+0x52/0xc0 [ 7.319330][ C1] common_interrupt+0x78/0x90 [ 7.319546][ C1] </IRQ> [ 7.319679][ C1] <TASK> [ 7.319854][ C1] asm_common_interrupt+0x22/0x40 [ 7.320082][ C1] RIP: 0010:default_idle+0xb/0x10 [ 7.320309][ C1] Code: 4c 01 c7 4c 29 c2 e9 72 ff ff ff cc cc cc cc 90 90 90 90 90 90 90 90 90 90 90 b8 0c 67 40 a5 66 90 0f 00 2d 09 b9 3b 00 fb f4 <fa> c3 0f 1f 00 90 90 90 90 90 90 90 90 90 90 90 b8 0c 67 40 a5 e9 [ 7.321449][ C1] RSP: 0018:ffffc9000009bee8 EFLAGS: 00000256 [ 7.321808][ C1] RAX: ffff88813bca8b68 RBX: 0000000000000001 RCX: 000000000001ef0c [ 7.322227][ C1] RDX: 0000000000000000 RSI: 0000000000000001 RDI: 000000000001ef0c [ 7.322656][ C1] RBP: ffffc9000009bef8 R08: 8000000000000000 R09: 00000000000008c2 [ 7.323083][ C1] R10: 0000000000000000 R11: ffffffff81058e70 R12: 0000000000000000 [ 7.323530][ C1] R13: ffff8881002b30c0 R14: 0000000000000000 R15: 0000000000000000 [ 7.323948][ C1] ? __cfi_lapic_next_deadline+0x10/0x10 [ 7.324239][ C1] default_idle_call+0x31/0x50 [ 7.324464][ C1] do_idle+0xd3/0x240 [ 7.324690][ C1] cpu_startup_entry+0x25/0x30 [ 7.324983][ C1] start_secondary+0xb4/0xc0 [ 7.325217][ C1] secondary_startup_64_no_verify+0x179/0x17b [ 7.325498][ C1] </TASK> [ 7.325641][ C1] Modules linked in: [ 7.325906][ C1] CR2: 000000000002b4d8 [ 7.326104][ C1] ---[ end trace 0000000000000000 ]--- [ 7.326354][ C1] RIP: 0010:__common_interrupt+0x52/0xc0 [ 7.326614][ C1] Code: 01 00 4d 85 f6 74 39 49 81 fe 00 f0 ff ff 77 30 4c 89 f7 4d 8b 5e 68 41 ba 91 76 d8 42 45 03 53 fc 74 02 0f 0b cc ff d3 65 48 <8b> 05 30 c7 ff 7e 65 4c 89 3d 28 c7 ff 7e 5b 41 5c 41 5e 41 5f c3 [ 7.327570][ C1] RSP: 0018:ffffc900000e0fd0 EFLAGS: 00010046 [ 7.327910][ C1] RAX: 0000000000000001 RBX: 0000000000000023 RCX: 0000000000000001 [ 7.328273][ C1] RDX: 00000000000003cd RSI: 0000000000000001 RDI: ffff888100d302a4 [ 7.328632][ C1] RBP: 0000000000000001 R08: 0ef439818636191f R09: b1621ff338a3b482 [ 7.329223][ C1] R10: ffffffff81e5127b R11: ffffffff81059810 R12: 0000000000000023 [ 7.329780][ C1] R13: 0000000000000000 R14: ffff888100d30200 R15: 0000000000000000 [ 7.330193][ C1] FS: 0000000000000000(0000) GS:ffff88813bc80000(0000) knlGS:0000000000000000 [ 7.330632][ C1] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 7.331050][ C1] CR2: 000000000002b4d8 CR3: 0000000003028003 CR4: 0000000000370ef0 [ 7.331454][ C1] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 7.331854][ C1] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 7.332236][ C1] Kernel panic - not syncing: Fatal exception in interrupt [ 7.332730][ C1] Kernel Offset: disabled [ 7.333044][ C1] ---[ end Kernel panic - not syncing: Fatal exception in interrupt ]--- The relevant assembly code is (from objdump, faulting address highlighted): ffffffff8102ed9d: 41 ff d3 call *%r11 ffffffff8102eda0: 65 48 <8b> 05 30 c7 ff mov %gs:0x7effc730(%rip),%rax The emulation incorrectly sets the return address to be ffffffff8102ed9d + 0x5 = ffffffff8102eda2, which is the 8b byte in the middle of the next mov. This in turn causes incorrect subsequent instruction decoding and eventually triggers the page fault above. Instead of invoking int3_emulate_call, perform push and jmp emulation directly in kprobe_emulate_call_indirect. At this point we can obtain the instruction size from p->ainsn.size so that we can calculate the correct return address. Link: https://lore.kernel.org/all/20240102233345.385475-1-jinghao7@illinois.edu/ Fixes: 6256e668b7af ("x86/kprobes: Use int3 instead of debug trap for single-step") Cc: stable@vger.kernel.org Signed-off-by: Jinghao Jia <jinghao7@illinois.edu> Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
2024-01-03arch/x86: Fix typosBjorn Helgaas
Fix typos, most reported by "codespell arch/x86". Only touches comments, no code changes. Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Randy Dunlap <rdunlap@infradead.org> Link: https://lore.kernel.org/r/20240103004011.1758650-1-helgaas@kernel.org
2024-01-03Merge branches 'apple/dart', 'arm/rockchip', 'arm/smmu', 'virtio', ↵Joerg Roedel
'x86/vt-d', 'x86/amd' and 'core' into next
2023-12-30x86/alternative: Correct feature bit debug outputBorislav Petkov (AMD)
In https://lore.kernel.org/r/20231206110636.GBZXBVvCWj2IDjVk4c@fat_crate.local I wanted to adjust the alternative patching debug output to the new changes introduced by da0fe6e68e10 ("x86/alternative: Add indirect call patching") but removed the '*' which denotes the ->x86_capability word. The correct output should be, for example: [ 0.230071] SMP alternatives: feat: 11*32+15, old: (entry_SYSCALL_64_after_hwframe+0x5a/0x77 (ffffffff81c000c2) len: 16), repl: (ffffffff89ae896a, len: 5) flags: 0x0 while the incorrect one says "... 1132+15" currently. Add back the '*'. Fixes: da0fe6e68e10 ("x86/alternative: Add indirect call patching") Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/r/20231206110636.GBZXBVvCWj2IDjVk4c@fat_crate.local
2023-12-29x86/kexec: fix incorrect end address passed to kernel_ident_mapping_init()Yuntao Wang
kernel_ident_mapping_init() takes an exclusive memory range [pstart, pend) where pend is not included in the range, while res represents an inclusive memory range [start, end] where end is considered part of the range. Passing [start, end] rather than [start, end+1) to kernel_ident_mapping_init() may result in the identity mapping for the end address not being set up. For example, when res->start is equal to res->end, kernel_ident_mapping_init() will not establish any identity mapping. Similarly, when the value of res->end is a multiple of 2M and the page table maps 2M pages, kernel_ident_mapping_init() will also not set up identity mapping for res->end. Therefore, passing res->end directly to kernel_ident_mapping_init() is incorrect, the correct end address should be `res->end + 1`. Link: https://lkml.kernel.org/r/20231221101702.20956-1-ytcoode@gmail.com Signed-off-by: Yuntao Wang <ytcoode@gmail.com> Cc: Baoquan He <bhe@redhat.com> Cc: Bjorn Helgaas <bhelgaas@google.com> Cc: Borislav Petkov (AMD) <bp@alien8.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Simon Horman <horms@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-12-29x86/kexec: fix incorrect argument passed to kexec_dprintk()Yuntao Wang
kexec_dprintk() expects the last argument to be kbuf.memsz, but the actual argument being passed is kbuf.bufsz. Although these two values are currently equal, it is better to pass the correct one, in case these two values become different in the future. Link: https://lkml.kernel.org/r/20231220154105.215610-1-ytcoode@gmail.com Signed-off-by: Yuntao Wang <ytcoode@gmail.com> Cc: Baoquan He <bhe@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-12-29x86/kexec: use pr_err() instead of kexec_dprintk() when an error occursYuntao Wang
When detecting an error, the current code uses kexec_dprintk() to output log message. This is not quite appropriate as kexec_dprintk() is mainly used for outputting debugging messages, rather than error messages. Replace kexec_dprintk() with pr_err(). This also makes the output method for this error log align with the output method for other error logs in this function. Additionally, the last return statement in set_page_address() is unnecessary, remove it. Link: https://lkml.kernel.org/r/20231220030124.149160-1-ytcoode@gmail.com Signed-off-by: Yuntao Wang <ytcoode@gmail.com> Cc: Baoquan He <bhe@redhat.com> Cc: Borislav Petkov (AMD) <bp@alien8.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-12-27rseq: Split out rseq.h from sched.hKent Overstreet
We're trying to get sched.h down to more or less just types only, not code - rseq can live in its own header. This helps us kill the dependency on preempt.h in sched.h. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-12-20x86/kexec: simplify the logic of mem_region_callback()Yuntao Wang
The expression `mstart + resource_size(res) - 1` is actually equivalent to `res->end`, simplify the logic of this function to improve readability. Link: https://lkml.kernel.org/r/20231212150506.31711-1-ytcoode@gmail.com Signed-off-by: Yuntao Wang <ytcoode@gmail.com> Acked-by: Baoquan He <bhe@redhat.com> Cc: Bjorn Helgaas <bhelgaas@google.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Simon Horman <horms@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-12-20kexec_file, x86: print out debugging message if requiredBaoquan He
Then when specifying '-d' for kexec_file_load interface, loaded locations of kernel/initrd/cmdline etc can be printed out to help debug. Here replace pr_debug() with the newly added kexec_dprintk() in kexec_file loading related codes. And also print out e820 memmap passed to 2nd kernel just as kexec_load interface has been doing. Link: https://lkml.kernel.org/r/20231213055747.61826-4-bhe@redhat.com Signed-off-by: Baoquan He <bhe@redhat.com> Cc: Conor Dooley <conor@kernel.org> Cc: Joe Perches <joe@perches.com> Cc: Nathan Chancellor <nathan@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-12-20x86: fix missing includes/forward declarationsKent Overstreet
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-12-20x86/asm: Add DB flag to 32-bit percpu GDT entryVegard Nossum
The D/B size flag for the 32-bit percpu GDT entry was not set. The Intel manual (vol 3, section 3.4.5) only specifies the meaning of this flag for three cases: 1) code segments used for %cs -- doesn't apply here 2) stack segments used for %ss -- doesn't apply 3) expand-down data segments -- but we don't have the expand-down flag set, so it also doesn't apply here The flag likely doesn't do anything here, although the manual does also say: "This flag should always be set to 1 for 32-bit code and data segments [...]" so we should probably do it anyway. Signed-off-by: Vegard Nossum <vegard.nossum@oracle.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Link: https://lore.kernel.org/r/20231219151200.2878271-6-vegard.nossum@oracle.com
2023-12-20x86/asm: Always set A (accessed) flag in GDT descriptorsVegard Nossum
We have no known use for having the CPU track whether GDT descriptors have been accessed or not. Simplify the code by adding the flag to the common flags and removing it everywhere else. Signed-off-by: Vegard Nossum <vegard.nossum@oracle.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Link: https://lore.kernel.org/r/20231219151200.2878271-5-vegard.nossum@oracle.com
2023-12-20x86/asm: Replace magic numbers in GDT descriptors, script-generated changeVegard Nossum
Actually replace the numeric values by the new symbolic values. I used this to find all the existing users of the GDT_ENTRY*() macros: $ git grep -P 'GDT_ENTRY(_INIT)?\(' Some of the lines will exceed 80 characters, but some of them will be shorter again in the next couple of patches. Signed-off-by: Vegard Nossum <vegard.nossum@oracle.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Link: https://lore.kernel.org/r/20231219151200.2878271-4-vegard.nossum@oracle.com
2023-12-20x86/asm: Replace magic numbers in GDT descriptors, preparationsVegard Nossum
We'd like to replace all the magic numbers in various GDT descriptors with new, semantically meaningful, symbolic values. In order to be able to verify that the change doesn't cause any actual changes to the compiled binary code, I've split the change into two patches: - Part 1 (this commit): everything _but_ actually replacing the numbers - Part 2 (the following commit): _only_ replacing the numbers The reason we need this split for verification is that including new headers causes some spurious changes to the object files, mostly line number changes in the debug info but occasionally other subtle codegen changes. Signed-off-by: Vegard Nossum <vegard.nossum@oracle.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Link: https://lore.kernel.org/r/20231219151200.2878271-3-vegard.nossum@oracle.com
2023-12-19x86/alternatives: Move apply_relocation() out of init sectionArnd Bergmann
This function is now called from a few places that are no __init_or_module, resulting a link time warning: WARNING: modpost: vmlinux: section mismatch in reference: patch_dest+0x8a (section: .text) -> apply_relocation (section: .init.text) Remove the annotation here. [ mingo: Also sync up add_nop() with these changes. ] Fixes: 17bce3b2ae2d ("x86/callthunks: Handle %rip-relative relocations in call thunk template") Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Uros Bizjak <ubizjak@gmail.com> Link: https://lore.kernel.org/r/20231204072856.1033621-1-arnd@kernel.org
2023-12-18Merge tag 'for-netdev' of ↵Jakub Kicinski
https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next Alexei Starovoitov says: ==================== pull-request: bpf-next 2023-12-18 This PR is larger than usual and contains changes in various parts of the kernel. The main changes are: 1) Fix kCFI bugs in BPF, from Peter Zijlstra. End result: all forms of indirect calls from BPF into kernel and from kernel into BPF work with CFI enabled. This allows BPF to work with CONFIG_FINEIBT=y. 2) Introduce BPF token object, from Andrii Nakryiko. It adds an ability to delegate a subset of BPF features from privileged daemon (e.g., systemd) through special mount options for userns-bound BPF FS to a trusted unprivileged application. The design accommodates suggestions from Christian Brauner and Paul Moore. Example: $ sudo mkdir -p /sys/fs/bpf/token $ sudo mount -t bpf bpffs /sys/fs/bpf/token \ -o delegate_cmds=prog_load:MAP_CREATE \ -o delegate_progs=kprobe \ -o delegate_attachs=xdp 3) Various verifier improvements and fixes, from Andrii Nakryiko, Andrei Matei. - Complete precision tracking support for register spills - Fix verification of possibly-zero-sized stack accesses - Fix access to uninit stack slots - Track aligned STACK_ZERO cases as imprecise spilled registers. It improves the verifier "instructions processed" metric from single digit to 50-60% for some programs. - Fix verifier retval logic 4) Support for VLAN tag in XDP hints, from Larysa Zaremba. 5) Allocate BPF trampoline via bpf_prog_pack mechanism, from Song Liu. End result: better memory utilization and lower I$ miss for calls to BPF via BPF trampoline. 6) Fix race between BPF prog accessing inner map and parallel delete, from Hou Tao. 7) Add bpf_xdp_get_xfrm_state() kfunc, from Daniel Xu. It allows BPF interact with IPSEC infra. The intent is to support software RSS (via XDP) for the upcoming ipsec pcpu work. Experiments on AWS demonstrate single tunnel pcpu ipsec reaching line rate on 100G ENA nics. 8) Expand bpf_cgrp_storage to support cgroup1 non-attach, from Yafang Shao. 9) BPF file verification via fsverity, from Song Liu. It allows BPF progs get fsverity digest. * tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (164 commits) bpf: Ensure precise is reset to false in __mark_reg_const_zero() selftests/bpf: Add more uprobe multi fail tests bpf: Fail uprobe multi link with negative offset selftests/bpf: Test the release of map btf s390/bpf: Fix indirect trampoline generation selftests/bpf: Temporarily disable dummy_struct_ops test on s390 x86/cfi,bpf: Fix bpf_exception_cb() signature bpf: Fix dtor CFI cfi: Add CFI_NOSEAL() x86/cfi,bpf: Fix bpf_struct_ops CFI x86/cfi,bpf: Fix bpf_callback_t CFI x86/cfi,bpf: Fix BPF JIT call cfi: Flip headers selftests/bpf: Add test for abnormal cnt during multi-kprobe attachment selftests/bpf: Don't use libbpf_get_error() in kprobe_multi_test selftests/bpf: Add test for abnormal cnt during multi-uprobe attachment bpf: Limit the number of kprobes when attaching program to multiple kprobes bpf: Limit the number of uprobes when attaching program to multiple uprobes bpf: xdp: Register generic_kfunc_set with XDP programs selftests/bpf: utilize string values for delegate_xxx mount options ... ==================== Link: https://lore.kernel.org/r/20231219000520.34178-1-alexei.starovoitov@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-12-18x86/acpi: Handle bogus MADT APIC tables gracefullyThomas Gleixner
The recent fix to ignore invalid x2APIC entries inadvertently broke systems with creative MADT APIC tables. The affected systems have APIC MADT tables where all entries have invalid APIC IDs (0xFF), which means they register exactly zero CPUs. But the condition to ignore the entries of APIC IDs < 255 in the X2APIC MADT table is solely based on the count of MADT APIC table entries. As a consequence, the affected machines enumerate no secondary CPUs at all because the APIC table has entries and therefore the X2APIC table entries with APIC IDs < 255 are ignored. Change the condition so that the APIC table preference for APIC IDs < 255 only becomes effective when the APIC table has valid APIC ID entries. IOW, an APIC table full of invalid APIC IDs is considered to be empty which in consequence enables the X2APIC table entries with a APIC ID < 255 and restores the expected behaviour. Fixes: ec9aedb2aa1a ("x86/acpi: Ignore invalid x2APIC entries") Reported-by: John Sperbeck <jsperbeck@google.com> Reported-by: Andres Freund <andres@anarazel.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/r/169953729188.3135.6804572126118798018.tip-bot2@tip-bot2
2023-12-17rtc: Extend timeout for waiting for UIP to clear to 1sMario Limonciello
Specs don't say anything about UIP being cleared within 10ms. They only say that UIP won't occur for another 244uS. If a long NMI occurs while UIP is still updating it might not be possible to get valid data in 10ms. This has been observed in the wild that around s2idle some calls can take up to 480ms before UIP is clear. Adjust callers from outside an interrupt context to wait for up to a 1s instead of 10ms. Cc: <stable@vger.kernel.org> # 6.1.y Fixes: ec5895c0f2d8 ("rtc: mc146818-lib: extract mc146818_avoid_UIP") Reported-by: Carsten Hatger <xmb8dsv4@gmail.com> Closes: https://bugzilla.kernel.org/show_bug.cgi?id=217626 Tested-by: Mateusz Jończyk <mat.jonczyk@o2.pl> Reviewed-by: Mateusz Jończyk <mat.jonczyk@o2.pl> Acked-by: Mateusz Jończyk <mat.jonczyk@o2.pl> Signed-off-by: Mario Limonciello <mario.limonciello@amd.com> Link: https://lore.kernel.org/r/20231128053653.101798-5-mario.limonciello@amd.com Signed-off-by: Alexandre Belloni <alexandre.belloni@bootlin.com>
2023-12-17rtc: Add support for configuring the UIP timeout for RTC readsMario Limonciello
The UIP timeout is hardcoded to 10ms for all RTC reads, but in some contexts this might not be enough time. Add a timeout parameter to mc146818_get_time() and mc146818_get_time_callback(). If UIP timeout is configured by caller to be >=100 ms and a call takes this long, log a warning. Make all callers use 10ms to ensure no functional changes. Cc: <stable@vger.kernel.org> # 6.1.y Fixes: ec5895c0f2d8 ("rtc: mc146818-lib: extract mc146818_avoid_UIP") Signed-off-by: Mario Limonciello <mario.limonciello@amd.com> Tested-by: Mateusz Jończyk <mat.jonczyk@o2.pl> Reviewed-by: Mateusz Jończyk <mat.jonczyk@o2.pl> Acked-by: Mateusz Jończyk <mat.jonczyk@o2.pl> Link: https://lore.kernel.org/r/20231128053653.101798-4-mario.limonciello@amd.com Signed-off-by: Alexandre Belloni <alexandre.belloni@bootlin.com>
2023-12-15x86/cfi,bpf: Fix bpf_struct_ops CFIPeter Zijlstra
BPF struct_ops uses __arch_prepare_bpf_trampoline() to write trampolines for indirect function calls. These tramplines much have matching CFI. In order to obtain the correct CFI hash for the various methods, add a matching structure that contains stub functions, the compiler will generate correct CFI which we can pilfer for the trampolines. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20231215092707.566977112@infradead.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-12-15x86/cfi,bpf: Fix bpf_callback_t CFIPeter Zijlstra
Where the main BPF program is expected to match bpf_func_t, sub-programs are expected to match bpf_callback_t. This fixes things like: tools/testing/selftests/bpf/progs/bloom_filter_bench.c: bpf_for_each_map_elem(&array_map, bloom_callback, &data, 0); Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20231215092707.451956710@infradead.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-12-15x86/cfi,bpf: Fix BPF JIT callPeter Zijlstra
The current BPF call convention is __nocfi, except when it calls !JIT things, then it calls regular C functions. It so happens that with FineIBT the __nocfi and C calling conventions are incompatible. Specifically __nocfi will call at func+0, while FineIBT will have endbr-poison there, which is not a valid indirect target. Causing #CP. Notably this only triggers on IBT enabled hardware, which is probably why this hasn't been reported (also, most people will have JIT on anyway). Implement proper CFI prologues for the BPF JIT codegen and drop __nocfi for x86. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20231215092707.345270396@infradead.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-12-15cfi: Flip headersPeter Zijlstra
Normal include order is that linux/foo.h should include asm/foo.h, CFI has it the wrong way around. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Sami Tolvanen <samitolvanen@google.com> Link: https://lore.kernel.org/r/20231215092707.231038174@infradead.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-12-15x86/alternatives: Disable interrupts and sync when optimizing NOPs in placeThomas Gleixner
apply_alternatives() treats alternatives with the ALT_FLAG_NOT flag set special as it optimizes the existing NOPs in place. Unfortunately, this happens with interrupts enabled and does not provide any form of core synchronization. So an interrupt hitting in the middle of the update and using the affected code path will observe a half updated NOP and crash and burn. The following 3 NOP sequence was observed to expose this crash halfway reliably under QEMU 32bit: 0x90 0x90 0x90 which is replaced by the optimized 3 byte NOP: 0x8d 0x76 0x00 So an interrupt can observe: 1) 0x90 0x90 0x90 nop nop nop 2) 0x8d 0x90 0x90 undefined 3) 0x8d 0x76 0x90 lea -0x70(%esi),%esi 4) 0x8d 0x76 0x00 lea 0x0(%esi),%esi Where only #1 and #4 are true NOPs. The same problem exists for 64bit obviously. Disable interrupts around this NOP optimization and invoke sync_core() before re-enabling them. Fixes: 270a69c4485d ("x86/alternative: Support relocations in alternatives") Reported-by: Paul Gortmaker <paul.gortmaker@windriver.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/ZT6narvE%2BLxX%2B7Be@windriver.com
2023-12-15x86/alternatives: Sync core before enabling interruptsThomas Gleixner
text_poke_early() does: local_irq_save(flags); memcpy(addr, opcode, len); local_irq_restore(flags); sync_core(); That's not really correct because the synchronization should happen before interrupts are re-enabled to ensure that a pending interrupt observes the complete update of the opcodes. It's not entirely clear whether the interrupt entry provides enough serialization already, but moving the sync_core() invocation into interrupt disabled region does no harm and is obviously correct. Fixes: 6fffacb30349 ("x86/alternatives, jumplabel: Use text_poke_early() before mm_init()") Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: <stable@kernel.org> Link: https://lore.kernel.org/r/ZT6narvE%2BLxX%2B7Be@windriver.com
2023-12-15x86/smpboot/64: Handle X2APIC BIOS inconsistency gracefullyThomas Gleixner
Chris reported that a Dell PowerEdge T340 system stopped to boot when upgrading to a kernel which contains the parallel hotplug changes. Disabling parallel hotplug on the kernel command line makes it boot again. It turns out that the Dell BIOS has x2APIC enabled and the boot CPU comes up in X2APIC mode, but the APs come up inconsistently in xAPIC mode. Parallel hotplug requires that the upcoming CPU reads out its APIC ID from the local APIC in order to map it to the Linux CPU number. In this particular case the readout on the APs uses the MMIO mapped registers because the BIOS failed to enable x2APIC mode. That readout results in a page fault because the kernel does not have the APIC MMIO space mapped when X2APIC mode was enabled by the BIOS on the boot CPU and the kernel switched to X2APIC mode early. That page fault can't be handled on the upcoming CPU that early and results in a silent boot failure. If parallel hotplug is disabled the system boots because in that case the APIC ID read is not required as the Linux CPU number is provided to the AP in the smpboot control word. When the kernel uses x2APIC mode then the APs are switched to x2APIC mode too slightly later in the bringup process, but there is no reason to do it that late. Cure the BIOS bogosity by checking in the parallel bootup path whether the kernel uses x2APIC mode and if so switching over the APs to x2APIC mode before the APIC ID readout. Fixes: 0c7ffa32dbd6 ("x86/smpboot/64: Implement arch_cpuhp_init_parallel_bringup() and enable it") Reported-by: Chris Lindee <chris.lindee@gmail.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Ashok Raj <ashok.raj@intel.com> Tested-by: Chris Lindee <chris.lindee@gmail.com> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/CA%2B2tU59853R49EaU_tyvOZuOTDdcU0RshGyydccp9R1NX9bEeQ@mail.gmail.com
2023-12-15x86/mce: Handle Intel threshold interrupt stormsTony Luck
Add an Intel specific hook into machine_check_poll() to keep track of per-CPU, per-bank corrected error logs (with a stub for the CONFIG_MCE_INTEL=n case). When a storm is observed the rate of interrupts is reduced by setting a large threshold value for this bank in IA32_MCi_CTL2. This bank is added to the bitmap of banks for this CPU to poll. The polling rate is increased to once per second. When a storm ends reset the threshold in IA32_MCi_CTL2 back to 1, remove the bank from the bitmap for polling, and change the polling rate back to the default. If a CPU with banks in storm mode is taken offline, the new CPU that inherits ownership of those banks takes over management of storm(s) in the inherited bank(s). The cmci_discover() function was already very large. These changes pushed it well over the top. Refactor with three helper functions to bring it back under control. Signed-off-by: Tony Luck <tony.luck@intel.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/r/20231115195450.12963-4-tony.luck@intel.com
2023-12-15x86/mce: Add per-bank CMCI storm mitigationTony Luck
This is the core functionality to track CMCI storms at the machine check bank granularity. Subsequent patches will add the vendor specific hooks to supply input to the storm detection and take actions on the start/end of a storm. machine_check_poll() is called both by the CMCI interrupt code, and for periodic polls from a timer. Add a hook in this routine to maintain a bitmap history for each bank showing whether the bank logged an corrected error or not each time it is polled. In normal operation the interval between polls of these banks determines how far to shift the history. The 64 bit width corresponds to about one second. When a storm is observed a CPU vendor specific action is taken to reduce or stop CMCI from the bank that is the source of the storm. The bank is added to the bitmap of banks for this CPU to poll. The polling rate is increased to once per second. During a storm each bit in the history indicates the status of the bank each time it is polled. Thus the history covers just over a minute. Declare a storm for that bank if the number of corrected interrupts seen in that history is above some threshold (defined as 5 in this series, could be tuned later if there is data to suggest a better value). A storm on a bank ends if enough consecutive polls of the bank show no corrected errors (defined as 30, may also change). That calls the CPU vendor specific function to revert to normal operational mode, and changes the polling rate back to the default. [ bp: Massage. ] Signed-off-by: Tony Luck <tony.luck@intel.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/r/20231115195450.12963-3-tony.luck@intel.com
2023-12-15x86/mce: Remove old CMCI storm mitigation codeTony Luck
When a "storm" of corrected machine check interrupts (CMCI) is detected this code mitigates by disabling CMCI interrupt signalling from all of the banks owned by the CPU that saw the storm. There are problems with this approach: 1) It is very coarse grained. In all likelihood only one of the banks was generating the interrupts, but CMCI is disabled for all. This means Linux may delay seeing and processing errors logged from other banks. 2) Although CMCI stands for Corrected Machine Check Interrupt, it is also used to signal when an uncorrected error is logged. This is a problem because these errors should be handled in a timely manner. Delete all this code in preparation for a finer grained solution. Signed-off-by: Tony Luck <tony.luck@intel.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Yazen Ghannam <yazen.ghannam@amd.com> Tested-by: Yazen Ghannam <yazen.ghannam@amd.com> Link: https://lore.kernel.org/r/20231115195450.12963-2-tony.luck@intel.com
2023-12-13x86/sev: Do the C-bit verification only on the BSPBorislav Petkov (AMD)
There's no need to do it on every AP. The C-bit value read on the BSP and also verified there, is used everywhere from now on. No functional changes - just a bit faster booting APs. Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Acked-by: Tom Lendacky <thomas.lendacky@amd.com> Link: https://lore.kernel.org/r/20231130132601.10317-1-bp@alien8.de
2023-12-13x86/head_64: Use TESTB instead of TESTL in secondary_startup_64_no_verify()Uros Bizjak
There is no need to use TESTL when checking the least-significant bit with a TEST instruction. Use TESTB, which is three bytes shorter: f6 05 00 00 00 00 01 testb $0x1,0x0(%rip) vs: f7 05 00 00 00 00 01 testl $0x1,0x0(%rip) 00 00 00 for the same effect. No functional changes intended. Signed-off-by: Uros Bizjak <ubizjak@gmail.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/r/20231109201032.4439-1-ubizjak@gmail.com
2023-12-12x86/mce: Differentiate real hardware #MCs from TDX erratum onesKai Huang
The first few generations of TDX hardware have an erratum. Triggering it in Linux requires some kind of kernel bug involving relatively exotic memory writes to TDX private memory and will manifest via spurious-looking machine checks when reading the affected memory. Make an effort to detect these TDX-induced machine checks and spit out a new blurb to dmesg so folks do not think their hardware is failing. == Background == Virtually all kernel memory accesses operations happen in full cachelines. In practice, writing a "byte" of memory usually reads a 64 byte cacheline of memory, modifies it, then writes the whole line back. Those operations do not trigger this problem. This problem is triggered by "partial" writes where a write transaction of less than cacheline lands at the memory controller. The CPU does these via non-temporal write instructions (like MOVNTI), or through UC/WC memory mappings. The issue can also be triggered away from the CPU by devices doing partial writes via DMA. == Problem == A partial write to a TDX private memory cacheline will silently "poison" the line. Subsequent reads will consume the poison and generate a machine check. According to the TDX hardware spec, neither of these things should have happened. To add insult to injury, the Linux machine code will present these as a literal "Hardware error" when they were, in fact, a software-triggered issue. == Solution == In the end, this issue is hard to trigger. Rather than do something rash (and incomplete) like unmap TDX private memory from the direct map, improve the machine check handler. Currently, the #MC handler doesn't distinguish whether the memory is TDX private memory or not but just dump, for instance, below message: [...] mce: [Hardware Error]: CPU 147: Machine Check Exception: f Bank 1: bd80000000100134 [...] mce: [Hardware Error]: RIP 10:<ffffffffadb69870> {__tlb_remove_page_size+0x10/0xa0} ... [...] mce: [Hardware Error]: Run the above through 'mcelog --ascii' [...] mce: [Hardware Error]: Machine check: Data load in unrecoverable area of kernel [...] Kernel panic - not syncing: Fatal local machine check Which says "Hardware Error" and "Data load in unrecoverable area of kernel". Ideally, it's better for the log to say "software bug around TDX private memory" instead of "Hardware Error". But in reality the real hardware memory error can happen, and sadly such software-triggered #MC cannot be distinguished from the real hardware error. Also, the error message is used by userspace tool 'mcelog' to parse, so changing the output may break userspace. So keep the "Hardware Error". The "Data load in unrecoverable area of kernel" is also helpful, so keep it too. Instead of modifying above error log, improve the error log by printing additional TDX related message to make the log like: ... [...] mce: [Hardware Error]: Machine check: Data load in unrecoverable area of kernel [...] mce: [Hardware Error]: Machine Check: TDX private memory error. Possible kernel bug. Adding this additional message requires determination of whether the memory page is TDX private memory. There is no existing infrastructure to do that. Add an interface to query the TDX module to fill this gap. == Impact == This issue requires some kind of kernel bug to trigger. TDX private memory should never be mapped UC/WC. A partial write originating from these mappings would require *two* bugs, first mapping the wrong page, then writing the wrong memory. It would also be detectable using traditional memory corruption techniques like DEBUG_PAGEALLOC. MOVNTI (and friends) could cause this issue with something like a simple buffer overrun or use-after-free on the direct map. It should also be detectable with normal debug techniques. The one place where this might get nasty would be if the CPU read data then wrote back the same data. That would trigger this problem but would not, for instance, set off mechanisms like slab redzoning because it doesn't actually corrupt data. With an IOMMU at least, the DMA exposure is similar to the UC/WC issue. TDX private memory would first need to be incorrectly mapped into the I/O space and then a later DMA to that mapping would actually cause the poisoning event. [ dhansen: changelog tweaks ] Signed-off-by: Kai Huang <kai.huang@intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Reviewed-by: Yuan Yao <yuan.yao@intel.com> Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Tony Luck <tony.luck@intel.com> Link: https://lore.kernel.org/all/20231208170740.53979-18-dave.hansen%40intel.com
2023-12-12x86/CPU/AMD: Add X86_FEATURE_ZEN1Borislav Petkov (AMD)
Add a synthetic feature flag specifically for first generation Zen machines. There's need to have a generic flag for all Zen generations so make X86_FEATURE_ZEN be that flag. Fixes: 30fa92832f40 ("x86/CPU/AMD: Add ZenX generations flags") Suggested-by: Brian Gerst <brgerst@gmail.com> Suggested-by: Tom Lendacky <thomas.lendacky@amd.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/r/dc3835e3-0731-4230-bbb9-336bbe3d042b@amd.com
2023-12-12iommu: Add mm_get_enqcmd_pasid() helper functionTina Zhang
mm_get_enqcmd_pasid() should be used by architecture code and closely related to learn the PASID value that the x86 ENQCMD operation should use for the mm. For the moment SMMUv3 uses this without any connection to ENQCMD, it will be cleaned up similar to how the prior patch made VT-d use the PASID argument of set_dev_pasid(). The motivation is to replace mm->pasid with an iommu private data structure that is introduced in a later patch. Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Tested-by: Nicolin Chen <nicolinc@nvidia.com> Signed-off-by: Tina Zhang <tina.zhang@intel.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com> Link: https://lore.kernel.org/r/20231027000525.1278806-4-tina.zhang@intel.com Signed-off-by: Joerg Roedel <jroedel@suse.de>
2023-12-12iommu: Change kconfig around IOMMU_SVAJason Gunthorpe
Linus suggested that the kconfig here is confusing: https://lore.kernel.org/all/CAHk-=wgUiAtiszwseM1p2fCJ+sC4XWQ+YN4TanFhUgvUqjr9Xw@mail.gmail.com/ Let's break it into three kconfigs controlling distinct things: - CONFIG_IOMMU_MM_DATA controls if the mm_struct has the additional fields for the IOMMU. Currently only PASID, but later patches store a struct iommu_mm_data * - CONFIG_ARCH_HAS_CPU_PASID controls if the arch needs the scheduling bit for keeping track of the ENQCMD instruction. x86 will select this if IOMMU_SVA is enabled - IOMMU_SVA controls if the IOMMU core compiles in the SVA support code for iommu driver use and the IOMMU exported API This way ARM will not enable CONFIG_ARCH_HAS_CPU_PASID Signed-off-by: Jason Gunthorpe <jgg@nvidia.com> Link: https://lore.kernel.org/r/20231027000525.1278806-2-tina.zhang@intel.com Signed-off-by: Joerg Roedel <jroedel@suse.de>
2023-12-11x86/traps: Use current_top_of_stack() helper in traps.cUros Bizjak
Use current_top_of_stack() helper in sync_regs() and vc_switch_off_ist() instead of open-coding the reading of the top_of_stack percpu variable explicitly. Signed-off-by: Uros Bizjak <ubizjak@gmail.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: https://lore.kernel.org/r/20231204210320.114429-2-ubizjak@gmail.com
2023-12-10x86/paravirt: Remove no longer needed paravirt patching codeJuergen Gross
Now that paravirt is using the alternatives patching infrastructure, remove the paravirt patching code. Signed-off-by: Juergen Gross <jgross@suse.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20231210062138.2417-6-jgross@suse.com
2023-12-10x86/paravirt: Switch mixed paravirt/alternative calls to alternativesJuergen Gross
Instead of stacking alternative and paravirt patching, use the new ALT_FLAG_CALL flag to switch those mixed calls to pure alternative handling. Eliminate the need to be careful regarding the sequence of alternative and paravirt patching. [ bp: Touch up commit message. ] Signed-off-by: Juergen Gross <jgross@suse.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20231210062138.2417-5-jgross@suse.com
2023-12-10x86/alternative: Add indirect call patchingJuergen Gross
In order to prepare replacing of paravirt patching with alternative patching, add the capability to replace an indirect call with a direct one. This is done via a new flag ALT_FLAG_CALL as the target of the CALL instruction needs to be evaluated using the value of the location addressed by the indirect call. For convenience, add a macro for a default CALL instruction. In case it is being used without the new flag being set, it will result in a BUG() when being executed. As in most cases, the feature used will be X86_FEATURE_ALWAYS so add another macro ALT_CALL_ALWAYS usable for the flags parameter of the ALTERNATIVE macros. For a complete replacement, handle the special cases of calling a nop function and an indirect call of NULL the same way as paravirt does. [ bp: Massage commit message, fixup the debug output and clarify flow more. ] Co-developed-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Juergen Gross <jgross@suse.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/r/20231210062138.2417-4-jgross@suse.com
2023-12-10x86/paravirt: Move some functions and defines to alternative.cJuergen Gross
As a preparation for replacing paravirt patching completely by alternative patching, move some backend functions and #defines to the alternatives code and header. Signed-off-by: Juergen Gross <jgross@suse.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/r/20231129133332.31043-3-jgross@suse.com
2023-12-10Merge tag 'x86_urgent_for_v6.7_rc5' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 fixes from Borislav Petkov: - Add a forgotten CPU vendor check in the AMD microcode post-loading callback so that the callback runs only on AMD - Make sure SEV-ES protocol negotiation happens only once and on the BSP * tag 'x86_urgent_for_v6.7_rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/CPU/AMD: Check vendor in the AMD microcode callback x86/sev: Fix kernel crash due to late update to read-only ghcb_version
2023-12-08x86/virt/tdx: Use all system memory when initializing TDX module as TDX memoryKai Huang
Start to transit out the "multi-steps" to initialize the TDX module. TDX provides increased levels of memory confidentiality and integrity. This requires special hardware support for features like memory encryption and storage of memory integrity checksums. Not all memory satisfies these requirements. As a result, TDX introduced the concept of a "Convertible Memory Region" (CMR). During boot, the firmware builds a list of all of the memory ranges which can provide the TDX security guarantees. The list of these ranges is available to the kernel by querying the TDX module. CMRs tell the kernel which memory is TDX compatible. The kernel needs to build a list of memory regions (out of CMRs) as "TDX-usable" memory and pass them to the TDX module. Once this is done, those "TDX-usable" memory regions are fixed during module's lifetime. To keep things simple, assume that all TDX-protected memory will come from the page allocator. Make sure all pages in the page allocator *are* TDX-usable memory. As TDX-usable memory is a fixed configuration, take a snapshot of the memory configuration from memblocks at the time of module initialization (memblocks are modified on memory hotplug). This snapshot is used to enable TDX support for *this* memory configuration only. Use a memory hotplug notifier to ensure that no other RAM can be added outside of this configuration. This approach requires all memblock memory regions at the time of module initialization to be TDX convertible memory to work, otherwise module initialization will fail in a later SEAMCALL when passing those regions to the module. This approach works when all boot-time "system RAM" is TDX convertible memory and no non-TDX-convertible memory is hot-added to the core-mm before module initialization. For instance, on the first generation of TDX machines, both CXL memory and NVDIMM are not TDX convertible memory. Using kmem driver to hot-add any CXL memory or NVDIMM to the core-mm before module initialization will result in failure to initialize the module. The SEAMCALL error code will be available in the dmesg to help user to understand the failure. Signed-off-by: Kai Huang <kai.huang@intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: "Huang, Ying" <ying.huang@intel.com> Reviewed-by: Isaku Yamahata <isaku.yamahata@intel.com> Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Link: https://lore.kernel.org/all/20231208170740.53979-7-dave.hansen%40intel.com
2023-12-08x86/virt/tdx: Detect TDX during kernel bootKai Huang
Intel Trust Domain Extensions (TDX) protects guest VMs from malicious host and certain physical attacks. A CPU-attested software module called 'the TDX module' runs inside a new isolated memory range as a trusted hypervisor to manage and run protected VMs. Pre-TDX Intel hardware has support for a memory encryption architecture called MKTME. The memory encryption hardware underpinning MKTME is also used for Intel TDX. TDX ends up "stealing" some of the physical address space from the MKTME architecture for crypto-protection to VMs. The BIOS is responsible for partitioning the "KeyID" space between legacy MKTME and TDX. The KeyIDs reserved for TDX are called 'TDX private KeyIDs' or 'TDX KeyIDs' for short. During machine boot, TDX microcode verifies that the BIOS programmed TDX private KeyIDs consistently and correctly programmed across all CPU packages. The MSRs are locked in this state after verification. This is why MSR_IA32_MKTME_KEYID_PARTITIONING gets used for TDX enumeration: it indicates not just that the hardware supports TDX, but that all the boot-time security checks passed. The TDX module is expected to be loaded by the BIOS when it enables TDX, but the kernel needs to properly initialize it before it can be used to create and run any TDX guests. The TDX module will be initialized by the KVM subsystem when KVM wants to use TDX. Detect platform TDX support by detecting TDX private KeyIDs. The TDX module itself requires one TDX KeyID as the 'TDX global KeyID' to protect its metadata. Each TDX guest also needs a TDX KeyID for its own protection. Just use the first TDX KeyID as the global KeyID and leave the rest for TDX guests. If no TDX KeyID is left for TDX guests, disable TDX as initializing the TDX module alone is useless. [ dhansen: add X86_FEATURE, replace helper function ] Signed-off-by: Kai Huang <kai.huang@intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Reviewed-by: Isaku Yamahata <isaku.yamahata@intel.com> Reviewed-by: David Hildenbrand <david@redhat.com> Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com> Link: https://lore.kernel.org/all/20231208170740.53979-1-dave.hansen%40intel.com
2023-12-07x86/entry: Convert INT 0x80 emulation to IDTENTRYThomas Gleixner
There is no real reason to have a separate ASM entry point implementation for the legacy INT 0x80 syscall emulation on 64-bit. IDTENTRY provides all the functionality needed with the only difference that it does not: - save the syscall number (AX) into pt_regs::orig_ax - set pt_regs::ax to -ENOSYS Both can be done safely in the C code of an IDTENTRY before invoking any of the syscall related functions which depend on this convention. Aside of ASM code reduction this prepares for detecting and handling a local APIC injected vector 0x80. [ kirill.shutemov: More verbose comments ] Suggested-by: Linus Torvalds <torvalds@linuxfoundation.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Borislav Petkov (AMD) <bp@alien8.de> Cc: <stable@vger.kernel.org> # v6.0+
2023-12-06x86/topology: convert to use arch_cpu_is_hotpluggable()Russell King (Oracle)
Convert x86 to use the arch_cpu_is_hotpluggable() helper rather than arch_register_cpu(). Reviewed-by: Gavin Shan <gshan@redhat.com> Signed-off-by: "Russell King (Oracle)" <rmk+kernel@armlinux.org.uk> Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/E1r5R3w-00Cszy-6k@rmk-PC.armlinux.org.uk Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-12-06x86/topology: use weak version of arch_unregister_cpu()Russell King (Oracle)
Since the x86 version of arch_unregister_cpu() is the same as the weak version, drop the x86 specific version. Reviewed-by: Gavin Shan <gshan@redhat.com> Signed-off-by: "Russell King (Oracle)" <rmk+kernel@armlinux.org.uk> Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/E1r5R3r-00Cszs-2R@rmk-PC.armlinux.org.uk Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>