Age | Commit message (Collapse) | Author |
|
Introduce XIP (eXecute In Place) support for RISC-V platforms.
It allows code to be executed directly from non-volatile storage
directly addressable by the CPU, such as QSPI NOR flash which can
be found on many RISC-V platforms. This makes way for significant
optimization of RAM footprint. The XIP kernel is not compressed
since it has to run directly from flash, so it will occupy more
space on the non-volatile storage. The physical flash address used
to link the kernel object files and for storing it has to be known
at compile time and is represented by a Kconfig option.
XIP on RISC-V will for the time being only work on MMU-enabled
kernels.
Signed-off-by: Vitaly Wool <vitaly.wool@konsulko.com>
[Alex: Rebase on top of "Move kernel mapping outside the linear mapping" ]
Signed-off-by: Alexandre Ghiti <alex@ghiti.fr>
[Palmer: disable XIP for allyesconfig]
Signed-off-by: Palmer Dabbelt <palmerdabbelt@google.com>
|
|
This patch allows Linux to act as a crash kernel for use with
kdump. Userspace will let the crash kernel know about the
memory region it can use through linux,usable-memory property
on the /memory node (overriding its reg property), and about the
memory region where the elf core header of the previous kernel
is saved, through a reserved-memory node with a compatible string
of "linux,elfcorehdr". This approach is the least invasive and
re-uses functionality already present.
I tested this on riscv64 qemu and it works as expected, you
may test it by retrieving the dmesg of the previous kernel
through /proc/vmcore, using the vmcore-dmesg utility from
kexec-tools.
Signed-off-by: Nick Kossifidis <mick@ics.forth.gr>
Signed-off-by: Palmer Dabbelt <palmerdabbelt@google.com>
|
|
This patch adds support for kdump, the kernel will reserve a
region for the crash kernel and jump there on panic. In order
for userspace tools (kexec-tools) to prepare the crash kernel
kexec image, we also need to expose some information on
/proc/iomem for the memory regions used by the kernel and for
the region reserved for crash kernel. Note that on userspace
the device tree is used to determine the system's memory
layout so the "System RAM" on /proc/iomem is ignored.
I tested this on riscv64 qemu and works as expected, you may
test it by triggering a crash through /proc/sysrq_trigger:
echo c > /proc/sysrq_trigger
Signed-off-by: Nick Kossifidis <mick@ics.forth.gr>
Signed-off-by: Palmer Dabbelt <palmerdabbelt@google.com>
|
|
The kernel region is always present and we know where it is, no need to
look for it inside the loop, just ignore it like the rest of the
reserved regions within system's memory.
Additionally, we don't need to call memblock_free inside the loop, as if
called it'll split the region of pre-allocated resources in two parts,
messing things up, just re-use the previous pre-allocated resource and
free any unused resources after both loops finish.
Signed-off-by: Nick Kossifidis <mick@ics.forth.gr>
[Palmer: commit text]
Signed-off-by: Palmer Dabbelt <palmerdabbelt@google.com>
|
|
This patch adds support for kexec on RISC-V. On SMP systems it depends
on HOTPLUG_CPU in order to be able to bring up all harts after kexec.
It also needs a recent OpenSBI version that supports the HSM extension.
I tested it on riscv64 QEMU on both an smp and a non-smp system.
Signed-off-by: Nick Kossifidis <mick@ics.forth.gr>
Signed-off-by: Palmer Dabbelt <palmerdabbelt@google.com>
|
|
Running "make" on an already compiled kernel tree will rebuild the
kernel even without any modifications:
CALL linux/scripts/checksyscalls.sh
CALL linux/scripts/atomic/check-atomics.sh
CHK include/generated/compile.h
SO2S arch/riscv/kernel/vdso/vdso-syms.S
AS arch/riscv/kernel/vdso/vdso-syms.o
AR arch/riscv/kernel/vdso/built-in.a
AR arch/riscv/kernel/built-in.a
AR arch/riscv/built-in.a
GEN .version
CHK include/generated/compile.h
UPD include/generated/compile.h
CC init/version.o
AR init/built-in.a
LD vmlinux.o
The reason is "Any target that utilizes if_changed must be listed in
$(targets), otherwise the command line check will fail, and the target
will always be built" as explained by Documentation/kbuild/makefiles.rst
Fix this build bug by adding vdso-syms.S to $(targets)
At the same time, there are two trivial clean up modifications:
- the vdso-dummy.o is not needed any more after so remove it.
- vdso.lds is a generated file, so it should be prefixed with
$(obj)/ instead of $(src)/
Fixes: c2c81bb2f691 ("RISC-V: Fix the VDSO symbol generaton for binutils-2.35+")
Cc: stable@vger.kernel.org
Signed-off-by: Jisheng Zhang <jszhang@kernel.org>
Signed-off-by: Palmer Dabbelt <palmerdabbelt@google.com>
|
|
BUG_ON() uses unlikely in if(), which can be optimized at compile time.
Signed-off-by: zhouchuangao <zhouchuangao@vivo.com>
Signed-off-by: Palmer Dabbelt <palmerdabbelt@google.com>
|
|
The execution of sys_read end up hitting a BUG_ON() in __find_get_block
after installing kprobe at sys_read, the BUG message like the following:
[ 65.708663] ------------[ cut here ]------------
[ 65.709987] kernel BUG at fs/buffer.c:1251!
[ 65.711283] Kernel BUG [#1]
[ 65.712032] Modules linked in:
[ 65.712925] CPU: 0 PID: 51 Comm: sh Not tainted 5.12.0-rc4 #1
[ 65.714407] Hardware name: riscv-virtio,qemu (DT)
[ 65.715696] epc : __find_get_block+0x218/0x2c8
[ 65.716835] ra : __getblk_gfp+0x1c/0x4a
[ 65.717831] epc : ffffffe00019f11e ra : ffffffe00019f56a sp : ffffffe002437930
[ 65.719553] gp : ffffffe000f06030 tp : ffffffe0015abc00 t0 : ffffffe00191e038
[ 65.721290] t1 : ffffffe00191e038 t2 : 000000000000000a s0 : ffffffe002437960
[ 65.723051] s1 : ffffffe00160ad00 a0 : ffffffe00160ad00 a1 : 000000000000012a
[ 65.724772] a2 : 0000000000000400 a3 : 0000000000000008 a4 : 0000000000000040
[ 65.726545] a5 : 0000000000000000 a6 : ffffffe00191e000 a7 : 0000000000000000
[ 65.728308] s2 : 000000000000012a s3 : 0000000000000400 s4 : 0000000000000008
[ 65.730049] s5 : 000000000000006c s6 : ffffffe00240f800 s7 : ffffffe000f080a8
[ 65.731802] s8 : 0000000000000001 s9 : 000000000000012a s10: 0000000000000008
[ 65.733516] s11: 0000000000000008 t3 : 00000000000003ff t4 : 000000000000000f
[ 65.734434] t5 : 00000000000003ff t6 : 0000000000040000
[ 65.734613] status: 0000000000000100 badaddr: 0000000000000000 cause: 0000000000000003
[ 65.734901] Call Trace:
[ 65.735076] [<ffffffe00019f11e>] __find_get_block+0x218/0x2c8
[ 65.735417] [<ffffffe00020017a>] __ext4_get_inode_loc+0xb2/0x2f6
[ 65.735618] [<ffffffe000201b6c>] ext4_get_inode_loc+0x3a/0x8a
[ 65.735802] [<ffffffe000203380>] ext4_reserve_inode_write+0x2e/0x8c
[ 65.735999] [<ffffffe00020357a>] __ext4_mark_inode_dirty+0x4c/0x18e
[ 65.736208] [<ffffffe000206bb0>] ext4_dirty_inode+0x46/0x66
[ 65.736387] [<ffffffe000192914>] __mark_inode_dirty+0x12c/0x3da
[ 65.736576] [<ffffffe000180dd2>] touch_atime+0x146/0x150
[ 65.736748] [<ffffffe00010d762>] filemap_read+0x234/0x246
[ 65.736920] [<ffffffe00010d834>] generic_file_read_iter+0xc0/0x114
[ 65.737114] [<ffffffe0001f5d7a>] ext4_file_read_iter+0x42/0xea
[ 65.737310] [<ffffffe000163f2c>] new_sync_read+0xe2/0x15a
[ 65.737483] [<ffffffe000165814>] vfs_read+0xca/0xf2
[ 65.737641] [<ffffffe000165bae>] ksys_read+0x5e/0xc8
[ 65.737816] [<ffffffe000165c26>] sys_read+0xe/0x16
[ 65.737973] [<ffffffe000003972>] ret_from_syscall+0x0/0x2
[ 65.738858] ---[ end trace fe93f985456c935d ]---
A simple reproducer looks like:
echo 'p:myprobe sys_read fd=%a0 buf=%a1 count=%a2' > /sys/kernel/debug/tracing/kprobe_events
echo 1 > /sys/kernel/debug/tracing/events/kprobes/myprobe/enable
cat /sys/kernel/debug/tracing/trace
Here's what happens to hit that BUG_ON():
1) After installing kprobe at entry of sys_read, the first instruction
is replaced by 'ebreak' instruction on riscv64 platform.
2) Once kernel reach the 'ebreak' instruction at the entry of sys_read,
it trap into the riscv breakpoint handler, where it do something to
setup for coming single-step of origin instruction, including backup
the 'sstatus' in pt_regs, followed by disable interrupt during single
stepping via clear 'SIE' bit of 'sstatus' in pt_regs.
3) Then kernel restore to the instruction slot contains two instructions,
one is original instruction at entry of sys_read, the other is 'ebreak'.
Here it trigger a 'Instruction page fault' exception (value at 'scause'
is '0xc'), if PF is not filled into PageTabe for that slot yet.
4) Again kernel trap into page fault exception handler, where it choose
different policy according to the state of running kprobe. Because
afte 2) the state is KPROBE_HIT_SS, so kernel reset the current kprobe
and 'pc' points back to the probe address.
5) Because 'epc' point back to 'ebreak' instrution at sys_read probe,
kernel trap into breakpoint handler again, and repeat the operations
at 2), however 'sstatus' without 'SIE' is keep at 4), it cause the
real 'sstatus' saved at 2) is overwritten by the one withou 'SIE'.
6) When kernel cross the probe the 'sstatus' CSR restore with value
without 'SIE', and reach __find_get_block where it requires the
interrupt must be enabled.
Fix this is very trivial, just restore the value of 'sstatus' in pt_regs
with backup one at 2) when the instruction being single stepped cause a
page fault.
Fixes: c22b0bcb1dd02 ("riscv: Add kprobes supported")
Signed-off-by: Liao Chang <liaochang1@huawei.com>
Cc: stable@vger.kernel.org
Signed-off-by: Palmer Dabbelt <palmerdabbelt@google.com>
|
|
Now we can set ARCH_HAS_STRICT_MODULE_RWX for MMU riscv platforms, this
is good from security perspective.
Signed-off-by: Jisheng Zhang <jszhang@kernel.org>
Signed-off-by: Palmer Dabbelt <palmerdabbelt@google.com>
|
|
The core code manages the executable permissions of code regions of
modules explicitly, it is not necessary to create the module vmalloc
regions with RWX permissions. Create them with RW- permissions instead.
Signed-off-by: Jisheng Zhang <jszhang@kernel.org>
Signed-off-by: Palmer Dabbelt <palmerdabbelt@google.com>
|
|
We allocate Non-executable pages, then call bpf_jit_binary_lock_ro()
to enable executable permission after mapping them read-only. This is
to prepare for STRICT_MODULE_RWX in following patch.
Signed-off-by: Jisheng Zhang <jszhang@kernel.org>
Signed-off-by: Palmer Dabbelt <palmerdabbelt@google.com>
|
|
We will drop the executable permissions of the code pages from the
mapping at allocation time soon. Move bpf_jit_alloc_exec() and
bpf_jit_free_exec() to bpf_jit_core.c so that they can be shared by
both RV64I and RV32I.
Signed-off-by: Jisheng Zhang <jszhang@kernel.org>
Acked-by: Luke Nelson <luke.r.nels@gmail.com>
Signed-off-by: Palmer Dabbelt <palmerdabbelt@google.com>
|
|
Allocate PAGE_KERNEL_READ_EXEC(read only, executable) page for kprobes
insn page. This is to prepare for STRICT_MODULE_RWX.
Signed-off-by: Jisheng Zhang <jszhang@kernel.org>
Signed-off-by: Palmer Dabbelt <palmerdabbelt@google.com>
|
|
Constify the sbi_ipi_ops so that it will be placed in the .rodata
section. This will cause attempts to modify it to fail when strict
page permissions are in place.
Signed-off-by: Jisheng Zhang <jszhang@kernel.org>
Signed-off-by: Palmer Dabbelt <palmerdabbelt@google.com>
|
|
Constify the sys_call_table so that it will be placed in the .rodata
section. This will cause attempts to modify the table to fail when
strict page permissions are in place.
Signed-off-by: Jisheng Zhang <jszhang@kernel.org>
Signed-off-by: Palmer Dabbelt <palmerdabbelt@google.com>
|
|
All of these are never modified after init, so they can be
__ro_after_init.
Signed-off-by: Jisheng Zhang <jszhang@kernel.org>
Signed-off-by: Palmer Dabbelt <palmerdabbelt@google.com>
|
|
They are not needed after booting, so mark them as __init to move them
to the __init section.
Signed-off-by: Jisheng Zhang <jszhang@kernel.org>
Signed-off-by: Palmer Dabbelt <palmerdabbelt@google.com>
|
|
This is a preparatory patch for sv48 support that will introduce
dynamic PAGE_OFFSET.
Dynamic PAGE_OFFSET implies that all zones (vmalloc, vmemmap, fixaddr...)
whose addresses depend on PAGE_OFFSET become dynamic and can't be used
to statically initialize the array used by ptdump to identify the
different zones of the vm layout.
Signed-off-by: Alexandre Ghiti <alex@ghiti.fr>
Reviewed-by: Anup Patel <anup@brainfault.org>
Signed-off-by: Palmer Dabbelt <palmerdabbelt@google.com>
|
|
This is a preparatory patch for relocatable kernel and sv48 support.
The kernel used to be linked at PAGE_OFFSET address therefore we could use
the linear mapping for the kernel mapping. But the relocated kernel base
address will be different from PAGE_OFFSET and since in the linear mapping,
two different virtual addresses cannot point to the same physical address,
the kernel mapping needs to lie outside the linear mapping so that we don't
have to copy it at the same physical offset.
The kernel mapping is moved to the last 2GB of the address space, BPF
is now always after the kernel and modules use the 2GB memory range right
before the kernel, so BPF and modules regions do not overlap. KASLR
implementation will simply have to move the kernel in the last 2GB range
and just take care of leaving enough space for BPF.
In addition, by moving the kernel to the end of the address space, both
sv39 and sv48 kernels will be exactly the same without needing to be
relocated at runtime.
Suggested-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Alexandre Ghiti <alex@ghiti.fr>
[Palmer: Squash the STRICT_RWX fix, and a !MMU fix]
Signed-off-by: Palmer Dabbelt <palmerdabbelt@google.com>
|
|
clang prior to 13.0.0 does not support -fpatchable-function-entry for
RISC-V.
clang: error: unsupported option '-fpatchable-function-entry=8' for target 'riscv64-unknown-linux-gnu'
To avoid this error, only select HAVE_DYNAMIC_FTRACE when this option is
not available.
Fixes: afc76b8b8011 ("riscv: Using PATCHABLE_FUNCTION_ENTRY instead of MCOUNT")
Link: https://github.com/ClangBuiltLinux/linux/issues/1268
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Nathan Chancellor <nathan@kernel.org>
Reviewed-by: Fangrui Song <maskray@google.com>
Signed-off-by: Palmer Dabbelt <palmerdabbelt@google.com>
|
|
Prior to clang 13.0.0, the RISC-V name for the mcount symbol was
"mcount", which differs from the GCC version of "_mcount", which results
in the following errors:
riscv64-linux-gnu-ld: init/main.o: in function `__traceiter_initcall_level':
main.c:(.text+0xe): undefined reference to `mcount'
riscv64-linux-gnu-ld: init/main.o: in function `__traceiter_initcall_start':
main.c:(.text+0x4e): undefined reference to `mcount'
riscv64-linux-gnu-ld: init/main.o: in function `__traceiter_initcall_finish':
main.c:(.text+0x92): undefined reference to `mcount'
riscv64-linux-gnu-ld: init/main.o: in function `.LBB32_28':
main.c:(.text+0x30c): undefined reference to `mcount'
riscv64-linux-gnu-ld: init/main.o: in function `free_initmem':
main.c:(.text+0x54c): undefined reference to `mcount'
This has been corrected in https://reviews.llvm.org/D98881 but the
minimum supported clang version is 10.0.1. To avoid build errors and to
gain a working function tracer, adjust the name of the mcount symbol for
older versions of clang in mount.S and recordmcount.pl.
Link: https://github.com/ClangBuiltLinux/linux/issues/1331
Signed-off-by: Nathan Chancellor <nathan@kernel.org>
Reviewed-by: Nick Desaulniers <ndesaulniers@google.com>
Signed-off-by: Palmer Dabbelt <palmerdabbelt@google.com>
|
|
Currently, the VDSO is being linked through $(CC). This does not match
how the rest of the kernel links objects, which is through the $(LD)
variable.
When linking with clang, there are a couple of warnings about flags that
will not be used during the link:
clang-12: warning: argument unused during compilation: '-no-pie' [-Wunused-command-line-argument]
clang-12: warning: argument unused during compilation: '-pg' [-Wunused-command-line-argument]
'-no-pie' was added in commit 85602bea297f ("RISC-V: build vdso-dummy.o
with -no-pie") to override '-pie' getting added to the ld command from
distribution versions of GCC that enable PIE by default. It is
technically no longer needed after commit c2c81bb2f691 ("RISC-V: Fix the
VDSO symbol generaton for binutils-2.35+"), which removed vdso-dummy.o
in favor of generating vdso-syms.S from vdso.so with $(NM) but this also
resolves the issue in case it ever comes back due to having full control
over the $(LD) command. '-pg' is for function tracing, it is not used
during linking as clang states.
These flags could be removed/filtered to fix the warnings but it is
easier to just match the rest of the kernel and use $(LD) directly for
linking. See commits
fe00e50b2db8 ("ARM: 8858/1: vdso: use $(LD) instead of $(CC) to link VDSO")
691efbedc60d ("arm64: vdso: use $(LD) instead of $(CC) to link VDSO")
2ff906994b6c ("MIPS: VDSO: Use $(LD) instead of $(CC) to link VDSO")
2b2a25845d53 ("s390/vdso: Use $(LD) instead of $(CC) to link vDSO")
for more information.
The flags are converted to linker flags and '--eh-frame-hdr' is added to
match what is added by GCC implicitly, which can be seen by adding '-v'
to GCC's invocation.
Additionally, since this area is being modified, use the $(OBJCOPY)
variable instead of an open coded $(CROSS_COMPILE)objcopy so that the
user's choice of objcopy binary is respected.
Link: https://github.com/ClangBuiltLinux/linux/issues/803
Link: https://github.com/ClangBuiltLinux/linux/issues/970
Signed-off-by: Nathan Chancellor <nathan@kernel.org>
Reviewed-by: Fangrui Song <maskray@google.com>
Signed-off-by: Palmer Dabbelt <palmerdabbelt@google.com>
|
|
For certain SiFive CPUs, "sfence.vma addr" cannot exactly flush addr
from TLB in the particular cases. The details could be found here:
https://sifive.cdn.prismic.io/sifive/167a1a56-03f4-4615-a79e-b2a86153148f_FU740_errata_20210205.pdf
In order to ensure the functionality, this patch uses the Alternative
scheme to replace all "sfence.vma addr" with "sfence.vma" at runtime.
Signed-off-by: Vincent Chen <vincent.chen@sifive.com>
Signed-off-by: Palmer Dabbelt <palmerdabbelt@google.com>
|
|
Add sign extension to the $badaddr before addressing the instruction page
fault and instruction access fault to workaround the issue "cip-453".
To avoid affecting the existing code sequence, this patch will creates two
trampolines to add sign extension to the $badaddr. By the "alternative"
mechanism, these two trampolines will replace the original exception
handler of instruction page fault and instruction access fault in the
excp_vect_table. In this case, only the specific SiFive CPU core jumps to
the do_page_fault and do_trap_insn_fault through these two trampolines.
Other CPUs are not affected.
Signed-off-by: Vincent Chen <vincent.chen@sifive.com>
Signed-off-by: Palmer Dabbelt <palmerdabbelt@google.com>
|
|
Add required ports of the Alternative scheme for SiFive.
Signed-off-by: Vincent Chen <vincent.chen@sifive.com>
Reviewed-by: Anup Patel <anup@brainfault.org>
Signed-off-by: Palmer Dabbelt <palmerdabbelt@google.com>
|
|
Introduce the "alternative" mechanism from ARM64 and x86 to apply the CPU
vendors' errata solution at runtime. The main purpose of this patch is
to provide a framework. Therefore, the implementation is quite basic for
now so that some scenarios could not use this schemei, such as patching
code to a module, relocating the patching code and heterogeneous CPU
topology.
Users could use the macro ALTERNATIVE to apply an errata to the existing
code flow. In the macro ALTERNATIVE, users need to specify the manufacturer
information(vendorid, archid, and impid) for this errata. Therefore, kernel
will know this errata is suitable for which CPU core. During the booting
procedure, kernel will select the errata required by the CPU core and then
patch it. It means that the kernel only applies the errata to the specified
CPU core. In this case, the vendor's errata does not affect each other at
runtime. The above patching procedure only occurs during the booting phase,
so we only take the overhead of the "alternative" mechanism once.
This "alternative" mechanism is enabled by default to ensure that all
required errata will be applied. However, users can disable this feature by
the Kconfig "CONFIG_RISCV_ERRATA_ALTERNATIVE".
Signed-off-by: Vincent Chen <vincent.chen@sifive.com>
Reviewed-by: Anup Patel <anup@brainfault.org>
Signed-off-by: Palmer Dabbelt <palmerdabbelt@google.com>
|
|
Add 3 wrapper functions to get vendor id, architecture id and implement id
from M-mode
Signed-off-by: Vincent Chen <vincent.chen@sifive.com>
Reviewed-by: Anup Patel <anup@brainfault.org>
Signed-off-by: Palmer Dabbelt <palmerdabbelt@google.com>
|
|
When KASAN vmalloc region is populated, there is no userspace process and
the page table in use is swapper_pg_dir, so there is no need to read
SATP. Then we can use the same scheme used by kasan_populate_p*d
functions to go through the page table, which harmonizes the code.
In addition, make use of set_pgd that goes through all unused page table
levels, contrary to p*d_populate functions, which makes this function work
whatever the number of page table levels.
Signed-off-by: Alexandre Ghiti <alex@ghiti.fr>
Reviewed-by: Palmer Dabbelt <palmerdabbelt@google.com>
Signed-off-by: Palmer Dabbelt <palmerdabbelt@google.com>
|
|
The sbi_init() already prints SBI version before detecting
various SBI extensions so we don't need to print SBI version
for all detected SBI extensions.
Signed-off-by: Anup Patel <anup.patel@wdc.com>
Reviewed-by: Atish Patra <atish.patra@wdc.com>
Signed-off-by: Palmer Dabbelt <palmerdabbelt@google.com>
|
|
When percpu-timers are stopped by deep power saving mode, we
need system timer help to broadcast IPI_TIMER.
This is first introduced by broken x86 hardware, where the local apic
timer stops in C3 state. But many other architectures(powerpc, mips,
arm, hexagon, openrisc, sh) have supported the infrastructure to
deal with Power Management issues.
Signed-off-by: Guo Ren <guoren@linux.alibaba.com>
Reviewed-by: Anup Patel <anup@brainfault.org>
Signed-off-by: Palmer Dabbelt <palmerdabbelt@google.com>
|
|
FORTIFY_SOURCE could detect various overflows at compile and run time.
ARCH_HAS_FORTIFY_SOURCE means that the architecture can be built and
run with CONFIG_FORTIFY_SOURCE. Select it in RISCV.
See more about this feature from commit 6974f0c4555e
("include/linux/string.h: add the option of fortified string.h functions").
Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
Signed-off-by: Palmer Dabbelt <palmerdabbelt@google.com>
|
|
The riscv [rv32_]defconfig enabled CONFIG_MEMTEST,
but memtest feature is not supported in RISCV.
Add early_memtest() to support for memtest.
Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
Signed-off-by: Palmer Dabbelt <palmerdabbelt@google.com>
|
|
Pull KVM fixes from Paolo Bonzini:
- Doc fixes
- selftests fixes
- Add runstate information to the new Xen support
- Allow compiling out the Xen interface
- 32-bit PAE without EPT bugfix
- NULL pointer dereference bugfix
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
KVM: SVM: Clear the CR4 register on reset
KVM: x86/xen: Add support for vCPU runstate information
KVM: x86/xen: Fix return code when clearing vcpu_info and vcpu_time_info
selftests: kvm: Mmap the entire vcpu mmap area
KVM: Documentation: Fix index for KVM_CAP_PPC_DAWR1
KVM: x86: allow compiling out the Xen hypercall interface
KVM: xen: flush deferred static key before checking it
KVM: x86/mmu: Set SPTE_AD_WRPROT_ONLY_MASK if and only if PML is enabled
KVM: x86: hyper-v: Fix Hyper-V context null-ptr-deref
KVM: x86: remove misplaced comment on active_mmu_pages
KVM: Documentation: rectify rst markup in kvm_run->flags
Documentation: kvm: fix messy conversion from .txt to .rst
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip
Pull xen fixes from Juergen Gross:
"Two security issues (XSA-367 and XSA-369)"
* tag 'for-linus-5.12b-rc2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip:
xen: fix p2m size in dom0 for disabled memory hotplug case
xen-netback: respect gnttab_map_refs()'s return value
Xen/gnttab: handle p2m update errors on a per-slot basis
|
|
Since commit 9e2369c06c8a18 ("xen: add helpers to allocate unpopulated
memory") foreign mappings are using guest physical addresses allocated
via ZONE_DEVICE functionality.
This will result in problems for the case of no balloon memory hotplug
being configured, as the p2m list will only cover the initial memory
size of the domain. Any ZONE_DEVICE allocated address will be outside
the p2m range and thus a mapping can't be established with that memory
address.
Fix that by extending the p2m size for that case. At the same time add
a check for a to be created mapping to be within the p2m limits in
order to detect errors early.
While changing a comment, remove some 32-bit leftovers.
This is XSA-369.
Fixes: 9e2369c06c8a18 ("xen: add helpers to allocate unpopulated memory")
Cc: <stable@vger.kernel.org> # 5.9
Reported-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com>
Signed-off-by: Juergen Gross <jgross@suse.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Signed-off-by: Juergen Gross <jgross@suse.com>
|
|
Bailing immediately from set_foreign_p2m_mapping() upon a p2m updating
error leaves the full batch in an ambiguous state as far as the caller
is concerned. Instead flags respective slots as bad, unmapping what
was mapped there right away.
HYPERVISOR_grant_table_op()'s return value and the individual unmap
slots' status fields get used only for a one-time - there's not much we
can do in case of a failure.
Note that there's no GNTST_enomem or alike, so GNTST_general_error gets
used.
The map ops' handle fields get overwritten just to be on the safe side.
This is part of XSA-367.
Cc: <stable@vger.kernel.org>
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Juergen Gross <jgross@suse.com>
Link: https://lore.kernel.org/r/96cccf5d-e756-5f53-b91a-ea269bfb9be0@suse.com
Signed-off-by: Juergen Gross <jgross@suse.com>
|
|
Sergei and John both reported that ia64 failed to boot in 5.11, and it
was related to signals. Turns out the ia64 signal handling is a bit odd,
it doesn't check the return value of get_signal() for whether there's a
signal to deliver or not. With the introduction of TIF_NOTIFY_SIGNAL,
then task_work could trigger it.
Fix it by only calling handle_signal() if we actually have a real signal
to deliver. This brings it in line with all other archs, too.
Fixes: b269c229b0e8 ("ia64: add support for TIF_NOTIFY_SIGNAL")
Reported-by: Sergei Trofimovich <slyich@gmail.com>
Reported-by: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de>
Tested-by: Sergei Trofimovich <slyich@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
This problem was reported on a SVM guest while executing kexec.
Kexec fails to load the new kernel when the PCID feature is enabled.
When kexec starts loading the new kernel, it starts the process by
resetting the vCPU's and then bringing each vCPU online one by one.
The vCPU reset is supposed to reset all the register states before the
vCPUs are brought online. However, the CR4 register is not reset during
this process. If this register is already setup during the last boot,
all the flags can remain intact. The X86_CR4_PCIDE bit can only be
enabled in long mode. So, it must be enabled much later in SMP
initialization. Having the X86_CR4_PCIDE bit set during SMP boot can
cause a boot failures.
Fix the issue by resetting the CR4 register in init_vmcb().
Signed-off-by: Babu Moger <babu.moger@amd.com>
Message-Id: <161471109108.30811.6392805173629704166.stgit@bmoger-ubuntu>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
This is how Xen guests do steal time accounting. The hypervisor records
the amount of time spent in each of running/runnable/blocked/offline
states.
In the Xen accounting, a vCPU is still in state RUNSTATE_running while
in Xen for a hypercall or I/O trap, etc. Only if Xen explicitly schedules
does the state become RUNSTATE_blocked. In KVM this means that even when
the vCPU exits the kvm_run loop, the state remains RUNSTATE_running.
The VMM can explicitly set the vCPU to RUNSTATE_blocked by using the
KVM_XEN_VCPU_ATTR_TYPE_RUNSTATE_CURRENT attribute, and can also use
KVM_XEN_VCPU_ATTR_TYPE_RUNSTATE_ADJUST to retrospectively add a given
amount of time to the blocked state and subtract it from the running
state.
The state_entry_time corresponds to get_kvmclock_ns() at the time the
vCPU entered the current state, and the total times of all four states
should always add up to state_entry_time.
Co-developed-by: Joao Martins <joao.m.martins@oracle.com>
Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Message-Id: <20210301125309.874953-2-dwmw2@infradead.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
When clearing the per-vCPU shared regions, set the return value to zero
to indicate success. This was causing spurious errors to be returned to
userspace on soft reset.
Also add a paranoid BUILD_BUG_ON() for compat structure compatibility.
Fixes: 0c165b3c01fe ("KVM: x86/xen: Allow reset of Xen attributes")
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Message-Id: <20210301125309.874953-1-dwmw2@infradead.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
The Xen hypercall interface adds to the attack surface of the hypervisor
and will be used quite rarely. Allow compiling it out.
Suggested-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Woodhouse <dwmw@amazon.co.uk>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild
Pull Kbuild fixes from Masahiro Yamada:
- Fix UNUSED_KSYMS_WHITELIST for Clang LTO
- Make -s builds really silent irrespective of V= option
- Fix build error when SUBLEVEL or PATCHLEVEL is empty
* tag 'kbuild-fixes-v5.12' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild:
kbuild: Fix <linux/version.h> for empty SUBLEVEL or PATCHLEVEL again
kbuild: make -s option take precedence over V=1
ia64: remove redundant READELF from arch/ia64/Makefile
kbuild: do not include include/config/auto.conf from adjust_autoksyms.sh
kbuild: fix UNUSED_KSYMS_WHITELIST for Clang LTO
kbuild: lto: add _mcount to list of used symbols
|
|
Pull arch/csky updates from Guo Ren:
"Features:
- add new memory layout 2.5G(user):1.5G(kernel)
- add kmemleak support
- reconstruct VDSO framework: add VDSO with GENERIC_GETTIMEOFDAY,
GENERIC_TIME_VSYSCALL, HAVE_GENERIC_VDSO
- add faulthandler_disabled() check
- support (fix) swapon
- add (fix) _PAGE_ACCESSED for default pgprot
- abort uaccess retries upon fatal signal (from arm)
Fixes and optimizations:
- fix perf probe failure
- fix show_regs doesn't contain regs->usp
- remove custom asm/atomic.h implementation
- fix barrier design
- fix futex SMP implementation
- fix asm/cmpxchg.h with correct ordering barrier
- cleanup asm/spinlock.h
- fix PTE global for 2.5:1.5 virtual memory
- remove prologue of page fault handler in entry.S
- fix TLB maintenance synchronization problem
- add show_tlb for CPU_CK860 debug
- fix FAULT_FLAG_XXX param for handle_mm_fault
- fix update_mmu_cache called with user io mapping
- fix do_page_fault parent irq status
- fix a size determination in gpr_get()
- pgtable.h: Coding convention
- kprobe: Fix code in simulate without 'long'
- fix pfn_valid error with wrong max_mapnr
- use free_initmem_default() in free_initmem()
- fix compile error"
* tag 'csky-for-linus-5.12-rc1' of git://github.com/c-sky/csky-linux: (30 commits)
csky: Fixup compile error
csky: use free_initmem_default() in free_initmem()
csky: Fixup pfn_valid error with wrong max_mapnr
csky: Add VDSO with GENERIC_GETTIMEOFDAY, GENERIC_TIME_VSYSCALL, HAVE_GENERIC_VDSO
csky: kprobe: Fixup code in simulate without 'long'
csky: Fixup swapon
csky: pgtable.h: Coding convention
csky: Fixup _PAGE_ACCESSED for default pgprot
csky: remove unused including <linux/version.h>
csky: Fix a size determination in gpr_get()
csky: Reconstruct VDSO framework
csky: mm: abort uaccess retries upon fatal signal
csky: Sync riscv mm/fault.c for easy maintenance
csky: Fixup do_page_fault parent irq status
csky: Add faulthandler_disabled() check
csky: Fixup update_mmu_cache called with user io mapping
csky: Fixup FAULT_FLAG_XXX param for handle_mm_fault
csky: Add show_tlb for CPU_CK860 debug
csky: Fix TLB maintenance synchronization problem
csky: Add kmemleak support
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux
Pull more RISC-V updates from Palmer Dabbelt:
"A pair of patches that slipped through the cracks:
- enable CPU hotplug in the defconfigs
- some cleanups to setup_bootmem
There's also a single fix for some randconfig build failures:
- make NUMA depend on SMP"
* tag 'riscv-for-linus-5.12-mw1' of git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux:
riscv: Cleanup setup_bootmem()
RISC-V: Enable CPU Hotplug in defconfigs
RISC-V: Make NUMA depend on SMP
|
|
READELF is defined by the top Makefile.
Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
|
|
Pull io_uring thread rewrite from Jens Axboe:
"This converts the io-wq workers to be forked off the tasks in question
instead of being kernel threads that assume various bits of the
original task identity.
This kills > 400 lines of code from io_uring/io-wq, and it's the worst
part of the code. We've had several bugs in this area, and the worry
is always that we could be missing some pieces for file types doing
unusual things (recent /dev/tty example comes to mind, userfaultfd
reads installing file descriptors is another fun one... - both of
which need special handling, and I bet it's not the last weird oddity
we'll find).
With these identical workers, we can have full confidence that we're
never missing anything. That, in itself, is a huge win. Outside of
that, it's also more efficient since we're not wasting space and code
on tracking state, or switching between different states.
I'm sure we're going to find little things to patch up after this
series, but testing has been pretty thorough, from the usual
regression suite to production. Any issue that may crop up should be
manageable.
There's also a nice series of further reductions we can do on top of
this, but I wanted to get the meat of it out sooner rather than later.
The general worry here isn't that it's fundamentally broken. Most of
the little issues we've found over the last week have been related to
just changes in how thread startup/exit is done, since that's the main
difference between using kthreads and these kinds of threads. In fact,
if all goes according to plan, I want to get this into the 5.10 and
5.11 stable branches as well.
That said, the changes outside of io_uring/io-wq are:
- arch setup, simple one-liner to each arch copy_thread()
implementation.
- Removal of net and proc restrictions for io_uring, they are no
longer needed or useful"
* tag 'io_uring-worker.v3-2021-02-25' of git://git.kernel.dk/linux-block: (30 commits)
io-wq: remove now unused IO_WQ_BIT_ERROR
io_uring: fix SQPOLL thread handling over exec
io-wq: improve manager/worker handling over exec
io_uring: ensure SQPOLL startup is triggered before error shutdown
io-wq: make buffered file write hashed work map per-ctx
io-wq: fix race around io_worker grabbing
io-wq: fix races around manager/worker creation and task exit
io_uring: ensure io-wq context is always destroyed for tasks
arch: ensure parisc/powerpc handle PF_IO_WORKER in copy_thread()
io_uring: cleanup ->user usage
io-wq: remove nr_process accounting
io_uring: flag new native workers with IORING_FEAT_NATIVE_WORKERS
net: remove cmsg restriction from io_uring based send/recvmsg calls
Revert "proc: don't allow async path resolution of /proc/self components"
Revert "proc: don't allow async path resolution of /proc/thread-self components"
io_uring: move SQPOLL thread io-wq forked worker
io-wq: make io_wq_fork_thread() available to other users
io-wq: only remove worker from free_list, if it was there
io_uring: remove io_identity
io_uring: remove any grabbing of context
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull misc vfs updates from Al Viro:
"Assorted stuff pile - no common topic here"
* 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
whack-a-mole: don't open-code iminor/imajor
9p: fix misuse of sscanf() in v9fs_stat2inode()
audit_alloc_mark(): don't open-code ERR_CAST()
fs/inode.c: make inode_init_always() initialize i_ino to 0
vfs: don't unnecessarily clone write access for writable fds
|
|
: error: C++ style comments are not allowed in ISO C90
// Copyright (C) 2018 Hangzhou C-SKY Microsystems co.,ltd.
^
error: (this will be reported only once per input file)
Signed-off-by: Guo Ren <guoren@linux.alibaba.com>
|
|
The existing code is essentially
free_initmem_default()->free_reserved_area() without poisoning.
Note that existing code missed to update the managed page count of the
zone.
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Wei Yang <richard.weiyang@linux.alibaba.com>
Tested-by: Guo Ren <guoren@kernel.org>
Signed-off-by: Guo Ren <guoren@kernel.org>
Signed-off-by: David Hildenbrand <david@redhat.com>
|
|
The max_mapnr is the number of PFNs, not absolute PFN offset.
Using set_max_mapnr API instead of setting the value directly.
Signed-off-by: Guo Ren <guoren@linux.alibaba.com>
|