Age | Commit message (Collapse) | Author |
|
Intel Sapphire Rapids server will introduce 8 metrics events. Intel
Ice Lake only supports 4 metrics events. A perf tool user may mistakenly
use the unsupported events via RAW format on Ice Lake. The user can
still get a value from the unsupported Topdown metrics event once the
following Sapphire Rapids enabling patch is applied.
To enable the 8 metrics events on Intel Sapphire Rapids, the
INTEL_TD_METRIC_MAX has to be updated, which impacts the
is_metric_event(). The is_metric_event() is a generic function.
On Ice Lake, the newly added SPR metrics events will be mistakenly
accepted as metric events on creation. At runtime, the unsupported
Topdown metrics events will be updated.
Add a variable num_topdown_events in x86_pmu to indicate the available
number of the Topdown metrics event on the platform. Apply the number
into is_metric_event(). Only the supported Topdown metrics events
should be created as metrics events.
Apply the num_topdown_events in icl_update_topdown_event() as well. The
function can be reused by the following patch.
Suggested-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/1611873611-156687-4-git-send-email-kan.liang@linux.intel.com
|
|
Similar to Ice Lake, Intel Sapphire Rapids server also supports the
topdown performance metrics feature. The difference is that Intel
Sapphire Rapids server extends the PERF_METRICS MSR to feature TMA
method level two metrics, which will introduce 8 metrics events. Current
icl_update_topdown_event() only check 4 level one metrics events.
Factor out intel_update_topdown_event() to facilitate the code sharing
between Ice Lake and Sapphire Rapids.
Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/1611873611-156687-3-git-send-email-kan.liang@linux.intel.com
|
|
Current PERF_SAMPLE_WEIGHT sample type is very useful to expresses the
cost of an action represented by the sample. This allows the profiler
to scale the samples to be more informative to the programmer. It could
also help to locate a hotspot, e.g., when profiling by memory latencies,
the expensive load appear higher up in the histograms. But current
PERF_SAMPLE_WEIGHT sample type is solely determined by one factor. This
could be a problem, if users want two or more factors to contribute to
the weight. For example, Golden Cove core PMU can provide both the
instruction latency and the cache Latency information as factors for the
memory profiling.
For current X86 platforms, although meminfo::latency is defined as a
u64, only the lower 32 bits include the valid data in practice (No
memory access could last than 4G cycles). The higher 32 bits can be used
to store new factors.
Add a new sample type, PERF_SAMPLE_WEIGHT_STRUCT, to indicate the new
sample weight structure. It shares the same space as the
PERF_SAMPLE_WEIGHT sample type.
Users can apply either the PERF_SAMPLE_WEIGHT sample type or the
PERF_SAMPLE_WEIGHT_STRUCT sample type to retrieve the sample weight, but
they cannot apply both sample types simultaneously.
Currently, only X86 and PowerPC use the PERF_SAMPLE_WEIGHT sample type.
- For PowerPC, there is nothing changed for the PERF_SAMPLE_WEIGHT
sample type. There is no effect for the new PERF_SAMPLE_WEIGHT_STRUCT
sample type. PowerPC can re-struct the weight field similarly later.
- For X86, the same value will be dumped for the PERF_SAMPLE_WEIGHT
sample type or the PERF_SAMPLE_WEIGHT_STRUCT sample type for now.
The following patches will apply the new factors for the
PERF_SAMPLE_WEIGHT_STRUCT sample type.
The field in the union perf_sample_weight should be shared among
different architectures. A generic name is required, but it's hard to
abstract a name that applies to all architectures. For example, on X86,
the fields are to store all kinds of latency. While on PowerPC, it
stores MMCRA[TECX/TECM], which should not be latency. So a general name
prefix 'var$NUM' is used here.
Suggested-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/1611873611-156687-2-git-send-email-kan.liang@linux.intel.com
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 fix from Borislav Petkov:
"A single fix for objtool to generate proper unwind info for newer
toolchains which do not generate section symbols anymore. And a
cleanup ontop.
This was originally going to go during the next merge window but
people can already trigger a build error with binutils-2.36 which
doesn't emit section symbols - something which objtool relies on - so
let's expedite it"
* tag 'x86_entry_for_v5.11_rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/entry: Remove put_ret_addr_in_rdi THUNK macro argument
x86/entry: Emit a symbol for register restoring thunk
|
|
free_ldt_pgtables() uses the MMU gather API for batching TLB flushes
over the call to free_pgd_range(). However, tlb_gather_mmu() expects
to operate on user addresses and so passing LDT_{BASE,END}_ADDR will
confuse the range setting logic in __tlb_adjust_range(), causing the
gather to identify a range starting at TASK_SIZE. Such a large range
will be converted into a 'fullmm' flush by the low-level invalidation
code, so change the caller to invoke tlb_gather_mmu_fullmm() directly.
Signed-off-by: Will Deacon <will@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Yu Zhao <yuzhao@google.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Link: https://lkml.kernel.org/r/20210127235347.1402-7-will@kernel.org
|
|
If __tlb_remove_tlb_entry() is not defined by the architecture then
we provide an empty definition in asm-generic/tlb.h.
Remove the redundant empty definitions for sparc64 and x86.
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Will Deacon <will@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Yu Zhao <yuzhao@google.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Link: https://lkml.kernel.org/r/20210127235347.1402-6-will@kernel.org
|
|
The 'start' and 'end' arguments to tlb_gather_mmu() are no longer
needed now that there is a separate function for 'fullmm' flushing.
Remove the unused arguments and update all callers.
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Will Deacon <will@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Yu Zhao <yuzhao@google.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Link: https://lore.kernel.org/r/CAHk-=wjQWa14_4UpfDf=fiineNP+RH74kZeDMo_f1D35xNzq9w@mail.gmail.com
|
|
Since commit 7a30df49f63a ("mm: mmu_gather: remove __tlb_reset_range()
for force flush"), the 'start' and 'end' arguments to tlb_finish_mmu()
are no longer used, since we flush the whole mm in case of a nested
invalidation.
Remove the unused arguments and update all callers.
Signed-off-by: Will Deacon <will@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Yu Zhao <yuzhao@google.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Link: https://lkml.kernel.org/r/20210127235347.1402-3-will@kernel.org
|
|
With retpolines disabled, some configurations of GCC, and specifically
the GCC versions 9 and 10 in Ubuntu will add Intel CET instrumentation
to the kernel by default. That breaks certain tracing scenarios by
adding a superfluous ENDBR64 instruction before the fentry call, for
functions which can be called indirectly.
CET instrumentation isn't currently necessary in the kernel, as CET is
only supported in user space. Disable it unconditionally and move it
into the x86's Makefile as CET/CFI... enablement should be a per-arch
decision anyway.
[ bp: Massage and extend commit message. ]
Fixes: 29be86d7f9cb ("kbuild: add -fcf-protection=none when using retpoline flags")
Reported-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Tested-by: Nikolay Borisov <nborisov@suse.com>
Cc: <stable@vger.kernel.org>
Cc: Seth Forshee <seth.forshee@canonical.com>
Cc: Masahiro Yamada <yamada.masahiro@socionext.com>
Link: https://lkml.kernel.org/r/20210128215219.6kct3h2eiustncws@treble
|
|
Use sizeof() instead of a constant in fpstate_sanitize_xstate().
Remove use of the address of the 0th array element of ->st_space and
->xmm_space which is equivalent to the array address itself:
No code changed:
# arch/x86/kernel/fpu/xstate.o:
text data bss dec hex filename
9694 899 4 10597 2965 xstate.o.before
9694 899 4 10597 2965 xstate.o.after
md5:
5a43fc70bad8e2a1784f67f01b71aabb xstate.o.before.asm
5a43fc70bad8e2a1784f67f01b71aabb xstate.o.after.asm
[ bp: Massage commit message. ]
Signed-off-by: Yejune Deng <yejune.deng@gmail.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20210122071925.41285-1-yejune.deng@gmail.com
|
|
The remaining callers of kernel_fpu_begin() in 64-bit kernels don't use 387
instructions, so there's no need to sanitize the FPU state. Skip it to get
most of the performance we lost back.
Reported-by: Krzysztof Olędzki <ole@ans.pl>
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/57f8841ccbf9f3c25a23196c888f5f6ec5887577.1611205691.git.luto@kernel.org
|
|
EFI uses kernel_fpu_begin() to conform to the UEFI calling convention.
This specifically requires initializing FCW (FPU Control Word), whereas
no sane 64-bit kernel code should use legacy 387 operations that
reference FCW.
This should allow to safely change the default semantics of
kernel_fpu_begin() to stop initializing FCW on 64-bit kernels.
[ bp: Massage commit message a little. ]
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/25d392fff64680e0f4bb8cf0b1003314dc29eafe.1611205691.git.luto@kernel.org
|
|
The "oprofile" user-space tools don't use the kernel OPROFILE support
any more, and haven't in a long time. User-space has been converted to
the perf interfaces.
Remove the old oprofile's architecture specific support.
Suggested-by: Christoph Hellwig <hch@infradead.org>
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Acked-by: Robert Richter <rric@kernel.org>
Acked-by: William Cohen <wcohen@redhat.com>
Acked-by: Al Viro <viro@zeniv.linux.org.uk>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
|
|
drivers/net/can/dev.c
b552766c872f ("can: dev: prevent potential information leak in can_fill_info()")
3e77f70e7345 ("can: dev: move driver related infrastructure into separate subdir")
0a042c6ec991 ("can: dev: move netlink related code into seperate file")
Code move.
drivers/net/ethernet/mellanox/mlx5/core/en_ethtool.c
57ac4a31c483 ("net/mlx5e: Correctly handle changing the number of queues when the interface is down")
214baf22870c ("net/mlx5e: Support HTB offload")
Adjacent code changes
net/switchdev/switchdev.c
20776b465c0c ("net: switchdev: don't set port_obj_info->handled true when -EOPNOTSUPP")
ffb68fc58e96 ("net: switchdev: remove the transaction structure from port object notifiers")
bae33f2b5afe ("net: switchdev: remove the transaction structure from port attributes")
Transaction parameter gets dropped otherwise keep the fix.
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip
Pull xen fixes from Juergen Gross:
- A fix for a regression introduced in 5.11 resulting in Xen dom0
having problems to correctly initialize Xenstore.
- A fix for avoiding WARN splats when booting as Xen dom0 with
CONFIG_AMD_MEM_ENCRYPT enabled due to a missing trap handler for the
#VC exception (even if the handler should never be called).
- A fix for the Xen bklfront driver adapting to the correct but
unexpected behavior of new qemu.
* tag 'for-linus-5.11-rc6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip:
x86/xen: avoid warning in Xen pv guest with CONFIG_AMD_MEM_ENCRYPT enabled
xen: Fix XenStore initialisation for XS_LOCAL
xen-blkfront: allow discard-* nodes to be optional
|
|
Grab kvm->lock before pinning memory when registering an encrypted
region; sev_pin_memory() relies on kvm->lock being held to ensure
correctness when checking and updating the number of pinned pages.
Add a lockdep assertion to help prevent future regressions.
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Joerg Roedel <joro@8bytes.org>
Cc: Tom Lendacky <thomas.lendacky@amd.com>
Cc: Brijesh Singh <brijesh.singh@amd.com>
Cc: Sean Christopherson <seanjc@google.com>
Cc: x86@kernel.org
Cc: kvm@vger.kernel.org
Cc: stable@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Fixes: 1e80fdc09d12 ("KVM: SVM: Pin guest memory when SEV is active")
Signed-off-by: Peter Gonda <pgonda@google.com>
V2
- Fix up patch description
- Correct file paths svm.c -> sev.c
- Add unlock of kvm->lock on sev_pin_memory error
V1
- https://lore.kernel.org/kvm/20210126185431.1824530-1-pgonda@google.com/
Message-Id: <20210127161524.2832400-1-pgonda@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
Collect the scattered SME/SEV related feature flags into a dedicated
word. There are now five recognized features in CPUID.0x8000001F.EAX,
with at least one more on the horizon (SEV-SNP). Using a dedicated word
allows KVM to use its automagic CPUID adjustment logic when reporting
the set of supported features to userspace.
No functional change intended.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Brijesh Singh <brijesh.singh@amd.com>
Link: https://lkml.kernel.org/r/20210122204047.2860075-2-seanjc@google.com
|
|
This is similar to commit
b21ebf2fb4cd ("x86: Treat R_X86_64_PLT32 as R_X86_64_PC32")
but for i386. As far as the kernel is concerned, R_386_PLT32 can be
treated the same as R_386_PC32.
R_386_PLT32/R_X86_64_PLT32 are PC-relative relocation types which
can only be used by branches. If the referenced symbol is defined
externally, a PLT will be used.
R_386_PC32/R_X86_64_PC32 are PC-relative relocation types which can be
used by address taking operations and branches. If the referenced symbol
is defined externally, a copy relocation/canonical PLT entry will be
created in the executable.
On x86-64, there is no PIC vs non-PIC PLT distinction and an
R_X86_64_PLT32 relocation is produced for both `call/jmp foo` and
`call/jmp foo@PLT` with newer (2018) GNU as/LLVM integrated assembler.
This avoids canonical PLT entries (st_shndx=0, st_value!=0).
On i386, there are 2 types of PLTs, PIC and non-PIC. Currently,
the GCC/GNU as convention is to use R_386_PC32 for non-PIC PLT and
R_386_PLT32 for PIC PLT. Copy relocations/canonical PLT entries
are possible ABI issues but GCC/GNU as will likely keep the status
quo because (1) the ABI is legacy (2) the change will drop a GNU
ld diagnostic for non-default visibility ifunc in shared objects.
clang-12 -fno-pic (since [1]) can emit R_386_PLT32 for compiler
generated function declarations, because preventing canonical PLT
entries is weighed over the rare ifunc diagnostic.
Further info for the more interested:
https://github.com/ClangBuiltLinux/linux/issues/1210
https://sourceware.org/bugzilla/show_bug.cgi?id=27169
https://github.com/llvm/llvm-project/commit/a084c0388e2a59b9556f2de0083333232da3f1d6 [1]
[ bp: Massage commit message. ]
Reported-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Fangrui Song <maskray@google.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Nick Desaulniers <ndesaulniers@google.com>
Reviewed-by: Nathan Chancellor <natechancellor@gmail.com>
Tested-by: Nick Desaulniers <ndesaulniers@google.com>
Tested-by: Nathan Chancellor <natechancellor@gmail.com>
Tested-by: Sedat Dilek <sedat.dilek@gmail.com>
Link: https://lkml.kernel.org/r/20210127205600.1227437-1-maskray@google.com
|
|
Recent commit 255cbecfe0 modified struct kvm_vcpu_arch to make
'cpuid_entries' a pointer to an array of kvm_cpuid_entry2 entries
rather than embedding the array in the struct. KVM_SET_CPUID and
KVM_SET_CPUID2 were updated accordingly, but KVM_GET_CPUID2 was missed.
As a result, KVM_GET_CPUID2 currently returns random fields from struct
kvm_vcpu_arch to userspace rather than the expected CPUID values. Fix
this by treating 'cpuid_entries' as a pointer when copying its
contents to userspace buffer.
Fixes: 255cbecfe0c9 ("KVM: x86: allocate vcpu->arch.cpuid_entries dynamically")
Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Michael Roth <michael.roth@amd.com.com>
Message-Id: <20210128024451.1816770-1-michael.roth@amd.com>
Cc: stable@vger.kernel.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
Commit
a7e1f67ed29f ("x86/msr: Filter MSR writes")
introduced a module parameter to disable writing to the MSR device file
and tainted the kernel upon writing. As MSR registers can be written by
the X86_IOC_WRMSR_REGS ioctl too, the same filtering and tainting should
be applied to the ioctl as well.
[ bp: Massage commit message and space out statements. ]
Fixes: a7e1f67ed29f ("x86/msr: Filter MSR writes")
Signed-off-by: Misono Tomohiro <misono.tomohiro@jp.fujitsu.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20210127122456.13939-1-misono.tomohiro@jp.fujitsu.com
|
|
Perfmon-v4 counter freezing is fundamentally broken; remove this default
disabled code to make sure nobody uses it.
The feature is called Freeze-on-PMI in the SDM, and if it would do that,
there wouldn't actually be a problem, *however* it does something subtly
different. It globally disables the whole PMU when it raises the PMI,
not when the PMI hits.
This means there's a window between the PMI getting raised and the PMI
actually getting served where we loose events and this violates the
perf counter independence. That is, a counting event should not result
in a different event count when there is a sampling event co-scheduled.
This is known to break existing software (RR).
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
|
|
Clean up that CONFIG_RETPOLINE crud and replace the
indirect call x86_pmu.guest_get_msrs with static_call().
Reported-by: kernel test robot <lkp@intel.com>
Suggested-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Like Xu <like.xu@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20210125121458.181635-1-like.xu@linux.intel.com
|
|
When booting a kernel which has been built with CONFIG_AMD_MEM_ENCRYPT
enabled as a Xen pv guest a warning is issued for each processor:
[ 5.964347] ------------[ cut here ]------------
[ 5.968314] WARNING: CPU: 0 PID: 1 at /home/gross/linux/head/arch/x86/xen/enlighten_pv.c:660 get_trap_addr+0x59/0x90
[ 5.972321] Modules linked in:
[ 5.976313] CPU: 0 PID: 1 Comm: swapper/0 Tainted: G W 5.11.0-rc5-default #75
[ 5.980313] Hardware name: Dell Inc. OptiPlex 9020/0PC5F7, BIOS A05 12/05/2013
[ 5.984313] RIP: e030:get_trap_addr+0x59/0x90
[ 5.988313] Code: 42 10 83 f0 01 85 f6 74 04 84 c0 75 1d b8 01 00 00 00 c3 48 3d 00 80 83 82 72 08 48 3d 20 81 83 82 72 0c b8 01 00 00 00 eb db <0f> 0b 31 c0 c3 48 2d 00 80 83 82 48 ba 72 1c c7 71 1c c7 71 1c 48
[ 5.992313] RSP: e02b:ffffc90040033d38 EFLAGS: 00010202
[ 5.996313] RAX: 0000000000000001 RBX: ffffffff82a141d0 RCX: ffffffff8222ec38
[ 6.000312] RDX: ffffffff8222ec38 RSI: 0000000000000005 RDI: ffffc90040033d40
[ 6.004313] RBP: ffff8881003984a0 R08: 0000000000000007 R09: ffff888100398000
[ 6.008312] R10: 0000000000000007 R11: ffffc90040246000 R12: ffff8884082182a8
[ 6.012313] R13: 0000000000000100 R14: 000000000000001d R15: ffff8881003982d0
[ 6.016316] FS: 0000000000000000(0000) GS:ffff888408200000(0000) knlGS:0000000000000000
[ 6.020313] CS: e030 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 6.024313] CR2: ffffc900020ef000 CR3: 000000000220a000 CR4: 0000000000050660
[ 6.028314] Call Trace:
[ 6.032313] cvt_gate_to_trap.part.7+0x3f/0x90
[ 6.036313] ? asm_exc_double_fault+0x30/0x30
[ 6.040313] xen_convert_trap_info+0x87/0xd0
[ 6.044313] xen_pv_cpu_up+0x17a/0x450
[ 6.048313] bringup_cpu+0x2b/0xc0
[ 6.052313] ? cpus_read_trylock+0x50/0x50
[ 6.056313] cpuhp_invoke_callback+0x80/0x4c0
[ 6.060313] _cpu_up+0xa7/0x140
[ 6.064313] cpu_up+0x98/0xd0
[ 6.068313] bringup_nonboot_cpus+0x4f/0x60
[ 6.072313] smp_init+0x26/0x79
[ 6.076313] kernel_init_freeable+0x103/0x258
[ 6.080313] ? rest_init+0xd0/0xd0
[ 6.084313] kernel_init+0xa/0x110
[ 6.088313] ret_from_fork+0x1f/0x30
[ 6.092313] ---[ end trace be9ecf17dceeb4f3 ]---
Reason is that there is no Xen pv trap entry for X86_TRAP_VC.
Fix that by adding a generic trap handler for unknown traps and wire all
unknown bare metal handlers to this generic handler, which will just
crash the system in case such a trap will ever happen.
Fixes: 0786138c78e793 ("x86/sev-es: Add a Runtime #VC Exception Handler")
Cc: <stable@vger.kernel.org> # v5.10
Signed-off-by: Juergen Gross <jgross@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Signed-off-by: Juergen Gross <jgross@suse.com>
|
|
Pull kvm fixes from Paolo Bonzini:
- x86 bugfixes
- Documentation fixes
- Avoid performance regression due to SEV-ES patches
- ARM:
- Don't allow tagged pointers to point to memslots
- Filter out ARMv8.1+ PMU events on v8.0 hardware
- Hide PMU registers from userspace when no PMU is configured
- More PMU cleanups
- Don't try to handle broken PSCI firmware
- More sys_reg() to reg_to_encoding() conversions
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
KVM: x86: allow KVM_REQ_GET_NESTED_STATE_PAGES outside guest mode for VMX
KVM: x86: Revert "KVM: x86: Mark GPRs dirty when written"
KVM: SVM: Unconditionally sync GPRs to GHCB on VMRUN of SEV-ES guest
KVM: nVMX: Sync unsync'd vmcs02 state to vmcs12 on migration
kvm: tracing: Fix unmatched kvm_entry and kvm_exit events
KVM: Documentation: Update description of KVM_{GET,CLEAR}_DIRTY_LOG
KVM: x86: get smi pending status correctly
KVM: x86/pmu: Fix HW_REF_CPU_CYCLES event pseudo-encoding in intel_arch_events[]
KVM: x86/pmu: Fix UBSAN shift-out-of-bounds warning in intel_pmu_refresh()
KVM: x86: Add more protection against undefined behavior in rsvd_bits()
KVM: Documentation: Fix spec for KVM_CAP_ENABLE_CAP_VM
KVM: Forbid the use of tagged userspace addresses for memslots
KVM: arm64: Filter out v8.1+ events on v8.0 HW
KVM: arm64: Compute TPIDR_EL2 ignoring MTE tag
KVM: arm64: Use the reg_to_encoding() macro instead of sys_reg()
KVM: arm64: Allow PSCI SYSTEM_OFF/RESET to return
KVM: arm64: Simplify handling of absent PMU system registers
KVM: arm64: Hide PMU registers from userspace when not available
|
|
The OBJECT_FILES_NON_STANDARD annotation is used to tell objtool to
ignore a file. File-level ignores won't work when validating vmlinux.o.
Instead, convert restore_image() and core_restore_code() to be ELF
functions. Their code is conventional enough for objtool to be able to
understand them.
Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
Cc: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
Link: https://lore.kernel.org/r/974f8ceb5385e470f72e93974c70ab5c894bb0dc.1611263462.git.jpoimboe@redhat.com
|
|
Because restore_registers() is page-aligned, the assembler inexplicably
adds an unreachable jump from after the end of the previous function to
the beginning of restore_registers().
That confuses objtool, understandably. It also creates significant text
fragmentation. As a result, most of the object file is wasted text
(nops).
Move restore_registers() to the beginning of the file to both prevent
the text fragmentation and avoid the dead jump instruction.
$ size /tmp/hibernate_asm_64.before.o /tmp/hibernate_asm_64.after.o
text data bss dec hex filename
4415 0 0 4415 113f /tmp/hibernate_asm_64.before.o
524 0 0 524 20c /tmp/hibernate_asm_64.after.o
Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
Cc: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
Link: https://lore.kernel.org/r/8c7f634201d26453d73fe55032cbbdc05d004387.1611263462.git.jpoimboe@redhat.com
|
|
These indirect jumps are harmless; annotate them to make objtool's
retpoline validation happy.
Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
Cc: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
Link: https://lore.kernel.org/r/ba7a141c98f2c09c255b19bf78ee4a5f45d4ecb6.1611263462.git.jpoimboe@redhat.com
|
|
The OBJECT_FILES_NON_STANDARD annotation is used to tell objtool to
ignore a file. File-level ignores won't work when validating vmlinux.o.
Instead, tell objtool to ignore do_suspend_lowlevel() directly with the
STACK_FRAME_NON_STANDARD annotation.
Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
Cc: Len Brown <len.brown@intel.com>
Cc: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
Link: https://lore.kernel.org/r/269eda576c53bc9ecc8167c211989111013a67aa.1611263462.git.jpoimboe@redhat.com
|
|
This indirect jump is harmless; annotate it to keep objtool's retpoline
validation happy.
Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
Cc: Len Brown <len.brown@intel.com>
Cc: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
Link: https://lore.kernel.org/r/a7288e7043265d95c1a5d64f9fd751ead4854bdc.1611263462.git.jpoimboe@redhat.com
|
|
With objtool vmlinux.o validation of return_to_handler(), now that
objtool has visibility inside the retpoline, jumping from EMPTY state to
a proper function state results in a stack state mismatch.
return_to_handler() is actually quite normal despite the underlying
magic. Just annotate it as a normal function.
Acked-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
Link: https://lore.kernel.org/r/14f48e623f61dbdcd84cf27a56ed8ccae73199ef.1611263462.git.jpoimboe@redhat.com
|
|
This indirect jump is harmless; annotate it to keep objtool's retpoline
validation happy.
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Juergen Gross <jgross@suse.com>
Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
Reviewed-by: Juergen Gross <jgross@suse.com>
Link: https://lore.kernel.org/r/4797c72a258b26e06741c58ccd4a75c42db39c1d.1611263462.git.jpoimboe@redhat.com
|
|
The Xen hypercall page is filled with zeros, causing objtool to fall
through all the empty hypercall functions until it reaches a real
function, resulting in a stack state mismatch.
The build-time contents of the hypercall page don't matter because the
page gets rewritten by the hypervisor. Make it more palatable to
objtool by making each hypervisor function a true empty function, with
nops and a return.
Cc: Juergen Gross <jgross@suse.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
Link: https://lore.kernel.org/r/0883bde1d7a1fb3b6a4c952bc0200e873752f609.1611263462.git.jpoimboe@redhat.com
|
|
The OBJECT_FILES_NON_STANDARD annotation is used to tell objtool to
ignore a file. File-level ignores won't work when validating vmlinux.o.
Tweak the ELF metadata and unwind hints to allow objtool to follow the
code.
Cc: Juergen Gross <jgross@suse.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
Link: https://lore.kernel.org/r/8b042a09c69e8645f3b133ef6653ba28f896807d.1611263462.git.jpoimboe@redhat.com
|
|
The ORC metadata generated for UNWIND_HINT_FUNC isn't actually very
func-like. With certain usages it can cause stack state mismatches
because it doesn't set the return address (CFI_RA).
Also, users of UNWIND_HINT_RET_OFFSET no longer need to set a custom
return stack offset. Instead they just need to specify a func-like
situation, so the current ret_offset code is hacky for no good reason.
Solve both problems by simplifying the RET_OFFSET handling and
converting it into a more useful UNWIND_HINT_FUNC.
If we end up needing the old 'ret_offset' functionality again in the
future, we should be able to support it pretty easily with the addition
of a custom 'sp_offset' in UNWIND_HINT_FUNC.
Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
Link: https://lore.kernel.org/r/db9d1f5d79dddfbb3725ef6d8ec3477ad199948d.1611263462.git.jpoimboe@redhat.com
|
|
Prevent an unreachable objtool warning after the sibling call detection
gets improved. ftrace_stub() is basically a function, annotate it as
such.
Acked-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
Link: https://lore.kernel.org/r/6845e1b2fb0723a95740c6674e548ba38c5ea489.1611263461.git.jpoimboe@redhat.com
|
|
VMX also uses KVM_REQ_GET_NESTED_STATE_PAGES for the Hyper-V eVMCS,
which may need to be loaded outside guest mode. Therefore we cannot
WARN in that case.
However, that part of nested_get_vmcs12_pages is _not_ needed at
vmentry time. Split it out of KVM_REQ_GET_NESTED_STATE_PAGES handling,
so that both vmentry and migration (and in the latter case, independent
of is_guest_mode) do the parts that are needed.
Cc: <stable@vger.kernel.org> # 5.10.x: f2c7ef3ba: KVM: nSVM: cancel KVM_REQ_GET_NESTED_STATE_PAGES
Cc: <stable@vger.kernel.org> # 5.10.x
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
Revert the dirty/available tracking of GPRs now that KVM copies the GPRs
to the GHCB on any post-VMGEXIT VMRUN, even if a GPR is not dirty. Per
commit de3cd117ed2f ("KVM: x86: Omit caching logic for always-available
GPRs"), tracking for GPRs noticeably impacts KVM's code footprint.
This reverts commit 1c04d8c986567c27c56c05205dceadc92efb14ff.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210122235049.3107620-3-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
Drop the per-GPR dirty checks when synchronizing GPRs to the GHCB, the
GRPs' dirty bits are set from time zero and never cleared, i.e. will
always be seen as dirty. The obvious alternative would be to clear
the dirty bits when appropriate, but removing the dirty checks is
desirable as it allows reverting GPR dirty+available tracking, which
adds overhead to all flavors of x86 VMs.
Note, unconditionally writing the GPRs in the GHCB is tacitly allowed
by the GHCB spec, which allows the hypervisor (or guest) to provide
unnecessary info; it's the guest's responsibility to consume only what
it needs (the hypervisor is untrusted after all).
The guest and hypervisor can supply additional state if desired but
must not rely on that additional state being provided.
Cc: Brijesh Singh <brijesh.singh@amd.com>
Cc: Tom Lendacky <thomas.lendacky@amd.com>
Fixes: 291bd20d5d88 ("KVM: SVM: Add initial support for a VMGEXIT VMEXIT")
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210122235049.3107620-2-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
Even when we are outside the nested guest, some vmcs02 fields
may not be in sync vs vmcs12. This is intentional, even across
nested VM-exit, because the sync can be delayed until the nested
hypervisor performs a VMCLEAR or a VMREAD/VMWRITE that affects those
rarely accessed fields.
However, during KVM_GET_NESTED_STATE, the vmcs12 has to be up to date to
be able to restore it. To fix that, call copy_vmcs02_to_vmcs12_rare()
before the vmcs12 contents are copied to userspace.
Fixes: 7952d769c29ca ("KVM: nVMX: Sync rarely accessed guest fields only when needed")
Reviewed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
Message-Id: <20210114205449.8715-2-mlevitsk@redhat.com>
Cc: stable@vger.kernel.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
On VMX, if we exit and then re-enter immediately without leaving
the vmx_vcpu_run() function, the kvm_entry event is not logged.
That means we will see one (or more) kvm_exit, without its (their)
corresponding kvm_entry, as shown here:
CPU-1979 [002] 89.871187: kvm_entry: vcpu 1
CPU-1979 [002] 89.871218: kvm_exit: reason MSR_WRITE
CPU-1979 [002] 89.871259: kvm_exit: reason MSR_WRITE
It also seems possible for a kvm_entry event to be logged, but then
we leave vmx_vcpu_run() right away (if vmx->emulation_required is
true). In this case, we will have a spurious kvm_entry event in the
trace.
Fix these situations by moving trace_kvm_entry() inside vmx_vcpu_run()
(where trace_kvm_exit() already is).
A trace obtained with this patch applied looks like this:
CPU-14295 [000] 8388.395387: kvm_entry: vcpu 0
CPU-14295 [000] 8388.395392: kvm_exit: reason MSR_WRITE
CPU-14295 [000] 8388.395393: kvm_entry: vcpu 0
CPU-14295 [000] 8388.395503: kvm_exit: reason EXTERNAL_INTERRUPT
Of course, not calling trace_kvm_entry() in common x86 code any
longer means that we need to adjust the SVM side of things too.
Signed-off-by: Lorenzo Brescia <lorenzo.brescia@edu.unito.it>
Signed-off-by: Dario Faggioli <dfaggioli@suse.com>
Message-Id: <160873470698.11652.13483635328769030605.stgit@Wayrath>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
The injection process of smi has two steps:
Qemu KVM
Step1:
cpu->interrupt_request &= \
~CPU_INTERRUPT_SMI;
kvm_vcpu_ioctl(cpu, KVM_SMI)
call kvm_vcpu_ioctl_smi() and
kvm_make_request(KVM_REQ_SMI, vcpu);
Step2:
kvm_vcpu_ioctl(cpu, KVM_RUN, 0)
call process_smi() if
kvm_check_request(KVM_REQ_SMI, vcpu) is
true, mark vcpu->arch.smi_pending = true;
The vcpu->arch.smi_pending will be set true in step2, unfortunately if
vcpu paused between step1 and step2, the kvm_run->immediate_exit will be
set and vcpu has to exit to Qemu immediately during step2 before mark
vcpu->arch.smi_pending true.
During VM migration, Qemu will get the smi pending status from KVM using
KVM_GET_VCPU_EVENTS ioctl at the downtime, then the smi pending status
will be lost.
Signed-off-by: Jay Zhou <jianjay.zhou@huawei.com>
Signed-off-by: Shengen Zhuang <zhuangshengen@huawei.com>
Message-Id: <20210118084720.1585-1-jianjay.zhou@huawei.com>
Cc: stable@vger.kernel.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
The HW_REF_CPU_CYCLES event on the fixed counter 2 is pseudo-encoded as
0x0300 in the intel_perfmon_event_map[]. Correct its usage.
Fixes: 62079d8a4312 ("KVM: PMU: add proper support for fixed counter 2")
Signed-off-by: Like Xu <like.xu@linux.intel.com>
Message-Id: <20201230081916.63417-1-like.xu@linux.intel.com>
Reviewed-by: Sean Christopherson <seanjc@google.com>
Cc: stable@vger.kernel.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
Since we know vPMU will not work properly when (1) the guest bit_width(s)
of the [gp|fixed] counters are greater than the host ones, or (2) guest
requested architectural events exceeds the range supported by the host, so
we can setup a smaller left shift value and refresh the guest cpuid entry,
thus fixing the following UBSAN shift-out-of-bounds warning:
shift exponent 197 is too large for 64-bit type 'long long unsigned int'
Call Trace:
__dump_stack lib/dump_stack.c:79 [inline]
dump_stack+0x107/0x163 lib/dump_stack.c:120
ubsan_epilogue+0xb/0x5a lib/ubsan.c:148
__ubsan_handle_shift_out_of_bounds.cold+0xb1/0x181 lib/ubsan.c:395
intel_pmu_refresh.cold+0x75/0x99 arch/x86/kvm/vmx/pmu_intel.c:348
kvm_vcpu_after_set_cpuid+0x65a/0xf80 arch/x86/kvm/cpuid.c:177
kvm_vcpu_ioctl_set_cpuid2+0x160/0x440 arch/x86/kvm/cpuid.c:308
kvm_arch_vcpu_ioctl+0x11b6/0x2d70 arch/x86/kvm/x86.c:4709
kvm_vcpu_ioctl+0x7b9/0xdb0 arch/x86/kvm/../../../virt/kvm/kvm_main.c:3386
vfs_ioctl fs/ioctl.c:48 [inline]
__do_sys_ioctl fs/ioctl.c:753 [inline]
__se_sys_ioctl fs/ioctl.c:739 [inline]
__x64_sys_ioctl+0x193/0x200 fs/ioctl.c:739
do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46
entry_SYSCALL_64_after_hwframe+0x44/0xa9
Reported-by: syzbot+ae488dc136a4cc6ba32b@syzkaller.appspotmail.com
Signed-off-by: Like Xu <like.xu@linux.intel.com>
Message-Id: <20210118025800.34620-1-like.xu@linux.intel.com>
Cc: stable@vger.kernel.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
Add compile-time asserts in rsvd_bits() to guard against KVM passing in
garbage hardcoded values, and cap the upper bound at '63' for dynamic
values to prevent generating a mask that would overflow a u64.
Suggested-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210113204515.3473079-1-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
The ACPI-enabled Intel MID platforms neither have WDAT table nor proper IDs
to instantiate watchdog device. In order to keep them working move the board
code from arch/x86 to drivers/platform/x86.
Note, the complete SFI support is going to be removed, that's why PDx86
has been chosen as a new home for it. This is the only device which needs
additional code so far.
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Reviewed-by: Guenter Roeck <linux@roeck-us.net>
Reviewed-by: Hans de Goede <hdegoede@redhat.com>
Acked-by: Linus Walleij <linus.walleij@linaro.org>
|
|
Merge misc fixes from Andrew Morton:
"18 patches.
Subsystems affected by this patch series: mm (pagealloc, memcg, kasan,
memory-failure, and highmem), ubsan, proc, and MAINTAINERS"
* emailed patches from Andrew Morton <akpm@linux-foundation.org>:
MAINTAINERS: add a couple more files to the Clang/LLVM section
proc_sysctl: fix oops caused by incorrect command parameters
powerpc/mm/highmem: use __set_pte_at() for kmap_local()
mips/mm/highmem: use set_pte() for kmap_local()
mm/highmem: prepare for overriding set_pte_at()
sparc/mm/highmem: flush cache and TLB
mm: fix page reference leak in soft_offline_page()
ubsan: disable unsigned-overflow check for i386
kasan, mm: fix resetting page_alloc tags for HW_TAGS
kasan, mm: fix conflicts with init_on_alloc/free
kasan: fix HW_TAGS boot parameters
kasan: fix incorrect arguments passing in kasan_add_zero_shadow
kasan: fix unaligned address is unhandled in kasan_remove_zero_shadow
mm: fix numa stats for thp migration
mm: memcg: fix memcg file_dirty numa stat
mm: memcg/slab: optimize objcg stock draining
mm: fix initialization of struct page for holes in memory layout
x86/setup: don't remove E820_TYPE_RAM for pfn 0
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler fixes from Borislav Petkov:
- Correct the marking of kthreads which are supposed to run on a
specific, single CPU vs such which are affine to only one CPU, mark
per-cpu workqueue threads as such and make sure that marking
"survives" CPU hotplug. Fix CPU hotplug issues with such kthreads.
- A fix to not push away tasks on CPUs coming online.
- Have workqueue CPU hotplug code use cpu_possible_mask when breaking
affinity on CPU offlining so that pending workers can finish on newly
arrived onlined CPUs too.
- Dump tasks which haven't vacated a CPU which is currently being
unplugged.
- Register a special scale invariance callback which gets called on
resume from RAM to read out APERF/MPERF after resume and thus make
the schedutil scaling governor more precise.
* tag 'sched_urgent_for_v5.11_rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
sched: Relax the set_cpus_allowed_ptr() semantics
sched: Fix CPU hotplug / tighten is_per_cpu_kthread()
sched: Prepare to use balance_push in ttwu()
workqueue: Restrict affinity change to rescuer
workqueue: Tag bound workers with KTHREAD_IS_PER_CPU
kthread: Extract KTHREAD_IS_PER_CPU
sched: Don't run cpu-online with balance_push() enabled
workqueue: Use cpu_possible_mask instead of cpu_active_mask to break affinity
sched/core: Print out straggler tasks in sched_cpu_dying()
x86: PM: Register syscore_ops for scale invariance
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 fixes from Borislav Petkov:
- Add a new Intel model number for Alder Lake
- Differentiate which aspects of the FPU state get saved/restored when
the FPU is used in-kernel and fix a boot crash on K7 due to early
MXCSR access before CR4.OSFXSR is even set.
- A couple of noinstr annotation fixes
- Correct die ID setting on AMD for users of topology information which
need the correct die ID
- A SEV-ES fix to handle string port IO to/from kernel memory properly
* tag 'x86_urgent_for_v5.11_rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/cpu: Add another Alder Lake CPU to the Intel family
x86/mmx: Use KFPU_387 for MMX string operations
x86/fpu: Add kernel_fpu_begin_mask() to selectively initialize state
x86/topology: Make __max_die_per_package available unconditionally
x86: __always_inline __{rd,wr}msr()
x86/mce: Remove explicit/superfluous tracing
locking/lockdep: Avoid noinstr warning for DEBUG_LOCKDEP
locking/lockdep: Cure noinstr fail
x86/sev: Fix nonistr violation
x86/entry: Fix noinstr fail
x86/cpu/amd: Set __max_die_per_package on AMD
x86/sev-es: Handle string port IO to kernel memory properly
|
|
Patch series "mm: fix initialization of struct page for holes in memory layout", v3.
Commit 73a6e474cb37 ("mm: memmap_init: iterate over memblock regions
rather that check each PFN") exposed several issues with the memory map
initialization and these patches fix those issues.
Initially there were crashes during compaction that Qian Cai reported
back in April [1]. It seemed back then that the problem was fixed, but
a few weeks ago Andrea Arcangeli hit the same bug [2] and there was an
additional discussion at [3].
[1] https://lore.kernel.org/lkml/8C537EB7-85EE-4DCF-943E-3CC0ED0DF56D@lca.pw
[2] https://lore.kernel.org/lkml/20201121194506.13464-1-aarcange@redhat.com
[3] https://lore.kernel.org/mm-commits/20201206005401.qKuAVgOXr%akpm@linux-foundation.org
This patch (of 2):
The first 4Kb of memory is a BIOS owned area and to avoid its allocation
for the kernel it was not listed in e820 tables as memory. As the result,
pfn 0 was never recognised by the generic memory management and it is not
a part of neither node 0 nor ZONE_DMA.
If set_pfnblock_flags_mask() would be ever called for the pageblock
corresponding to the first 2Mbytes of memory, having pfn 0 outside of
ZONE_DMA would trigger
VM_BUG_ON_PAGE(!zone_spans_pfn(page_zone(page), pfn), page);
Along with reserving the first 4Kb in e820 tables, several first pages are
reserved with memblock in several places during setup_arch(). These
reservations are enough to ensure the kernel does not touch the BIOS area
and it is not necessary to remove E820_TYPE_RAM for pfn 0.
Remove the update of e820 table that changes the type of pfn 0 and move
the comment describing why it was done to trim_low_memory_range() that
reserves the beginning of the memory.
Link: https://lkml.kernel.org/r/20210111194017.22696-2-rppt@kernel.org
Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
Cc: Baoquan He <bhe@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: David Hildenbrand <david@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Qian Cai <cai@lca.pw>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
This implements the missing mount_setattr() syscall. While the new mount
api allows to change the properties of a superblock there is currently
no way to change the properties of a mount or a mount tree using file
descriptors which the new mount api is based on. In addition the old
mount api has the restriction that mount options cannot be applied
recursively. This hasn't changed since changing mount options on a
per-mount basis was implemented in [1] and has been a frequent request
not just for convenience but also for security reasons. The legacy
mount syscall is unable to accommodate this behavior without introducing
a whole new set of flags because MS_REC | MS_REMOUNT | MS_BIND |
MS_RDONLY | MS_NOEXEC | [...] only apply the mount option to the topmost
mount. Changing MS_REC to apply to the whole mount tree would mean
introducing a significant uapi change and would likely cause significant
regressions.
The new mount_setattr() syscall allows to recursively clear and set
mount options in one shot. Multiple calls to change mount options
requesting the same changes are idempotent:
int mount_setattr(int dfd, const char *path, unsigned flags,
struct mount_attr *uattr, size_t usize);
Flags to modify path resolution behavior are specified in the @flags
argument. Currently, AT_EMPTY_PATH, AT_RECURSIVE, AT_SYMLINK_NOFOLLOW,
and AT_NO_AUTOMOUNT are supported. If useful, additional lookup flags to
restrict path resolution as introduced with openat2() might be supported
in the future.
The mount_setattr() syscall can be expected to grow over time and is
designed with extensibility in mind. It follows the extensible syscall
pattern we have used with other syscalls such as openat2(), clone3(),
sched_{set,get}attr(), and others.
The set of mount options is passed in the uapi struct mount_attr which
currently has the following layout:
struct mount_attr {
__u64 attr_set;
__u64 attr_clr;
__u64 propagation;
__u64 userns_fd;
};
The @attr_set and @attr_clr members are used to clear and set mount
options. This way a user can e.g. request that a set of flags is to be
raised such as turning mounts readonly by raising MOUNT_ATTR_RDONLY in
@attr_set while at the same time requesting that another set of flags is
to be lowered such as removing noexec from a mount tree by specifying
MOUNT_ATTR_NOEXEC in @attr_clr.
Note, since the MOUNT_ATTR_<atime> values are an enum starting from 0,
not a bitmap, users wanting to transition to a different atime setting
cannot simply specify the atime setting in @attr_set, but must also
specify MOUNT_ATTR__ATIME in the @attr_clr field. So we ensure that
MOUNT_ATTR__ATIME can't be partially set in @attr_clr and that @attr_set
can't have any atime bits set if MOUNT_ATTR__ATIME isn't set in
@attr_clr.
The @propagation field lets callers specify the propagation type of a
mount tree. Propagation is a single property that has four different
settings and as such is not really a flag argument but an enum.
Specifically, it would be unclear what setting and clearing propagation
settings in combination would amount to. The legacy mount() syscall thus
forbids the combination of multiple propagation settings too. The goal
is to keep the semantics of mount propagation somewhat simple as they
are overly complex as it is.
The @userns_fd field lets user specify a user namespace whose idmapping
becomes the idmapping of the mount. This is implemented and explained in
detail in the next patch.
[1]: commit 2e4b7fcd9260 ("[PATCH] r/o bind mounts: honor mount writer counts at remount")
Link: https://lore.kernel.org/r/20210121131959.646623-35-christian.brauner@ubuntu.com
Cc: David Howells <dhowells@redhat.com>
Cc: Aleksa Sarai <cyphar@cyphar.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: linux-fsdevel@vger.kernel.org
Cc: linux-api@vger.kernel.org
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
|