summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2022-02-02RISC-V: KVM: Fix SBI implementation versionAnup Patel
The SBI implementation version returned by KVM RISC-V should be the Host Linux version code. Fixes: c62a76859723 ("RISC-V: KVM: Add SBI v0.2 base extension") Signed-off-by: Anup Patel <apatel@ventanamicro.com> Reviewed-by: Atish Patra <atishp@rivosinc.com> Signed-off-by: Anup Patel <anup@brainfault.org>
2022-02-02RISC-V: KVM: make CY, TM, and IR counters accessible in VU modeMayuresh Chitale
Those applications that run in VU mode and access the time CSR cause a virtual instruction trap as Guest kernel currently does not initialize the scounteren CSR. To fix this, we should make CY, TM, and IR counters accessibile by default in VU mode (similar to OpenSBI). Fixes: a33c72faf2d73 ("RISC-V: KVM: Implement VCPU create, init and destroy functions") Cc: stable@vger.kernel.org Signed-off-by: Mayuresh Chitale <mchitale@ventanamicro.com> Signed-off-by: Anup Patel <anup@brainfault.org>
2022-02-02kvm/riscv: rework guest entry logicMark Rutland
In kvm_arch_vcpu_ioctl_run() we enter an RCU extended quiescent state (EQS) by calling guest_enter_irqoff(), and unmask IRQs prior to exiting the EQS by calling guest_exit(). As the IRQ entry code will not wake RCU in this case, we may run the core IRQ code and IRQ handler without RCU watching, leading to various potential problems. Additionally, we do not inform lockdep or tracing that interrupts will be enabled during guest execution, which caan lead to misleading traces and warnings that interrupts have been enabled for overly-long periods. This patch fixes these issues by using the new timing and context entry/exit helpers to ensure that interrupts are handled during guest vtime but with RCU watching, with a sequence: guest_timing_enter_irqoff(); guest_state_enter_irqoff(); < run the vcpu > guest_state_exit_irqoff(); < take any pending IRQs > guest_timing_exit_irqoff(); Since instrumentation may make use of RCU, we must also ensure that no instrumented code is run during the EQS. I've split out the critical section into a new kvm_riscv_enter_exit_vcpu() helper which is marked noinstr. Fixes: 99cdc6c18c2d815e ("RISC-V: Add initial skeletal KVM support") Signed-off-by: Mark Rutland <mark.rutland@arm.com> Cc: Albert Ou <aou@eecs.berkeley.edu> Cc: Anup Patel <anup@brainfault.org> Cc: Atish Patra <atishp@atishpatra.org> Cc: Frederic Weisbecker <frederic@kernel.org> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Paul E. McKenney <paulmck@kernel.org> Cc: Paul Walmsley <paul.walmsley@sifive.com> Tested-by: Anup Patel <anup@brainfault.org> Signed-off-by: Anup Patel <anup@brainfault.org>
2022-02-02perf/x86/intel/pt: Fix crash with stop filters in single-range modeTristan Hume
Add a check for !buf->single before calling pt_buffer_region_size in a place where a missing check can cause a kernel crash. Fixes a bug introduced by commit 670638477aed ("perf/x86/intel/pt: Opportunistically use single range output mode"), which added a support for PT single-range output mode. Since that commit if a PT stop filter range is hit while tracing, the kernel will crash because of a null pointer dereference in pt_handle_status due to calling pt_buffer_region_size without a ToPA configured. The commit which introduced single-range mode guarded almost all uses of the ToPA buffer variables with checks of the buf->single variable, but missed the case where tracing was stopped by the PT hardware, which happens when execution hits a configured stop filter. Tested that hitting a stop filter while PT recording successfully records a trace with this patch but crashes without this patch. Fixes: 670638477aed ("perf/x86/intel/pt: Opportunistically use single range output mode") Signed-off-by: Tristan Hume <tristan@thume.ca> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Adrian Hunter <adrian.hunter@intel.com> Cc: stable@kernel.org Link: https://lkml.kernel.org/r/20220127220806.73664-1-tristan@thume.ca
2022-02-02perf: uapi: Document perf_event_attr::sig_data truncation on 32 bit ↵Marco Elver
architectures Due to the alignment requirements of siginfo_t, as described in 3ddb3fd8cdb0 ("signal, perf: Fix siginfo_t by avoiding u64 on 32-bit architectures"), siginfo_t::si_perf_data is limited to an unsigned long. However, perf_event_attr::sig_data is an u64, to avoid having to deal with compat conversions. Due to being an u64, it may not immediately be clear to users that sig_data is truncated on 32 bit architectures. Add a comment to explicitly point this out, and hopefully help some users save time by not having to deduce themselves what's happening. Reported-by: Dmitry Vyukov <dvyukov@google.com> Signed-off-by: Marco Elver <elver@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Dmitry Vyukov <dvyukov@google.com> Link: https://lore.kernel.org/r/20220131103407.1971678-3-elver@google.com
2022-02-02selftests/perf_events: Test modification of perf_event_attr::sig_dataMarco Elver
Test that PERF_EVENT_IOC_MODIFY_ATTRIBUTES correctly modifies perf_event_attr::sig_data as well. Signed-off-by: Marco Elver <elver@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Dmitry Vyukov <dvyukov@google.com> Link: https://lore.kernel.org/r/20220131103407.1971678-2-elver@google.com
2022-02-02perf: Copy perf_event_attr::sig_data on modificationMarco Elver
The intent has always been that perf_event_attr::sig_data should also be modifiable along with PERF_EVENT_IOC_MODIFY_ATTRIBUTES, because it is observable by user space if SIGTRAP on events is requested. Currently only PERF_TYPE_BREAKPOINT is modifiable, and explicitly copies relevant breakpoint-related attributes in hw_breakpoint_copy_attr(). This misses copying perf_event_attr::sig_data. Since sig_data is not specific to PERF_TYPE_BREAKPOINT, introduce a helper to copy generic event-type-independent attributes on modification. Fixes: 97ba62b27867 ("perf: Add support for SIGTRAP on perf events") Reported-by: Dmitry Vyukov <dvyukov@google.com> Signed-off-by: Marco Elver <elver@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Dmitry Vyukov <dvyukov@google.com> Link: https://lore.kernel.org/r/20220131103407.1971678-1-elver@google.com
2022-02-02x86/perf: Default set FREEZE_ON_SMI for allPeter Zijlstra
Kyle reported that rr[0] has started to malfunction on Comet Lake and later CPUs due to EFI starting to make use of CPL3 [1] and the PMU event filtering not distinguishing between regular CPL3 and SMM CPL3. Since this is a privilege violation, default disable SMM visibility where possible. Administrators wanting to observe SMM cycles can easily change this using the sysfs attribute while regular users don't have access to this file. [0] https://rr-project.org/ [1] See the Intel white paper "Trustworthy SMM on the Intel vPro Platform" at https://bugzilla.kernel.org/attachment.cgi?id=300300, particularly the end of page 5. Reported-by: Kyle Huey <me@kylehuey.com> Suggested-by: Andrew Cooper <Andrew.Cooper3@citrix.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: stable@kernel.org Link: https://lkml.kernel.org/r/YfKChjX61OW4CkYm@hirez.programming.kicks-ass.net
2022-02-02sched: move autogroup sysctls into its own fileZhen Ni
move autogroup sysctls to autogroup.c and use the new register_sysctl_init() to register the sysctl interface. Signed-off-by: Zhen Ni <nizhen@uniontech.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20220128095025.8745-1-nizhen@uniontech.com
2022-02-02selftests/rseq: x86-32: use %gs segment selector for accessing rseq thread areaMathieu Desnoyers
Rather than use rseq_get_abi() and pass its result through a register to the inline assembler, directly access the per-thread rseq area through a memory reference combining the %gs segment selector, the constant offset of the field in struct rseq, and the rseq_offset value (in a register). Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20220124171253.22072-16-mathieu.desnoyers@efficios.com
2022-02-02selftests/rseq: x86-64: use %fs segment selector for accessing rseq thread areaMathieu Desnoyers
Rather than use rseq_get_abi() and pass its result through a register to the inline assembler, directly access the per-thread rseq area through a memory reference combining the %fs segment selector, the constant offset of the field in struct rseq, and the rseq_offset value (in a register). Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20220124171253.22072-15-mathieu.desnoyers@efficios.com
2022-02-02selftests/rseq: Fix: work-around asm goto compiler bugsMathieu Desnoyers
gcc and clang each have their own compiler bugs with respect to asm goto. Implement a work-around for compiler versions known to have those bugs. gcc prior to 4.8.2 miscompiles asm goto. https://gcc.gnu.org/bugzilla/show_bug.cgi?id=58670 gcc prior to 8.1.0 miscompiles asm goto at O1. https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103908 clang prior to version 13.0.1 miscompiles asm goto at O2. https://github.com/llvm/llvm-project/issues/52735 Work around these issues by adding a volatile inline asm with memory clobber in the fallthrough after the asm goto and at each label target. Emit this for all compilers in case other similar issues are found in the future. Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20220124171253.22072-14-mathieu.desnoyers@efficios.com
2022-02-02selftests/rseq: Remove arm/mips asm goto compiler work-aroundMathieu Desnoyers
The arm and mips work-around for asm goto size guess issues are not properly documented, and lack reference to specific compiler versions, upstream compiler bug tracker entry, and reproducer. I can only find a loosely documented patch in my original LKML rseq post refering to gcc < 7 on ARM, but it does not appear to be sufficient to track the exact issue. Also, I am not sure MIPS really has the same limitation. Therefore, remove the work-around until we can properly document this. Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/lkml/20171121141900.18471-17-mathieu.desnoyers@efficios.com/
2022-02-02selftests/rseq: Fix warnings about #if checks of undefined tokensMathieu Desnoyers
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20220124171253.22072-12-mathieu.desnoyers@efficios.com
2022-02-02selftests/rseq: Fix ppc32 offsets by using long rather than off_tMathieu Desnoyers
The semantic of off_t is for file offsets. We mean to use it as an offset from a pointer. We really expect it to fit in a single register, and not use a 64-bit type on 32-bit architectures. Fix runtime issues on ppc32 where the offset is always 0 due to inconsistency between the argument type (off_t -> 64-bit) and type expected by the inline assembler (32-bit). Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20220124171253.22072-11-mathieu.desnoyers@efficios.com
2022-02-02selftests/rseq: Fix ppc32 missing instruction selection "u" and "x" for ↵Mathieu Desnoyers
load/store Building the rseq basic test with gcc version 5.4.0 20160609 (Ubuntu 5.4.0-6ubuntu1~16.04.12) Target: powerpc-linux-gnu leads to these errors: /tmp/ccieEWxU.s: Assembler messages: /tmp/ccieEWxU.s:118: Error: syntax error; found `,', expected `(' /tmp/ccieEWxU.s:118: Error: junk at end of line: `,8' /tmp/ccieEWxU.s:121: Error: syntax error; found `,', expected `(' /tmp/ccieEWxU.s:121: Error: junk at end of line: `,8' /tmp/ccieEWxU.s:626: Error: syntax error; found `,', expected `(' /tmp/ccieEWxU.s:626: Error: junk at end of line: `,8' /tmp/ccieEWxU.s:629: Error: syntax error; found `,', expected `(' /tmp/ccieEWxU.s:629: Error: junk at end of line: `,8' /tmp/ccieEWxU.s:735: Error: syntax error; found `,', expected `(' /tmp/ccieEWxU.s:735: Error: junk at end of line: `,8' /tmp/ccieEWxU.s:738: Error: syntax error; found `,', expected `(' /tmp/ccieEWxU.s:738: Error: junk at end of line: `,8' /tmp/ccieEWxU.s:741: Error: syntax error; found `,', expected `(' /tmp/ccieEWxU.s:741: Error: junk at end of line: `,8' Makefile:581: recipe for target 'basic_percpu_ops_test.o' failed Based on discussion with Linux powerpc maintainers and review of the use of the "m" operand in powerpc kernel code, add the missing %Un%Xn (where n is operand number) to the lwz, stw, ld, and std instructions when used with "m" operands. Using "WORD" to mean either a 32-bit or 64-bit type depending on the architecture is misleading. The term "WORD" really means a 32-bit type in both 32-bit and 64-bit powerpc assembler. The intent here is to wrap load/store to intptr_t into common macros for both 32-bit and 64-bit. Rename the macros with a RSEQ_ prefix, and use the terms "INT" for always 32-bit type, and "LONG" for architecture bitness-sized type. Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20220124171253.22072-10-mathieu.desnoyers@efficios.com
2022-02-02selftests/rseq: Fix ppc32: wrong rseq_cs 32-bit field pointer on big endianMathieu Desnoyers
ppc32 incorrectly uses padding as rseq_cs pointer field. Fix this by using the rseq_cs.arch.ptr field. Use this field across all architectures. Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20220124171253.22072-9-mathieu.desnoyers@efficios.com
2022-02-02selftests/rseq: Uplift rseq selftests for compatibility with glibc-2.35Mathieu Desnoyers
glibc-2.35 (upcoming release date 2022-02-01) exposes the rseq per-thread data in the TCB, accessible at an offset from the thread pointer, rather than through an actual Thread-Local Storage (TLS) variable, as the Linux kernel selftests initially expected. The __rseq_abi TLS and glibc-2.35's ABI for per-thread data cannot actively coexist in a process, because the kernel supports only a single rseq registration per thread. Here is the scheme introduced to ensure selftests can work both with an older glibc and with glibc-2.35+: - librseq exposes its own "rseq_offset, rseq_size, rseq_flags" ABI. - librseq queries for glibc rseq ABI (__rseq_offset, __rseq_size, __rseq_flags) using dlsym() in a librseq library constructor. If those are found, copy their values into rseq_offset, rseq_size, and rseq_flags. - Else, if those glibc symbols are not found, handle rseq registration from librseq and use its own IE-model TLS to implement the rseq ABI per-thread storage. Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20220124171253.22072-8-mathieu.desnoyers@efficios.com
2022-02-02selftests/rseq: Introduce thread pointer gettersMathieu Desnoyers
This is done in preparation for the selftest uplift to become compatible with glibc-2.35. glibc-2.35 exposes the rseq per-thread data in the TCB, accessible at an offset from the thread pointer. The toolchains do not implement accessing the thread pointer on all architectures. Provide thread pointer getters for ppc and x86 which lack (or lacked until recently) toolchain support. Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20220124171253.22072-7-mathieu.desnoyers@efficios.com
2022-02-02selftests/rseq: Introduce rseq_get_abi() helperMathieu Desnoyers
This is done in preparation for the selftest uplift to become compatible with glibc-2.35. glibc-2.35 exposes the rseq per-thread data in the TCB, accessible at an offset from the thread pointer, rather than through an actual Thread-Local Storage (TLS) variable, as the kernel selftests initially expected. Introduce a rseq_get_abi() helper, initially using the __rseq_abi TLS variable, in preparation for changing this userspace ABI for one which is compatible with glibc-2.35. Note that the __rseq_abi TLS and glibc-2.35's ABI for per-thread data cannot actively coexist in a process, because the kernel supports only a single rseq registration per thread. Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20220124171253.22072-6-mathieu.desnoyers@efficios.com
2022-02-02selftests/rseq: Remove volatile from __rseq_abiMathieu Desnoyers
This is done in preparation for the selftest uplift to become compatible with glibc-2.35. All accesses to the __rseq_abi fields are volatile, but remove the volatile from the TLS variable declaration, otherwise we are stuck with volatile for the upcoming rseq_get_abi() helper. Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20220124171253.22072-5-mathieu.desnoyers@efficios.com
2022-02-02selftests/rseq: Remove useless assignment to cpu variableMathieu Desnoyers
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20220124171253.22072-4-mathieu.desnoyers@efficios.com
2022-02-02rseq: Remove broken uapi field layout on 32-bit little endianMathieu Desnoyers
The rseq rseq_cs.ptr.{ptr32,padding} uapi endianness handling is entirely wrong on 32-bit little endian: a preprocessor logic mistake wrongly uses the big endian field layout on 32-bit little endian architectures. Fortunately, those ptr32 accessors were never used within the kernel, and only meant as a convenience for user-space. Remove those and replace the whole rseq_cs union by a __u64 type, as this is the only thing really needed to express the ABI. Document how 32-bit architectures are meant to interact with this field. Fixes: ec9c82e03a74 ("rseq: uapi: Declare rseq_cs field as union, update includes") Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20220127152720.25898-1-mathieu.desnoyers@efficios.com
2022-02-02selftests/rseq: introduce own copy of rseq uapi headerMathieu Desnoyers
The Linux kernel rseq uapi header has a broken layout for the rseq_cs.ptr field on 32-bit little endian architectures. The entire rseq_cs.ptr field is planned for removal, leaving only the 64-bit rseq_cs.ptr64 field available. Both glibc and librseq use their own copy of the Linux kernel uapi header, where they introduce proper union fields to access to the 32-bit low order bits of the rseq_cs pointer on 32-bit architectures. Introduce a copy of the Linux kernel uapi headers in the Linux kernel selftests. Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20220124171253.22072-2-mathieu.desnoyers@efficios.com
2022-02-02gpio: aggregator: Fix calling into sleeping GPIO controllersGeert Uytterhoeven
If the parent GPIO controller is a sleeping controller (e.g. a GPIO controller connected to I2C), getting or setting a GPIO triggers a might_sleep() warning. This happens because the GPIO Aggregator takes the can_sleep flag into account only for its internal locking, not for calling into the parent GPIO controller. Fix this by using the gpiod_[gs]et*_cansleep() APIs when calling into a sleeping GPIO controller. Reported-by: Mikko Salomäki <ms@datarespons.se> Fixes: 828546e24280f721 ("gpio: Add GPIO Aggregator") Signed-off-by: Geert Uytterhoeven <geert+renesas@glider.be> Reviewed-by: Andy Shevchenko <andy.shevchenko@gmail.com> Signed-off-by: Bartosz Golaszewski <brgl@bgdev.pl>
2022-02-02irqchip/sifive-plic: Add missing thead,c900-plic match stringGuo Ren
The thead,c900-plic has been used in opensbi to distinguish PLIC [1]. Although PLICs have the same behaviors in Linux, they are different hardware with some custom initializing in firmware(opensbi). Qute opensbi patch commit-msg by Samuel: The T-HEAD PLIC implementation requires setting a delegation bit to allow access from S-mode. Now that the T-HEAD PLIC has its own compatible string, set this bit automatically from the PLIC driver, instead of reaching into the PLIC's MMIO space from another driver. [1]: https://github.com/riscv-software-src/opensbi/commit/78c2b19218bd62653b9fb31623a42ced45f38ea6 Signed-off-by: Guo Ren <guoren@linux.alibaba.com> Cc: Anup Patel <anup@brainfault.org> Cc: Marc Zyngier <maz@kernel.org> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Samuel Holland <samuel@sholland.org> Cc: Thomas Gleixner <tglx@linutronix.de> Tested-by: Samuel Holland <samuel@sholland.org> Signed-off-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20220130135634.1213301-3-guoren@kernel.org
2022-02-02dt-bindings: update riscv plic compatible stringGuo Ren
Add the compatible string "thead,c900-plic" to the riscv plic bindings to support allwinner d1 SOC which contains c906 core. Signed-off-by: Guo Ren <guoren@linux.alibaba.com> Cc: Anup Patel <anup@brainfault.org> Cc: Heiko Stuebner <heiko@sntech.de> Cc: Rob Herring <robh@kernel.org> Cc: Rob Herring <robh+dt@kernel.org> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Samuel Holland <samuel@sholland.org> Reviewed-by: Rob Herring <robh@kernel.org> Signed-off-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20220130135634.1213301-2-guoren@kernel.org
2022-02-02irqchip/gic-v3-its: Skip HP notifier when no ITS is registeredMarc Zyngier
We have some systems out there that have both LPI support and an ITS, but that don't expose the ITS in their firmware tables (either because it is broken or because they run under a hypervisor that hides it...). Is such a configuration, we still register the HP notifier to free the allocated tables if needed, resulting in a warning as there is no memory to free (nothing was allocated the first place). Fix it by keying the HP notifier on the presence of at least one sucessfully probed ITS. Fixes: d23bc2bc1d63 ("irqchip/gic-v3-its: Postpone LPI pending table freeing and memreserve") Reported-by: Steev Klimaszewski <steev@kali.org> Tested-by: Steev Klimaszewski <steev@kali.org> Signed-off-by: Marc Zyngier <maz@kernel.org> Cc: Valentin Schneider <valentin.schneider@arm.com> Link: https://lore.kernel.org/r/20220202103454.2480465-1-maz@kernel.org
2022-02-02KVM: s390: Return error on SIDA memop on normal guestJanis Schoetterl-Glausch
Refuse SIDA memops on guests which are not protected. For normal guests, the secure instruction data address designation, which determines the location we access, is not under control of KVM. Fixes: 19e122776886 (KVM: S390: protvirt: Introduce instruction data area bounce buffer) Signed-off-by: Janis Schoetterl-Glausch <scgl@linux.ibm.com> Cc: stable@vger.kernel.org Signed-off-by: Christian Borntraeger <borntraeger@linux.ibm.com>
2022-02-02nvme-rdma: fix possible use-after-free in transport error_recovery workSagi Grimberg
While nvme_rdma_submit_async_event_work is checking the ctrl and queue state before preparing the AER command and scheduling io_work, in order to fully prevent a race where this check is not reliable the error recovery work must flush async_event_work before continuing to destroy the admin queue after setting the ctrl state to RESETTING such that there is no race .submit_async_event and the error recovery handler itself changing the ctrl state. Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
2022-02-02nvme-tcp: fix possible use-after-free in transport error_recovery workSagi Grimberg
While nvme_tcp_submit_async_event_work is checking the ctrl and queue state before preparing the AER command and scheduling io_work, in order to fully prevent a race where this check is not reliable the error recovery work must flush async_event_work before continuing to destroy the admin queue after setting the ctrl state to RESETTING such that there is no race .submit_async_event and the error recovery handler itself changing the ctrl state. Tested-by: Chris Leech <cleech@redhat.com> Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
2022-02-02nvme: fix a possible use-after-free in controller reset during loadSagi Grimberg
Unlike .queue_rq, in .submit_async_event drivers may not check the ctrl readiness for AER submission. This may lead to a use-after-free condition that was observed with nvme-tcp. The race condition may happen in the following scenario: 1. driver executes its reset_ctrl_work 2. -> nvme_stop_ctrl - flushes ctrl async_event_work 3. ctrl sends AEN which is received by the host, which in turn schedules AEN handling 4. teardown admin queue (which releases the queue socket) 5. AEN processed, submits another AER, calling the driver to submit 6. driver attempts to send the cmd ==> use-after-free In order to fix that, add ctrl state check to validate the ctrl is actually able to accept the AER submission. This addresses the above race in controller resets because the driver during teardown should: 1. change ctrl state to RESETTING 2. flush async_event_work (as well as other async work elements) So after 1,2, any other AER command will find the ctrl state to be RESETTING and bail out without submitting the AER. Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
2022-02-01Merge branch '1GbE' of ↵Jakub Kicinski
git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/net-queue Tony Nguyen says: ==================== Intel Wired LAN Driver Updates 2022-02-01 This series contains updates to e1000e driver only. Sasha removes CSME handshake with TGL platform as this is not supported and is causing hardware unit hangs to be reported. * '1GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/net-queue: e1000e: Handshake with CSME starts from ADL platforms e1000e: Separate ADP board type from TGP ==================== Link: https://lore.kernel.org/r/20220201173754.580305-1-anthony.l.nguyen@intel.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-02-02phy: dphy: Correct clk_pre parameterLiu Ying
The D-PHY specification (v1.2) explicitly mentions that the T-CLK-PRE parameter's unit is Unit Interval(UI) and the minimum value is 8. Also, kernel doc of the 'clk_pre' member of struct phy_configure_opts_mipi_dphy mentions that it should be in UI. However, the dphy core driver wrongly sets 'clk_pre' to 8000, which seems to hint that it's in picoseconds. So, let's fix the dphy core driver to correctly reflect the T-CLK-PRE parameter's minimum value according to the D-PHY specification. I'm assuming that all impacted custom drivers shall program values in TxByteClkHS cycles into hardware for the T-CLK-PRE parameter. The D-PHY specification mentions that the frequency of TxByteClkHS is exactly 1/8 the High-Speed(HS) bit rate(each HS bit consumes one UI). So, relevant custom driver code is changed to program those values as DIV_ROUND_UP(cfg->clk_pre, BITS_PER_BYTE), then. Note that I've only tested the patch with RM67191 DSI panel on i.MX8mq EVK. Help is needed to test with other i.MX8mq, Meson and Rockchip platforms, as I don't have the hardwares. Fixes: 2ed869990e14 ("phy: Add MIPI D-PHY configuration options") Tested-by: Liu Ying <victor.liu@nxp.com> # RM67191 DSI panel on i.MX8mq EVK Reviewed-by: Andrzej Hajda <andrzej.hajda@intel.com> Reviewed-by: Neil Armstrong <narmstrong@baylibre.com> # for phy-meson-axg-mipi-dphy.c Tested-by: Neil Armstrong <narmstrong@baylibre.com> # for phy-meson-axg-mipi-dphy.c Tested-by: Guido Günther <agx@sigxcpu.org> # Librem 5 (imx8mq) with it's rather picky panel Reviewed-by: Laurent Pinchart <laurent.pinchart@ideasonboard.com> Signed-off-by: Liu Ying <victor.liu@nxp.com> Link: https://lore.kernel.org/r/20220124024007.1465018-1-victor.liu@nxp.com Signed-off-by: Vinod Koul <vkoul@kernel.org>
2022-02-01net/mlx5e: Avoid field-overflowing memcpy()Kees Cook
In preparation for FORTIFY_SOURCE performing compile-time and run-time field bounds checking for memcpy(), memmove(), and memset(), avoid intentionally writing across neighboring fields. Use flexible arrays instead of zero-element arrays (which look like they are always overflowing) and split the cross-field memcpy() into two halves that can be appropriately bounds-checked by the compiler. We were doing: #define ETH_HLEN 14 #define VLAN_HLEN 4 ... #define MLX5E_XDP_MIN_INLINE (ETH_HLEN + VLAN_HLEN) ... struct mlx5e_tx_wqe *wqe = mlx5_wq_cyc_get_wqe(wq, pi); ... struct mlx5_wqe_eth_seg *eseg = &wqe->eth; struct mlx5_wqe_data_seg *dseg = wqe->data; ... memcpy(eseg->inline_hdr.start, xdptxd->data, MLX5E_XDP_MIN_INLINE); target is wqe->eth.inline_hdr.start (which the compiler sees as being 2 bytes in size), but copying 18, intending to write across start (really vlan_tci, 2 bytes). The remaining 16 bytes get written into wqe->data[0], covering byte_count (4 bytes), lkey (4 bytes), and addr (8 bytes). struct mlx5e_tx_wqe { struct mlx5_wqe_ctrl_seg ctrl; /* 0 16 */ struct mlx5_wqe_eth_seg eth; /* 16 16 */ struct mlx5_wqe_data_seg data[]; /* 32 0 */ /* size: 32, cachelines: 1, members: 3 */ /* last cacheline: 32 bytes */ }; struct mlx5_wqe_eth_seg { u8 swp_outer_l4_offset; /* 0 1 */ u8 swp_outer_l3_offset; /* 1 1 */ u8 swp_inner_l4_offset; /* 2 1 */ u8 swp_inner_l3_offset; /* 3 1 */ u8 cs_flags; /* 4 1 */ u8 swp_flags; /* 5 1 */ __be16 mss; /* 6 2 */ __be32 flow_table_metadata; /* 8 4 */ union { struct { __be16 sz; /* 12 2 */ u8 start[2]; /* 14 2 */ } inline_hdr; /* 12 4 */ struct { __be16 type; /* 12 2 */ __be16 vlan_tci; /* 14 2 */ } insert; /* 12 4 */ __be32 trailer; /* 12 4 */ }; /* 12 4 */ /* size: 16, cachelines: 1, members: 9 */ /* last cacheline: 16 bytes */ }; struct mlx5_wqe_data_seg { __be32 byte_count; /* 0 4 */ __be32 lkey; /* 4 4 */ __be64 addr; /* 8 8 */ /* size: 16, cachelines: 1, members: 3 */ /* last cacheline: 16 bytes */ }; So, split the memcpy() so the compiler can reason about the buffer sizes. "pahole" shows no size nor member offset changes to struct mlx5e_tx_wqe nor struct mlx5e_umr_wqe. "objdump -d" shows no meaningful object code changes (i.e. only source line number induced differences and optimizations). Fixes: b5503b994ed5 ("net/mlx5e: XDP TX forwarding support") Signed-off-by: Kees Cook <keescook@chromium.org> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2022-02-01net/mlx5e: Use struct_group() for memcpy() regionKees Cook
In preparation for FORTIFY_SOURCE performing compile-time and run-time field bounds checking for memcpy(), memmove(), and memset(), avoid intentionally writing across neighboring fields. Use struct_group() in struct vlan_ethhdr around members h_dest and h_source, so they can be referenced together. This will allow memcpy() and sizeof() to more easily reason about sizes, improve readability, and avoid future warnings about writing beyond the end of h_dest. "pahole" shows no size nor member offset changes to struct vlan_ethhdr. "objdump -d" shows no object code changes. Fixes: 34802a42b352 ("net/mlx5e: Do not modify the TX SKB") Signed-off-by: Kees Cook <keescook@chromium.org> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2022-02-01net/mlx5e: Avoid implicit modify hdr for decap drop ruleRoi Dayan
Currently the driver adds implicit modify hdr action for decap rules on tunnel devices if the port is an ovs port. This is also done if the action is drop and makes the modify hdr redundant and also the FW doesn't support it and will generate a syndrome. kernel: mlx5_core 0000:08:00.0: mlx5_cmd_check:777:(pid 102063): SET_FLOW_TABLE_ENTRY(0x936) op_mod(0x0) failed, status bad parameter(0x3), syndrome (0x8708c3) Fix it by adding the implicit modify hdr only for fwd actions. Fixes: b16eb3c81fe2 ("net/mlx5: Support internal port as decap route device") Fixes: 077cdda764c7 ("net/mlx5e: TC, Fix memory leak with rules with internal port") Signed-off-by: Roi Dayan <roid@nvidia.com> Reviewed-by: Ariel Levkovich <lariel@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2022-02-01net/mlx5e: IPsec: Fix tunnel mode crypto offload for non TCP/UDP trafficRaed Salem
IPsec Tunnel mode crypto offload software parser (SWP) setting in data path currently always set the inner L4 offset regardless of the encapsulated L4 header type and whether it exists in the first place, this breaks non TCP/UDP traffic as such. Set the SWP inner L4 offset only when the IPsec tunnel encapsulated L4 header protocol is TCP/UDP. While at it fix inner ip protocol read for setting MLX5_ETH_WQE_SWP_INNER_L4_UDP flag to address the case where the ip header protocol is IPv6. Fixes: f1267798c980 ("net/mlx5: Fix checksum issue of VXLAN and IPsec crypto offload") Signed-off-by: Raed Salem <raeds@nvidia.com> Reviewed-by: Maor Dickman <maord@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2022-02-01net/mlx5e: IPsec: Fix crypto offload for non TCP/UDP encapsulated trafficRaed Salem
IPsec crypto offload always set the ethernet segment checksum flags with the inner L4 header checksum flag enabled for encapsulated IPsec offloaded packet regardless of the encapsulated L4 header type, and even if it doesn't exists in the first place, this breaks non TCP/UDP traffic as such. Set the inner L4 checksum flag only when the encapsulated L4 header protocol is TCP/UDP using software parser swp_inner_l4_offset field as indication. Fixes: 5cfb540ef27b ("net/mlx5e: Set IPsec WAs only in IP's non checksum partial case.") Signed-off-by: Raed Salem <raeds@nvidia.com> Reviewed-by: Maor Dickman <maord@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2022-02-01net/mlx5e: Don't treat small ceil values as unlimited in HTB offloadMaxim Mikityanskiy
The hardware spec defines max_average_bw == 0 as "unlimited bandwidth". max_average_bw is calculated as `ceil / BYTES_IN_MBIT`, which can become 0 when ceil is small, leading to an undesired effect of having no bandwidth limit. This commit fixes it by rounding up small values of ceil to 1 Mbit/s. Fixes: 214baf22870c ("net/mlx5e: Support HTB offload") Signed-off-by: Maxim Mikityanskiy <maximmi@nvidia.com> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2022-02-01net/mlx5: E-Switch, Fix uninitialized variable modactMaor Dickman
The variable modact is not initialized before used in command modify header allocation which can cause command to fail. Fix by initializing modact with zeros. Addresses-Coverity: ("Uninitialized scalar variable") Fixes: 8f1e0b97cc70 ("net/mlx5: E-Switch, Mark miss packets with new chain id mapping") Signed-off-by: Maor Dickman <maord@nvidia.com> Reviewed-by: Roi Dayan <roid@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2022-02-01net/mlx5e: Fix handling of wrong devices during bond neteventMaor Dickman
Current implementation of bond netevent handler only check if the handled netdev is VF representor and it missing a check if the VF representor is on the same phys device of the bond handling the netevent. Fix by adding the missing check and optimizing the check if the netdev is VF representor so it will not access uninitialized private data and crashes. BUG: kernel NULL pointer dereference, address: 000000000000036c PGD 0 P4D 0 Oops: 0000 [#1] SMP NOPTI Workqueue: eth3bond0 bond_mii_monitor [bonding] RIP: 0010:mlx5e_is_uplink_rep+0xc/0x50 [mlx5_core] RSP: 0018:ffff88812d69fd60 EFLAGS: 00010282 RAX: 0000000000000000 RBX: ffff8881cf800000 RCX: 0000000000000000 RDX: ffff88812d69fe10 RSI: 000000000000001b RDI: ffff8881cf800880 RBP: ffff8881cf800000 R08: 00000445cabccf2b R09: 0000000000000008 R10: 0000000000000004 R11: 0000000000000008 R12: ffff88812d69fe10 R13: 00000000fffffffe R14: ffff88820c0f9000 R15: 0000000000000000 FS: 0000000000000000(0000) GS:ffff88846fb00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 000000000000036c CR3: 0000000103d80006 CR4: 0000000000370ea0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: mlx5e_eswitch_uplink_rep+0x31/0x40 [mlx5_core] mlx5e_rep_is_lag_netdev+0x94/0xc0 [mlx5_core] mlx5e_rep_esw_bond_netevent+0xeb/0x3d0 [mlx5_core] raw_notifier_call_chain+0x41/0x60 call_netdevice_notifiers_info+0x34/0x80 netdev_lower_state_changed+0x4e/0xa0 bond_mii_monitor+0x56b/0x640 [bonding] process_one_work+0x1b9/0x390 worker_thread+0x4d/0x3d0 ? rescuer_thread+0x350/0x350 kthread+0x124/0x150 ? set_kthread_struct+0x40/0x40 ret_from_fork+0x1f/0x30 Fixes: 7e51891a237f ("net/mlx5e: Use netdev events to set/del egress acl forward-to-vport rule") Signed-off-by: Maor Dickman <maord@nvidia.com> Reviewed-by: Roi Dayan <roid@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2022-02-01net/mlx5e: Fix broken SKB allocation in HW-GROKhalid Manaa
In case the HW doesn't perform header-data split, it will write the whole packet into the data buffer in the WQ, in this case the SHAMPO CQE handler couldn't use the header entry to build the SKB, instead it should allocate a new memory to build the SKB using the function: mlx5e_skb_from_cqe_mpwrq_nonlinear. Fixes: f97d5c2a453e ("net/mlx5e: Add handle SHAMPO cqe support") Signed-off-by: Khalid Manaa <khalidm@nvidia.com> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2022-02-01net/mlx5e: Fix wrong calculation of header index in HW_GROKhalid Manaa
The HW doesn't wrap the CQE.shampo.header_index field according to the headers buffer size, instead it always increases it until reaching overflow of u16 size. Thus the mlx5e_handle_rx_cqe_mpwrq_shampo handler should mask the CQE header_index field to find the actual header index in the headers buffer. Fixes: f97d5c2a453e ("net/mlx5e: Add handle SHAMPO cqe support") Signed-off-by: Khalid Manaa <khalidm@nvidia.com> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2022-02-01net/mlx5: Bridge, Fix devlink deadlock on net namespace deletionRoi Dayan
When changing mode to switchdev, rep bridge init registered to netdevice notifier holds the devlink lock and then takes pernet_ops_rwsem. At that time deleting a netns holds pernet_ops_rwsem and then takes the devlink lock. Example sequence is: $ ip netns add foo $ devlink dev eswitch set pci/0000:00:08.0 mode switchdev & $ ip netns del foo deleting netns trace: [ 1185.365555] ? devlink_pernet_pre_exit+0x74/0x1c0 [ 1185.368331] ? mutex_lock_io_nested+0x13f0/0x13f0 [ 1185.370984] ? xt_find_table+0x40/0x100 [ 1185.373244] ? __mutex_lock+0x24a/0x15a0 [ 1185.375494] ? net_generic+0xa0/0x1c0 [ 1185.376844] ? wait_for_completion_io+0x280/0x280 [ 1185.377767] ? devlink_pernet_pre_exit+0x74/0x1c0 [ 1185.378686] devlink_pernet_pre_exit+0x74/0x1c0 [ 1185.379579] ? devlink_nl_cmd_get_dumpit+0x3a0/0x3a0 [ 1185.380557] ? xt_find_table+0xda/0x100 [ 1185.381367] cleanup_net+0x372/0x8e0 changing mode to switchdev trace: [ 1185.411267] down_write+0x13a/0x150 [ 1185.412029] ? down_write_killable+0x180/0x180 [ 1185.413005] register_netdevice_notifier+0x1e/0x210 [ 1185.414000] mlx5e_rep_bridge_init+0x181/0x360 [mlx5_core] [ 1185.415243] mlx5e_uplink_rep_enable+0x269/0x480 [mlx5_core] [ 1185.416464] ? mlx5e_uplink_rep_disable+0x210/0x210 [mlx5_core] [ 1185.417749] mlx5e_attach_netdev+0x232/0x400 [mlx5_core] [ 1185.418906] mlx5e_netdev_attach_profile+0x15b/0x1e0 [mlx5_core] [ 1185.420172] mlx5e_netdev_change_profile+0x15a/0x1d0 [mlx5_core] [ 1185.421459] mlx5e_vport_rep_load+0x557/0x780 [mlx5_core] [ 1185.422624] ? mlx5e_stats_grp_vport_rep_num_stats+0x10/0x10 [mlx5_core] [ 1185.424006] mlx5_esw_offloads_rep_load+0xdb/0x190 [mlx5_core] [ 1185.425277] esw_offloads_enable+0xd74/0x14a0 [mlx5_core] Fix this by registering rep bridges for per net netdev notifier instead of global one, which operats on the net namespace without holding the pernet_ops_rwsem. Fixes: 19e9bfa044f3 ("net/mlx5: Bridge, add offload infrastructure") Signed-off-by: Roi Dayan <roid@nvidia.com> Reviewed-by: Vlad Buslov <vladbu@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2022-02-01net/mlx5: Fix offloading with ESWITCH_IPV4_TTL_MODIFY_ENABLEDima Chumak
Only prio 1 is supported for nic mode when there is no ignore flow level support in firmware. But for switchdev mode, which supports fixed number of statically pre-allocated prios, this restriction is not relevant so it can be relaxed. Fixes: d671e109bd85 ("net/mlx5: Fix tc max supported prio for nic mode") Signed-off-by: Dima Chumak <dchumak@nvidia.com> Reviewed-by: Roi Dayan <roid@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2022-02-01net/mlx5e: TC, Reject rules with forward and drop actionsRoi Dayan
Such rules are redundant but allowed and passed to the driver. The driver does not support offloading such rules so return an error. Fixes: 03a9d11e6eeb ("net/mlx5e: Add TC drop and mirred/redirect action parsing for SRIOV offloads") Signed-off-by: Roi Dayan <roid@nvidia.com> Reviewed-by: Oz Shlomo <ozsh@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2022-02-01net/mlx5: Use del_timer_sync in fw reset flow of halting pollMaher Sanalla
Substitute del_timer() with del_timer_sync() in fw reset polling deactivation flow, in order to prevent a race condition which occurs when del_timer() is called and timer is deactivated while another process is handling the timer interrupt. A situation that led to the following call trace: RIP: 0010:run_timer_softirq+0x137/0x420 <IRQ> recalibrate_cpu_khz+0x10/0x10 ktime_get+0x3e/0xa0 ? sched_clock_cpu+0xb/0xc0 __do_softirq+0xf5/0x2ea irq_exit_rcu+0xc1/0xf0 sysvec_apic_timer_interrupt+0x9e/0xc0 asm_sysvec_apic_timer_interrupt+0x12/0x20 </IRQ> Fixes: 38b9f903f22b ("net/mlx5: Handle sync reset request event") Signed-off-by: Maher Sanalla <msanalla@nvidia.com> Reviewed-by: Moshe Shemesh <moshe@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2022-02-01net/mlx5e: Fix module EEPROM queryGal Pressman
When querying the module EEPROM, there was a misusage of the 'offset' variable vs the 'query.offset' field. Fix that by always using 'offset' and assigning its value to 'query.offset' right before the mcia register read call. While at it, the cross-pages read size adjustment was changed to be more intuitive. Fixes: e19b0a3474ab ("net/mlx5: Refactor module EEPROM query") Reported-by: Wang Yugui <wangyugui@e16-tech.com> Signed-off-by: Gal Pressman <gal@nvidia.com> Reviewed-by: Maxim Mikityanskiy <maximmi@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2022-02-01net/mlx5e: TC, Reject rules with drop and modify hdr actionRoi Dayan
This kind of action is not supported by firmware and generates a syndrome. kernel: mlx5_core 0000:08:00.0: mlx5_cmd_check:777:(pid 102063): SET_FLOW_TABLE_ENTRY(0x936) op_mod(0x0) failed, status bad parameter(0x3), syndrome (0x8708c3) Fixes: d7e75a325cb2 ("net/mlx5e: Add offloading of E-Switch TC pedit (header re-write) actions") Signed-off-by: Roi Dayan <roid@nvidia.com> Reviewed-by: Oz Shlomo <ozsh@nvidia.com> Reviewed-by: Maor Dickman <maord@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>