Age | Commit message (Collapse) | Author |
|
This patch adds ability to switch beetween two PHY SGMII modes.
Some hardware, for example, FPGA IP designs may use 6-wire mode
which enables differential SGMII clock to MAC.
Signed-off-by: Vitaly Gaiduk <vitaly.gaiduk@cloudbear.ru>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Add documentation of ti,sgmii-ref-clock-output-enable
which can be used to select SGMII mode type (4 or 6-wire).
Signed-off-by: Vitaly Gaiduk <vitaly.gaiduk@cloudbear.ru>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
A negative number of devices is nonsensical, so change the type to
unsigned. If the number of devices is 0, it is impossible for userspace
to interact with the module, so refuse loading the driver for that case.
Signed-off-by: André Almeida <andrealmeid@collabora.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
The name of the module is "null_blk", not "null". Make `pr_info()` follow
the pattern of `pr_err()` log messages.
Signed-off-by: André Almeida <andrealmeid@collabora.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Use proper ReST syntax for chapters. Add more information to enhance
standardization in the file and to make the rendering more homogeneous.
Add a SPDX identifier. Mark single-queue mode as deprecated.
Signed-off-by: André Almeida <andrealmeid@collabora.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/krzk/linux into arm/late
Samsung mach/soc changes for v5.4, part 2
1. Fix system restart on S3C6410 due to missing match of watchdog,
2. Enable suppor for ARM architected timers on Exynos.
* tag 'samsung-soc-5.4-3' of git://git.kernel.org/pub/scm/linux/kernel/git/krzk/linux:
ARM: exynos: Enable support for ARM architected timers
ARM: samsung: Fix system restart on S3C6410
MAINTAINERS: Extend patterns for Samsung SoC, Security Subsystem and clock drivers
ARM: s3c64xx: squash samsung_usb_phy.h into setup-usb-phy.c
ARM: exynos: Enable exynos-chipid driver
ARM: samsung: Include GPIO driver header
Link: https://lore.kernel.org/r/20190911183632.4317-2-krzk@kernel.org
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
|
|
Unit of 'chunk_size' is byte, instead of sector, so fix it by setting
the queue_limits' max_discard_sectors to rs->md.chunk_sectors. Also,
rename chunk_size to chunk_size_bytes.
Without this fix, too big max_discard_sectors is applied on the request
queue of dm-raid, finally raid code has to split the bio again.
This re-split done by raid causes the following nested clone_endio:
1) one big bio 'A' is submitted to dm queue, and served as the original
bio
2) one new bio 'B' is cloned from the original bio 'A', and .map()
is run on this bio of 'B', and B's original bio points to 'A'
3) raid code sees that 'B' is too big, and split 'B' and re-submit
the remainded part of 'B' to dm-raid queue via generic_make_request().
4) now dm will handle 'B' as new original bio, then allocate a new
clone bio of 'C' and run .map() on 'C'. Meantime C's original bio
points to 'B'.
5) suppose now 'C' is completed by raid directly, then the following
clone_endio() is called recursively:
clone_endio(C)
->clone_endio(B) #B is original bio of 'C'
->bio_endio(A)
'A' can be big enough to make hundreds of nested clone_endio(), then
stack can be corrupted easily.
Fixes: 61697a6abd24a ("dm: eliminate 'split_discard_bios' flag from DM target interface")
Cc: stable@vger.kernel.org
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
|
|
You can pass opaque pointers directly.
I also renamed 'va' and 'vb' into more meaningful arguments.
Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com>
Signed-off-by: Jessica Yu <jeyu@kernel.org>
|
|
When CONFIG_MODULES is disabled, CONFIG_UNUSED_SYMBOLS is pointless,
thus it should be invisible.
Instead of adding "depends on MODULES", I moved it to the sub-menu
"Enable loadable module support", which is a better fit. I put it
close to TRIM_UNUSED_KSYMS because it depends on !UNUSED_SYMBOLS.
Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com>
Signed-off-by: Jessica Yu <jeyu@kernel.org>
|
|
These are located in the 'if MODULES' ... 'endif' block.
Remove the redundant dependencies.
Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com>
Signed-off-by: Jessica Yu <jeyu@kernel.org>
|
|
The code assumes log_num < in_num everywhere, and that is true as long as
in_num is incremented by descriptor iov count, and log_num by 1. However
this breaks if there's a zero sized descriptor.
As a result, if a malicious guest creates a vring desc with desc.len = 0,
it may cause the host kernel to crash by overflowing the log array. This
bug can be triggered during the VM migration.
There's no need to log when desc.len = 0, so just don't increment log_num
in this case.
Fixes: 3a4d5c94e959 ("vhost_net: a kernel-level virtio server")
Cc: stable@vger.kernel.org
Reviewed-by: Lidong Chen <lidongchen@tencent.com>
Signed-off-by: ruippan <ruippan@tencent.com>
Signed-off-by: yongduan <yongduan@tencent.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Reviewed-by: Tyler Hicks <tyhicks@canonical.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
|
|
iovec addresses coming from vhost are assumed to be
pre-validated, but in fact can be speculated to a value
out of range.
Userspace address are later validated with array_index_nospec so we can
be sure kernel info does not leak through these addresses, but vhost
must also not leak userspace info outside the allowed memory table to
guests.
Following the defence in depth principle, make sure
the address is not validated out of node range.
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Cc: stable@vger.kernel.org
Acked-by: Jason Wang <jasowang@redhat.com>
Tested-by: Jason Wang <jasowang@redhat.com>
|
|
The software node is searched from a list that may be empty
when the function is called. This makes sure that the
function returns NULL if the list is empty.
Fixes: 1666faedb567 ("software node: Add software_node_find_by_name()")
Reported-by: kbuild test robot <lkp@intel.com>
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Heikki Krogerus <heikki.krogerus@linux.intel.com>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
|
|
Commit 8651ec01daed ("module: add support for symbol namespaces.")
broke linking for arm64 defconfig:
| lib/crypto/arc4.o: In function `__ksymtab_arc4_setkey':
| arc4.c:(___ksymtab+arc4_setkey+0x8): undefined reference to `no symbol'
| lib/crypto/arc4.o: In function `__ksymtab_arc4_crypt':
| arc4.c:(___ksymtab+arc4_crypt+0x8): undefined reference to `no symbol'
This is because the dummy initialisation of the 'namespace_offset' field
in 'struct kernel_symbol' when using EXPORT_SYMBOL on architectures with
support for PREL32 locations uses an offset from an absolute address (0)
in an effort to trick 'offset_to_pointer' into behaving as a NOP,
allowing non-namespaced symbols to be treated in the same way as those
belonging to a namespace.
Unfortunately, place-relative relocations require a symbol reference
rather than an absolute value and, although x86 appears to get away with
this due to placing the kernel text at the top of the address space, it
almost certainly results in a runtime failure if the kernel is relocated
dynamically as a result of KASLR.
Rework 'namespace_offset' so that a value of 0, which cannot occur for a
valid namespaced symbol, indicates that the corresponding symbol does
not belong to a namespace.
Cc: Matthias Maennich <maennich@google.com>
Cc: Jessica Yu <jeyu@kernel.org>
Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Fixes: 8651ec01daed ("module: add support for symbol namespaces.")
Reported-by: kbuild test robot <lkp@intel.com>
Tested-by: Matthias Maennich <maennich@google.com>
Tested-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Reviewed-by: Matthias Maennich <maennich@google.com>
Acked-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Signed-off-by: Will Deacon <will@kernel.org>
Signed-off-by: Jessica Yu <jeyu@kernel.org>
|
|
Tx code doesn't clear the descriptors' status after cleaning.
So, if the budget is larger than number of used elems in a ring, some
descriptors will be accounted twice and xsk_umem_complete_tx will move
prod_tail far beyond the prod_head breaking the completion queue ring.
Fix that by limiting the number of descriptors to clean by the number
of used descriptors in the Tx ring.
'ixgbe_clean_xdp_tx_irq()' function refactored to look more like
'ixgbe_xsk_clean_tx_ring()' since we're allowed to directly use
'next_to_clean' and 'next_to_use' indexes.
CC: stable@vger.kernel.org
Fixes: 8221c5eba8c1 ("ixgbe: add AF_XDP zero-copy Tx support")
Signed-off-by: Ilya Maximets <i.maximets@samsung.com>
Tested-by: William Tu <u9012063@gmail.com>
Tested-by: Eelco Chaudron <echaudro@redhat.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
|
|
There were a couple cases where the ITR value generated via the adaptive
ITR scheme could exceed 126. This resulted in the value becoming either 0
or something less than 10. Switching back and forth between a value less
than 10 and a value greater than 10 can cause issues as certain hardware
features such as RSC to not function well when the ITR value has dropped
that low.
CC: stable@vger.kernel.org
Fixes: b4ded8327fea ("ixgbe: Update adaptive ITR algorithm")
Reported-by: Gregg Leventhal <gleventhal@janestreet.com>
Signed-off-by: Alexander Duyck <alexander.h.duyck@linux.intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
|
|
Commit cd7764fe9f73 ("KVM: x86: latch INITs while in system management mode")
changed code to latch INIT while vCPU is in SMM and process latched INIT
when leaving SMM. It left a subtle remark in commit message that similar
treatment should also be done while vCPU is in VMX non-root-mode.
However, INIT signals should actually be latched in various vCPU states:
(*) For both Intel and AMD, INIT signals should be latched while vCPU
is in SMM.
(*) For Intel, INIT should also be latched while vCPU is in VMX
operation and later processed when vCPU leaves VMX operation by
executing VMXOFF.
(*) For AMD, INIT should also be latched while vCPU runs with GIF=0
or in guest-mode with intercept defined on INIT signal.
To fix this:
1) Add kvm_x86_ops->apic_init_signal_blocked() such that each CPU vendor
can define the various CPU states in which INIT signals should be
blocked and modify kvm_apic_accept_events() to use it.
2) Modify vmx_check_nested_events() to check for pending INIT signal
while vCPU in guest-mode. If so, emualte vmexit on
EXIT_REASON_INIT_SIGNAL. Note that nSVM should have similar behaviour
but is currently left as a TODO comment to implement in the future
because nSVM don't yet implement svm_check_nested_events().
Note: Currently KVM nVMX implementation don't support VMX wait-for-SIPI
activity state as specified in MSR_IA32_VMX_MISC bits 6:8 exposed to
guest (See nested_vmx_setup_ctls_msrs()).
If and when support for this activity state will be implemented,
kvm_check_nested_events() would need to avoid emulating vmexit on
INIT signal in case activity-state is wait-for-SIPI. In addition,
kvm_apic_accept_events() would need to be modified to avoid discarding
SIPI in case VMX activity-state is wait-for-SIPI but instead delay
SIPI processing to vmx_check_nested_events() that would clear
pending APIC events and emulate vmexit on SIPI.
Reviewed-by: Joao Martins <joao.m.martins@oracle.com>
Co-developed-by: Nikita Leshenko <nikita.leshchenko@oracle.com>
Signed-off-by: Nikita Leshenko <nikita.leshchenko@oracle.com>
Signed-off-by: Liran Alon <liran.alon@oracle.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
When the RX rings are created they are also populated with buffers
so that packets can be received. Usually these are kernel buffers,
but for AF_XDP in zero-copy mode, these are user-space buffers and
in this case the application might not have sent down any buffers
to the driver at this point. And if no buffers are allocated at ring
creation time, no packets can be received and no interrupts will be
generated so the NAPI poll function that allocates buffers to the
rings will never get executed.
To rectify this, we kick the NAPI context of any queue with an
attached AF_XDP zero-copy socket in two places in the code. Once
after an XDP program has loaded and once after the umem is registered.
This take care of both cases: XDP program gets loaded first then AF_XDP
socket is created, and the reverse, AF_XDP socket is created first,
then XDP program is loaded.
Fixes: 0a714186d3c0 ("i40e: add AF_XDP zero-copy Rx support")
Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
|
|
Don't populate the array API on the stack but instead make it
static const. Makes the object code smaller by 58 bytes.
Before:
text data bss dec hex filename
82969 9763 256 92988 16b3c ixgbevf/ixgbevf_main.o
After:
text data bss dec hex filename
82815 9859 256 92930 16b02 ixgbevf/ixgbevf_main.o
(gcc version 9.2.1, amd64)
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
|
|
Currently iavf unconditionally applies MAC address change requests. This
brings the VF in a state where it is no longer able to pass traffic if
the PF rejects a MAC filter change for the VF.
A typical scenario for a rejected MAC filter is for an untrusted VF to
request to change the MAC address when an administratively set MAC is
present.
To keep iavf working in this scenario the MAC filter handling in iavf
needs to act on the PF reply regarding the MAC filter change. In the
case of an ack the new MAC address gets set, whereas in the case of a
nack the previous MAC address needs to stay in place.
Signed-off-by: Stefan Assmann <sassmann@kpanic.de>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
|
|
In the case of an invalid min Tx rate being requested
i40e_ndo_set_vf_bw() immediately returns -EINVAL instead of releasing
__I40E_VIRTCHNL_OP_PENDING first.
Signed-off-by: Stefan Assmann <sassmann@kpanic.de>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
|
|
The macros used to specify the cloud filter fields are intended to be
individual bits. Declare them using the BIT() macro to make their
intention a little more clear.
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
|
|
When variable "req_fec, fec, an" are empty,
dmesg shows log with "Requested FEC: , Negotiated FEC: , Autoneg:".
Add link dmesg log for cards without FEC.
Signed-off-by: Czeslaw Zagorski <czeslawx.zagorski@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
|
|
The "Negotiated" string in i40e_print_link_message() function was missed.
This string has been added to the dmesg and small refactoring done removing
common substrings and unifying link status message format.
Without this patch it was not clear that FEC is related to negotiated FEC.
Signed-off-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
|
|
Mark bits 0xD through 0xF for the command flags of a cloud filter as
reserved. These bits are not yet defined and are considered as reserved
in the data sheet.
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
|
|
The bit 0x0001 used in the cloud filters adminq command is reserved, and
is not actually a valid type.
The Linux driver has never used this type, and it's not clear if any
driver ever has.
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
|
|
Remove a call to ktime_to_timespec64 by calling ktime_get_real_ts64
directly.
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
|
|
Use the common api, and don't access queue_mapping directly.
Signed-off-by: Tonghao Zhang <xiangxia.m.yue@gmail.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
|
|
While testing VF spawn/destroy the following panic occurred.
BUG: unable to handle kernel NULL pointer dereference at 0000000000000029
[...]
Workqueue: i40e i40e_service_task [i40e]
RIP: 0010:i40e_sync_vsi_filters+0x6fd/0xc60 [i40e]
[...]
Call Trace:
? __switch_to_asm+0x35/0x70
? __switch_to_asm+0x41/0x70
? __switch_to_asm+0x35/0x70
? _cond_resched+0x15/0x30
i40e_sync_filters_subtask+0x56/0x70 [i40e]
i40e_service_task+0x382/0x11b0 [i40e]
? __switch_to_asm+0x41/0x70
? __switch_to_asm+0x41/0x70
process_one_work+0x1a7/0x3b0
worker_thread+0x30/0x390
? create_worker+0x1a0/0x1a0
kthread+0x112/0x130
? kthread_bind+0x30/0x30
ret_from_fork+0x35/0x40
Investigation revealed a race where pf->vf[vsi->vf_id].trusted may get
accessed by the watchdog via i40e_sync_filters_subtask() although
i40e_free_vfs() already free'd pf->vf.
To avoid this the call to i40e_sync_vsi_filters() in
i40e_sync_filters_subtask() needs to be guarded by __I40E_VF_DISABLE,
which is also used by i40e_free_vfs().
Note: put the __I40E_VF_DISABLE check after the
__I40E_MACVLAN_SYNC_PENDING check as the latter is more likely to
trigger.
CC: stable@vger.kernel.org
Signed-off-by: Stefan Assmann <sassmann@kpanic.de>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
|
|
In ixgbe_configure_clsu32(), 'jump', 'input', and 'mask' are allocated
through kzalloc() respectively in a for loop body. Then,
ixgbe_clsu32_build_input() is invoked to build the input. If this process
fails, next iteration of the for loop will be executed. However, the
allocated 'jump', 'input', and 'mask' are not deallocated on this execution
path, leading to memory leaks.
Signed-off-by: Wenwen Wang <wenwen@cs.uga.edu>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
|
|
According to Intel SDM section 25.2 "Other Causes of VM Exits",
When INIT signal is received on a CPU that is running in VMX
non-root mode it should cause an exit with exit-reason of 3.
(See Intel SDM Appendix C "VMX BASIC EXIT REASONS")
This patch introduce the exit-reason definition.
Reviewed-by: Bhavesh Davda <bhavesh.davda@oracle.com>
Reviewed-by: Joao Martins <joao.m.martins@oracle.com>
Co-developed-by: Nikita Leshenko <nikita.leshchenko@oracle.com>
Signed-off-by: Nikita Leshenko <nikita.leshchenko@oracle.com>
Signed-off-by: Liran Alon <liran.alon@oracle.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/kvms390/linux into HEAD
* More selftests
* Improved KVM_S390_MEM_OP ioctl input checking
* Add kvm_valid_regs and kvm_dirty_regs invalid bit checking
|
|
The hrtimer which is used to emulate lapic timer is stopped during
vcpu reset, preemption timer should do the same.
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Wanpeng Li <wanpengli@tencent.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
This patch optimizes the virtual IPI emulation sequence:
write ICR2 write ICR2
write ICR read ICR2
read ICR ==> send virtual IPI
read ICR2 write ICR
send virtual IPI
It can reduce kvm-unit-tests/vmexit.flat IPI testing latency(from sender
send IPI to sender receive the ACK) from 3319 cycles to 3203 cycles on
SKylake server.
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Wanpeng Li <wanpengli@tencent.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
On AMD processors, in PAE 32bit mode, nested KVM instances don't
work. The L0 host get a kernel OOPS, which is related to
arch.mmu->pae_root being NULL.
The reason for this is that when setting up nested KVM instance,
arch.mmu is set to &arch.guest_mmu (while normally, it would be
&arch.root_mmu). However, the initialization and allocation of
pae_root only creates it in root_mmu. KVM code (ie. in
mmu_alloc_shadow_roots) then accesses arch.mmu->pae_root, which is the
unallocated arch.guest_mmu->pae_root.
This fix just allocates (and frees) pae_root in both guest_mmu and
root_mmu (and also lm_root if it was allocated). The allocation is
subject to previous restrictions ie. it won't allocate anything on
64-bit and AFAIK not on Intel.
Fixes: https://bugzilla.kernel.org/show_bug.cgi?id=203923
Fixes: 14c07ad89f4d ("x86/kvm/mmu: introduce guest_mmu")
Signed-off-by: Jiri Palecek <jpalecek@web.de>
Tested-by: Jiri Palecek <jpalecek@web.de>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
x86_emulate_instruction() takes into account ctxt->have_exception flag
during instruction decoding, but in practice this flag is never set in
x86_decode_insn().
Fixes: 6ea6e84309ca ("KVM: x86: inject exceptions produced by x86_decode_insn")
Cc: stable@vger.kernel.org
Cc: Denis Lunev <den@virtuozzo.com>
Cc: Roman Kagan <rkagan@virtuozzo.com>
Cc: Denis Plotnikov <dplotnikov@virtuozzo.com>
Signed-off-by: Jan Dakinevich <jan.dakinevich@virtuozzo.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
inject_emulated_exception() returns true if and only if nested page
fault happens. However, page fault can come from guest page tables
walk, either nested or not nested. In both cases we should stop an
attempt to read under RIP and give guest to step over its own page
fault handler.
This is also visible when an emulated instruction causes a #GP fault
and the VMware backdoor is enabled. To handle the VMware backdoor,
KVM intercepts #GP faults; with only the next patch applied,
x86_emulate_instruction() injects a #GP but returns EMULATE_FAIL
instead of EMULATE_DONE. EMULATE_FAIL causes handle_exception_nmi()
(or gp_interception() for SVM) to re-inject the original #GP because it
thinks emulation failed due to a non-VMware opcode. This patch prevents
the issue as x86_emulate_instruction() will return EMULATE_DONE after
injecting the #GP.
Fixes: 6ea6e84309ca ("KVM: x86: inject exceptions produced by x86_decode_insn")
Cc: stable@vger.kernel.org
Cc: Denis Lunev <den@virtuozzo.com>
Cc: Roman Kagan <rkagan@virtuozzo.com>
Cc: Denis Plotnikov <dplotnikov@virtuozzo.com>
Signed-off-by: Jan Dakinevich <jan.dakinevich@virtuozzo.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
available
The downside of guest side polling is that polling is performed even
with other runnable tasks in the host. However, even if poll in kvm
can aware whether or not other runnable tasks in the same pCPU, it
can still incur extra overhead in over-subscribe scenario. Now we can
just enable guest polling when dedicated pCPUs are available.
Acked-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Wanpeng Li <wanpengli@tencent.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
|
|
cpuidle-haltpoll can be built as a module to allow optional late load.
Given we are setting @owner to THIS_MODULE, cpuidle will attempt to grab a
module reference every time a cpuidle_device is registered -- so
essentially all online cpus get a reference.
This prevents for the module to be unloaded later, which makes the
module_exit callback entirely unused. Thus remove the @owner and allow
module to be unloaded.
Fixes: fa86ee90eb11 ("add cpuidle-haltpoll driver")
Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
|
|
When a user loads cpuidle-haltpoll on a non KVM guest the module will
successfully load, even though idle driver registration didn't take
place.
We should instead return -ENODEV signaling the user that the driver can't
be loaded, like other error paths in haltpoll_init(). An example of such
error paths is when we return -EBUSY when attempting to register an idle
driver when it had one already (e.g. intel_idle loads at boot and then we
attempt to insert module cpuidle-haltpoll).
Fixes: fa86ee90eb11 ("add cpuidle-haltpoll driver")
Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
|
|
Right now, guest current governors have the following ratings:
* ladder -> 10
* teo -> 19
* menu -> 20
* haltpoll -> 21
* ladder + nohz=off -> 25
haltpoll governor got introduced and it is now the default governor given
its highest rating -- with ladder+nohz being the exception -- regardless of
idle driver in the guest. An example of an undesirable case is x86 KVM
guests with MWAIT which have intel_idle registered first, and consequently
will have haltpoll be used as governor which would get limited to a poll
state and state 1 and the other states wouldn't get used.
To keep the previous defaults we decrease rating of governor to 9 (below
current lowest rating) and thus rely on @governor switch on
cpuidle_register_driver() to tie in haltpoll idle driver and governor
together.
Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
|
|
The recently introduced haltpoll driver is largely only useful with
haltpoll governor. To allow drivers to associate with a particular idle
behaviour, add a @governor property to 'struct cpuidle_driver' and thus
allow a cpuidle driver to switch to a *preferred* governor on idle driver
registration. We save the previous governor, and when an idle driver is
unregistered we switch back to that.
The @governor can be overridden by cpuidle.governor= boot param or
alternatively be ignored if the governor doesn't exist.
Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
|
|
Use the recently added tracepoint for logging nested VM-Enter failures
instead of spamming the kernel log when hardware detects a consistency
check failure. Take the opportunity to print the name of the error code
instead of dumping the raw hex number, but limit the symbol table to
error codes that can reasonably be encountered by KVM.
Add an equivalent tracepoint in nested_vmx_check_vmentry_hw(), e.g. so
that tracing of "invalid control field" errors isn't suppressed when
nested early checks are enabled.
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
Debugging a failed VM-Enter is often like searching for a needle in a
haystack, e.g. there are over 80 consistency checks that funnel into
the "invalid control field" error code. One way to expedite debug is
to run the buggy code as an L1 guest under KVM (and pray that the
failing check is detected by KVM). However, extracting useful debug
information out of L0 KVM requires attaching a debugger to KVM and/or
modifying the source, e.g. to log which check is failing.
Make life a little less painful for VMM developers and add a tracepoint
for failed VM-Enter consistency checks. Ideally the tracepoint would
capture both what check failed and precisely why it failed, but logging
why a checked failed is difficult to do in a generic tracepoint without
resorting to invasive techniques, e.g. generating a custom string on
failure. That being said, for the vast majority of VM-Enter failures
the most difficult step is figuring out exactly what to look at, e.g.
figuring out which bit was incorrectly set in a control field is usually
not too painful once the guilty field as been identified.
To reach a happy medium between precision and ease of use, simply log
the code that detected a failed check, using a macro to execute the
check and log the trace event on failure. This approach enables tracing
arbitrary code, e.g. it's not limited to function calls or specific
formats of checks, and the changes to the existing code are minimally
invasive. A macro with a two-character name is desirable as usage of
the macro doesn't result in overly long lines or confusing alignment,
while still retaining some amount of readability. I.e. a one-character
name is a little too terse, and a three-character name results in the
contents being passed to the macro aligning with an indented line when
the macro is used an in if-statement, e.g.:
if (VCC(nested_vmx_check_long_line_one(...) &&
nested_vmx_check_long_line_two(...)))
return -EINVAL;
And that is the story of how the CC(), a.k.a. Consistency Check, macro
got its name.
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
We refactored this code a bit and accidentally deleted the "-" character
from "-EINVAL". The kvm_vcpu_map() function never returns positive
EINVAL.
Fixes: c8e16b78c614 ("x86: KVM: svm: eliminate hardcoded RIP advancement from vmrun_interception()")
Cc: stable@vger.kernel.org
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Reviewed-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
|
|
|
|
The BCM2835 SPI driver currently sets the SPI_CONTROLLER_MUST_TX flag.
When performing an RX-only transfer, this flag causes the SPI core to
allocate and DMA-map a dummy buffer which is copied to the TX FIFO.
The dummy buffer is necessary because the chip is not capable of
automatically clocking out null bytes.
Avoid the overhead induced by the dummy buffer by preallocating a
reusable DMA transaction which fills the TX FIFO by cyclically copying
from the zero page. The transaction requires very little CPU time to
submit and generates no interrupts while running. Specifics are
provided in kerneldoc comments.
[Nathan Chancellor contributed a DMA mapping fixup for an early version
of this commit, hence his Signed-off-by.]
Tested-by: Nuno Sá <nuno.sa@analog.com>
Tested-by: Noralf Trønnes <noralf@tronnes.org>
Signed-off-by: Nathan Chancellor <natechancellor@gmail.com>
Signed-off-by: Lukas Wunner <lukas@wunner.de>
Acked-by: Stefan Wahren <wahrenst@gmx.net>
Acked-by: Martin Sperl <kernel@martin.sperl.org>
Cc: Robert Jarzmik <robert.jarzmik@free.fr>
Link: https://lore.kernel.org/r/f45920af18dbf06e34129bbc406f53dc9c5d1075.1568187525.git.lukas@wunner.de
Signed-off-by: Mark Brown <broonie@kernel.org>
|
|
The BCM2835 SPI driver currently sets the SPI_CONTROLLER_MUST_RX flag.
When performing a TX-only transfer, this flag causes the SPI core to
allocate and DMA-map a dummy buffer into which the RX FIFO contents are
copied. The dummy buffer is necessary because the chip is not capable
of disabling the receiver or automatically throwing away received data.
Not reading the RX FIFO isn't an option either since transmission is
halted once it's full.
Avoid the overhead induced by the dummy buffer by preallocating a
reusable DMA transaction which cyclically clears the RX FIFO. The
transaction requires very little CPU time to submit and generates no
interrupts while running. Specifics are provided in kerneldoc comments.
With a ks8851 Ethernet chip attached to the SPI controller, I am seeing
a 30 us reduction in ping time with this commit (1.819 ms vs. 1.849 ms,
average of 100,000 packets) as well as a 2% reduction in CPU time
(75:08 vs. 76:39 for transmission of 5 GByte over the SPI bus).
The commit uses the TX DMA interrupt to signal completion of a transfer.
This interrupt is raised once all bytes have been written to the
TX FIFO and it is then necessary to busy-wait for the TX FIFO to become
empty before the transfer can be finalized. As an alternative approach,
I have explored using the SPI controller's DONE interrupt to detect
completion. This interrupt is signaled when the TX FIFO becomes empty,
avoiding the need to busy-wait. However latency deteriorates compared
to the present commit and surprisingly, CPU time is slightly higher as
well:
It turns out that in 45% of the cases, no busy-waiting is needed at all
and in 76% of the cases, less than 10 busy-wait iterations are
sufficient for the TX FIFO to drain. This was measured on an RT kernel.
On a vanilla kernel, wakeup latency is worse and thus fewer iterations
are needed. The measurements were made with an SPI clock of 20 MHz,
they may differ slightly for slower or faster clock speeds.
Previously we always used the RX DMA interrupt to signal completion of a
transfer. Using the TX DMA interrupt now introduces a race condition:
TX DMA is always started before RX DMA so that bytes are already clocked
out while RX DMA is still being set up. But if a TX-only transfer is
very short, then the TX DMA interrupt may occur before RX DMA is set up.
If the interrupt happens to occur on the same CPU, setup of RX DMA may
even be delayed until after the interrupt was handled.
I've solved this by having the TX DMA callback clear the RX FIFO while
busy-waiting for the TX FIFO to drain, thus avoiding a dependency on
setup of RX DMA. Additionally, I am using a lock-free mechanism with
two flags, tx_dma_active and rx_dma_active plus memory barriers to
terminate RX DMA either by the TX DMA callback or immediately after
setting it up, whichever wins the race. I've explored an alternative
approach which temporarily disables the TX DMA callback until RX DMA
has been set up (using tasklet_disable(), local_bh_disable() or
local_irq_save()), but the performance was minimally worse.
[Nathan Chancellor contributed a DMA mapping fixup for an early version
of this commit, hence his Signed-off-by.]
Tested-by: Nuno Sá <nuno.sa@analog.com>
Tested-by: Noralf Trønnes <noralf@tronnes.org>
Signed-off-by: Nathan Chancellor <natechancellor@gmail.com>
Signed-off-by: Lukas Wunner <lukas@wunner.de>
Acked-by: Stefan Wahren <wahrenst@gmx.net>
Acked-by: Martin Sperl <kernel@martin.sperl.org>
Cc: Robert Jarzmik <robert.jarzmik@free.fr>
Link: https://lore.kernel.org/r/874949385f28251e2dcaa9494e39a27b50e9f9e4.1568187525.git.lukas@wunner.de
Signed-off-by: Mark Brown <broonie@kernel.org>
|
|
The BCM2835 DMA controller is capable of synthesizing zeroes instead of
copying them from a source address. The feature is enabled by setting
the SRC_IGNORE bit in the Transfer Information field of a Control Block:
"Do not perform source reads.
In addition, destination writes will zero all the write strobes.
This is used for fast cache fill operations."
https://www.raspberrypi.org/app/uploads/2012/02/BCM2835-ARM-Peripherals.pdf
The feature is only available on 8 of the 16 channels. The others are
so-called "lite" channels with a limited feature set and performance.
Enable the feature if a cyclic transaction copies from the zero page.
This reduces traffic on the memory bus.
A forthcoming use case is the BCM2835 SPI driver, which will cyclically
copy from the zero page to the TX FIFO. The idea to use SRC_IGNORE was
taken from an ancient GitHub conversation between Martin and Noralf:
https://github.com/msperl/spi-bcm2835/issues/13#issuecomment-98180451
Tested-by: Nuno Sá <nuno.sa@analog.com>
Tested-by: Noralf Trønnes <noralf@tronnes.org>
Signed-off-by: Lukas Wunner <lukas@wunner.de>
Acked-by: Vinod Koul <vkoul@kernel.org>
Acked-by: Stefan Wahren <wahrenst@gmx.net>
Acked-by: Martin Sperl <kernel@martin.sperl.org>
Cc: Florian Kauer <florian.kauer@koalo.de>
Link: https://lore.kernel.org/r/b2286c904408745192e4beb3de3c88f73e4a7210.1568187525.git.lukas@wunner.de
Signed-off-by: Mark Brown <broonie@kernel.org>
|