Age | Commit message (Collapse) | Author |
|
MIPS defines two kvm types:
#define KVM_VM_MIPS_TE 0
#define KVM_VM_MIPS_VZ 1
In Documentation/virt/kvm/api.rst it is said that "You probably want to
use 0 as machine type", which implies that type 0 be the "automatic" or
"default" type. And, in user-space libvirt use the null-machine (with
type 0) to detect the kvm capability, which returns "KVM not supported"
on a VZ platform.
I try to fix it in QEMU but it is ugly:
https://lists.nongnu.org/archive/html/qemu-devel/2020-08/msg05629.html
And Thomas Huth suggests me to change the definition of kvm type:
https://lists.nongnu.org/archive/html/qemu-devel/2020-09/msg03281.html
So I define like this:
#define KVM_VM_MIPS_AUTO 0
#define KVM_VM_MIPS_VZ 1
#define KVM_VM_MIPS_TE 2
Since VZ and TE cannot co-exists, using type 0 on a TE platform will
still return success (so old user-space tools have no problems on new
kernels); the advantage is that using type 0 on a VZ platform will not
return failure. So, the only problem is "new user-space tools use type
2 on old kernels", but if we treat this as a kernel bug, we can backport
this patch to old stable kernels.
Signed-off-by: Huacai Chen <chenhc@lemote.com>
Message-Id: <1599734031-28746-1-git-send-email-chenhc@lemote.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/ulfh/mmc
Pull MMC fixes from Ulf Hansson:
"MMC core:
- sdio: Restore ~20% performance drop for SDHCI drivers, by using
mmc_pre_req() and mmc_post_req() for SDIO requests.
MMC host:
- sdhci-of-esdhc: Fix support for erratum eSDHC7
- mmc_spi: Allow the driver to be built when CONFIG_HAS_DMA is unset
- sdhci-msm: Use retries to fix tuning
- sdhci-acpi: Fix resume for eMMC HS400 mode"
* tag 'mmc-v5.9-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/ulfh/mmc:
mmc: sdio: Use mmc_pre_req() / mmc_post_req()
mmc: sdhci-of-esdhc: Don't walk device-tree on every interrupt
mmc: mmc_spi: Allow the driver to be built when CONFIG_HAS_DMA is unset
mmc: sdhci-msm: Add retries when all tuning phases are found valid
mmc: sdhci-acpi: Clear amd_sdhci_host on reset
|
|
When kvm_mmu_get_page() gets a page with unsynced children, the spt
pagetable is unsynchronized with the guest pagetable. But the
guest might not issue a "flush" operation on it when the pagetable
entry is changed from zero or other cases. The hypervisor has the
responsibility to synchronize the pagetables.
KVM behaved as above for many years, But commit 8c8560b83390
("KVM: x86/mmu: Use KVM_REQ_TLB_FLUSH_CURRENT for MMU specific flushes")
inadvertently included a line of code to change it without giving any
reason in the changelog. It is clear that the commit's intention was to
change KVM_REQ_TLB_FLUSH -> KVM_REQ_TLB_FLUSH_CURRENT, so we don't
needlessly flush other contexts; however, one of the hunks changed
a nearby KVM_REQ_MMU_SYNC instead. This patch changes it back.
Link: https://lore.kernel.org/lkml/20200320212833.3507-26-sean.j.christopherson@intel.com/
Cc: Sean Christopherson <sean.j.christopherson@intel.com>
Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
Message-Id: <20200902135421.31158-1-jiangshanlai@gmail.com>
fixes: 8c8560b83390 ("KVM: x86/mmu: Use KVM_REQ_TLB_FLUSH_CURRENT for MMU specific flushes")
Cc: stable@vger.kernel.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
A minor fix for the update of VM_EXIT_LOAD_IA32_PERF_GLOBAL_CTRL field
in exit_ctls_high.
Fixes: 03a8871add95 ("KVM: nVMX: Expose load IA32_PERF_GLOBAL_CTRL
VM-{Entry,Exit} control")
Signed-off-by: Chenyi Qiang <chenyi.qiang@intel.com>
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Message-Id: <20200828085622.8365-5-chenyi.qiang@intel.com>
Reviewed-by: Jim Mattson <jmattson@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
when kmalloc() fails in kvm_io_bus_unregister_dev(), before removing
the bus, we should iterate over all other devices linked to it and call
kvm_iodevice_destructor() for them
Fixes: 90db10434b16 ("KVM: kvm_io_bus_unregister_dev() should never fail")
Cc: stable@vger.kernel.org
Reported-and-tested-by: syzbot+f196caa45793d6374707@syzkaller.appspotmail.com
Link: https://syzkaller.appspot.com/bug?extid=f196caa45793d6374707
Signed-off-by: Rustam Kovhaev <rkovhaev@gmail.com>
Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Message-Id: <20200907185535.233114-1-rkovhaev@gmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
check the allocation of per-cpu __pv_cpu_mask. Initialize ops only when
successful.
Signed-off-by: Haiwei Li <lihaiwei@tencent.com>
Message-Id: <d59f05df-e6d3-3d31-a036-cc25a2b2f33f@gmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
When L2 uses PAE, L0 intercepts of L2 writes to CR0/CR3/CR4 call
load_pdptrs to read the possibly updated PDPTEs from the guest
physical address referenced by CR3. It loads them into
vcpu->arch.walk_mmu->pdptrs and sets VCPU_EXREG_PDPTR in
vcpu->arch.regs_dirty.
At the subsequent assumed reentry into L2, the mmu will call
vmx_load_mmu_pgd which calls ept_load_pdptrs. ept_load_pdptrs sees
VCPU_EXREG_PDPTR set in vcpu->arch.regs_dirty and loads
VMCS02.GUEST_PDPTRn from vcpu->arch.walk_mmu->pdptrs[]. This all works
if the L2 CRn write intercept always resumes L2.
The resume path calls vmx_check_nested_events which checks for
exceptions, MTF, and expired VMX preemption timers. If
vmx_check_nested_events finds any of these conditions pending it will
reflect the corresponding exit into L1. Live migration at this point
would also cause a missed immediate reentry into L2.
After L1 exits, vmx_vcpu_run calls vmx_register_cache_reset which
clears VCPU_EXREG_PDPTR in vcpu->arch.regs_dirty. When L2 next
resumes, ept_load_pdptrs finds VCPU_EXREG_PDPTR clear in
vcpu->arch.regs_dirty and does not load VMCS02.GUEST_PDPTRn from
vcpu->arch.walk_mmu->pdptrs[]. prepare_vmcs02 will then load
VMCS02.GUEST_PDPTRn from vmcs12->pdptr0/1/2/3 which contain the stale
values stored at last L2 exit. A repro of this bug showed L2 entering
triple fault immediately due to the bad VMCS02.GUEST_PDPTRn values.
When L2 is in PAE paging mode add a call to ept_load_pdptrs before
leaving L2. This will update VMCS02.GUEST_PDPTRn if they are dirty in
vcpu->arch.walk_mmu->pdptrs[].
Tested:
kvm-unit-tests with new directed test: vmx_mtf_pdpte_test.
Verified that test fails without the fix.
Also ran Google internal VMM with an Ubuntu 16.04 4.4.0-83 guest running a
custom hypervisor with a 32-bit Windows XP L2 guest using PAE. Prior to fix
would repro readily. Ran 14 simultaneous L2s for 140 iterations with no
failures.
Signed-off-by: Peter Shier <pshier@google.com>
Reviewed-by: Jim Mattson <jmattson@google.com>
Message-Id: <20200820230545.2411347-1-pshier@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into HEAD
KVM/arm64 fixes for Linux 5.9, take #1
- Multiple stolen time fixes, with a new capability to match x86
- Fix for hugetlbfs mappings when PUD and PMD are the same level
- Fix for hugetlbfs mappings when PTE mappings are enforced
(dirty logging, for example)
- Fix tracing output of 64bit values
|
|
Pull drm fixes from Dave Airlie:
"Regular fixes, not much a major amount. One thing though is Laurent
fixed some Kconfig issues, and I'm carrying the rapidio kconfig change
so the drm one for xlnx driver works. He hadn't got a response from
rapidio maintainers.
Otherwise, virtio, sun4i, tve200, ingenic have some fixes, one audio
fix for i915 and a core docs fix.
kconfig:
- rapidio/xlnx kconfig fix
core:
- Documentation fix
i915:
- audio regression fix
virtio:
- Fix double free in virtio
- Fix virtio unblank
- Remove output->enabled from virtio, as it should use crtc_state
sun4i:
- Add missing put_device in sun4i, and other fixes
- Handle sun4i alpha on lowest plane correctly
tv200:
- Fix tve200 enable/disable
ingenic
- Small ingenic fixes"
* tag 'drm-fixes-2020-09-11' of git://anongit.freedesktop.org/drm/drm:
drm/i915: fix regression leading to display audio probe failure on GLK
drm: xlnx: dpsub: Fix DMADEVICES Kconfig dependency
rapidio: Replace 'select' DMAENGINES 'with depends on'
drm/virtio: drop virtio_gpu_output->enabled
drm/sun4i: backend: Disable alpha on the lowest plane on the A20
drm/sun4i: backend: Support alpha property on lowest plane
drm/sun4i: Fix DE2 YVU handling
drm/tve200: Stabilize enable/disable
dma-buf: fence-chain: Document missing dma_fence_chain_init() parameter in kerneldoc
dma-buf: Fix kerneldoc of dma_buf_set_name()
drm/virtio: fix unblank
Documentation: fix dma-buf.rst underline length warning
drm/sun4i: Fix dsi dcs long write function
drm/ingenic: Fix driver not probing when IPU port is missing
drm/ingenic: Fix leak of device_node pointer
drm/sun4i: add missing put_device() call in sun8i_r40_tcon_tv_set_mux()
drm/virtio: Revert "drm/virtio: Call the right shmem helpers"
|
|
Pull rdma fixes from Jason Gunthorpe:
"A number of driver bug fixes and a few recent regressions:
- Several bug fixes for bnxt_re. Crashing, incorrect data reported,
and corruption on new HW
- Memory leak and crash in rxe
- Fix sysfs corruption in rxe if the netdev name is too long
- Fix a crash on error unwind in the new cq_pool code
- Fix kobject panics in rtrs by working device lifetime properly
- Fix a data corruption bug in iser target related to misaligned
buffers"
* tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma:
IB/isert: Fix unaligned immediate-data handling
RDMA/rtrs-srv: Set .release function for rtrs srv device during device init
RDMA/bnxt_re: Remove set but not used variable 'qplib_ctx'
RDMA/core: Fix reported speed and width
RDMA/core: Fix unsafe linked list traversal after failing to allocate CQ
RDMA/bnxt_re: Remove the qp from list only if the qp destroy succeeds
RDMA/bnxt_re: Fix driver crash on unaligned PSN entry address
RDMA/bnxt_re: Restrict the max_gids to 256
RDMA/bnxt_re: Static NQ depth allocation
RDMA/bnxt_re: Fix the qp table indexing
RDMA/bnxt_re: Do not report transparent vlan from QP1
RDMA/mlx4: Read pkey table length instead of hardcoded value
RDMA/rxe: Fix panic when calling kmem_cache_create()
RDMA/rxe: Fix memleak in rxe_mem_init_user
RDMA/rxe: Fix the parent sysfs read when the interface has 15 chars
RDMA/rtrs-srv: Replace device_register with device_initialize and device_add
|
|
Using gcov to collect coverage data for kernels compiled with GCC 10.1
causes random malfunctions and kernel crashes. This is the result of a
changed GCOV_COUNTERS value in GCC 10.1 that causes a mismatch between
the layout of the gcov_info structure created by GCC profiling code and
the related structure used by the kernel.
Fix this by updating the in-kernel GCOV_COUNTERS value. Also re-enable
config GCOV_KERNEL for use with GCC 10.
Reported-by: Colin Ian King <colin.king@canonical.com>
Reported-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Peter Oberparleiter <oberpar@linux.ibm.com>
Tested-by: Leon Romanovsky <leonro@nvidia.com>
Tested-and-Acked-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
https://git.kernel.org/pub/scm/linux/kernel/git/broonie/sound into for-linus
ASoC: Fixes for v5.9
Most of this is various driver specific fixes, none of which are
terribly exciting in themselves, plus one core fix adding and using a
new DAI lookup function to deal with a lockdep warning.
|
|
ath.git patches for v5.10. Major changes:
ath10k
* support SDIO firmware codedumps
* support station specific TID configurations
ath11k
* add support for IPQ6018
|
|
* powercap:
powercap: make documentation reflect code
powercap/intel_rapl: add support for AlderLake
powercap/intel_rapl: add support for RocketLake
powercap/intel_rapl: add support for TigerLake Desktop
|
|
There is no caller in tree, so can remove it.
Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Signed-off-by: Kalle Valo <kvalo@codeaurora.org>
Link: https://lore.kernel.org/r/20200909135834.38448-1-yuehaibing@huawei.com
|
|
There is no caller in tree, so can remove it.
Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Signed-off-by: Kalle Valo <kvalo@codeaurora.org>
Link: https://lore.kernel.org/r/20200909134533.19604-1-yuehaibing@huawei.com
|
|
If CONFIG_REMOTEPROC was disabled the linking failed with:
ERROR: modpost: "rproc_get_by_phandle" [drivers/net/wireless/ath/ath11k/ath11k.ko] undefined!
Compile tested only.
Reported-by: Randy Dunlap <rdunlap@infradead.org>
Fixes: 1ff8ed786d5d ("ath11k: use remoteproc only with AHB devices")
Signed-off-by: Kalle Valo <kvalo@codeaurora.org>
Link: https://lore.kernel.org/r/0101017476e38f40-c4168ac4-c00a-4220-a032-fe17e4a157cb-000000@us-west-2.amazonses.com
|
|
During probe ath11k_init_hw_params() is called from ath11k_core_pre_init()
and is not needed agian in ath11k_core_init().
Tested on: IPQ8074 hw2.0 AHB WLAN.HK.2.4.0.1-00009-QCAHKSWPL_SILICONZ-1
Fixes: 1ff8ed786d5d (ath11k: use remoteproc only with AHB devices)
Signed-off-by: Anilkumar Kolli <akolli@codeaurora.org>
Signed-off-by: Kalle Valo <kvalo@codeaurora.org>
Link: https://lore.kernel.org/r/010101746d2a40d3-25cd7dbe-c0dd-4fdf-8735-366d7fb40207-000000@us-west-2.amazonses.com
|
|
Adding raw mode tx/rx support. Also, adding support
for software crypto which depends on raw mode.
To enable raw mode tx/rx:
insmod ath11k.ko frame_mode=0
To enable software crypto:
insmod ath11k.ko crypto_mode=1
These modes could be helpful in debugging crypto related issues.
Tested-on: IPQ8074 WLAN.HK.2.1.0.1-01228-QCAHKSWPL_SILICONZ-1
Signed-off-by: Manikanta Pubbisetty <mpubbise@codeaurora.org>
Signed-off-by: Venkateswara Naralasetty <vnaralas@codeaurora.org>
Signed-off-by: Kalle Valo <kvalo@codeaurora.org>
Link: https://lore.kernel.org/r/010101746c6a52d9-18302a2c-0d6d-4057-aa4b-95960c809646-000000@us-west-2.amazonses.com
|
|
IPQ6018 has one 5G and one 2G radio with 2x2,
shares ipq8074 configurations.
Tested on: IPQ6018 hw1.0 AHB WLAN.HK.2.2-02134-QCAHKSWPL_SILICONZ-1
Tested on: IPQ8074 hw2.0 AHB WLAN.HK.2.4.0.1-00009-QCAHKSWPL_SILICONZ-1
Signed-off-by: Anilkumar Kolli <akolli@codeaurora.org>
Signed-off-by: Kalle Valo <kvalo@codeaurora.org>
Link: https://lore.kernel.org/r/010101746cb68b63-c2bc31ec-a31e-442e-a572-26f4c045c06b-000000@us-west-2.amazonses.com
|
|
Move target CE config and target CE service config to hw_params.
No functional changes.
Tested on: IPQ8074 hw2.0 AHB WLAN.HK.2.4.0.1-00009-QCAHKSWPL_SILICONZ-1
Signed-off-by: Anilkumar Kolli <akolli@codeaurora.org>
Signed-off-by: Kalle Valo <kvalo@codeaurora.org>
Link: https://lore.kernel.org/r/010101746cb685d9-6bedeccb-29a1-4d32-8664-fcfe7d105f4a-000000@us-west-2.amazonses.com
|
|
Add IPQ6018 wireless driver support,
its based on ath11k driver.
Signed-off-by: Anilkumar Kolli <akolli@codeaurora.org>
Reviewed-by: Rob Herring <robh@kernel.org>
Signed-off-by: Kalle Valo <kvalo@codeaurora.org>
Link: https://lore.kernel.org/r/010101746cb6751a-ca300933-1174-4534-a01b-b1dbf1c1f305-000000@us-west-2.amazonses.com
|
|
For new advertising features, it will be important for userspace to
know the capabilities of the controller and kernel. If the controller
and kernel support extended advertising, we include flags indicating
hardware offloading support and support for setting tx power of adv
instances.
In the future, vendor-specific commands may allow the setting of tx
power in advertising instances, but for now this feature is only
marked available if extended advertising is supported.
This change is manually verified in userspace by ensuring the
advertising manager's supported_flags field is updated with new flags on
hatch chromebook (ext advertising supported).
Signed-off-by: Daniel Winkler <danielwinkler@google.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
|
|
Here we make sure we properly report the number of supported
advertising slots when we are using hardware offloading. If no
hardware offloading is available, we default this value to
HCI_MAX_ADV_INSTANCES for use in software rotation as before.
This change has been tested on kukui (no ext adv) and hatch (ext adv)
chromebooks by verifying "SupportedInstances" shows 5 (the default) and
6 (slots supported by controller), respectively.
Signed-off-by: Daniel Winkler <danielwinkler@google.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
|
|
During serdev unregister, hdev->shutdown is called before
proto close. Removing duplicates power OFF call.
Signed-off-by: Venkata Lakshmi Narayana Gubba <gubbaven@codeaurora.org>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
|
|
When HCI_QUIRK_NON_PERSISTENT_SETUP is set by drivers,
it indicates that BT SoC will be completely powered OFF
during BT OFF. On next BT ON firmware must be downloaded
again. Holding UART port open during BT OFF is draining
the battery. Now during BT OFF, UART port is closed if
qurik HCI_QUIRK_NON_PERSISTENT_SETUP is set by clearing
HCI_UART_PROTO_READY proto flag. On next BT ON, UART
port is opened if HCI_UART_PROTO_READY proto flag is cleared.
Signed-off-by: Venkata Lakshmi Narayana Gubba <gubbaven@codeaurora.org>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
|
|
This patch defines new getsockopt options BT_SNDMTU/BT_RCVMTU
for SCO socket to be compatible with other bluetooth sockets.
These new options return the same value as option SCO_OPTIONS
which is already present on existing kernels.
Signed-off-by: Joseph Hwang <josephsih@chromium.org>
Reviewed-by: Alain Michaud <alainm@chromium.org>
Reviewed-by: Abhishek Pandit-Subedi <abhishekpandit@chromium.org>
Reviewed-by: Pali Rohár <pali@kernel.org>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
|
|
Unregister_pm_notifier is a blocking call so suspend tasks should be
cleared beforehand. Otherwise, the notifier will wait for completion
before returning (and we encounter a 2s timeout on resume).
Fixes: 0e9952804ec9c8 (Bluetooth: Clear suspend tasks on unregister)
Signed-off-by: Abhishek Pandit-Subedi <abhishekpandit@chromium.org>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
|
|
On new Intel platform the device is provided with INT33E3 ID.
Append it to the list.
This will require ACPI_GPIO_QUIRK_ONLY_GPIOIO to be enabled because
the relevant ASL looks like:
UartSerialBusV2 ( ... )
GpioInt ( ... ) { ... }
GpioIo ( ... ) { ... }
which means that first GPIO resource is an interrupt, while we are expecting it
to be reset one (output). Do the same for host-wake because in case of
GpioInt() the platform_get_irq() will do the job and should return correct
Linux IRQ number. That said, host-wake GPIO can only be GpioIo() resource.
While here, drop commas in terminator lines.
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
|
|
read_adv_mon_features() is leaking memory. Free `rp` before returning.
Fixes: e5e1e7fd470c ("Bluetooth: Add handler of MGMT_OP_READ_ADV_MONITOR_FEATURES")
Reported-and-tested-by: syzbot+f7f6e564f4202d8601c6@syzkaller.appspotmail.com
Link: https://syzkaller.appspot.com/bug?extid=f7f6e564f4202d8601c6
Signed-off-by: Peilin Ye <yepeilin.cs@gmail.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
|
|
Fix sparse warnings:
drivers/bluetooth/btmtksdio.c:499:57: warning: Using plain integer as NULL pointer
drivers/bluetooth/btmtksdio.c:533:57: warning: Using plain integer as NULL pointer
Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
|
|
There is no need to have list_for_each() followed by list_entry()
when we simply may use list_for_each_entry() directly.
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
|
|
Sparse rightfully complains:
hci_intel.c:696:26: warning: cast to restricted __le16
hci_intel.c:701:26: warning: cast to restricted __le16
hci_intel.c:702:26: warning: cast to restricted __le16
hci_intel.c:703:26: warning: cast to restricted __le16
hci_intel.c:725:26: warning: cast to restricted __le16
hci_intel.c:730:26: warning: cast to restricted __le16
hci_intel.c:731:26: warning: cast to restricted __le16
hci_intel.c:732:26: warning: cast to restricted __le16
because we access non-restricted types with le16_to_cpu().
More confusion is added by using above against u8. On big-endian
architecture we will get all zeroes. I bet it's not what should be
in such case.
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
|
|
Because clk_disable_unprepare already checked
NULL clock parameter, so the additional check is
unnecessary, just remove it.
Signed-off-by: Xu Wang <vulab@iscas.ac.cn>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
|
|
Neal Cardwell says:
====================
This patch series reorganizes TCP congestion control initialization so that if
EBPF code called by tcp_init_transfer() sets the congestion control algorithm
by calling setsockopt(TCP_CONGESTION) then the TCP stack initializes the
congestion control module immediately, instead of having tcp_init_transfer()
later initialize the congestion control module.
This increases flexibility for the EBPF code that runs at connection
establishment time, and simplifies the code.
This has the following benefits:
(1) This allows CC module customizations made by the EBPF called in
tcp_init_transfer() to persist, and not be wiped out by a later
call to tcp_init_congestion_control() in tcp_init_transfer().
(2) Does not flip the order of EBPF and CC init, to avoid causing bugs
for existing code upstream that depends on the current order.
(3) Does not cause 2 initializations for for CC in the case where the
EBPF called in tcp_init_transfer() wants to set the CC to a new CC
algorithm.
(4) Allows follow-on simplifications to the code in net/core/filter.c
and net/ipv4/tcp_cong.c, which currently both have some complexity
to special-case CC initialization to avoid double CC
initialization if EBPF sets the CC.
changes in v2:
o rebase onto bpf-next
o add another follow-on simplification suggested by Martin KaFai Lau:
"tcp: simplify tcp_set_congestion_control() load=false case"
changes in v3:
o no change in commits
o resent patch series from @gmail.com, since mail from ncardwell@google.com
stopped being accepted at netdev@vger.kernel.org mid-way through processing
the v2 patch series (between patches 2 and 3), confusing patchwork about
which patches belonged to the v2 patch series
====================
Acked-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
Simplify tcp_set_congestion_control() by removing the initialization
code path for the !load case.
There are only two call sites for tcp_set_congestion_control(). The
EBPF call site is the only one that passes load=false; it also passes
cap_net_admin=true. Because of that, the exact same behavior can be
achieved by removing the special if (!load) branch of the logic. Both
before and after this commit, the EBPF case will call
bpf_try_module_get(), and if that succeeds then call
tcp_reinit_congestion_control() or if that fails then return EBUSY.
Note that this returns the logic to a structure very similar to the
structure before:
commit 91b5b21c7c16 ("bpf: Add support for changing congestion control")
except that the CAP_NET_ADMIN status is passed in as a function
argument.
This clean-up was suggested by Martin KaFai Lau.
Suggested-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Cc: Lawrence Brakmo <brakmo@fb.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Cc: Kevin Yang <yyd@google.com>
|
|
Now that the previous patches have removed the code that uses the
flags argument to _bpf_setsockopt(), we can remove that argument.
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Yuchung Cheng <ycheng@google.com>
Acked-by: Kevin Yang <yyd@google.com>
Cc: Lawrence Brakmo <brakmo@fb.com>
|
|
Now that the previous patches ensure that all call sites for
tcp_set_congestion_control() want to initialize congestion control, we
can simplify tcp_set_congestion_control() by removing the reinit
argument and the code to support it.
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Yuchung Cheng <ycheng@google.com>
Acked-by: Kevin Yang <yyd@google.com>
Cc: Lawrence Brakmo <brakmo@fb.com>
|
|
Now that the previous patch ensures we don't initialize the congestion
control twice, when EBPF sets the congestion control algorithm at
connection establishment we can simplify the code by simply
initializing the congestion control module at that time.
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Yuchung Cheng <ycheng@google.com>
Acked-by: Kevin Yang <yyd@google.com>
Cc: Lawrence Brakmo <brakmo@fb.com>
|
|
Change tcp_init_transfer() to only initialize congestion control if it
has not been initialized already.
With this new approach, we can arrange things so that if the EBPF code
sets the congestion control by calling setsockopt(TCP_CONGESTION) then
tcp_init_transfer() will not re-initialize the CC module.
This is an approach that has the following beneficial properties:
(1) This allows CC module customizations made by the EBPF called in
tcp_init_transfer() to persist, and not be wiped out by a later
call to tcp_init_congestion_control() in tcp_init_transfer().
(2) Does not flip the order of EBPF and CC init, to avoid causing bugs
for existing code upstream that depends on the current order.
(3) Does not cause 2 initializations for for CC in the case where the
EBPF called in tcp_init_transfer() wants to set the CC to a new CC
algorithm.
(4) Allows follow-on simplifications to the code in net/core/filter.c
and net/ipv4/tcp_cong.c, which currently both have some complexity
to special-case CC initialization to avoid double CC
initialization if EBPF sets the CC.
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Yuchung Cheng <ycheng@google.com>
Acked-by: Kevin Yang <yyd@google.com>
Cc: Lawrence Brakmo <brakmo@fb.com>
|
|
Remove link to litmus tests that didn't make it to upstream. Fix ringbuf
benchmark link.
I wasn't able to test this with `make htmldocs`, unfortunately, because of
Sphinx dependencies. But bench_ringbufs.c path is certainly correct now.
Fixes: 97abb2b39682 ("docs/bpf: Add BPF ring buffer design notes")
Reported-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
Signed-off-by: Andrii Nakryiko <andriin@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20200910225245.2896991-1-andriin@fb.com
|
|
The "SEE ALSO" sections of bpftool's manual pages refer to bpf(2),
bpf-helpers(7), then all existing bpftool man pages (save the current
one).
This leads to nearly-identical lists being duplicated in all manual
pages. Ideally, when a new page is created, all lists should be updated
accordingly, but this has led to omissions and inconsistencies multiple
times in the past.
Let's take it out of the RST files and generate the "SEE ALSO" sections
automatically in the Makefile when generating the man pages. The lists
are not really useful in the RST anyway because all other pages are
available in the same directory.
v3:
- Fix conflict with a previous patchset that introduced RST2MAN_OPTS
variable passed to rst2man.
v2:
- Use "echo -n" instead of "printf" in Makefile, to avoid any risk of
passing a format string directly to the command.
Signed-off-by: Quentin Monnet <quentin@isovalent.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Link: https://lore.kernel.org/bpf/20200910203935.25304-1-quentin@isovalent.com
|
|
This should be "current" not "skb".
Fixes: c6b5fb8690fa ("bpf: add documentation for eBPF helpers (42-50)")
Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Cc: <stable@vger.kernel.org>
Link: https://lore.kernel.org/bpf/20200910203314.70018-1-songliubraving@fb.com
|
|
When tweaking llvm optimizations, I found that selftest build failed
with the following error:
libbpf: elf: skipping unrecognized data section(6) .rodata.str1.1
libbpf: prog 'sysctl_tcp_mem': bad map relo against '.L__const.is_tcp_mem.tcp_mem_name'
in section '.rodata.str1.1'
Error: failed to open BPF object file: Relocation failed
make: *** [/work/net-next/tools/testing/selftests/bpf/test_sysctl_prog.skel.h] Error 255
make: *** Deleting file `/work/net-next/tools/testing/selftests/bpf/test_sysctl_prog.skel.h'
The local string constant "tcp_mem_name" is put into '.rodata.str1.1' section
which libbpf cannot handle. Using untweaked upstream llvm, "tcp_mem_name"
is completely inlined after loop unrolling.
Commit 7fb5eefd7639 ("selftests/bpf: Fix test_sysctl_loop{1, 2}
failure due to clang change") solved a similar problem by defining
the string const as a global. Let us do the same here
for test_sysctl_prog.c so it can weather future potential llvm changes.
Signed-off-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Link: https://lore.kernel.org/bpf/20200910202718.956042-1-yhs@fb.com
|
|
On non-SMP kernels __per_cpu_start is not 0, so look it up in kallsyms.
Signed-off-by: Ilya Leoshkevich <iii@linux.ibm.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Link: https://lore.kernel.org/bpf/20200910171336.3161995-1-iii@linux.ibm.com
|
|
Fixes the following warning when using W=1 to build kernel:
drivers/net/ethernet/smsc/smc91x.c: In function ‘smc_phy_configure’:
drivers/net/ethernet/smsc/smc91x.c:1039:6: warning: variable ‘status’ set but not used [-Wunused-but-set-variable]
int status;
Signed-off-by: Luo Jiaxing <luojiaxing@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
As Alexei points out, struct bpf_sk_lookup_kern has two 4-byte holes.
This leads to suboptimal instructions being generated (IPv4, x86):
1372 struct bpf_sk_lookup_kern ctx = {
0xffffffff81b87f30 <+624>: xor %eax,%eax
0xffffffff81b87f32 <+626>: mov $0x6,%ecx
0xffffffff81b87f37 <+631>: lea 0x90(%rsp),%rdi
0xffffffff81b87f3f <+639>: movl $0x110002,0x88(%rsp)
0xffffffff81b87f4a <+650>: rep stos %rax,%es:(%rdi)
0xffffffff81b87f4d <+653>: mov 0x8(%rsp),%eax
0xffffffff81b87f51 <+657>: mov %r13d,0x90(%rsp)
0xffffffff81b87f59 <+665>: incl %gs:0x7e4970a0(%rip)
0xffffffff81b87f60 <+672>: mov %eax,0x8c(%rsp)
0xffffffff81b87f67 <+679>: movzwl 0x10(%rsp),%eax
0xffffffff81b87f6c <+684>: mov %ax,0xa8(%rsp)
0xffffffff81b87f74 <+692>: movzwl 0x38(%rsp),%eax
0xffffffff81b87f79 <+697>: mov %ax,0xaa(%rsp)
Fix this by moving around sport and dport. pahole confirms there
are no more holes:
struct bpf_sk_lookup_kern {
u16 family; /* 0 2 */
u16 protocol; /* 2 2 */
__be16 sport; /* 4 2 */
u16 dport; /* 6 2 */
struct {
__be32 saddr; /* 8 4 */
__be32 daddr; /* 12 4 */
} v4; /* 8 8 */
struct {
const struct in6_addr * saddr; /* 16 8 */
const struct in6_addr * daddr; /* 24 8 */
} v6; /* 16 16 */
struct sock * selected_sk; /* 32 8 */
bool no_reuseport; /* 40 1 */
/* size: 48, cachelines: 1, members: 8 */
/* padding: 7 */
/* last cacheline: 48 bytes */
};
The assembly also doesn't contain the pesky rep stos anymore:
1372 struct bpf_sk_lookup_kern ctx = {
0xffffffff81b87f60 <+624>: movzwl 0x10(%rsp),%eax
0xffffffff81b87f65 <+629>: movq $0x0,0xa8(%rsp)
0xffffffff81b87f71 <+641>: movq $0x0,0xb0(%rsp)
0xffffffff81b87f7d <+653>: mov %ax,0x9c(%rsp)
0xffffffff81b87f85 <+661>: movzwl 0x38(%rsp),%eax
0xffffffff81b87f8a <+666>: movq $0x0,0xb8(%rsp)
0xffffffff81b87f96 <+678>: mov %ax,0x9e(%rsp)
0xffffffff81b87f9e <+686>: mov 0x8(%rsp),%eax
0xffffffff81b87fa2 <+690>: movq $0x0,0xc0(%rsp)
0xffffffff81b87fae <+702>: movl $0x110002,0x98(%rsp)
0xffffffff81b87fb9 <+713>: mov %eax,0xa0(%rsp)
0xffffffff81b87fc0 <+720>: mov %r13d,0xa4(%rsp)
1: https://lore.kernel.org/bpf/CAADnVQKE6y9h2fwX6OS837v-Uf+aBXnT_JXiN_bbo2gitZQ3tA@mail.gmail.com/
Fixes: e9ddbb7707ff ("bpf: Introduce SK_LOOKUP program type with a dedicated attach point")
Suggested-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Lorenz Bauer <lmb@cloudflare.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Jesper Dangaard Brouer <brouer@redhat.com>
Link: https://lore.kernel.org/bpf/20200910110248.198326-1-lmb@cloudflare.com
|
|
There is no support for creating maps of types array-of-map or
hash-of-map in bpftool. This is because the kernel needs an inner_map_fd
to collect metadata on the inner maps to be supported by the new map,
but bpftool does not provide a way to pass this file descriptor.
Add a new optional "inner_map" keyword that can be used to pass a
reference to a map, retrieve a fd to that map, and pass it as the
inner_map_fd.
Add related documentation and bash completion. Note that we can
reference the inner map by its name, meaning we can have several times
the keyword "name" with different meanings (mandatory outer map name,
and possibly a name to use to find the inner_map_fd). The bash
completion will offer it just once, and will not suggest "name" on the
following command:
# bpftool map create /sys/fs/bpf/my_outer_map type hash_of_maps \
inner_map name my_inner_map [TAB]
Fixing that specific case seems too convoluted. Completion will work as
expected, however, if the outer map name comes first and the "inner_map
name ..." is passed second.
Signed-off-by: Quentin Monnet <quentin@isovalent.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Link: https://lore.kernel.org/bpf/20200910102652.10509-4-quentin@isovalent.com
|
|
When dumping outer maps or prog_array maps, and on lookup failure,
bpftool simply skips the entry with no error message. This is because
the kernel returns non-zero when no value is found for the provided key,
which frequently happen for those maps if they have not been filled.
When such a case occurs, errno is set to ENOENT. It seems unlikely we
could receive other error codes at this stage (we successfully retrieved
map info just before), but to be on the safe side, let's skip the entry
only if errno was ENOENT, and not for the other errors.
v3: New patch
Signed-off-by: Quentin Monnet <quentin@isovalent.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Link: https://lore.kernel.org/bpf/20200910102652.10509-3-quentin@isovalent.com
|
|
The function used to dump a map entry in bpftool is a bit difficult to
follow, as a consequence to earlier refactorings. There is a variable
("num_elems") which does not appear to be necessary, and the error
handling would look cleaner if moved to its own function. Let's clean it
up. No functional change.
v2:
- v1 was erroneously removing the check on fd maps in an attempt to get
support for outer map dumps. This is already working. Instead, v2
focuses on cleaning up the dump_map_elem() function, to avoid
similar confusion in the future.
Signed-off-by: Quentin Monnet <quentin@isovalent.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Link: https://lore.kernel.org/bpf/20200910102652.10509-2-quentin@isovalent.com
|