Age | Commit message (Collapse) | Author |
|
tcp recvmsg() (or rx zerocopy) spends a fair amount of time
freeing skbs after their payload has been consumed.
A typical ~64KB GRO packet has to release ~45 page
references, eventually going to page allocator
for each of them.
Currently, this freeing is performed while socket lock
is held, meaning that there is a high chance that
BH handler has to queue incoming packets to tcp socket backlog.
This can cause additional latencies, because the user
thread has to process the backlog at release_sock() time,
and while doing so, additional frames can be added
by BH handler.
This patch adds logic to defer these frees after socket
lock is released, or directly from BH handler if possible.
Being able to free these skbs from BH handler helps a lot,
because this avoids the usual alloc/free assymetry,
when BH handler and user thread do not run on same cpu or
NUMA node.
One cpu can now be fully utilized for the kernel->user copy,
and another cpu is handling BH processing and skb/page
allocs/frees (assuming RFS is not forcing use of a single CPU)
Tested:
100Gbit NIC
Max throughput for one TCP_STREAM flow, over 10 runs
MTU : 1500
Before: 55 Gbit
After: 66 Gbit
MTU : 4096+(headers)
Before: 82 Gbit
After: 95 Gbit
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
TCP uses sk_eat_skb() when skbs can be removed from receive queue.
However, the call to skb_orphan() from __kfree_skb() incurs
an indirect call so sock_rfee(), which is more expensive than
a direct call, especially for CONFIG_RETPOLINE=y.
Add tcp_eat_recv_skb() function to make the call before
__kfree_skb().
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Use some unlikely() hints in the fast path.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
tcp_poll() and tcp_ioctl() are reading tp->urg_data without socket lock
owned.
Also, it is faster to first check tp->urg_data in tcp_poll(),
then tp->urg_seq == tp->copied_seq, because tp->urg_seq is
located in a different/cold cache line.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
tcp_segs_in() can be called from BH, while socket spinlock
is held but socket owned by user, eventually reading these
fields from tcp_get_info()
Found by code inspection, no need to backport this patch
to older kernels.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Use INDIRECT_CALL_INET() to avoid an indirect call
when/if CONFIG_RETPOLINE=y
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
When reading large chunks of data, incoming packets might
be added to the backlog from BH.
tcp recvmsg() detects the backlog queue is not empty, and uses
a release_sock()/lock_sock() pair to process this backlog.
We now have __sk_flush_backlog() to perform this
a bit faster.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
tcp_memory_allocated and tcp_sockets_allocated often share
a common cache line, source of false sharing.
Also take care of udp_memory_allocated and mptcp_sockets_allocated.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
(struct proto)->sk_forward_alloc is currently only used by MPTCP.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Move sk_bind_phc next to sk_peer_lock to fill a hole.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
gso_size can be moved after tclass, to use an existing hole.
(8 bytes saved on 64bit arches)
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Instead of using a full netdev_features_t, we can use a single bit,
as sk_route_nocaps is only used to remove NETIF_F_GSO_MASK from
sk->sk_route_cap.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
We were only using one bit, and we can replace it by sk_is_tcp()
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Move sk_is_tcp() to include/net/sock.h and use it where we can.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
For TCP flows, inet6_sk(sk)->saddr has the same value
than sk->sk_v6_rcv_saddr.
Using sk->sk_v6_rcv_saddr increases data locality.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
For some reason, I forgot to change __tcp_v6_send_check() at
the same time I removed (ip_summed == CHECKSUM_PARTIAL) check
in __tcp_v4_send_check()
Fixes: 98be9b12096f ("tcp: remove dead code after CHECKSUM_PARTIAL adoption")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
If packet is going to be coalesced, sk_sndbuf/sk_rcvbuf values
are not used. Defer their access to the point we need them.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
A race condition is triggered when usermode control is given to
userspace before the kernel's MSFT query responds, resulting in an
unexpected response to userspace's reset command.
Issue can be observed in btmon:
< HCI Command: Vendor (0x3f|0x001e) plen 2 #3 [hci0]
05 01 ..
@ USER Open: bt_stack_manage (privileged) version 2.22 {0x0002} [hci0]
< HCI Command: Reset (0x03|0x0003) plen 0 #4 [hci0]
> HCI Event: Command Complete (0x0e) plen 5 #5 [hci0]
Vendor (0x3f|0x001e) ncmd 1
Status: Command Disallowed (0x0c)
05 .
> HCI Event: Command Complete (0x0e) plen 4 #6 [hci0]
Reset (0x03|0x0003) ncmd 2
Status: Success (0x00)
Reviewed-by: Abhishek Pandit-Subedi <abhishekpandit@chromium.org>
Reviewed-by: Sonny Sasaka <sonnysasaka@chromium.org>
Signed-off-by: Jesse Melhuish <melhuishj@chromium.org>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
|
|
Coverity Scan report:
[...]
*** CID 1493985: Uninitialized variables (UNINIT)
/net/bluetooth/hci_event.c: 4535 in hci_sync_conn_complete_evt()
4529
4530 /* Notify only in case of SCO over HCI transport data path which
4531 * is zero and non-zero value shall be non-HCI transport data path
4532 */
4533 if (conn->codec.data_path == 0) {
4534 if (hdev->notify)
>>> CID 1493985: Uninitialized variables (UNINIT)
>>> Using uninitialized value "notify_evt" when calling "*hdev->notify".
4535 hdev->notify(hdev, notify_evt);
4536 }
4537
4538 hci_connect_cfm(conn, ev->status);
4539 if (ev->status)
4540 hci_conn_del(conn);
[...]
Although only btusb uses air_mode, and he only handles HCI_NOTIFY_ENABLE_SCO_CVSD
and HCI_NOTIFY_ENABLE_SCO_TRANSP, there is still a very small chance that
ev->air_mode is not equal to 0x2 and 0x3, but notify_evt is initialized to
HCI_NOTIFY_ENABLE_SCO_CVSD or HCI_NOTIFY_ENABLE_SCO_TRANSP. the context is
maybe not correct.
Let us directly use the required function instead of re-initializing it,
so as to restore the original logic and make the code more correct.
Addresses-Coverity: ("Uninitialized variables")
Fixes: f4f9fa0c07bb ("Bluetooth: Allow usb to auto-suspend when SCO use non-HCI transport")
Suggested-by: Marcel Holtmann <marcel@holtmann.org>
Signed-off-by: Jackie Liu <liuyun01@kylinos.cn>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
|
|
Syzbot reported slab-out-of-bounds read in hci_le_adv_report_evt(). The
problem was in missing validaion check.
We should check if data is not malicious and we can read next data block.
If we won't check ptr validness, code can read a way beyond skb->end and
it can cause problems, of course.
Fixes: e95beb414168 ("Bluetooth: hci_le_adv_report_evt code refactoring")
Reported-and-tested-by: syzbot+e3fcb9c4f3c2a931dc40@syzkaller.appspotmail.com
Signed-off-by: Pavel Skripkin <paskripkin@gmail.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
|
|
it is erroneous to calculate padding by subtracting length of type
indication from skb->len, it will cause data analysis error for
alignment which is greater than 1, so fixed by adding length of type
indication with skb->len.
Signed-off-by: Zijun Hu <quic_zijuhu@quicinc.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
|
|
In vcpu_load_eoi_exitmap(), currently the eoi_exit_bitmap[4] array is
initialized only when Hyper-V context is available, in other path it is
just passed to kvm_x86_ops.load_eoi_exitmap() directly from on the stack,
which would cause unexpected interrupt delivery/handling issues, e.g. an
*old* linux kernel that relies on PIT to do clock calibration on KVM might
randomly fail to boot.
Fix it by passing ioapic_handled_vectors to load_eoi_exitmap() when Hyper-V
context is not available.
Fixes: f2bc14b69c38 ("KVM: x86: hyper-v: Prepare to meet unallocated Hyper-V context")
Cc: stable@vger.kernel.org
Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Huang Le <huangle1@jd.com>
Message-Id: <62115b277dab49ea97da5633f8522daf@jd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
Change memslot_modification_stress_test to use perf_test_destroy_vm
instead of manually calling ucall_uninit and kvm_vm_free.
No functional change intended.
Signed-off-by: David Matlack <dmatlack@google.com>
Reviewed-by: Ben Gardon <bgardon@google.com>
Message-Id: <20211111001257.1446428-5-dmatlack@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
Thread creation requires taking the mmap_sem in write mode, which causes
vCPU threads running in guest mode to block while they are populating
memory. Fix this by waiting for all vCPU threads to be created and start
running before entering guest mode on any one vCPU thread.
This substantially improves the "Populate memory time" when using 1GiB
pages since it allows all vCPUs to zero pages in parallel rather than
blocking because a writer is waiting (which is waiting for another vCPU
that is busy zeroing a 1GiB page).
Before:
$ ./dirty_log_perf_test -v256 -s anonymous_hugetlb_1gb
...
Populate memory time: 52.811184013s
After:
$ ./dirty_log_perf_test -v256 -s anonymous_hugetlb_1gb
...
Populate memory time: 10.204573342s
Signed-off-by: David Matlack <dmatlack@google.com>
Message-Id: <20211111001257.1446428-4-dmatlack@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
Move vCPU thread creation and joining to common helper functions. This
is in preparation for the next commit which ensures that all vCPU
threads are fully created before entering guest mode on any one
vCPU.
No functional change intended.
Signed-off-by: David Matlack <dmatlack@google.com>
Reviewed-by: Ben Gardon <bgardon@google.com>
Message-Id: <20211111001257.1446428-3-dmatlack@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
Start at iteration 0 instead of -1 to avoid having to initialize
vcpu_last_completed_iteration when setting up vCPU threads. This
simplifies the next commit where we move vCPU thread initialization
out to a common helper.
No functional change intended.
Signed-off-by: David Matlack <dmatlack@google.com>
Message-Id: <20211111001257.1446428-2-dmatlack@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
Copy perf_test_args to the guest during VM creation instead of relying on
the caller to do so at their leisure. Ideally, tests wouldn't even be
able to modify perf_test_args, i.e. they would have no motivation to do
the sync, but enforcing that is arguably a net negative for readability.
No functional change intended.
[Set wr_fract=1 by default and add helper to override it since the new
access_tracking_perf_test needs to set it dynamically.]
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: David Matlack <dmatlack@google.com>
Reviewed-by: Ben Gardon <bgardon@google.com>
Message-Id: <20211111000310.1435032-13-dmatlack@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
Fill the per-vCPU args when creating the perf_test VM instead of having
the caller do so. This helps ensure that any adjustments to the number
of pages (and thus vcpu_memory_bytes) are reflected in the per-VM args.
Automatically filling the per-vCPU args will also allow a future patch
to do the sync to the guest during creation.
Signed-off-by: Sean Christopherson <seanjc@google.com>
[Updated access_tracking_perf_test as well.]
Signed-off-by: David Matlack <dmatlack@google.com>
Reviewed-by: Ben Gardon <bgardon@google.com>
Message-Id: <20211111000310.1435032-12-dmatlack@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
Use the already computed guest_num_pages when creating the so called
extra VM pages for a perf test, and add a comment explaining why the
pages are allocated as extra pages.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Ben Gardon <bgardon@google.com>
Signed-off-by: David Matlack <dmatlack@google.com>
Message-Id: <20211111000310.1435032-11-dmatlack@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
Remove perf_test_args.host_page_size and instead use getpagesize() so
that it's somewhat obvious that, for tests that care about the host page
size, they care about the system page size, not the hardware page size,
e.g. that the logic is unchanged if hugepages are in play.
No functional change intended.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Ben Gardon <bgardon@google.com>
Signed-off-by: David Matlack <dmatlack@google.com>
Message-Id: <20211111000310.1435032-10-dmatlack@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
Move the per-VM GPA into perf_test_args instead of storing it as a
separate global variable. It's not obvious that guest_test_phys_mem
holds a GPA, nor that it's connected/coupled with per_vcpu->gpa.
No functional change intended.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Ben Gardon <bgardon@google.com>
Signed-off-by: David Matlack <dmatlack@google.com>
Message-Id: <20211111000310.1435032-9-dmatlack@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
Grab the per-vCPU GPA and number of pages from perf_util in the demand
paging test instead of duplicating perf_util's calculations.
Note, this may or may not result in a functional change. It's not clear
that the test's calculations are guaranteed to yield the same value as
perf_util, e.g. if guest_percpu_mem_size != vcpu_args->pages.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Ben Gardon <bgardon@google.com>
Signed-off-by: David Matlack <dmatlack@google.com>
Message-Id: <20211111000310.1435032-8-dmatlack@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
Capture the per-vCPU GPA in perf_test_vcpu_args so that tests can get
the GPA without having to calculate the GPA on their own.
No functional change intended.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Ben Gardon <bgardon@google.com>
Signed-off-by: David Matlack <dmatlack@google.com>
Message-Id: <20211111000310.1435032-7-dmatlack@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
Use 'pta' as a local pointer to the global perf_tests_args in order to
shorten line lengths and make the code borderline readable.
No functional change intended.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Ben Gardon <bgardon@google.com>
Signed-off-by: David Matlack <dmatlack@google.com>
Message-Id: <20211111000310.1435032-6-dmatlack@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
Assert that the GPA for a memslot backed by a hugepage is aligned to
the hugepage size and fix perf_test_util accordingly. Lack of GPA
alignment prevents KVM from backing the guest with hugepages, e.g. x86's
write-protection of hugepages when dirty logging is activated is
otherwise not exercised.
Add a comment explaining that guest_page_size is for non-huge pages to
try and avoid confusion about what it actually tracks.
Cc: Ben Gardon <bgardon@google.com>
Cc: Yanan Wang <wangyanan55@huawei.com>
Cc: Andrew Jones <drjones@redhat.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Aaron Lewis <aaronlewis@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
[Used get_backing_src_pagesz() to determine alignment dynamically.]
Signed-off-by: David Matlack <dmatlack@google.com>
Message-Id: <20211111000310.1435032-5-dmatlack@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
Manually padding and aligning the mmap region is only needed when using
THP. When using HugeTLB, mmap will always return an address aligned to
the HugeTLB page size. Add a comment to clarify this and assert the mmap
behavior for HugeTLB.
[Removed requirement that HugeTLB mmaps must be padded per Yanan's
feedback and added assertion that mmap returns aligned addresses
when using HugeTLB.]
Cc: Ben Gardon <bgardon@google.com>
Cc: Yanan Wang <wangyanan55@huawei.com>
Cc: Andrew Jones <drjones@redhat.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Aaron Lewis <aaronlewis@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: David Matlack <dmatlack@google.com>
Message-Id: <20211111000310.1435032-4-dmatlack@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
Refactor align() to work with non-pointers and split into separate
helpers for aligning up vs. down. Add align_ptr_up() for use with
pointers. Expose all helpers so that they can be used by tests and/or
other utilities. The align_down() helper in particular will be used to
ensure gpa alignment for hugepages.
No functional change intended.
[Added sepearate up/down helpers and replaced open-coded alignment
bit math throughout the KVM selftests.]
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: David Matlack <dmatlack@google.com>
Reviewed-by: Ben Gardon <bgardon@google.com>
Message-Id: <20211111000310.1435032-3-dmatlack@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
Explicitly state the indices when populating vm_guest_mode_params to
make it marginally easier to visualize what's going on.
No functional change intended.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Ben Gardon <bgardon@google.com>
[Added indices for new guest modes.]
Signed-off-by: David Matlack <dmatlack@google.com>
Message-Id: <20211111000310.1435032-2-dmatlack@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
When I first looked at this, there was no support for guest exception
handling in the KVM selftests. In fact it was merged into 5.10 before
the Xen support got merged in 5.11, and I could have used it from the
start.
Hook it up now, to exercise the Xen upcall delivery. I'm about to make
things a bit more interesting by handling the full 2level event channel
stuff in-kernel on top of the basic vector injection that we already
have, and I'll want to build more tests on top.
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Message-Id: <20211115165030.7422-3-dwmw2@infradead.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
It turns out the skb's in sock receive queue could have bad checksums, as
both ->poll() and ->recvmsg() validate checksums. We have to do the same
for ->read_sock() path too before they are redirected in sockmap.
Fixes: d7f571188ecf ("udp: Implement ->read_sock() for sockmap")
Reported-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Cong Wang <cong.wang@bytedance.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/bpf/20211115044006.26068-1-xiyou.wangcong@gmail.com
|
|
Provide a simple implementation of clk_get_parent() in the
lantiq subarch so that callers of it will build without errors.
Fixes this build error:
ERROR: modpost: "clk_get_parent" [drivers/iio/adc/ingenic-adc.ko] undefined!
Fixes: 171bb2f19ed6 ("MIPS: Lantiq: Add initial support for Lantiq SoCs")
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Suggested-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Cc: linux-mips@vger.kernel.org
Cc: John Crispin <john@phrozen.org>
Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
Cc: Jonathan Cameron <jic23@kernel.org>
Cc: linux-iio@vger.kernel.org
Cc: Russell King <linux@armlinux.org.uk>
Cc: Andy Shevchenko <andy.shevchenko@gmail.com>
Acked-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Acked-by: John Crispin <john@phrozen.org>
Signed-off-by: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
|
|
BCM63XX selects HAVE_LEGACY_CLK but does not provide/support
clk_get_parent(), so add a simple implementation of that
function so that callers of it will build without errors.
Fixes these build errors:
mips-linux-ld: drivers/iio/adc/ingenic-adc.o: in function `jz4770_adc_init_clk_div':
ingenic-adc.c:(.text+0xe4): undefined reference to `clk_get_parent'
mips-linux-ld: drivers/iio/adc/ingenic-adc.o: in function `jz4725b_adc_init_clk_div':
ingenic-adc.c:(.text+0x1b8): undefined reference to `clk_get_parent'
Fixes: e7300d04bd08 ("MIPS: BCM63xx: Add support for the Broadcom BCM63xx family of SOCs." )
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Reported-by: kernel test robot <lkp@intel.com>
Suggested-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Cc: Artur Rojek <contact@artur-rojek.eu>
Cc: Paul Cercueil <paul@crapouillou.net>
Cc: linux-mips@vger.kernel.org
Cc: Jonathan Cameron <jic23@kernel.org>
Cc: Lars-Peter Clausen <lars@metafoo.de>
Cc: linux-iio@vger.kernel.org
Cc: Florian Fainelli <f.fainelli@gmail.com>
Cc: Andy Shevchenko <andy.shevchenko@gmail.com>
Cc: Russell King <linux@armlinux.org.uk>
Cc: bcm-kernel-feedback-list@broadcom.com
Cc: Jonas Gorski <jonas.gorski@gmail.com>
Reviewed-by: Andy Shevchenko <andy.shevchenko@gmail.com>
Acked-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Acked-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Acked-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
|
|
In the case where fw_getenv returns an error when fetching values
for ememsizea and memsize then variable phys_memsize is not assigned
a variable and will be uninitialized on a zero check of phys_memsize.
Fix this by initializing phys_memsize to zero.
Cleans up cppcheck error:
arch/mips/generic/yamon-dt.c:100:7: error: Uninitialized variable: phys_memsize [uninitvar]
Fixes: f41d2430bbd6 ("MIPS: generic/yamon-dt: Support > 256MB of RAM")
Signed-off-by: Colin Ian King <colin.i.king@gmail.com>
Signed-off-by: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
|
|
Wire up the futex_waitv syscall.
Fix Build warning: #warning syscall futex_waitv not implemented [-Wcpp]
Signed-off-by: Wang Haojun <wanghaojun@loongson.cn>
Signed-off-by: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
|
|
Commit a23740ec43ba ("bpf: Track contents of read-only maps as scalars") is
checking whether maps are read-only both from BPF program side and user space
side, and then, given their content is constant, reading out their data via
map->ops->map_direct_value_addr() which is then subsequently used as known
scalar value for the register, that is, it is marked as __mark_reg_known()
with the read value at verification time. Before a23740ec43ba, the register
content was marked as an unknown scalar so the verifier could not make any
assumptions about the map content.
The current implementation however is prone to a TOCTOU race, meaning, the
value read as known scalar for the register is not guaranteed to be exactly
the same at a later point when the program is executed, and as such, the
prior made assumptions of the verifier with regards to the program will be
invalid which can cause issues such as OOB access, etc.
While the BPF_F_RDONLY_PROG map flag is always fixed and required to be
specified at map creation time, the map->frozen property is initially set to
false for the map given the map value needs to be populated, e.g. for global
data sections. Once complete, the loader "freezes" the map from user space
such that no subsequent updates/deletes are possible anymore. For the rest
of the lifetime of the map, this freeze one-time trigger cannot be undone
anymore after a successful BPF_MAP_FREEZE cmd return. Meaning, any new BPF_*
cmd calls which would update/delete map entries will be rejected with -EPERM
since map_get_sys_perms() removes the FMODE_CAN_WRITE permission. This also
means that pending update/delete map entries must still complete before this
guarantee is given. This corner case is not an issue for loaders since they
create and prepare such program private map in successive steps.
However, a malicious user is able to trigger this TOCTOU race in two different
ways: i) via userfaultfd, and ii) via batched updates. For i) userfaultfd is
used to expand the competition interval, so that map_update_elem() can modify
the contents of the map after map_freeze() and bpf_prog_load() were executed.
This works, because userfaultfd halts the parallel thread which triggered a
map_update_elem() at the time where we copy key/value from the user buffer and
this already passed the FMODE_CAN_WRITE capability test given at that time the
map was not "frozen". Then, the main thread performs the map_freeze() and
bpf_prog_load(), and once that had completed successfully, the other thread
is woken up to complete the pending map_update_elem() which then changes the
map content. For ii) the idea of the batched update is similar, meaning, when
there are a large number of updates to be processed, it can increase the
competition interval between the two. It is therefore possible in practice to
modify the contents of the map after executing map_freeze() and bpf_prog_load().
One way to fix both i) and ii) at the same time is to expand the use of the
map's map->writecnt. The latter was introduced in fc9702273e2e ("bpf: Add mmap()
support for BPF_MAP_TYPE_ARRAY") and further refined in 1f6cb19be2e2 ("bpf:
Prevent re-mmap()'ing BPF map as writable for initially r/o mapping") with
the rationale to make a writable mmap()'ing of a map mutually exclusive with
read-only freezing. The counter indicates writable mmap() mappings and then
prevents/fails the freeze operation. Its semantics can be expanded beyond
just mmap() by generally indicating ongoing write phases. This would essentially
span any parallel regular and batched flavor of update/delete operation and
then also have map_freeze() fail with -EBUSY. For the check_mem_access() in
the verifier we expand upon the bpf_map_is_rdonly() check ensuring that all
last pending writes have completed via bpf_map_write_active() test. Once the
map->frozen is set and bpf_map_write_active() indicates a map->writecnt of 0
only then we are really guaranteed to use the map's data as known constants.
For map->frozen being set and pending writes in process of still being completed
we fall back to marking that register as unknown scalar so we don't end up
making assumptions about it. With this, both TOCTOU reproducers from i) and
ii) are fixed.
Note that the map->writecnt has been converted into a atomic64 in the fix in
order to avoid a double freeze_mutex mutex_{un,}lock() pair when updating
map->writecnt in the various map update/delete BPF_* cmd flavors. Spanning
the freeze_mutex over entire map update/delete operations in syscall side
would not be possible due to then causing everything to be serialized.
Similarly, something like synchronize_rcu() after setting map->frozen to wait
for update/deletes to complete is not possible either since it would also
have to span the user copy which can sleep. On the libbpf side, this won't
break d66562fba1ce ("libbpf: Add BPF object skeleton support") as the
anonymous mmap()-ed "map initialization image" is remapped as a BPF map-backed
mmap()-ed memory where for .rodata it's non-writable.
Fixes: a23740ec43ba ("bpf: Track contents of read-only maps as scalars")
Reported-by: w1tcher.bupt@gmail.com
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
Since recent Kbuild updates we no longer include files from compiler
directories. However, samples/bpf/hbm_kern.h hasn't been tuned for
this (LLVM 13):
CLANG-bpf samples/bpf/hbm_out_kern.o
In file included from samples/bpf/hbm_out_kern.c:55:
samples/bpf/hbm_kern.h:12:10: fatal error: 'stddef.h' file not found
^~~~~~~~~~
1 error generated.
CLANG-bpf samples/bpf/hbm_edt_kern.o
In file included from samples/bpf/hbm_edt_kern.c:53:
samples/bpf/hbm_kern.h:12:10: fatal error: 'stddef.h' file not found
^~~~~~~~~~
1 error generated.
It is enough to just drop both stdbool.h and stddef.h from includes
to fix those.
Fixes: 04e85bbf71c9 ("isystem: delete global -isystem compile option")
Signed-off-by: Alexander Lobakin <alexandr.lobakin@intel.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com>
Link: https://lore.kernel.org/bpf/20211115130741.3584-1-alexandr.lobakin@intel.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
Dmitrii Banshchikov says:
====================
Various locking issues are possible with bpf_ktime_get_coarse_ns() and
bpf_timer_* set of helpers.
syzbot found a locking issue with bpf_ktime_get_coarse_ns() helper executed in
BPF_PROG_TYPE_PERF_EVENT prog type - [1]. The issue is possible because the
helper uses non fast version of time accessor that isn't safe for any context.
The helper was added because it provided performance benefits in comparison to
bpf_ktime_get_ns() helper.
A similar locking issue is possible with bpf_timer_* set of helpers when used
in tracing progs.
The solution is to restrict use of the helpers in tracing progs.
In the [1] discussion it was stated that bpf_spin_lock related helpers shall
also be excluded for tracing progs. The verifier has a compatibility check
between a map and a program. If a tracing program tries to use a map which
value has struct bpf_spin_lock the verifier fails that is why bpf_spin_lock is
already restricted.
Patch 1 restricts helpers
Patch 2 adds tests
v1 -> v2:
* Limit the helpers via func proto getters instead of allowed callback
* Add note about helpers' restrictions to linux/bpf.h
* Add Fixes tag
* Remove extra \0 from btf_str_sec
* Beside asm tests add prog tests
* Trim CC
1. https://lore.kernel.org/all/00000000000013aebd05cff8e064@google.com/
====================
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
This patch adds tests that bpf_ktime_get_coarse_ns(), bpf_timer_* and
bpf_spin_lock()/bpf_spin_unlock() helpers are forbidden in tracing progs
as their use there may result in various locking issues.
Signed-off-by: Dmitrii Banshchikov <me@ubique.spb.ru>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20211113142227.566439-3-me@ubique.spb.ru
|
|
Use of bpf_ktime_get_coarse_ns() and bpf_timer_* helpers in tracing
progs may result in locking issues.
bpf_ktime_get_coarse_ns() uses ktime_get_coarse_ns() time accessor that
isn't safe for any context:
======================================================
WARNING: possible circular locking dependency detected
5.15.0-syzkaller #0 Not tainted
------------------------------------------------------
syz-executor.4/14877 is trying to acquire lock:
ffffffff8cb30008 (tk_core.seq.seqcount){----}-{0:0}, at: ktime_get_coarse_ts64+0x25/0x110 kernel/time/timekeeping.c:2255
but task is already holding lock:
ffffffff90dbf200 (&obj_hash[i].lock){-.-.}-{2:2}, at: debug_object_deactivate+0x61/0x400 lib/debugobjects.c:735
which lock already depends on the new lock.
the existing dependency chain (in reverse order) is:
-> #1 (&obj_hash[i].lock){-.-.}-{2:2}:
lock_acquire+0x19f/0x4d0 kernel/locking/lockdep.c:5625
__raw_spin_lock_irqsave include/linux/spinlock_api_smp.h:110 [inline]
_raw_spin_lock_irqsave+0xd1/0x120 kernel/locking/spinlock.c:162
__debug_object_init+0xd9/0x1860 lib/debugobjects.c:569
debug_hrtimer_init kernel/time/hrtimer.c:414 [inline]
debug_init kernel/time/hrtimer.c:468 [inline]
hrtimer_init+0x20/0x40 kernel/time/hrtimer.c:1592
ntp_init_cmos_sync kernel/time/ntp.c:676 [inline]
ntp_init+0xa1/0xad kernel/time/ntp.c:1095
timekeeping_init+0x512/0x6bf kernel/time/timekeeping.c:1639
start_kernel+0x267/0x56e init/main.c:1030
secondary_startup_64_no_verify+0xb1/0xbb
-> #0 (tk_core.seq.seqcount){----}-{0:0}:
check_prev_add kernel/locking/lockdep.c:3051 [inline]
check_prevs_add kernel/locking/lockdep.c:3174 [inline]
validate_chain+0x1dfb/0x8240 kernel/locking/lockdep.c:3789
__lock_acquire+0x1382/0x2b00 kernel/locking/lockdep.c:5015
lock_acquire+0x19f/0x4d0 kernel/locking/lockdep.c:5625
seqcount_lockdep_reader_access+0xfe/0x230 include/linux/seqlock.h:103
ktime_get_coarse_ts64+0x25/0x110 kernel/time/timekeeping.c:2255
ktime_get_coarse include/linux/timekeeping.h:120 [inline]
ktime_get_coarse_ns include/linux/timekeeping.h:126 [inline]
____bpf_ktime_get_coarse_ns kernel/bpf/helpers.c:173 [inline]
bpf_ktime_get_coarse_ns+0x7e/0x130 kernel/bpf/helpers.c:171
bpf_prog_a99735ebafdda2f1+0x10/0xb50
bpf_dispatcher_nop_func include/linux/bpf.h:721 [inline]
__bpf_prog_run include/linux/filter.h:626 [inline]
bpf_prog_run include/linux/filter.h:633 [inline]
BPF_PROG_RUN_ARRAY include/linux/bpf.h:1294 [inline]
trace_call_bpf+0x2cf/0x5d0 kernel/trace/bpf_trace.c:127
perf_trace_run_bpf_submit+0x7b/0x1d0 kernel/events/core.c:9708
perf_trace_lock+0x37c/0x440 include/trace/events/lock.h:39
trace_lock_release+0x128/0x150 include/trace/events/lock.h:58
lock_release+0x82/0x810 kernel/locking/lockdep.c:5636
__raw_spin_unlock_irqrestore include/linux/spinlock_api_smp.h:149 [inline]
_raw_spin_unlock_irqrestore+0x75/0x130 kernel/locking/spinlock.c:194
debug_hrtimer_deactivate kernel/time/hrtimer.c:425 [inline]
debug_deactivate kernel/time/hrtimer.c:481 [inline]
__run_hrtimer kernel/time/hrtimer.c:1653 [inline]
__hrtimer_run_queues+0x2f9/0xa60 kernel/time/hrtimer.c:1749
hrtimer_interrupt+0x3b3/0x1040 kernel/time/hrtimer.c:1811
local_apic_timer_interrupt arch/x86/kernel/apic/apic.c:1086 [inline]
__sysvec_apic_timer_interrupt+0xf9/0x270 arch/x86/kernel/apic/apic.c:1103
sysvec_apic_timer_interrupt+0x8c/0xb0 arch/x86/kernel/apic/apic.c:1097
asm_sysvec_apic_timer_interrupt+0x12/0x20
__raw_spin_unlock_irqrestore include/linux/spinlock_api_smp.h:152 [inline]
_raw_spin_unlock_irqrestore+0xd4/0x130 kernel/locking/spinlock.c:194
try_to_wake_up+0x702/0xd20 kernel/sched/core.c:4118
wake_up_process kernel/sched/core.c:4200 [inline]
wake_up_q+0x9a/0xf0 kernel/sched/core.c:953
futex_wake+0x50f/0x5b0 kernel/futex/waitwake.c:184
do_futex+0x367/0x560 kernel/futex/syscalls.c:127
__do_sys_futex kernel/futex/syscalls.c:199 [inline]
__se_sys_futex+0x401/0x4b0 kernel/futex/syscalls.c:180
do_syscall_x64 arch/x86/entry/common.c:50 [inline]
do_syscall_64+0x44/0xd0 arch/x86/entry/common.c:80
entry_SYSCALL_64_after_hwframe+0x44/0xae
There is a possible deadlock with bpf_timer_* set of helpers:
hrtimer_start()
lock_base();
trace_hrtimer...()
perf_event()
bpf_run()
bpf_timer_start()
hrtimer_start()
lock_base() <- DEADLOCK
Forbid use of bpf_ktime_get_coarse_ns() and bpf_timer_* helpers in
BPF_PROG_TYPE_KPROBE, BPF_PROG_TYPE_TRACEPOINT, BPF_PROG_TYPE_PERF_EVENT
and BPF_PROG_TYPE_RAW_TRACEPOINT prog types.
Fixes: d05512618056 ("bpf: Add bpf_ktime_get_coarse_ns helper")
Fixes: b00628b1c7d5 ("bpf: Introduce bpf timers.")
Reported-by: syzbot+43fd005b5a1b4d10781e@syzkaller.appspotmail.com
Signed-off-by: Dmitrii Banshchikov <me@ubique.spb.ru>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20211113142227.566439-2-me@ubique.spb.ru
|
|
There were several cases where validate() would return bogus supported
modes with unusual combinations of interfaces and capabilities. For
example, if state->interface was 10GBASER and the macb had HIGH_SPEED
and PCS but not GIGABIT MODE, then 10/100 modes would be set anyway. In
another case, SGMII could be enabled even if the mac was not a GEM
(despite this being checked for later on in mac_config()). These
inconsistencies make it difficult to refactor this function cleanly.
There is still the open question of what exactly the requirements for
SGMII and 10GBASER are, and what SGMII actually supports. If someone
from Cadence (or anyone else with access to the GEM/MACB datasheet)
could comment on this, it would be greatly appreciated. In particular,
what is supported by Cadence vs. vendor extension/limitation?
To address this, the current logic is split into three parts. First, we
determine what we support, then we eliminate unsupported interfaces, and
finally we set the appropriate link modes. There is still some cruft
related to NA, but this can be removed in a future patch.
Signed-off-by: Sean Anderson <sean.anderson@seco.com>
Reviewed-by: Parshuram Thombare <pthombar@cadence.com>
Reviewed-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Link: https://lore.kernel.org/r/20211112190400.1937855-1-sean.anderson@seco.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|