linux.git - Linus' kernel tree

Age	Commit message (Collapse)	Author
2019-07-31	xen: avoid link error on ARM	Arnd Bergmann
	Building the privcmd code as a loadable module on ARM, we get a link error due to the private cache management functions: ERROR: "__sync_icache_dcache" [drivers/xen/xen-privcmd.ko] undefined! Move the code into a new that is always built in when Xen is enabled, as suggested by Juergen Gross and Boris Ostrovsky. Signed-off-by: Arnd Bergmann <arnd@arndb.de> Reviewed-by: Stefano Stabellini <sstabellini@kernel.org> Signed-off-by: Juergen Gross <jgross@suse.com>
2019-07-31	xen/gntdev.c: Replace vm_map_pages() with vm_map_pages_zero()	Souptick Joarder
	'commit df9bde015a72 ("xen/gntdev.c: convert to use vm_map_pages()")' breaks gntdev driver. If vma->vm_pgoff > 0, vm_map_pages() will: - use map->pages starting at vma->vm_pgoff instead of 0 - verify map->count against vma_pages()+vma->vm_pgoff instead of just vma_pages(). In practice, this breaks using a single gntdev FD for mapping multiple grants. relevant strace output: [pid 857] ioctl(7, IOCTL_GNTDEV_MAP_GRANT_REF, 0x7ffd3407b6d0) = 0 [pid 857] mmap(NULL, 4096, PROT_READ\|PROT_WRITE, MAP_SHARED, 7, 0) = 0x777f1211b000 [pid 857] ioctl(7, IOCTL_GNTDEV_SET_UNMAP_NOTIFY, 0x7ffd3407b710) = 0 [pid 857] ioctl(7, IOCTL_GNTDEV_MAP_GRANT_REF, 0x7ffd3407b6d0) = 0 [pid 857] mmap(NULL, 4096, PROT_READ\|PROT_WRITE, MAP_SHARED, 7, 0x1000) = -1 ENXIO (No such device or address) details here: https://github.com/QubesOS/qubes-issues/issues/5199 The reason is -> ( copying Marek's word from discussion) vma->vm_pgoff is used as index passed to gntdev_find_map_index. It's basically using this parameter for "which grant reference to map". map struct returned by gntdev_find_map_index() describes just the pages to be mapped. Specifically map->pages[0] should be mapped at vma->vm_start, not vma->vm_start+vma->vm_pgoff*PAGE_SIZE. When trying to map grant with index (aka vma->vm_pgoff) > 1, __vm_map_pages() will refuse to map it because it will expect map->count to be at least vma_pages(vma)+vma->vm_pgoff, while it is exactly vma_pages(vma). Converting vm_map_pages() to use vm_map_pages_zero() will fix the problem. Marek has tested and confirmed the same. Cc: stable@vger.kernel.org # v5.2+ Fixes: df9bde015a72 ("xen/gntdev.c: convert to use vm_map_pages()") Reported-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com> Signed-off-by: Souptick Joarder <jrdr.linux@gmail.com> Tested-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com> Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> Signed-off-by: Juergen Gross <jgross@suse.com>
2019-07-31	drm/amd/powerplay: enable SW SMU reset functionality	Evan Quan
	Move SMU irq handler register to sw_init as that's totally software related. Otherwise, it will prevent SMU reset working. Signed-off-by: Evan Quan <evan.quan@amd.com> Reviewed-by: Kenneth Feng <kenneth.feng@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2019-07-31	drm/amd/powerplay: fix null pointer dereference around dpm state relates	Evan Quan
	DPM state relates are not supported on the new SW SMU ASICs. But still it's not OK to trigger null pointer dereference on accessing them. Signed-off-by: Evan Quan <evan.quan@amd.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2019-07-31	drm/amdgpu/powerplay: use proper revision id for navi	Alex Deucher
	The PCI revision id determines the sku. Reviewed-by: Feifei Xu <Feifei.Xu@amd.com> Reviewed-by: Kevin Wang <kevin1.wang@amd.com> Reviewed-by: Evan Quan <evan.quan@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2019-07-31	drm/amd/powerplay: fix temperature granularity error in smu11	Kevin Wang
	in this patch, drm/amd/powerplay: add callback function of get_thermal_temperature_range the driver missed temperature granularity change on other temperature. Signed-off-by: Kevin Wang <kevin1.wang@amd.com> Reviewed-by: Evan Quan <evan.quan@amd.com> Reviewed-by: Kenneth Feng <kenneth.feng@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2019-07-31	drm/amd/powerplay: add callback function of get_thermal_temperature_range	Kevin Wang
	1. the thermal temperature is asic related data, move the code logic to xxx_ppt.c. 2. replace data structure PP_TemperatureRange with smu_temperature_range. 3. change temperature uint from temp*1000 to temp (temperature uint). Signed-off-by: Kevin Wang <kevin1.wang@amd.com> Signed-off-by: Kenneth Feng <kenneth.feng@amd.com> Acked-by: Huang Rui <ray.huang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2019-07-30	drm/amdkfd: Fix byte align on VegaM	Kent Russell
	This was missed during the addition of VegaM support Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Kent Russell <kent.russell@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2019-07-30	tools: bpftool: add support for reporting the effective cgroup progs	Jakub Kicinski
	Takshak said in the original submission: With different bpf attach_flags available to attach bpf programs specially with BPF_F_ALLOW_OVERRIDE and BPF_F_ALLOW_MULTI, the list of effective bpf-programs available to any sub-cgroups really needs to be available for easy debugging. Using BPF_F_QUERY_EFFECTIVE flag, one can get the list of not only attached bpf-programs to a cgroup but also the inherited ones from parent cgroup. So a new option is introduced to use BPF_F_QUERY_EFFECTIVE query flag here to list all the effective bpf-programs available for execution at a specified cgroup. Reused modified test program test_cgroup_attach from tools/testing/selftests/bpf: # ./test_cgroup_attach With old bpftool: # bpftool cgroup show /sys/fs/cgroup/cgroup-test-work-dir/cg1/ ID AttachType AttachFlags Name 271 egress multi pkt_cntr_1 272 egress multi pkt_cntr_2 Attached new program pkt_cntr_4 in cg2 gives following: # bpftool cgroup show /sys/fs/cgroup/cgroup-test-work-dir/cg1/cg2 ID AttachType AttachFlags Name 273 egress override pkt_cntr_4 And with new "effective" option it shows all effective programs for cg2: # bpftool cgroup show /sys/fs/cgroup/cgroup-test-work-dir/cg1/cg2 effective ID AttachType AttachFlags Name 273 egress override pkt_cntr_4 271 egress override pkt_cntr_1 272 egress override pkt_cntr_2 Compared to original submission use a local flag instead of global option. We need to clear query_flags on every command, in case batch mode wants to use varying settings. v2: (Takshak) - forbid duplicated flags; - fix cgroup path freeing. Signed-off-by: Takshak Chahande <ctakshak@fb.com> Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Reviewed-by: Quentin Monnet <quentin.monnet@netronome.com> Reviewed-by: Takshak Chahande <ctakshak@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2019-07-30	selftests/bpf: fix clearing buffered output between tests/subtests	Andrii Nakryiko
	Clear buffered output once test or subtests finishes even if test was successful. Not doing this leads to accumulation of output from previous tests and on first failed tests lots of irrelevant output will be dumped, greatly confusing things. v1->v2: fix Fixes tag, add more context to patch Fixes: 3a516a0a3a7b ("selftests/bpf: add sub-tests support for test_progs") Signed-off-by: Andrii Nakryiko <andriin@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2019-07-30	Merge branch 'gen-syn-cookie'	Alexei Starovoitov
	Petar Penkov says: ==================== This patch series introduces a BPF helper function that allows generating SYN cookies from BPF. Currently, this helper is enabled at both the TC hook and the XDP hook. The first two patches in the series add/modify several TCP helper functions to allow for SKB-less operation, as is the case at the XDP hook. The third patch introduces the bpf_tcp_gen_syncookie helper function which generates a SYN cookie for either XDP or TC programs. The return value of this function contains both the MSS value, encoded in the cookie, and the cookie itself. The last three patches sync tools/ and add a test. Performance evaluation: I sent 10Mpps to a fixed port on a host with 2 10G bonded Mellanox 4 NICs from random IPv6 source addresses. Without XDP I observed 7.2Mpps (syn-acks) being sent out if the IPv6 packets carry 20 bytes of TCP options or 7.6Mpps if they carry no options. If I attached a simple program that checks if a packet is IPv6/TCP/SYN, looks up the socket, issues a cookie, and sends it back out after swapping src/dest, recomputing the checksum, and setting the ACK flag, I observed 10Mpps being sent back out. Changes since v1: 1/ Added performance numbers to the cover letter 2/ Patch 2: Refactored a bit to fix compilation issues 3/ Patch 3: Changed ENOTSUPP to EOPNOTSUPP at Toke's suggestion Changes since RFC: 1/ Cookie is returned in host order at Alexei's suggestion 2/ If cookies are not enabled via a sysctl, the helper function returns -ENOENT instead of -EINVAL at Lorenz's suggestion 3/ Fixed documentation to properly reflect that MSS is 16 bits at Lorenz's suggestion 4/ BPF helper requires TCP length to match ->doff field, rather than to simply be no more than 20 bytes at Eric and Alexei's suggestion 5/ Packet type is looked up from the packet version field, rather than from the socket. v4 packets are rejected on v6-only sockets but should work with dual stack listeners at Eric's suggestion 6/ Removed unnecessary `net` argument from helper function in patch 2 at Lorenz's suggestion 7/ Changed test to only pass MSS option so we can convince the verifier that the memory access is not out of bounds Note that 7/ below illustrates the verifier might need to be extended to allow passing a variable tcph->doff to the helper function like below: __u32 thlen = tcph->doff * 4; if (thlen < sizeof(*tcph)) return; __s64 cookie = bpf_tcp_gen_syncookie(sk, ipv4h, 20, tcph, thlen); ==================== Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2019-07-30	selftests/bpf: add test for bpf_tcp_gen_syncookie	Petar Penkov
	Modify the existing bpf_tcp_check_syncookie test to also generate a SYN cookie, pass the packet to the kernel, and verify that the two cookies are the same (and both valid). Since cloned SKBs are skipped during generic XDP, this test does not issue a SYN cookie when run in XDP mode. We therefore only check that a valid SYN cookie was issued at the TC hook. Additionally, verify that the MSS for that SYN cookie is within expected range. Signed-off-by: Petar Penkov <ppenkov@google.com> Reviewed-by: Lorenz Bauer <lmb@cloudflare.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2019-07-30	selftests/bpf: bpf_tcp_gen_syncookie->bpf_helpers	Petar Penkov
	Expose bpf_tcp_gen_syncookie to selftests. Signed-off-by: Petar Penkov <ppenkov@google.com> Reviewed-by: Lorenz Bauer <lmb@cloudflare.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2019-07-30	bpf: sync bpf.h to tools/	Petar Penkov
	Sync updated documentation for bpf_redirect_map. Sync the bpf_tcp_gen_syncookie helper function definition with the one in tools/uapi. Signed-off-by: Petar Penkov <ppenkov@google.com> Reviewed-by: Lorenz Bauer <lmb@cloudflare.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2019-07-30	bpf: add bpf_tcp_gen_syncookie helper	Petar Penkov
	This helper function allows BPF programs to try to generate SYN cookies, given a reference to a listener socket. The function works from XDP and with an skb context since bpf_skc_lookup_tcp can lookup a socket in both cases. Signed-off-by: Petar Penkov <ppenkov@google.com> Suggested-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Lorenz Bauer <lmb@cloudflare.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2019-07-30	tcp: add skb-less helpers to retrieve SYN cookie	Petar Penkov
	This patch allows generation of a SYN cookie before an SKB has been allocated, as is the case at XDP. Signed-off-by: Petar Penkov <ppenkov@google.com> Reviewed-by: Lorenz Bauer <lmb@cloudflare.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2019-07-30	tcp: tcp_syn_flood_action read port from socket	Petar Penkov
	This allows us to call this function before an SKB has been allocated. Signed-off-by: Petar Penkov <ppenkov@google.com> Reviewed-by: Lorenz Bauer <lmb@cloudflare.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2019-07-30	fgraph: Remove redundant ftrace_graph_notrace_addr() test	Changbin Du
	We already have tested it before. The second one should be removed. With this change, the performance should have little improvement. Link: http://lkml.kernel.org/r/20190730140850.7927-1-changbin.du@gmail.com Cc: stable@vger.kernel.org Fixes: 9cd2992f2d6c ("fgraph: Have set_graph_notrace only affect function_graph tracer") Signed-off-by: Changbin Du <changbin.du@gmail.com> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2019-07-30	tracing: Fix header include guards in trace event headers	Masahiro Yamada
	These include guards are broken. Match the #if !define() and #define lines so that they work correctly. Link: http://lkml.kernel.org/r/20190720103943.16982-1-yamada.masahiro@socionext.com Fixes: f54d1867005c3 ("dma-buf: Rename struct fence to dma_fence") Fixes: 2e26ca7150a4f ("tracing: Fix tracepoint.h DECLARE_TRACE() to allow more than one header") Fixes: e543002f77f46 ("qdisc: add tracepoint qdisc:qdisc_dequeue for dequeued SKBs") Fixes: 95f295f9fe081 ("dmaengine: tegra: add tracepoints to driver") Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2019-07-30	Merge branch 'dax-fix-5.3-rc3' of ↵	Linus Torvalds
	git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm Pull dax fix from Dan Williams: "Fix a botched manual patch update that got dropped between testing and application" * 'dax-fix-5.3-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm: dax: Fix missed wakeup in put_unlocked_entry()
2019-07-30	dm table: fix various whitespace issues with recent DAX code	Mike Snitzer
	Also, rename device_synchronous to device_dax_synchronous. Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-07-30	dm table: fix dax_dev NULL dereference in device_synchronous()	Pankaj Gupta
	If a device doesn't support DAX its 'dax_dev' is NULL. Fix device_synchronous() to first check if dax_dev is NULL before dereferencing it. Fixes: 2e9ee0955d3c ("dm: enable synchronous dax") Reported-by: jencce.kernel@gmail.com Signed-off-by: Pankaj Gupta <pagupta@redhat.com> Acked-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-07-30	Merge branch 'net-dsa-ksz-Add-Microchip-KSZ87xx-support'	David S. Miller
	Marek Vasut says: ==================== net: dsa: ksz: Add Microchip KSZ87xx support This series adds support for Microchip KSZ87xx switches, which are slightly simpler compared to KSZ9xxx . ==================== Signed-off-by: Marek Vasut <marex@denx.de> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-07-30	net: dsa: ksz: Add Microchip KSZ8795 DSA driver	Tristram Ha
	Add Microchip KSZ8795 DSA driver. Signed-off-by: Tristram Ha <Tristram.Ha@microchip.com> Signed-off-by: Marek Vasut <marex@denx.de> Cc: Andrew Lunn <andrew@lunn.ch> Cc: David S. Miller <davem@davemloft.net> Cc: Florian Fainelli <f.fainelli@gmail.com> Cc: Tristram Ha <Tristram.Ha@microchip.com> Cc: Vivien Didelot <vivien.didelot@gmail.com> Cc: Woojung Huh <woojung.huh@microchip.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-07-30	net: dsa: ksz: Add KSZ8795 tag code	Tristram Ha
	Add DSA tag code for Microchip KSZ8795 switch. The switch is simpler and the tag is only 1 byte, instead of 2 as is the case with KSZ9477. Signed-off-by: Tristram Ha <Tristram.Ha@microchip.com> Signed-off-by: Marek Vasut <marex@denx.de> Cc: Andrew Lunn <andrew@lunn.ch> Cc: David S. Miller <davem@davemloft.net> Cc: Florian Fainelli <f.fainelli@gmail.com> Cc: Tristram Ha <Tristram.Ha@microchip.com> Cc: Vivien Didelot <vivien.didelot@gmail.com> Cc: Woojung Huh <woojung.huh@microchip.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-07-30	dt-bindings: net: dsa: ksz: document Microchip KSZ87xx family switches	Marek Vasut
	Document Microchip KSZ87xx family switches. These include KSZ8765 - 5 port switch KSZ8794 - 4 port switch KSZ8795 - 5 port switch Signed-off-by: Marek Vasut <marex@denx.de> Cc: Andrew Lunn <andrew@lunn.ch> Cc: David S. Miller <davem@davemloft.net> Cc: Florian Fainelli <f.fainelli@gmail.com> Cc: Rob Herring <robh+dt@kernel.org> Cc: Tristram Ha <Tristram.Ha@microchip.com> Cc: Vivien Didelot <vivien.didelot@gmail.com> Cc: Woojung Huh <woojung.huh@microchip.com> Cc: devicetree@vger.kernel.org Signed-off-by: David S. Miller <davem@davemloft.net>
2019-07-31	arm64: compat: vdso: Use legacy syscalls as fallback	Thomas Gleixner
	The generic VDSO implementation uses the Y2038 safe clock_gettime64() and clock_getres_time64() syscalls as fallback for 32bit VDSO. This breaks seccomp setups because these syscalls might be not (yet) allowed. Implement the 32bit variants which use the legacy syscalls and select the variant in the core library. The 64bit time variants are not removed because they are required for the time64 based vdso accessors. Fixes: 00b26474c2f1 ("lib/vdso: Provide generic VDSO implementation") Reported-by: Sean Christopherson <sean.j.christopherson@intel.com> Reported-by: Paul Bolle <pebolle@tiscali.nl> Suggested-by: Andy Lutomirski <luto@kernel.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Vincenzo Frascino <vincenzo.frascino@arm.com> Reviewed-by: Vincenzo Frascino <vincenzo.frascino@arm.com> Link: https://lkml.kernel.org/r/20190728131648.971361611@linutronix.de
2019-07-31	x86/vdso/32: Use 32bit syscall fallback	Thomas Gleixner
	The generic VDSO implementation uses the Y2038 safe clock_gettime64() and clock_getres_time64() syscalls as fallback for 32bit VDSO. This breaks seccomp setups because these syscalls might be not (yet) allowed. Implement the 32bit variants which use the legacy syscalls and select the variant in the core library. The 64bit time variants are not removed because they are required for the time64 based vdso accessors. Fixes: 7ac870747988 ("x86/vdso: Switch to generic vDSO implementation") Reported-by: Sean Christopherson <sean.j.christopherson@intel.com> Reported-by: Paul Bolle <pebolle@tiscali.nl> Suggested-by: Andy Lutomirski <luto@kernel.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Vincenzo Frascino <vincenzo.frascino@arm.com> Reviewed-by: Andy Lutomirski <luto@kernel.org> Link: https://lkml.kernel.org/r/20190728131648.879156507@linutronix.de
2019-07-31	lib/vdso/32: Provide legacy syscall fallbacks	Thomas Gleixner
	To address the regression which causes seccomp to deny applications the access to clock_gettime64() and clock_getres64() syscalls because they are not enabled in the existing filters. That trips over the fact that 32bit VDSOs use the new clock_gettime64() and clock_getres64() syscalls in the fallback path. Add a conditional to invoke the 32bit legacy fallback syscalls instead of the new 64bit variants. The conditional can go away once all architectures are converted. Fixes: 00b26474c2f1 ("lib/vdso: Provide generic VDSO implementation") Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Sean Christopherson <sean.j.christopherson@intel.com> Reviewed-by: Sean Christopherson <sean.j.christopherson@intel.com> Link: https://lkml.kernel.org/r/alpine.DEB.2.21.1907301134470.1738@nanos.tec.linutronix.de
2019-07-31	lib/vdso: Move fallback invocation to the callers	Thomas Gleixner
	To allow syscall fallbacks using the legacy 32bit syscall for 32bit VDSO builds, move the fallback invocation out into the callers. Split the common code out of __cvdso_clock_gettime/getres() and invoke the syscall fallback in the 64bit and 32bit variants. Preparatory work for using legacy syscalls in 32bit VDSO. No functional change. Fixes: 00b26474c2f1 ("lib/vdso: Provide generic VDSO implementation") Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Vincenzo Frascino <vincenzo.frascino@arm.com> Reviewed-by: Andy Lutomirski <luto@kernel.org> Reviewed-by: Vincenzo Frascino <vincenzo.frascino@arm.com> Link: https://lkml.kernel.org/r/20190728131648.695579736@linutronix.de
2019-07-31	lib/vdso/32: Remove inconsistent NULL pointer checks	Thomas Gleixner
	The 32bit variants of vdso_clock_gettime()/getres() have a NULL pointer check for the timespec pointer. That's inconsistent vs. 64bit. But the vdso implementation will never be consistent versus the syscall because the only case which it can handle is NULL. Any other invalid pointer will cause a segfault. So special casing NULL is not really useful. Remove it along with the superflouos syscall fallback invocation as that will return -EFAULT anyway. That also gets rid of the dubious typecast which only works because the pointer is NULL. Fixes: 00b26474c2f1 ("lib/vdso: Provide generic VDSO implementation") Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Vincenzo Frascino <vincenzo.frascino@arm.com> Reviewed-by: Vincenzo Frascino <vincenzo.frascino@arm.com> Reviewed-by: Andy Lutomirski <luto@kernel.org> Link: https://lkml.kernel.org/r/20190728131648.587523358@linutronix.de
2019-07-30	net: dsa: qca8k: enable port flow control	xiaofeis
	Set phy device advertising to enable MAC flow control. Signed-off-by: Xiaofei Shen <xiaofeis@codeaurora.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-07-30	Merge branch 'vsock-virtio-optimizations-to-increase-the-throughput'	David S. Miller
	Stefano Garzarella says: ==================== vsock/virtio: optimizations to increase the throughput This series tries to increase the throughput of virtio-vsock with slight changes. While I was testing the v2 of this series I discovered an huge use of memory, so I added patch 1 to mitigate this issue. I put it in this series in order to better track the performance trends. v5: - rebased all patches on net-next - added Stefan's R-b and Michael's A-b v4: https://patchwork.kernel.org/cover/11047717 v3: https://patchwork.kernel.org/cover/10970145 v2: https://patchwork.kernel.org/cover/10938743 v1: https://patchwork.kernel.org/cover/10885431 Below are the benchmarks step by step. I used iperf3 [1] modified with VSOCK support. As Michael suggested in the v1, I booted host and guest with 'nosmap'. A brief description of patches: - Patches 1: limit the memory usage with an extra copy for small packets - Patches 2+3: reduce the number of credit update messages sent to the transmitter - Patches 4+5: allow the host to split packets on multiple buffers and use VIRTIO_VSOCK_MAX_PKT_BUF_SIZE as the max packet size allowed host -> guest [Gbps] pkt_size before opt p 1 p 2+3 p 4+5 32 0.032 0.030 0.048 0.051 64 0.061 0.059 0.108 0.117 128 0.122 0.112 0.227 0.234 256 0.244 0.241 0.418 0.415 512 0.459 0.466 0.847 0.865 1K 0.927 0.919 1.657 1.641 2K 1.884 1.813 3.262 3.269 4K 3.378 3.326 6.044 6.195 8K 5.637 5.676 10.141 11.287 16K 8.250 8.402 15.976 16.736 32K 13.327 13.204 19.013 20.515 64K 21.241 21.341 20.973 21.879 128K 21.851 22.354 21.816 23.203 256K 21.408 21.693 21.846 24.088 512K 21.600 21.899 21.921 24.106 guest -> host [Gbps] pkt_size before opt p 1 p 2+3 p 4+5 32 0.045 0.046 0.057 0.057 64 0.089 0.091 0.103 0.104 128 0.170 0.179 0.192 0.200 256 0.364 0.351 0.361 0.379 512 0.709 0.699 0.731 0.790 1K 1.399 1.407 1.395 1.427 2K 2.670 2.684 2.745 2.835 4K 5.171 5.199 5.305 5.451 8K 8.442 8.500 10.083 9.941 16K 12.305 12.259 13.519 15.385 32K 11.418 11.150 11.988 24.680 64K 10.778 10.659 11.589 35.273 128K 10.421 10.339 10.939 40.338 256K 10.300 9.719 10.508 36.562 512K 9.833 9.808 10.612 35.979 As Stefan suggested in the v1, I measured also the efficiency in this way: efficiency = Mbps / (%CPU_Host + %CPU_Guest) The '%CPU_Guest' is taken inside the VM. I know that it is not the best way, but it's provided for free from iperf3 and could be an indication. host -> guest efficiency [Mbps / (%CPU_Host + %CPU_Guest)] pkt_size before opt p 1 p 2+3 p 4+5 32 0.35 0.45 0.79 1.02 64 0.56 0.80 1.41 1.54 128 1.11 1.52 3.03 3.12 256 2.20 2.16 5.44 5.58 512 4.17 4.18 10.96 11.46 1K 8.30 8.26 20.99 20.89 2K 16.82 16.31 39.76 39.73 4K 30.89 30.79 74.07 75.73 8K 53.74 54.49 124.24 148.91 16K 80.68 83.63 200.21 232.79 32K 132.27 132.52 260.81 357.07 64K 229.82 230.40 300.19 444.18 128K 332.60 329.78 331.51 492.28 256K 331.06 337.22 339.59 511.59 512K 335.58 328.50 331.56 504.56 guest -> host efficiency [Mbps / (%CPU_Host + %CPU_Guest)] pkt_size before opt p 1 p 2+3 p 4+5 32 0.43 0.43 0.53 0.56 64 0.85 0.86 1.04 1.10 128 1.63 1.71 2.07 2.13 256 3.48 3.35 4.02 4.22 512 6.80 6.67 7.97 8.63 1K 13.32 13.31 15.72 15.94 2K 25.79 25.92 30.84 30.98 4K 50.37 50.48 58.79 59.69 8K 95.90 96.15 107.04 110.33 16K 145.80 145.43 143.97 174.70 32K 147.06 144.74 146.02 282.48 64K 145.25 143.99 141.62 406.40 128K 149.34 146.96 147.49 489.34 256K 156.35 149.81 152.21 536.37 512K 151.65 150.74 151.52 519.93 [1] https://github.com/stefano-garzarella/iperf/ ==================== Acked-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-07-30	vsock/virtio: change the maximum packet size allowed	Stefano Garzarella
	Since now we are able to split packets, we can avoid limiting their sizes to VIRTIO_VSOCK_DEFAULT_RX_BUF_SIZE. Instead, we can use VIRTIO_VSOCK_MAX_PKT_BUF_SIZE as the max packet size. Signed-off-by: Stefano Garzarella <sgarzare@redhat.com> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com> Acked-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-07-30	vhost/vsock: split packets to send using multiple buffers	Stefano Garzarella
	If the packets to sent to the guest are bigger than the buffer available, we can split them, using multiple buffers and fixing the length in the packet header. This is safe since virtio-vsock supports only stream sockets. Signed-off-by: Stefano Garzarella <sgarzare@redhat.com> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com> Acked-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-07-30	vsock/virtio: fix locking in virtio_transport_inc_tx_pkt()	Stefano Garzarella
	fwd_cnt and last_fwd_cnt are protected by rx_lock, so we should use the same spinlock also if we are in the TX path. Move also buf_alloc under the same lock. Signed-off-by: Stefano Garzarella <sgarzare@redhat.com> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com> Acked-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-07-30	vsock/virtio: reduce credit update messages	Stefano Garzarella
	In order to reduce the number of credit update messages, we send them only when the space available seen by the transmitter is less than VIRTIO_VSOCK_MAX_PKT_BUF_SIZE. Signed-off-by: Stefano Garzarella <sgarzare@redhat.com> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com> Acked-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-07-30	vsock/virtio: limit the memory used per-socket	Stefano Garzarella
	Since virtio-vsock was introduced, the buffers filled by the host and pushed to the guest using the vring, are directly queued in a per-socket list. These buffers are preallocated by the guest with a fixed size (4 KB). The maximum amount of memory used by each socket should be controlled by the credit mechanism. The default credit available per-socket is 256 KB, but if we use only 1 byte per packet, the guest can queue up to 262144 of 4 KB buffers, using up to 1 GB of memory per-socket. In addition, the guest will continue to fill the vring with new 4 KB free buffers to avoid starvation of other sockets. This patch mitigates this issue copying the payload of small packets (< 128 bytes) into the buffer of last packet queued, in order to avoid wasting memory. Signed-off-by: Stefano Garzarella <sgarzare@redhat.com> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com> Acked-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-07-30	selftests/livepatch: push and pop dynamic debug config	Joe Lawrence
	The livepatching self-tests tweak the dynamic debug config to verify the kernel log during the tests. Enhance set_dynamic_debug() so that the config changes are restored when the script exits. Note this functionality needs to keep in sync with: - dynamic_debug input/output formatting - functions affected by set_dynamic_debug() For example, push_dynamic_debug() transforms: kernel/livepatch/transition.c:530 [livepatch]klp_init_transition =_ "'%s': initializing %s transition\012" to the following: file kernel/livepatch/transition.c line 530 =_ Signed-off-by: Joe Lawrence <joe.lawrence@redhat.com> Tested-by: Petr Mladek <pmladek@suse.com> Signed-off-by: Shuah Khan <skhan@linuxfoundation.org>
2019-07-30	kselftest: save-and-restore errno to allow for %m formatting	Aleksa Sarai
	Previously, using "%m" in a ksft_* format string can result in strange output because the errno value wasn't saved before calling other libc functions. The solution is to simply save and restore the errno before we format the user-supplied format string. Signed-off-by: Aleksa Sarai <cyphar@cyphar.com> Signed-off-by: Shuah Khan <skhan@linuxfoundation.org>
2019-07-30	compat_ioctl: pppoe: fix PPPOEIOCSFWD handling	Arnd Bergmann
	Support for handling the PPPOEIOCSFWD ioctl in compat mode was added in linux-2.5.69 along with hundreds of other commands, but was always broken sincen only the structure is compatible, but the command number is not, due to the size being sizeof(size_t), or at first sizeof(sizeof((struct sockaddr_pppox)), which is different on 64-bit architectures. Guillaume Nault adds: And the implementation was broken until 2016 (see 29e73269aa4d ("pppoe: fix reference counting in PPPoE proxy")), and nobody ever noticed. I should probably have removed this ioctl entirely instead of fixing it. Clearly, it has never been used. Fix it by adding a compat_ioctl handler for all pppoe variants that translates the command number and then calls the regular ioctl function. All other ioctl commands handled by pppoe are compatible between 32-bit and 64-bit, and require compat_ptr() conversion. This should apply to all stable kernels. Acked-by: Guillaume Nault <g.nault@alphalink.fr> Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-07-30	tipc: fix unitilized skb list crash	Jon Maloy
	Our test suite somtimes provokes the following crash: Description of problem: [ 1092.597234] BUG: unable to handle kernel NULL pointer dereference at 00000000000000e8 [ 1092.605072] PGD 0 P4D 0 [ 1092.607620] Oops: 0000 [#1] SMP PTI [ 1092.611118] CPU: 37 PID: 0 Comm: swapper/37 Kdump: loaded Not tainted 4.18.0-122.el8.x86_64 #1 [ 1092.619724] Hardware name: Dell Inc. PowerEdge R740/08D89F, BIOS 1.3.7 02/08/2018 [ 1092.627215] RIP: 0010:tipc_mcast_filter_msg+0x93/0x2d0 [tipc] [ 1092.632955] Code: 0f 84 aa 01 00 00 89 cf 4d 01 ca 4c 8b 26 c1 ef 19 83 e7 0f 83 ff 0c 4d 0f 45 d1 41 8b 6a 10 0f cd 4c 39 e6 0f 84 81 01 00 00 <4d> 8b 9c 24 e8 00 00 00 45 8b 13 41 0f ca 44 89 d7 c1 ef 13 83 e7 [ 1092.651703] RSP: 0018:ffff929e5fa83a18 EFLAGS: 00010282 [ 1092.656927] RAX: ffff929e3fb38100 RBX: 00000000069f29ee RCX: 00000000416c0045 [ 1092.664058] RDX: ffff929e5fa83a88 RSI: ffff929e31a28420 RDI: 0000000000000000 [ 1092.671209] RBP: 0000000029b11821 R08: 0000000000000000 R09: ffff929e39b4407a [ 1092.678343] R10: ffff929e39b4407a R11: 0000000000000007 R12: 0000000000000000 [ 1092.685475] R13: 0000000000000001 R14: ffff929e3fb38100 R15: ffff929e39b4407a [ 1092.692614] FS: 0000000000000000(0000) GS:ffff929e5fa80000(0000) knlGS:0000000000000000 [ 1092.700702] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 1092.706447] CR2: 00000000000000e8 CR3: 000000031300a004 CR4: 00000000007606e0 [ 1092.713579] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 1092.720712] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 1092.727843] PKRU: 55555554 [ 1092.730556] Call Trace: [ 1092.733010] <IRQ> [ 1092.735034] tipc_sk_filter_rcv+0x7ca/0xb80 [tipc] [ 1092.739828] ? __kmalloc_node_track_caller+0x1cb/0x290 [ 1092.744974] ? dev_hard_start_xmit+0xa5/0x210 [ 1092.749332] tipc_sk_rcv+0x389/0x640 [tipc] [ 1092.753519] tipc_sk_mcast_rcv+0x23c/0x3a0 [tipc] [ 1092.758224] tipc_rcv+0x57a/0xf20 [tipc] [ 1092.762154] ? ktime_get_real_ts64+0x40/0xe0 [ 1092.766432] ? tpacket_rcv+0x50/0x9f0 [ 1092.770098] tipc_l2_rcv_msg+0x4a/0x70 [tipc] [ 1092.774452] __netif_receive_skb_core+0xb62/0xbd0 [ 1092.779164] ? enqueue_entity+0xf6/0x630 [ 1092.783084] ? kmem_cache_alloc+0x158/0x1c0 [ 1092.787272] ? __build_skb+0x25/0xd0 [ 1092.790849] netif_receive_skb_internal+0x42/0xf0 [ 1092.795557] napi_gro_receive+0xba/0xe0 [ 1092.799417] mlx5e_handle_rx_cqe+0x83/0xd0 [mlx5_core] [ 1092.804564] mlx5e_poll_rx_cq+0xd5/0x920 [mlx5_core] [ 1092.809536] mlx5e_napi_poll+0xb2/0xce0 [mlx5_core] [ 1092.814415] ? __wake_up_common_lock+0x89/0xc0 [ 1092.818861] net_rx_action+0x149/0x3b0 [ 1092.822616] __do_softirq+0xe3/0x30a [ 1092.826193] irq_exit+0x100/0x110 [ 1092.829512] do_IRQ+0x85/0xd0 [ 1092.832483] common_interrupt+0xf/0xf [ 1092.836147] </IRQ> [ 1092.838255] RIP: 0010:cpuidle_enter_state+0xb7/0x2a0 [ 1092.843221] Code: e8 3e 79 a5 ff 80 7c 24 03 00 74 17 9c 58 0f 1f 44 00 00 f6 c4 02 0f 85 d7 01 00 00 31 ff e8 a0 6b ab ff fb 66 0f 1f 44 00 00 <48> b8 ff ff ff ff f3 01 00 00 4c 29 f3 ba ff ff ff 7f 48 39 c3 7f [ 1092.861967] RSP: 0018:ffffaa5ec6533e98 EFLAGS: 00000246 ORIG_RAX: ffffffffffffffdd [ 1092.869530] RAX: ffff929e5faa3100 RBX: 000000fe63dd2092 RCX: 000000000000001f [ 1092.876665] RDX: 000000fe63dd2092 RSI: 000000003a518aaa RDI: 0000000000000000 [ 1092.883795] RBP: 0000000000000003 R08: 0000000000000004 R09: 0000000000022940 [ 1092.890929] R10: 0000040cb0666b56 R11: ffff929e5faa20a8 R12: ffff929e5faade78 [ 1092.898060] R13: ffffffffb59258f8 R14: 000000fe60f3228d R15: 0000000000000000 [ 1092.905196] ? cpuidle_enter_state+0x92/0x2a0 [ 1092.909555] do_idle+0x236/0x280 [ 1092.912785] cpu_startup_entry+0x6f/0x80 [ 1092.916715] start_secondary+0x1a7/0x200 [ 1092.920642] secondary_startup_64+0xb7/0xc0 [...] The reason is that the skb list tipc_socket::mc_method.deferredq only is initialized for connectionless sockets, while nothing stops arriving multicast messages from being filtered by connection oriented sockets, with subsequent access to the said list. We fix this by initializing the list unconditionally at socket creation. This eliminates the crash, while the message still is dropped further down in tipc_sk_filter_rcv() as it should be. Reported-by: Li Shuang <shuali@redhat.com> Signed-off-by: Jon Maloy <jon.maloy@ericsson.com> Reviewed-by: Xin Long <lucien.xin@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-07-30	net: Remove dev_err() usage after platform_get_irq()	Stephen Boyd
	We don't need dev_err() messages when platform_get_irq() fails now that platform_get_irq() prints an error message itself when something goes wrong. Let's remove these prints with a simple semantic patch. // <smpl> @@ expression ret; struct platform_device *E; @@ ret = ( platform_get_irq(E, ...) \| platform_get_irq_byname(E, ...) ); if ( \( ret < 0 \\| ret <= 0 \) ) { ( -if (ret != -EPROBE_DEFER) -{ ... -dev_err(...); -... } \| ... -dev_err(...); ) ... } // </smpl> While we're here, remove braces on if statements that only have one statement (manually). Cc: "David S. Miller" <davem@davemloft.net> Cc: Kalle Valo <kvalo@codeaurora.org> Cc: Saeed Mahameed <saeedm@mellanox.com> Cc: Jeff Kirsher <jeffrey.t.kirsher@intel.com> Cc: Felix Fietkau <nbd@nbd.name> Cc: Lorenzo Bianconi <lorenzo@kernel.org> Cc: netdev@vger.kernel.org Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Stephen Boyd <swboyd@chromium.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-07-30	Merge branch 'Finish-conversion-of-skb_frag_t-to-bio_vec'	David S. Miller
	Jonathan Lemon says: ==================== Finish conversion of skb_frag_t to bio_vec The recent conversion of skb_frag_t to bio_vec did not include skb_frag's page_offset. Add accessor functions for this field, utilize them, and remove the union, restoring the original structure. v2: - rename accessors - follow kdoc conventions ==================== Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com> Reviewed-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-07-30	linux: Remove bvec page_offset, use bv_offset	Jonathan Lemon
	Now that page_offset is referenced through accessors, remove the union, and use bv_offset. Signed-off-by: Jonathan Lemon <jonathan.lemon@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-07-30	net: Use skb_frag_off accessors	Jonathan Lemon
	Use accessor functions for skb fragment's page_offset instead of direct references, in preparation for bvec conversion. Signed-off-by: Jonathan Lemon <jonathan.lemon@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-07-30	linux: Add skb_frag_t page_offset accessors	Jonathan Lemon
	Add skb_frag_off(), skb_frag_off_add(), skb_frag_off_set(), and skb_frag_off_copy() accessors for page_offset. Signed-off-by: Jonathan Lemon <jonathan.lemon@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-07-30	Merge branch 'sctp-clean-up-sctp_connect-function'	David S. Miller
	Xin Long says: ==================== sctp: clean up __sctp_connect function This patchset is to factor out some common code for sctp_sendmsg_new_asoc() and __sctp_connect() into 2 new functioins. v1->v2: - add the patch 1/5 to avoid a slab-out-of-bounds warning. - add some code comment for the check change in patch 2/5. - remove unused 'addrcnt' as Marcelo noticed in patch 3/5. ==================== Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-07-30	sctp: factor out sctp_connect_add_peer	Xin Long
	In this function factored out from sctp_sendmsg_new_asoc() and __sctp_connect(), it adds a peer with the other addr into the asoc after this asoc is created with the 1st addr. Signed-off-by: Xin Long <lucien.xin@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-07-30	sctp: factor out sctp_connect_new_asoc	Xin Long
	In this function factored out from sctp_sendmsg_new_asoc() and __sctp_connect(), it creates the asoc and adds a peer with the 1st addr. Signed-off-by: Xin Long <lucien.xin@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>