summaryrefslogtreecommitdiff
path: root/net
AgeCommit message (Collapse)Author
2022-05-25Merge tag 'net-next-5.19' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next Pull networking updates from Jakub Kicinski: "Core ---- - Support TCPv6 segmentation offload with super-segments larger than 64k bytes using the IPv6 Jumbogram extension header (AKA BIG TCP). - Generalize skb freeing deferral to per-cpu lists, instead of per-socket lists. - Add a netdev statistic for packets dropped due to L2 address mismatch (rx_otherhost_dropped). - Continue work annotating skb drop reasons. - Accept alternative netdev names (ALT_IFNAME) in more netlink requests. - Add VLAN support for AF_PACKET SOCK_RAW GSO. - Allow receiving skb mark from the socket as a cmsg. - Enable memcg accounting for veth queues, sysctl tables and IPv6. BPF --- - Add libbpf support for User Statically-Defined Tracing (USDTs). - Speed up symbol resolution for kprobes multi-link attachments. - Support storing typed pointers to referenced and unreferenced objects in BPF maps. - Add support for BPF link iterator. - Introduce access to remote CPU map elements in BPF per-cpu map. - Allow middle-of-the-road settings for the kernel.unprivileged_bpf_disabled sysctl. - Implement basic types of dynamic pointers e.g. to allow for dynamically sized ringbuf reservations without extra memory copies. Protocols --------- - Retire port only listening_hash table, add a second bind table hashed by port and address. Avoid linear list walk when binding to very popular ports (e.g. 443). - Add bridge FDB bulk flush filtering support allowing user space to remove all FDB entries matching a condition. - Introduce accept_unsolicited_na sysctl for IPv6 to implement router-side changes for RFC9131. - Support for MPTCP path manager in user space. - Add MPTCP support for fallback to regular TCP for connections that have never connected additional subflows or transmitted out-of-sequence data (partial support for RFC8684 fallback). - Avoid races in MPTCP-level window tracking, stabilize and improve throughput. - Support lockless operation of GRE tunnels with seq numbers enabled. - WiFi support for host based BSS color collision detection. - Add support for SO_TXTIME/SCM_TXTIME on CAN sockets. - Support transmission w/o flow control in CAN ISOTP (ISO 15765-2). - Support zero-copy Tx with TLS 1.2 crypto offload (sendfile). - Allow matching on the number of VLAN tags via tc-flower. - Add tracepoint for tcp_set_ca_state(). Driver API ---------- - Improve error reporting from classifier and action offload. - Add support for listing line cards in switches (devlink). - Add helpers for reporting page pool statistics with ethtool -S. - Add support for reading clock cycles when using PTP virtual clocks, instead of having the driver convert to time before reporting. This makes it possible to report time from different vclocks. - Support configuring low-latency Tx descriptor push via ethtool. - Separate Clause 22 and Clause 45 MDIO accesses more explicitly. New hardware / drivers ---------------------- - Ethernet: - Marvell's Octeon NIC PCI Endpoint support (octeon_ep) - Sunplus SP7021 SoC (sp7021_emac) - Add support for Renesas RZ/V2M (in ravb) - Add support for MediaTek mt7986 switches (in mtk_eth_soc) - Ethernet PHYs: - ADIN1100 industrial PHYs (w/ 10BASE-T1L and SQI reporting) - TI DP83TD510 PHY - Microchip LAN8742/LAN88xx PHYs - WiFi: - Driver for pureLiFi X, XL, XC devices (plfxlc) - Driver for Silicon Labs devices (wfx) - Support for WCN6750 (in ath11k) - Support Realtek 8852ce devices (in rtw89) - Mobile: - MediaTek T700 modems (Intel 5G 5000 M.2 cards) - CAN: - ctucanfd: add support for CTU CAN FD open-source IP core from Czech Technical University in Prague Drivers ------- - Delete a number of old drivers still using virt_to_bus(). - Ethernet NICs: - intel: support TSO on tunnels MPLS - broadcom: support multi-buffer XDP - nfp: support VF rate limiting - sfc: use hardware tx timestamps for more than PTP - mlx5: multi-port eswitch support - hyper-v: add support for XDP_REDIRECT - atlantic: XDP support (including multi-buffer) - macb: improve real-time perf by deferring Tx processing to NAPI - High-speed Ethernet switches: - mlxsw: implement basic line card information querying - prestera: add support for traffic policing on ingress and egress - Embedded Ethernet switches: - lan966x: add support for packet DMA (FDMA) - lan966x: add support for PTP programmable pins - ti: cpsw_new: enable bc/mc storm prevention - Qualcomm 802.11ax WiFi (ath11k): - Wake-on-WLAN support for QCA6390 and WCN6855 - device recovery (firmware restart) support - support setting Specific Absorption Rate (SAR) for WCN6855 - read country code from SMBIOS for WCN6855/QCA6390 - enable keep-alive during WoWLAN suspend - implement remain-on-channel support - MediaTek WiFi (mt76): - support Wireless Ethernet Dispatch offloading packet movement between the Ethernet switch and WiFi interfaces - non-standard VHT MCS10-11 support - mt7921 AP mode support - mt7921 IPv6 NS offload support - Ethernet PHYs: - micrel: ksz9031/ksz9131: cabletest support - lan87xx: SQI support for T1 PHYs - lan937x: add interrupt support for link detection" * tag 'net-next-5.19' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (1809 commits) ptp: ocp: Add firmware header checks ptp: ocp: fix PPS source selector debugfs reporting ptp: ocp: add .init function for sma_op vector ptp: ocp: vectorize the sma accessor functions ptp: ocp: constify selectors ptp: ocp: parameterize input/output sma selectors ptp: ocp: revise firmware display ptp: ocp: add Celestica timecard PCI ids ptp: ocp: Remove #ifdefs around PCI IDs ptp: ocp: 32-bit fixups for pci start address Revert "net/smc: fix listen processing for SMC-Rv2" ath6kl: Use cc-disable-warning to disable -Wdangling-pointer selftests/bpf: Dynptr tests bpf: Add dynptr data slices bpf: Add bpf_dynptr_read and bpf_dynptr_write bpf: Dynptr support for ring buffers bpf: Add bpf_dynptr_from_mem for local dynptrs bpf: Add verifier support for dynptrs bpf: Suppress 'passing zero to PTR_ERR' warning bpf: Introduce bpf_arch_text_invalidate for bpf_prog_pack ...
2022-05-25Merge tag 'linux-kselftest-kunit-5.19-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest Pull KUnit updates from Shuah Khan: "Several fixes, cleanups, and enhancements to tests and framework: - introduce _NULL and _NOT_NULL macros to pointer error checks - rework kunit_resource allocation policy to fix memory leaks when caller doesn't specify free() function to be used when allocating memory using kunit_add_resource() and kunit_alloc_resource() funcs. - add ability to specify suite-level init and exit functions" * tag 'linux-kselftest-kunit-5.19-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest: (41 commits) kunit: tool: Use qemu-system-i386 for i386 runs kunit: fix executor OOM error handling logic on non-UML kunit: tool: update riscv QEMU config with new serial dependency kcsan: test: use new suite_{init,exit} support kunit: tool: Add list of all valid test configs on UML kunit: take `kunit_assert` as `const` kunit: tool: misc cleanups kunit: tool: minor cosmetic cleanups in kunit_parser.py kunit: tool: make parser stop overwriting status of suites w/ no_tests kunit: tool: remove dead parse_crash_in_log() logic kunit: tool: print clearer error message when there's no TAP output kunit: tool: stop using a shell to run kernel under QEMU kunit: tool: update test counts summary line format kunit: bail out of test filtering logic quicker if OOM lib/Kconfig.debug: change KUnit tests to default to KUNIT_ALL_TESTS kunit: Rework kunit_resource allocation policy kunit: fix debugfs code to use enum kunit_status, not bool kfence: test: use new suite_{init/exit} support, add .kunitconfig kunit: add ability to specify suite-level init and exit functions kunit: rename print_subtest_{start,end} for clarity (s/subtest/suite) ...
2022-05-24Merge tag 'kernel-hardening-v5.19-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux Pull kernel hardening updates from Kees Cook: - usercopy hardening expanded to check other allocation types (Matthew Wilcox, Yuanzheng Song) - arm64 stackleak behavioral improvements (Mark Rutland) - arm64 CFI code gen improvement (Sami Tolvanen) - LoadPin LSM block dev API adjustment (Christoph Hellwig) - Clang randstruct support (Bill Wendling, Kees Cook) * tag 'kernel-hardening-v5.19-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux: (34 commits) loadpin: stop using bdevname mm: usercopy: move the virt_addr_valid() below the is_vmalloc_addr() gcc-plugins: randstruct: Remove cast exception handling af_unix: Silence randstruct GCC plugin warning niu: Silence randstruct warnings big_keys: Use struct for internal payload gcc-plugins: Change all version strings match kernel randomize_kstack: Improve docs on requirements/rationale lkdtm/stackleak: fix CONFIG_GCC_PLUGIN_STACKLEAK=n arm64: entry: use stackleak_erase_on_task_stack() stackleak: add on/off stack variants lkdtm/stackleak: check stack boundaries lkdtm/stackleak: prevent unexpected stack usage lkdtm/stackleak: rework boundary management lkdtm/stackleak: avoid spurious failure stackleak: rework poison scanning stackleak: rework stack high bound handling stackleak: clarify variable names stackleak: rework stack low bound handling stackleak: remove redundant check ...
2022-05-24Merge tag 'random-5.19-rc1-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/crng/random Pull random number generator updates from Jason Donenfeld: "These updates continue to refine the work began in 5.17 and 5.18 of modernizing the RNG's crypto and streamlining and documenting its code. New for 5.19, the updates aim to improve entropy collection methods and make some initial decisions regarding the "premature next" problem and our threat model. The cloc utility now reports that random.c is 931 lines of code and 466 lines of comments, not that basic metrics like that mean all that much, but at the very least it tells you that this is very much a manageable driver now. Here's a summary of the various updates: - The random_get_entropy() function now always returns something at least minimally useful. This is the primary entropy source in most collectors, which in the best case expands to something like RDTSC, but prior to this change, in the worst case it would just return 0, contributing nothing. For 5.19, additional architectures are wired up, and architectures that are entirely missing a cycle counter now have a generic fallback path, which uses the highest resolution clock available from the timekeeping subsystem. Some of those clocks can actually be quite good, despite the CPU not having a cycle counter of its own, and going off-core for a stamp is generally thought to increase jitter, something positive from the perspective of entropy gathering. Done very early on in the development cycle, this has been sitting in next getting some testing for a while now and has relevant acks from the archs, so it should be pretty well tested and fine, but is nonetheless the thing I'll be keeping my eye on most closely. - Of particular note with the random_get_entropy() improvements is MIPS, which, on CPUs that lack the c0 count register, will now combine the high-speed but short-cycle c0 random register with the lower-speed but long-cycle generic fallback path. - With random_get_entropy() now always returning something useful, the interrupt handler now collects entropy in a consistent construction. - Rather than comparing two samples of random_get_entropy() for the jitter dance, the algorithm now tests many samples, and uses the amount of differing ones to determine whether or not jitter entropy is usable and how laborious it must be. The problem with comparing only two samples was that if the cycle counter was extremely slow, but just so happened to be on the cusp of a change, the slowness wouldn't be detected. Taking many samples fixes that to some degree. This, combined with the other improvements to random_get_entropy(), should make future unification of /dev/random and /dev/urandom maybe more possible. At the very least, were we to attempt it again today (we're not), it wouldn't break any of Guenter's test rigs that broke when we tried it with 5.18. So, not today, but perhaps down the road, that's something we can revisit. - We attempt to reseed the RNG immediately upon waking up from system suspend or hibernation, making use of the various timestamps about suspend time and such available, as well as the usual inputs such as RDRAND when available. - Batched randomness now falls back to ordinary randomness before the RNG is initialized. This provides more consistent guarantees to the types of random numbers being returned by the various accessors. - The "pre-init injection" code is now gone for good. I suspect you in particular will be happy to read that, as I recall you expressing your distaste for it a few months ago. Instead, to avoid a "premature first" issue, while still allowing for maximal amount of entropy availability during system boot, the first 128 bits of estimated entropy are used immediately as it arrives, with the next 128 bits being buffered. And, as before, after the RNG has been fully initialized, it winds up reseeding anyway a few seconds later in most cases. This resulted in a pretty big simplification of the initialization code and let us remove various ad-hoc mechanisms like the ugly crng_pre_init_inject(). - The RNG no longer pretends to handle the "premature next" security model, something that various academics and other RNG designs have tried to care about in the past. After an interesting mailing list thread, these issues are thought to be a) mainly academic and not practical at all, and b) actively harming the real security of the RNG by delaying new entropy additions after a potential compromise, making a potentially bad situation even worse. As well, in the first place, our RNG never even properly handled the premature next issue, so removing an incomplete solution to a fake problem was particularly nice. This allowed for numerous other simplifications in the code, which is a lot cleaner as a consequence. If you didn't see it before, https://lore.kernel.org/lkml/YmlMGx6+uigkGiZ0@zx2c4.com/ may be a thread worth skimming through. - While the interrupt handler received a separate code path years ago that avoids locks by using per-cpu data structures and a faster mixing algorithm, in order to reduce interrupt latency, input and disk events that are triggered in hardirq handlers were still hitting locks and more expensive algorithms. Those are now redirected to use the faster per-cpu data structures. - Rather than having the fake-crypto almost-siphash-based random32 implementation be used right and left, and in many places where cryptographically secure randomness is desirable, the batched entropy code is now fast enough to replace that. - As usual, numerous code quality and documentation cleanups. For example, the initialization state machine now uses enum symbolic constants instead of just hard coding numbers everywhere. - Since the RNG initializes once, and then is always initialized thereafter, a pretty heavy amount of code used during that initialization is never used again. It is now completely cordoned off using static branches and it winds up in the .text.unlikely section so that it doesn't reduce cache compactness after the RNG is ready. - A variety of functions meant for waiting on the RNG to be initialized were only used by vsprintf, and in not a particularly optimal way. Replacing that usage with a more ordinary setup made it possible to remove those functions. - A cleanup of how we warn userspace about the use of uninitialized /dev/urandom and uninitialized get_random_bytes() usage. Interestingly, with the change you merged for 5.18 that attempts to use jitter (but does not block if it can't), the majority of users should never see those warnings for /dev/urandom at all now, and the one for in-kernel usage is mainly a debug thing. - The file_operations struct for /dev/[u]random now implements .read_iter and .write_iter instead of .read and .write, allowing it to also implement .splice_read and .splice_write, which makes splice(2) work again after it was broken here (and in many other places in the tree) during the set_fs() removal. This was a bit of a last minute arrival from Jens that hasn't had as much time to bake, so I'll be keeping my eye on this as well, but it seems fairly ordinary. Unfortunately, read_iter() is around 3% slower than read() in my tests, which I'm not thrilled about. But Jens and Al, spurred by this observation, seem to be making progress in removing the bottlenecks on the iter paths in the VFS layer in general, which should remove the performance gap for all drivers. - Assorted other bug fixes, cleanups, and optimizations. - A small SipHash cleanup" * tag 'random-5.19-rc1-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/crng/random: (49 commits) random: check for signals after page of pool writes random: wire up fops->splice_{read,write}_iter() random: convert to using fops->write_iter() random: convert to using fops->read_iter() random: unify batched entropy implementations random: move randomize_page() into mm where it belongs random: remove mostly unused async readiness notifier random: remove get_random_bytes_arch() and add rng_has_arch_random() random: move initialization functions out of hot pages random: make consistent use of buf and len random: use proper return types on get_random_{int,long}_wait() random: remove extern from functions in header random: use static branch for crng_ready() random: credit architectural init the exact amount random: handle latent entropy and command line from random_init() random: use proper jiffies comparison macro random: remove ratelimiting for in-kernel unseeded randomness random: move initialization out of reseeding hot path random: avoid initializing twice in credit race random: use symbolic constants for crng_init states ...
2022-05-24Revert "net/smc: fix listen processing for SMC-Rv2"liuyacan
This reverts commit 8c3b8dc5cc9bf6d273ebe18b16e2d6882bcfb36d. Some rollback issue will be fixed in other patches in the future. Link: https://lore.kernel.org/all/20220523055056.2078994-1-liuyacan@corp.netease.com/ Fixes: 8c3b8dc5cc9b ("net/smc: fix listen processing for SMC-Rv2") Signed-off-by: liuyacan <liuyacan@corp.netease.com> Link: https://lore.kernel.org/r/20220524090230.2140302-1-liuyacan@corp.netease.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-05-23Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski
drivers/net/ethernet/cadence/macb_main.c 5cebb40bc955 ("net: macb: Fix PTP one step sync support") 138badbc21a0 ("net: macb: use NAPI for TX completion path") https://lore.kernel.org/all/20220523111021.31489367@canb.auug.org.au/ net/smc/af_smc.c 75c1edf23b95 ("net/smc: postpone sk_refcnt increment in connect()") 3aba103006bc ("net/smc: align the connect behaviour with TCP") https://lore.kernel.org/all/20220524114408.4bf1af38@canb.auug.org.au/ Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-05-23Merge https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-nextJakub Kicinski
Daniel Borkmann says: ==================== pull-request: bpf-next 2022-05-23 We've added 113 non-merge commits during the last 26 day(s) which contain a total of 121 files changed, 7425 insertions(+), 1586 deletions(-). The main changes are: 1) Speed up symbol resolution for kprobes multi-link attachments, from Jiri Olsa. 2) Add BPF dynamic pointer infrastructure e.g. to allow for dynamically sized ringbuf reservations without extra memory copies, from Joanne Koong. 3) Big batch of libbpf improvements towards libbpf 1.0 release, from Andrii Nakryiko. 4) Add BPF link iterator to traverse links via seq_file ops, from Dmitrii Dolgov. 5) Add source IP address to BPF tunnel key infrastructure, from Kaixi Fan. 6) Refine unprivileged BPF to disable only object-creating commands, from Alan Maguire. 7) Fix JIT blinding of ld_imm64 when they point to subprogs, from Alexei Starovoitov. 8) Add BPF access to mptcp_sock structures and their meta data, from Geliang Tang. 9) Add new BPF helper for access to remote CPU's BPF map elements, from Feng Zhou. 10) Allow attaching 64-bit cookie to BPF link of fentry/fexit/fmod_ret, from Kui-Feng Lee. 11) Follow-ups to typed pointer support in BPF maps, from Kumar Kartikeya Dwivedi. 12) Add busy-poll test cases to the XSK selftest suite, from Magnus Karlsson. 13) Improvements in BPF selftest test_progs subtest output, from Mykola Lysenko. 14) Fill bpf_prog_pack allocator areas with illegal instructions, from Song Liu. 15) Add generic batch operations for BPF map-in-map cases, from Takshak Chahande. 16) Make bpf_jit_enable more user friendly when permanently on 1, from Tiezhu Yang. 17) Fix an array overflow in bpf_trampoline_get_progs(), from Yuntao Wang. ==================== Link: https://lore.kernel.org/r/20220523223805.27931-1-daniel@iogearbox.net Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-05-23Merge tag 'for-net-next-2022-05-23' of ↵Jakub Kicinski
git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth-next Luiz Augusto von Dentz says: ==================== bluetooth-next pull request for net-next: - Add support for Realtek 8761BUV - Add HCI_QUIRK_BROKEN_ENHANCED_SETUP_SYNC_CONN quirk - Add support for RTL8852C - Add a new PID/VID 0489/e0c8 for MT7921 - Add support for Qualcomm WCN785x * tag 'for-net-next-2022-05-23' of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth-next: (26 commits) Bluetooth: hci_sync: use hci_skb_event() helper Bluetooth: eir: Add helpers for managing service data Bluetooth: hci_sync: Fix attempting to suspend with unfiltered passive scan Bluetooth: MGMT: Add conditions for setting HCI_CONN_FLAG_REMOTE_WAKEUP Bluetooth: btmtksdio: fix the reset takes too long Bluetooth: btmtksdio: fix possible FW initialization failure Bluetooth: btmtksdio: fix use-after-free at btmtksdio_recv_event Bluetooth: btbcm: Add entry for BCM4373A0 UART Bluetooth Bluetooth: btusb: Add a new PID/VID 0489/e0c8 for MT7921 Bluetooth: btusb: Add 0x0bda:0x8771 Realtek 8761BUV devices Bluetooth: btusb: Set HCI_QUIRK_BROKEN_ERR_DATA_REPORTING for QCA Bluetooth: core: Fix missing power_on work cancel on HCI close Bluetooth: btusb: add support for Qualcomm WCN785x Bluetooth: protect le accept and resolv lists with hdev->lock Bluetooth: use hdev lock for accept_list and reject_list in conn req Bluetooth: use hdev lock in activate_scan for hci_is_adv_monitoring Bluetooth: btrtl: Add support for RTL8852C Bluetooth: btusb: Set HCI_QUIRK_BROKEN_ENHANCED_SETUP_SYNC_CONN for QCA Bluetooth: Print broken quirks Bluetooth: HCI: Add HCI_QUIRK_BROKEN_ENHANCED_SETUP_SYNC_CONN quirk ... ==================== Link: https://lore.kernel.org/r/20220523204151.3327345-1-luiz.dentz@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-05-23Bluetooth: hci_conn: Fix hci_connect_le_syncLuiz Augusto von Dentz
The handling of connection failures shall be handled by the request completion callback as already done by hci_cs_le_create_conn, also make sure to use hci_conn_failed instead of hci_le_conn_failed as the later don't actually call hci_conn_del to cleanup. Link: https://github.com/bluez/bluez/issues/340 Fixes: 8e8b92ee60de5 ("Bluetooth: hci_sync: Add hci_le_create_conn_sync") Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com> Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2022-05-23Merge tag 'for-5.19/io_uring-net-2022-05-22' of git://git.kernel.dk/linux-blockLinus Torvalds
Pull io_uring 'more data in socket' support from Jens Axboe: "To be able to fully utilize the 'poll first' support in the core io_uring branch, it's advantageous knowing if the socket was empty after a receive. This adds support for that" * tag 'for-5.19/io_uring-net-2022-05-22' of git://git.kernel.dk/linux-block: io_uring: return hint on whether more data is available after receive tcp: pass back data left in socket after receive
2022-05-23Merge tag 'for-5.19/io_uring-socket-2022-05-22' of ↵Linus Torvalds
git://git.kernel.dk/linux-block Pull io_uring socket() support from Jens Axboe: "This adds support for socket(2) for io_uring. This is handy when using direct / registered file descriptors with io_uring. Outside of those two patches, a small series from Dylan on top that improves the tracing by providing a text representation of the opcode rather than needing to decode this by reading the header file every time. That sits in this branch as it was the last opcode added (until it wasn't...)" * tag 'for-5.19/io_uring-socket-2022-05-22' of git://git.kernel.dk/linux-block: io_uring: use the text representation of ops in trace io_uring: rename op -> opcode io_uring: add io_uring_get_opcode io_uring: add type to op enum io_uring: add socket(2) support net: add __sys_socket_file()
2022-05-23Bluetooth: hci_sync: use hci_skb_event() helperAhmad Fatoum
This instance is the only opencoded version of the macro, so have it follow suit. Signed-off-by: Ahmad Fatoum <a.fatoum@pengutronix.de> Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2022-05-23net: dsa: OF-ware slave_mii_busLuiz Angelo Daros de Luca
If found, register the DSA internally allocated slave_mii_bus with an OF "mdio" child object. It can save some drivers from creating their custom internal MDIO bus. Signed-off-by: Luiz Angelo Daros de Luca <luizluca@gmail.com> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-05-23net/smc: fix listen processing for SMC-Rv2liuyacan
In the process of checking whether RDMAv2 is available, the current implementation first sets ini->smcrv2.ib_dev_v2, and then allocates smc buf desc, but the latter may fail. Unfortunately, the caller will only check the former. In this case, a NULL pointer reference will occur in smc_clc_send_confirm_accept() when accessing conn->rmb_desc. This patch does two things: 1. Use the return code to determine whether V2 is available. 2. If the return code is NODEV, continue to check whether V1 is available. Fixes: e49300a6bf62 ("net/smc: add listen processing for SMC-Rv2") Signed-off-by: liuyacan <liuyacan@corp.netease.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-05-23net/smc: postpone sk_refcnt increment in connect()liuyacan
Same trigger condition as commit 86434744. When setsockopt runs in parallel to a connect(), and switch the socket into fallback mode. Then the sk_refcnt is incremented in smc_connect(), but its state stay in SMC_INIT (NOT SMC_ACTIVE). This cause the corresponding sk_refcnt decrement in __smc_release() will not be performed. Fixes: 86434744fedf ("net/smc: add fallback check to connect()") Signed-off-by: liuyacan <liuyacan@corp.netease.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-05-22net: wrap the wireless pointers in struct net_device in an ifdefJakub Kicinski
Most protocol-specific pointers in struct net_device are under a respective ifdef. Wireless is the notable exception. Since there's a sizable number of custom-built kernels for datacenter workloads which don't build wireless it seems reasonable to ifdefy those pointers as well. While at it move IPv4 and IPv6 pointers up, those are special for obvious reasons. Acked-by: Johannes Berg <johannes@sipsolutions.net> Acked-by: Stefan Schmidt <stefan@datenfreihafen.org> # ieee802154 Acked-by: Sven Eckelmann <sven@narfation.org> Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-05-22rxrpc: Fix decision on when to generate an IDLE ACKDavid Howells
Fix the decision on when to generate an IDLE ACK by keeping a count of the number of packets we've received, but not yet soft-ACK'd, and the number of packets we've processed, but not yet hard-ACK'd, rather than trying to keep track of which DATA sequence numbers correspond to those points. We then generate an ACK when either counter exceeds 2. The counters are both cleared when we transcribe the information into any sort of ACK packet for transmission. IDLE and DELAY ACKs are skipped if both counters are 0 (ie. no change). Fixes: 805b21b929e2 ("rxrpc: Send an ACK after every few DATA packets we receive") Signed-off-by: David Howells <dhowells@redhat.com> cc: Marc Dionne <marc.dionne@auristor.com> cc: linux-afs@lists.infradead.org Signed-off-by: David S. Miller <davem@davemloft.net>
2022-05-22rxrpc: Don't let ack.previousPacket regressDavid Howells
The previousPacket field in the rx ACK packet should never go backwards - it's now the highest DATA sequence number received, not the last on received (it used to be used for out of sequence detection). Fixes: 248f219cb8bc ("rxrpc: Rewrite the data and ack handling code") Signed-off-by: David Howells <dhowells@redhat.com> cc: Marc Dionne <marc.dionne@auristor.com> cc: linux-afs@lists.infradead.org Signed-off-by: David S. Miller <davem@davemloft.net>
2022-05-22rxrpc: Fix overlapping ACK accountingDavid Howells
Fix accidental overlapping of Rx-phase ACK accounting with Tx-phase ACK accounting through variables shared between the two. call->acks_* members refer to ACKs received in the Tx phase and call->ackr_* members to ACKs sent/to be sent during the Rx phase. Fixes: 1a2391c30c0b ("rxrpc: Fix detection of out of order acks") Signed-off-by: David Howells <dhowells@redhat.com> cc: Jeffrey Altman <jaltman@auristor.com> cc: Marc Dionne <marc.dionne@auristor.com> cc: linux-afs@lists.infradead.org Signed-off-by: David S. Miller <davem@davemloft.net>
2022-05-22rxrpc: Don't try to resend the request if we're receiving the replyDavid Howells
rxrpc has a timer to trigger resending of unacked data packets in a call. This is not cancelled when a client call switches to the receive phase on the basis that most calls don't last long enough for it to ever expire. However, if it *does* expire after we've started to receive the reply, we shouldn't then go into trying to retransmit or pinging the server to find out if an ack got lost. Fix this by skipping the resend code if we're into receiving the reply to a client call. Fixes: 17926a79320a ("[AF_RXRPC]: Provide secure RxRPC sockets for use by userspace and kernel both") Signed-off-by: David Howells <dhowells@redhat.com> cc: linux-afs@lists.infradead.org Signed-off-by: David S. Miller <davem@davemloft.net>
2022-05-22rxrpc: Fix listen() setting the bar too high for the prealloc ringsDavid Howells
AF_RXRPC's listen() handler lets you set the backlog up to 32 (if you bump up the sysctl), but whilst the preallocation circular buffers have 32 slots in them, one of them has to be a dead slot because we're using CIRC_CNT(). This means that listen(rxrpc_sock, 32) will cause an oops when the socket is closed because rxrpc_service_prealloc_one() allocated one too many calls and rxrpc_discard_prealloc() won't then be able to get rid of them because it'll think the ring is empty. rxrpc_release_calls_on_socket() then tries to abort them, but oopses because call->peer isn't yet set. Fix this by setting the maximum backlog to RXRPC_BACKLOG_MAX - 1 to match the ring capacity. BUG: kernel NULL pointer dereference, address: 0000000000000086 ... RIP: 0010:rxrpc_send_abort_packet+0x73/0x240 [rxrpc] Call Trace: <TASK> ? __wake_up_common_lock+0x7a/0x90 ? rxrpc_notify_socket+0x8e/0x140 [rxrpc] ? rxrpc_abort_call+0x4c/0x60 [rxrpc] rxrpc_release_calls_on_socket+0x107/0x1a0 [rxrpc] rxrpc_release+0xc9/0x1c0 [rxrpc] __sock_release+0x37/0xa0 sock_close+0x11/0x20 __fput+0x89/0x240 task_work_run+0x59/0x90 do_exit+0x319/0xaa0 Fixes: 00e907127e6f ("rxrpc: Preallocate peers, conns and calls for incoming service requests") Reported-by: Marc Dionne <marc.dionne@auristor.com> Signed-off-by: David Howells <dhowells@redhat.com> cc: linux-afs@lists.infradead.org Link: https://lists.infradead.org/pipermail/linux-afs/2022-March/005079.html Signed-off-by: David S. Miller <davem@davemloft.net>
2022-05-22afs: Adjust ACK interpretation to try and cope with NATDavid Howells
If a client's address changes, say if it is NAT'd, this can disrupt an in progress operation. For most operations, this is not much of a problem, but StoreData can be different as some servers modify the target file as the data comes in, so if a store request is disrupted, the file can get corrupted on the server. The problem is that the server doesn't recognise packets that come after the change of address as belonging to the original client and will bounce them, either by sending an OUT_OF_SEQUENCE ACK to the apparent new call if the packet number falls within the initial sequence number window of a call or by sending an EXCEEDS_WINDOW ACK if it falls outside and then aborting it. In both cases, firstPacket will be 1 and previousPacket will be 0 in the ACK information. Fix this by the following means: (1) If a client call receives an EXCEEDS_WINDOW ACK with firstPacket as 1 and previousPacket as 0, assume this indicates that the server saw the incoming packets from a different peer and thus as a different call. Fail the call with error -ENETRESET. (2) Also fail the call if a similar OUT_OF_SEQUENCE ACK occurs if the first packet has been hard-ACK'd. If it hasn't been hard-ACK'd, the ACK packet will cause it to get retransmitted, so the call will just be repeated. (3) Make afs_select_fileserver() treat -ENETRESET as a straight fail of the operation. (4) Prioritise the error code over things like -ECONNRESET as the server did actually respond. (5) Make writeback treat -ENETRESET as a retryable error and make it redirty all the pages involved in a write so that the VM will retry. Note that there is still a circumstance that I can't easily deal with: if the operation is fully received and processed by the server, but the reply is lost due to address change. There's no way to know if the op happened. We can examine the server, but a conflicting change could have been made by a third party - and we can't tell the difference. In such a case, a message like: kAFS: vnode modified {100058:146266} b7->b8 YFS.StoreData64 (op=2646a) will be logged to dmesg on the next op to touch the file and the client will reset the inode state, including invalidating clean parts of the pagecache. Reported-by: Marc Dionne <marc.dionne@auristor.com> Signed-off-by: David Howells <dhowells@redhat.com> cc: linux-afs@lists.infradead.org Link: http://lists.infradead.org/pipermail/linux-afs/2021-December/004811.html # v1 Signed-off-by: David S. Miller <davem@davemloft.net>
2022-05-22rxrpc, afs: Fix selection of abort codesDavid Howells
The RX_USER_ABORT code should really only be used to indicate that the user of the rxrpc service (ie. userspace) implicitly caused a call to be aborted - for instance if the AF_RXRPC socket is closed whilst the call was in progress. (The user may also explicitly abort a call and specify the abort code to use). Change some of the points of generation to use other abort codes instead: (1) Abort the call with RXGEN_SS_UNMARSHAL or RXGEN_CC_UNMARSHAL if we see ENOMEM and EFAULT during received data delivery and abort with RX_CALL_DEAD in the default case. (2) Abort with RXGEN_SS_MARSHAL if we get ENOMEM whilst trying to send a reply. (3) Abort with RX_CALL_DEAD if we stop hearing from the peer if we had heard from the peer and abort with RX_CALL_TIMEOUT if we hadn't. (4) Abort with RX_CALL_DEAD if we try to disconnect a call that's not completed successfully or been aborted. Reported-by: Jeffrey Altman <jaltman@auristor.com> Signed-off-by: David Howells <dhowells@redhat.com> cc: Marc Dionne <marc.dionne@auristor.com> cc: linux-afs@lists.infradead.org Signed-off-by: David S. Miller <davem@davemloft.net>
2022-05-22rxrpc: Return an error to sendmsg if call failedDavid Howells
If at the end of rxrpc sendmsg() or rxrpc_kernel_send_data() the call that was being given data was aborted remotely or otherwise failed, return an error rather than returning the amount of data buffered for transmission. The call (presumably) did not complete, so there's not much point continuing with it. AF_RXRPC considers it "complete" and so will be unwilling to do anything else with it - and won't send a notification for it, deeming the return from sendmsg sufficient. Not returning an error causes afs to incorrectly handle a StoreData operation that gets interrupted by a change of address due to NAT reconfiguration. This doesn't normally affect most operations since their request parameters tend to fit into a single UDP packet and afs_make_call() returns before the server responds; StoreData is different as it involves transmission of a lot of data. This can be triggered on a client by doing something like: dd if=/dev/zero of=/afs/example.com/foo bs=1M count=512 at one prompt, and then changing the network address at another prompt, e.g.: ifconfig enp6s0 inet 192.168.6.2 && route add 192.168.6.1 dev enp6s0 Tracing packets on an Auristor fileserver looks something like: 192.168.6.1 -> 192.168.6.3 RX 107 ACK Idle Seq: 0 Call: 4 Source Port: 7000 Destination Port: 7001 192.168.6.3 -> 192.168.6.1 AFS (RX) 1482 FS Request: Unknown(64538) (64538) 192.168.6.3 -> 192.168.6.1 AFS (RX) 1482 FS Request: Unknown(64538) (64538) 192.168.6.1 -> 192.168.6.3 RX 107 ACK Idle Seq: 0 Call: 4 Source Port: 7000 Destination Port: 7001 <ARP exchange for 192.168.6.2> 192.168.6.2 -> 192.168.6.1 AFS (RX) 1482 FS Request: Unknown(0) (0) 192.168.6.2 -> 192.168.6.1 AFS (RX) 1482 FS Request: Unknown(0) (0) 192.168.6.1 -> 192.168.6.2 RX 107 ACK Exceeds Window Seq: 0 Call: 4 Source Port: 7000 Destination Port: 7001 192.168.6.1 -> 192.168.6.2 RX 74 ABORT Seq: 0 Call: 4 Source Port: 7000 Destination Port: 7001 192.168.6.1 -> 192.168.6.2 RX 74 ABORT Seq: 29321 Call: 4 Source Port: 7000 Destination Port: 7001 The Auristor fileserver logs code -453 (RXGEN_SS_UNMARSHAL), but the abort code received by kafs is -5 (RX_PROTOCOL_ERROR) as the rx layer sees the condition and generates an abort first and the unmarshal error is a consequence of that at the application layer. Reported-by: Marc Dionne <marc.dionne@auristor.com> Signed-off-by: David Howells <dhowells@redhat.com> cc: linux-afs@lists.infradead.org Link: http://lists.infradead.org/pipermail/linux-afs/2021-December/004810.html # v1 Signed-off-by: David S. Miller <davem@davemloft.net>
2022-05-22rxrpc: Fix locking issueDavid Howells
There's a locking issue with the per-netns list of calls in rxrpc. The pieces of code that add and remove a call from the list use write_lock() and the calls procfile uses read_lock() to access it. However, the timer callback function may trigger a removal by trying to queue a call for processing and finding that it's already queued - at which point it has a spare refcount that it has to do something with. Unfortunately, if it puts the call and this reduces the refcount to 0, the call will be removed from the list. Unfortunately, since the _bh variants of the locking functions aren't used, this can deadlock. ================================ WARNING: inconsistent lock state 5.18.0-rc3-build4+ #10 Not tainted -------------------------------- inconsistent {SOFTIRQ-ON-W} -> {IN-SOFTIRQ-W} usage. ksoftirqd/2/25 [HC0[0]:SC1[1]:HE1:SE0] takes: ffff888107ac4038 (&rxnet->call_lock){+.?.}-{2:2}, at: rxrpc_put_call+0x103/0x14b {SOFTIRQ-ON-W} state was registered at: ... Possible unsafe locking scenario: CPU0 ---- lock(&rxnet->call_lock); <Interrupt> lock(&rxnet->call_lock); *** DEADLOCK *** 1 lock held by ksoftirqd/2/25: #0: ffff8881008ffdb0 ((&call->timer)){+.-.}-{0:0}, at: call_timer_fn+0x5/0x23d Changes ======= ver #2) - Changed to using list_next_rcu() rather than rcu_dereference() directly. Fixes: 17926a79320a ("[AF_RXRPC]: Provide secure RxRPC sockets for use by userspace and kernel both") Signed-off-by: David Howells <dhowells@redhat.com> cc: Marc Dionne <marc.dionne@auristor.com> cc: linux-afs@lists.infradead.org Signed-off-by: David S. Miller <davem@davemloft.net>
2022-05-22rxrpc: Use refcount_t rather than atomic_tDavid Howells
Move to using refcount_t rather than atomic_t for refcounts in rxrpc. Signed-off-by: David Howells <dhowells@redhat.com> cc: Marc Dionne <marc.dionne@auristor.com> cc: linux-afs@lists.infradead.org Signed-off-by: David S. Miller <davem@davemloft.net>
2022-05-22rxrpc: Allow list of in-use local UDP endpoints to be viewed in /procDavid Howells
Allow the list of in-use local UDP endpoints in the current network namespace to be viewed in /proc. To aid with this, the endpoint list is converted to an hlist and RCU-safe manipulation is used so that the list can be read with only the RCU read lock held. Signed-off-by: David Howells <dhowells@redhat.com> cc: Marc Dionne <marc.dionne@auristor.com> cc: linux-afs@lists.infradead.org Signed-off-by: David S. Miller <davem@davemloft.net>
2022-05-20net: Add a second bind table hashed by port and addressJoanne Koong
We currently have one tcp bind table (bhash) which hashes by port number only. In the socket bind path, we check for bind conflicts by traversing the specified port's inet_bind2_bucket while holding the bucket's spinlock (see inet_csk_get_port() and inet_csk_bind_conflict()). In instances where there are tons of sockets hashed to the same port at different addresses, checking for a bind conflict is time-intensive and can cause softirq cpu lockups, as well as stops new tcp connections since __inet_inherit_port() also contests for the spinlock. This patch proposes adding a second bind table, bhash2, that hashes by port and ip address. Searching the bhash2 table leads to significantly faster conflict resolution and less time holding the spinlock. Signed-off-by: Joanne Koong <joannelkoong@gmail.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Acked-by: Kuniyuki Iwashima <kuniyu@amazon.co.jp> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-05-20stcp: Use memset_after() to zero sctp_stream_out_extXiu Jianfeng
Use memset_after() helper to simplify the code, there is no functional change in this patch. Signed-off-by: Xiu Jianfeng <xiujianfeng@huawei.com> Link: https://lore.kernel.org/r/20220519062932.249926-1-xiujianfeng@huawei.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-05-20net: avoid strange behavior with skb_defer_max == 1Jakub Kicinski
When user sets skb_defer_max to 1 the kick threshold is 0 (half of 1). If we increment queue length before the check the kick will never happen, and the skb may get stranded. This is likely harmless but can be avoided by moving the increment after the check. This way skb_defer_max == 1 will always kick. Still a silly config to have, but somehow that feels more correct. While at it drop a comment which seems to be outdated or confusing, and wrap the defer_count write with a WRITE_ONCE() since it's read on the fast path that avoids taking the lock. Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://lore.kernel.org/r/20220518185522.2038683-1-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-05-20bpf: Add bpf_skc_to_mptcp_sock_protoGeliang Tang
This patch implements a new struct bpf_func_proto, named bpf_skc_to_mptcp_sock_proto. Define a new bpf_id BTF_SOCK_TYPE_MPTCP, and a new helper bpf_skc_to_mptcp_sock(), which invokes another new helper bpf_mptcp_sock_from_subflow() in net/mptcp/bpf.c to get struct mptcp_sock from a given subflow socket. v2: Emit BTF type, add func_id checks in verifier.c and bpf_trace.c, remove build check for CONFIG_BPF_JIT v5: Drop EXPORT_SYMBOL (Martin) Co-developed-by: Nicolas Rybowski <nicolas.rybowski@tessares.net> Co-developed-by: Matthieu Baerts <matthieu.baerts@tessares.net> Signed-off-by: Nicolas Rybowski <nicolas.rybowski@tessares.net> Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net> Signed-off-by: Geliang Tang <geliang.tang@suse.com> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20220519233016.105670-2-mathew.j.martineau@linux.intel.com
2022-05-20Merge tag 'ceph-for-5.18-rc8' of https://github.com/ceph/ceph-clientLinus Torvalds
Pull ceph fix from Ilya Dryomov: "A fix for a nasty use-after-free, marked for stable" * tag 'ceph-for-5.18-rc8' of https://github.com/ceph/ceph-client: libceph: fix misleading ceph_osdc_cancel_request() comment libceph: fix potential use-after-free on linger ping and resends
2022-05-20tcp_ipv6: set the drop_reason in the right placeJakub Kicinski
Looks like the IPv6 version of the patch under Fixes was a copy/paste of the IPv4 but hit the wrong spot. It is tcp_v6_rcv() which uses drop_reason as a boolean, and needs to be protected against reason == 0 before calling free. tcp_v6_do_rcv() has a pretty straightforward flow. The resulting warning looks like this: WARNING: CPU: 1 PID: 0 at net/core/skbuff.c:775 Call Trace: tcp_v6_rcv (net/ipv6/tcp_ipv6.c:1767) ip6_protocol_deliver_rcu (net/ipv6/ip6_input.c:438) ip6_input_finish (include/linux/rcupdate.h:726) ip6_input (include/linux/netfilter.h:307) Fixes: f8319dfd1b3b ("net: tcp: reset 'drop_reason' to NOT_SPCIFIED in tcp_v{4,6}_rcv()") Tested-by: Matthieu Baerts <matthieu.baerts@tessares.net> Link: https://lore.kernel.org/r/20220520021347.2270207-1-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-05-19Merge git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf-nextJakub Kicinski
Pablo Neira Ayuso says: ==================== Netfilter updates for net-next The following patchset contains Netfilter updates for net-next, misc updates and fallout fixes from recent Florian's code rewritting (from last pull request): 1) Use new flowi4_l3mdev field in ip_route_me_harder(), from Martin Willi. 2) Avoid unnecessary GC with a timestamp in conncount, from William Tu and Yifeng Sun. 3) Remove TCP conntrack debugging, from Florian Westphal. 4) Fix compilation warning in ctnetlink, from Florian. * git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf-next: netfilter: ctnetlink: fix up for "netfilter: conntrack: remove unconfirmed list" netfilter: conntrack: remove pr_debug callsites from tcp tracker netfilter: nf_conncount: reduce unnecessary GC netfilter: Use l3mdev flow key when re-routing mangled packets ==================== Link: https://lore.kernel.org/r/20220519220206.722153-1-pablo@netfilter.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-05-19mptcp: Do not traverse the subflow connection list without lockMat Martineau
The MPTCP socket's conn_list (list of subflows) requires the socket lock to access. The MP_FAIL timeout code added such an access, where it would check the list of subflows both in timer context and (later) in workqueue context where the socket lock is held. Rather than check the list twice, remove the check in the timeout handler and only depend on the check in the workqueue. Also remove the MPTCP_FAIL_NO_RESPONSE flag, since mptcp_mp_fail_no_response() has insignificant overhead and can be checked on each worker run. Fixes: 49fa1919d6bc ("mptcp: reset subflow when MP_FAIL doesn't respond") Reported-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-05-19mptcp: Check for orphaned subflow before handling MP_FAIL timerMat Martineau
MP_FAIL timeout (waiting for a peer to respond to an MP_FAIL with another MP_FAIL) is implemented using the MPTCP socket's sk_timer. That timer is also used at MPTCP socket close, so it's important to not have the two timer users interfere with each other. At MPTCP socket close, all subflows are orphaned before sk_timer is manipulated. By checking the SOCK_DEAD flag on the subflows, each subflow can determine if the timer is safe to alter without acquiring any MPTCP-level lock. This replaces code that was using the mptcp_data_lock and MPTCP-level socket state checks that did not correctly protect the timer. Fixes: 49fa1919d6bc ("mptcp: reset subflow when MP_FAIL doesn't respond") Reviewed-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-05-19mptcp: stop using the mptcp_has_another_subflow() helperPaolo Abeni
The mentioned helper requires the msk socket lock, and the current callers don't own it nor can't acquire it, so the access is racy. All the current callers are really checking for infinite mapping fallback, and the latter condition is explicitly tracked by the relevant msk variable: we can safely remove the caller usage - and the caller itself. The issue is present since MP_FAIL implementation, but the fix only applies since the infinite fallback support, ence the somewhat unexpected fixes tag. Fixes: 0530020a7c8f ("mptcp: track and update contiguous data status") Acked-and-tested-by: Geliang Tang <geliang.tang@suse.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-05-19tcp: improve PRR loss recoveryYuchung Cheng
This patch improves TCP PRR loss recovery behavior for a corner case. Previously during PRR conservation-bound mode, it strictly sends the amount equals to the amount newly acked or s/acked. The patch changes s.t. PRR may send additional amount that was banked previously (e.g. application-limited) in the conservation-bound mode, similar to the slow-start mode. This unifies and simplifies the algorithm further and may improve the recovery latency. This change still follow the general packet conservation design principle and always keep inflight/cwnd below the slow start threshold set by the congestion control module. PRR is described in RFC 6937. We'll include this change in the latest revision rfc6937-bis as well. Reported-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Link: https://lore.kernel.org/r/20220519003410.2531936-1-ycheng@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-05-19net: tls: fix messing up lists when bpf enabledJakub Kicinski
Artem points out that skb may try to take over the skb and queue it to its own list. Unlink the skb before calling out. Fixes: b1a2c1786330 ("tls: rx: clear ctx->recv_pkt earlier") Reported-by: Artem Savkov <asavkov@redhat.com> Tested-by: Artem Savkov <asavkov@redhat.com> Link: https://lore.kernel.org/r/20220518205644.2059468-1-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-05-19Merge tag 'linux-can-next-for-5.19-20220519' of ↵Jakub Kicinski
git://git.kernel.org/pub/scm/linux/kernel/git/mkl/linux-can-next Marc Kleine-Budde says: ==================== pull-request: can-next 2022-05-19 Oliver Hartkopp contributes a patch for the ISO-TP CAN protocol to update the validation of address information during bind. The next patch is by Jakub Kicinski and converts the CAN network drivers from netif_napi_add() to the netif_napi_add_weight() function. Another patch by Oliver Hartkopp removes obsolete CAN specific LED support. Vincent Mailhol's patch for the mcp251xfd driver fixes a -Wunaligned-access warning by clang-14. * tag 'linux-can-next-for-5.19-20220519' of git://git.kernel.org/pub/scm/linux/kernel/git/mkl/linux-can-next: can: mcp251xfd: silence clang's -Wunaligned-access warning can: can-dev: remove obsolete CAN LED support can: can-dev: move to netif_napi_add_weight() can: isotp: isotp_bind(): do not validate unused address information ==================== Link: https://lore.kernel.org/r/20220519202308.1435903-1-mkl@pengutronix.de Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-05-19can: isotp: isotp_bind(): do not validate unused address informationOliver Hartkopp
With commit 2aa39889c463 ("can: isotp: isotp_bind(): return -EINVAL on incorrect CAN ID formatting") the bind() syscall returns -EINVAL when the given CAN ID needed to be sanitized. But in the case of an unconfirmed broadcast mode the rx CAN ID is not needed and may be uninitialized from the caller - which is ok. This patch makes sure the result of an inproper CAN ID format is only provided when the address information is needed. Link: https://lore.kernel.org/all/20220517145653.2556-1-socketcan@hartkopp.net Signed-off-by: Oliver Hartkopp <socketcan@hartkopp.net> Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
2022-05-19Merge tag 'wireless-next-2022-05-19' of ↵Jakub Kicinski
git://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless-next Kalle Valo says: ==================== wireless-next patches for v5.19 Second set of patches for v5.19 and most likely the last one. rtw89 got support for 8852ce devices and mt76 now supports Wireless Ethernet Dispatch. Major changes: cfg80211/mac80211 - support disabling EHT mode rtw89 - add support for Realtek 8852ce devices mt76 - Wireless Ethernet Dispatch support for flow offload - non-standard VHT MCS10-11 support - mt7921 AP mode support - mt7921 ipv6 NS offload support ath11k - enable keepalive during WoWLAN suspend - implement remain-on-channel support * tag 'wireless-next-2022-05-19' of git://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless-next: (135 commits) iwlwifi: mei: fix potential NULL-ptr deref iwlwifi: mei: clear the sap data header before sending iwlwifi: mvm: remove vif_count iwlwifi: mvm: always tell the firmware to accept MCAST frames in BSS iwlwifi: mvm: add OTP info in case of init failure iwlwifi: mvm: fix assert 1F04 upon reconfig iwlwifi: fw: init SAR GEO table only if data is present iwlwifi: mvm: clean up authorized condition iwlwifi: mvm: use NULL instead of ERR_PTR when parsing wowlan status iwlwifi: pcie: simplify MSI-X cause mapping rtw89: pci: only mask out INT indicator register for disable interrupt v1 rtw89: convert rtw89_band to nl80211_band precisely rtw89: 8852c: update txpwr tables to HALRF_027_00_052 rtw89: cfo: check mac_id to avoid out-of-bounds rtw89: 8852c: set TX antenna path rtw89: add ieee80211::sta_rc_update ops wireless: Fix Makefile to be in alphabetical order mac80211: refactor freeing the next_beacon cfg80211: fix kernel-doc for cfg80211_beacon_data mac80211: minstrel_ht: support ieee80211_rate_status ... ==================== Link: https://lore.kernel.org/r/20220519153334.8D051C385AA@smtp.kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-05-19Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski
drivers/net/ethernet/mellanox/mlx5/core/main.c b33886971dbc ("net/mlx5: Initialize flow steering during driver probe") 40379a0084c2 ("net/mlx5_fpga: Drop INNOVA TLS support") f2b41b32cde8 ("net/mlx5: Remove ipsec_ops function table") https://lore.kernel.org/all/20220519040345.6yrjromcdistu7vh@sx1/ 16d42d313350 ("net/mlx5: Drain fw_reset when removing device") 8324a02c342a ("net/mlx5: Add exit route when waiting for FW") https://lore.kernel.org/all/20220519114119.060ce014@canb.auug.org.au/ tools/testing/selftests/net/mptcp/mptcp_join.sh e274f7154008 ("selftests: mptcp: add subflow limits test-cases") b6e074e171bc ("selftests: mptcp: add infinite map testcase") 5ac1d2d63451 ("selftests: mptcp: Add tests for userspace PM type") https://lore.kernel.org/all/20220516111918.366d747f@canb.auug.org.au/ net/mptcp/options.c ba2c89e0ea74 ("mptcp: fix checksum byte order") 1e39e5a32ad7 ("mptcp: infinite mapping sending") ea66758c1795 ("tcp: allow MPTCP to update the announced window") https://lore.kernel.org/all/20220519115146.751c3a37@canb.auug.org.au/ net/mptcp/pm.c 95d686517884 ("mptcp: fix subflow accounting on close") 4d25247d3ae4 ("mptcp: bypass in-kernel PM restrictions for non-kernel PMs") https://lore.kernel.org/all/20220516111435.72f35dca@canb.auug.org.au/ net/mptcp/subflow.c ae66fb2ba6c3 ("mptcp: Do TCP fallback on early DSS checksum failure") 0348c690ed37 ("mptcp: add the fallback check") f8d4bcacff3b ("mptcp: infinite mapping receiving") https://lore.kernel.org/all/20220519115837.380bb8d4@canb.auug.org.au/ Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-05-19Bluetooth: eir: Add helpers for managing service dataLuiz Augusto von Dentz
This adds helpers for accessing and appending service data (0x16) ad type. Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com> Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2022-05-19Merge tag 'net-5.18-rc8' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Pull networking fixes from Paolo Abeni: "Including fixes from can, xfrm and netfilter subtrees. Notably this reverts a recent TCP/DCCP netns-related change to address a possible UaF. Current release - regressions: - tcp: revert "tcp/dccp: get rid of inet_twsk_purge()" - xfrm: set dst dev to blackhole_netdev instead of loopback_dev in ifdown Previous releases - regressions: - netfilter: flowtable: fix TCP flow teardown - can: revert "can: m_can: pci: use custom bit timings for Elkhart Lake" - xfrm: check encryption module availability consistency - eth: vmxnet3: fix possible use-after-free bugs in vmxnet3_rq_alloc_rx_buf() - eth: mlx5: initialize flow steering during driver probe - eth: ice: fix crash when writing timestamp on RX rings Previous releases - always broken: - mptcp: fix checksum byte order - eth: lan966x: fix assignment of the MAC address - eth: mlx5: remove HW-GRO from reported features - eth: ftgmac100: disable hardware checksum on AST2600" * tag 'net-5.18-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (50 commits) net: bridge: Clear offload_fwd_mark when passing frame up bridge interface. ptp: ocp: change sysfs attr group handling selftests: forwarding: fix missing backslash netfilter: nf_tables: disable expression reduction infra netfilter: flowtable: move dst_check to packet path netfilter: flowtable: fix TCP flow teardown net: ftgmac100: Disable hardware checksum on AST2600 igb: skip phy status check where unavailable nfc: pn533: Fix buggy cleanup order mptcp: Do TCP fallback on early DSS checksum failure mptcp: fix checksum byte order net: af_key: check encryption module availability consistency net: af_key: add check for pfkey_broadcast in function pfkey_process net/mlx5: Drain fw_reset when removing device net/mlx5e: CT: Fix setting flow_source for smfs ct tuples net/mlx5e: CT: Fix support for GRE tuples net/mlx5e: Remove HW-GRO from reported features net/mlx5e: Properly block HW GRO when XDP is enabled net/mlx5e: Properly block LRO when XDP is enabled net/mlx5e: Block rx-gro-hw feature in switchdev mode ...
2022-05-19tls: Add opt-in zerocopy mode of sendfile()Boris Pismenny
TLS device offload copies sendfile data to a bounce buffer before transmitting. It allows to maintain the valid MAC on TLS records when the file contents change and a part of TLS record has to be retransmitted on TCP level. In many common use cases (like serving static files over HTTPS) the file contents are not changed on the fly. In many use cases breaking the connection is totally acceptable if the file is changed during transmission, because it would be received corrupted in any case. This commit allows to optimize performance for such use cases to providing a new optional mode of TLS sendfile(), in which the extra copy is skipped. Removing this copy improves performance significantly, as TLS and TCP sendfile perform the same operations, and the only overhead is TLS header/trailer insertion. The new mode can only be enabled with the new socket option named TLS_TX_ZEROCOPY_SENDFILE on per-socket basis. It preserves backwards compatibility with existing applications that rely on the copying behavior. The new mode is safe, meaning that unsolicited modifications of the file being sent can't break integrity of the kernel. The worst thing that can happen is sending a corrupted TLS record, which is in any case not forbidden when using regular TCP sockets. Sockets other than TLS device offload are not affected by the new socket option. The actual status of zerocopy sendfile can be queried with sock_diag. Performance numbers in a single-core test with 24 HTTPS streams on nginx, under 100% CPU load: * non-zerocopy: 33.6 Gbit/s * zerocopy: 79.92 Gbit/s CPU: Intel(R) Xeon(R) Platinum 8380 CPU @ 2.30GHz Signed-off-by: Boris Pismenny <borisp@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: Maxim Mikityanskiy <maximmi@nvidia.com> Reviewed-by: Jakub Kicinski <kuba@kernel.org> Link: https://lore.kernel.org/r/20220518092731.1243494-1-maximmi@nvidia.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2022-05-19net: bridge: Clear offload_fwd_mark when passing frame up bridge interface.Andrew Lunn
It is possible to stack bridges on top of each other. Consider the following which makes use of an Ethernet switch: br1 / \ / \ / \ br0.11 wlan0 | br0 / | \ p1 p2 p3 br0 is offloaded to the switch. Above br0 is a vlan interface, for vlan 11. This vlan interface is then a slave of br1. br1 also has a wireless interface as a slave. This setup trunks wireless lan traffic over the copper network inside a VLAN. A frame received on p1 which is passed up to the bridge has the skb->offload_fwd_mark flag set to true, indicating that the switch has dealt with forwarding the frame out ports p2 and p3 as needed. This flag instructs the software bridge it does not need to pass the frame back down again. However, the flag is not getting reset when the frame is passed upwards. As a result br1 sees the flag, wrongly interprets it, and fails to forward the frame to wlan0. When passing a frame upwards, clear the flag. This is the Rx equivalent of br_switchdev_frame_unmark() in br_dev_xmit(). Fixes: f1c2eddf4cb6 ("bridge: switchdev: Use an helper to clear forward mark") Signed-off-by: Andrew Lunn <andrew@lunn.ch> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Tested-by: Ido Schimmel <idosch@nvidia.com> Acked-by: Nikolay Aleksandrov <razor@blackwall.org> Link: https://lore.kernel.org/r/20220518005840.771575-1-andrew@lunn.ch Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2022-05-18Merge git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nfJakub Kicinski
Pablo Neira Ayuso says: ==================== Netfilter fixes for net 1) Reduce number of hardware offload retries from flowtable datapath which might hog system with retries, from Felix Fietkau. 2) Skip neighbour lookup for PPPoE device, fill_forward_path() already provides this and set on destination address from fill_forward_path for PPPoE device, also from Felix. 4) When combining PPPoE on top of a VLAN device, set info->outdev to the PPPoE device so software offload works, from Felix. 5) Fix TCP teardown flowtable state, races with conntrack gc might result in resetting the state to ESTABLISHED and the time to one day. Joint work with Oz Shlomo and Sven Auhagen. 6) Call dst_check() from flowtable datapath to check if dst is stale instead of doing it from garbage collector path. 7) Disable register tracking infrastructure, either user-space or kernel need to pre-fetch keys inconditionally, otherwise register tracking assumes data is already available in register that might not well be there, leading to incorrect reductions. * git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf: netfilter: nf_tables: disable expression reduction infra netfilter: flowtable: move dst_check to packet path netfilter: flowtable: fix TCP flow teardown netfilter: nft_flow_offload: fix offload with pppoe + vlan net: fix dev_fill_forward_path with pppoe + bridge netfilter: nft_flow_offload: skip dst neigh lookup for ppp devices netfilter: flowtable: fix excessive hw offload attempts after failure ==================== Link: https://lore.kernel.org/r/20220518213841.359653-1-pablo@netfilter.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-05-18libceph: fix misleading ceph_osdc_cancel_request() commentIlya Dryomov
cancel_request() never guaranteed that after its return the OSD client would be completely done with the OSD request. The callback (if specified) can still be invoked and a ref can still be held. Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Reviewed-by: Xiubo Li <xiubli@redhat.com>
2022-05-18libceph: fix potential use-after-free on linger ping and resendsIlya Dryomov
request_reinit() is not only ugly as the comment rightfully suggests, but also unsafe. Even though it is called with osdc->lock held for write in all cases, resetting the OSD request refcount can still race with handle_reply() and result in use-after-free. Taking linger ping as an example: handle_timeout thread handle_reply thread down_read(&osdc->lock) req = lookup_request(...) ... finish_request(req) # unregisters up_read(&osdc->lock) __complete_request(req) linger_ping_cb(req) # req->r_kref == 2 because handle_reply still holds its ref down_write(&osdc->lock) send_linger_ping(lreq) req = lreq->ping_req # same req # cancel_linger_request is NOT # called - handle_reply already # unregistered request_reinit(req) WARN_ON(req->r_kref != 1) # fires request_init(req) kref_init(req->r_kref) # req->r_kref == 1 after kref_init ceph_osdc_put_request(req) kref_put(req->r_kref) # req->r_kref == 0 after kref_put, req is freed <further req initialization/use> !!! This happens because send_linger_ping() always (re)uses the same OSD request for watch ping requests, relying on cancel_linger_request() to unregister it from the OSD client and rip its messages out from the messenger. send_linger() does the same for watch/notify registration and watch reconnect requests. Unfortunately cancel_request() doesn't guarantee that after it returns the OSD client would be completely done with the OSD request -- a ref could still be held and the callback (if specified) could still be invoked too. The original motivation for request_reinit() was inability to deal with allocation failures in send_linger() and send_linger_ping(). Switching to using osdc->req_mempool (currently only used by CephFS) respects that and allows us to get rid of request_reinit(). Cc: stable@vger.kernel.org Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Reviewed-by: Xiubo Li <xiubli@redhat.com> Acked-by: Jeff Layton <jlayton@kernel.org>