summaryrefslogtreecommitdiff
path: root/net
AgeCommit message (Collapse)Author
2021-05-17batman-adv: bcast: avoid skb-copy for (re)queued broadcastsLinus Lüssing
Broadcast packets send via batadv_send_outstanding_bcast_packet() were originally copied in batadv_forw_bcast_packet_to_list() before being queued. And after that only the ethernet header will be pushed through batadv_send_broadcast_skb()->batadv_send_skb_packet() which works safely on skb clones as it uses batadv_skb_head_push()->skb_cow_head(). Signed-off-by: Linus Lüssing <linus.luessing@c0d3.blue> Signed-off-by: Sven Eckelmann <sven@narfation.org> Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
2021-05-17batman-adv: bcast: queue per interface, if neededLinus Lüssing
Currently we schedule a broadcast packet like: 3x: [ [(re-)queue] --> for(hard-if): maybe-transmit ] The intention of queueing a broadcast packet multiple times is to increase robustness for wireless interfaces. However on interfaces which we only broadcast on once the queueing induces an unnecessary penalty. This patch restructures the queueing to be performed on a per interface basis: for(hard-if): - transmit - if wireless: [queue] --> transmit --> [requeue] --> transmit Next to the performance benefits on non-wireless interfaces this should also make it easier to apply alternative strategies for transmissions on wireless interfaces in the future (for instance sending via unicast transmissions on wireless interfaces, without queueing in batman-adv, if appropriate). Signed-off-by: Linus Lüssing <linus.luessing@c0d3.blue> Signed-off-by: Sven Eckelmann <sven@narfation.org> Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
2021-05-17batman-adv: Always send iface index+name in genlmsgSven Eckelmann
The batman-adv netlink messages often contain the interface index and interface name in the same message. This makes it easy for the receiver to operate on the incoming data when it either needs to print something or needs to operate on the interface index. But one of the attributes was missing for: * neighbor table dumps * originator table dumps * gateway list dumps * query of hardif information * query of vid information The userspace therefore had to implement special workarounds using SIOCGIFNAME or SIOCGIFINDEX depending on what was actually provided. Providing both information simplifies the userspace code massively without adding a lot of extra overhead in the kernel portion. Signed-off-by: Sven Eckelmann <sven@narfation.org> Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
2021-05-17batman-adv: Start new development cycleSimon Wunderlich
This version will contain all the (major or even only minor) changes for Linux 5.14. The version number isn't a semantic version number with major and minor information. It is just encoding the year of the expected publishing as Linux -rc1 and the number of published versions this year (starting at 0). Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
2021-05-14mm: fix struct page layout on 32-bit systemsMatthew Wilcox (Oracle)
32-bit architectures which expect 8-byte alignment for 8-byte integers and need 64-bit DMA addresses (arm, mips, ppc) had their struct page inadvertently expanded in 2019. When the dma_addr_t was added, it forced the alignment of the union to 8 bytes, which inserted a 4 byte gap between 'flags' and the union. Fix this by storing the dma_addr_t in one or two adjacent unsigned longs. This restores the alignment to that of an unsigned long. We always store the low bits in the first word to prevent the PageTail bit from being inadvertently set on a big endian platform. If that happened, get_user_pages_fast() racing against a page which was freed and reallocated to the page_pool could dereference a bogus compound_head(), which would be hard to trace back to this cause. Link: https://lkml.kernel.org/r/20210510153211.1504886-1-willy@infradead.org Fixes: c25fff7171be ("mm: add dma_addr_t to struct page") Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Acked-by: Ilias Apalodimas <ilias.apalodimas@linaro.org> Acked-by: Jesper Dangaard Brouer <brouer@redhat.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Tested-by: Matteo Croce <mcroce@linux.microsoft.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-05-14tcp: add tracepoint for checksum errorsJakub Kicinski
Add a tracepoint for capturing TCP segments with a bad checksum. This makes it easy to identify sources of bad frames in the fleet (e.g. machines with faulty NICs). It should also help tools like IOvisor's tcpdrop.py which are used today to get detailed information about such packets. We don't have a socket in many cases so we must open code the address extraction based just on the skb. v2: add missing export for ipv6=m Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-05-14net: sched: fix tx action reschedule issue with stopped queueYunsheng Lin
The netdev qeueue might be stopped when byte queue limit has reached or tx hw ring is full, net_tx_action() may still be rescheduled if STATE_MISSED is set, which consumes unnecessary cpu without dequeuing and transmiting any skb because the netdev queue is stopped, see qdisc_run_end(). This patch fixes it by checking the netdev queue state before calling qdisc_run() and clearing STATE_MISSED if netdev queue is stopped during qdisc_run(), the net_tx_action() is rescheduled again when netdev qeueue is restarted, see netif_tx_wake_queue(). As there is time window between netif_xmit_frozen_or_stopped() checking and STATE_MISSED clearing, between which STATE_MISSED may set by net_tx_action() scheduled by netif_tx_wake_queue(), so set the STATE_MISSED again if netdev queue is restarted. Fixes: 6b3ba9146fe6 ("net: sched: allow qdiscs to handle locking") Reported-by: Michal Kubecek <mkubecek@suse.cz> Acked-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-05-14net: sched: fix tx action rescheduling issue during deactivationYunsheng Lin
Currently qdisc_run() checks the STATE_DEACTIVATED of lockless qdisc before calling __qdisc_run(), which ultimately clear the STATE_MISSED when all the skb is dequeued. If STATE_DEACTIVATED is set before clearing STATE_MISSED, there may be rescheduling of net_tx_action() at the end of qdisc_run_end(), see below: CPU0(net_tx_atcion) CPU1(__dev_xmit_skb) CPU2(dev_deactivate) . . . . set STATE_MISSED . . __netif_schedule() . . . set STATE_DEACTIVATED . . qdisc_reset() . . . .<--------------- . synchronize_net() clear __QDISC_STATE_SCHED | . . . | . . . | . some_qdisc_is_busy() . | . return *false* . | . . test STATE_DEACTIVATED | . . __qdisc_run() *not* called | . . . | . . test STATE_MISS | . . __netif_schedule()--------| . . . . . . . . __qdisc_run() is not called by net_tx_atcion() in CPU0 because CPU2 has set STATE_DEACTIVATED flag during dev_deactivate(), and STATE_MISSED is only cleared in __qdisc_run(), __netif_schedule is called at the end of qdisc_run_end(), causing tx action rescheduling problem. qdisc_run() called by net_tx_action() runs in the softirq context, which should has the same semantic as the qdisc_run() called by __dev_xmit_skb() protected by rcu_read_lock_bh(). And there is a synchronize_net() between STATE_DEACTIVATED flag being set and qdisc_reset()/some_qdisc_is_busy in dev_deactivate(), we can safely bail out for the deactived lockless qdisc in net_tx_action(), and qdisc_reset() will reset all skb not dequeued yet. So add the rcu_read_lock() explicitly to protect the qdisc_run() and do the STATE_DEACTIVATED checking in net_tx_action() before calling qdisc_run_begin(). Another option is to do the checking in the qdisc_run_end(), but it will add unnecessary overhead for non-tx_action case, because __dev_queue_xmit() will not see qdisc with STATE_DEACTIVATED after synchronize_net(), the qdisc with STATE_DEACTIVATED can only be seen by net_tx_action() because of __netif_schedule(). The STATE_DEACTIVATED checking in qdisc_run() is to avoid race between net_tx_action() and qdisc_reset(), see: commit d518d2ed8640 ("net/sched: fix race between deactivation and dequeue for NOLOCK qdisc"). As the bailout added above for deactived lockless qdisc in net_tx_action() provides better protection for the race without calling qdisc_run() at all, so remove the STATE_DEACTIVATED checking in qdisc_run(). After qdisc_reset(), there is no skb in qdisc to be dequeued, so clear the STATE_MISSED in dev_reset_queue() too. Fixes: 6b3ba9146fe6 ("net: sched: allow qdiscs to handle locking") Acked-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com> V8: Clearing STATE_MISSED before calling __netif_schedule() has avoid the endless rescheduling problem, but there may still be a unnecessary rescheduling, so adjust the commit log. Signed-off-by: David S. Miller <davem@davemloft.net>
2021-05-14net: sched: fix packet stuck problem for lockless qdiscYunsheng Lin
Lockless qdisc has below concurrent problem: cpu0 cpu1 . . q->enqueue . . . qdisc_run_begin() . . . dequeue_skb() . . . sch_direct_xmit() . . . . q->enqueue . qdisc_run_begin() . return and do nothing . . qdisc_run_end() . cpu1 enqueue a skb without calling __qdisc_run() because cpu0 has not released the lock yet and spin_trylock() return false for cpu1 in qdisc_run_begin(), and cpu0 do not see the skb enqueued by cpu1 when calling dequeue_skb() because cpu1 may enqueue the skb after cpu0 calling dequeue_skb() and before cpu0 calling qdisc_run_end(). Lockless qdisc has below another concurrent problem when tx_action is involved: cpu0(serving tx_action) cpu1 cpu2 . . . . q->enqueue . . qdisc_run_begin() . . dequeue_skb() . . . q->enqueue . . . . sch_direct_xmit() . . . qdisc_run_begin() . . return and do nothing . . . clear __QDISC_STATE_SCHED . . qdisc_run_begin() . . return and do nothing . . . . . . qdisc_run_end() . This patch fixes the above data race by: 1. If the first spin_trylock() return false and STATE_MISSED is not set, set STATE_MISSED and retry another spin_trylock() in case other CPU may not see STATE_MISSED after it releases the lock. 2. reschedule if STATE_MISSED is set after the lock is released at the end of qdisc_run_end(). For tx_action case, STATE_MISSED is also set when cpu1 is at the end if qdisc_run_end(), so tx_action will be rescheduled again to dequeue the skb enqueued by cpu2. Clear STATE_MISSED before retrying a dequeuing when dequeuing returns NULL in order to reduce the overhead of the second spin_trylock() and __netif_schedule() calling. Also clear the STATE_MISSED before calling __netif_schedule() at the end of qdisc_run_end() to avoid doing another round of dequeuing in the pfifo_fast_dequeue(). The performance impact of this patch, tested using pktgen and dummy netdev with pfifo_fast qdisc attached: threads without+this_patch with+this_patch delta 1 2.61Mpps 2.60Mpps -0.3% 2 3.97Mpps 3.82Mpps -3.7% 4 5.62Mpps 5.59Mpps -0.5% 8 2.78Mpps 2.77Mpps -0.3% 16 2.22Mpps 2.22Mpps -0.0% Fixes: 6b3ba9146fe6 ("net: sched: allow qdiscs to handle locking") Acked-by: Jakub Kicinski <kuba@kernel.org> Tested-by: Juergen Gross <jgross@suse.com> Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-05-14tls splice: check SPLICE_F_NONBLOCK instead of MSG_DONTWAITJim Ma
In tls_sw_splice_read, checkout MSG_* is inappropriate, should use SPLICE_*, update tls_wait_data to accept nonblock arguments instead of flags for recvmsg and splice. Fixes: c46234ebb4d1 ("tls: RX path for ktls") Signed-off-by: Jim Ma <majinjing3@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-05-14Revert "net:tipc: Fix a double free in tipc_sk_mcast_rcv"Hoang Le
This reverts commit 6bf24dc0cc0cc43b29ba344b66d78590e687e046. Above fix is not correct and caused memory leak issue. Fixes: 6bf24dc0cc0c ("net:tipc: Fix a double free in tipc_sk_mcast_rcv") Acked-by: Jon Maloy <jmaloy@redhat.com> Acked-by: Tung Nguyen <tung.q.nguyen@dektech.com.au> Signed-off-by: Hoang Le <hoang.h.le@dektech.com.au> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-05-14net: bridge: fix build when IPv6 is disabledMatteo Croce
The br_ip6_multicast_add_router() prototype is defined only when CONFIG_IPV6 is enabled, but the function is always referenced, so there is this build error with CONFIG_IPV6 not defined: net/bridge/br_multicast.c: In function ‘__br_multicast_enable_port’: net/bridge/br_multicast.c:1743:3: error: implicit declaration of function ‘br_ip6_multicast_add_router’; did you mean ‘br_ip4_multicast_add_router’? [-Werror=implicit-function-declaration] 1743 | br_ip6_multicast_add_router(br, port); | ^~~~~~~~~~~~~~~~~~~~~~~~~~~ | br_ip4_multicast_add_router net/bridge/br_multicast.c: At top level: net/bridge/br_multicast.c:2804:13: warning: conflicting types for ‘br_ip6_multicast_add_router’ 2804 | static void br_ip6_multicast_add_router(struct net_bridge *br, | ^~~~~~~~~~~~~~~~~~~~~~~~~~~ net/bridge/br_multicast.c:2804:13: error: static declaration of ‘br_ip6_multicast_add_router’ follows non-static declaration net/bridge/br_multicast.c:1743:3: note: previous implicit declaration of ‘br_ip6_multicast_add_router’ was here 1743 | br_ip6_multicast_add_router(br, port); | ^~~~~~~~~~~~~~~~~~~~~~~~~~~ Fix this build error by moving the definition out of the #ifdef. Fixes: a3c02e769efe ("net: bridge: mcast: split multicast router state for IPv4 and IPv6") Signed-off-by: Matteo Croce <mcroce@microsoft.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-05-14Drivers: hv: vmbus: Copy packets sent by Hyper-V out of the ring bufferAndres Beltran
Pointers to ring-buffer packets sent by Hyper-V are used within the guest VM. Hyper-V can send packets with erroneous values or modify packet fields after they are processed by the guest. To defend against these scenarios, return a copy of the incoming VMBus packet after validating its length and offset fields in hv_pkt_iter_first(). In this way, the packet can no longer be modified by the host. Signed-off-by: Andres Beltran <lkmlabelt@gmail.com> Co-developed-by: Andrea Parri (Microsoft) <parri.andrea@gmail.com> Signed-off-by: Andrea Parri (Microsoft) <parri.andrea@gmail.com> Reviewed-by: Michael Kelley <mikelley@microsoft.com> Link: https://lore.kernel.org/r/20210408161439.341988-1-parri.andrea@gmail.com Signed-off-by: Wei Liu <wei.liu@kernel.org>
2021-05-14net: bridge: fix br_multicast_is_router stub when igmp is disabledNikolay Aleksandrov
br_multicast_is_router takes two arguments when bridge IGMP is enabled and just one when it's disabled, fix the stub to take two as well. Fixes: 1a3065a26807 ("net: bridge: mcast: prepare is-router function for mcast router split") Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com> Acked-by: Linus Lüssing <linus.luessing@c0d3.blue> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-05-14xfrm: add state hashtable keyed by seqSabrina Dubroca
When creating new states with seq set in xfrm_usersa_info, we walk through all the states already installed in that netns to find a matching ACQUIRE state (__xfrm_find_acq_byseq, called from xfrm_state_add). This causes severe slowdowns on systems with a large number of states. This patch introduces a hashtable using x->km.seq as key, so that the corresponding state can be found in a reasonable time. Signed-off-by: Sabrina Dubroca <sd@queasysnail.net> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2021-05-14esp: drop unneeded assignment in esp4_gro_receive()Yang Li
Making '!=' operation with 0 directly after calling the function xfrm_parse_spi() is more efficient, assignment to err is redundant. Eliminate the following clang_analyzer warning: net/ipv4/esp4_offload.c:41:7: warning: Although the value stored to 'err' is used in the enclosing expression, the value is never actually read from 'err' No functional change, only more efficient. Reported-by: Abaci Robot <abaci@linux.alibaba.com> Signed-off-by: Yang Li <yang.lee@linux.alibaba.com> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2021-05-14netfilter: nft_set_pipapo_avx2: Add irq_fpu_usable() check, fallback to ↵Stefano Brivio
non-AVX2 version Arturo reported this backtrace: [709732.358791] WARNING: CPU: 3 PID: 456 at arch/x86/kernel/fpu/core.c:128 kernel_fpu_begin_mask+0xae/0xe0 [709732.358793] Modules linked in: binfmt_misc nft_nat nft_chain_nat nf_nat nft_counter nft_ct nf_tables nf_conntrack_netlink nfnetlink 8021q garp stp mrp llc vrf intel_rapl_msr intel_rapl_common skx_edac nfit libnvdimm ipmi_ssif x86_pkg_temp_thermal intel_powerclamp coretemp crc32_pclmul mgag200 ghash_clmulni_intel drm_kms_helper cec aesni_intel drm libaes crypto_simd cryptd glue_helper mei_me dell_smbios iTCO_wdt evdev intel_pmc_bxt iTCO_vendor_support dcdbas pcspkr rapl dell_wmi_descriptor wmi_bmof sg i2c_algo_bit watchdog mei acpi_ipmi ipmi_si button nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ipmi_devintf ipmi_msghandler ip_tables x_tables autofs4 ext4 crc16 mbcache jbd2 dm_mod raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor sd_mod t10_pi crc_t10dif crct10dif_generic raid6_pq libcrc32c crc32c_generic raid1 raid0 multipath linear md_mod ahci libahci tg3 libata xhci_pci libphy xhci_hcd ptp usbcore crct10dif_pclmul crct10dif_common bnxt_en crc32c_intel scsi_mod [709732.358941] pps_core i2c_i801 lpc_ich i2c_smbus wmi usb_common [709732.358957] CPU: 3 PID: 456 Comm: jbd2/dm-0-8 Not tainted 5.10.0-0.bpo.5-amd64 #1 Debian 5.10.24-1~bpo10+1 [709732.358959] Hardware name: Dell Inc. PowerEdge R440/04JN2K, BIOS 2.9.3 09/23/2020 [709732.358964] RIP: 0010:kernel_fpu_begin_mask+0xae/0xe0 [709732.358969] Code: ae 54 24 04 83 e3 01 75 38 48 8b 44 24 08 65 48 33 04 25 28 00 00 00 75 33 48 83 c4 10 5b c3 65 8a 05 5e 21 5e 76 84 c0 74 92 <0f> 0b eb 8e f0 80 4f 01 40 48 81 c7 00 14 00 00 e8 dd fb ff ff eb [709732.358972] RSP: 0018:ffffbb9700304740 EFLAGS: 00010202 [709732.358976] RAX: 0000000000000001 RBX: 0000000000000003 RCX: 0000000000000001 [709732.358979] RDX: ffffbb9700304970 RSI: ffff922fe1952e00 RDI: 0000000000000003 [709732.358981] RBP: ffffbb9700304970 R08: ffff922fc868a600 R09: ffff922fc711e462 [709732.358984] R10: 000000000000005f R11: ffff922ff0b27180 R12: ffffbb9700304960 [709732.358987] R13: ffffbb9700304b08 R14: ffff922fc664b6c8 R15: ffff922fc664b660 [709732.358990] FS: 0000000000000000(0000) GS:ffff92371fec0000(0000) knlGS:0000000000000000 [709732.358993] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [709732.358996] CR2: 0000557a6655bdd0 CR3: 000000026020a001 CR4: 00000000007706e0 [709732.358999] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [709732.359001] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [709732.359003] PKRU: 55555554 [709732.359005] Call Trace: [709732.359009] <IRQ> [709732.359035] nft_pipapo_avx2_lookup+0x4c/0x1cba [nf_tables] [709732.359046] ? sched_clock+0x5/0x10 [709732.359054] ? sched_clock_cpu+0xc/0xb0 [709732.359061] ? record_times+0x16/0x80 [709732.359068] ? plist_add+0xc1/0x100 [709732.359073] ? psi_group_change+0x47/0x230 [709732.359079] ? skb_clone+0x4d/0xb0 [709732.359085] ? enqueue_task_rt+0x22b/0x310 [709732.359098] ? bnxt_start_xmit+0x1e8/0xaf0 [bnxt_en] [709732.359102] ? packet_rcv+0x40/0x4a0 [709732.359121] nft_lookup_eval+0x59/0x160 [nf_tables] [709732.359133] nft_do_chain+0x350/0x500 [nf_tables] [709732.359152] ? nft_lookup_eval+0x59/0x160 [nf_tables] [709732.359163] ? nft_do_chain+0x364/0x500 [nf_tables] [709732.359172] ? fib4_rule_action+0x6d/0x80 [709732.359178] ? fib_rules_lookup+0x107/0x250 [709732.359184] nft_nat_do_chain+0x8a/0xf2 [nft_chain_nat] [709732.359193] nf_nat_inet_fn+0xea/0x210 [nf_nat] [709732.359202] nf_nat_ipv4_out+0x14/0xa0 [nf_nat] [709732.359207] nf_hook_slow+0x44/0xc0 [709732.359214] ip_output+0xd2/0x100 [709732.359221] ? __ip_finish_output+0x210/0x210 [709732.359226] ip_forward+0x37d/0x4a0 [709732.359232] ? ip4_key_hashfn+0xb0/0xb0 [709732.359238] ip_sublist_rcv_finish+0x4f/0x60 [709732.359243] ip_sublist_rcv+0x196/0x220 [709732.359250] ? ip_rcv_finish_core.isra.22+0x400/0x400 [709732.359255] ip_list_rcv+0x137/0x160 [709732.359264] __netif_receive_skb_list_core+0x29b/0x2c0 [709732.359272] netif_receive_skb_list_internal+0x1a6/0x2d0 [709732.359280] gro_normal_list.part.156+0x19/0x40 [709732.359286] napi_complete_done+0x67/0x170 [709732.359298] bnxt_poll+0x105/0x190 [bnxt_en] [709732.359304] ? irqentry_exit+0x29/0x30 [709732.359309] ? asm_common_interrupt+0x1e/0x40 [709732.359315] net_rx_action+0x144/0x3c0 [709732.359322] __do_softirq+0xd5/0x29c [709732.359329] asm_call_irq_on_stack+0xf/0x20 [709732.359332] </IRQ> [709732.359339] do_softirq_own_stack+0x37/0x40 [709732.359346] irq_exit_rcu+0x9d/0xa0 [709732.359353] common_interrupt+0x78/0x130 [709732.359358] asm_common_interrupt+0x1e/0x40 [709732.359366] RIP: 0010:crc_41+0x0/0x1e [crc32c_intel] [709732.359370] Code: ff ff f2 4d 0f 38 f1 93 a8 fe ff ff f2 4c 0f 38 f1 81 b0 fe ff ff f2 4c 0f 38 f1 8a b0 fe ff ff f2 4d 0f 38 f1 93 b0 fe ff ff <f2> 4c 0f 38 f1 81 b8 fe ff ff f2 4c 0f 38 f1 8a b8 fe ff ff f2 4d [709732.359373] RSP: 0018:ffffbb97008dfcd0 EFLAGS: 00000246 [709732.359377] RAX: 000000000000002a RBX: 0000000000000400 RCX: ffff922fc591dd50 [709732.359379] RDX: ffff922fc591dea0 RSI: 0000000000000a14 RDI: ffffffffc00dddc0 [709732.359382] RBP: 0000000000001000 R08: 000000000342d8c3 R09: 0000000000000000 [709732.359384] R10: 0000000000000000 R11: ffff922fc591dff0 R12: ffffbb97008dfe58 [709732.359386] R13: 000000000000000a R14: ffff922fd2b91e80 R15: ffff922fef83fe38 [709732.359395] ? crc_43+0x1e/0x1e [crc32c_intel] [709732.359403] ? crc32c_pcl_intel_update+0x97/0xb0 [crc32c_intel] [709732.359419] ? jbd2_journal_commit_transaction+0xaec/0x1a30 [jbd2] [709732.359425] ? irq_exit_rcu+0x3e/0xa0 [709732.359447] ? kjournald2+0xbd/0x270 [jbd2] [709732.359454] ? finish_wait+0x80/0x80 [709732.359470] ? commit_timeout+0x10/0x10 [jbd2] [709732.359476] ? kthread+0x116/0x130 [709732.359481] ? kthread_park+0x80/0x80 [709732.359488] ? ret_from_fork+0x1f/0x30 [709732.359494] ---[ end trace 081a19978e5f09f5 ]--- that is, nft_pipapo_avx2_lookup() uses the FPU running from a softirq that interrupted a kthread, also using the FPU. That's exactly the reason why irq_fpu_usable() is there: use it, and if we can't use the FPU, fall back to the non-AVX2 version of the lookup operation, i.e. nft_pipapo_lookup(). Reported-by: Arturo Borrero Gonzalez <arturo@netfilter.org> Cc: <stable@vger.kernel.org> # 5.6.x Fixes: 7400b063969b ("nft_set_pipapo: Introduce AVX2-based lookup implementation") Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2021-05-14netfilter: flowtable: Remove redundant hw refresh bitRoi Dayan
Offloading conns could fail for multiple reasons and a hw refresh bit is set to try to reoffload it in next sw packet. But it could be in some cases and future points that the hw refresh bit is not set but a refresh could succeed. Remove the hw refresh bit and do offload refresh if requested. There won't be a new work entry if a work is already pending anyway as there is the hw pending bit. Fixes: 8b3646d6e0c4 ("net/sched: act_ct: Support refreshing the flow table entries") Signed-off-by: Roi Dayan <roid@nvidia.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2021-05-13bpf: Use struct_size() in kzalloc()Gustavo A. R. Silva
Make use of the struct_size() helper instead of an open-coded version, in order to avoid any potential type mistakes or integer overflows that, in the worst scenario, could lead to heap overflows. This code was detected with the help of Coccinelle and, audited and fixed manually. Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-05-13net: caif: Drop unnecessary NULL check after container_ofGuenter Roeck
The first parameter passed to chnl_recv_cb() can never be NULL since all callers dereferenced it. Consequently, container_of() on it is also never NULL, even though the reference into the structure points to the first element of the structure. The NULL check is therefore unnecessary. On top of that, it is misleading to perform a NULL check on the result of container_of() because the position of the contained element could change, which would make the test invalid. Remove the unnecessary NULL check. This change was made automatically with the following Coccinelle script. @@ type t; identifier v; statement s; @@ <+... ( t v = container_of(...); | v = container_of(...); ) ... when != v - if (\( !v \| v == NULL \) ) s ...+> Signed-off-by: Guenter Roeck <linux@roeck-us.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-05-13openvswitch: meter: fix race when getting now_ms.Tao Liu
We have observed meters working unexpected if traffic is 3+Gbit/s with multiple connections. now_ms is not pretected by meter->lock, we may get a negative long_delta_ms when another cpu updated meter->used, then: delta_ms = (u32)long_delta_ms; which will be a large value. band->bucket += delta_ms * band->rate; then we get a wrong band->bucket. OpenVswitch userspace datapath has fixed the same issue[1] some time ago, and we port the implementation to kernel datapath. [1] https://patchwork.ozlabs.org/project/openvswitch/patch/20191025114436.9746-1-i.maximets@ovn.org/ Fixes: 96fbc13d7e77 ("openvswitch: Add meter infrastructure") Signed-off-by: Tao Liu <thomas.liu@ucloud.cn> Suggested-by: Ilya Maximets <i.maximets@ovn.org> Reviewed-by: Ilya Maximets <i.maximets@ovn.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-05-13net: bridge: mcast: export multicast router presence adjacent to a portLinus Lüssing
To properly support routable multicast addresses in batman-adv in a group-aware way, a batman-adv node needs to know if it serves multicast routers. This adds a function to the bridge to export this so that batman-adv can then make full use of the Multicast Router Discovery capability of the bridge. Signed-off-by: Linus Lüssing <linus.luessing@c0d3.blue> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-05-13net: bridge: mcast: add ip4+ip6 mcast router timers to mdb netlinkLinus Lüssing
Now that we have split the multicast router state into two, one for IPv4 and one for IPv6, also add individual timers to the mdb netlink router port dump. Leaving the old timer attribute for backwards compatibility. Signed-off-by: Linus Lüssing <linus.luessing@c0d3.blue> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-05-13net: bridge: mcast: split multicast router state for IPv4 and IPv6Linus Lüssing
A multicast router for IPv4 does not imply that the same host also is a multicast router for IPv6 and vice versa. To reduce multicast traffic when a host is only a multicast router for one of these two protocol families, keep router state for IPv4 and IPv6 separately. Similar to how querier state is kept separately. For backwards compatibility for netlink and switchdev notifications these two will still only notify if a port switched from either no IPv4/IPv6 multicast router to any IPv4/IPv6 multicast router or the other way round. However a full netlink MDB router dump will now also include a multicast router timeout for both IPv4 and IPv6. Signed-off-by: Linus Lüssing <linus.luessing@c0d3.blue> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-05-13net: bridge: mcast: split router port del+notify for mcast router splitLinus Lüssing
In preparation for the upcoming split of multicast router state into their IPv4 and IPv6 variants split router port deletion and notification into two functions. When we disable a port for instance later we want to only send one notification to switchdev and netlink for compatibility and want to avoid sending one for IPv4 and one for IPv6. For that the split is needed. Signed-off-by: Linus Lüssing <linus.luessing@c0d3.blue> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-05-13net: bridge: mcast: prepare add-router function for mcast router splitLinus Lüssing
In preparation for the upcoming split of multicast router state into their IPv4 and IPv6 variants move the protocol specific router list and timer access to ip4 wrapper functions. Signed-off-by: Linus Lüssing <linus.luessing@c0d3.blue> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-05-13net: bridge: mcast: prepare expiry functions for mcast router splitLinus Lüssing
In preparation for the upcoming split of multicast router state into their IPv4 and IPv6 variants move the protocol specific timer access to an ip4 wrapper function. Signed-off-by: Linus Lüssing <linus.luessing@c0d3.blue> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-05-13net: bridge: mcast: prepare is-router function for mcast router splitLinus Lüssing
In preparation for the upcoming split of multicast router state into their IPv4 and IPv6 variants make br_multicast_is_router() protocol family aware. Note that for now br_ip6_multicast_is_router() uses the currently still common ip4_mc_router_timer for now. It will be renamed to ip6_mc_router_timer later when the split is performed. While at it also renames the "1" and "2" constants in br_multicast_is_router() to the MDB_RTR_TYPE_TEMP_QUERY and MDB_RTR_TYPE_PERM enums. Signed-off-by: Linus Lüssing <linus.luessing@c0d3.blue> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-05-13net: bridge: mcast: prepare query reception for mcast router splitLinus Lüssing
In preparation for the upcoming split of multicast router state into their IPv4 and IPv6 variants and as the br_multicast_mark_router() will be split for that remove the select querier wrapper and instead add ip4 and ip6 variants for br_multicast_query_received(). Signed-off-by: Linus Lüssing <linus.luessing@c0d3.blue> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-05-13net: bridge: mcast: prepare mdb netlink for mcast router splitLinus Lüssing
In preparation for the upcoming split of multicast router state into their IPv4 and IPv6 variants and to avoid IPv6 #ifdef clutter later add some inline functions for the protocol specific parts in the mdb router netlink code. Also the we need iterate over the port instead of router list to be able put one router port entry with both the IPv4 and IPv6 multicast router info later. Signed-off-by: Linus Lüssing <linus.luessing@c0d3.blue> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-05-13net: bridge: mcast: add wrappers for router node retrievalLinus Lüssing
In preparation for the upcoming split of multicast router state into their IPv4 and IPv6 variants and to avoid IPv6 #ifdef clutter later add two wrapper functions for router node retrieval in the payload forwarding code. Signed-off-by: Linus Lüssing <linus.luessing@c0d3.blue> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-05-13net: bridge: mcast: rename multicast router lists and timersLinus Lüssing
In preparation for the upcoming split of multicast router state into their IPv4 and IPv6 variants, rename the affected variable to the IPv4 version first to avoid some renames in later commits. Signed-off-by: Linus Lüssing <linus.luessing@c0d3.blue> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-05-13net: Treat __napi_schedule_irqoff() as __napi_schedule() on PREEMPT_RTSebastian Andrzej Siewior
__napi_schedule_irqoff() is an optimized version of __napi_schedule() which can be used where it is known that interrupts are disabled, e.g. in interrupt-handlers, spin_lock_irq() sections or hrtimer callbacks. On PREEMPT_RT enabled kernels this assumptions is not true. Force- threaded interrupt handlers and spinlocks are not disabling interrupts and the NAPI hrtimer callback is forced into softirq context which runs with interrupts enabled as well. Chasing all usage sites of __napi_schedule_irqoff() is a whack-a-mole game so make __napi_schedule_irqoff() invoke __napi_schedule() for PREEMPT_RT kernels. The callers of ____napi_schedule() in the networking core have been audited and are correct on PREEMPT_RT kernels as well. Reported-by: Juri Lelli <juri.lelli@redhat.com> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Juri Lelli <juri.lelli@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-05-13net: taprio offload: enforce qdisc to netdev queue mappingYannick Vignon
Even though the taprio qdisc is designed for multiqueue devices, all the queues still point to the same top-level taprio qdisc. This works and is probably required for software taprio, but at least with offload taprio, it has an undesirable side effect: because the whole qdisc is run when a packet has to be sent, it allows packets in a best-effort class to be processed in the context of a task sending higher priority traffic. If there are packets left in the qdisc after that first run, the NET_TX softirq is raised and gets executed immediately in the same process context. As with any other softirq, it runs up to 10 times and for up to 2ms, during which the calling process is waiting for the sendmsg call (or similar) to return. In my use case, that calling process is a real-time task scheduled to send a packet every 2ms, so the long sendmsg calls are leading to missed timeslots. By attaching each netdev queue to its own qdisc, as it is done with the "classic" mq qdisc, each traffic class can be processed independently without touching the other classes. A high-priority process can then send packets without getting stuck in the sendmsg call anymore. Signed-off-by: Yannick Vignon <yannick.vignon@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-05-13tty: make tty_operations::chars_in_buffer return uintJiri Slaby
tty_operations::chars_in_buffer is another hook which is expected to return values >= 0. So make it explicit by the return type too -- use unsigned int. Signed-off-by: Jiri Slaby <jslaby@suse.cz> Acked-By: Anton Ivanov <anton.ivanov@cambridgegreys.com> Acked-by: David Sterba <dsterba@suse.com> Cc: Jeff Dike <jdike@addtoit.com> Cc: Richard Weinberger <richard@nod.at> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Samuel Iglesias Gonsalvez <siglesias@igalia.com> Cc: Jens Taprogge <jens.taprogge@taprogge.org> Cc: Karsten Keil <isdn@linux-pingi.de> Cc: Ulf Hansson <ulf.hansson@linaro.org> Cc: "David S. Miller" <davem@davemloft.net> Cc: Jakub Kicinski <kuba@kernel.org> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Christian Borntraeger <borntraeger@de.ibm.com> Cc: David Lin <dtwlin@gmail.com> Cc: Johan Hovold <johan@kernel.org> Cc: Alex Elder <elder@kernel.org> Cc: Jiri Kosina <jikos@kernel.org> Cc: Shawn Guo <shawnguo@kernel.org> Cc: Sascha Hauer <s.hauer@pengutronix.de> Cc: Oliver Neukum <oneukum@suse.com> Cc: Felipe Balbi <balbi@kernel.org> Cc: Mathias Nyman <mathias.nyman@intel.com> Cc: Marcel Holtmann <marcel@holtmann.org> Cc: Johan Hedberg <johan.hedberg@gmail.com> Cc: Luiz Augusto von Dentz <luiz.dentz@gmail.com> Link: https://lore.kernel.org/r/20210505091928.22010-27-jslaby@suse.cz Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-05-13net/smc: properly handle workqueue allocation failureAnirudh Rayabharam
In smcd_alloc_dev(), if alloc_ordered_workqueue() fails, properly catch it, clean up and return NULL to let the caller know there was a failure. Move the call to alloc_ordered_workqueue higher in the function in order to abort earlier without needing to unwind the call to device_initialize(). Cc: Ursula Braun <ubraun@linux.ibm.com> Cc: David S. Miller <davem@davemloft.net> Signed-off-by: Anirudh Rayabharam <mail@anirudhrb.com> Link: https://lore.kernel.org/r/20210503115736.2104747-18-gregkh@linuxfoundation.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-05-13Revert "net/smc: fix a NULL pointer dereference"Greg Kroah-Hartman
This reverts commit e183d4e414b64711baf7a04e214b61969ca08dfa. Because of recent interactions with developers from @umn.edu, all commits from them have been recently re-reviewed to ensure if they were correct or not. Upon review, this commit was found to be incorrect for the reasons below, so it must be reverted. It will be fixed up "correctly" in a later kernel change. The original commit causes a memory leak and does not properly fix the issue it claims to fix. I will send a follow-on patch to resolve this properly. Cc: Kangjie Lu <kjlu@umn.edu> Cc: Ursula Braun <ubraun@linux.ibm.com> Cc: David S. Miller <davem@davemloft.net> Link: https://lore.kernel.org/r/20210503115736.2104747-17-gregkh@linuxfoundation.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-05-13tty: make tty_operations::write_room return uintJiri Slaby
Line disciplines expect a positive value or zero returned from tty->ops->write_room (invoked by tty_write_room). So make this assumption explicit by using unsigned int as a return value. Both of tty->ops->write_room and tty_write_room. Signed-off-by: Jiri Slaby <jslaby@suse.cz> Acked-by: Laurentiu Tudor <laurentiu.tudor@nxp.com> Acked-by: Alex Elder <elder@linaro.org> Acked-by: Max Filippov <jcmvbkbc@gmail.com> # xtensa Acked-by: David Sterba <dsterba@suse.com> Acked-By: Anton Ivanov <anton.ivanov@cambridgegreys.com> Acked-by: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Richard Henderson <rth@twiddle.net> Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru> Cc: Matt Turner <mattst88@gmail.com> Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com> Cc: Helge Deller <deller@gmx.de> Cc: Jeff Dike <jdike@addtoit.com> Cc: Richard Weinberger <richard@nod.at> Cc: Chris Zankel <chris@zankel.net> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Samuel Iglesias Gonsalvez <siglesias@igalia.com> Cc: Jens Taprogge <jens.taprogge@taprogge.org> Cc: Karsten Keil <isdn@linux-pingi.de> Cc: Scott Branden <scott.branden@broadcom.com> Cc: Ulf Hansson <ulf.hansson@linaro.org> Cc: "David S. Miller" <davem@davemloft.net> Cc: Jakub Kicinski <kuba@kernel.org> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Christian Borntraeger <borntraeger@de.ibm.com> Cc: David Lin <dtwlin@gmail.com> Cc: Johan Hovold <johan@kernel.org> Cc: Jiri Kosina <jikos@kernel.org> Cc: Shawn Guo <shawnguo@kernel.org> Cc: Sascha Hauer <s.hauer@pengutronix.de> Cc: Oliver Neukum <oneukum@suse.com> Cc: Felipe Balbi <balbi@kernel.org> Cc: Mathias Nyman <mathias.nyman@intel.com> Cc: Marcel Holtmann <marcel@holtmann.org> Cc: Johan Hedberg <johan.hedberg@gmail.com> Cc: Luiz Augusto von Dentz <luiz.dentz@gmail.com> Link: https://lore.kernel.org/r/20210505091928.22010-23-jslaby@suse.cz Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-05-13tty: make tty_ldisc_ops a param in tty_unregister_ldiscJiri Slaby
Make tty_unregister_ldisc symmetric to tty_register_ldisc by accepting struct tty_ldisc_ops as a parameter instead of ldisc number. This avoids checking of the ldisc number bounds in tty_unregister_ldisc. Signed-off-by: Jiri Slaby <jslaby@suse.cz> Cc: William Hubbs <w.d.hubbs@gmail.com> Cc: Chris Brannon <chris@the-brannons.com> Cc: Kirk Reiser <kirk@reisers.ca> Cc: Samuel Thibault <samuel.thibault@ens-lyon.org> Cc: Marcel Holtmann <marcel@holtmann.org> Cc: Johan Hedberg <johan.hedberg@gmail.com> Cc: Luiz Augusto von Dentz <luiz.dentz@gmail.com> Cc: Dmitry Torokhov <dmitry.torokhov@gmail.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: "David S. Miller" <davem@davemloft.net> Cc: Jakub Kicinski <kuba@kernel.org> Cc: Wolfgang Grandegger <wg@grandegger.com> Cc: Marc Kleine-Budde <mkl@pengutronix.de> Cc: Andreas Koensgen <ajk@comnets.uni-bremen.de> Cc: Paul Mackerras <paulus@samba.org> Cc: Rodolfo Giometti <giometti@enneenne.com> Cc: Peter Ujfalusi <peter.ujfalusi@gmail.com> Cc: Liam Girdwood <lgirdwood@gmail.com> Cc: Mark Brown <broonie@kernel.org> Cc: Jaroslav Kysela <perex@perex.cz> Cc: Takashi Iwai <tiwai@suse.com> Link: https://lore.kernel.org/r/20210505091928.22010-17-jslaby@suse.cz Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-05-13tty: set tty_ldisc_ops::num staticallyJiri Slaby
There is no reason to pass the ldisc number to tty_register_ldisc separately. Just set it in the already defined tty_ldisc_ops in all the ldiscs. This simplifies tty_register_ldisc a bit too (no need to set the num member there). Signed-off-by: Jiri Slaby <jslaby@suse.cz> Cc: William Hubbs <w.d.hubbs@gmail.com> Cc: Chris Brannon <chris@the-brannons.com> Cc: Kirk Reiser <kirk@reisers.ca> Cc: Samuel Thibault <samuel.thibault@ens-lyon.org> Cc: Marcel Holtmann <marcel@holtmann.org> Cc: Johan Hedberg <johan.hedberg@gmail.com> Cc: Luiz Augusto von Dentz <luiz.dentz@gmail.com> Cc: Dmitry Torokhov <dmitry.torokhov@gmail.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: "David S. Miller" <davem@davemloft.net> Cc: Jakub Kicinski <kuba@kernel.org> Cc: Wolfgang Grandegger <wg@grandegger.com> Cc: Marc Kleine-Budde <mkl@pengutronix.de> Cc: Andreas Koensgen <ajk@comnets.uni-bremen.de> Cc: Paul Mackerras <paulus@samba.org> Cc: Rodolfo Giometti <giometti@enneenne.com> Cc: Peter Ujfalusi <peter.ujfalusi@gmail.com> Cc: Liam Girdwood <lgirdwood@gmail.com> Cc: Mark Brown <broonie@kernel.org> Cc: Jaroslav Kysela <perex@perex.cz> Cc: Takashi Iwai <tiwai@suse.com> Link: https://lore.kernel.org/r/20210505091928.22010-15-jslaby@suse.cz Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-05-13tty: make fp of tty_ldisc_ops::receive_buf{,2} constJiri Slaby
Char pointer (cp) passed to tty_ldisc_ops::receive_buf{,2} is const. There is no reason for flag pointer (fp) not to be too. So switch it in the definition and all uses. Signed-off-by: Jiri Slaby <jslaby@suse.cz> Cc: William Hubbs <w.d.hubbs@gmail.com> Cc: Chris Brannon <chris@the-brannons.com> Cc: Kirk Reiser <kirk@reisers.ca> Cc: Samuel Thibault <samuel.thibault@ens-lyon.org> Cc: Marcel Holtmann <marcel@holtmann.org> Cc: Johan Hedberg <johan.hedberg@gmail.com> Cc: Luiz Augusto von Dentz <luiz.dentz@gmail.com> Cc: Dmitry Torokhov <dmitry.torokhov@gmail.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: "David S. Miller" <davem@davemloft.net> Cc: Jakub Kicinski <kuba@kernel.org> Cc: Wolfgang Grandegger <wg@grandegger.com> Cc: Marc Kleine-Budde <mkl@pengutronix.de> Cc: Andreas Koensgen <ajk@comnets.uni-bremen.de> Cc: Paul Mackerras <paulus@samba.org> Cc: Liam Girdwood <lgirdwood@gmail.com> Cc: Mark Brown <broonie@kernel.org> Cc: Jaroslav Kysela <perex@perex.cz> Cc: Takashi Iwai <tiwai@suse.com> Cc: Peter Ujfalusi <peter.ujfalusi@gmail.com> Link: https://lore.kernel.org/r/20210505091928.22010-12-jslaby@suse.cz Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-05-12tls splice: remove inappropriate flags checking for MSG_PEEKJim Ma
In function tls_sw_splice_read, before call tls_sw_advance_skb it checks likely(!(flags & MSG_PEEK)), while MSG_PEEK is used for recvmsg, splice supports SPLICE_F_NONBLOCK, SPLICE_F_MOVE, SPLICE_F_MORE, should remove this checking. Signed-off-by: Jim Ma <majinjing3@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-05-12Merge tag 'linux-can-fixes-for-5.13-20210512' of ↵David S. Miller
git://git.kernel.org/pub/scm/linux/kernel/git/mkl/linux-can Marc Kleine-Budde says: ==================== pull-request: can 2021-05-12 this is a pull request of a single patch for net/master. The patch is by Norbert Slusarek and it fixes a race condition in the CAN ISO-TP socket between isotp_bind() and isotp_setsockopt(). ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2021-05-12net: packetmmap: fix only tx timestamp on requestRichard Sanger
The packetmmap tx ring should only return timestamps if requested via setsockopt PACKET_TIMESTAMP, as documented. This allows compatibility with non-timestamp aware user-space code which checks tp_status == TP_STATUS_AVAILABLE; not expecting additional timestamp flags to be set in tp_status. Fixes: b9c32fb27170 ("packet: if hw/sw ts enabled in rx/tx ring, report which ts we got") Cc: Daniel Borkmann <daniel@iogearbox.net> Cc: Willem de Bruijn <willemdebruijn.kernel@gmail.com> Signed-off-by: Richard Sanger <rsanger@wand.net.nz> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-05-12net: really orphan skbs tied to closing skPaolo Abeni
If the owing socket is shutting down - e.g. the sock reference count already dropped to 0 and only sk_wmem_alloc is keeping the sock alive, skb_orphan_partial() becomes a no-op. When forwarding packets over veth with GRO enabled, the above causes refcount errors. This change addresses the issue with a plain skb_orphan() call in the critical scenario. Fixes: 9adc89af724f ("net: let skb_orphan_partial wake-up waiters.") Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-05-12can: isotp: prevent race between isotp_bind() and isotp_setsockopt()Norbert Slusarek
A race condition was found in isotp_setsockopt() which allows to change socket options after the socket was bound. For the specific case of SF_BROADCAST support, this might lead to possible use-after-free because can_rx_unregister() is not called. Checking for the flag under the socket lock in isotp_bind() and taking the lock in isotp_setsockopt() fixes the issue. Fixes: 921ca574cd38 ("can: isotp: add SF_BROADCAST support for functional addressing") Link: https://lore.kernel.org/r/trinity-e6ae9efa-9afb-4326-84c0-f3609b9b8168-1620773528307@3c-app-gmx-bs06 Reported-by: Norbert Slusarek <nslusarek@gmx.net> Signed-off-by: Thadeu Lima de Souza Cascardo <cascardo@canonical.com> Signed-off-by: Norbert Slusarek <nslusarek@gmx.net> Acked-by: Oliver Hartkopp <socketcan@hartkopp.net> Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
2021-05-11net/sched: taprio: Drop unnecessary NULL check after container_ofGuenter Roeck
The rcu_head pointer passed to taprio_free_sched_cb is never NULL. That means that the result of container_of() operations on it is also never NULL, even though rcu_head is the first element of the structure embedding it. On top of that, it is misleading to perform a NULL check on the result of container_of() because the position of the contained element could change, which would make the check invalid. Remove the unnecessary NULL check. This change was made automatically with the following Coccinelle script. @@ type t; identifier v; statement s; @@ <+... ( t v = container_of(...); | v = container_of(...); ) ... when != v - if (\( !v \| v == NULL \) ) s ...+> Signed-off-by: Guenter Roeck <linux@roeck-us.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-05-11mptcp: fix data stream corruptionPaolo Abeni
Maxim reported several issues when forcing a TCP transparent proxy to use the MPTCP protocol for the inbound connections. He also provided a clean reproducer. The problem boils down to 'mptcp_frag_can_collapse_to()' assuming that only MPTCP will use the given page_frag. If others - e.g. the plain TCP protocol - allocate page fragments, we can end-up re-using already allocated memory for mptcp_data_frag. Fix the issue ensuring that the to-be-expanded data fragment is located at the current page frag end. v1 -> v2: - added missing fixes tag (Mat) Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/178 Reported-and-tested-by: Maxim Galaganov <max@internet.ru> Fixes: 18b683bff89d ("mptcp: queue data for mptcp level retransmission") Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-05-11Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpfDavid S. Miller
Daniel Borkmann says: ==================== pull-request: bpf 2021-05-11 The following pull-request contains BPF updates for your *net* tree. We've added 13 non-merge commits during the last 8 day(s) which contain a total of 21 files changed, 817 insertions(+), 382 deletions(-). The main changes are: 1) Fix multiple ringbuf bugs in particular to prevent writable mmap of read-only pages, from Andrii Nakryiko & Thadeu Lima de Souza Cascardo. 2) Fix verifier alu32 known-const subregister bound tracking for bitwise operations and/or/xor, from Daniel Borkmann. 3) Reject trampoline attachment for functions with variable arguments, and also add a deny list of other forbidden functions, from Jiri Olsa. 4) Fix nested bpf_bprintf_prepare() calls used by various helpers by switching to per-CPU buffers, from Florent Revest. 5) Fix kernel compilation with BTF debug info on ppc64 due to pahole missing TCP-CC functions like cubictcp_init, from Martin KaFai Lau. 6) Add a kconfig entry to provide an option to disallow unprivileged BPF by default, from Daniel Borkmann. 7) Fix libbpf compilation for older libelf when GELF_ST_VISIBILITY() macro is not available, from Arnaldo Carvalho de Melo. 8) Migrate test_tc_redirect to test_progs framework as prep work for upcoming skb_change_head() fix & selftest, from Jussi Maki. 9) Fix a libbpf segfault in add_dummy_ksym_var() if BTF is not present, from Ian Rogers. 10) Fix tx_only micro-benchmark in xdpsock BPF sample with proper frame size, from Magnus Karlsson. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2021-05-11bpf: Limit static tcp-cc functions in the .BTF_ids list to x86Martin KaFai Lau
During the discussion in [0]. It was pointed out that static functions in ppc64 is prefixed with ".". For example, the 'readelf -s vmlinux.ppc': 89326: c000000001383280 24 NOTYPE LOCAL DEFAULT 31 cubictcp_init 89327: c000000000c97c50 168 FUNC LOCAL DEFAULT 2 .cubictcp_init The one with FUNC type is ".cubictcp_init" instead of "cubictcp_init". The "." seems to be done by arch/powerpc/include/asm/ppc_asm.h. This caused that pahole cannot generate the BTF for these tcp-cc kernel functions because pahole only captures the FUNC type and "cubictcp_init" is not. It then failed the kernel compilation in ppc64. This behavior is only reported in ppc64 so far. I tried arm64, s390, and sparc64 and did not observe this "." prefix and NOTYPE behavior. Since the kfunc call is only supported in the x86_64 and x86_32 JIT, this patch limits those tcp-cc functions to x86 only to avoid unnecessary compilation issue in other ARCHs. In the future, we can examine if it is better to change all those functions from static to extern. [0] https://lore.kernel.org/bpf/4e051459-8532-7b61-c815-f3435767f8a0@kernel.org/ Fixes: e78aea8b2170 ("bpf: tcp: Put some tcp cong functions in allowlist for bpf-tcp-cc") Signed-off-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Cc: Michal Suchánek <msuchanek@suse.de> Cc: Jiri Slaby <jslaby@suse.com> Cc: Jiri Olsa <jolsa@redhat.com> Link: https://lore.kernel.org/bpf/20210508005011.3863757-1-kafai@fb.com