summaryrefslogtreecommitdiff
path: root/net
AgeCommit message (Collapse)Author
2020-09-10tcp: Simplify EBPF TCP_CONGESTION to always init CCNeal Cardwell
Now that the previous patch ensures we don't initialize the congestion control twice, when EBPF sets the congestion control algorithm at connection establishment we can simplify the code by simply initializing the congestion control module at that time. Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Yuchung Cheng <ycheng@google.com> Acked-by: Kevin Yang <yyd@google.com> Cc: Lawrence Brakmo <brakmo@fb.com>
2020-09-10tcp: Only init congestion control if not initialized alreadyNeal Cardwell
Change tcp_init_transfer() to only initialize congestion control if it has not been initialized already. With this new approach, we can arrange things so that if the EBPF code sets the congestion control by calling setsockopt(TCP_CONGESTION) then tcp_init_transfer() will not re-initialize the CC module. This is an approach that has the following beneficial properties: (1) This allows CC module customizations made by the EBPF called in tcp_init_transfer() to persist, and not be wiped out by a later call to tcp_init_congestion_control() in tcp_init_transfer(). (2) Does not flip the order of EBPF and CC init, to avoid causing bugs for existing code upstream that depends on the current order. (3) Does not cause 2 initializations for for CC in the case where the EBPF called in tcp_init_transfer() wants to set the CC to a new CC algorithm. (4) Allows follow-on simplifications to the code in net/core/filter.c and net/ipv4/tcp_cong.c, which currently both have some complexity to special-case CC initialization to avoid double CC initialization if EBPF sets the CC. Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Yuchung Cheng <ycheng@google.com> Acked-by: Kevin Yang <yyd@google.com> Cc: Lawrence Brakmo <brakmo@fb.com>
2020-09-10net/smc: use separate work queues for different worker typesKarsten Graul
There are 6 types of workers which exist per smc connection. 3 of them are used for listen and handshake processing, another 2 are used for close and abort processing and 1 is the tx worker that moves calls to sleeping functions into a worker. To prevent flooding of the system work queue when many connections are opened or closed at the same time (some pattern uperf implements), move those workers to one of 3 smc-specific work queues. Two work queues are module-global and used for handshake and close workers. The third work queue is defined per link group and used by the tx workers that may sleep waiting for resources of this link group. And in smc_llc_enqueue() queue the llc_event_work work to the system prio work queue because its critical that this work is started fast. Reviewed-by: Ursula Braun <ubraun@linux.ibm.com> Signed-off-by: Karsten Graul <kgraul@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-10net/smc: use the retry mechanism for netlink messagesGuvenc Gulce
When the netlink messages to be sent to the userspace are too big for a single netlink message, send them in chunks using the netlink_dump infrastructure. Modify the smc diag dump code so that it can signal to the netlink_dump infrastructure that it needs to send more data. Signed-off-by: Guvenc Gulce <guvenc@linux.ibm.com> Signed-off-by: Karsten Graul <kgraul@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-10net/smc: immediate freeing in smc_lgr_cleanup_early()Ursula Braun
smc_lgr_cleanup_early() schedules the free worker with delay. DMB unregistering occurs in this delayed worker increasing the risk to reach the SMCD SBA limit without need. Terminate the linkgroup immediately, since termination means early DMB unregistering. For SMCD the global smc_server_lgr_pending lock is given up early. A linkgroup to be given up with smc_lgr_cleanup_early() may already contain more than one connection. Using __smc_lgr_terminate() in smc_lgr_cleanup_early() covers this. And consolidate smc_ism_put_vlan() and smc_put_device() into smc_lgr_free() only. Signed-off-by: Ursula Braun <ubraun@linux.ibm.com> Signed-off-by: Karsten Graul <kgraul@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-10net/smc: reduce smc_listen_decline() callsUrsula Braun
smc_listen_work() contains already an smc_listen_decline() exit. Use this exit for smc_listen_rdma_finish() problems as well. Signed-off-by: Ursula Braun <ubraun@linux.ibm.com> Signed-off-by: Karsten Graul <kgraul@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-10net/smc: improve server ISM device determinationUrsula Braun
Move check whether peer can be reached into smc_pnet_find_ism_by_pnetid(). Thus searching continues for another ism device, if check fails. Signed-off-by: Ursula Braun <ubraun@linux.ibm.com> Signed-off-by: Karsten Graul <kgraul@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-10net/smc: common routine for CLC accept and confirmUrsula Braun
smc_clc_send_accept() and smc_clc_send_confirm() are quite similar. Move common code into a separate function smc_clc_send_confirm_accept(). And introduce separate SMCD and SMCR struct definitions for CLC accept resp. confirm. No functional change. Signed-off-by: Ursula Braun <ubraun@linux.ibm.com> Signed-off-by: Karsten Graul <kgraul@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-10net/smc: dynamic allocation of CLC proposal bufferUrsula Braun
Reduce stack size for smc_listen_work() and smc_clc_send_proposal() by dynamic allocation of the CLC buffer to be received or sent. Signed-off-by: Ursula Braun <ubraun@linux.ibm.com> Signed-off-by: Karsten Graul <kgraul@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-10net/smc: introduce better field namesUrsula Braun
Field names "srv_first_contact" and "cln_first_contact" are misleading, since they apply to both, server and client. Rename them to "first_contact_peer" and "first_contact_local". Rename "ism_gid" by the more precise name "ism_peer_gid". Rename version constant "SMC_CLC_V1" into "SMC_V1". No functional change. Signed-off-by: Ursula Braun <ubraun@linux.ibm.com> Signed-off-by: Karsten Graul <kgraul@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-10net/smc: reduce active tcp_listen workersUrsula Braun
SMC starts a separate tcp_listen worker for every SMC socket in state SMC_LISTEN, and can accept an incoming connection request only, if this worker is really running and waiting in kernel_accept(). But the number of running workers is limited. This patch reworks the listening SMC code and starts a tcp_listen worker after the SYN-ACK handshake on the internal clc-socket only. Suggested-by: Karsten Graul <kgraul@linux.ibm.com> Signed-off-by: Ursula Braun <ubraun@linux.ibm.com> Reviewed-by: Guvenc Gulce <guvenc@linux.ibm.com> Signed-off-by: Karsten Graul <kgraul@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-10net: DCB: Validate DCB_ATTR_DCB_BUFFER argumentPetr Machata
The parameter passed via DCB_ATTR_DCB_BUFFER is a struct dcbnl_buffer. The field prio2buffer is an array of IEEE_8021Q_MAX_PRIORITIES bytes, where each value is a number of a buffer to direct that priority's traffic to. That value is however never validated to lie within the bounds set by DCBX_MAX_BUFFERS. The only driver that currently implements the callback is mlx5 (maintainers CCd), and that does not do any validation either, in particual allowing incorrect configuration if the prio2buffer value does not fit into 4 bits. Instead of offloading the need to validate the buffer index to drivers, do it right there in core, and bounce the request if the value is too large. CC: Parav Pandit <parav@nvidia.com> CC: Saeed Mahameed <saeedm@nvidia.com> Fixes: e549f6f9c098 ("net/dcb: Add dcbnl buffer attribute") Signed-off-by: Petr Machata <petrm@nvidia.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Jiri Pirko <jiri@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-10net: Fix bridge enslavement failureIdo Schimmel
When a netdev is enslaved to a bridge, its parent identifier is queried. This is done so that packets that were already forwarded in hardware will not be forwarded again by the bridge device between netdevs belonging to the same hardware instance. The operation fails when the netdev is an upper of netdevs with different parent identifiers. Instead of failing the enslavement, have dev_get_port_parent_id() return '-EOPNOTSUPP' which will signal the bridge to skip the query operation. Other callers of the function are not affected by this change. Fixes: 7e1146e8c10c ("net: devlink: introduce devlink_compat_switch_id_get() helper") Signed-off-by: Ido Schimmel <idosch@nvidia.com> Reported-by: Vasundhara Volam <vasundhara-v.volam@broadcom.com> Reviewed-by: Jiri Pirko <jiri@nvidia.com> Reviewed-by: Nikolay Aleksandrov <nikolay@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-10net: Correct the comment of dst_dev_put()Miaohe Lin
Since commit 8d7017fd621d ("blackhole_netdev: use blackhole_netdev to invalidate dst entries"), we use blackhole_netdev to invalidate dst entries instead of loopback device anymore. Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-10tcp: reflect tos value received in SYN to the socketWei Wang
This commit adds a new TCP feature to reflect the tos value received in SYN, and send it out on the SYN-ACK, and eventually set the tos value of the established socket with this reflected tos value. This provides a way to set the traffic class/QoS level for all traffic in the same connection to be the same as the incoming SYN request. It could be useful in data centers to provide equivalent QoS according to the incoming request. This feature is guarded by /proc/sys/net/ipv4/tcp_reflect_tos, and is by default turned off. Signed-off-by: Wei Wang <weiwan@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-10ip: pass tos into ip_build_and_send_pkt()Wei Wang
This commit adds tos as a new passed in parameter to ip_build_and_send_pkt() which will be used in the later commit. This is a pure restructure and does not have any functional change. Signed-off-by: Wei Wang <weiwan@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-10tcp: record received TOS value in the request socketWei Wang
A new field is added to the request sock to record the TOS value received on the listening socket during 3WHS: When not under syn flood, it is recording the TOS value sent in SYN. When under syn flood, it is recording the TOS value sent in the ACK. This is a preparation patch in order to do TOS reflection in the later commit. Signed-off-by: Wei Wang <weiwan@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-10net: make sure napi_list is safe for RCU traversalJakub Kicinski
netpoll needs to traverse dev->napi_list under RCU, make sure it uses the right iterator and that removal from this list is handled safely. Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-10net: manage napi add/del idempotence explicitlyJakub Kicinski
To RCUify napi->dev_list we need to replace list_del_init() with list_del_rcu(). There is no _init() version for RCU for obvious reasons. Up until now netif_napi_del() was idempotent so to make sure it remains such add a bit which is set when NAPI is listed, and cleared when it removed. Since we don't expect multiple calls to netif_napi_add() to be correct, add a warning on that side. Now that napi_hash_add / napi_hash_del are only called by napi_add / del we can actually steal its bit. We just need to make sure hash node is initialized correctly. Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-10net: remove napi_hash_del() from driver-facing APIJakub Kicinski
We allow drivers to call napi_hash_del() before calling netif_napi_del() to batch RCU grace periods. This makes the API asymmetric and leaks internal implementation details. Soon we will want the grace period to protect more than just the NAPI hash table. Restructure the API and have drivers call a new function - __netif_napi_del() if they want to take care of RCU waits. Note that only core was checking the return status from napi_hash_del() so the new helper does not report if the NAPI was actually deleted. Some notes on driver oddness: - veth observed the grace period before calling netif_napi_del() but that should not matter - myri10ge observed normal RCU flavor - bnx2x and enic did not actually observe the grace period (unless they did so implicitly) - virtio_net and enic only unhashed Rx NAPIs The last two points seem to indicate that the calls to napi_hash_del() were a left over rather than an optimization. Regardless, it's easy enough to correct them. This patch may introduce extra synchronize_net() calls for interfaces which set NAPI_STATE_NO_BUSY_POLL and depend on free_netdev() to call netif_napi_del(). This seems inevitable since we want to use RCU for netpoll dev->napi_list traversal, and almost no drivers set IFF_DISABLE_NETPOLL. Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-10devlink: don't crash if netdev is NULLJakub Kicinski
Following change will add support for a corner case where we may not have a netdev to pass to devlink_port_type_eth_set() but we still want to set port type. This is definitely a corner case, and drivers should not normally pass NULL netdev - print a warning message when this happens. Sadly for other port types (ib) switches don't have a device reference, the way we always do for Ethernet, so we can't put the warning in __devlink_port_type_set(). Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-10net: sch_generic: aviod concurrent reset and enqueue op for lockless qdiscYunsheng Lin
Currently there is concurrent reset and enqueue operation for the same lockless qdisc when there is no lock to synchronize the q->enqueue() in __dev_xmit_skb() with the qdisc reset operation in qdisc_deactivate() called by dev_deactivate_queue(), which may cause out-of-bounds access for priv->ring[] in hns3 driver if user has requested a smaller queue num when __dev_xmit_skb() still enqueue a skb with a larger queue_mapping after the corresponding qdisc is reset, and call hns3_nic_net_xmit() with that skb later. Reused the existing synchronize_net() in dev_deactivate_many() to make sure skb with larger queue_mapping enqueued to old qdisc(which is saved in dev_queue->qdisc_sleeping) will always be reset when dev_reset_queue() is called. Fixes: 6b3ba9146fe6 ("net: sched: allow qdiscs to handle locking") Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-10net: Allow iterating sockmap and sockhashLorenz Bauer
Add bpf_iter support for sockmap / sockhash, based on the bpf_sk_storage and hashtable implementation. sockmap and sockhash share the same iteration context: a pointer to an arbitrary key and a pointer to a socket. Both pointers may be NULL, and so BPF has to perform a NULL check before accessing them. Technically it's not possible for sockhash iteration to yield a NULL socket, but we ignore this to be able to use a single iteration point. Iteration will visit all keys that remain unmodified during the lifetime of the iterator. It may or may not visit newly added ones. Switch from using rcu_dereference_raw to plain rcu_dereference, so we gain another guard rail if CONFIG_PROVE_RCU is enabled. Signed-off-by: Lorenz Bauer <lmb@cloudflare.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Yonghong Song <yhs@fb.com> Link: https://lore.kernel.org/bpf/20200909162712.221874-3-lmb@cloudflare.com
2020-09-10net: sockmap: Remove unnecessary sk_fullsock checksLorenz Bauer
The lookup paths for sockmap and sockhash currently include a check that returns NULL if the socket we just found is not a full socket. However, this check is not necessary. On insertion we ensure that we have a full socket (caveat around sock_ops), so request sockets are not a problem. Time-wait sockets are allocated separate from the original socket and then fed into the hashdance. They don't affect the sockets already stored in the sockmap. Suggested-by: Jakub Sitnicki <jakub@cloudflare.com> Signed-off-by: Lorenz Bauer <lmb@cloudflare.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20200909162712.221874-2-lmb@cloudflare.com
2020-09-10mptcp: fix kmalloc flag in mptcp_pm_nl_get_local_idGeliang Tang
mptcp_pm_nl_get_local_id may be called in interrupt context, so we need to use GFP_ATOMIC flag to allocate memory to avoid sleeping in atomic context. [ 280.209809] BUG: sleeping function called from invalid context at mm/slab.h:498 [ 280.209812] in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 1680, name: kworker/1:3 [ 280.209814] INFO: lockdep is turned off. [ 280.209816] CPU: 1 PID: 1680 Comm: kworker/1:3 Tainted: G W 5.9.0-rc3-mptcp+ #146 [ 280.209818] Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006 [ 280.209820] Workqueue: events mptcp_worker [ 280.209822] Call Trace: [ 280.209824] <IRQ> [ 280.209826] dump_stack+0x77/0xa0 [ 280.209829] ___might_sleep.cold+0xa6/0xb6 [ 280.209832] kmem_cache_alloc_trace+0x1d1/0x290 [ 280.209835] mptcp_pm_nl_get_local_id+0x23c/0x410 [ 280.209840] subflow_init_req+0x1e9/0x2ea [ 280.209843] ? inet_reqsk_alloc+0x1c/0x120 [ 280.209845] ? kmem_cache_alloc+0x264/0x290 [ 280.209849] tcp_conn_request+0x303/0xae0 [ 280.209854] ? printk+0x53/0x6a [ 280.209857] ? tcp_rcv_state_process+0x28f/0x1374 [ 280.209859] tcp_rcv_state_process+0x28f/0x1374 [ 280.209864] ? tcp_v4_do_rcv+0xb3/0x1f0 [ 280.209866] tcp_v4_do_rcv+0xb3/0x1f0 [ 280.209869] tcp_v4_rcv+0xed6/0xfa0 [ 280.209873] ip_protocol_deliver_rcu+0x28/0x270 [ 280.209875] ip_local_deliver_finish+0x89/0x120 [ 280.209877] ip_local_deliver+0x180/0x220 [ 280.209881] ip_rcv+0x166/0x210 [ 280.209885] __netif_receive_skb_one_core+0x82/0x90 [ 280.209888] process_backlog+0xd6/0x230 [ 280.209891] net_rx_action+0x13a/0x410 [ 280.209895] __do_softirq+0xcf/0x468 [ 280.209899] asm_call_on_stack+0x12/0x20 [ 280.209901] </IRQ> [ 280.209903] ? ip_finish_output2+0x240/0x9a0 [ 280.209906] do_softirq_own_stack+0x4d/0x60 [ 280.209908] do_softirq.part.0+0x2b/0x60 [ 280.209911] __local_bh_enable_ip+0x9a/0xa0 [ 280.209913] ip_finish_output2+0x264/0x9a0 [ 280.209916] ? rcu_read_lock_held+0x4d/0x60 [ 280.209920] ? ip_output+0x7a/0x250 [ 280.209922] ip_output+0x7a/0x250 [ 280.209925] ? __ip_finish_output+0x330/0x330 [ 280.209928] __ip_queue_xmit+0x1dc/0x5a0 [ 280.209931] __tcp_transmit_skb+0xa0f/0xc70 [ 280.209937] tcp_connect+0xb03/0xff0 [ 280.209939] ? lockdep_hardirqs_on_prepare+0xe7/0x190 [ 280.209942] ? ktime_get_with_offset+0x125/0x150 [ 280.209944] ? trace_hardirqs_on+0x1c/0xe0 [ 280.209948] tcp_v4_connect+0x449/0x550 [ 280.209953] __inet_stream_connect+0xbb/0x320 [ 280.209955] ? mark_held_locks+0x49/0x70 [ 280.209958] ? lockdep_hardirqs_on_prepare+0xe7/0x190 [ 280.209960] ? __local_bh_enable_ip+0x6b/0xa0 [ 280.209963] inet_stream_connect+0x32/0x50 [ 280.209966] __mptcp_subflow_connect+0x1fd/0x242 [ 280.209972] mptcp_pm_create_subflow_or_signal_addr+0x2db/0x600 [ 280.209975] mptcp_worker+0x543/0x7a0 [ 280.209980] process_one_work+0x26d/0x5b0 [ 280.209984] ? process_one_work+0x5b0/0x5b0 [ 280.209987] worker_thread+0x48/0x3d0 [ 280.209990] ? process_one_work+0x5b0/0x5b0 [ 280.209993] kthread+0x117/0x150 [ 280.209996] ? kthread_park+0x80/0x80 [ 280.209998] ret_from_fork+0x22/0x30 Fixes: 01cacb00b35cb ("mptcp: add netlink-based PM") Signed-off-by: Geliang Tang <geliangtang@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-10mptcp: fix subflow's remote_id issuesGeliang Tang
This patch set the init remote_id to zero, otherwise it will be a random number. Then it added the missing subflow's remote_id setting code both in __mptcp_subflow_connect and in subflow_ulp_clone. Fixes: 01cacb00b35cb ("mptcp: add netlink-based PM") Fixes: ec3edaa7ca6ce ("mptcp: Add handling of outgoing MP_JOIN requests") Fixes: f296234c98a8f ("mptcp: Add handling of incoming MP_JOIN requests") Signed-off-by: Geliang Tang <geliangtang@gmail.com> Reviewed-by: Matthieu Baerts <matthieu.baerts@tessares.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-10mptcp: fix subflow's local_id issuesGeliang Tang
In mptcp_pm_nl_get_local_id, skc_local is the same as msk_local, so it always return 0. Thus every subflow's local_id is 0. It's incorrect. This patch fixed this issue. Also, we need to ignore the zero address here, like 0.0.0.0 in IPv4. When we use the zero address as a local address, it means that we can use any one of the local addresses. The zero address is not a new address, we don't need to add it to PM, so this patch added a new function address_zero to check whether an address is the zero address, if it is, we ignore this address. Fixes: 01cacb00b35cb ("mptcp: add netlink-based PM") Signed-off-by: Geliang Tang <geliangtang@gmail.com> Reviewed-by: Matthieu Baerts <matthieu.baerts@tessares.net> Reported-by: kernel test robot <lkp@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-10ipmr: Use full VIF ID in netlink cache reportsPaul Davey
Insert the full 16 bit VIF ID into ipmr Netlink cache reports. The VIF_ID attribute has 32 bits of space so can store the full VIF ID extracted from the high and low byte fields in the igmpmsg. Signed-off-by: Paul Davey <paul.davey@alliedtelesis.co.nz> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-10ipmr: Add high byte of VIF ID to igmpmsgPaul Davey
Use the unused3 byte in struct igmpmsg to hold the high 8 bits of the VIF ID. If using more than 255 IPv4 multicast interfaces it is necessary to have access to a VIF ID for cache reports that is wider than 8 bits, the VIF ID present in the igmpmsg reports sent to mroute_sk was only 8 bits wide in the igmpmsg header. Adding the high 8 bits of the 16 bit VIF ID in the unused byte allows use of more than 255 IPv4 multicast interfaces. Signed-off-by: Paul Davey <paul.davey@alliedtelesis.co.nz> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-10ipmr: Add route table ID to netlink cache reportsPaul Davey
Insert the multicast route table ID as a Netlink attribute to Netlink cache report notifications. When multiple route tables are in use it is necessary to have a way to determine which route table a given cache report belongs to when receiving the cache report. Signed-off-by: Paul Davey <paul.davey@alliedtelesis.co.nz> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-10tipc: fix shutdown() of connection oriented socketTetsuo Handa
I confirmed that the problem fixed by commit 2a63866c8b51a3f7 ("tipc: fix shutdown() of connectionless socket") also applies to stream socket. ---------- #include <sys/socket.h> #include <unistd.h> #include <sys/wait.h> int main(int argc, char *argv[]) { int fds[2] = { -1, -1 }; socketpair(PF_TIPC, SOCK_STREAM /* or SOCK_DGRAM */, 0, fds); if (fork() == 0) _exit(read(fds[0], NULL, 1)); shutdown(fds[0], SHUT_RDWR); /* This must make read() return. */ wait(NULL); /* To be woken up by _exit(). */ return 0; } ---------- Since shutdown(SHUT_RDWR) should affect all processes sharing that socket, unconditionally setting sk->sk_shutdown to SHUTDOWN_MASK will be the right behavior. Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Acked-by: Ying Xue <ying.xue@windriver.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-10bpf: Remove duplicate headersChen Zhou
Remove duplicate headers which are included twice. Signed-off-by: Chen Zhou <chenzhou10@huawei.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Andrii Nakryiko <andriin@fb.com> Link: https://lore.kernel.org/bpf/20200908132201.184005-1-chenzhou10@huawei.com
2020-09-09devlink: Use controller while building phys_port_nameParav Pandit
Now that controller number attribute is available, use it when building phsy_port_name for external controller ports. An example devlink port and representor netdev name consist of controller annotation for external controller with controller number = 1, for a VF 1 of PF 0: $ devlink port show pci/0000:06:00.0/2 pci/0000:06:00.0/2: type eth netdev ens2f0c1pf0vf1 flavour pcivf controller 1 pfnum 0 vfnum 1 external true splittable false function: hw_addr 00:00:00:00:00:00 $ devlink port show pci/0000:06:00.0/2 -jp { "port": { "pci/0000:06:00.0/2": { "type": "eth", "netdev": "ens2f0c1pf0vf1", "flavour": "pcivf", "controller": 1, "pfnum": 0, "vfnum": 1, "external": true, "splittable": false, "function": { "hw_addr": "00:00:00:00:00:00" } } } } Controller number annotation is skipped for non external controllers to maintain backward compatibility. Signed-off-by: Parav Pandit <parav@nvidia.com> Reviewed-by: Jiri Pirko <jiri@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-09devlink: Introduce controller numberParav Pandit
A devlink port may be for a controller consist of PCI device. A devlink instance holds ports of two types of controllers. (1) controller discovered on same system where eswitch resides This is the case where PCI PF/VF of a controller and devlink eswitch instance both are located on a single system. (2) controller located on external host system. This is the case where a controller is located in one system and its devlink eswitch ports are located in a different system. When a devlink eswitch instance serves the devlink ports of both controllers together, PCI PF/VF numbers may overlap. Due to this a unique phys_port_name cannot be constructed. For example in below such system controller-0 and controller-1, each has PCI PF pf0 whose eswitch ports can be present in controller-0. These results in phys_port_name as "pf0" for both. Similar problem exists for VFs and upcoming Sub functions. An example view of two controller systems: --------------------------------------------------------- | | | --------- --------- ------- ------- | ----------- | | vf(s) | | sf(s) | |vf(s)| |sf(s)| | | server | | ------- ----/---- ---/----- ------- ---/--- ---/--- | | pci rc |=== | pf0 |______/________/ | pf1 |___/_______/ | | connect | | ------- ------- | ----------- | | controller_num=1 (no eswitch) | ------|-------------------------------------------------- (internal wire) | --------------------------------------------------------- | devlink eswitch ports and reps | | ----------------------------------------------------- | | |ctrl-0 | ctrl-0 | ctrl-0 | ctrl-0 | ctrl-0 |ctrl-0 | | | |pf0 | pf0vfN | pf0sfN | pf1 | pf1vfN |pf1sfN | | | ----------------------------------------------------- | | |ctrl-1 | ctrl-1 | ctrl-1 | ctrl-1 | ctrl-1 |ctrl-1 | | | |pf1 | pf1vfN | pf1sfN | pf1 | pf1vfN |pf0sfN | | | ----------------------------------------------------- | | | | | | --------- --------- ------- ------- | | | vf(s) | | sf(s) | |vf(s)| |sf(s)| | | ------- ----/---- ---/----- ------- ---/--- ---/--- | | | pf0 |______/________/ | pf1 |___/_______/ | | ------- ------- | | | | local controller_num=0 (eswitch) | --------------------------------------------------------- An example devlink port for external controller with controller number = 1 for a VF 1 of PF 0: $ devlink port show pci/0000:06:00.0/2 pci/0000:06:00.0/2: type eth netdev ens2f0pf0vf1 flavour pcivf controller 1 pfnum 0 vfnum 1 external true splittable false function: hw_addr 00:00:00:00:00:00 $ devlink port show pci/0000:06:00.0/2 -jp { "port": { "pci/0000:06:00.0/2": { "type": "eth", "netdev": "ens2f0pf0vf1", "flavour": "pcivf", "controller": 1, "pfnum": 0, "vfnum": 1, "external": true, "splittable": false, "function": { "hw_addr": "00:00:00:00:00:00" } } } } Signed-off-by: Parav Pandit <parav@nvidia.com> Reviewed-by: Jiri Pirko <jiri@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-09devlink: Introduce external controller flagParav Pandit
A devlink eswitch port may represent PCI PF/VF ports of a controller. A controller either located on same system or it can be an external controller located in host where such NIC is plugged in. Add the ability for driver to specify if a port is for external controller. Use such flag in the mlx5_core driver. An example of an external controller having VF1 of PF0 belong to controller 1. $ devlink port show pci/0000:06:00.0/2 pci/0000:06:00.0/2: type eth netdev ens2f0pf0vf1 flavour pcivf pfnum 0 vfnum 1 external true splittable false function: hw_addr 00:00:00:00:00:00 $ devlink port show pci/0000:06:00.0/2 -jp { "port": { "pci/0000:06:00.0/2": { "type": "eth", "netdev": "ens2f0pf0vf1", "flavour": "pcivf", "pfnum": 0, "vfnum": 1, "external": true, "splittable": false, "function": { "hw_addr": "00:00:00:00:00:00" } } } } Signed-off-by: Parav Pandit <parav@nvidia.com> Reviewed-by: Jiri Pirko <jiri@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-09Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-nextDavid S. Miller
Pablo Neira Ayuso says: ==================== Netfilter updates for net-next The following patchset contains Netfilter updates for net-next: 1) Rewrite inner header IPv6 in ICMPv6 messages in ip6t_NPT, from Michael Zhou. 2) do_ip_vs_set_ctl() dereferences uninitialized value, from Peilin Ye. 3) Support for userdata in tables, from Jose M. Guisado. 4) Do not increment ct error and invalid stats at the same time, from Florian Westphal. 5) Remove ct ignore stats, also from Florian. 6) Add ct stats for clash resolution, from Florian Westphal. 7) Bump reference counter bump on ct clash resolution only, this is safe because bucket lock is held, again from Florian. 8) Use ip_is_fragment() in xt_HMARK, from YueHaibing. 9) Add wildcard support for nft_socket, from Balazs Scheidler. 10) Remove superfluous IPVS dependency on iptables, from Yaroslav Bolyukin. 11) Remove unused definition in ebt_stp, from Wang Hai. 12) Replace CONFIG_NFT_CHAIN_NAT_{IPV4,IPV6} by CONFIG_NFT_NAT in selftests/net, from Fabian Frederick. 13) Add userdata support for nft_object, from Jose M. Guisado. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-09hsr: avoid newline at end of message in NL_SET_ERR_MSG_MODYe Bin
clean follow coccicheck warning: net//hsr/hsr_netlink.c:94:8-42: WARNING avoid newline at end of message in NL_SET_ERR_MSG_MOD net//hsr/hsr_netlink.c:87:30-57: WARNING avoid newline at end of message in NL_SET_ERR_MSG_MOD net//hsr/hsr_netlink.c:79:29-53: WARNING avoid newline at end of message in NL_SET_ERR_MSG_MOD Signed-off-by: Ye Bin <yebin10@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-09Merge tag 'nfs-for-5.9-2' of git://git.linux-nfs.org/projects/trondmy/linux-nfsLinus Torvalds
Pull NFS client bugfixes from Trond Myklebust: - Fix an NFS/RDMA resource leak - Fix the error handling during delegation recall - NFSv4.0 needs to return the delegation on a zero-stateid SETATTR - Stop printk reading past end of string * tag 'nfs-for-5.9-2' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: SUNRPC: stop printk reading past end of string NFS: Zero-stateid SETATTR should first return delegation NFSv4.1 handle ERR_DELAY error reclaiming locking state on delegation recall xprtrdma: Release in-flight MRs on disconnect
2020-09-09net: qrtr: check skb_put_padto() return valueEric Dumazet
If skb_put_padto() returns an error, skb has been freed. Better not touch it anymore, as reported by syzbot [1] Note to qrtr maintainers : this suggests qrtr_sendmsg() should adjust sock_alloc_send_skb() second parameter to account for the potential added alignment to avoid reallocation. [1] BUG: KASAN: use-after-free in __skb_insert include/linux/skbuff.h:1907 [inline] BUG: KASAN: use-after-free in __skb_queue_before include/linux/skbuff.h:2016 [inline] BUG: KASAN: use-after-free in __skb_queue_tail include/linux/skbuff.h:2049 [inline] BUG: KASAN: use-after-free in skb_queue_tail+0x6b/0x120 net/core/skbuff.c:3146 Write of size 8 at addr ffff88804d8ab3c0 by task syz-executor.4/4316 CPU: 1 PID: 4316 Comm: syz-executor.4 Not tainted 5.9.0-rc4-syzkaller #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Call Trace: __dump_stack lib/dump_stack.c:77 [inline] dump_stack+0x1d6/0x29e lib/dump_stack.c:118 print_address_description+0x66/0x620 mm/kasan/report.c:383 __kasan_report mm/kasan/report.c:513 [inline] kasan_report+0x132/0x1d0 mm/kasan/report.c:530 __skb_insert include/linux/skbuff.h:1907 [inline] __skb_queue_before include/linux/skbuff.h:2016 [inline] __skb_queue_tail include/linux/skbuff.h:2049 [inline] skb_queue_tail+0x6b/0x120 net/core/skbuff.c:3146 qrtr_tun_send+0x1a/0x40 net/qrtr/tun.c:23 qrtr_node_enqueue+0x44f/0xc00 net/qrtr/qrtr.c:364 qrtr_bcast_enqueue+0xbe/0x140 net/qrtr/qrtr.c:861 qrtr_sendmsg+0x680/0x9c0 net/qrtr/qrtr.c:960 sock_sendmsg_nosec net/socket.c:651 [inline] sock_sendmsg net/socket.c:671 [inline] sock_write_iter+0x317/0x470 net/socket.c:998 call_write_iter include/linux/fs.h:1882 [inline] new_sync_write fs/read_write.c:503 [inline] vfs_write+0xa96/0xd10 fs/read_write.c:578 ksys_write+0x11b/0x220 fs/read_write.c:631 do_syscall_64+0x31/0x70 arch/x86/entry/common.c:46 entry_SYSCALL_64_after_hwframe+0x44/0xa9 RIP: 0033:0x45d5b9 Code: 5d b4 fb ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 0f 83 2b b4 fb ff c3 66 2e 0f 1f 84 00 00 00 00 RSP: 002b:00007f84b5b81c78 EFLAGS: 00000246 ORIG_RAX: 0000000000000001 RAX: ffffffffffffffda RBX: 0000000000038b40 RCX: 000000000045d5b9 RDX: 0000000000000055 RSI: 0000000020001240 RDI: 0000000000000003 RBP: 00007f84b5b81ca0 R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000246 R12: 000000000000000f R13: 00007ffcbbf86daf R14: 00007f84b5b829c0 R15: 000000000118cf4c Allocated by task 4316: kasan_save_stack mm/kasan/common.c:48 [inline] kasan_set_track mm/kasan/common.c:56 [inline] __kasan_kmalloc+0x100/0x130 mm/kasan/common.c:461 slab_post_alloc_hook+0x3e/0x290 mm/slab.h:518 slab_alloc mm/slab.c:3312 [inline] kmem_cache_alloc+0x1c1/0x2d0 mm/slab.c:3482 skb_clone+0x1b2/0x370 net/core/skbuff.c:1449 qrtr_bcast_enqueue+0x6d/0x140 net/qrtr/qrtr.c:857 qrtr_sendmsg+0x680/0x9c0 net/qrtr/qrtr.c:960 sock_sendmsg_nosec net/socket.c:651 [inline] sock_sendmsg net/socket.c:671 [inline] sock_write_iter+0x317/0x470 net/socket.c:998 call_write_iter include/linux/fs.h:1882 [inline] new_sync_write fs/read_write.c:503 [inline] vfs_write+0xa96/0xd10 fs/read_write.c:578 ksys_write+0x11b/0x220 fs/read_write.c:631 do_syscall_64+0x31/0x70 arch/x86/entry/common.c:46 entry_SYSCALL_64_after_hwframe+0x44/0xa9 Freed by task 4316: kasan_save_stack mm/kasan/common.c:48 [inline] kasan_set_track+0x3d/0x70 mm/kasan/common.c:56 kasan_set_free_info+0x17/0x30 mm/kasan/generic.c:355 __kasan_slab_free+0xdd/0x110 mm/kasan/common.c:422 __cache_free mm/slab.c:3418 [inline] kmem_cache_free+0x82/0xf0 mm/slab.c:3693 __skb_pad+0x3f5/0x5a0 net/core/skbuff.c:1823 __skb_put_padto include/linux/skbuff.h:3233 [inline] skb_put_padto include/linux/skbuff.h:3252 [inline] qrtr_node_enqueue+0x62f/0xc00 net/qrtr/qrtr.c:360 qrtr_bcast_enqueue+0xbe/0x140 net/qrtr/qrtr.c:861 qrtr_sendmsg+0x680/0x9c0 net/qrtr/qrtr.c:960 sock_sendmsg_nosec net/socket.c:651 [inline] sock_sendmsg net/socket.c:671 [inline] sock_write_iter+0x317/0x470 net/socket.c:998 call_write_iter include/linux/fs.h:1882 [inline] new_sync_write fs/read_write.c:503 [inline] vfs_write+0xa96/0xd10 fs/read_write.c:578 ksys_write+0x11b/0x220 fs/read_write.c:631 do_syscall_64+0x31/0x70 arch/x86/entry/common.c:46 entry_SYSCALL_64_after_hwframe+0x44/0xa9 The buggy address belongs to the object at ffff88804d8ab3c0 which belongs to the cache skbuff_head_cache of size 224 The buggy address is located 0 bytes inside of 224-byte region [ffff88804d8ab3c0, ffff88804d8ab4a0) The buggy address belongs to the page: page:00000000ea8cccfb refcount:1 mapcount:0 mapping:0000000000000000 index:0xffff88804d8abb40 pfn:0x4d8ab flags: 0xfffe0000000200(slab) raw: 00fffe0000000200 ffffea0002237ec8 ffffea00029b3388 ffff88821bb66800 raw: ffff88804d8abb40 ffff88804d8ab000 000000010000000b 0000000000000000 page dumped because: kasan: bad access detected Fixes: ce57785bf91b ("net: qrtr: fix len of skb_put_padto in qrtr_node_enqueue") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: syzbot <syzkaller@googlegroups.com> Cc: Carl Huang <cjhuang@codeaurora.org> Cc: Wen Gong <wgong@codeaurora.org> Cc: Bjorn Andersson <bjorn.andersson@linaro.org> Cc: Manivannan Sadhasivam <manivannan.sadhasivam@linaro.org> Acked-by: Manivannan Sadhasivam <manivannan.sadhasivam@linaro.org> Reviewed-by: Bjorn Andersson <bjorn.andersson@linaro.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-08ipv6: add tos reflection in TCP reset and ackWei Wang
Currently, ipv6 stack does not do any TOS reflection. To make the behavior consistent with v4 stack, this commit adds TOS reflection in tcp_v6_reqsk_send_ack() and tcp_v6_send_reset(). We clear the lower 2-bit ECN value of the received TOS in compliance with RFC 3168 6.1.5 robustness principles. Signed-off-by: Wei Wang <weiwan@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-08ip: fix tos reflection in ack and reset packetsWei Wang
Currently, in tcp_v4_reqsk_send_ack() and tcp_v4_send_reset(), we echo the TOS value of the received packets in the response. However, we do not want to echo the lower 2 ECN bits in accordance with RFC 3168 6.1.5 robustness principles. Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Signed-off-by: Wei Wang <weiwan@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-08Merge tag 'rxrpc-next-20200908' of ↵David S. Miller
git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs David Howells says: ==================== rxrpc: Allow more calls to same peer Here are some development patches for AF_RXRPC that allow more simultaneous calls to be made to the same peer with the same security parameters. The current code allows a maximum of 4 simultaneous calls, which limits the afs filesystem to that many simultaneous threads. This increases the limit to 16. To make this work, the way client connections are limited has to be changed (incoming call/connection limits are unaffected) as the current code depends on queuing calls on a connection and then pushing the connection through a queue. The limit is on the number of available connections. This is changed such that there's a limit[*] on the total number of calls systemwide across all namespaces, but the limit on the number of client connections is removed. Once a call is allowed to proceed, it finds a bundle of connections and tries to grab a call slot. If there's a spare call slot, fine, otherwise it will wait. If there's already a waiter, it will try to create another connection in the bundle, unless the limit of 4 is reached (4 calls per connection, giving 16). A number of things throttle someone trying to set up endless connections: - Calls that fail immediately have their conns deleted immediately, - Calls that don't fail immediately have to wait for a timeout, - Connections normally get automatically reaped if they haven't been used for 2m, but this is sped up to 2s if the number of connections rises over 900. This number is tunable by sysctl. [*] Technically two limits - kernel sockets and userspace rxrpc sockets are accounted separately. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-08Merge tag 'ieee802154-for-davem-2020-09-08' of ↵David S. Miller
git://git.kernel.org/pub/scm/linux/kernel/git/sschmidt/wpan Stefan Schmidt says: ==================== pull-request: ieee802154 for net 2020-09-08 An update from ieee802154 for your *net* tree. A potential memory leak fix for ca8210 from Liu Jian, a check on the return for a register read in adf7242 and finally a user after free fix in the softmac tx function from Eric found by syzkaller. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-08net: bridge: mcast: fix unused br var when lockdep isn't definedNikolay Aleksandrov
Stephen reported the following warning: net/bridge/br_multicast.c: In function 'br_multicast_find_port': net/bridge/br_multicast.c:1818:21: warning: unused variable 'br' [-Wunused-variable] 1818 | struct net_bridge *br = mp->br; | ^~ It happens due to bridge's mlock_dereference() when lockdep isn't defined. Silence the warning by annotating the variable as __maybe_unused. Fixes: 0436862e417e ("net: bridge: mcast: support for IGMPv3/MLDv2 ALLOW_NEW_SOURCES report") Reported-by: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-08fib: fix fib_rule_ops indirect call wrappers when CONFIG_IPV6=mBrian Vazquez
If CONFIG_IPV6=m, the IPV6 functions won't be found by the linker: ld: net/core/fib_rules.o: in function `fib_rules_lookup': fib_rules.c:(.text+0x606): undefined reference to `fib6_rule_match' ld: fib_rules.c:(.text+0x611): undefined reference to `fib6_rule_match' ld: fib_rules.c:(.text+0x68c): undefined reference to `fib6_rule_action' ld: fib_rules.c:(.text+0x693): undefined reference to `fib6_rule_action' ld: fib_rules.c:(.text+0x6aa): undefined reference to `fib6_rule_suppress' ld: fib_rules.c:(.text+0x6bc): undefined reference to `fib6_rule_suppress' make: *** [Makefile:1166: vmlinux] Error 1 Reported-by: Sven Joachim <svenjoac@gmx.de> Fixes: b9aaec8f0be5 ("fib: use indirect call wrappers in the most common fib_rules_ops") Acked-by: Randy Dunlap <rdunlap@infradead.org> # build-tested Signed-off-by: Brian Vazquez <brianvv@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-08Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nfDavid S. Miller
Pablo Neira Ayuso says: =================== Netfilter fixes for net The following patchset contains Netfilter fixes for net: 1) Allow conntrack entries with l3num == NFPROTO_IPV4 or == NFPROTO_IPV6 only via ctnetlink, from Will McVicker. 2) Batch notifications to userspace to improve netlink socket receive utilization. 3) Restore mark based dump filtering via ctnetlink, from Martin Willi. 4) nf_conncount_init() fails with -EPROTO with CONFIG_IPV6, from Eelco Chaudron. 5) Containers fail to match on meta skuid and skgid, use socket user_ns to retrieve meta skuid and skgid. =================== Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-08netlabel: Fix some kernel-doc warningsWang Hai
Fixes the following W=1 kernel build warning(s): net/netlabel/netlabel_calipso.c:438: warning: Excess function parameter 'audit_secid' description in 'calipso_doi_remove' net/netlabel/netlabel_calipso.c:605: warning: Excess function parameter 'reg' description in 'calipso_req_delattr' Reported-by: Hulk Robot <hulkci@huawei.com> Signed-off-by: Wang Hai <wanghai38@huawei.com> Acked-by: Paul Moore <paul@paul-moore.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-08cipso: fix 'audit_secid' kernel-doc warning in cipso_ipv4.cWang Hai
Fixes the following W=1 kernel build warning(s): net/ipv4/cipso_ipv4.c:510: warning: Excess function parameter 'audit_secid' description in 'cipso_v4_doi_remove' Reported-by: Hulk Robot <hulkci@huawei.com> Signed-off-by: Wang Hai <wanghai38@huawei.com> Acked-by: Paul Moore <paul@paul-moore.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-08ipv6: avoid lockdep issue in fib6_del()Eric Dumazet
syzbot reported twice a lockdep issue in fib6_del() [1] which I think is caused by net->ipv6.fib6_null_entry having a NULL fib6_table pointer. fib6_del() already checks for fib6_null_entry special case, we only need to return earlier. Bug seems to occur very rarely, I have thus chosen a 'bug origin' that makes backports not too complex. [1] WARNING: suspicious RCU usage 5.9.0-rc4-syzkaller #0 Not tainted ----------------------------- net/ipv6/ip6_fib.c:1996 suspicious rcu_dereference_protected() usage! other info that might help us debug this: rcu_scheduler_active = 2, debug_locks = 1 4 locks held by syz-executor.5/8095: #0: ffffffff8a7ea708 (rtnl_mutex){+.+.}-{3:3}, at: ppp_release+0x178/0x240 drivers/net/ppp/ppp_generic.c:401 #1: ffff88804c422dd8 (&net->ipv6.fib6_gc_lock){+.-.}-{2:2}, at: spin_trylock_bh include/linux/spinlock.h:414 [inline] #1: ffff88804c422dd8 (&net->ipv6.fib6_gc_lock){+.-.}-{2:2}, at: fib6_run_gc+0x21b/0x2d0 net/ipv6/ip6_fib.c:2312 #2: ffffffff89bd6a40 (rcu_read_lock){....}-{1:2}, at: __fib6_clean_all+0x0/0x290 net/ipv6/ip6_fib.c:2613 #3: ffff8880a82e6430 (&tb->tb6_lock){+.-.}-{2:2}, at: spin_lock_bh include/linux/spinlock.h:359 [inline] #3: ffff8880a82e6430 (&tb->tb6_lock){+.-.}-{2:2}, at: __fib6_clean_all+0x107/0x290 net/ipv6/ip6_fib.c:2245 stack backtrace: CPU: 1 PID: 8095 Comm: syz-executor.5 Not tainted 5.9.0-rc4-syzkaller #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Call Trace: __dump_stack lib/dump_stack.c:77 [inline] dump_stack+0x198/0x1fd lib/dump_stack.c:118 fib6_del+0x12b4/0x1630 net/ipv6/ip6_fib.c:1996 fib6_clean_node+0x39b/0x570 net/ipv6/ip6_fib.c:2180 fib6_walk_continue+0x4aa/0x8e0 net/ipv6/ip6_fib.c:2102 fib6_walk+0x182/0x370 net/ipv6/ip6_fib.c:2150 fib6_clean_tree+0xdb/0x120 net/ipv6/ip6_fib.c:2230 __fib6_clean_all+0x120/0x290 net/ipv6/ip6_fib.c:2246 fib6_clean_all net/ipv6/ip6_fib.c:2257 [inline] fib6_run_gc+0x113/0x2d0 net/ipv6/ip6_fib.c:2320 ndisc_netdev_event+0x217/0x350 net/ipv6/ndisc.c:1805 notifier_call_chain+0xb5/0x200 kernel/notifier.c:83 call_netdevice_notifiers_info+0xb5/0x130 net/core/dev.c:2033 call_netdevice_notifiers_extack net/core/dev.c:2045 [inline] call_netdevice_notifiers net/core/dev.c:2059 [inline] dev_close_many+0x30b/0x650 net/core/dev.c:1634 rollback_registered_many+0x3a8/0x1210 net/core/dev.c:9261 rollback_registered net/core/dev.c:9329 [inline] unregister_netdevice_queue+0x2dd/0x570 net/core/dev.c:10410 unregister_netdevice include/linux/netdevice.h:2774 [inline] ppp_release+0x216/0x240 drivers/net/ppp/ppp_generic.c:403 __fput+0x285/0x920 fs/file_table.c:281 task_work_run+0xdd/0x190 kernel/task_work.c:141 tracehook_notify_resume include/linux/tracehook.h:188 [inline] exit_to_user_mode_loop kernel/entry/common.c:163 [inline] exit_to_user_mode_prepare+0x1e1/0x200 kernel/entry/common.c:190 syscall_exit_to_user_mode+0x7e/0x2e0 kernel/entry/common.c:265 entry_SYSCALL_64_after_hwframe+0x44/0xa9 Fixes: 421842edeaf6 ("net/ipv6: Add fib6_null_entry") Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: David Ahern <dsahern@gmail.com> Reviewed-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-08net: dsa: link interfaces with the DSA master to get rid of lockdep warningsVladimir Oltean
Since commit 845e0ebb4408 ("net: change addr_list_lock back to static key"), cascaded DSA setups (DSA switch port as DSA master for another DSA switch port) are emitting this lockdep warning: ============================================ WARNING: possible recursive locking detected 5.8.0-rc1-00133-g923e4b5032dd-dirty #208 Not tainted -------------------------------------------- dhcpcd/323 is trying to acquire lock: ffff000066dd4268 (&dsa_master_addr_list_lock_key/1){+...}-{2:2}, at: dev_mc_sync+0x44/0x90 but task is already holding lock: ffff00006608c268 (&dsa_master_addr_list_lock_key/1){+...}-{2:2}, at: dev_mc_sync+0x44/0x90 other info that might help us debug this: Possible unsafe locking scenario: CPU0 ---- lock(&dsa_master_addr_list_lock_key/1); lock(&dsa_master_addr_list_lock_key/1); *** DEADLOCK *** May be due to missing lock nesting notation 3 locks held by dhcpcd/323: #0: ffffdbd1381dda18 (rtnl_mutex){+.+.}-{3:3}, at: rtnl_lock+0x24/0x30 #1: ffff00006614b268 (_xmit_ETHER){+...}-{2:2}, at: dev_set_rx_mode+0x28/0x48 #2: ffff00006608c268 (&dsa_master_addr_list_lock_key/1){+...}-{2:2}, at: dev_mc_sync+0x44/0x90 stack backtrace: Call trace: dump_backtrace+0x0/0x1e0 show_stack+0x20/0x30 dump_stack+0xec/0x158 __lock_acquire+0xca0/0x2398 lock_acquire+0xe8/0x440 _raw_spin_lock_nested+0x64/0x90 dev_mc_sync+0x44/0x90 dsa_slave_set_rx_mode+0x34/0x50 __dev_set_rx_mode+0x60/0xa0 dev_mc_sync+0x84/0x90 dsa_slave_set_rx_mode+0x34/0x50 __dev_set_rx_mode+0x60/0xa0 dev_set_rx_mode+0x30/0x48 __dev_open+0x10c/0x180 __dev_change_flags+0x170/0x1c8 dev_change_flags+0x2c/0x70 devinet_ioctl+0x774/0x878 inet_ioctl+0x348/0x3b0 sock_do_ioctl+0x50/0x310 sock_ioctl+0x1f8/0x580 ksys_ioctl+0xb0/0xf0 __arm64_sys_ioctl+0x28/0x38 el0_svc_common.constprop.0+0x7c/0x180 do_el0_svc+0x2c/0x98 el0_sync_handler+0x9c/0x1b8 el0_sync+0x158/0x180 Since DSA never made use of the netdev API for describing links between upper devices and lower devices, the dev->lower_level value of a DSA switch interface would be 1, which would warn when it is a DSA master. We can use netdev_upper_dev_link() to describe the relationship between a DSA slave and a DSA master. To be precise, a DSA "slave" (switch port) is an "upper" to a DSA "master" (host port). The relationship is "many uppers to one lower", like in the case of VLAN. So, for that reason, we use the same function as VLAN uses. There might be a chance that somebody will try to take hold of this interface and use it immediately after register_netdev() and before netdev_upper_dev_link(). To avoid that, we do the registration and linkage while holding the RTNL, and we use the RTNL-locked cousin of register_netdev(), which is register_netdevice(). Since this warning was not there when lockdep was using dynamic keys for addr_list_lock, we are blaming the lockdep patch itself. The network stack _has_ been using static lockdep keys before, and it _is_ likely that stacked DSA setups have been triggering these lockdep warnings since forever, however I can't test very old kernels on this particular stacked DSA setup, to ensure I'm not in fact introducing regressions. Fixes: 845e0ebb4408 ("net: change addr_list_lock back to static key") Suggested-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: Vladimir Oltean <olteanv@gmail.com> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>