summaryrefslogtreecommitdiff
path: root/net
AgeCommit message (Collapse)Author
2021-07-20memcg: enable accounting for inet_bin_bucket cacheVasily Averin
net namespace can create up to 64K tcp and dccp ports and force kernel to allocate up to several megabytes of memory per netns for inet_bind_bucket objects. It makes sense to account for them to restrict the host's memory consumption from inside the memcg-limited container. Signed-off-by: Vasily Averin <vvs@virtuozzo.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-07-20memcg: enable accounting for IP address and routing-related objectsVasily Averin
An netadmin inside container can use 'ip a a' and 'ip r a' to assign a large number of ipv4/ipv6 addresses and routing entries and force kernel to allocate megabytes of unaccounted memory for long-lived per-netdevice related kernel objects: 'struct in_ifaddr', 'struct inet6_ifaddr', 'struct fib6_node', 'struct rt6_info', 'struct fib_rules' and ip_fib caches. These objects can be manually removed, though usually they lives in memory till destroy of its net namespace. It makes sense to account for them to restrict the host's memory consumption from inside the memcg-limited container. One of such objects is the 'struct fib6_node' mostly allocated in net/ipv6/route.c::__ip6_ins_rt() inside the lock_bh()/unlock_bh() section: write_lock_bh(&table->tb6_lock); err = fib6_add(&table->tb6_root, rt, info, mxc); write_unlock_bh(&table->tb6_lock); In this case it is not enough to simply add SLAB_ACCOUNT to corresponding kmem cache. The proper memory cgroup still cannot be found due to the incorrect 'in_interrupt()' check used in memcg_kmem_bypass(). Obsoleted in_interrupt() does not describe real execution context properly. >From include/linux/preempt.h: The following macros are deprecated and should not be used in new code: in_interrupt() - We're in NMI,IRQ,SoftIRQ context or have BH disabled To verify the current execution context new macro should be used instead: in_task() - We're in task context Signed-off-by: Vasily Averin <vvs@virtuozzo.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-07-20memcg: enable accounting for net_device and Tx/Rx queuesVasily Averin
Container netadmin can create a lot of fake net devices, then create a new net namespace and repeat it again and again. Net device can request the creation of up to 4096 tx and rx queues, and force kernel to allocate up to several tens of megabytes memory per net device. It makes sense to account for them to restrict the host's memory consumption from inside the memcg-limited container. Signed-off-by: Vasily Averin <vvs@virtuozzo.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-07-20net: bridge: vlan: add mcast snooping controlNikolay Aleksandrov
Add a new global vlan option which controls whether multicast snooping is enabled or disabled for a single vlan. It controls the vlan private flag: BR_VLFLAG_GLOBAL_MCAST_ENABLED. Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-07-20net: bridge: vlan: notify when global options changeNikolay Aleksandrov
Add support for global options notifications. They use only RTM_NEWVLAN since global options can only be set and are contained in a separate vlan global options attribute. Notifications are compressed in ranges where possible, i.e. the sequential vlan options are equal. Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-07-20net: bridge: vlan: add support for dumping global vlan optionsNikolay Aleksandrov
Add a new vlan options dump flag which causes only global vlan options to be dumped. The dumps are done only with bridge devices, ports are ignored. They support vlan compression if the options in sequential vlans are equal (currently always true). Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-07-20net: bridge: vlan: add support for global optionsNikolay Aleksandrov
We can have two types of vlan options depending on context: - per-device vlan options (split in per-bridge and per-port) - global vlan options The second type wasn't supported in the bridge until now, but we need them for per-vlan multicast support, per-vlan STP support and other options which require global vlan context. They are contained in the global bridge vlan context even if the vlan is not configured on the bridge device itself. This patch adds initial netlink attributes and support for setting these global vlan options, they can only be set (RTM_NEWVLAN) and the operation must use the bridge device. Since there are no such options yet it shouldn't have any functional effect. Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-07-20net: bridge: multicast: include router port vlan id in notificationsNikolay Aleksandrov
Use the port multicast context to check if the router port is a vlan and in case it is include its vlan id in the notification. Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-07-20net: bridge: multicast: add vlan querier and query supportNikolay Aleksandrov
Add basic vlan context querier support, if the contexts passed to multicast_alloc_query are vlan then the query will be tagged. Also handle querier start/stop of vlan contexts. Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-07-20net: bridge: multicast: check if should use vlan mcast ctxNikolay Aleksandrov
Add helpers which check if the current bridge/port multicast context should be used (i.e. they're not disabled) and use them for Rx IGMP/MLD processing, timers and new group addition. It is important for vlans to disable processing of timer/packet after the multicast_lock is obtained if the vlan context doesn't have BR_VLFLAG_MCAST_ENABLED. There are two cases when that flag is missing: - if the vlan is getting destroyed it will be removed and timers will be stopped - if the vlan mcast snooping is being disabled Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-07-20net: bridge: multicast: use the port group to port context helperNikolay Aleksandrov
We need to use the new port group to port context helper in places where we cannot pass down the proper context (i.e. functions that can be called by timers or outside the packet snooping paths). Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-07-20net: bridge: multicast: add helper to get port mcast context from port groupNikolay Aleksandrov
Add br_multicast_pg_to_port_ctx() which returns the proper port multicast context from either port or vlan based on bridge option and vlan flags. As the comment inside explains the locking is a bit tricky, we rely on the fact that BR_VLFLAG_MCAST_ENABLED requires multicast_lock to change and we also require it to be held to call that helper. If we find the vlan under rcu and it still has the flag then we can be sure it will be alive until we unlock multicast_lock which should be enough. Note that the context might change from vlan to bridge between different calls to this helper as the mcast vlan knob requires only rtnl so it should be used carefully and for read-only/check purposes. Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-07-20net: bridge: add vlan mcast snooping knobNikolay Aleksandrov
Add a global knob that controls if vlan multicast snooping is enabled. The proper contexts (vlan or bridge-wide) will be chosen based on the knob when processing packets and changing bridge device state. Note that vlans have their individual mcast snooping enabled by default, but this knob is needed to turn on bridge vlan snooping. It is disabled by default. To enable the knob vlan filtering must also be enabled, it doesn't make sense to have vlan mcast snooping without vlan filtering since that would lead to inconsistencies. Disabling vlan filtering will also automatically disable vlan mcast snooping. Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-07-20net: bridge: multicast: add vlan state initialization and controlNikolay Aleksandrov
Add helpers to enable/disable vlan multicast based on its flags, we need two flags because we need to know if the vlan has multicast enabled globally (user-controlled) and if it has it enabled on the specific device (bridge or port). The new private vlan flags are: - BR_VLFLAG_MCAST_ENABLED: locally enabled multicast on the device, used when removing a vlan, toggling vlan mcast snooping and controlling single vlan (kernel-controlled, valid under RTNL and multicast_lock) - BR_VLFLAG_GLOBAL_MCAST_ENABLED: globally enabled multicast for the vlan, used to control the bridge-wide vlan mcast snooping for a single vlan (user-controlled, can be checked under any context) Bridge vlan contexts are created with multicast snooping enabled by default to be in line with the current bridge snooping defaults. In order to actually activate per vlan snooping and context usage a bridge-wide knob will be added later which will default to disabled. If that knob is enabled then automatically all vlan snooping will be enabled. All vlan contexts are initialized with the current bridge multicast context defaults. Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-07-20net: bridge: vlan: add global and per-port multicast contextNikolay Aleksandrov
Add global and per-port vlan multicast context, only initialized but still not used. No functional changes intended. Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-07-20net: bridge: multicast: use multicast contexts instead of bridge or portNikolay Aleksandrov
Pass multicast context pointers to multicast functions instead of bridge/port. This would make it easier later to switch these contexts to their per-vlan versions. The patch is basically search and replace, no functional changes. Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-07-20net: bridge: multicast: factor out bridge multicast contextNikolay Aleksandrov
Factor out the bridge's global multicast context into a separate structure which will later be used for per-vlan global context. No functional changes intended. Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-07-20net: bridge: multicast: factor out port multicast contextNikolay Aleksandrov
Factor out the port's multicast context into a separate structure which will later be shared for per-port,vlan context. No functional changes intended. We need the structure even if bridge multicast is not defined to pass down as pointer to forwarding functions. Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-07-20net: bridge: do not replay fdb entries pointing towards the bridge twiceVladimir Oltean
This simple script: ip link add br0 type bridge ip link set swp2 master br0 ip link set br0 address 00:01:02:03:04:05 ip link del br0 produces this result on a DSA switch: [ 421.306399] br0: port 1(swp2) entered blocking state [ 421.311445] br0: port 1(swp2) entered disabled state [ 421.472553] device swp2 entered promiscuous mode [ 421.488986] device swp2 left promiscuous mode [ 421.493508] br0: port 1(swp2) entered disabled state [ 421.886107] sja1105 spi0.1: port 1 failed to delete 00:01:02:03:04:05 vid 1 from fdb: -ENOENT [ 421.894374] sja1105 spi0.1: port 1 failed to delete 00:01:02:03:04:05 vid 0 from fdb: -ENOENT [ 421.943982] br0: port 1(swp2) entered blocking state [ 421.949030] br0: port 1(swp2) entered disabled state [ 422.112504] device swp2 entered promiscuous mode A very simplified view of what happens is: (1) the bridge port is created, and the bridge device inherits its MAC address (2) when joining, the bridge port (DSA) requests a replay of the addition of all FDB entries towards this bridge port and towards the bridge device itself. In fact, DSA calls br_fdb_replay() twice: br_fdb_replay(br, brport_dev); br_fdb_replay(br, br); DSA uses reference counting for the FDB entries. So the MAC address of the bridge is simply kept with refcount 2. When the bridge port leaves under normal circumstances, everything cancels out since the replay of the FDB entry deletion is also done twice per VLAN. (3) when the bridge MAC address changes, switchdev is notified of the deletion of the old address and of the insertion of the new one. But the old address does not really go away, since it had refcount 2, and the new address is added "only" with refcount 1. (4) when the bridge port leaves now, it will replay a deletion of the FDB entries pointing towards the bridge twice. Then DSA will complain that it can't delete something that no longer exists. It is clear that the problem is that the FDB entries towards the bridge are replayed too many times, so let's fix that problem. Fixes: 63c51453c82c ("net: dsa: replay the local bridge FDB entries pointing to the bridge dev too") Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Link: https://lore.kernel.org/r/20210719093916.4099032-1-vladimir.oltean@nxp.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-07-20net/tcp_fastopen: remove tcp_fastopen_ctx_lockEric Dumazet
Remove the (per netns) spinlock in favor of xchg() atomic operations. Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Yuchung Cheng <ycheng@google.com> Cc: Neal Cardwell <ncardwell@google.com> Acked-by: Wei Wang <weiwan@google.com> Link: https://lore.kernel.org/r/20210719101107.3203943-1-eric.dumazet@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-07-20ipv6: ip6_finish_output2: set sk into newly allocated nskbVasily Averin
skb_set_owner_w() should set sk not to old skb but to new nskb. Fixes: 5796015fa968 ("ipv6: allocate enough headroom in ip6_finish_output2()") Signed-off-by: Vasily Averin <vvs@virtuozzo.com> Link: https://lore.kernel.org/r/70c0744f-89ae-1869-7e3e-4fa292158f4b@virtuozzo.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-07-20netlink: Deal with ESRCH error in nlmsg_notify()Yajun Deng
Yonghong Song report: The bpf selftest tc_bpf failed with latest bpf-next. The following is the command to run and the result: $ ./test_progs -n 132 [ 40.947571] bpf_testmod: loading out-of-tree module taints kernel. test_tc_bpf:PASS:test_tc_bpf__open_and_load 0 nsec test_tc_bpf:PASS:bpf_tc_hook_create(BPF_TC_INGRESS) 0 nsec test_tc_bpf:PASS:bpf_tc_hook_create invalid hook.attach_point 0 nsec test_tc_bpf_basic:PASS:bpf_obj_get_info_by_fd 0 nsec test_tc_bpf_basic:PASS:bpf_tc_attach 0 nsec test_tc_bpf_basic:PASS:handle set 0 nsec test_tc_bpf_basic:PASS:priority set 0 nsec test_tc_bpf_basic:PASS:prog_id set 0 nsec test_tc_bpf_basic:PASS:bpf_tc_attach replace mode 0 nsec test_tc_bpf_basic:PASS:bpf_tc_query 0 nsec test_tc_bpf_basic:PASS:handle set 0 nsec test_tc_bpf_basic:PASS:priority set 0 nsec test_tc_bpf_basic:PASS:prog_id set 0 nsec libbpf: Kernel error message: Failed to send filter delete notification test_tc_bpf_basic:FAIL:bpf_tc_detach unexpected error: -3 (errno 3) test_tc_bpf:FAIL:test_tc_internal ingress unexpected error: -3 (errno 3) The failure seems due to the commit cfdf0d9ae75b ("rtnetlink: use nlmsg_notify() in rtnetlink_send()") Deal with ESRCH error in nlmsg_notify() even the report variable is zero. Reported-by: Yonghong Song <yhs@fb.com> Signed-off-by: Yajun Deng <yajun.deng@linux.dev> Link: https://lore.kernel.org/r/20210719051816.11762-1-yajun.deng@linux.dev Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-07-19net/tcp_fastopen: fix data races around tfo_active_disable_stampEric Dumazet
tfo_active_disable_stamp is read and written locklessly. We need to annotate these accesses appropriately. Then, we need to perform the atomic_inc(tfo_active_disable_times) after the timestamp has been updated, and thus add barriers to make sure tcp_fastopen_active_should_disable() wont read a stale timestamp. Fixes: cf1ef3f0719b ("net/tcp_fastopen: Disable active side TFO in certain scenarios") Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Wei Wang <weiwan@google.com> Cc: Yuchung Cheng <ycheng@google.com> Cc: Neal Cardwell <ncardwell@google.com> Acked-by: Wei Wang <weiwan@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-07-18netrom: Decrease sock refcount when sock timers expireNguyen Dinh Phi
Commit 63346650c1a9 ("netrom: switch to sock timer API") switched to use sock timer API. It replaces mod_timer() by sk_reset_timer(), and del_timer() by sk_stop_timer(). Function sk_reset_timer() will increase the refcount of sock if it is called on an inactive timer, hence, in case the timer expires, we need to decrease the refcount ourselves in the handler, otherwise, the sock refcount will be unbalanced and the sock will never be freed. Signed-off-by: Nguyen Dinh Phi <phind.uet@gmail.com> Reported-by: syzbot+10f1194569953b72f1ae@syzkaller.appspotmail.com Fixes: 63346650c1a9 ("netrom: switch to sock timer API") Signed-off-by: David S. Miller <davem@davemloft.net>
2021-07-18sctp: trim optlen when it's a huge value in sctp_setsockoptXin Long
After commit ca84bd058dae ("sctp: copy the optval from user space in sctp_setsockopt"), it does memory allocation in sctp_setsockopt with the optlen, and it would fail the allocation and return error if the optlen from user space is a huge value. This breaks some sockopts, like SCTP_HMAC_IDENT, SCTP_RESET_STREAMS and SCTP_AUTH_KEY, as when processing these sockopts before, optlen would be trimmed to a biggest value it needs when optlen is a huge value, instead of failing the allocation and returning error. This patch is to fix the allocation failure when it's a huge optlen from user space by trimming it to the biggest size sctp sockopt may need when necessary, and this biggest size is from sctp_setsockopt_reset_streams() for SCTP_RESET_STREAMS, which is bigger than those for SCTP_HMAC_IDENT and SCTP_AUTH_KEY. Fixes: ca84bd058dae ("sctp: copy the optval from user space in sctp_setsockopt") Signed-off-by: Xin Long <lucien.xin@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-07-18net: sched: fix memory leak in tcindex_partial_destroy_workPavel Skripkin
Syzbot reported memory leak in tcindex_set_parms(). The problem was in non-freed perfect hash in tcindex_partial_destroy_work(). In tcindex_set_parms() new tcindex_data is allocated and some fields from old one are copied to new one, but not the perfect hash. Since tcindex_partial_destroy_work() is the destroy function for old tcindex_data, we need to free perfect hash to avoid memory leak. Reported-and-tested-by: syzbot+f0bbb2287b8993d4fa74@syzkaller.appspotmail.com Fixes: 331b72922c5f ("net: sched: RCU cls_tcindex") Signed-off-by: Pavel Skripkin <paskripkin@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-07-18net: Fix zero-copy head len calculation.Pravin B Shelar
In some cases skb head could be locked and entire header data is pulled from skb. When skb_zerocopy() called in such cases, following BUG is triggered. This patch fixes it by copying entire skb in such cases. This could be optimized incase this is performance bottleneck. ---8<--- kernel BUG at net/core/skbuff.c:2961! invalid opcode: 0000 [#1] SMP PTI CPU: 2 PID: 0 Comm: swapper/2 Tainted: G OE 5.4.0-77-generic #86-Ubuntu Hardware name: OpenStack Foundation OpenStack Nova, BIOS 1.13.0-1ubuntu1.1 04/01/2014 RIP: 0010:skb_zerocopy+0x37a/0x3a0 RSP: 0018:ffffbcc70013ca38 EFLAGS: 00010246 Call Trace: <IRQ> queue_userspace_packet+0x2af/0x5e0 [openvswitch] ovs_dp_upcall+0x3d/0x60 [openvswitch] ovs_dp_process_packet+0x125/0x150 [openvswitch] ovs_vport_receive+0x77/0xd0 [openvswitch] netdev_port_receive+0x87/0x130 [openvswitch] netdev_frame_hook+0x4b/0x60 [openvswitch] __netif_receive_skb_core+0x2b4/0xc90 __netif_receive_skb_one_core+0x3f/0xa0 __netif_receive_skb+0x18/0x60 process_backlog+0xa9/0x160 net_rx_action+0x142/0x390 __do_softirq+0xe1/0x2d6 irq_exit+0xae/0xb0 do_IRQ+0x5a/0xf0 common_interrupt+0xf/0xf Code that triggered BUG: int skb_zerocopy(struct sk_buff *to, struct sk_buff *from, int len, int hlen) { int i, j = 0; int plen = 0; /* length of skb->head fragment */ int ret; struct page *page; unsigned int offset; BUG_ON(!from->head_frag && !hlen); Signed-off-by: Pravin B Shelar <pshelar@ovn.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-07-17igmp: Add ip_mc_list lock in ip_check_mc_rcuLiu Jian
I got below panic when doing fuzz test: Kernel panic - not syncing: panic_on_warn set ... CPU: 0 PID: 4056 Comm: syz-executor.3 Tainted: G B 5.14.0-rc1-00195-gcff5c4254439-dirty #2 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-59-gc9ba5276e321-prebuilt.qemu.org 04/01/2014 Call Trace: dump_stack_lvl+0x7a/0x9b panic+0x2cd/0x5af end_report.cold+0x5a/0x5a kasan_report+0xec/0x110 ip_check_mc_rcu+0x556/0x5d0 __mkroute_output+0x895/0x1740 ip_route_output_key_hash_rcu+0x2d0/0x1050 ip_route_output_key_hash+0x182/0x2e0 ip_route_output_flow+0x28/0x130 udp_sendmsg+0x165d/0x2280 udpv6_sendmsg+0x121e/0x24f0 inet6_sendmsg+0xf7/0x140 sock_sendmsg+0xe9/0x180 ____sys_sendmsg+0x2b8/0x7a0 ___sys_sendmsg+0xf0/0x160 __sys_sendmmsg+0x17e/0x3c0 __x64_sys_sendmmsg+0x9e/0x100 do_syscall_64+0x3b/0x90 entry_SYSCALL_64_after_hwframe+0x44/0xae RIP: 0033:0x462eb9 Code: f7 d8 64 89 02 b8 ff ff ff ff c3 66 0f 1f 44 00 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 bc ff ff ff f7 d8 64 89 01 48 RSP: 002b:00007f3df5af1c58 EFLAGS: 00000246 ORIG_RAX: 0000000000000133 RAX: ffffffffffffffda RBX: 000000000073bf00 RCX: 0000000000462eb9 RDX: 0000000000000312 RSI: 0000000020001700 RDI: 0000000000000007 RBP: 0000000000000004 R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000246 R12: 00007f3df5af26bc R13: 00000000004c372d R14: 0000000000700b10 R15: 00000000ffffffff It is one use-after-free in ip_check_mc_rcu. In ip_mc_del_src, the ip_sf_list of pmc has been freed under pmc->lock protection. But access to ip_sf_list in ip_check_mc_rcu is not protected by the lock. Signed-off-by: Liu Jian <liujian56@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-07-16tipc: keep the skb in rcv queue until the whole data is readXin Long
Currently, when userspace reads a datagram with a buffer that is smaller than this datagram, the data will be truncated and only part of it can be received by users. It doesn't seem right that users don't know the datagram size and have to use a huge buffer to read it to avoid the truncation. This patch to fix it by keeping the skb in rcv queue until the whole data is read by users. Only the last msg of the datagram will be marked with MSG_EOR, just as TCP/SCTP does. Note that this will work as above only when MSG_EOR is set in the flags parameter of recvmsg(), so that it won't break any old user applications. Signed-off-by: Xin Long <lucien.xin@gmail.com> Acked-by: Jon Maloy <jmaloy@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-07-17netfilter: nf_tables: fix audit memory leak in nf_tables_commitDongliang Mu
In nf_tables_commit, if nf_tables_commit_audit_alloc fails, it does not free the adp variable. Fix this by adding nf_tables_commit_audit_free which frees the linked list with the head node adl. backtrace: kmalloc include/linux/slab.h:591 [inline] kzalloc include/linux/slab.h:721 [inline] nf_tables_commit_audit_alloc net/netfilter/nf_tables_api.c:8439 [inline] nf_tables_commit+0x16e/0x1760 net/netfilter/nf_tables_api.c:8508 nfnetlink_rcv_batch+0x512/0xa80 net/netfilter/nfnetlink.c:562 nfnetlink_rcv_skb_batch net/netfilter/nfnetlink.c:634 [inline] nfnetlink_rcv+0x1fa/0x220 net/netfilter/nfnetlink.c:652 netlink_unicast_kernel net/netlink/af_netlink.c:1314 [inline] netlink_unicast+0x2c7/0x3e0 net/netlink/af_netlink.c:1340 netlink_sendmsg+0x36b/0x6b0 net/netlink/af_netlink.c:1929 sock_sendmsg_nosec net/socket.c:702 [inline] sock_sendmsg+0x56/0x80 net/socket.c:722 Reported-by: syzbot <syzkaller@googlegroups.com> Reported-by: kernel test robot <lkp@intel.com> Fixes: c520292f29b8 ("audit: log nftables configuration change events once per table") Signed-off-by: Dongliang Mu <mudongliangabcd@gmail.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2021-07-16net: decnet: Fix sleeping inside in af_decnetYajun Deng
The release_sock() is blocking function, it would change the state after sleeping. use wait_woken() instead. Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Signed-off-by: Yajun Deng <yajun.deng@linux.dev> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-07-16bpf: Add ambient BPF runtime context stored in currentAndrii Nakryiko
b910eaaaa4b8 ("bpf: Fix NULL pointer dereference in bpf_get_local_storage() helper") fixed the problem with cgroup-local storage use in BPF by pre-allocating per-CPU array of 8 cgroup storage pointers to accommodate possible BPF program preemptions and nested executions. While this seems to work good in practice, it introduces new and unnecessary failure mode in which not all BPF programs might be executed if we fail to find an unused slot for cgroup storage, however unlikely it is. It might also not be so unlikely when/if we allow sleepable cgroup BPF programs in the future. Further, the way that cgroup storage is implemented as ambiently-available property during entire BPF program execution is a convenient way to pass extra information to BPF program and helpers without requiring user code to pass around extra arguments explicitly. So it would be good to have a generic solution that can allow implementing this without arbitrary restrictions. Ideally, such solution would work for both preemptable and sleepable BPF programs in exactly the same way. This patch introduces such solution, bpf_run_ctx. It adds one pointer field (bpf_ctx) to task_struct. This field is maintained by BPF_PROG_RUN family of macros in such a way that it always stays valid throughout BPF program execution. BPF program preemption is handled by remembering previous current->bpf_ctx value locally while executing nested BPF program and restoring old value after nested BPF program finishes. This is handled by two helper functions, bpf_set_run_ctx() and bpf_reset_run_ctx(), which are supposed to be used before and after BPF program runs, respectively. Restoring old value of the pointer handles preemption, while bpf_run_ctx pointer being a property of current task_struct naturally solves this problem for sleepable BPF programs by "following" BPF program execution as it is scheduled in and out of CPU. It would even allow CPU migration of BPF programs, even though it's not currently allowed by BPF infra. This patch cleans up cgroup local storage handling as a first application. The design itself is generic, though, with bpf_run_ctx being an empty struct that is supposed to be embedded into a specific struct for a given BPF program type (bpf_cg_run_ctx in this case). Follow up patches are planned that will expand this mechanism for other uses within tracing BPF programs. To verify that this change doesn't revert the fix to the original cgroup storage issue, I ran the same repro as in the original report ([0]) and didn't get any problems. Replacing bpf_reset_run_ctx(old_run_ctx) with bpf_reset_run_ctx(NULL) triggers the issue pretty quickly (so repro does work). [0] https://lore.kernel.org/bpf/YEEvBUiJl2pJkxTd@krava/ Fixes: b910eaaaa4b8 ("bpf: Fix NULL pointer dereference in bpf_get_local_storage() helper") Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Yonghong Song <yhs@fb.com> Link: https://lore.kernel.org/bpf/20210712230615.3525979-1-andrii@kernel.org
2021-07-16skbuff: Fix a potential race while recycling page_pool packetsIlias Apalodimas
As Alexander points out, when we are trying to recycle a cloned/expanded SKB we might trigger a race. The recycling code relies on the pp_recycle bit to trigger, which we carry over to cloned SKBs. If that cloned SKB gets expanded or if we get references to the frags, call skb_release_data() and overwrite skb->head, we are creating separate instances accessing the same page frags. Since the skb_release_data() will first try to recycle the frags, there's a potential race between the original and cloned SKB, since both will have the pp_recycle bit set. Fix this by explicitly those SKBs not recyclable. The atomic_sub_return effectively limits us to a single release case, and when we are calling skb_release_data we are also releasing the option to perform the recycling, or releasing the pages from the page pool. Fixes: 6a5bcd84e886 ("page_pool: Allow drivers to hint on SKB recycling") Reported-by: Alexander Duyck <alexanderduyck@fb.com> Suggested-by: Alexander Duyck <alexanderduyck@fb.com> Reviewed-by: Alexander Duyck <alexanderduyck@fb.com> Acked-by: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: Ilias Apalodimas <ilias.apalodimas@linaro.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-07-16openvswitch: Introduce per-cpu upcall dispatchMark Gray
The Open vSwitch kernel module uses the upcall mechanism to send packets from kernel space to user space when it misses in the kernel space flow table. The upcall sends packets via a Netlink socket. Currently, a Netlink socket is created for every vport. In this way, there is a 1:1 mapping between a vport and a Netlink socket. When a packet is received by a vport, if it needs to be sent to user space, it is sent via the corresponding Netlink socket. This mechanism, with various iterations of the corresponding user space code, has seen some limitations and issues: * On systems with a large number of vports, there is a correspondingly large number of Netlink sockets which can limit scaling. (https://bugzilla.redhat.com/show_bug.cgi?id=1526306) * Packet reordering on upcalls. (https://bugzilla.redhat.com/show_bug.cgi?id=1844576) * A thundering herd issue. (https://bugzilla.redhat.com/show_bug.cgi?id=1834444) This patch introduces an alternative, feature-negotiated, upcall mode using a per-cpu dispatch rather than a per-vport dispatch. In this mode, the Netlink socket to be used for the upcall is selected based on the CPU of the thread that is executing the upcall. In this way, it resolves the issues above as: a) The number of Netlink sockets scales with the number of CPUs rather than the number of vports. b) Ordering per-flow is maintained as packets are distributed to CPUs based on mechanisms such as RSS and flows are distributed to a single user space thread. c) Packets from a flow can only wake up one user space thread. The corresponding user space code can be found at: https://mail.openvswitch.org/pipermail/ovs-dev/2021-July/385139.html Bugzilla: https://bugzilla.redhat.com/1844576 Signed-off-by: Mark Gray <mark.d.gray@redhat.com> Acked-by: Flavio Leitner <fbl@sysclose.org> Acked-by: Pravin B Shelar <pshelar@ovn.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-07-16net/sched: Remove unnecessary if statementYajun Deng
It has been deal with the 'if (err' statement in rtnetlink_send() and rtnl_unicast(). so remove unnecessary if statement. v2: use the raw name rtnetlink_send(). Signed-off-by: Yajun Deng <yajun.deng@linux.dev> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-07-16rtnetlink: use nlmsg_notify() in rtnetlink_send()Yajun Deng
The netlink_{broadcast, unicast} don't deal with 'if (err > 0' statement but nlmsg_{multicast, unicast} do. The nlmsg_notify() contains them. so use nlmsg_notify() instead. so that the caller wouldn't deal with 'if (err > 0' statement. v2: use nlmsg_notify() will do well. Signed-off-by: Yajun Deng <yajun.deng@linux.dev> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-07-15Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-nextDavid S. Miller
Alexei Starovoitov says: ==================== pull-request: bpf-next 2021-07-15 The following pull-request contains BPF updates for your *net-next* tree. We've added 45 non-merge commits during the last 15 day(s) which contain a total of 52 files changed, 3122 insertions(+), 384 deletions(-). The main changes are: 1) Introduce bpf timers, from Alexei. 2) Add sockmap support for unix datagram socket, from Cong. 3) Fix potential memleak and UAF in the verifier, from He. 4) Add bpf_get_func_ip helper, from Jiri. 5) Improvements to generic XDP mode, from Kumar. 6) Support for passing xdp_md to XDP programs in bpf_prog_run, from Zvi. =================== Signed-off-by: David S. Miller <davem@davemloft.net>
2021-07-15af_unix: Implement unix_dgram_bpf_recvmsg()Cong Wang
We have to implement unix_dgram_bpf_recvmsg() to replace the original ->recvmsg() to retrieve skmsg from ingress_msg. AF_UNIX is again special here because the lack of sk_prot->recvmsg(). I simply add a special case inside unix_dgram_recvmsg() to call sk->sk_prot->recvmsg() directly. Signed-off-by: Cong Wang <cong.wang@bytedance.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20210704190252.11866-8-xiyou.wangcong@gmail.com
2021-07-15af_unix: Implement ->psock_update_sk_prot()Cong Wang
Now we can implement unix_bpf_update_proto() to update sk_prot, especially prot->close(). Signed-off-by: Cong Wang <cong.wang@bytedance.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20210704190252.11866-7-xiyou.wangcong@gmail.com
2021-07-15af_unix: Add a dummy ->close() for sockmapCong Wang
Unlike af_inet, unix_proto is very different, it does not even have a ->close(). We have to add a dummy implementation to satisfy sockmap. Normally it is just a nop, it is introduced only for sockmap to replace it. Signed-off-by: Cong Wang <cong.wang@bytedance.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20210704190252.11866-6-xiyou.wangcong@gmail.com
2021-07-15af_unix: Set TCP_ESTABLISHED for datagram sockets tooCong Wang
Currently only unix stream socket sets TCP_ESTABLISHED, datagram socket can set this too when they connect to its peer socket. At least __ip4_datagram_connect() does the same. This will be used to determine whether an AF_UNIX datagram socket can be redirected to in sockmap. Signed-off-by: Cong Wang <cong.wang@bytedance.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20210704190252.11866-5-xiyou.wangcong@gmail.com
2021-07-15af_unix: Implement ->read_sock() for sockmapCong Wang
Implement ->read_sock() for AF_UNIX datagram socket, it is pretty much similar to udp_read_sock(). Signed-off-by: Cong Wang <cong.wang@bytedance.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20210704190252.11866-4-xiyou.wangcong@gmail.com
2021-07-15sock_map: Lift socket state restriction for datagram socketsCong Wang
TCP and other connection oriented sockets have accept() for each incoming connection on the server side, hence they can just insert those fd's from accept() to sockmap, which are of course established. Now with datagram sockets begin to support sockmap and redirection, the restriction is no longer applicable to them, as they have no accept(). So we have to lift this restriction for them. This is fine, because inside bpf_sk_redirect_map() we still have another socket status check, sock_map_redirect_allowed(), as a guard. This also means they do not have to be removed from sockmap when disconnecting. Signed-off-by: Cong Wang <cong.wang@bytedance.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20210704190252.11866-3-xiyou.wangcong@gmail.com
2021-07-15sock_map: Relax config dependency to CONFIG_NETCong Wang
Currently sock_map still has Kconfig dependency on CONFIG_INET, but there is no actual functional dependency on it after we introduce ->psock_update_sk_prot(). We have to extend it to CONFIG_NET now as we are going to support AF_UNIX. Signed-off-by: Cong Wang <cong.wang@bytedance.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20210704190252.11866-2-xiyou.wangcong@gmail.com
2021-07-15Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpfDavid S. Miller
Andrii Nakryiko says: ==================== pull-request: bpf 2021-07-15 The following pull-request contains BPF updates for your *net* tree. We've added 9 non-merge commits during the last 5 day(s) which contain a total of 9 files changed, 37 insertions(+), 15 deletions(-). The main changes are: 1) Fix NULL pointer dereference in BPF_TEST_RUN for BPF_XDP_DEVMAP and BPF_XDP_CPUMAP programs, from Xuan Zhuo. 2) Fix use-after-free of net_device in XDP bpf_link, from Xuan Zhuo. 3) Follow-up fix to subprog poke descriptor use-after-free problem, from Daniel Borkmann and John Fastabend. 4) Fix out-of-range array access in s390 BPF JIT backend, from Colin Ian King. 5) Fix memory leak in BPF sockmap, from John Fastabend. 6) Fix for sockmap to prevent proc stats reporting bug, from John Fastabend and Jakub Sitnicki. 7) Fix NULL pointer dereference in bpftool, from Tobias Klauser. 8) AF_XDP documentation fixes, from Baruch Siach. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2021-07-15net: fix uninit-value in caif_seqpkt_sendmsgZiyang Xuan
When nr_segs equal to zero in iovec_from_user, the object msg->msg_iter.iov is uninit stack memory in caif_seqpkt_sendmsg which is defined in ___sys_sendmsg. So we cann't just judge msg->msg_iter.iov->base directlly. We can use nr_segs to judge msg in caif_seqpkt_sendmsg whether has data buffers. ===================================================== BUG: KMSAN: uninit-value in caif_seqpkt_sendmsg+0x693/0xf60 net/caif/caif_socket.c:542 Call Trace: __dump_stack lib/dump_stack.c:77 [inline] dump_stack+0x1c9/0x220 lib/dump_stack.c:118 kmsan_report+0xf7/0x1e0 mm/kmsan/kmsan_report.c:118 __msan_warning+0x58/0xa0 mm/kmsan/kmsan_instr.c:215 caif_seqpkt_sendmsg+0x693/0xf60 net/caif/caif_socket.c:542 sock_sendmsg_nosec net/socket.c:652 [inline] sock_sendmsg net/socket.c:672 [inline] ____sys_sendmsg+0x12b6/0x1350 net/socket.c:2343 ___sys_sendmsg net/socket.c:2397 [inline] __sys_sendmmsg+0x808/0xc90 net/socket.c:2480 __compat_sys_sendmmsg net/compat.c:656 [inline] Reported-by: syzbot+09a5d591c1f98cf5efcb@syzkaller.appspotmail.com Link: https://syzkaller.appspot.com/bug?id=1ace85e8fc9b0d5a45c08c2656c3e91762daa9b8 Fixes: bece7b2398d0 ("caif: Rewritten socket implementation") Signed-off-by: Ziyang Xuan <william.xuanziyang@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-07-15bpf, sockmap, udp: sk_prot needs inuse_idx set for proc statsJakub Sitnicki
The proc socket stats use sk_prot->inuse_idx value to record inuse sock stats. We currently do not set this correctly from sockmap side. The result is reading sock stats '/proc/net/sockstat' gives incorrect values. The socket counter is incremented correctly, but because we don't set the counter correctly when we replace sk_prot we may omit the decrement. To get the correct inuse_idx value move the core_initcall that initializes the UDP proto handlers to late_initcall. This way it is initialized after UDP has the chance to assign the inuse_idx value from the register protocol handler. Fixes: edc6741cc660 ("bpf: Add sockmap hooks for UDP sockets") Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: Cong Wang <cong.wang@bytedance.com> Acked-by: John Fastabend <john.fastabend@gmail.com> Link: https://lore.kernel.org/bpf/20210714154750.528206-1-jakub@cloudflare.com
2021-07-15bpf, sockmap, tcp: sk_prot needs inuse_idx set for proc statsJohn Fastabend
The proc socket stats use sk_prot->inuse_idx value to record inuse sock stats. We currently do not set this correctly from sockmap side. The result is reading sock stats '/proc/net/sockstat' gives incorrect values. The socket counter is incremented correctly, but because we don't set the counter correctly when we replace sk_prot we may omit the decrement. To get the correct inuse_idx value move the core_initcall that initializes the TCP proto handlers to late_initcall. This way it is initialized after TCP has the chance to assign the inuse_idx value from the register protocol handler. Fixes: 604326b41a6fb ("bpf, sockmap: convert to generic sk_msg interface") Suggested-by: Jakub Sitnicki <jakub@cloudflare.com> Signed-off-by: John Fastabend <john.fastabend@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: Cong Wang <cong.wang@bytedance.com> Link: https://lore.kernel.org/bpf/20210712195546.423990-3-john.fastabend@gmail.com
2021-07-15bpf, sockmap: Fix potential memory leak on unlikely error caseJohn Fastabend
If skb_linearize is needed and fails we could leak a msg on the error handling. To fix ensure we kfree the msg block before returning error. Found during code review. Fixes: 4363023d2668e ("bpf, sockmap: Avoid failures from skb_to_sgvec when skb has frag_list") Signed-off-by: John Fastabend <john.fastabend@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: Cong Wang <cong.wang@bytedance.com> Link: https://lore.kernel.org/bpf/20210712195546.423990-2-john.fastabend@gmail.com
2021-07-15net_sched: introduce tracepoint trace_qdisc_enqueue()Qitao Xu
Tracepoint trace_qdisc_enqueue() is introduced to trace skb at the entrance of TC layer on TX side. This is similar to trace_qdisc_dequeue(): 1. For both we only trace successful cases. The failure cases can be traced via trace_kfree_skb(). 2. They are called at entrance or exit of TC layer, not for each ->enqueue() or ->dequeue(). This is intentional, because we want to make trace_qdisc_enqueue() symmetric to trace_qdisc_dequeue(), which is easier to use. The return value of qdisc_enqueue() is not interesting here, we have Qdisc's drop packets in ->dequeue(), it is impossible to trace them even if we have the return value, the only way to trace them is tracing kfree_skb(). We only add information we need to trace ring buffer. If any other information is needed, it is easy to extend it without breaking ABI, see commit 3dd344ea84e1 ("net: tracepoint: exposing sk_family in all tcp:tracepoints"). Reviewed-by: Cong Wang <cong.wang@bytedance.com> Signed-off-by: Qitao Xu <qitao.xu@bytedance.com> Signed-off-by: David S. Miller <davem@davemloft.net>