summaryrefslogtreecommitdiff
path: root/net
AgeCommit message (Collapse)Author
2021-10-08vsock: Refactor vsock_*_getsockopt to resemble sock_getsockoptRichard Palethorpe
In preparation for sharing the implementation of sock_get_timeout. Signed-off-by: Richard Palethorpe <rpalethorpe@suse.com> Cc: Richard Palethorpe <rpalethorpe@richiejp.com> Reviewed-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-10-08net-sysfs: try not to restart the syscall if it will fail eventuallyAntoine Tenart
Due to deadlocks in the networking subsystem spotted 12 years ago[1], a workaround was put in place[2] to avoid taking the rtnl lock when it was not available and restarting the syscall (back to VFS, letting userspace spin). The following construction is found a lot in the net sysfs and sysctl code: if (!rtnl_trylock()) return restart_syscall(); This can be problematic when multiple userspace threads use such interfaces in a short period, making them to spin a lot. This happens for example when adding and moving virtual interfaces: userspace programs listening on events, such as systemd-udevd and NetworkManager, do trigger actions reading files in sysfs. It gets worse when a lot of virtual interfaces are created concurrently, say when creating containers at boot time. Returning early without hitting the above pattern when the syscall will fail eventually does make things better. While it is not a fix for the issue, it does ease things. [1] https://lore.kernel.org/netdev/49A4D5D5.5090602@trash.net/ https://lore.kernel.org/netdev/m14oyhis31.fsf@fess.ebiederm.org/ and https://lore.kernel.org/netdev/20090226084924.16cb3e08@nehalam/ [2] Rightfully, those deadlocks are *hard* to solve. Signed-off-by: Antoine Tenart <atenart@kernel.org> Reviewed-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-10-08net/sched: sch_ets: properly init all active DRR list handlesDavide Caratti
leaf classes of ETS qdiscs are served in strict priority or deficit round robin (DRR), depending on the value of 'nstrict'. Since this value can be changed while traffic is running, we need to be sure that the active list of DRR classes can be updated at any time, so: 1) call INIT_LIST_HEAD(&alist) on all leaf classes in .init(), before the first packet hits any of them. 2) ensure that 'alist' is not overwritten with zeros when a leaf class is no more strict priority nor DRR (i.e. array elements beyond 'nbands'). Link: https://lore.kernel.org/netdev/YS%2FoZ+f0Nr8eQkzH@dcaratti.users.ipa.redhat.com Suggested-by: Cong Wang <cong.wang@bytedance.com> Signed-off-by: Davide Caratti <dcaratti@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-10-08mptcp: fix possible stall on recvmsg()Paolo Abeni
recvmsg() can enter an infinite loop if the caller provides the MSG_WAITALL, the data present in the receive queue is not sufficient to fulfill the request, and no more data is received by the peer. When the above happens, mptcp_wait_data() will always return with no wait, as the MPTCP_DATA_READY flag checked by such function is set and never cleared in such code path. Leveraging the above syzbot was able to trigger an RCU stall: rcu: INFO: rcu_preempt self-detected stall on CPU rcu: 0-...!: (10499 ticks this GP) idle=0af/1/0x4000000000000000 softirq=10678/10678 fqs=1 (t=10500 jiffies g=13089 q=109) rcu: rcu_preempt kthread starved for 10497 jiffies! g13089 f0x0 RCU_GP_WAIT_FQS(5) ->state=0x0 ->cpu=1 rcu: Unless rcu_preempt kthread gets sufficient CPU time, OOM is now expected behavior. rcu: RCU grace-period kthread stack dump: task:rcu_preempt state:R running task stack:28696 pid: 14 ppid: 2 flags:0x00004000 Call Trace: context_switch kernel/sched/core.c:4955 [inline] __schedule+0x940/0x26f0 kernel/sched/core.c:6236 schedule+0xd3/0x270 kernel/sched/core.c:6315 schedule_timeout+0x14a/0x2a0 kernel/time/timer.c:1881 rcu_gp_fqs_loop+0x186/0x810 kernel/rcu/tree.c:1955 rcu_gp_kthread+0x1de/0x320 kernel/rcu/tree.c:2128 kthread+0x405/0x4f0 kernel/kthread.c:327 ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:295 rcu: Stack dump where RCU GP kthread last ran: Sending NMI from CPU 0 to CPUs 1: NMI backtrace for cpu 1 CPU: 1 PID: 8510 Comm: syz-executor827 Not tainted 5.15.0-rc2-next-20210920-syzkaller #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 RIP: 0010:bytes_is_nonzero mm/kasan/generic.c:84 [inline] RIP: 0010:memory_is_nonzero mm/kasan/generic.c:102 [inline] RIP: 0010:memory_is_poisoned_n mm/kasan/generic.c:128 [inline] RIP: 0010:memory_is_poisoned mm/kasan/generic.c:159 [inline] RIP: 0010:check_region_inline mm/kasan/generic.c:180 [inline] RIP: 0010:kasan_check_range+0xc8/0x180 mm/kasan/generic.c:189 Code: 38 00 74 ed 48 8d 50 08 eb 09 48 83 c0 01 48 39 d0 74 7a 80 38 00 74 f2 48 89 c2 b8 01 00 00 00 48 85 d2 75 56 5b 5d 41 5c c3 <48> 85 d2 74 5e 48 01 ea eb 09 48 83 c0 01 48 39 d0 74 50 80 38 00 RSP: 0018:ffffc9000cd676c8 EFLAGS: 00000283 RAX: ffffed100e9a110e RBX: ffffed100e9a110f RCX: ffffffff88ea062a RDX: 0000000000000001 RSI: 0000000000000008 RDI: ffff888074d08870 RBP: ffffed100e9a110e R08: 0000000000000001 R09: ffff888074d08877 R10: ffffed100e9a110e R11: 0000000000000000 R12: ffff888074d08000 R13: ffff888074d08000 R14: ffff888074d08088 R15: ffff888074d08000 FS: 0000555556d8e300(0000) GS:ffff8880b9d00000(0000) knlGS:0000000000000000 S: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000020000180 CR3: 0000000068909000 CR4: 00000000001506e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: instrument_atomic_read_write include/linux/instrumented.h:101 [inline] test_and_clear_bit include/asm-generic/bitops/instrumented-atomic.h:83 [inline] mptcp_release_cb+0x14a/0x210 net/mptcp/protocol.c:3016 release_sock+0xb4/0x1b0 net/core/sock.c:3204 mptcp_wait_data net/mptcp/protocol.c:1770 [inline] mptcp_recvmsg+0xfd1/0x27b0 net/mptcp/protocol.c:2080 inet6_recvmsg+0x11b/0x5e0 net/ipv6/af_inet6.c:659 sock_recvmsg_nosec net/socket.c:944 [inline] ____sys_recvmsg+0x527/0x600 net/socket.c:2626 ___sys_recvmsg+0x127/0x200 net/socket.c:2670 do_recvmmsg+0x24d/0x6d0 net/socket.c:2764 __sys_recvmmsg net/socket.c:2843 [inline] __do_sys_recvmmsg net/socket.c:2866 [inline] __se_sys_recvmmsg net/socket.c:2859 [inline] __x64_sys_recvmmsg+0x20b/0x260 net/socket.c:2859 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x44/0xae RIP: 0033:0x7fc200d2dc39 Code: 28 00 00 00 75 05 48 83 c4 28 c3 e8 41 15 00 00 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 c0 ff ff ff f7 d8 64 89 01 48 RSP: 002b:00007ffc5758e5a8 EFLAGS: 00000246 ORIG_RAX: 000000000000012b RAX: ffffffffffffffda RBX: 0000000000000003 RCX: 00007fc200d2dc39 RDX: 0000000000000002 RSI: 00000000200017c0 RDI: 0000000000000003 RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000f0b5ff R10: 0000000000000100 R11: 0000000000000246 R12: 0000000000000003 R13: 00007ffc5758e5d0 R14: 00007ffc5758e5c0 R15: 0000000000000003 Fix the issue by replacing the MPTCP_DATA_READY bit with direct inspection of the msk receive queue. Reported-and-tested-by: syzbot+3360da629681aa0d22fe@syzkaller.appspotmail.com Fixes: 7a6a6cbc3e59 ("mptcp: recvmsg() can drain data from multiple subflow") Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-10-08eth: platform: add a helper for loading netdev->dev_addrJakub Kicinski
Commit 406f42fa0d3c ("net-next: When a bond have a massive amount of VLANs...") introduced a rbtree for faster Ethernet address look up. To maintain netdev->dev_addr in this tree we need to make all the writes to it got through appropriate helpers. There is a handful of drivers which pass netdev->dev_addr as the destination buffer to eth_platform_get_mac_address(). Add a helper which takes a dev pointer instead, so it can call an appropriate helper. Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-10-08ethernet: un-export nvmem_get_mac_address()Jakub Kicinski
nvmem_get_mac_address() is only called from of_net.c we don't need the export. Signed-off-by: Jakub Kicinski <kuba@kernel.org> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-10-07Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski
No conflicts. Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-10-07Merge tag 'nfsd-5.15-3' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux Pull nfsd fixes from Chuck Lever: "Bug fixes for NFSD error handling paths" * tag 'nfsd-5.15-3' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux: NFSD: Keep existing listeners on portlist error SUNRPC: fix sign error causing rpcsec_gss drops nfsd: Fix a warning for nfsd_file_close_inode nfsd4: Handle the NFSv4 READDIR 'dircount' hint being zero nfsd: fix error handling of register_pernet_subsys() in init_nfsd()
2021-10-07netfilter: nft_dynset: relax superfluous check on set updatesPablo Neira Ayuso
Relax this condition to make add and update commands idempotent for sets with no timeout. The eval function already checks if the set element timeout is available and updates it if the update command is used. Fixes: 22fe54d5fefc ("netfilter: nf_tables: add support for dynamic set updates") Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2021-10-07ipvs: add sysctl_run_estimation to support disable estimationDust Li
estimation_timer will iterate the est_list to do estimation for each ipvs stats. When there are lots of services, the list can be very large. We found that estimation_timer() run for more then 200ms on a machine with 104 CPU and 50K services. yunhong-cgl jiang report the same phenomenon before: https://www.spinics.net/lists/lvs-devel/msg05426.html In some cases(for example a large K8S cluster with many ipvs services), ipvs estimation may not be needed. So adding a sysctl blob to allow users to disable this completely. Default is: 1 (enable) Cc: yunhong-cgl jiang <xintian1976@gmail.com> Signed-off-by: Dust Li <dust.li@linux.alibaba.com> Acked-by: Julian Anastasov <ja@ssi.bg> Acked-by: Simon Horman <horms@verge.net.au> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2021-10-07netfilter: nf_tables: skip netdev events generated on netns removalFlorian Westphal
syzbot reported following (harmless) WARN: WARNING: CPU: 1 PID: 2648 at net/netfilter/core.c:468 nft_netdev_unregister_hooks net/netfilter/nf_tables_api.c:230 [inline] nf_tables_unregister_hook include/net/netfilter/nf_tables.h:1090 [inline] __nft_release_basechain+0x138/0x640 net/netfilter/nf_tables_api.c:9524 nft_netdev_event net/netfilter/nft_chain_filter.c:351 [inline] nf_tables_netdev_event+0x521/0x8a0 net/netfilter/nft_chain_filter.c:382 reproducer: unshare -n bash -c 'ip link add br0 type bridge; nft add table netdev t ; \ nft add chain netdev t ingress \{ type filter hook ingress device "br0" \ priority 0\; policy drop\; \}' Problem is that when netns device exit hooks create the UNREGISTER event, the .pre_exit hook for nf_tables core has already removed the base hook. Notifier attempts to do this again. The need to do base hook unregister unconditionally was needed in the past, because notifier was last stage where reg->dev dereference was safe. Now that nf_tables does the hook removal in .pre_exit, this isn't needed anymore. Reported-and-tested-by: syzbot+154bd5be532a63aa778b@syzkaller.appspotmail.com Fixes: 767d1216bff825 ("netfilter: nftables: fix possible UAF over chains from packet path in netns") Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2021-10-07netfilter: Kconfig: use 'default y' instead of 'm' for bool config optionVegard Nossum
This option, NF_CONNTRACK_SECMARK, is a bool, so it can never be 'm'. Fixes: 33b8e77605620 ("[NETFILTER]: Add CONFIG_NETFILTER_ADVANCED option") Signed-off-by: Vegard Nossum <vegard.nossum@oracle.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2021-10-07netfilter: xt_IDLETIMER: fix panic that occurs when timer_type has garbage valueJuhee Kang
Currently, when the rule related to IDLETIMER is added, idletimer_tg timer structure is initialized by kmalloc on executing idletimer_tg_create function. However, in this process timer->timer_type is not defined to a specific value. Thus, timer->timer_type has garbage value and it occurs kernel panic. So, this commit fixes the panic by initializing timer->timer_type using kzalloc instead of kmalloc. Test commands: # iptables -A OUTPUT -j IDLETIMER --timeout 1 --label test $ cat /sys/class/xt_idletimer/timers/test Killed Splat looks like: BUG: KASAN: user-memory-access in alarm_expires_remaining+0x49/0x70 Read of size 8 at addr 0000002e8c7bc4c8 by task cat/917 CPU: 12 PID: 917 Comm: cat Not tainted 5.14.0+ #3 79940a339f71eb14fc81aee1757a20d5bf13eb0e Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.13.0-1ubuntu1.1 04/01/2014 Call Trace: dump_stack_lvl+0x6e/0x9c kasan_report.cold+0x112/0x117 ? alarm_expires_remaining+0x49/0x70 __asan_load8+0x86/0xb0 alarm_expires_remaining+0x49/0x70 idletimer_tg_show+0xe5/0x19b [xt_IDLETIMER 11219304af9316a21bee5ba9d58f76a6b9bccc6d] dev_attr_show+0x3c/0x60 sysfs_kf_seq_show+0x11d/0x1f0 ? device_remove_bin_file+0x20/0x20 kernfs_seq_show+0xa4/0xb0 seq_read_iter+0x29c/0x750 kernfs_fop_read_iter+0x25a/0x2c0 ? __fsnotify_parent+0x3d1/0x570 ? iov_iter_init+0x70/0x90 new_sync_read+0x2a7/0x3d0 ? __x64_sys_llseek+0x230/0x230 ? rw_verify_area+0x81/0x150 vfs_read+0x17b/0x240 ksys_read+0xd9/0x180 ? vfs_write+0x460/0x460 ? do_syscall_64+0x16/0xc0 ? lockdep_hardirqs_on+0x79/0x120 __x64_sys_read+0x43/0x50 do_syscall_64+0x3b/0xc0 entry_SYSCALL_64_after_hwframe+0x44/0xae RIP: 0033:0x7f0cdc819142 Code: c0 e9 c2 fe ff ff 50 48 8d 3d 3a ca 0a 00 e8 f5 19 02 00 0f 1f 44 00 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 0f 05 <48> 3d 00 f0 ff ff 77 56 c3 0f 1f 44 00 00 48 83 ec 28 48 89 54 24 RSP: 002b:00007fff28eee5b8 EFLAGS: 00000246 ORIG_RAX: 0000000000000000 RAX: ffffffffffffffda RBX: 0000000000020000 RCX: 00007f0cdc819142 RDX: 0000000000020000 RSI: 00007f0cdc032000 RDI: 0000000000000003 RBP: 00007f0cdc032000 R08: 00007f0cdc031010 R09: 0000000000000000 R10: 0000000000000022 R11: 0000000000000246 R12: 00005607e9ee31f0 R13: 0000000000000003 R14: 0000000000020000 R15: 0000000000020000 Fixes: 68983a354a65 ("netfilter: xtables: Add snapshot of hardidletimer target") Signed-off-by: Juhee Kang <claudiajkang@gmail.com> Reviewed-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2021-10-07net: prefer socket bound to interface when not in VRFMike Manning
The commit 6da5b0f027a8 ("net: ensure unbound datagram socket to be chosen when not in a VRF") modified compute_score() so that a device match is always made, not just in the case of an l3mdev skb, then increments the score also for unbound sockets. This ensures that sockets bound to an l3mdev are never selected when not in a VRF. But as unbound and bound sockets are now scored equally, this results in the last opened socket being selected if there are matches in the default VRF for an unbound socket and a socket bound to a dev that is not an l3mdev. However, handling prior to this commit was to always select the bound socket in this case. Reinstate this handling by incrementing the score only for bound sockets. The required isolation due to choosing between an unbound socket and a socket bound to an l3mdev remains in place due to the device match always being made. The same approach is taken for compute_score() for stream sockets. Fixes: 6da5b0f027a8 ("net: ensure unbound datagram socket to be chosen when not in a VRF") Fixes: e78190581aff ("net: ensure unbound stream socket to be chosen when not in a VRF") Signed-off-by: Mike Manning <mmanning@vyatta.att-mail.com> Reviewed-by: David Ahern <dsahern@kernel.org> Link: https://lore.kernel.org/r/cf0a8523-b362-1edf-ee78-eef63cbbb428@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-10-07Merge https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpfJakub Kicinski
Daniel Borkmann says: ==================== pull-request: bpf 2021-10-07 We've added 7 non-merge commits during the last 8 day(s) which contain a total of 8 files changed, 38 insertions(+), 21 deletions(-). The main changes are: 1) Fix ARM BPF JIT to preserve caller-saved regs for DIV/MOD JIT-internal helper call, from Johan Almbladh. 2) Fix integer overflow in BPF stack map element size calculation when used with preallocation, from Tatsuhiko Yasumatsu. 3) Fix an AF_UNIX regression due to added BPF sockmap support related to shutdown handling, from Jiang Wang. 4) Fix a segfault in libbpf when generating light skeletons from objects without BTF, from Kumar Kartikeya Dwivedi. 5) Fix a libbpf memory leak in strset to free the actual struct strset itself, from Andrii Nakryiko. 6) Dual-license bpf_insn.h similarly as we did for libbpf and bpftool, with ACKs from all contributors, from Luca Boccassi. ==================== Link: https://lore.kernel.org/r/20211007135010.21143-1-daniel@iogearbox.net Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-10-07eth: fwnode: add a helper for loading netdev->dev_addrJakub Kicinski
Commit 406f42fa0d3c ("net-next: When a bond have a massive amount of VLANs...") introduced a rbtree for faster Ethernet address look up. To maintain netdev->dev_addr in this tree we need to make all the writes to it got through appropriate helpers. There is a handful of drivers which pass netdev->dev_addr as the destination buffer to device_get_mac_address(). Add a helper which takes a dev pointer instead, so it can call an appropriate helper. Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-10-07eth: fwnode: remove the addr len from mac helpersJakub Kicinski
All callers pass in ETH_ALEN and the function itself will return -EINVAL for any other address length. Just assume it's ETH_ALEN like all other mac address helpers (nvm, of, platform). Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-10-07eth: fwnode: change the return type of mac address helpersJakub Kicinski
fwnode_get_mac_address() and device_get_mac_address() return a pointer to the buffer that was passed to them on success or NULL on failure. None of the callers care about the actual value, only if it's NULL or not. These semantics differ from of_get_mac_address() which returns an int so to avoid confusion make the device helpers return an errno. Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-10-07device property: move mac addr helpers to eth.cJakub Kicinski
Move the mac address helpers out, eth.c already contains a bunch of similar helpers. Suggested-by: Heikki Krogerus <heikki.krogerus@linux.intel.com> Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Jakub Kicinski <kuba@kernel.org> Reviewed-by: Heikki Krogerus <heikki.krogerus@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-10-07of: net: add a helper for loading netdev->dev_addrJakub Kicinski
Commit 406f42fa0d3c ("net-next: When a bond have a massive amount of VLANs...") introduced a rbtree for faster Ethernet address look up. To maintain netdev->dev_addr in this tree we need to make all the writes to it got through appropriate helpers. There are roughly 40 places where netdev->dev_addr is passed as the destination to a of_get_mac_address() call. Add a helper which takes a dev pointer instead, so it can call an appropriate helper. Note that of_get_mac_address() already assumes the address is 6 bytes long (ETH_ALEN) so use eth_hw_addr_set(). Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-10-07of: net: move of_net under net/Jakub Kicinski
Rob suggests to move of_net.c from under drivers/of/ somewhere to the networking code. Suggested-by: Rob Herring <robh@kernel.org> Signed-off-by: Jakub Kicinski <kuba@kernel.org> Reviewed-by: Rob Herring <robh@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-10-07Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/David S. Miller
ipsec Steffen Klassert says: ==================== pull request (net): ipsec 2021-10-07 1) Fix a sysbot reported shift-out-of-bounds in xfrm_get_default. From Pavel Skripkin. 2) Fix XFRM_MSG_MAPPING ABI breakage. The new XFRM_MSG_MAPPING messages were accidentally not paced at the end. Fix by Eugene Syromiatnikov. 3) Fix the uapi for the default policy, use explicit field and macros and make it accessible to userland. From Nicolas Dichtel. 4) Fix a missing rcu lock in xfrm_notify_userpolicy(). From Nicolas Dichtel. Please pull or let me know if there are problems. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2021-10-06ethtool: Add ability to control transceiver modules' power modeIdo Schimmel
Add a pair of new ethtool messages, 'ETHTOOL_MSG_MODULE_SET' and 'ETHTOOL_MSG_MODULE_GET', that can be used to control transceiver modules parameters and retrieve their status. The first parameter to control is the power mode of the module. It is only relevant for paged memory modules, as flat memory modules always operate in low power mode. When a paged memory module is in low power mode, its power consumption is reduced to the minimum, the management interface towards the host is available and the data path is deactivated. User space can choose to put modules that are not currently in use in low power mode and transition them to high power mode before putting the associated ports administratively up. This is useful for user space that favors reduced power consumption and lower temperatures over reduced link up times. In QSFP-DD modules the transition from low power mode to high power mode can take a few seconds and this transition is only expected to get longer with future / more complex modules. User space can control the power mode of the module via the power mode policy attribute ('ETHTOOL_A_MODULE_POWER_MODE_POLICY'). Possible values: * high: Module is always in high power mode. * auto: Module is transitioned by the host to high power mode when the first port using it is put administratively up and to low power mode when the last port using it is put administratively down. The operational power mode of the module is available to user space via the 'ETHTOOL_A_MODULE_POWER_MODE' attribute. The attribute is not reported to user space when a module is not plugged-in. The user API is designed to be generic enough so that it could be used for modules with different memory maps (e.g., SFF-8636, CMIS). The only implementation of the device driver API in this series is for a MAC driver (mlxsw) where the module is controlled by the device's firmware, but it is designed to be generic enough so that it could also be used by implementations where the module is controlled by the CPU. CMIS testing ============ # ethtool -m swp11 Identifier : 0x18 (QSFP-DD Double Density 8X Pluggable Transceiver (INF-8628)) ... Module State : 0x03 (ModuleReady) LowPwrAllowRequestHW : Off LowPwrRequestSW : Off The module is not in low power mode, as it is not forced by hardware (LowPwrAllowRequestHW is off) or by software (LowPwrRequestSW is off). The power mode can be queried from the kernel. In case LowPwrAllowRequestHW was on, the kernel would need to take into account the state of the LowPwrRequestHW signal, which is not visible to user space. $ ethtool --show-module swp11 Module parameters for swp11: power-mode-policy high power-mode high Change the power mode policy to 'auto': # ethtool --set-module swp11 power-mode-policy auto Query the power mode again: $ ethtool --show-module swp11 Module parameters for swp11: power-mode-policy auto power-mode low Verify with the data read from the EEPROM: # ethtool -m swp11 Identifier : 0x18 (QSFP-DD Double Density 8X Pluggable Transceiver (INF-8628)) ... Module State : 0x01 (ModuleLowPwr) LowPwrAllowRequestHW : Off LowPwrRequestSW : On Put the associated port administratively up which will instruct the host to transition the module to high power mode: # ip link set dev swp11 up Query the power mode again: $ ethtool --show-module swp11 Module parameters for swp11: power-mode-policy auto power-mode high Verify with the data read from the EEPROM: # ethtool -m swp11 Identifier : 0x18 (QSFP-DD Double Density 8X Pluggable Transceiver (INF-8628)) ... Module State : 0x03 (ModuleReady) LowPwrAllowRequestHW : Off LowPwrRequestSW : Off Put the associated port administratively down which will instruct the host to transition the module to low power mode: # ip link set dev swp11 down Query the power mode again: $ ethtool --show-module swp11 Module parameters for swp11: power-mode-policy auto power-mode low Verify with the data read from the EEPROM: # ethtool -m swp11 Identifier : 0x18 (QSFP-DD Double Density 8X Pluggable Transceiver (INF-8628)) ... Module State : 0x01 (ModuleLowPwr) LowPwrAllowRequestHW : Off LowPwrRequestSW : On SFF-8636 testing ================ # ethtool -m swp13 Identifier : 0x11 (QSFP28) ... Extended identifier description : 5.0W max. Power consumption, High Power Class (> 3.5 W) enabled Power set : Off Power override : On ... Transmit avg optical power (Channel 1) : 0.7733 mW / -1.12 dBm Transmit avg optical power (Channel 2) : 0.7649 mW / -1.16 dBm Transmit avg optical power (Channel 3) : 0.7790 mW / -1.08 dBm Transmit avg optical power (Channel 4) : 0.7837 mW / -1.06 dBm Rcvr signal avg optical power(Channel 1) : 0.9302 mW / -0.31 dBm Rcvr signal avg optical power(Channel 2) : 0.9079 mW / -0.42 dBm Rcvr signal avg optical power(Channel 3) : 0.8993 mW / -0.46 dBm Rcvr signal avg optical power(Channel 4) : 0.8778 mW / -0.57 dBm The module is not in low power mode, as it is not forced by hardware (Power override is on) or by software (Power set is off). The power mode can be queried from the kernel. In case Power override was off, the kernel would need to take into account the state of the LPMode signal, which is not visible to user space. $ ethtool --show-module swp13 Module parameters for swp13: power-mode-policy high power-mode high Change the power mode policy to 'auto': # ethtool --set-module swp13 power-mode-policy auto Query the power mode again: $ ethtool --show-module swp13 Module parameters for swp13: power-mode-policy auto power-mode low Verify with the data read from the EEPROM: # ethtool -m swp13 Identifier : 0x11 (QSFP28) Extended identifier description : 5.0W max. Power consumption, High Power Class (> 3.5 W) not enabled Power set : On Power override : On ... Transmit avg optical power (Channel 1) : 0.0000 mW / -inf dBm Transmit avg optical power (Channel 2) : 0.0000 mW / -inf dBm Transmit avg optical power (Channel 3) : 0.0000 mW / -inf dBm Transmit avg optical power (Channel 4) : 0.0000 mW / -inf dBm Rcvr signal avg optical power(Channel 1) : 0.0000 mW / -inf dBm Rcvr signal avg optical power(Channel 2) : 0.0000 mW / -inf dBm Rcvr signal avg optical power(Channel 3) : 0.0000 mW / -inf dBm Rcvr signal avg optical power(Channel 4) : 0.0000 mW / -inf dBm Put the associated port administratively up which will instruct the host to transition the module to high power mode: # ip link set dev swp13 up Query the power mode again: $ ethtool --show-module swp13 Module parameters for swp13: power-mode-policy auto power-mode high Verify with the data read from the EEPROM: # ethtool -m swp13 Identifier : 0x11 (QSFP28) ... Extended identifier description : 5.0W max. Power consumption, High Power Class (> 3.5 W) enabled Power set : Off Power override : On ... Transmit avg optical power (Channel 1) : 0.7934 mW / -1.01 dBm Transmit avg optical power (Channel 2) : 0.7859 mW / -1.05 dBm Transmit avg optical power (Channel 3) : 0.7885 mW / -1.03 dBm Transmit avg optical power (Channel 4) : 0.7985 mW / -0.98 dBm Rcvr signal avg optical power(Channel 1) : 0.9325 mW / -0.30 dBm Rcvr signal avg optical power(Channel 2) : 0.9034 mW / -0.44 dBm Rcvr signal avg optical power(Channel 3) : 0.9086 mW / -0.42 dBm Rcvr signal avg optical power(Channel 4) : 0.8885 mW / -0.51 dBm Put the associated port administratively down which will instruct the host to transition the module to low power mode: # ip link set dev swp13 down Query the power mode again: $ ethtool --show-module swp13 Module parameters for swp13: power-mode-policy auto power-mode low Verify with the data read from the EEPROM: # ethtool -m swp13 Identifier : 0x11 (QSFP28) ... Extended identifier description : 5.0W max. Power consumption, High Power Class (> 3.5 W) not enabled Power set : On Power override : On ... Transmit avg optical power (Channel 1) : 0.0000 mW / -inf dBm Transmit avg optical power (Channel 2) : 0.0000 mW / -inf dBm Transmit avg optical power (Channel 3) : 0.0000 mW / -inf dBm Transmit avg optical power (Channel 4) : 0.0000 mW / -inf dBm Rcvr signal avg optical power(Channel 1) : 0.0000 mW / -inf dBm Rcvr signal avg optical power(Channel 2) : 0.0000 mW / -inf dBm Rcvr signal avg optical power(Channel 3) : 0.0000 mW / -inf dBm Rcvr signal avg optical power(Channel 4) : 0.0000 mW / -inf dBm Signed-off-by: Ido Schimmel <idosch@nvidia.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-10-06rtnetlink: fix if_nlmsg_stats_size() under estimationEric Dumazet
rtnl_fill_statsinfo() is filling skb with one mandatory if_stats_msg structure. nlmsg_put(skb, pid, seq, type, sizeof(struct if_stats_msg), flags); But if_nlmsg_stats_size() never considered the needed storage. This bug did not show up because alloc_skb(X) allocates skb with extra tailroom, because of added alignments. This could very well be changed in the future to have deterministic behavior. Fixes: 10c9ead9f3c6 ("rtnetlink: add new RTM_GETSTATS message to dump link stats") Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Roopa Prabhu <roopa@nvidia.com> Acked-by: Roopa Prabhu <roopa@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-10-06unix: Fix an issue in unix_shutdown causing the other end read/write failuresJiang Wang
Commit 94531cfcbe79 ("af_unix: Add unix_stream_proto for sockmap") sets unix domain socket peer state to TCP_CLOSE in unix_shutdown. This could happen when the local end is shutdown but the other end is not. Then, the other end will get read or write failures which is not expected. Fix the issue by setting the local state to shutdown. Fixes: 94531cfcbe79 ("af_unix: Add unix_stream_proto for sockmap") Reported-by: Casey Schaufler <casey@schaufler-ca.com> Suggested-by: Cong Wang <cong.wang@bytedance.com> Signed-off-by: Jiang Wang <jiang.wang@bytedance.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Tested-by: Casey Schaufler <casey@schaufler-ca.com> Reviewed-by: Casey Schaufler <casey@schaufler-ca.com> Acked-by: Song Liu <songliubraving@fb.com> Link: https://lore.kernel.org/bpf/20211004232530.2377085-1-jiang.wang@bytedance.com
2021-10-05bpf: selftests: Add selftests for module kfunc supportKumar Kartikeya Dwivedi
This adds selftests that tests the success and failure path for modules kfuncs (in presence of invalid kfunc calls) for both libbpf and gen_loader. It also adds a prog_test kfunc_btf_id_list so that we can add module BTF ID set from bpf_testmod. This also introduces a couple of test cases to verifier selftests for validating whether we get an error or not depending on if invalid kfunc call remains after elimination of unreachable instructions. Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20211002011757.311265-10-memxor@gmail.com
2021-10-05bpf: Enable TCP congestion control kfunc from modulesKumar Kartikeya Dwivedi
This commit moves BTF ID lookup into the newly added registration helper, in a way that the bbr, cubic, and dctcp implementation set up their sets in the bpf_tcp_ca kfunc_btf_set list, while the ones not dependent on modules are looked up from the wrapper function. This lifts the restriction for them to be compiled as built in objects, and can be loaded as modules if required. Also modify Makefile.modfinal to call resolve_btfids for each module. Note that since kernel kfunc_ids never overlap with module kfunc_ids, we only match the owner for module btf id sets. See following commits for background on use of: CONFIG_X86 ifdef: 569c484f9995 (bpf: Limit static tcp-cc functions in the .BTF_ids list to x86) CONFIG_DYNAMIC_FTRACE ifdef: 7aae231ac93b (bpf: tcp: Limit calling some tcp cc functions to CONFIG_DYNAMIC_FTRACE) Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20211002011757.311265-6-memxor@gmail.com
2021-10-05bpf: Introduce BPF support for kernel module function callsKumar Kartikeya Dwivedi
This change adds support on the kernel side to allow for BPF programs to call kernel module functions. Userspace will prepare an array of module BTF fds that is passed in during BPF_PROG_LOAD using fd_array parameter. In the kernel, the module BTFs are placed in the auxilliary struct for bpf_prog, and loaded as needed. The verifier then uses insn->off to index into the fd_array. insn->off 0 is reserved for vmlinux BTF (for backwards compat), so userspace must use an fd_array index > 0 for module kfunc support. kfunc_btf_tab is sorted based on offset in an array, and each offset corresponds to one descriptor, with a max limit up to 256 such module BTFs. We also change existing kfunc_tab to distinguish each element based on imm, off pair as each such call will now be distinct. Another change is to check_kfunc_call callback, which now include a struct module * pointer, this is to be used in later patch such that the kfunc_id and module pointer are matched for dynamically registered BTF sets from loadable modules, so that same kfunc_id in two modules doesn't lead to check_kfunc_call succeeding. For the duration of the check_kfunc_call, the reference to struct module exists, as it returns the pointer stored in kfunc_btf_tab. Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20211002011757.311265-2-memxor@gmail.com
2021-10-05Merge tag 'for-net-next-2021-10-01' of ↵Jakub Kicinski
git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth-next Luiz Augusto von Dentz says: ==================== bluetooth-next pull request for net-next: - Add support for MediaTek MT7922 and MT7921 - Enable support for AOSP extention in Qualcomm WCN399x and Realtek 8822C/8852A. - Add initial support for link quality and audio/codec offload. - Rework of sockets sendmsg to avoid locking issues. - Add vhci suspend/resume emulation. ==================== Link: https://lore.kernel.org/r/20211001230850.3635543-1-luiz.dentz@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-10-05netlink: annotate data races around nlk->boundEric Dumazet
While existing code is correct, KCSAN is reporting a data-race in netlink_insert / netlink_sendmsg [1] It is correct to read nlk->bound without a lock, as netlink_autobind() will acquire all needed locks. [1] BUG: KCSAN: data-race in netlink_insert / netlink_sendmsg write to 0xffff8881031c8b30 of 1 bytes by task 18752 on cpu 0: netlink_insert+0x5cc/0x7f0 net/netlink/af_netlink.c:597 netlink_autobind+0xa9/0x150 net/netlink/af_netlink.c:842 netlink_sendmsg+0x479/0x7c0 net/netlink/af_netlink.c:1892 sock_sendmsg_nosec net/socket.c:703 [inline] sock_sendmsg net/socket.c:723 [inline] ____sys_sendmsg+0x360/0x4d0 net/socket.c:2392 ___sys_sendmsg net/socket.c:2446 [inline] __sys_sendmsg+0x1ed/0x270 net/socket.c:2475 __do_sys_sendmsg net/socket.c:2484 [inline] __se_sys_sendmsg net/socket.c:2482 [inline] __x64_sys_sendmsg+0x42/0x50 net/socket.c:2482 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x3d/0x90 arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x44/0xae read to 0xffff8881031c8b30 of 1 bytes by task 18751 on cpu 1: netlink_sendmsg+0x270/0x7c0 net/netlink/af_netlink.c:1891 sock_sendmsg_nosec net/socket.c:703 [inline] sock_sendmsg net/socket.c:723 [inline] __sys_sendto+0x2a8/0x370 net/socket.c:2019 __do_sys_sendto net/socket.c:2031 [inline] __se_sys_sendto net/socket.c:2027 [inline] __x64_sys_sendto+0x74/0x90 net/socket.c:2027 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x3d/0x90 arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x44/0xae value changed: 0x00 -> 0x01 Reported by Kernel Concurrency Sanitizer on: CPU: 1 PID: 18751 Comm: syz-executor.0 Not tainted 5.14.0-rc1-syzkaller #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Fixes: da314c9923fe ("netlink: Replace rhash_portid with bound") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: syzbot <syzkaller@googlegroups.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-10-05netlink: remove netlink_broadcast_filteredFlorian Westphal
No users in tree since commit a3498436b3a0 ("netns: restrict uevents"), so remove this functionality. Cc: Christian Brauner <christian.brauner@ubuntu.com> Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-10-05net/sched: sch_taprio: properly cancel timer from taprio_destroy()Eric Dumazet
There is a comment in qdisc_create() about us not calling ops->reset() in some cases. err_out4: /* * Any broken qdiscs that would require a ops->reset() here? * The qdisc was never in action so it shouldn't be necessary. */ As taprio sets a timer before actually receiving a packet, we need to cancel it from ops->destroy, just in case ops->reset has not been called. syzbot reported: ODEBUG: free active (active state 0) object type: hrtimer hint: advance_sched+0x0/0x9a0 arch/x86/include/asm/atomic64_64.h:22 WARNING: CPU: 0 PID: 8441 at lib/debugobjects.c:505 debug_print_object+0x16e/0x250 lib/debugobjects.c:505 Modules linked in: CPU: 0 PID: 8441 Comm: syz-executor813 Not tainted 5.14.0-rc6-syzkaller #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 RIP: 0010:debug_print_object+0x16e/0x250 lib/debugobjects.c:505 Code: ff df 48 89 fa 48 c1 ea 03 80 3c 02 00 0f 85 af 00 00 00 48 8b 14 dd e0 d3 e3 89 4c 89 ee 48 c7 c7 e0 c7 e3 89 e8 5b 86 11 05 <0f> 0b 83 05 85 03 92 09 01 48 83 c4 18 5b 5d 41 5c 41 5d 41 5e c3 RSP: 0018:ffffc9000130f330 EFLAGS: 00010282 RAX: 0000000000000000 RBX: 0000000000000003 RCX: 0000000000000000 RDX: ffff88802baeb880 RSI: ffffffff815d87b5 RDI: fffff52000261e58 RBP: 0000000000000001 R08: 0000000000000000 R09: 0000000000000000 R10: ffffffff815d25ee R11: 0000000000000000 R12: ffffffff898dd020 R13: ffffffff89e3ce20 R14: ffffffff81653630 R15: dffffc0000000000 FS: 0000000000f0d300(0000) GS:ffff8880b9d00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007ffb64b3e000 CR3: 0000000036557000 CR4: 00000000001506e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: __debug_check_no_obj_freed lib/debugobjects.c:987 [inline] debug_check_no_obj_freed+0x301/0x420 lib/debugobjects.c:1018 slab_free_hook mm/slub.c:1603 [inline] slab_free_freelist_hook+0x171/0x240 mm/slub.c:1653 slab_free mm/slub.c:3213 [inline] kfree+0xe4/0x540 mm/slub.c:4267 qdisc_create+0xbcf/0x1320 net/sched/sch_api.c:1299 tc_modify_qdisc+0x4c8/0x1a60 net/sched/sch_api.c:1663 rtnetlink_rcv_msg+0x413/0xb80 net/core/rtnetlink.c:5571 netlink_rcv_skb+0x153/0x420 net/netlink/af_netlink.c:2504 netlink_unicast_kernel net/netlink/af_netlink.c:1314 [inline] netlink_unicast+0x533/0x7d0 net/netlink/af_netlink.c:1340 netlink_sendmsg+0x86d/0xdb0 net/netlink/af_netlink.c:1929 sock_sendmsg_nosec net/socket.c:704 [inline] sock_sendmsg+0xcf/0x120 net/socket.c:724 ____sys_sendmsg+0x6e8/0x810 net/socket.c:2403 ___sys_sendmsg+0xf3/0x170 net/socket.c:2457 __sys_sendmsg+0xe5/0x1b0 net/socket.c:2486 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80 Fixes: 44d4775ca518 ("net/sched: sch_taprio: reset child qdiscs before freeing them") Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Davide Caratti <dcaratti@redhat.com> Reported-by: syzbot <syzkaller@googlegroups.com> Acked-by: Vinicius Costa Gomes <vinicius.gomes@intel.com> Acked-by: Davide Caratti <dcaratti@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-10-05net: bridge: fix under estimation in br_get_linkxstats_size()Eric Dumazet
Commit de1799667b00 ("net: bridge: add STP xstats") added an additional nla_reserve_64bit() in br_fill_linkxstats(), but forgot to update br_get_linkxstats_size() accordingly. This can trigger the following in rtnl_stats_get() WARN_ON(err == -EMSGSIZE); Fixes: de1799667b00 ("net: bridge: add STP xstats") Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Vivien Didelot <vivien.didelot@gmail.com> Cc: Nikolay Aleksandrov <nikolay@nvidia.com> Acked-by: Nikolay Aleksandrov <nikolay@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-10-05net: bridge: use nla_total_size_64bit() in br_get_linkxstats_size()Eric Dumazet
bridge_fill_linkxstats() is using nla_reserve_64bit(). We must use nla_total_size_64bit() instead of nla_total_size() for corresponding data structure. Fixes: 1080ab95e3c7 ("net: bridge: add support for IGMP/MLD stats and export them via netlink") Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Nikolay Aleksandrov <nikolay@nvidia.com> Cc: Vivien Didelot <vivien.didelot@gmail.com> Acked-by: Nikolay Aleksandrov <nikolay@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-10-04SUNRPC: Add trace event when alloc_pages_bulk() makes no progressChuck Lever
This is an operational low memory situation that needs to be flagged. The new tracepoint records a timestamp and the nfsd thread that failed to allocate pages. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2021-10-04svcrdma: Split svcrmda_wc_{read,write} tracepointsChuck Lever
There are currently three separate purposes being served by single tracepoints. Split them up, as was done with wc_send. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2021-10-04svcrdma: Split the svcrdma_wc_send() tracepointChuck Lever
There are currently three separate purposes being served by a single tracepoint here. They need to be split up. svcrdma_wc_send: - status is always zero, so there's no value in recording it. - vendor_err is meaningless unless status is not zero, so there's no value in recording it. - This tracepoint is needed only when developing modifications, so it should be left disabled most of the time. svcrdma_wc_send_flush: - As above, needed only rarely, and not an error. svcrdma_wc_send_err: - This tracepoint can be left persistently enabled because completion errors are run-time problems (except for FLUSHED_ERR). - Tracepoint name now ends in _err to reflect its purpose. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2021-10-04svcrdma: Split the svcrdma_wc_receive() tracepointChuck Lever
There are currently three separate purposes being served by a single tracepoint here. They need to be split up. svcrdma_wc_recv: - status is always zero, so there's no value in recording it. - vendor_err is meaningless unless status is not zero, so there's no value in recording it. - This tracepoint is needed only when developing modifications, so it should be left disabled most of the time. svcrdma_wc_recv_flush: - As above, needed only rarely, and not an error. svcrdma_wc_recv_err: - received is always zero, so there's no value in recording it. - This tracepoint can be left enabled because completion errors are run-time problems (except for FLUSHED_ERR). - Tracepoint name now ends in _err to reflect its purpose. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2021-10-04dsa: tag_dsa: Fix mask for trunked packetsAndrew Lunn
A packet received on a trunk will have bit 2 set in Forward DSA tagged frame. Bit 1 can be either 0 or 1 and is otherwise undefined and bit 0 indicates the frame CFI. Masking with 7 thus results in frames as being identified as being from a trunk when in fact they are not. Fix the mask to just look at bit 2. Fixes: 5b60dadb71db ("net: dsa: tag_dsa: Support reception of packets from LAG devices") Signed-off-by: Andrew Lunn <andrew@lunn.ch> Reviewed-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-10-04net: ipv6: fix use after free of struct seg6_pernet_dataMichelleJin
sdata->tun_src should be freed before sdata is freed because sdata->tun_src is allocated after sdata allocation. So, kfree(sdata) and kfree(rcu_dereference_raw(sdata->tun_src)) are changed code order. Fixes: f04ed7d277e8 ("net: ipv6: check return value of rhashtable_init") Signed-off-by: MichelleJin <shjy180909@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-10-04ipv6: ioam: Add support for the ip6ip6 encapsulationJustin Iurman
This patch adds support for the ip6ip6 encapsulation by providing three encap modes: inline, encap and auto. Signed-off-by: Justin Iurman <justin.iurman@uliege.be> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-10-04ipv6: ioam: Prerequisite patch for ioam6_iptunnelJustin Iurman
This prerequisite patch provides some minor edits (alignments, renames) and a minor modification inside a function to facilitate the next patch by using existing nla_* functions. Signed-off-by: Justin Iurman <justin.iurman@uliege.be> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-10-04ipv6: ioam: Distinguish input and output for hop-limitJustin Iurman
This patch anticipates the support for the IOAM insertion inside in-transit packets, by making a difference between input and output in order to determine the right value for its hop-limit (inherited from the IPv6 hop-limit). Input case: happens before ip6_forward, the IPv6 hop-limit is not decremented yet -> decrement the IOAM hop-limit to reflect the new hop inside the trace. Output case: happens after ip6_forward, the IPv6 hop-limit has already been decremented -> keep the same value for the IOAM hop-limit. Signed-off-by: Justin Iurman <justin.iurman@uliege.be> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-10-03SUNRPC: xprt_clear_locked() only needs release memory semanticsTrond Myklebust
The clearing of the XPRT_LOCKED bit has to happen after we clear xprt->snd_task, but we don't require any extra memory barriers after that. Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2021-10-03SUNRPC: Remove WQ_HIGHPRI from xprtiodTrond Myklebust
Don't let xprtiod pre-empt softirq. Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2021-10-03SUNRPC: Add cond_resched() at the appropriate point in __rpc_execute()Trond Myklebust
Allow tasks that need to pre-empt rpciod/xprtiod to do so when it is safe. Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2021-10-03SUNRPC: Partial revert of commit 6f9f17287e78Trond Myklebust
The premise of commit 6f9f17287e78 ("SUNRPC: Mitigate cond_resched() in xprt_transmit()") was that cond_resched() is expensive and unnecessary when there has been just a single send. The point of cond_resched() is to ensure that tasks that should pre-empt this one get a chance to do so when it is safe to do so. The code prior to commit 6f9f17287e78 failed to take into account that it was keeping a rpc_task pinned for longer than it needed to, and so rather than doing a full revert, let's just move the cond_resched. Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2021-10-03mctp: Add input reassembly testsJeremy Kerr
Add multi-packet route input tests, for message reassembly. These will feed packets to be received by a bound socket, or dropped. Signed-off-by: Jeremy Kerr <jk@codeconstruct.com.au> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-10-03mctp: Add route input to socket testsJeremy Kerr
Add a few tests for single-packet route inputs, testing the mctp_route_input function. Signed-off-by: Jeremy Kerr <jk@codeconstruct.com.au> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-10-03mctp: Add packet rx testsJeremy Kerr
Add a few tests for the initial packet ingress through mctp_pkttype_receive function; mainly packet header sanity checks. Full input routing checks will be added as a separate change. Signed-off-by: Jeremy Kerr <jk@codeconstruct.com.au> Signed-off-by: David S. Miller <davem@davemloft.net>