summaryrefslogtreecommitdiff
path: root/net
AgeCommit message (Collapse)Author
2022-01-06Merge https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-nextJakub Kicinski
Alexei Starovoitov says: ==================== pull-request: bpf-next 2022-01-06 We've added 41 non-merge commits during the last 2 day(s) which contain a total of 36 files changed, 1214 insertions(+), 368 deletions(-). The main changes are: 1) Various fixes in the verifier, from Kris and Daniel. 2) Fixes in sockmap, from John. 3) bpf_getsockopt fix, from Kuniyuki. 4) INET_POST_BIND fix, from Menglong. 5) arm64 JIT fix for bpf pseudo funcs, from Hou. 6) BPF ISA doc improvements, from Christoph. * https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (41 commits) bpf: selftests: Add bind retry for post_bind{4, 6} bpf: selftests: Use C99 initializers in test_sock.c net: bpf: Handle return value of BPF_CGROUP_RUN_PROG_INET{4,6}_POST_BIND() bpf/selftests: Test bpf_d_path on rdonly_mem. libbpf: Add documentation for bpf_map batch operations selftests/bpf: Don't rely on preserving volatile in PT_REGS macros in loop3 xdp: Add xdp_do_redirect_frame() for pre-computed xdp_frames xdp: Move conversion to xdp_frame out of map functions page_pool: Store the XDP mem id page_pool: Add callback to init pages when they are allocated xdp: Allow registering memory model without rxq reference samples/bpf: xdpsock: Add timestamp for Tx-only operation samples/bpf: xdpsock: Add time-out for cleaning Tx samples/bpf: xdpsock: Add sched policy and priority support samples/bpf: xdpsock: Add cyclic TX operation capability samples/bpf: xdpsock: Add clockid selection support samples/bpf: xdpsock: Add Dest and Src MAC setting for Tx-only operation samples/bpf: xdpsock: Add VLAN support for Tx-only operation libbpf 1.0: Deprecate bpf_object__find_map_by_offset() API libbpf 1.0: Deprecate bpf_map__is_offload_neutral() ... ==================== Link: https://lore.kernel.org/r/20220107013626.53943-1-alexei.starovoitov@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-01-06net: bpf: Handle return value of BPF_CGROUP_RUN_PROG_INET{4,6}_POST_BIND()Menglong Dong
The return value of BPF_CGROUP_RUN_PROG_INET{4,6}_POST_BIND() in __inet_bind() is not handled properly. While the return value is non-zero, it will set inet_saddr and inet_rcv_saddr to 0 and exit: err = BPF_CGROUP_RUN_PROG_INET4_POST_BIND(sk); if (err) { inet->inet_saddr = inet->inet_rcv_saddr = 0; goto out_release_sock; } Let's take UDP for example and see what will happen. For UDP socket, it will be added to 'udp_prot.h.udp_table->hash' and 'udp_prot.h.udp_table->hash2' after the sk->sk_prot->get_port() called success. If 'inet->inet_rcv_saddr' is specified here, then 'sk' will be in the 'hslot2' of 'hash2' that it don't belong to (because inet_saddr is changed to 0), and UDP packet received will not be passed to this sock. If 'inet->inet_rcv_saddr' is not specified here, the sock will work fine, as it can receive packet properly, which is wired, as the 'bind()' is already failed. To undo the get_port() operation, introduce the 'put_port' field for 'struct proto'. For TCP proto, it is inet_put_port(); For UDP proto, it is udp_lib_unhash(); For icmp proto, it is ping_unhash(). Therefore, after sys_bind() fail caused by BPF_CGROUP_RUN_PROG_INET4_POST_BIND(), it will be unbinded, which means that it can try to be binded to another port. Signed-off-by: Menglong Dong <imagedong@tencent.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20220106132022.3470772-2-imagedong@tencent.com
2022-01-06ethtool: use phydev variableTom Rix
In ethtool_get_phy_stats(), the phydev varaible is set to dev->phydev but dev->phydev is still used. Replace dev->phydev uses with phydev. Signed-off-by: Tom Rix <trix@redhat.com> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-01-06gro: add ability to control gro max packet sizeCoco Li
Eric Dumazet suggested to allow users to modify max GRO packet size. We have seen GRO being disabled by users of appliances (such as wifi access points) because of claimed bufferbloat issues, or some work arounds in sch_cake, to split GRO/GSO packets. Instead of disabling GRO completely, one can chose to limit the maximum packet size of GRO packets, depending on their latency constraints. This patch adds a per device gro_max_size attribute that can be changed with ip link command. ip link set dev eth0 gro_max_size 16000 Suggested-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Coco Li <lixiaoyan@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-01-06net: fix SOF_TIMESTAMPING_BIND_PHC to work with multiple socketsMiroslav Lichvar
When multiple sockets using the SOF_TIMESTAMPING_BIND_PHC flag received a packet with a hardware timestamp (e.g. multiple PTP instances in different PTP domains using the UDPv4/v6 multicast or L2 transport), the timestamps received on some sockets were corrupted due to repeated conversion of the same timestamp (by the same or different vclocks). Fix ptp_convert_timestamp() to not modify the shared skb timestamp and return the converted timestamp as a ktime_t instead. If the conversion fails, return 0 to not confuse the application with timestamps corresponding to an unexpected PHC. Fixes: d7c088265588 ("net: socket: support hardware timestamp conversion to PHC bound") Signed-off-by: Miroslav Lichvar <mlichvar@redhat.com> Cc: Yangbo Lu <yangbo.lu@nxp.com> Cc: Richard Cochran <richardcochran@gmail.com> Acked-by: Richard Cochran <richardcochran@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-01-06net: dsa: setup master before portsVladimir Oltean
It is said that as soon as a network interface is registered, all its resources should have already been prepared, so that it is available for sending and receiving traffic. One of the resources needed by a DSA slave interface is the master. dsa_tree_setup -> dsa_tree_setup_ports -> dsa_port_setup -> dsa_slave_create -> register_netdevice -> dsa_tree_setup_master -> dsa_master_setup -> sets up master->dsa_ptr, which enables reception Therefore, there is a short period of time after register_netdevice() during which the master isn't prepared to pass traffic to the DSA layer (master->dsa_ptr is checked by eth_type_trans). Same thing during unregistration, there is a time frame in which packets might be missed. Note that this change opens us to another race: dsa_master_find_slave() will get invoked potentially earlier than the slave creation, and later than the slave deletion. Since dp->slave starts off as a NULL pointer, the earlier calls aren't a problem, but the later calls are. To avoid use-after-free, we should zeroize dp->slave before calling dsa_slave_destroy(). In practice I cannot really test real life improvements brought by this change, since in my systems, netdevice creation races with PHY autoneg which takes a few seconds to complete, and that masks quite a few races. Effects might be noticeable in a setup with fixed links all the way to an external system. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-01-06net: dsa: first set up shared ports, then non-shared portsVladimir Oltean
After commit a57d8c217aad ("net: dsa: flush switchdev workqueue before tearing down CPU/DSA ports"), the port setup and teardown procedure became asymmetric. The fact of the matter is that user ports need the shared ports to be up before they can be used for CPU-initiated termination. And since we register net devices for the user ports, those won't be functional until we also call the setup for the shared (CPU, DSA) ports. But we may do that later, depending on the port numbering scheme of the hardware we are dealing with. It just makes sense that all shared ports are brought up before any user port is. I can't pinpoint any issue due to the current behavior, but let's change it nonetheless, for consistency's sake. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-01-06net: dsa: hold rtnl_mutex when calling dsa_master_{setup,teardown}Vladimir Oltean
DSA needs to simulate master tracking events when a binding is first with a DSA master established and torn down, in order to give drivers the simplifying guarantee that ->master_state_change calls are made only when the master's readiness state to pass traffic changes. master_state_change() provide a operational bool that DSA driver can use to understand if DSA master is operational or not. To avoid races, we need to block the reception of NETDEV_UP/NETDEV_CHANGE/NETDEV_GOING_DOWN events in the netdev notifier chain while we are changing the master's dev->dsa_ptr (this changes what netdev_uses_dsa(dev) reports). The dsa_master_setup() and dsa_master_teardown() functions optionally require the rtnl_mutex to be held, if the tagger needs the master to be promiscuous, these functions call dev_set_promiscuity(). Move the rtnl_lock() from that function and make it top-level. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-01-06net: dsa: stop updating master MTU from master.cVladimir Oltean
At present there are two paths for changing the MTU of the DSA master. The first is: dsa_tree_setup -> dsa_tree_setup_ports -> dsa_port_setup -> dsa_slave_create -> dsa_slave_change_mtu -> dev_set_mtu(master) The second is: dsa_tree_setup -> dsa_tree_setup_master -> dsa_master_setup -> dev_set_mtu(dev) So the dev_set_mtu() call from dsa_master_setup() has been effectively superseded by the dsa_slave_change_mtu(slave_dev, ETH_DATA_LEN) that is done from dsa_slave_create() for each user port. The later function also updates the master MTU according to the largest user port MTU from the tree. Therefore, updating the master MTU through a separate code path isn't needed. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-01-06net: dsa: merge rtnl_lock sections in dsa_slave_createVladimir Oltean
Currently dsa_slave_create() has two sequences of rtnl_lock/rtnl_unlock in a row. Remove the rtnl_unlock() and rtnl_lock() in between, such that the operation can execute slighly faster. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-01-06net: dsa: reorder PHY initialization with MTU setup in slave.cVladimir Oltean
In dsa_slave_create() there are 2 sections that take rtnl_lock(): MTU change and netdev registration. They are separated by PHY initialization. There isn't any strict ordering requirement except for the fact that netdev registration should be last. Therefore, we can perform the MTU change a bit later, after the PHY setup. A future change will then be able to merge the two rtnl_lock sections into one. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-01-06Merge branch 'master' of ↵David S. Miller
git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec-next Steffen Klassert says: ==================== pull request (net-next): ipsec-next 2022-01-06 1) Fix some clang_analyzer warnings about never read variables. From luo penghao. 2) Check for pols[0] only once in xfrm_expand_policies(). From Jean Sacren. 3) The SA curlft.use_time was updated only on SA cration time. Update whenever the SA is used. From Antony Antony 4) Add support for SM3 secure hash. From Xu Jia. 5) Add support for SM4 symmetric cipher algorithm. From Xu Jia. 6) Add a rate limit for SA mapping change messages. From Antony Antony. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2022-01-05xdp: Add xdp_do_redirect_frame() for pre-computed xdp_framesToke Høiland-Jørgensen
Add an xdp_do_redirect_frame() variant which supports pre-computed xdp_frame structures. This will be used in bpf_prog_run() to avoid having to write to the xdp_frame structure when the XDP program doesn't modify the frame boundaries. Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20220103150812.87914-6-toke@redhat.com
2022-01-05xdp: Move conversion to xdp_frame out of map functionsToke Høiland-Jørgensen
All map redirect functions except XSK maps convert xdp_buff to xdp_frame before enqueueing it. So move this conversion of out the map functions and into xdp_do_redirect(). This removes a bit of duplicated code, but more importantly it makes it possible to support caller-allocated xdp_frame structures, which will be added in a subsequent commit. Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20220103150812.87914-5-toke@redhat.com
2022-01-05page_pool: Store the XDP mem idToke Høiland-Jørgensen
Store the XDP mem ID inside the page_pool struct so it can be retrieved later for use in bpf_prog_run(). Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Jesper Dangaard Brouer <brouer@redhat.com> Link: https://lore.kernel.org/bpf/20220103150812.87914-4-toke@redhat.com
2022-01-05page_pool: Add callback to init pages when they are allocatedToke Høiland-Jørgensen
Add a new callback function to page_pool that, if set, will be called every time a new page is allocated. This will be used from bpf_test_run() to initialise the page data with the data provided by userspace when running XDP programs with redirect turned on. Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: John Fastabend <john.fastabend@gmail.com> Acked-by: Jesper Dangaard Brouer <brouer@redhat.com> Link: https://lore.kernel.org/bpf/20220103150812.87914-3-toke@redhat.com
2022-01-05xdp: Allow registering memory model without rxq referenceToke Høiland-Jørgensen
The functions that register an XDP memory model take a struct xdp_rxq as parameter, but the RXQ is not actually used for anything other than pulling out the struct xdp_mem_info that it embeds. So refactor the register functions and export variants that just take a pointer to the xdp_mem_info. This is in preparation for enabling XDP_REDIRECT in bpf_prog_run(), using a page_pool instance that is not connected to any network device. Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20220103150812.87914-2-toke@redhat.com
2022-01-05Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski
No conflicts. Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-01-05bpf: Add SO_RCVBUF/SO_SNDBUF in _bpf_getsockopt().Kuniyuki Iwashima
This patch exposes SO_RCVBUF/SO_SNDBUF through bpf_getsockopt(). Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.co.jp> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20220104013153.97906-3-kuniyu@amazon.co.jp
2022-01-05bpf: Fix SO_RCVBUF/SO_SNDBUF handling in _bpf_setsockopt().Kuniyuki Iwashima
The commit 4057765f2dee ("sock: consistent handling of extreme SO_SNDBUF/SO_RCVBUF values") added a change to prevent underflow in setsockopt() around SO_SNDBUF/SO_RCVBUF. This patch adds the same change to _bpf_setsockopt(). Fixes: 4057765f2dee ("sock: consistent handling of extreme SO_SNDBUF/SO_RCVBUF values") Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.co.jp> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20220104013153.97906-2-kuniyu@amazon.co.jp
2022-01-05bpf, sockmap: Fix double bpf_prog_put on error case in map_linkJohn Fastabend
sock_map_link() is called to update a sockmap entry with a sk. But, if the sock_map_init_proto() call fails then we return an error to the map_update op against the sockmap. In the error path though we need to cleanup psock and dec the refcnt on any programs associated with the map, because we refcnt them early in the update process to ensure they are pinned for the psock. (This avoids a race where user deletes programs while also updating the map with new socks.) In current code we do the prog refcnt dec explicitely by calling bpf_prog_put() when the program was found in the map. But, after commit '38207a5e81230' in this error path we've already done the prog to psock assignment so the programs have a reference from the psock as well. This then causes the psock tear down logic, invoked by sk_psock_put() in the error path, to similarly call bpf_prog_put on the programs there. To be explicit this logic does the prog->psock assignment: if (msg_*) psock_set_prog(...) Then the error path under the out_progs label does a similar check and dec with: if (msg_*) bpf_prog_put(...) And the teardown logic sk_psock_put() does ... psock_set_prog(msg_*, NULL) ... triggering another bpf_prog_put(...). Then KASAN gives us this splat, found by syzbot because we've created an inbalance between bpf_prog_inc and bpf_prog_put calling put twice on the program. BUG: KASAN: vmalloc-out-of-bounds in __bpf_prog_put kernel/bpf/syscall.c:1812 [inline] BUG: KASAN: vmalloc-out-of-bounds in __bpf_prog_put kernel/bpf/syscall.c:1812 [inline] kernel/bpf/syscall.c:1829 BUG: KASAN: vmalloc-out-of-bounds in bpf_prog_put+0x8c/0x4f0 kernel/bpf/syscall.c:1829 kernel/bpf/syscall.c:1829 Read of size 8 at addr ffffc90000e76038 by task syz-executor020/3641 To fix clean up error path so it doesn't try to do the bpf_prog_put in the error path once progs are assigned then it relies on the normal psock tear down logic to do complete cleanup. For completness we also cover the case whereh sk_psock_init_strp() fails, but this is not expected because it indicates an incorrect socket type and should be caught earlier. Fixes: 38207a5e8123 ("bpf, sockmap: Attach map progs to psock early for feature probes") Reported-by: syzbot+bb73e71cf4b8fd376a4f@syzkaller.appspotmail.com Signed-off-by: John Fastabend <john.fastabend@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/20220104214645.290900-1-john.fastabend@gmail.com
2022-01-05bpf, sockmap: Fix return codes from tcp_bpf_recvmsg_parser()John Fastabend
Applications can be confused slightly because we do not always return the same error code as expected, e.g. what the TCP stack normally returns. For example on a sock err sk->sk_err instead of returning the sock_error we return EAGAIN. This usually means the application will 'try again' instead of aborting immediately. Another example, when a shutdown event is received we should immediately abort instead of waiting for data when the user provides a timeout. These tend to not be fatal, applications usually recover, but introduces bogus errors to the user or introduces unexpected latency. Before 'c5d2177a72a16' we fell back to the TCP stack when no data was available so we managed to catch many of the cases here, although with the extra latency cost of calling tcp_msg_wait_data() first. To fix lets duplicate the error handling in TCP stack into tcp_bpf so that we get the same error codes. These were found in our CI tests that run applications against sockmap and do longer lived testing, at least compared to test_sockmap that does short-lived ping/pong tests, and in some of our test clusters we deploy. Its non-trivial to do these in a shorter form CI tests that would be appropriate for BPF selftests, but we are looking into it so we can ensure this keeps working going forward. As a preview one idea is to pull in the packetdrill testing which catches some of this. Fixes: c5d2177a72a16 ("bpf, sockmap: Fix race in ingress receive verdict with redirect to self") Signed-off-by: John Fastabend <john.fastabend@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/20220104205918.286416-1-john.fastabend@gmail.com
2022-01-05netlink: do not allocate a device refcount tracker in ethnl_default_notify()Eric Dumazet
As reported by Johannes, the tracker allocated in ethnl_default_notify() is not really needed, as this function is not expected to change a device reference count. Fixes: e4b8954074f6 ("netlink: add net device refcount tracker to struct ethnl_req_info") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: Johannes Berg <johannes@sipsolutions.net> Tested-by: Johannes Berg <johannes@sipsolutions.net> Link: https://lore.kernel.org/r/20220105170849.2610470-1-eric.dumazet@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-01-05net/sched: add missing tracker information in qdisc_create()Eric Dumazet
qdisc_create() error path needs to use dev_put_track() because qdisc_alloc() allocated the tracker. Fixes: 606509f27f67 ("net/sched: add net device refcount tracker to struct Qdisc") Signed-off-by: Eric Dumazet <edumazet@google.com> Link: https://lore.kernel.org/r/20220104170439.3790052-1-eric.dumazet@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-01-05net: dsa: remove cross-chip support for HSRVladimir Oltean
The cross-chip notifiers for HSR are bypass operations, meaning that even though all switches in a tree are notified, only the switch specified in the info structure is targeted. We can eliminate the unnecessary complexity by deleting the cross-chip notifier logic and calling the ds->ops straight from port.c. Cc: George McCollister <george.mccollister@gmail.com> Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: George McCollister <george.mccollister@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-01-05net: dsa: remove cross-chip support for MRPVladimir Oltean
The cross-chip notifiers for MRP are bypass operations, meaning that even though all switches in a tree are notified, only the switch specified in the info structure is targeted. We can eliminate the unnecessary complexity by deleting the cross-chip notifier logic and calling the ds->ops straight from port.c. Cc: Horatiu Vultur <horatiu.vultur@microchip.com> Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-01-05net: dsa: fix incorrect function pointer check for MRP ring rolesVladimir Oltean
The cross-chip notifier boilerplate code meant to check the presence of ds->ops->port_mrp_add_ring_role before calling it, but checked ds->ops->port_mrp_add instead, before calling ds->ops->port_mrp_add_ring_role. Therefore, a driver which implements one operation but not the other would trigger a NULL pointer dereference. There isn't any such driver in DSA yet, so there is no reason to backport the change. Issue found through code inspection. Cc: Horatiu Vultur <horatiu.vultur@microchip.com> Fixes: c595c4330da0 ("net: dsa: add MRP support") Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-01-05net: dsa: make dsa_switch :: num_ports an unsigned intVladimir Oltean
Currently, num_ports is declared as size_t, which is defined as __kernel_ulong_t, therefore it occupies 8 bytes of memory. Even switches with port numbers in the range of tens are exotic, so there is no need for this amount of storage. Additionally, because the max_num_bridges member right above it is also 4 bytes, it means the compiler needs to add padding between the last 2 fields. By reducing the size, we don't need that padding and can reduce the struct size. Before: pahole -C dsa_switch net/dsa/slave.o struct dsa_switch { struct device * dev; /* 0 8 */ struct dsa_switch_tree * dst; /* 8 8 */ unsigned int index; /* 16 4 */ u32 setup:1; /* 20: 0 4 */ u32 vlan_filtering_is_global:1; /* 20: 1 4 */ u32 needs_standalone_vlan_filtering:1; /* 20: 2 4 */ u32 configure_vlan_while_not_filtering:1; /* 20: 3 4 */ u32 untag_bridge_pvid:1; /* 20: 4 4 */ u32 assisted_learning_on_cpu_port:1; /* 20: 5 4 */ u32 vlan_filtering:1; /* 20: 6 4 */ u32 pcs_poll:1; /* 20: 7 4 */ u32 mtu_enforcement_ingress:1; /* 20: 8 4 */ /* XXX 23 bits hole, try to pack */ struct notifier_block nb; /* 24 24 */ /* XXX last struct has 4 bytes of padding */ void * priv; /* 48 8 */ void * tagger_data; /* 56 8 */ /* --- cacheline 1 boundary (64 bytes) --- */ struct dsa_chip_data * cd; /* 64 8 */ const struct dsa_switch_ops * ops; /* 72 8 */ u32 phys_mii_mask; /* 80 4 */ /* XXX 4 bytes hole, try to pack */ struct mii_bus * slave_mii_bus; /* 88 8 */ unsigned int ageing_time_min; /* 96 4 */ unsigned int ageing_time_max; /* 100 4 */ struct dsa_8021q_context * tag_8021q_ctx; /* 104 8 */ struct devlink * devlink; /* 112 8 */ unsigned int num_tx_queues; /* 120 4 */ unsigned int num_lag_ids; /* 124 4 */ /* --- cacheline 2 boundary (128 bytes) --- */ unsigned int max_num_bridges; /* 128 4 */ /* XXX 4 bytes hole, try to pack */ size_t num_ports; /* 136 8 */ /* size: 144, cachelines: 3, members: 27 */ /* sum members: 132, holes: 2, sum holes: 8 */ /* sum bitfield members: 9 bits, bit holes: 1, sum bit holes: 23 bits */ /* paddings: 1, sum paddings: 4 */ /* last cacheline: 16 bytes */ }; After: pahole -C dsa_switch net/dsa/slave.o struct dsa_switch { struct device * dev; /* 0 8 */ struct dsa_switch_tree * dst; /* 8 8 */ unsigned int index; /* 16 4 */ u32 setup:1; /* 20: 0 4 */ u32 vlan_filtering_is_global:1; /* 20: 1 4 */ u32 needs_standalone_vlan_filtering:1; /* 20: 2 4 */ u32 configure_vlan_while_not_filtering:1; /* 20: 3 4 */ u32 untag_bridge_pvid:1; /* 20: 4 4 */ u32 assisted_learning_on_cpu_port:1; /* 20: 5 4 */ u32 vlan_filtering:1; /* 20: 6 4 */ u32 pcs_poll:1; /* 20: 7 4 */ u32 mtu_enforcement_ingress:1; /* 20: 8 4 */ /* XXX 23 bits hole, try to pack */ struct notifier_block nb; /* 24 24 */ /* XXX last struct has 4 bytes of padding */ void * priv; /* 48 8 */ void * tagger_data; /* 56 8 */ /* --- cacheline 1 boundary (64 bytes) --- */ struct dsa_chip_data * cd; /* 64 8 */ const struct dsa_switch_ops * ops; /* 72 8 */ u32 phys_mii_mask; /* 80 4 */ /* XXX 4 bytes hole, try to pack */ struct mii_bus * slave_mii_bus; /* 88 8 */ unsigned int ageing_time_min; /* 96 4 */ unsigned int ageing_time_max; /* 100 4 */ struct dsa_8021q_context * tag_8021q_ctx; /* 104 8 */ struct devlink * devlink; /* 112 8 */ unsigned int num_tx_queues; /* 120 4 */ unsigned int num_lag_ids; /* 124 4 */ /* --- cacheline 2 boundary (128 bytes) --- */ unsigned int max_num_bridges; /* 128 4 */ unsigned int num_ports; /* 132 4 */ /* size: 136, cachelines: 3, members: 27 */ /* sum members: 128, holes: 1, sum holes: 4 */ /* sum bitfield members: 9 bits, bit holes: 1, sum bit holes: 23 bits */ /* paddings: 1, sum paddings: 4 */ /* last cacheline: 8 bytes */ }; Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-01-04Merge tag 'mac80211-for-net-2022-01-04' of ↵Jakub Kicinski
git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211 Johannes Berg says: ==================== Two more changes: - mac80211: initialize a variable to avoid using it uninitialized - mac80211 mesh: put some data structures into the container to fix bugs with and not have to deal with allocation failures * tag 'mac80211-for-net-2022-01-04' of git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211: mac80211: mesh: embedd mesh_paths and mpp_paths into ieee80211_if_mesh mac80211: initialize variable have_higher_than_11mbit ==================== Link: https://lore.kernel.org/r/20220104144449.64937-1-johannes@sipsolutions.net Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-01-04mac80211: use ieee80211_bss_get_elem()Johannes Berg
Instead of ieee80211_bss_get_ie(), use the more typed ieee80211_bss_get_elem(). Link: https://lore.kernel.org/r/20211220113609.56f8e2a70152.Id5a56afb8a4f9b38d10445e5a1874e93e84b5251@changeid Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2022-01-04mac80211: Add stations iterator where the iterator function may sleepMartin Blumenstingl
ieee80211_iterate_active_interfaces() and ieee80211_iterate_active_interfaces_atomic() already exist, where the former allows the iterator function to sleep. Add ieee80211_iterate_stations() which is similar to ieee80211_iterate_stations_atomic() but allows the iterator to sleep. This is needed for adding SDIO support to the rtw88 driver. Some interators there are reading or writing registers. With the SDIO ops (sdio_readb, sdio_writeb and friends) this means that the iterator function may sleep. Signed-off-by: Martin Blumenstingl <martin.blumenstingl@googlemail.com> Link: https://lore.kernel.org/r/20211228211501.468981-2-martin.blumenstingl@googlemail.com Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2022-01-04mac80211: allow non-standard VHT MCS-10/11Ping-Ke Shih
Some AP can possibly try non-standard VHT rate and mac80211 warns and drops packets, and leads low TCP throughput. Rate marked as a VHT rate but data is invalid: MCS: 10, NSS: 2 WARNING: CPU: 1 PID: 7817 at net/mac80211/rx.c:4856 ieee80211_rx_list+0x223/0x2f0 [mac8021 Since commit c27aa56a72b8 ("cfg80211: add VHT rate entries for MCS-10 and MCS-11") has added, mac80211 adds this support as well. After this patch, throughput is good and iw can get the bitrate: rx bitrate: 975.1 MBit/s VHT-MCS 10 80MHz short GI VHT-NSS 2 or rx bitrate: 1083.3 MBit/s VHT-MCS 11 80MHz short GI VHT-NSS 2 Buglink: https://bugzilla.suse.com/show_bug.cgi?id=1192891 Reported-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Signed-off-by: Ping-Ke Shih <pkshih@realtek.com> Link: https://lore.kernel.org/r/20220103013623.17052-1-pkshih@realtek.com Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2022-01-04mac80211: mesh: embedd mesh_paths and mpp_paths into ieee80211_if_meshPavel Skripkin
Syzbot hit NULL deref in rhashtable_free_and_destroy(). The problem was in mesh_paths and mpp_paths being NULL. mesh_pathtbl_init() could fail in case of memory allocation failure, but nobody cared, since ieee80211_mesh_init_sdata() returns void. It led to leaving 2 pointers as NULL. Syzbot has found null deref on exit path, but it could happen anywhere else, because code assumes these pointers are valid. Since all ieee80211_*_setup_sdata functions are void and do not fail, let's embedd mesh_paths and mpp_paths into parent struct to avoid adding error handling on higher levels and follow the pattern of others setup_sdata functions Fixes: 60854fd94573 ("mac80211: mesh: convert path table to rhashtable") Reported-and-tested-by: syzbot+860268315ba86ea6b96b@syzkaller.appspotmail.com Signed-off-by: Pavel Skripkin <paskripkin@gmail.com> Link: https://lore.kernel.org/r/20211230195547.23977-1-paskripkin@gmail.com Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2022-01-04mac80211: initialize variable have_higher_than_11mbitTom Rix
Clang static analysis reports this warnings mlme.c:5332:7: warning: Branch condition evaluates to a garbage value have_higher_than_11mbit) ^~~~~~~~~~~~~~~~~~~~~~~ have_higher_than_11mbit is only set to true some of the time in ieee80211_get_rates() but is checked all of the time. So have_higher_than_11mbit needs to be initialized to false. Fixes: 5d6a1b069b7f ("mac80211: set basic rates earlier") Signed-off-by: Tom Rix <trix@redhat.com> Reviewed-by: Nick Desaulniers <ndesaulniers@google.com> Link: https://lore.kernel.org/r/20211223162848.3243702-1-trix@redhat.com Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2022-01-04Namespaceify mtu_expires sysctlxu xin
This patch enables the sysctl mtu_expires to be configured per net namespace. Signed-off-by: xu xin <xu.xin16@zte.com.cn> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-01-04Namespaceify min_pmtu sysctlxu xin
This patch enables the sysctl min_pmtu to be configured per net namespace. Signed-off-by: xu xin <xu.xin16@zte.com.cn> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-01-04sch_qfq: prevent shift-out-of-bounds in qfq_init_qdiscEric Dumazet
tx_queue_len can be set to ~0U, we need to be more careful about overflows. __fls(0) is undefined, as this report shows: UBSAN: shift-out-of-bounds in net/sched/sch_qfq.c:1430:24 shift exponent 51770272 is too large for 32-bit type 'int' CPU: 0 PID: 25574 Comm: syz-executor.0 Not tainted 5.16.0-rc7-syzkaller #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Call Trace: <TASK> __dump_stack lib/dump_stack.c:88 [inline] dump_stack_lvl+0x201/0x2d8 lib/dump_stack.c:106 ubsan_epilogue lib/ubsan.c:151 [inline] __ubsan_handle_shift_out_of_bounds+0x494/0x530 lib/ubsan.c:330 qfq_init_qdisc+0x43f/0x450 net/sched/sch_qfq.c:1430 qdisc_create+0x895/0x1430 net/sched/sch_api.c:1253 tc_modify_qdisc+0x9d9/0x1e20 net/sched/sch_api.c:1660 rtnetlink_rcv_msg+0x934/0xe60 net/core/rtnetlink.c:5571 netlink_rcv_skb+0x200/0x470 net/netlink/af_netlink.c:2496 netlink_unicast_kernel net/netlink/af_netlink.c:1319 [inline] netlink_unicast+0x814/0x9f0 net/netlink/af_netlink.c:1345 netlink_sendmsg+0xaea/0xe60 net/netlink/af_netlink.c:1921 sock_sendmsg_nosec net/socket.c:704 [inline] sock_sendmsg net/socket.c:724 [inline] ____sys_sendmsg+0x5b9/0x910 net/socket.c:2409 ___sys_sendmsg net/socket.c:2463 [inline] __sys_sendmsg+0x280/0x370 net/socket.c:2492 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x44/0xd0 arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x44/0xae Fixes: 462dbc9101ac ("pkt_sched: QFQ Plus: fair-queueing service at DRR cost") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: syzbot <syzkaller@googlegroups.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-01-04netrom: fix copying in user data in nr_setsockoptChristoph Hellwig
This code used to copy in an unsigned long worth of data before the sockptr_t conversion, so restore that. Fixes: a7b75c5a8c41 ("net: pass a sockptr_t into ->setsockopt") Reported-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-01-04udp6: Use Segment Routing Header for dest address if presentAndrew Lunn
When finding the socket to report an error on, if the invoking packet is using Segment Routing, the IPv6 destination address is that of an intermediate router, not the end destination. Extract the ultimate destination address from the segment address. This change allows traceroute to function in the presence of Segment Routing. Signed-off-by: Andrew Lunn <andrew@lunn.ch> Reviewed-by: David Ahern <dsahern@kernel.org> Reviewed-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-01-04icmp: ICMPV6: Examine invoking packet for Segment Route Headers.Andrew Lunn
RFC8754 says: ICMP error packets generated within the SR domain are sent to source nodes within the SR domain. The invoking packet in the ICMP error message may contain an SRH. Since the destination address of a packet with an SRH changes as each segment is processed, it may not be the destination used by the socket or application that generated the invoking packet. For the source of an invoking packet to process the ICMP error message, the ultimate destination address of the IPv6 header may be required. The following logic is used to determine the destination address for use by protocol-error handlers. * Walk all extension headers of the invoking IPv6 packet to the routing extension header preceding the upper-layer header. - If routing header is type 4 Segment Routing Header (SRH) o The SID at Segment List[0] may be used as the destination address of the invoking packet. Mangle the skb so the network header points to the invoking packet inside the ICMP packet. The seg6 helpers can then be used on the skb to find any segment routing headers. If found, mark this fact in the IPv6 control block of the skb, and store the offset into the packet of the SRH. Then restore the skb back to its old state. Signed-off-by: Andrew Lunn <andrew@lunn.ch> Reviewed-by: David Ahern <dsahern@kernel.org> Reviewed-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-01-04seg6: export get_srh() for ICMP handlingAndrew Lunn
An ICMP error message can contain in its message body part of an IPv6 packet which invoked the error. Such a packet might contain a segment router header. Export get_srh() so the ICMP code can make use of it. Since his changes the scope of the function from local to global, add the seg6_ prefix to keep the namespace clean. And move it into seg6.c so it is always available, not just when IPV6_SEG6_LWTUNNEL is enabled. Signed-off-by: Andrew Lunn <andrew@lunn.ch> Reviewed-by: David Ahern <dsahern@kernel.org> Reviewed-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-01-04net: openvswitch: Fill act ct extensionPaul Blakey
To give drivers the originating device information for optimized connection tracking offload, fill in act ct extension with ifindex from skb. Signed-off-by: Paul Blakey <paulb@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-01-04net/sched: act_ct: Fill offloading tuple iifidxPaul Blakey
Driver offloading ct tuples can use the information of which devices received the packets that created the offloaded connections, to more efficiently offload them only to the relevant device. Add new act_ct nf conntrack extension, which is used to store the skb devices before offloading the connection, and then fill in the tuple iifindex so drivers can get the device via metadata dissector match. Signed-off-by: Oz Shlomo <ozsh@nvidia.com> Signed-off-by: Paul Blakey <paulb@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-01-03Merge tag 'batadv-next-pullrequest-20220103' of ↵Jakub Kicinski
git://git.open-mesh.org/linux-merge Simon Wunderlich says: ==================== This cleanup patchset includes the following patches: - bump version strings, by Simon Wunderlich - allow netlink usage in unprivileged containers, by Linus Lüssing - remove unneeded variable, by Minghao Chi * tag 'batadv-next-pullrequest-20220103' of git://git.open-mesh.org/linux-merge: batman-adv: remove unneeded variable in batadv_nc_init batman-adv: allow netlink usage in unprivileged containers batman-adv: Start new development cycle ==================== Link: https://lore.kernel.org/r/20220103171722.1126109-1-sw@simonwunderlich.de Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-01-03Merge tag 'batadv-net-pullrequest-20220103' of ↵Jakub Kicinski
git://git.open-mesh.org/linux-merge Simon Wunderlich says: ==================== Here is a batman-adv bugfix: - avoid sending link-local multicast to multicast routers, by Linus Lüssing * tag 'batadv-net-pullrequest-20220103' of git://git.open-mesh.org/linux-merge: batman-adv: mcast: don't send link-local multicast to mcast routers ==================== Link: https://lore.kernel.org/r/20220103171203.1124980-1-sw@simonwunderlich.de Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-01-03ipv6: Do cleanup if attribute validation fails in multipath routeDavid Ahern
As Nicolas noted, if gateway validation fails walking the multipath attribute the code should jump to the cleanup to free previously allocated memory. Fixes: 1ff15a710a86 ("ipv6: Check attribute length for RTA_GATEWAY when deleting multipath route") Signed-off-by: David Ahern <dsahern@kernel.org> Acked-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Link: https://lore.kernel.org/r/20220103170555.94638-1-dsahern@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-01-03ipv6: Continue processing multipath route even if gateway attribute is invalidDavid Ahern
ip6_route_multipath_del loop continues processing the multipath attribute even if delete of a nexthop path fails. For consistency, do the same if the gateway attribute is invalid. Fixes: 1ff15a710a86 ("ipv6: Check attribute length for RTA_GATEWAY when deleting multipath route") Signed-off-by: David Ahern <dsahern@kernel.org> Acked-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Link: https://lore.kernel.org/r/20220103171911.94739-1-dsahern@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-01-02net/smc: add comments for smc_link_{usable|sendable}Dust Li
Add comments for both smc_link_sendable() and smc_link_usable() to help better distinguish and use them. No function changes. Signed-off-by: Dust Li <dust.li@linux.alibaba.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-01-02sctp: hold endpoint before calling cb in sctp_transport_lookup_processXin Long
The same fix in commit 5ec7d18d1813 ("sctp: use call_rcu to free endpoint") is also needed for dumping one asoc and sock after the lookup. Fixes: 86fdb3448cc1 ("sctp: ensure ep is not destroyed before doing the dump") Signed-off-by: Xin Long <lucien.xin@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-01-02net: socket.c: style fixHamish MacDonald
Removed spaces and added a tab that was causing an error on checkpatch Signed-off-by: Hamish MacDonald <elusivenode@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>