summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2022-01-06Merge branch 'dpaa2-eth-small-cleanup'Jakub Kicinski
Ioana Ciornei says: ==================== dpaa2-eth: small cleanup These 3 patches are just part of a small cleanup on the dpaa2-eth and the dpaa2-switch drivers. In case we are hitting a case in which the fwnode of the root dprc device we initiate a deferred probe. On the dpaa2-switch side, if we are on the remove path, make sure that we check for a non-NULL pointer before accessing the port private structure. ==================== Link: https://lore.kernel.org/r/20220106135905.81923-1-ioana.ciornei@nxp.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-01-06dpaa2-switch: check if the port priv is validIoana Ciornei
Before accessing the port private structure make sure that there is still a non-NULL pointer there. A NULL pointer access can happen when we are on the remove path, some switch ports are unregistered and some are in the process of unregistering. Signed-off-by: Ioana Ciornei <ioana.ciornei@nxp.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-01-06dpaa2-mac: return -EPROBE_DEFER from dpaa2_mac_open in case the fwnode is ↵Ioana Ciornei
not set We could get into a situation when the fwnode of the parent device is not yet set because its probe didn't yet finish. When this happens, any caller of the dpaa2_mac_open() will not have the fwnode available, thus cause problems at the PHY connect time. Avoid this by just returning -EPROBE_DEFER from the dpaa2_mac_open when this happens. Signed-off-by: Ioana Ciornei <ioana.ciornei@nxp.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-01-06dpaa2-mac: bail if the dpmacs fwnode is not foundRobert-Ionut Alexa
The parent pointer node handler must be declared with a NULL initializer. Before using it, a check must be performed to make sure that a valid address has been assigned to it. Signed-off-by: Robert-Ionut Alexa <robert-ionut.alexa@nxp.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-01-06Merge https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-nextJakub Kicinski
Alexei Starovoitov says: ==================== pull-request: bpf-next 2022-01-06 We've added 41 non-merge commits during the last 2 day(s) which contain a total of 36 files changed, 1214 insertions(+), 368 deletions(-). The main changes are: 1) Various fixes in the verifier, from Kris and Daniel. 2) Fixes in sockmap, from John. 3) bpf_getsockopt fix, from Kuniyuki. 4) INET_POST_BIND fix, from Menglong. 5) arm64 JIT fix for bpf pseudo funcs, from Hou. 6) BPF ISA doc improvements, from Christoph. * https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (41 commits) bpf: selftests: Add bind retry for post_bind{4, 6} bpf: selftests: Use C99 initializers in test_sock.c net: bpf: Handle return value of BPF_CGROUP_RUN_PROG_INET{4,6}_POST_BIND() bpf/selftests: Test bpf_d_path on rdonly_mem. libbpf: Add documentation for bpf_map batch operations selftests/bpf: Don't rely on preserving volatile in PT_REGS macros in loop3 xdp: Add xdp_do_redirect_frame() for pre-computed xdp_frames xdp: Move conversion to xdp_frame out of map functions page_pool: Store the XDP mem id page_pool: Add callback to init pages when they are allocated xdp: Allow registering memory model without rxq reference samples/bpf: xdpsock: Add timestamp for Tx-only operation samples/bpf: xdpsock: Add time-out for cleaning Tx samples/bpf: xdpsock: Add sched policy and priority support samples/bpf: xdpsock: Add cyclic TX operation capability samples/bpf: xdpsock: Add clockid selection support samples/bpf: xdpsock: Add Dest and Src MAC setting for Tx-only operation samples/bpf: xdpsock: Add VLAN support for Tx-only operation libbpf 1.0: Deprecate bpf_object__find_map_by_offset() API libbpf 1.0: Deprecate bpf_map__is_offload_neutral() ... ==================== Link: https://lore.kernel.org/r/20220107013626.53943-1-alexei.starovoitov@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-01-06Merge branch 'net: bpf: handle return value of post_bind{4,6} and add ↵Alexei Starovoitov
selftests for it' Menglong Dong says: ==================== From: Menglong Dong <imagedong@tencent.com> The return value of BPF_CGROUP_RUN_PROG_INET{4,6}_POST_BIND() in __inet_bind() is not handled properly. While the return value is non-zero, it will set inet_saddr and inet_rcv_saddr to 0 and exit: err = BPF_CGROUP_RUN_PROG_INET4_POST_BIND(sk); if (err) { inet->inet_saddr = inet->inet_rcv_saddr = 0; goto out_release_sock; } Let's take UDP for example and see what will happen. For UDP socket, it will be added to 'udp_prot.h.udp_table->hash' and 'udp_prot.h.udp_table->hash2' after the sk->sk_prot->get_port() called success. If 'inet->inet_rcv_saddr' is specified here, then 'sk' will be in the 'hslot2' of 'hash2' that it don't belong to (because inet_saddr is changed to 0), and UDP packet received will not be passed to this sock. If 'inet->inet_rcv_saddr' is not specified here, the sock will work fine, as it can receive packet properly, which is wired, as the 'bind()' is already failed. To undo the get_port() operation, introduce the 'put_port' field for 'struct proto'. For TCP proto, it is inet_put_port(); For UDP proto, it is udp_lib_unhash(); For icmp proto, it is ping_unhash(). Therefore, after sys_bind() fail caused by BPF_CGROUP_RUN_PROG_INET4_POST_BIND(), it will be unbinded, which means that it can try to be binded to another port. The second patch use C99 initializers in test_sock.c The third patch is the selftests for this modification. Changes since v4: - use C99 initializers in test_sock.c before adding the test case Changes since v3: - add the third patch which use C99 initializers in test_sock.c Changes since v2: - NULL check for sk->sk_prot->put_port Changes since v1: - introduce 'put_port' field for 'struct proto' - add selftests for it ==================== Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-01-06bpf: selftests: Add bind retry for post_bind{4, 6}Menglong Dong
With previous patch, kernel is able to 'put_port' after sys_bind() fails. Add the test for that case: rebind another port after sys_bind() fails. If the bind success, it means previous bind operation is already undoed. Signed-off-by: Menglong Dong <imagedong@tencent.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20220106132022.3470772-4-imagedong@tencent.com
2022-01-06bpf: selftests: Use C99 initializers in test_sock.cMenglong Dong
Use C99 initializers for the initialization of 'tests' in test_sock.c. Signed-off-by: Menglong Dong <imagedong@tencent.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20220106132022.3470772-3-imagedong@tencent.com
2022-01-06net: bpf: Handle return value of BPF_CGROUP_RUN_PROG_INET{4,6}_POST_BIND()Menglong Dong
The return value of BPF_CGROUP_RUN_PROG_INET{4,6}_POST_BIND() in __inet_bind() is not handled properly. While the return value is non-zero, it will set inet_saddr and inet_rcv_saddr to 0 and exit: err = BPF_CGROUP_RUN_PROG_INET4_POST_BIND(sk); if (err) { inet->inet_saddr = inet->inet_rcv_saddr = 0; goto out_release_sock; } Let's take UDP for example and see what will happen. For UDP socket, it will be added to 'udp_prot.h.udp_table->hash' and 'udp_prot.h.udp_table->hash2' after the sk->sk_prot->get_port() called success. If 'inet->inet_rcv_saddr' is specified here, then 'sk' will be in the 'hslot2' of 'hash2' that it don't belong to (because inet_saddr is changed to 0), and UDP packet received will not be passed to this sock. If 'inet->inet_rcv_saddr' is not specified here, the sock will work fine, as it can receive packet properly, which is wired, as the 'bind()' is already failed. To undo the get_port() operation, introduce the 'put_port' field for 'struct proto'. For TCP proto, it is inet_put_port(); For UDP proto, it is udp_lib_unhash(); For icmp proto, it is ping_unhash(). Therefore, after sys_bind() fail caused by BPF_CGROUP_RUN_PROG_INET4_POST_BIND(), it will be unbinded, which means that it can try to be binded to another port. Signed-off-by: Menglong Dong <imagedong@tencent.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20220106132022.3470772-2-imagedong@tencent.com
2022-01-06Documentation: devlink: mlx5.rst: Fix htmldoc build warningSaeed Mahameed
Fix the following build warning: Documentation/networking/devlink/mlx5.rst:13: WARNING: Error parsing content block for the "list-table" directive: +uniform two-level bullet list expected, but row 2 does not contain the same number of items as row 1 (2 vs 3). ... Add the missing item in the first row. Fixes: 0844fa5f7b89 ("net/mlx5: Let user configure io_eq_size param") Reported-by: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2022-01-06net/mlx5e: Add recovery flow in case of error CQEGal Pressman
The rep legacy RQ completion handling was missing the appropriate handling of error CQEs (dump the CQE and queue a recover work), fix it by calling trigger_report() when needed. Since all CQE handling flows do the exact same error CQE handling, extract it to a common helper function. Signed-off-by: Gal Pressman <gal@nvidia.com> Reviewed-by: Aya Levin <ayal@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2022-01-06net/mlx5e: TC, Remove redundant error loggingRoi Dayan
Remove redundant and trivial error logging when trying to offload mirred device with unsupported devices. Using OVS could hit those a lot and the errors are still logged in extack. Signed-off-by: Roi Dayan <roid@nvidia.com> Reviewed-by: Maor Dickman <maord@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2022-01-06net/mlx5e: Refactor set_pflag_cqe_based_moderSaeed Mahameed
Rearrange the code and use cqe_mode_to_period_mode() helper. Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2022-01-06net/mlx5e: Move HW-GRO and CQE compression check to fix features flowGal Pressman
Feature dependencies should be resolved in fix features rather than in set features flow. Move the check that disables HW-GRO in case CQE compression is enabled from set_feature_hw_gro() to mlx5e_fix_features(). Signed-off-by: Gal Pressman <gal@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2022-01-06net/mlx5e: Fix feature check per profileAya Levin
Remove redundant space when constructing the feature's enum. Validate against the indented enum value. Fixes: 6c72cb05d4b8 ("net/mlx5e: Use bitmap field for profile features") Signed-off-by: Aya Levin <ayal@nvidia.com> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2022-01-06net/mlx5e: Unblock setting vid 0 for VF in case PF isn't eswitch managerMaor Dickman
When using libvirt to passthrough VF to VM it will always set the VF vlan to 0 even if user didn’t request it, this will cause libvirt to fail to boot in case the PF isn't eswitch owner. Example of such case is the DPU host PF which isn't eswitch manager, so any attempt to passthrough VF of it using libvirt will fail. Fix it by not returning error in case set VF vlan is called with vid 0. Signed-off-by: Maor Dickman <maord@nvidia.com> Reviewed-by: Roi Dayan <roid@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2022-01-06net/mlx5e: Expose FEC counters via ethtoolLama Kayal
Add FEC counters' statistics of corrected_blocks and uncorrectable_blocks, along with their lanes via ethtool. HW supports corrected_blocks and uncorrectable_blocks counters both for RS-FEC mode and FC-FEC mode. In FC mode these counters are accumulated per lane, while in RS mode the correction method crosses lanes, thus only total corrected_blocks and uncorrectable_blocks are reported in this mode. Signed-off-by: Lama Kayal <lkayal@nvidia.com> Reviewed-by: Gal Pressman <gal@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2022-01-06net/mlx5: Update log_max_qp value to FW max capabilityMaher Sanalla
log_max_qp in driver's default profile #2 was set to 18, but FW actually supports 17 at the most - a situation that led to the concerning print when the driver is loaded: "log_max_qp value in current profile is 18, changing to HCA capabaility limit (17)" The expected behavior from mlx5_profile #2 is to match the maximum FW capability in regards to log_max_qp. Thus, log_max_qp in profile #2 is initialized to a defined static value (0xff) - which basically means that when loading this profile, log_max_qp value will be what the currently installed FW supports at most. Signed-off-by: Maher Sanalla <msanalla@nvidia.com> Reviewed-by: Maor Gottlieb <maorg@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2022-01-06net/mlx5: SF, Use all available cpu for setting cpu affinityShay Drory
Currently all SFs are using the same CPUs. Spreading SF over CPUs, in round-robin manner, in order to achieve better distribution of the SFs over available CPUs. Signed-off-by: Shay Drory <shayd@nvidia.com> Reviewed-by: Moshe Shemesh <moshe@nvidia.com> Reviewed-by: Parav Pandit <parav@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2022-01-06net/mlx5: Introduce API for bulk request and release of IRQsShay Drory
Currently IRQs are requested one by one. To balance spreading IRQs among cpus using such scheme requires remembering cpu mask for the cpus used for a given device. This complicates the IRQ allocation scheme in subsequent patch. Hence, prepare the code for bulk IRQs allocation. This enables spreading IRQs among cpus in subsequent patch. Signed-off-by: Shay Drory <shayd@nvidia.com> Reviewed-by: Parav Pandit <parav@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2022-01-06net/mlx5: Split irq_pool_affinity logic to new fileShay Drory
The downstream patches add more functionality to irq_pool_affinity. Move the irq_pool_affinity logic to a new file in order to ease the coding and maintenance of it. Signed-off-by: Shay Drory <shayd@nvidia.com> Reviewed-by: Moshe Shemesh <moshe@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2022-01-06net/mlx5: Move affinity assignment into irq_requestShay Drory
Move affinity binding of the IRQ to irq_request function in order to bind the IRQ before inserting it to the xarray. After this change, the IRQ is ready for use when inserted to the xarray. Signed-off-by: Shay Drory <shayd@nvidia.com> Reviewed-by: Moshe Shemesh <moshe@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2022-01-06net/mlx5: Introduce control IRQ request APIShay Drory
Currently, IRQ layer have a separate flow for ctrl and comp IRQs, and the distinction between ctrl and comp IRQs is done in the IRQ layer. In order to ease the coding and maintenance of the IRQ layer, introduce a new API for requesting control IRQs - mlx5_ctrl_irq_request(struct mlx5_core_dev *dev). Signed-off-by: Shay Drory <shayd@nvidia.com> Reviewed-by: Moshe Shemesh <moshe@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2022-01-06net/mlx5: mlx5e_hv_vhca_stats_create return type to voidSaeed Mahameed
Callers of this functions ignore its return value, as reported by Wang Qing, in one of the return paths, it returns positive values. Since return value is ignored anyways, void out the return type of the function. Reported-by: Wang Qing <wangqing@vivo.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2022-01-06bpf/selftests: Test bpf_d_path on rdonly_mem.Hao Luo
The second parameter of bpf_d_path() can only accept writable memories. Rdonly_mem obtained from bpf_per_cpu_ptr() can not be passed into bpf_d_path for modification. This patch adds a selftest to verify this behavior. Signed-off-by: Hao Luo <haoluo@google.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Yonghong Song <yhs@fb.com> Link: https://lore.kernel.org/bpf/20220106205525.2116218-1-haoluo@google.com
2022-01-06libbpf: Add documentation for bpf_map batch operationsGrant Seltzer
This adds documention for: - bpf_map_delete_batch() - bpf_map_lookup_batch() - bpf_map_lookup_and_delete_batch() - bpf_map_update_batch() This also updates the public API for the `keys` parameter of `bpf_map_delete_batch()`, and both the `keys` and `values` parameters of `bpf_map_update_batch()` to be constants. Signed-off-by: Grant Seltzer <grantseltzer@gmail.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Yonghong Song <yhs@fb.com> Link: https://lore.kernel.org/bpf/20220106201304.112675-1-grantseltzer@gmail.com
2022-01-06selftests/bpf: Don't rely on preserving volatile in PT_REGS macros in loop3Andrii Nakryiko
PT_REGS*() macro on some architectures force-cast struct pt_regs to other types (user_pt_regs, etc) and might drop volatile modifiers, if any. Volatile isn't really required as pt_regs value isn't supposed to change during the BPF program run, so this is correct behavior. But progs/loop3.c relies on that volatile modifier to ensure that loop is preserved. Fix loop3.c by declaring i and sum variables as volatile instead. It preserves the loop and makes the test pass on all architectures (including s390x which is currently broken). Fixes: 3cc31d794097 ("libbpf: Normalize PT_REGS_xxx() macro definitions") Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/20220106205156.955373-1-andrii@kernel.org
2022-01-06ice: Use bitmap_free() to free bitmapChristophe JAILLET
kfree() and bitmap_free() are the same. But using the latter is more consistent when freeing memory allocated with bitmap_zalloc(). Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr> Tested-by: Gurucharan G <gurucharanx.g@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2022-01-06ice: Optimize a few bitmap operationsChristophe JAILLET
When a bitmap is local to a function, it is safe to use the non-atomic __[set|clear]_bit(). No concurrent accesses can occur. Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr> Tested-by: Gurucharan G <gurucharanx.g@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2022-01-06ice: Slightly simply ice_find_free_recp_res_idxChristophe JAILLET
The 'possible_idx' bitmap is set just after it is zeroed, so we can save the first step. The 'free_idx' bitmap is used only at the end of the function as the result of a bitmap xor operation. So there is no need to explicitly zero it before. So, slightly simply the code and remove 2 useless 'bitmap_zero()' call Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr> Tested-by: Sandeep Penigalapati <sandeep.penigalapati@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2022-01-06ice: improve switchdev's slow-pathWojciech Drewek
In current switchdev implementation, every VF PR is assigned to individual ring on switchdev ctrl VSI. For slow-path traffic, there is a mapping VF->ring done in software based on src_vsi value (by calling ice_eswitch_get_target_netdev function). With this change, HW solution is introduced which is more efficient. For each VF, src MAC (VF's MAC) filter will be created, which forwards packets to the corresponding switchdev ctrl VSI queue based on src MAC address. This filter has to be removed and then replayed in case of resetting one VF. Keep information about this rule in repr->mac_rule, thanks to that we know which rule has to be removed and replayed for a given VF. In case of CORE/GLOBAL all rules are removed automatically. We have to take care of readding them. This is done by ice_replay_vsi_adv_rule. When driver leaves switchdev mode, remove all advanced rules from switchdev ctrl VSI. This is done by ice_rem_adv_rule_for_vsi. Flag repr->rule_added is needed because in some cases reset might be triggered before VF sends request to add MAC. Co-developed-by: Grzegorz Nitka <grzegorz.nitka@intel.com> Signed-off-by: Grzegorz Nitka <grzegorz.nitka@intel.com> Signed-off-by: Wojciech Drewek <wojciech.drewek@intel.com> Tested-by: Sandeep Penigalapati <sandeep.penigalapati@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2022-01-06ice: replay advanced rules after resetVictor Raj
ice_replay_vsi_adv_rule will replay advanced rules for a given VSI. Exit this function when list of rules for given recipe is empty. Do not add rule when given vsi_handle does not match vsi_handle from the rule info. Use ICE_MAX_NUM_RECIPES instead of ICE_SW_LKUP_LAST in order to find advanced rules as well. Signed-off-by: Victor Raj <victor.raj@intel.com> Signed-off-by: Wojciech Drewek <wojciech.drewek@intel.com> Tested-by: Sandeep Penigalapati <sandeep.penigalapati@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2022-01-06veth: Do not record rx queue hint in veth_xmitDaniel Borkmann
Laurent reported that they have seen a significant amount of TCP retransmissions at high throughput from applications residing in network namespaces talking to the outside world via veths. The drops were seen on the qdisc layer (fq_codel, as per systemd default) of the phys device such as ena or virtio_net due to all traffic hitting a _single_ TX queue _despite_ multi-queue device. (Note that the setup was _not_ using XDP on veths as the issue is generic.) More specifically, after edbea9220251 ("veth: Store queue_mapping independently of XDP prog presence") which made it all the way back to v4.19.184+, skb_record_rx_queue() would set skb->queue_mapping to 1 (given 1 RX and 1 TX queue by default for veths) instead of leaving at 0. This is eventually retained and callbacks like ena_select_queue() will also pick single queue via netdev_core_pick_tx()'s ndo_select_queue() once all the traffic is forwarded to that device via upper stack or other means. Similarly, for others not implementing ndo_select_queue() if XPS is disabled, netdev_pick_tx() might call into the skb_tx_hash() and check for prior skb_rx_queue_recorded() as well. In general, it is a _bad_ idea for virtual devices like veth to mess around with queue selection [by default]. Given dev->real_num_tx_queues is by default 1, the skb->queue_mapping was left untouched, and so prior to edbea9220251 the netdev_core_pick_tx() could do its job upon __dev_queue_xmit() on the phys device. Unbreak this and restore prior behavior by removing the skb_record_rx_queue() from veth_xmit() altogether. If the veth peer has an XDP program attached, then it would return the first RX queue index in xdp_md->rx_queue_index (unless configured in non-default manner). However, this is still better than breaking the generic case. Fixes: edbea9220251 ("veth: Store queue_mapping independently of XDP prog presence") Fixes: 638264dc9022 ("veth: Support per queue XDP ring") Reported-by: Laurent Bernaille <laurent.bernaille@datadoghq.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Cc: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Cc: Toshiaki Makita <toshiaki.makita1@gmail.com> Cc: Eric Dumazet <eric.dumazet@gmail.com> Cc: Paolo Abeni <pabeni@redhat.com> Cc: John Fastabend <john.fastabend@gmail.com> Cc: Willem de Bruijn <willemb@google.com> Acked-by: John Fastabend <john.fastabend@gmail.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Acked-by: Toshiaki Makita <toshiaki.makita1@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-01-06ethernet: ibmveth: use default_groups in kobj_typeGreg Kroah-Hartman
There are currently 2 ways to create a set of sysfs files for a kobj_type, through the default_attrs field, and the default_groups field. Move the ibmveth sysfs code to use default_groups field which has been the preferred way since aa30f47cf666 ("kobject: Add support for default attribute groups to kobj_type") so that we can soon get rid of the obsolete default_attrs field. Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Cristobal Forno <cforno12@linux.ibm.com> Cc: "David S. Miller" <davem@davemloft.net> Cc: Jakub Kicinski <kuba@kernel.org> Cc: linuxppc-dev@lists.ozlabs.org Cc: netdev@vger.kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Reviewed-by: Tyrel Datwyler <tyreld@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-01-06sfc: Use swap() instead of open coding itJiapeng Chong
Clean the following coccicheck warning: ./drivers/net/ethernet/sfc/efx_channels.c:870:36-37: WARNING opportunity for swap(). ./drivers/net/ethernet/sfc/efx_channels.c:824:36-37: WARNING opportunity for swap(). Reported-by: Abaci Robot <abaci@linux.alibaba.com> Signed-off-by: Jiapeng Chong <jiapeng.chong@linux.alibaba.com> Acked-by: Martin Habets <habetsm.xilinx@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-01-06ethtool: use phydev variableTom Rix
In ethtool_get_phy_stats(), the phydev varaible is set to dev->phydev but dev->phydev is still used. Replace dev->phydev uses with phydev. Signed-off-by: Tom Rix <trix@redhat.com> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-01-06net: macb: use .mac_select_pcs() interfaceRussell King (Oracle)
Convert the PCS selection to use mac_select_pcs, which allows the PCS to perform any validation it needs. We must use separate phylink_pcs instances for the USX and SGMII PCS, rather than just changing the "ops" pointer before re-setting it to phylink as this interface queries the PCS, rather than requesting it to be changed. Acked-by: Nicolas Ferre <nicolas.ferre@microchip.com> Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-01-06gro: add ability to control gro max packet sizeCoco Li
Eric Dumazet suggested to allow users to modify max GRO packet size. We have seen GRO being disabled by users of appliances (such as wifi access points) because of claimed bufferbloat issues, or some work arounds in sch_cake, to split GRO/GSO packets. Instead of disabling GRO completely, one can chose to limit the maximum packet size of GRO packets, depending on their latency constraints. This patch adds a per device gro_max_size attribute that can be changed with ip link command. ip link set dev eth0 gro_max_size 16000 Suggested-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Coco Li <lixiaoyan@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-01-06net: fix SOF_TIMESTAMPING_BIND_PHC to work with multiple socketsMiroslav Lichvar
When multiple sockets using the SOF_TIMESTAMPING_BIND_PHC flag received a packet with a hardware timestamp (e.g. multiple PTP instances in different PTP domains using the UDPv4/v6 multicast or L2 transport), the timestamps received on some sockets were corrupted due to repeated conversion of the same timestamp (by the same or different vclocks). Fix ptp_convert_timestamp() to not modify the shared skb timestamp and return the converted timestamp as a ktime_t instead. If the conversion fails, return 0 to not confuse the application with timestamps corresponding to an unexpected PHC. Fixes: d7c088265588 ("net: socket: support hardware timestamp conversion to PHC bound") Signed-off-by: Miroslav Lichvar <mlichvar@redhat.com> Cc: Yangbo Lu <yangbo.lu@nxp.com> Cc: Richard Cochran <richardcochran@gmail.com> Acked-by: Richard Cochran <richardcochran@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-01-06net: dsa: warn about dsa_port and dsa_switch bit fields being non atomicVladimir Oltean
As discussed during review here: https://patchwork.kernel.org/project/netdevbpf/patch/20220105132141.2648876-3-vladimir.oltean@nxp.com/ we should inform developers about pitfalls of concurrent access to the boolean properties of dsa_switch and dsa_port, now that they've been converted to bit fields. No other measure than a comment needs to be taken, since the code paths that update these bit fields are not concurrent with each other. Suggested-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-01-06net: dsa: don't enumerate dsa_switch and dsa_port bit fields using commasVladimir Oltean
This is a cosmetic incremental fixup to commits 7787ff776398 ("net: dsa: merge all bools of struct dsa_switch into a single u32") bde82f389af1 ("net: dsa: merge all bools of struct dsa_port into a single u8") The desire to make this change was enunciated after posting these patches here: https://patchwork.kernel.org/project/netdevbpf/cover/20220105132141.2648876-1-vladimir.oltean@nxp.com/ but due to a slight timing overlap (message posted at 2:28 p.m. UTC, merge commit is at 2:46 p.m. UTC), that comment was missed and the changes were applied as-is. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-01-06Merge branch 'dsa-init-cleanups'David S. Miller
Vladimir Oltean says: ==================== DSA initialization cleanups These patches contain miscellaneous work that makes the DSA init code path symmetric with the teardown path, and some additional patches carried by Ansuel Smith for his register access over Ethernet work, but those patches can be applied as-is too. https://patchwork.kernel.org/project/netdevbpf/patch/20211214224409.5770-3-ansuelsmth@gmail.com/ ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2022-01-06net: dsa: setup master before portsVladimir Oltean
It is said that as soon as a network interface is registered, all its resources should have already been prepared, so that it is available for sending and receiving traffic. One of the resources needed by a DSA slave interface is the master. dsa_tree_setup -> dsa_tree_setup_ports -> dsa_port_setup -> dsa_slave_create -> register_netdevice -> dsa_tree_setup_master -> dsa_master_setup -> sets up master->dsa_ptr, which enables reception Therefore, there is a short period of time after register_netdevice() during which the master isn't prepared to pass traffic to the DSA layer (master->dsa_ptr is checked by eth_type_trans). Same thing during unregistration, there is a time frame in which packets might be missed. Note that this change opens us to another race: dsa_master_find_slave() will get invoked potentially earlier than the slave creation, and later than the slave deletion. Since dp->slave starts off as a NULL pointer, the earlier calls aren't a problem, but the later calls are. To avoid use-after-free, we should zeroize dp->slave before calling dsa_slave_destroy(). In practice I cannot really test real life improvements brought by this change, since in my systems, netdevice creation races with PHY autoneg which takes a few seconds to complete, and that masks quite a few races. Effects might be noticeable in a setup with fixed links all the way to an external system. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-01-06net: dsa: first set up shared ports, then non-shared portsVladimir Oltean
After commit a57d8c217aad ("net: dsa: flush switchdev workqueue before tearing down CPU/DSA ports"), the port setup and teardown procedure became asymmetric. The fact of the matter is that user ports need the shared ports to be up before they can be used for CPU-initiated termination. And since we register net devices for the user ports, those won't be functional until we also call the setup for the shared (CPU, DSA) ports. But we may do that later, depending on the port numbering scheme of the hardware we are dealing with. It just makes sense that all shared ports are brought up before any user port is. I can't pinpoint any issue due to the current behavior, but let's change it nonetheless, for consistency's sake. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-01-06net: dsa: hold rtnl_mutex when calling dsa_master_{setup,teardown}Vladimir Oltean
DSA needs to simulate master tracking events when a binding is first with a DSA master established and torn down, in order to give drivers the simplifying guarantee that ->master_state_change calls are made only when the master's readiness state to pass traffic changes. master_state_change() provide a operational bool that DSA driver can use to understand if DSA master is operational or not. To avoid races, we need to block the reception of NETDEV_UP/NETDEV_CHANGE/NETDEV_GOING_DOWN events in the netdev notifier chain while we are changing the master's dev->dsa_ptr (this changes what netdev_uses_dsa(dev) reports). The dsa_master_setup() and dsa_master_teardown() functions optionally require the rtnl_mutex to be held, if the tagger needs the master to be promiscuous, these functions call dev_set_promiscuity(). Move the rtnl_lock() from that function and make it top-level. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-01-06net: dsa: stop updating master MTU from master.cVladimir Oltean
At present there are two paths for changing the MTU of the DSA master. The first is: dsa_tree_setup -> dsa_tree_setup_ports -> dsa_port_setup -> dsa_slave_create -> dsa_slave_change_mtu -> dev_set_mtu(master) The second is: dsa_tree_setup -> dsa_tree_setup_master -> dsa_master_setup -> dev_set_mtu(dev) So the dev_set_mtu() call from dsa_master_setup() has been effectively superseded by the dsa_slave_change_mtu(slave_dev, ETH_DATA_LEN) that is done from dsa_slave_create() for each user port. The later function also updates the master MTU according to the largest user port MTU from the tree. Therefore, updating the master MTU through a separate code path isn't needed. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-01-06net: dsa: merge rtnl_lock sections in dsa_slave_createVladimir Oltean
Currently dsa_slave_create() has two sequences of rtnl_lock/rtnl_unlock in a row. Remove the rtnl_unlock() and rtnl_lock() in between, such that the operation can execute slighly faster. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-01-06net: dsa: reorder PHY initialization with MTU setup in slave.cVladimir Oltean
In dsa_slave_create() there are 2 sections that take rtnl_lock(): MTU change and netdev registration. They are separated by PHY initialization. There isn't any strict ordering requirement except for the fact that netdev registration should be last. Therefore, we can perform the MTU change a bit later, after the PHY setup. A future change will then be able to merge the two rtnl_lock sections into one. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-01-06Merge branch 'master' of ↵David S. Miller
git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec-next Steffen Klassert says: ==================== pull request (net-next): ipsec-next 2022-01-06 1) Fix some clang_analyzer warnings about never read variables. From luo penghao. 2) Check for pols[0] only once in xfrm_expand_policies(). From Jean Sacren. 3) The SA curlft.use_time was updated only on SA cration time. Update whenever the SA is used. From Antony Antony 4) Add support for SM3 secure hash. From Xu Jia. 5) Add support for SM4 symmetric cipher algorithm. From Xu Jia. 6) Add a rate limit for SA mapping change messages. From Antony Antony. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2022-01-05xdp: Add xdp_do_redirect_frame() for pre-computed xdp_framesToke Høiland-Jørgensen
Add an xdp_do_redirect_frame() variant which supports pre-computed xdp_frame structures. This will be used in bpf_prog_run() to avoid having to write to the xdp_frame structure when the XDP program doesn't modify the frame boundaries. Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20220103150812.87914-6-toke@redhat.com