summaryrefslogtreecommitdiff
path: root/net
AgeCommit message (Collapse)Author
2021-10-14netfilter: ipvs: merge ipv4 + ipv6 icmp reply handlersFlorian Westphal
Similar to earlier patches: allow ipv4 and ipv6 to use the same handler. ipv4 and ipv6 specific actions can be done by checking state->pf. v2: split the pf == NFPROTO_IPV4 check (Julian Anastasov) Signed-off-by: Florian Westphal <fw@strlen.de> Acked-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2021-10-14netfilter: ipvs: remove unneeded input wrappersFlorian Westphal
After earlier patch ip_vs_hook_in can be used directly. Signed-off-by: Florian Westphal <fw@strlen.de> Acked-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2021-10-14netfilter: ipvs: remove unneeded output wrappersFlorian Westphal
After earlier patch we can use ip_vs_out_hook directly. Signed-off-by: Florian Westphal <fw@strlen.de> Acked-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2021-10-14netfilter: ipvs: prepare for hook function reductionFlorian Westphal
ipvs has multiple one-line wrappers for hooks, compact them. To avoid a large patch make the two most common helpers use the same function signature as hooks. Next patches can then remove the oneline wrappers. Signed-off-by: Florian Westphal <fw@strlen.de> Acked-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2021-10-14netfilter: ebtables: allow use of ebt_do_table as hookfnFlorian Westphal
This is possible now that the xt_table structure is passed via *priv. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2021-10-14netfilter: ip6tables: allow use of ip6t_do_table as hookfnFlorian Westphal
This is possible now that the xt_table structure is passed via *priv. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2021-10-14netfilter: arp_tables: allow use of arpt_do_table as hookfnFlorian Westphal
This is possible now that the xt_table structure is passed in via *priv. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2021-10-14netfilter: iptables: allow use of ipt_do_table as hookfnFlorian Westphal
This is possible now that the xt_table structure is passed in via *priv. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2021-10-14af_packet: Introduce egress hookPablo Neira Ayuso
Add egress hook for AF_PACKET sockets that have the PACKET_QDISC_BYPASS socket option set to on, which allows packets to escape without being filtered in the egress path. This patch only updates the AF_PACKET path, it does not update dev_direct_xmit() so the XDP infrastructure has a chance to bypass Netfilter. [lukas: acquire rcu_read_lock, fix typos, rebase] Signed-off-by: Lukas Wunner <lukas@wunner.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2021-10-14netfilter: Introduce egress hookLukas Wunner
Support classifying packets with netfilter on egress to satisfy user requirements such as: * outbound security policies for containers (Laura) * filtering and mangling intra-node Direct Server Return (DSR) traffic on a load balancer (Laura) * filtering locally generated traffic coming in through AF_PACKET, such as local ARP traffic generated for clustering purposes or DHCP (Laura; the AF_PACKET plumbing is contained in a follow-up commit) * L2 filtering from ingress and egress for AVB (Audio Video Bridging) and gPTP with nftables (Pablo) * in the future: in-kernel NAT64/NAT46 (Pablo) The egress hook introduced herein complements the ingress hook added by commit e687ad60af09 ("netfilter: add netfilter ingress hook after handle_ing() under unique static key"). A patch for nftables to hook up egress rules from user space has been submitted separately, so users may immediately take advantage of the feature. Alternatively or in addition to netfilter, packets can be classified with traffic control (tc). On ingress, packets are classified first by tc, then by netfilter. On egress, the order is reversed for symmetry. Conceptually, tc and netfilter can be thought of as layers, with netfilter layered above tc. Traffic control is capable of redirecting packets to another interface (man 8 tc-mirred). E.g., an ingress packet may be redirected from the host namespace to a container via a veth connection: tc ingress (host) -> tc egress (veth host) -> tc ingress (veth container) In this case, netfilter egress classifying is not performed when leaving the host namespace! That's because the packet is still on the tc layer. If tc redirects the packet to a physical interface in the host namespace such that it leaves the system, the packet is never subjected to netfilter egress classifying. That is only logical since it hasn't passed through netfilter ingress classifying either. Packets can alternatively be redirected at the netfilter layer using nft fwd. Such a packet *is* subjected to netfilter egress classifying since it has reached the netfilter layer. Internally, the skb->nf_skip_egress flag controls whether netfilter is invoked on egress by __dev_queue_xmit(). Because __dev_queue_xmit() may be called recursively by tunnel drivers such as vxlan, the flag is reverted to false after sch_handle_egress(). This ensures that netfilter is applied both on the overlay and underlying network. Interaction between tc and netfilter is possible by setting and querying skb->mark. If netfilter egress classifying is not enabled on any interface, it is patched out of the data path by way of a static_key and doesn't make a performance difference that is discernible from noise: Before: 1537 1538 1538 1537 1538 1537 Mb/sec After: 1536 1534 1539 1539 1539 1540 Mb/sec Before + tc accept: 1418 1418 1418 1419 1419 1418 Mb/sec After + tc accept: 1419 1424 1418 1419 1422 1420 Mb/sec Before + tc drop: 1620 1619 1619 1619 1620 1620 Mb/sec After + tc drop: 1616 1624 1625 1624 1622 1619 Mb/sec When netfilter egress classifying is enabled on at least one interface, a minimal performance penalty is incurred for every egress packet, even if the interface it's transmitted over doesn't have any netfilter egress rules configured. That is caused by checking dev->nf_hooks_egress against NULL. Measurements were performed on a Core i7-3615QM. Commands to reproduce: ip link add dev foo type dummy ip link set dev foo up modprobe pktgen echo "add_device foo" > /proc/net/pktgen/kpktgend_3 samples/pktgen/pktgen_bench_xmit_mode_queue_xmit.sh -i foo -n 400000000 -m "11:11:11:11:11:11" -d 1.1.1.1 Accept all traffic with tc: tc qdisc add dev foo clsact tc filter add dev foo egress bpf da bytecode '1,6 0 0 0,' Drop all traffic with tc: tc qdisc add dev foo clsact tc filter add dev foo egress bpf da bytecode '1,6 0 0 2,' Apply this patch when measuring packet drops to avoid errors in dmesg: https://lore.kernel.org/netdev/a73dda33-57f4-95d8-ea51-ed483abd6a7a@iogearbox.net/ Signed-off-by: Lukas Wunner <lukas@wunner.de> Cc: Laura García Liébana <nevola@gmail.com> Cc: John Fastabend <john.fastabend@gmail.com> Cc: Daniel Borkmann <daniel@iogearbox.net> Cc: Alexei Starovoitov <ast@kernel.org> Cc: Eric Dumazet <edumazet@google.com> Cc: Thomas Graf <tgraf@suug.ch> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2021-10-14netfilter: Generalize ingress hook include fileLukas Wunner
Prepare for addition of a netfilter egress hook by generalizing the ingress hook include file. No functional change intended. Signed-off-by: Lukas Wunner <lukas@wunner.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2021-10-14netfilter: Rename ingress hook include fileLukas Wunner
Prepare for addition of a netfilter egress hook by renaming <linux/netfilter_ingress.h> to <linux/netfilter_netdev.h>. The egress hook also necessitates a refactoring of the include file, but that is done in a separate commit to ease reviewing. No functional change intended. Signed-off-by: Lukas Wunner <lukas@wunner.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2021-10-14ethernet: replace netdev->dev_addr assignment loopsJakub Kicinski
A handful of drivers contains loops assigning the mac addr byte by byte. Convert those to eth_hw_addr_set(). Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-10-14icmp: fix icmp_ext_echo_iio parsing in icmp_build_probeXin Long
In icmp_build_probe(), the icmp_ext_echo_iio parsing should be done step by step and skb_header_pointer() return value should always be checked, this patch fixes 3 places in there: - On case ICMP_EXT_ECHO_CTYPE_NAME, it should only copy ident.name from skb by skb_header_pointer(), its len is ident_len. Besides, the return value of skb_header_pointer() should always be checked. - On case ICMP_EXT_ECHO_CTYPE_INDEX, move ident_len check ahead of skb_header_pointer(), and also do the return value check for skb_header_pointer(). - On case ICMP_EXT_ECHO_CTYPE_ADDR, before accessing iio->ident.addr. ctype3_hdr.addrlen, skb_header_pointer() should be called first, then check its return value and ident_len. On subcases ICMP_AFI_IP and ICMP_AFI_IP6, also do check for ident. addr.ctype3_hdr.addrlen and skb_header_pointer()'s return value. On subcase ICMP_AFI_IP, the len for skb_header_pointer() should be "sizeof(iio->extobj_hdr) + sizeof(iio->ident.addr.ctype3_hdr) + sizeof(struct in_addr)" or "ident_len". v1->v2: - To make it more clear, call skb_header_pointer() once only for iio->indent's parsing as Jakub Suggested. v2->v3: - The extobj_hdr.length check against sizeof(_iio) should be done before calling skb_header_pointer(), as Eric noticed. Fixes: d329ea5bd884 ("icmp: add response to RFC 8335 PROBE messages") Reported-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Xin Long <lucien.xin@gmail.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://lore.kernel.org/r/31628dd76657ea62f5cf78bb55da6b35240831f1.1634205050.git.lucien.xin@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-10-14sctp: account stream padding length for reconf chunkEiichi Tsukata
sctp_make_strreset_req() makes repeated calls to sctp_addto_chunk() which will automatically account for padding on each call. inreq and outreq are already 4 bytes aligned, but the payload is not and doing SCTP_PAD4(a + b) (which _sctp_make_chunk() did implicitly here) is different from SCTP_PAD4(a) + SCTP_PAD4(b) and not enough. It led to possible attempt to use more buffer than it was allocated and triggered a BUG_ON. Cc: Vlad Yasevich <vyasevich@gmail.com> Cc: Neil Horman <nhorman@tuxdriver.com> Cc: Greg KH <gregkh@linuxfoundation.org> Fixes: cc16f00f6529 ("sctp: add support for generating stream reconf ssn reset request chunk") Reported-by: Eiichi Tsukata <eiichi.tsukata@nutanix.com> Signed-off-by: Eiichi Tsukata <eiichi.tsukata@nutanix.com> Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Signed-off-by: Marcelo Ricardo Leitner <mleitner@redhat.com> Reviewed-by: Xin Long <lucien.xin@gmail.com> Link: https://lore.kernel.org/r/b97c1f8b0c7ff79ac4ed206fc2c49d3612e0850c.1634156849.git.mleitner@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-10-13NFC: digital: fix possible memory leak in digital_in_send_sdd_req()Ziyang Xuan
'skb' is allocated in digital_in_send_sdd_req(), but not free when digital_in_send_cmd() failed, which will cause memory leak. Fix it by freeing 'skb' if digital_in_send_cmd() return failed. Fixes: 2c66daecc409 ("NFC Digital: Add NFC-A technology support") Signed-off-by: Ziyang Xuan <william.xuanziyang@huawei.com> Reviewed-by: Krzysztof Kozlowski <krzysztof.kozlowski@canonical.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-10-13NFC: digital: fix possible memory leak in digital_tg_listen_mdaa()Ziyang Xuan
'params' is allocated in digital_tg_listen_mdaa(), but not free when digital_send_cmd() failed, which will cause memory leak. Fix it by freeing 'params' if digital_send_cmd() return failed. Fixes: 1c7a4c24fbfd ("NFC Digital: Add target NFC-DEP support") Signed-off-by: Ziyang Xuan <william.xuanziyang@huawei.com> Reviewed-by: Krzysztof Kozlowski <krzysztof.kozlowski@canonical.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-10-13nfc: fix error handling of nfc_proto_register()Ziyang Xuan
When nfc proto id is using, nfc_proto_register() return -EBUSY error code, but forgot to unregister proto. Fix it by adding proto_unregister() in the error handling case. Fixes: c7fe3b52c128 ("NFC: add NFC socket family") Signed-off-by: Ziyang Xuan <william.xuanziyang@huawei.com> Reviewed-by: Krzysztof Kozlowski <krzysztof.kozlowski@canonical.com> Link: https://lore.kernel.org/r/20211013034932.2833737-1-william.xuanziyang@huawei.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-10-13Revert "net: procfs: add seq_puts() statement for dev_mcast"Vladimir Oltean
This reverts commit ec18e8455484370d633a718c6456ddbf6eceef21. It turns out that there are user space programs which got broken by that change. One example is the "ifstat" program shipped by Debian: https://packages.debian.org/source/bullseye/ifstat which, confusingly enough, seems to not have anything in common with the much more familiar (at least to me) ifstat program from iproute2: https://git.kernel.org/pub/scm/network/iproute2/iproute2.git/tree/misc/ifstat.c root@debian:~# ifstat ifstat: /proc/net/dev: unsupported format. This change modified the header (first two lines of text) in /proc/net/dev so that it looks like this: root@debian:~# cat /proc/net/dev Interface| Receive | Transmit | bytes packets errs drop fifo frame compressed multicast| bytes packets errs drop fifo colls carrier compressed lo: 97400 1204 0 0 0 0 0 0 97400 1204 0 0 0 0 0 0 bond0: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 sit0: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 eno2: 5002206 6651 0 0 0 0 0 0 105518642 1465023 0 0 0 0 0 0 swp0: 134531 2448 0 0 0 0 0 0 99599598 1464381 0 0 0 0 0 0 swp1: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 swp2: 4867675 4203 0 0 0 0 0 0 58134 631 0 0 0 0 0 0 sw0p0: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 sw0p1: 124739 2448 0 1422 0 0 0 0 93741184 1464369 0 0 0 0 0 0 sw0p2: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 sw2p0: 4850863 4203 0 0 0 0 0 0 54722 619 0 0 0 0 0 0 sw2p1: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 sw2p2: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 sw2p3: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 br0: 10508 212 0 212 0 0 0 212 61369558 958857 0 0 0 0 0 0 whereas before it looked like this: root@debian:~# cat /proc/net/dev Inter-| Receive | Transmit face |bytes packets errs drop fifo frame compressed multicast|bytes packets errs drop fifo colls carrier compressed lo: 13160 164 0 0 0 0 0 0 13160 164 0 0 0 0 0 0 bond0: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 sit0: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 eno2: 30824 268 0 0 0 0 0 0 3332 37 0 0 0 0 0 0 swp0: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 swp1: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 swp2: 30824 268 0 0 0 0 0 0 2428 27 0 0 0 0 0 0 sw0p0: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 sw0p1: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 sw0p2: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 sw2p0: 29752 268 0 0 0 0 0 0 1564 17 0 0 0 0 0 0 sw2p1: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 sw2p2: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 sw2p3: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 The reason why the ifstat shipped by Debian (v1.1, with a Debian patch upgrading it to 1.1-8.1 at the time of writing) is broken is because its "proc" driver/backend parses the header very literally: main/drivers.c#L825 if (!data->checked && strncmp(buf, "Inter-|", 7)) goto badproc; and there's no way in which the header can be changed such that programs parsing like that would not get broken. Even if we fix this ancient and very "lightly" maintained program to parse the text output of /proc/net/dev in a more sensible way, this story seems bound to repeat again with other programs, and modifying them all could cause more trouble than it's worth. On the other hand, the reverted patch had no other reason than an aesthetic one, so reverting it is the simplest way out. I don't know what other distributions would be affected; the fact that Debian doesn't ship the iproute2 version of the program (a different code base altogether, which uses netlink and not /proc/net/dev) is surprising in itself. Fixes: ec18e8455484 ("net: procfs: add seq_puts() statement for dev_mcast") Link: https://lore.kernel.org/netdev/20211009163511.vayjvtn3rrteglsu@skbuf/ Cc: Yajun Deng <yajun.deng@linux.dev> Cc: Matthieu Baerts <matthieu.baerts@tessares.net> Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Link: https://lore.kernel.org/r/20211013001909.3164185-1-vladimir.oltean@nxp.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-10-13net: dsa: unregister cross-chip notifier after ds->ops->teardownVladimir Oltean
To be symmetric with the error unwind path of dsa_switch_setup(), call dsa_switch_unregister_notifier() after ds->ops->teardown. The implication is that ds->ops->teardown cannot emit cross-chip notifiers. For example, currently the dsa_tag_8021q_unregister() call from sja1105_teardown() does not propagate to the entire tree due to this reason. However I cannot find an actual issue caused by this, observed using code inspection. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Link: https://lore.kernel.org/r/20211012123735.2545742-1-vladimir.oltean@nxp.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-10-13ip: use dev_addr_set() in tunnelsJakub Kicinski
Use dev_addr_set() instead of writing to netdev->dev_addr directly in ip tunnels drivers. Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-10-13tipc: constify dev_addr passingJakub Kicinski
In preparation for netdev->dev_addr being constant make all relevant arguments in tipc constant. Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-10-13ipv6: constify dev_addr passingJakub Kicinski
In preparation for netdev->dev_addr being constant make all relevant arguments in ndisc constant. Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-10-13llc/snap: constify dev_addr passingJakub Kicinski
In preparation for netdev->dev_addr being constant make all relevant arguments in LLC and SNAP constant. Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-10-13rose: constify dev_addr passingJakub Kicinski
In preparation for netdev->dev_addr being constant make all relevant arguments in rose constant. Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-10-13ax25: constify dev_addr passingJakub Kicinski
In preparation for netdev->dev_addr being constant make all relevant arguments in AX25 constant. Modify callers as well (netrom, rose). Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-10-12net: dsa: tag_ocelot_8021q: fix inability to inject STP BPDUs into BLOCKING ↵Vladimir Oltean
ports When setting up a bridge with stp_state 1, topology changes are not detected and loops are not blocked. This is because the standard way of transmitting a packet, based on VLAN IDs redirected by VCAP IS2 to the right egress port, does not override the port STP state (in the case of Ocelot switches, that's really the PGID_SRC masks). To force a packet to be injected into a port that's BLOCKING, we must send it as a control packet, which means in the case of this tagger to send it using the manual register injection method. We already do this for PTP frames, extend the logic to apply to any link-local MAC DA. Fixes: 7c83a7c539ab ("net: dsa: add a second tagger for Ocelot switches based on tag_8021q") Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-10-12net: dsa: tag_ocelot_8021q: break circular dependency with ocelot switch libVladimir Oltean
Michael reported that when using the "ocelot-8021q" tagging protocol, the switch driver module must be manually loaded before the tagging protocol can be loaded/is available. This appears to be the same problem described here: https://lore.kernel.org/netdev/20210908220834.d7gmtnwrorhharna@skbuf/ where due to the fact that DSA tagging protocols make use of symbols exported by the switch drivers, circular dependencies appear and this breaks module autoloading. The ocelot_8021q driver needs the ocelot_can_inject() and ocelot_port_inject_frame() functions from the switch library. Previously the wrong approach was taken to solve that dependency: shims were provided for the case where the ocelot switch library was compiled out, but that turns out to be insufficient, because the dependency when the switch lib _is_ compiled is problematic too. We cannot declare ocelot_can_inject() and ocelot_port_inject_frame() as static inline functions, because these access I/O functions like __ocelot_write_ix() which is called by ocelot_write_rix(). Making those static inline basically means exposing the whole guts of the ocelot switch library, not ideal... We already have one tagging protocol driver which calls into the switch driver during xmit but not using any exported symbol: sja1105_defer_xmit. We can do the same thing here: create a kthread worker and one work item per skb, and let the switch driver itself do the register accesses to send the skb, and then consume it. Fixes: 0a6f17c6ae21 ("net: dsa: tag_ocelot_8021q: add support for PTP timestamping") Reported-by: Michael Walle <michael@walle.cc> Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-10-12net: dsa: tag_ocelot: break circular dependency with ocelot switch lib driverVladimir Oltean
As explained here: https://lore.kernel.org/netdev/20210908220834.d7gmtnwrorhharna@skbuf/ DSA tagging protocol drivers cannot depend on symbols exported by switch drivers, because this creates a circular dependency that breaks module autoloading. The tag_ocelot.c file depends on the ocelot_ptp_rew_op() function exported by the common ocelot switch lib. This function looks at OCELOT_SKB_CB(skb) and computes how to populate the REW_OP field of the DSA tag, for PTP timestamping (the command: one-step/two-step, and the TX timestamp identifier). None of that requires deep insight into the driver, it is quite stateless, as it only depends upon the skb->cb. So let's make it a static inline function and put it in include/linux/dsa/ocelot.h, a file that despite its name is used by the ocelot switch driver for populating the injection header too - since commit 40d3f295b5fe ("net: mscc: ocelot: use common tag parsing code with DSA"). With that function declared as static inline, its body is expanded inside each call site, so the dependency is broken and the DSA tagger can be built without the switch library, upon which the felix driver depends. Fixes: 39e5308b3250 ("net: mscc: ocelot: support PTP Sync one-step timestamping") Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-10-12net: dsa: sja1105: break dependency between dsa_port_is_sja1105 and switch ↵Vladimir Oltean
driver It's nice to be able to test a tagging protocol with dsa_loop, but not at the cost of losing the ability of building the tagging protocol and switch driver as modules, because as things stand, there is a circular dependency between the two. Tagging protocol drivers cannot depend on switch drivers, that is a hard fact. The reasoning behind the blamed patch was that accessing dp->priv should first make sure that the structure behind that pointer is what we really think it is. Currently the "sja1105" and "sja1110" tagging protocols only operate with the sja1105 switch driver, just like any other tagging protocol and switch combination. The only way to mix and match them is by modifying the code, and this applies to dsa_loop as well (by default that uses DSA_TAG_PROTO_NONE). So while in principle there is an issue, in practice there isn't one. Until we extend dsa_loop to allow user space configuration, treat the problem as a non-issue and just say that DSA ports found by tag_sja1105 are always sja1105 ports, which is in fact true. But keep the dsa_port_is_sja1105 function so that it's easy to patch it during testing, and rely on dead code elimination. Fixes: 994d2cbb08ca ("net: dsa: tag_sja1105: be dsa_loop-safe") Link: https://lore.kernel.org/netdev/20210908220834.d7gmtnwrorhharna@skbuf/ Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-10-12net: dsa: move sja1110_process_meta_tstamp inside the tagging protocol driverVladimir Oltean
The problem is that DSA tagging protocols really must not depend on the switch driver, because this creates a circular dependency at insmod time, and the switch driver will effectively not load when the tagging protocol driver is missing. The code was structured in the way it was for a reason, though. The DSA driver-facing API for PTP timestamping relies on the assumption that two-step TX timestamps are provided by the hardware in an out-of-band manner, typically by raising an interrupt and making that timestamp available inside some sort of FIFO which is to be accessed over SPI/MDIO/etc. So the API puts .port_txtstamp into dsa_switch_ops, because it is expected that the switch driver needs to save some state (like put the skb into a queue until its TX timestamp arrives). On SJA1110, TX timestamps are provided by the switch as Ethernet packets, so this makes them be received and processed by the tagging protocol driver. This in itself is great, because the timestamps are full 64-bit and do not require reconstruction, and since Ethernet is the fastest I/O method available to/from the switch, PTP timestamps arrive very quickly, no matter how bottlenecked the SPI connection is, because SPI interaction is not needed at all. DSA's code structure and strict isolation between the tagging protocol driver and the switch driver break the natural code organization. When the tagging protocol driver receives a packet which is classified as a metadata packet containing timestamps, it passes those timestamps one by one to the switch driver, which then proceeds to compare them based on the recorded timestamp ID that was generated in .port_txtstamp. The communication between the tagging protocol and the switch driver is done through a method exported by the switch driver, sja1110_process_meta_tstamp. To satisfy build requirements, we force a dependency to build the tagging protocol driver as a module when the switch driver is a module. However, as explained in the first paragraph, that causes the circular dependency. To solve this, move the skb queue from struct sja1105_private :: struct sja1105_ptp_data to struct sja1105_private :: struct sja1105_tagger_data. The latter is a data structure for which hacks have already been put into place to be able to create persistent storage per switch that is accessible from the tagging protocol driver (see sja1105_setup_ports). With the skb queue directly accessible from the tagging protocol driver, we can now move sja1110_process_meta_tstamp into the tagging driver itself, and avoid exporting a symbol. Fixes: 566b18c8b752 ("net: dsa: sja1105: implement TX timestamping for SJA1110") Link: https://lore.kernel.org/netdev/20210908220834.d7gmtnwrorhharna@skbuf/ Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-10-12devlink: Delete reload enable/disable interfaceLeon Romanovsky
Commit a0c76345e3d3 ("devlink: disallow reload operation during device cleanup") added devlink_reload_{enable,disable}() APIs to prevent reload operation from racing with device probe/dismantle. After recent changes to move devlink_register() to the end of device probe and devlink_unregister() to the beginning of device dismantle, these races can no longer happen. Reload operations will be denied if the devlink instance is unregistered and devlink_unregister() will block until all in-flight operations are done. Therefore, remove these devlink_reload_{enable,disable}() APIs. Signed-off-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-10-12devlink: Allow control devlink ops behavior through feature maskLeon Romanovsky
Introduce new devlink call to set feature mask to control devlink behavior during device initialization phase after devlink_alloc() is already called. This allows us to set reload ops based on device property which is not known at the beginning of driver initialization. For the sake of simplicity, this API lacks any type of locking and needs to be called before devlink_register() to make sure that no parallel access to the ops is possible at this stage. Signed-off-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-10-12devlink: Annotate devlink API callsLeon Romanovsky
Initial annotation patch to separate calls that needs to be executed before or after devlink_register(). Signed-off-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-10-12devlink: Move netdev_to_devlink helpers to devlink.cLeon Romanovsky
Both netdev_to_devlink and netdev_to_devlink_port are used in devlink.c only, so move them in order to reduce their scope. Signed-off-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-10-12devlink: Reduce struct devlink exposureLeon Romanovsky
The declaration of struct devlink in general header provokes the situation where internal fields can be accidentally used by the driver authors. In order to reduce such possible situations, let's reduce the namespace exposure of struct devlink. Signed-off-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-10-12net: dsa: fix spurious error message when unoffloaded port leaves bridgeAlvin Šipraga
Flip the sign of a return value check, thereby suppressing the following spurious error: port 2 failed to notify DSA_NOTIFIER_BRIDGE_LEAVE: -EOPNOTSUPP ... which is emitted when removing an unoffloaded DSA switch port from a bridge. Fixes: d371b7c92d19 ("net: dsa: Unset vlan_filtering when ports leave the bridge") Signed-off-by: Alvin Šipraga <alsi@bang-olufsen.dk> Reviewed-by: Vladimir Oltean <olteanv@gmail.com> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Link: https://lore.kernel.org/r/20211012112730.3429157-1-alvin@pqrs.dk Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-10-12SUNRPC: De-duplicate .pc_release() call sitesChuck Lever
There was some spaghetti in svc_process_common() that had evolved over time such that there was still one case that needed a call to .pc_release() but never made it. That issue was removed in the previous patch. As additional insurance against missing this important callout, ensure that the .pc_release() method is always called, no matter what the reply_stat is. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2021-10-12SUNRPC: Simplify the SVC dispatch code pathChuck Lever
Micro-optimization: The last user of the generic SVC dispatch code path has been removed, so svc_process_common() can be simplified. This declutters the hot path so that the by-far most common case (a dispatch function exists) is made the /only/ path. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2021-10-12ipv6: ioam: move the check for undefined bitsJustin Iurman
The check for undefined bits in the trace type is moved from the input side to the output side, while the input side is relaxed and now inserts default empty values when an undefined bit is set. Signed-off-by: Justin Iurman <justin.iurman@uliege.be> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-10-12net, neigh: Add NTF_MANAGED flag for managed neighbor entriesDaniel Borkmann
Allow a user space control plane to insert entries with a new NTF_EXT_MANAGED flag. The flag then indicates to the kernel that the neighbor entry should be periodically probed for keeping the entry in NUD_REACHABLE state iff possible. The use case for this is targeting XDP or tc BPF load-balancers which use the bpf_fib_lookup() BPF helper in order to piggyback on neighbor resolution for their backends. Given they cannot be resolved in fast-path, a control plane inserts the L3 (without L2) entries manually into the neighbor table and lets the kernel do the neighbor resolution either on the gateway or on the backend directly in case the latter resides in the same L2. This avoids to deal with L2 in the control plane and to rebuild what the kernel already does best anyway. NTF_EXT_MANAGED can be combined with NTF_EXT_LEARNED in order to avoid GC eviction. The kernel then adds NTF_MANAGED flagged entries to a per-neighbor table which gets triggered by the system work queue to periodically call neigh_event_send() for performing the resolution. The implementation allows migration from/to NTF_MANAGED neighbor entries, so that already existing entries can be converted by the control plane if needed. Potentially, we could make the interval for periodically calling neigh_event_send() configurable; right now it's set to DELAY_PROBE_TIME which is also in line with mlxsw which has similar driver-internal infrastructure c723c735fa6b ("mlxsw: spectrum_router: Periodically update the kernel's neigh table"). In future, the latter could possibly reuse the NTF_MANAGED neighbors as well. Example: # ./ip/ip n replace 192.168.178.30 dev enp5s0 managed extern_learn # ./ip/ip n 192.168.178.30 dev enp5s0 lladdr f4:8c:50:5e:71:9a managed extern_learn REACHABLE [...] Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Roopa Prabhu <roopa@nvidia.com> Link: https://linuxplumbersconf.org/event/11/contributions/953/ Signed-off-by: David S. Miller <davem@davemloft.net>
2021-10-12net, neigh: Extend neigh->flags to 32 bit to allow for extensionsRoopa Prabhu
Currently, all bits in struct ndmsg's ndm_flags are used up with the most recent addition of 435f2e7cc0b7 ("net: bridge: add support for sticky fdb entries"). This makes it impossible to extend the neighboring subsystem with new NTF_* flags: struct ndmsg { __u8 ndm_family; __u8 ndm_pad1; __u16 ndm_pad2; __s32 ndm_ifindex; __u16 ndm_state; __u8 ndm_flags; __u8 ndm_type; }; There are ndm_pad{1,2} attributes which are not used. However, due to uncareful design, the kernel does not enforce them to be zero upon new neighbor entry addition, and given they've been around forever, it is not possible to reuse them today due to risk of breakage. One option to overcome this limitation is to add a new NDA_FLAGS_EXT attribute for extended flags. In struct neighbour, there is a 3 byte hole between protocol and ha_lock, which allows neigh->flags to be extended from 8 to 32 bits while still being on the same cacheline as before. This also allows for all future NTF_* flags being in neigh->flags rather than yet another flags field. Unknown flags in NDA_FLAGS_EXT will be rejected by the kernel. Co-developed-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Roopa Prabhu <roopa@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-10-12net, neigh: Enable state migration between NUD_PERMANENT and NTF_USEDaniel Borkmann
Currently, it is not possible to migrate a neighbor entry between NUD_PERMANENT state and NTF_USE flag with a dynamic NUD state from a user space control plane. Similarly, it is not possible to add/remove NTF_EXT_LEARNED flag from an existing neighbor entry in combination with NTF_USE flag. This is due to the latter directly calling into neigh_event_send() without any meta data updates as happening in __neigh_update(). Thus, to enable this use case, extend the latter with a NEIGH_UPDATE_F_USE flag where we break the NUD_PERMANENT state in particular so that a latter neigh_event_send() is able to re-resolve a neighbor entry. Before fix, NUD_PERMANENT -> NUD_* & NTF_USE: # ./ip/ip n replace 192.168.178.30 dev enp5s0 lladdr f4:8c:50:5e:71:9a # ./ip/ip n 192.168.178.30 dev enp5s0 lladdr f4:8c:50:5e:71:9a PERMANENT [...] # ./ip/ip n replace 192.168.178.30 dev enp5s0 use extern_learn # ./ip/ip n 192.168.178.30 dev enp5s0 lladdr f4:8c:50:5e:71:9a PERMANENT [...] As can be seen, despite the admin-triggered replace, the entry remains in the NUD_PERMANENT state. After fix, NUD_PERMANENT -> NUD_* & NTF_USE: # ./ip/ip n replace 192.168.178.30 dev enp5s0 lladdr f4:8c:50:5e:71:9a # ./ip/ip n 192.168.178.30 dev enp5s0 lladdr f4:8c:50:5e:71:9a PERMANENT [...] # ./ip/ip n replace 192.168.178.30 dev enp5s0 use extern_learn # ./ip/ip n 192.168.178.30 dev enp5s0 lladdr f4:8c:50:5e:71:9a extern_learn REACHABLE [...] # ./ip/ip n 192.168.178.30 dev enp5s0 lladdr f4:8c:50:5e:71:9a extern_learn STALE [...] # ./ip/ip n replace 192.168.178.30 dev enp5s0 lladdr f4:8c:50:5e:71:9a # ./ip/ip n 192.168.178.30 dev enp5s0 lladdr f4:8c:50:5e:71:9a PERMANENT [...] After the fix, the admin-triggered replace switches to a dynamic state from the NTF_USE flag which triggered a new neighbor resolution. Likewise, we can transition back from there, if needed, into NUD_PERMANENT. Similar before/after behavior can be observed for below transitions: Before fix, NTF_USE -> NTF_USE | NTF_EXT_LEARNED -> NTF_USE: # ./ip/ip n replace 192.168.178.30 dev enp5s0 use # ./ip/ip n 192.168.178.30 dev enp5s0 lladdr f4:8c:50:5e:71:9a REACHABLE [...] # ./ip/ip n replace 192.168.178.30 dev enp5s0 use extern_learn # ./ip/ip n 192.168.178.30 dev enp5s0 lladdr f4:8c:50:5e:71:9a REACHABLE [...] After fix, NTF_USE -> NTF_USE | NTF_EXT_LEARNED -> NTF_USE: # ./ip/ip n replace 192.168.178.30 dev enp5s0 use # ./ip/ip n 192.168.178.30 dev enp5s0 lladdr f4:8c:50:5e:71:9a REACHABLE [...] # ./ip/ip n replace 192.168.178.30 dev enp5s0 use extern_learn # ./ip/ip n 192.168.178.30 dev enp5s0 lladdr f4:8c:50:5e:71:9a extern_learn REACHABLE [...] # ./ip/ip n replace 192.168.178.30 dev enp5s0 use # ./ip/ip n 192.168.178.30 dev enp5s0 lladdr f4:8c:50:5e:71:9a REACHABLE [..] Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Roopa Prabhu <roopa@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-10-12net, neigh: Fix NTF_EXT_LEARNED in combination with NTF_USEDaniel Borkmann
The NTF_EXT_LEARNED neigh flag is usually propagated back to user space upon dump of the neighbor table. However, when used in combination with NTF_USE flag this is not the case despite exempting the entry from the garbage collector. This results in inconsistent state since entries are typically marked in neigh->flags with NTF_EXT_LEARNED, but here they are not. Fix it by propagating the creation flag to ___neigh_create(). Before fix: # ./ip/ip n replace 192.168.178.30 dev enp5s0 use extern_learn # ./ip/ip n 192.168.178.30 dev enp5s0 lladdr f4:8c:50:5e:71:9a REACHABLE [...] After fix: # ./ip/ip n replace 192.168.178.30 dev enp5s0 use extern_learn # ./ip/ip n 192.168.178.30 dev enp5s0 lladdr f4:8c:50:5e:71:9a extern_learn REACHABLE [...] Fixes: 9ce33e46531d ("neighbour: support for NTF_EXT_LEARNED flag") Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Roopa Prabhu <roopa@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-10-12af_unix: Rename UNIX-DGRAM to UNIX to maintain backwards compatabilityStephen Boyd
Then name of this protocol changed in commit 94531cfcbe79 ("af_unix: Add unix_stream_proto for sockmap") because that commit added stream support to the af_unix protocol. Renaming the existing protocol makes a ChromeOS protocol test[1] fail now that the name has changed in /proc/net/protocols from "UNIX" to "UNIX-DGRAM". Let's put the name back to how it was while keeping the stream protocol as "UNIX-STREAM" so that the procfs interface doesn't change. This fixes the test and maintains backwards compatibility in proc. Cc: Jiang Wang <jiang.wang@bytedance.com> Cc: Andrii Nakryiko <andrii@kernel.org> Cc: Cong Wang <cong.wang@bytedance.com> Cc: Jakub Sitnicki <jakub@cloudflare.com> Cc: John Fastabend <john.fastabend@gmail.com> Cc: Dmitry Osipenko <digetx@gmail.com> Link: https://source.chromium.org/chromiumos/chromiumos/codesearch/+/main:src/platform/tast-tests/src/chromiumos/tast/local/bundles/cros/network/supported_protocols.go;l=50;drc=e8b1c3f94cb40a054f4aa1ef1aff61e75dc38f18 [1] Fixes: 94531cfcbe79 ("af_unix: Add unix_stream_proto for sockmap") Signed-off-by: Stephen Boyd <swboyd@chromium.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-10-11nfc: nci: replace GPLv2 boilerplate with SPDXKrzysztof Kozlowski
Replace standard GPLv2 license text with SPDX tag. Although the comment mentions GPLv2-only, it refers to the full license file which allows later GPL versions. Signed-off-by: Krzysztof Kozlowski <krzysztof.kozlowski@canonical.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-10-11nfc: drop unneeded debug printsKrzysztof Kozlowski
ftrace is a preferred and standard way to debug entering and exiting functions so drop useless debug prints. Signed-off-by: Krzysztof Kozlowski <krzysztof.kozlowski@canonical.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-10-10net: make dev_get_port_parent_id slightly more readableAntoine Tenart
Cosmetic commit making dev_get_port_parent_id slightly more readable. There is no need to split the condition to return after calling devlink_compat_switch_id_get and after that 'recurse' is always true. Signed-off-by: Antoine Tenart <atenart@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-10-10SUNRPC: Per-rpc_clnt task PIDsChuck Lever
The current range of RPC task PIDs is 0..65535. That's not adequate for distinguishing tasks across multiple rpc_clnts running high throughput workloads. To help relieve this situation and to reduce the bottleneck of having a single atomic for assigning all RPC task PIDs, assign task PIDs per rpc_clnt. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2021-10-09net: dsa: hold rtnl_lock in dsa_switch_setup_tag_protocolVladimir Oltean
It was a documented fact that ds->ops->change_tag_protocol() offered rtnetlink mutex protection to the switch driver, since there was an ASSERT_RTNL right before the call in dsa_switch_change_tag_proto() (initiated from sysfs). The blamed commit introduced another call path for ds->ops->change_tag_protocol() which does not hold the rtnl_mutex. This is: dsa_tree_setup -> dsa_tree_setup_switches -> dsa_switch_setup -> dsa_switch_setup_tag_protocol -> ds->ops->change_tag_protocol() -> dsa_port_setup -> dsa_slave_create -> register_netdevice(slave_dev) -> dsa_tree_setup_master -> dsa_master_setup -> dev->dsa_ptr = cpu_dp The reason why the rtnl_mutex is held in the sysfs call path is to ensure that, once the master and all the DSA interfaces are down (which is required so that no packets flow), they remain down during the tagging protocol change. The above calling order illustrates the fact that it should not be risky to change the initial tagging protocol to the one specified in the device tree at the given time: - packets cannot enter the dsa_switch_rcv() packet type handler since netdev_uses_dsa() for the master will not yet return true, since dev->dsa_ptr has not yet been populated - packets cannot enter the dsa_slave_xmit() function because no DSA interface has yet been registered So from the DSA core's perspective, holding the rtnl_mutex is indeed not necessary. Yet, drivers may need to do things which need rtnl_mutex protection. For example: felix_set_tag_protocol -> felix_setup_tag_8021q -> dsa_tag_8021q_register -> dsa_tag_8021q_setup -> dsa_tag_8021q_port_setup -> vlan_vid_add -> ASSERT_RTNL These drivers do not really have a choice to take the rtnl_mutex themselves, since in the sysfs case, the rtnl_mutex is already held. Fixes: deff710703d8 ("net: dsa: Allow default tag protocol to be overridden from DT") Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>