summaryrefslogtreecommitdiff
path: root/include
AgeCommit message (Collapse)Author
2020-03-19net: bridge: vlan: include stats in dumps if requestedNikolay Aleksandrov
This patch adds support for vlan stats to be included when dumping vlan information. We have to dump them only when explicitly requested (thus the flag below) because that disables the vlan range compression and will make the dump significantly larger. In order to request the stats to be included we add a new dump attribute called BRIDGE_VLANDB_DUMP_FLAGS which can affect dumps with the following first flag: - BRIDGE_VLANDB_DUMPF_STATS The stats are intentionally nested and put into separate attributes to make it easier for extending later since we plan to add per-vlan mcast stats, drop stats and possibly STP stats. This is the last missing piece from the new vlan API which makes the dumped vlan information complete. A dump request which should include stats looks like: [BRIDGE_VLANDB_DUMP_FLAGS] |= BRIDGE_VLANDB_DUMPF_STATS A vlandb entry attribute with stats looks like: [BRIDGE_VLANDB_ENTRY] = { [BRIDGE_VLANDB_ENTRY_STATS] = { [BRIDGE_VLANDB_STATS_RX_BYTES] [BRIDGE_VLANDB_STATS_RX_PACKETS] ... } } Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-19bpf: Support llvm-objcopy for vmlinux BTFFangrui Song
Simplify gen_btf logic to make it work with llvm-objcopy. The existing 'file format' and 'architecture' parsing logic is brittle and does not work with llvm-objcopy/llvm-objdump. 'file format' output of llvm-objdump>=11 will match GNU objdump, but 'architecture' (bfdarch) may not. .BTF in .tmp_vmlinux.btf is non-SHF_ALLOC. Add the SHF_ALLOC flag because it is part of vmlinux image used for introspection. C code can reference the section via linker script defined __start_BTF and __stop_BTF. This fixes a small problem that previous .BTF had the SHF_WRITE flag (objcopy -I binary -O elf* synthesized .data). Additionally, `objcopy -I binary` synthesized symbols _binary__btf_vmlinux_bin_start and _binary__btf_vmlinux_bin_stop (not used elsewhere) are replaced with more commonplace __start_BTF and __stop_BTF. Add 2>/dev/null because GNU objcopy (but not llvm-objcopy) warns "empty loadable segment detected at vaddr=0xffffffff81000000, is this intentional?" We use a dd command to change the e_type field in the ELF header from ET_EXEC to ET_REL so that lld will accept .btf.vmlinux.bin.o. Accepting ET_EXEC as an input file is an extremely rare GNU ld feature that lld does not intend to support, because this is error-prone. The output section description .BTF in include/asm-generic/vmlinux.lds.h avoids potential subtle orphan section placement issues and suppresses --orphan-handling=warn warnings. Fixes: df786c9b9476 ("bpf: Force .BTF section start to zero when dumping from vmlinux") Fixes: cb0cc635c7a9 ("powerpc: Include .BTF section") Reported-by: Nathan Chancellor <natechancellor@gmail.com> Signed-off-by: Fangrui Song <maskray@google.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Tested-by: Stanislav Fomichev <sdf@google.com> Tested-by: Andrii Nakryiko <andriin@fb.com> Reviewed-by: Stanislav Fomichev <sdf@google.com> Reviewed-by: Kees Cook <keescook@chromium.org> Acked-by: Andrii Nakryiko <andriin@fb.com> Acked-by: Michael Ellerman <mpe@ellerman.id.au> (powerpc) Link: https://github.com/ClangBuiltLinux/linux/issues/871 Link: https://lore.kernel.org/bpf/20200318222746.173648-1-maskray@google.com
2020-03-19netfilter: nf_tables: allow to specify stateful expression in set definitionPablo Neira Ayuso
This patch allows users to specify the stateful expression for the elements in this set via NFTA_SET_EXPR. This new feature allows you to turn on counters for all of the elements in this set. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-03-19netfilter: nf_tables: move nft_expr_clone() to nf_tables_api.cPablo Neira Ayuso
Move the nft_expr_clone() helper function to the core. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-03-18Merge tag 'mlx5-updates-2020-03-17' of ↵David S. Miller
git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux Saeed Mahameed says: ==================== mlx5-updates-2020-03-17 1) Compiler warnings and cleanup for the connection tracking series 2) Bug fixes for the connection tracking series 3) Fix devlink port register sequence 4) Last five patches in the series, By Eli cohen Add the support for forwarding traffic between two eswitch uplink representors (Hairpin for eswitch), using mlx5 termination tables to change the direction of a packet in hw from RX to TX pipeline. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-18netfilter: revert introduction of egress hookDaniel Borkmann
This reverts the following commits: 8537f78647c0 ("netfilter: Introduce egress hook") 5418d3881e1f ("netfilter: Generalize ingress hook") b030f194aed2 ("netfilter: Rename ingress hook include file") >From the discussion in [0], the author's main motivation to add a hook in fast path is for an out of tree kernel module, which is a red flag to begin with. Other mentioned potential use cases like NAT{64,46} is on future extensions w/o concrete code in the tree yet. Revert as suggested [1] given the weak justification to add more hooks to critical fast-path. [0] https://lore.kernel.org/netdev/cover.1583927267.git.lukas@wunner.de/ [1] https://lore.kernel.org/netdev/20200318.011152.72770718915606186.davem@davemloft.net/ Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Cc: David Miller <davem@davemloft.net> Cc: Pablo Neira Ayuso <pablo@netfilter.org> Cc: Alexei Starovoitov <ast@kernel.org> Nacked-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-17Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-nextDavid S. Miller
Pablo Neira Ayuso says: ==================== Netfilter updates for net-next The following patchset contains Netfilter updates for net-next: 1) Use nf_flow_offload_tuple() to fetch flow stats, from Paul Blakey. 2) Add new xt_IDLETIMER hard mode, from Manoj Basapathi. Follow up patch to clean up this new mode, from Dan Carpenter. 3) Add support for geneve tunnel options, from Xin Long. 4) Make sets built-in and remove modular infrastructure for sets, from Florian Westphal. 5) Remove unused TEMPLATE_NULLS_VAL, from Li RongQing. 6) Statify nft_pipapo_get, from Chen Wandun. 7) Use C99 flexible-array member, from Gustavo A. R. Silva. 8) More descriptive variable names for bitwise, from Jeremy Sowden. 9) Four patches to add tunnel device hardware offload to the flowtable infrastructure, from wenxu. 10) pipapo set supports for 8-bit grouping, from Stefano Brivio. 11) pipapo can switch between nibble and byte grouping, also from Stefano. 12) Add AVX2 vectorized version of pipapo, from Stefano Brivio. 13) Update pipapo to be use it for single ranges, from Stefano. 14) Add stateful expression support to elements via control plane, eg. counter per element. 15) Re-visit sysctls in unprivileged namespaces, from Florian Westphal. 15) Add new egress hook, from Lukas Wunner. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-17net: phylink: pcs: add 802.3 clause 45 helpersRussell King
Implement helpers for PCS accessed via the MII bus using 802.3 clause 45 cycles for 10GBASE-R. Only link up/down is supported, 10G full duplex is assumed. Signed-off-by: Russell King <rmk+kernel@armlinux.org.uk> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-17net: phylink: pcs: add 802.3 clause 22 helpersRussell King
Implement helpers for PCS accessed via the MII bus using 802.3 clause 22 cycles, conforming to 802.3 clause 37 and Cisco SGMII specifications for the advertisement word. Signed-off-by: Russell King <rmk+kernel@armlinux.org.uk> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-17net: mdiobus: add APIs for modifying a MDIO device registerRussell King
Add APIs for modifying a MDIO device register, similar to the existing phy_modify() group of functions, but at mdiobus level instead. Adapt __phy_modify_changed() to use the new mdiobus level helper. Signed-off-by: Russell King <rmk+kernel@armlinux.org.uk> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-17net: bridge: vlan options: add support for tunnel mapping set/delNikolay Aleksandrov
This patch adds support for manipulating vlan/tunnel mappings. The tunnel ids are globally unique and are one per-vlan. There were two trickier issues - first in order to support vlan ranges we have to compute the current tunnel id in the following way: - base tunnel id (attr) + current vlan id - starting vlan id This is in line how the old API does vlan/tunnel mapping with ranges. We already have the vlan range present, so it's redundant to add another attribute for the tunnel range end. It's simply base tunnel id + vlan range. And second to support removing mappings we need an out-of-band way to tell the option manipulating function because there are no special/reserved tunnel id values, so we use a vlan flag to denote the operation is tunnel mapping removal. Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-17net: bridge: vlan options: add support for tunnel id dumpingNikolay Aleksandrov
Add a new option - BRIDGE_VLANDB_ENTRY_TUNNEL_ID which is used to dump the tunnel id mapping. Since they're unique per vlan they can enter a vlan range if they're consecutive, thus we can calculate the tunnel id range map simply as: vlan range end id - vlan range start id. The starting point is the tunnel id in BRIDGE_VLANDB_ENTRY_TUNNEL_ID. This is similar to how the tunnel entries can be created in a range via the old API (a vlan range maps to a tunnel range). Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-17net_sched: sch_fq: enable use of hrtimer slackEric Dumazet
Add a new attribute to control the fq qdisc hrtimer slack. Default is set to 10 usec. When/if packets are throttled, fq set up an hrtimer that can lead to one interrupt per packet in the throttled queue. By using a timer slack, we allow better use of timer interrupts, by giving them a chance to call multiple timer callbacks at each hardware interrupt. Also, giving a slack allows FQ to dequeue batches of packets instead of a single one, thus increasing xmit_more efficiency. This has no negative effect on the rate a TCP flow can sustain, since each TCP flow maintains its own precise vtime (tp->tcp_wstamp_ns) v2: added strict netlink checking (as feedback from Jakub Kicinski) Tested: 1000 concurrent flows all using paced packets. 1,000,000 packets sent per second. Before the patch : $ vmstat 2 10 procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu----- r b swpd free buff cache si so bi bo in cs us sy id wa st 0 0 0 60726784 23628 3485992 0 0 138 1 977 535 0 12 87 0 0 0 0 0 60714700 23628 3485628 0 0 0 0 1568827 26462 0 22 78 0 0 1 0 0 60716012 23628 3485656 0 0 0 0 1570034 26216 0 22 78 0 0 0 0 0 60722420 23628 3485492 0 0 0 0 1567230 26424 0 22 78 0 0 0 0 0 60727484 23628 3485556 0 0 0 0 1568220 26200 0 22 78 0 0 2 0 0 60718900 23628 3485380 0 0 0 40 1564721 26630 0 22 78 0 0 2 0 0 60718096 23628 3485332 0 0 0 0 1562593 26432 0 22 78 0 0 0 0 0 60719608 23628 3485064 0 0 0 0 1563806 26238 0 22 78 0 0 1 0 0 60722876 23628 3485236 0 0 0 130 1565874 26566 0 22 78 0 0 1 0 0 60722752 23628 3484908 0 0 0 0 1567646 26247 0 22 78 0 0 After the patch, slack of 10 usec, we can see a reduction of interrupts per second, and a small decrease of reported cpu usage. $ vmstat 2 10 procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu----- r b swpd free buff cache si so bi bo in cs us sy id wa st 1 0 0 60722564 23628 3484728 0 0 133 1 696 545 0 13 87 0 0 1 0 0 60722568 23628 3484824 0 0 0 0 977278 25469 0 20 80 0 0 0 0 0 60716396 23628 3484764 0 0 0 0 979997 25326 0 20 80 0 0 0 0 0 60713844 23628 3484960 0 0 0 0 981394 25249 0 20 80 0 0 2 0 0 60720468 23628 3484916 0 0 0 0 982860 25062 0 20 80 0 0 1 0 0 60721236 23628 3484856 0 0 0 0 982867 25100 0 20 80 0 0 1 0 0 60722400 23628 3484456 0 0 0 8 982698 25303 0 20 80 0 0 0 0 0 60715396 23628 3484428 0 0 0 0 981777 25176 0 20 80 0 0 0 0 0 60716520 23628 3486544 0 0 0 36 978965 27857 0 21 79 0 0 0 0 0 60719592 23628 3486516 0 0 0 22 977318 25106 0 20 80 0 0 Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-17net_sched: add qdisc_watchdog_schedule_range_ns()Eric Dumazet
Some packet schedulers might want to add a slack when programming hrtimers. This can reduce number of interrupts and increase batch sizes and thus give good xmit_more savings. This commit adds qdisc_watchdog_schedule_range_ns() helper, with an extra delta_ns parameter. Legacy qdisc_watchdog_schedule_n() becomes an inline passing a zero slack. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-17net: rename flow_action_hw_stats_types* -> flow_action_hw_stats*Jakub Kicinski
flow_action_hw_stats_types_check() helper takes one of the FLOW_ACTION_HW_STATS_*_BIT values as input. If we align the arguments to the opening bracket of the helper there is no way to call this helper and stay under 80 characters. Remove the "types" part from the new flow_action helpers and enum values. Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-17net: phy: improve phy_driver callback handle_interruptHeiner Kallweit
did_interrupt() clears the interrupt, therefore handle_interrupt() can not check which event triggered the interrupt. To overcome this constraint and allow more flexibility for customer interrupt handlers, let's decouple handle_interrupt() from parts of the phylib interrupt handling. Custom interrupt handlers now have to implement the did_interrupt() functionality in handle_interrupt() if needed. Fortunately we have just one custom interrupt handler so far (in the mscc PHY driver), convert it to the changed API. Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-17net: ethtool: require drivers to set supported_coalesce_paramsJakub Kicinski
Now that all in-tree drivers have been updated we can make the supported_coalesce_params mandatory. To save debugging time in case some driver was missed (or is out of tree) add a warning when netdev is registered with set_coalesce but without supported_coalesce_params. Signed-off-by: Jakub Kicinski <kuba@kernel.org> Reviewed-by: Michal Kubecek <mkubecek@suse.cz> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-17Input: allocate keycode for "Selective Screenshot" keyRajat Jain
New Chrome OS keyboards have a "snip" key that is basically a selective screenshot (allows a user to select an area of screen to be copied). Allocate a keycode for it. Signed-off-by: Rajat Jain <rajatja@google.com> Reviewed-by: Harry Cutts <hcutts@chromium.org> Link: https://lore.kernel.org/r/20200313180333.75011-1-rajatja@google.com Signed-off-by: Dmitry Torokhov <dmitry.torokhov@gmail.com>
2020-03-17net/mlx5: Eswitch, enable forwarding back to uplink portEli Cohen
Add dependencny on cap termination_table_raw_traffic to allow non encapsulated packets received from uplink to be forwarded back to the received uplink port. Refactor the conditions into a separate function. Signed-off-by: Eli Cohen <eli@mellanox.com> Reviewed-by: Oz Shlomo <ozsh@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2020-03-18netfilter: Introduce egress hookLukas Wunner
Commit e687ad60af09 ("netfilter: add netfilter ingress hook after handle_ing() under unique static key") introduced the ability to classify packets on ingress. Allow the same on egress. Position the hook immediately before a packet is handed to tc and then sent out on an interface, thereby mirroring the ingress order. This order allows marking packets in the netfilter egress hook and subsequently using the mark in tc. Another benefit of this order is consistency with a lot of existing documentation which says that egress tc is performed after netfilter hooks. Egress hooks already exist for the most common protocols, such as NF_INET_LOCAL_OUT or NF_ARP_OUT, and those are to be preferred because they are executed earlier during packet processing. However for more exotic protocols, there is currently no provision to apply netfilter on egress. A common workaround is to enslave the interface to a bridge and use ebtables, or to resort to tc. But when the ingress hook was introduced, consensus was that users should be given the choice to use netfilter or tc, whichever tool suits their needs best: https://lore.kernel.org/netdev/20150430153317.GA3230@salvia/ This hook is also useful for NAT46/NAT64, tunneling and filtering of locally generated af_packet traffic such as dhclient. There have also been occasional user requests for a netfilter egress hook in the past, e.g.: https://www.spinics.net/lists/netfilter/msg50038.html Performance measurements with pktgen surprisingly show a speedup rather than a slowdown with this commit: * Without this commit: Result: OK: 34240933(c34238375+d2558) usec, 100000000 (60byte,0frags) 2920481pps 1401Mb/sec (1401830880bps) errors: 0 * With this commit: Result: OK: 33997299(c33994193+d3106) usec, 100000000 (60byte,0frags) 2941410pps 1411Mb/sec (1411876800bps) errors: 0 * Without this commit + tc egress: Result: OK: 39022386(c39019547+d2839) usec, 100000000 (60byte,0frags) 2562631pps 1230Mb/sec (1230062880bps) errors: 0 * With this commit + tc egress: Result: OK: 37604447(c37601877+d2570) usec, 100000000 (60byte,0frags) 2659259pps 1276Mb/sec (1276444320bps) errors: 0 * With this commit + nft egress: Result: OK: 41436689(c41434088+d2600) usec, 100000000 (60byte,0frags) 2413320pps 1158Mb/sec (1158393600bps) errors: 0 Tested on a bare-metal Core i7-3615QM, each measurement was performed three times to verify that the numbers are stable. Commands to perform a measurement: modprobe pktgen echo "add_device lo@3" > /proc/net/pktgen/kpktgend_3 samples/pktgen/pktgen_bench_xmit_mode_queue_xmit.sh -i 'lo@3' -n 100000000 Commands for testing tc egress: tc qdisc add dev lo clsact tc filter add dev lo egress protocol ip prio 1 u32 match ip dst 4.3.2.1/32 Commands for testing nft egress: nft add table netdev t nft add chain netdev t co \{ type filter hook egress device lo priority 0 \; \} nft add rule netdev t co ip daddr 4.3.2.1/32 drop All testing was performed on the loopback interface to avoid distorting measurements by the packet handling in the low-level Ethernet driver. Signed-off-by: Lukas Wunner <lukas@wunner.de> Cc: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-03-18netfilter: Generalize ingress hookLukas Wunner
Prepare for addition of a netfilter egress hook by generalizing the ingress hook introduced by commit e687ad60af09 ("netfilter: add netfilter ingress hook after handle_ing() under unique static key"). In particular, rename and refactor the ingress hook's static inlines such that they can be reused for an egress hook. No functional change intended. Signed-off-by: Lukas Wunner <lukas@wunner.de> Cc: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-03-18netfilter: Rename ingress hook include fileLukas Wunner
Prepare for addition of a netfilter egress hook by renaming <linux/netfilter_ingress.h> to <linux/netfilter_netdev.h>. The egress hook also necessitates a refactoring of the include file, but that is done in a separate commit to ease reviewing. No functional change intended. Signed-off-by: Lukas Wunner <lukas@wunner.de> Cc: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-03-17bpf: Sanitize the bpf_struct_ops tcp-cc nameMartin KaFai Lau
The bpf_struct_ops tcp-cc name should be sanitized in order to avoid problematic chars (e.g. whitespaces). This patch reuses the bpf_obj_name_cpy() for accepting the same set of characters in order to keep a consistent bpf programming experience. A "size" param is added. Also, the strlen is returned on success so that the caller (like the bpf_tcp_ca here) can error out on empty name. The existing callers of the bpf_obj_name_cpy() only need to change the testing statement to "if (err < 0)". For all these existing callers, the err will be overwritten later, so no extra change is needed for the new strlen return value. v3: - reverse xmas tree style v2: - Save the orig_src to avoid "end - size" (Andrii) Fixes: 0baf26b0fcd7 ("bpf: tcp: Support tcp_congestion_ops in bpf") Signed-off-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Andrii Nakryiko <andriin@fb.com> Link: https://lore.kernel.org/bpf/20200314010209.1131542-1-kafai@fb.com
2020-03-16netlink: add nl_set_extack_cookie_u32()Michal Kubecek
Similar to existing nl_set_extack_cookie_u64(), add new helper nl_set_extack_cookie_u32() which sets extack cookie to a u32 value. Signed-off-by: Michal Kubecek <mkubecek@suse.cz> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-16macsec: Netlink support of XPN cipher suites (IEEE 802.1AEbw)Era Mayflower
Netlink support of extended packet number cipher suites, allows adding and updating XPN macsec interfaces. Added support in: * Creating interfaces with GCM-AES-XPN-128 and GCM-AES-XPN-256 suites. * Setting and getting 64bit packet numbers with of SAs. * Setting (only on SA creation) and getting ssci of SAs. * Setting salt when installing a SAK. Added 2 cipher suite identifiers according to 802.1AE-2018 table 14-1: * MACSEC_CIPHER_ID_GCM_AES_XPN_128 * MACSEC_CIPHER_ID_GCM_AES_XPN_256 In addition, added 2 new netlink attribute types: * MACSEC_SA_ATTR_SSCI * MACSEC_SA_ATTR_SALT Depends on: macsec: Support XPN frame handling - IEEE 802.1AEbw. Signed-off-by: Era Mayflower <mayflowerera@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-16macsec: Support XPN frame handling - IEEE 802.1AEbwEra Mayflower
Support extended packet number cipher suites (802.1AEbw) frames handling. This does not include the needed netlink patches. * Added xpn boolean field to `struct macsec_secy`. * Added ssci field to `struct_macsec_tx_sa` (802.1AE figure 10-5). * Added ssci field to `struct_macsec_rx_sa` (802.1AE figure 10-5). * Added salt field to `struct macsec_key` (802.1AE 10.7 NOTE 1). * Created pn_t type for easy access to lower and upper halves. * Created salt_t type for easy access to the "ssci" and "pn" parts. * Created `macsec_fill_iv_xpn` function to create IV in XPN mode. * Support in PN recovery and preliminary replay check in XPN mode. In addition, according to IEEE 802.1AEbw figure 10-5, the PN of incoming frame can be 0 when XPN cipher suite is used, so fixed the function `macsec_validate_skb` to fail on PN=0 only if XPN is off. Signed-off-by: Era Mayflower <mayflowerera@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-15net: mii: add linkmode_adv_to_mii_adv_x()Russell King
Add a helper to convert a linkmode advertisement to a clause 37 advertisement value for 1000base-x and 2500base-x. Signed-off-by: Russell King <rmk+kernel@armlinux.org.uk> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-15net: mii: convert mii_lpa_to_ethtool_lpa_x() to linkmode variantRussell King
Add a LPA to linkmode decoder for 1000BASE-X protocols; this decoder only provides the modify semantics similar to other such decoders. This replaces the unused mii_lpa_to_ethtool_lpa_x() helper. Signed-off-by: Russell King <rmk+kernel@armlinux.org.uk> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-15Merge tag 'locking-urgent-2020-03-15' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull futex fix from Thomas Gleixner: "Fix for yet another subtle futex issue. The futex code used ihold() to prevent inodes from vanishing, but ihold() does not guarantee inode persistence. Replace the inode pointer with a per boot, machine wide, unique inode identifier. The second commit fixes the breakage of the hash mechanism which causes a 100% performance regression" * tag 'locking-urgent-2020-03-15' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: futex: Unbreak futex hashing futex: Fix inode life-time issue
2020-03-15Merge tag 'iommu-fixes-v5.6-rc5' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu Pull IOMMU fixes from Joerg Roedel: - Intel VT-d fixes: - RCU list handling fixes - Replace WARN_TAINT with pr_warn + add_taint for reporting firmware issues - DebugFS fixes - Fix for hugepage handling in iova_to_phys implementation - Fix for handling VMD devices, which have a domain number which doesn't fit into 16 bits - Warning message fix - MSI allocation fix for iommu-dma code - Sign-extension fix for io page-table code - Fix for AMD-Vi to properly update the is-running bit when AVIC is used * tag 'iommu-fixes-v5.6-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu: iommu/vt-d: Populate debugfs if IOMMUs are detected iommu/amd: Fix IOMMU AVIC not properly update the is_run bit in IRTE iommu/vt-d: Ignore devices with out-of-spec domain number iommu/vt-d: Fix the wrong printing in RHSA parsing iommu/vt-d: Fix debugfs register reads iommu/vt-d: quirk_ioat_snb_local_iommu: replace WARN_TAINT with pr_warn + add_taint iommu/vt-d: dmar_parse_one_rmrr: replace WARN_TAINT with pr_warn + add_taint iommu/vt-d: dmar: replace WARN_TAINT with pr_warn + add_taint iommu/vt-d: Silence RCU-list debugging warnings iommu/vt-d: Fix RCU-list bugs in intel_iommu_init() iommu/dma: Fix MSI reservation allocation iommu/io-pgtable-arm: Fix IOVA validation for 32-bit iommu/vt-d: Fix a bug in intel_iommu_iova_to_phys() for huge page iommu/vt-d: Fix RCU list debugging warnings
2020-03-15netfilter: nf_tables: add nft_set_elem_update_expr() helper functionPablo Neira Ayuso
This helper function runs the eval path of the stateful expression of an existing set element. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-03-15netfilter: nf_tables: statify nft_expr_init()Pablo Neira Ayuso
Not exposed anymore to modules, statify this function. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-03-15netfilter: nf_tables: add nft_set_elem_expr_alloc()Pablo Neira Ayuso
Add helper function to create stateful expression. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-03-15nft_set_pipapo: Introduce AVX2-based lookup implementationStefano Brivio
If the AVX2 set is available, we can exploit the repetitive characteristic of this algorithm to provide a fast, vectorised version by using 256-bit wide AVX2 operations for bucket loads and bitwise intersections. In most cases, this implementation consistently outperforms rbtree set instances despite the fact they are configured to use a given, single, ranged data type out of the ones used for performance measurements by the nft_concat_range.sh kselftest. That script, injecting packets directly on the ingoing device path with pktgen, reports, averaged over five runs on a single AMD Epyc 7402 thread (3.35GHz, 768 KiB L1D$, 12 MiB L2$), the figures below. CONFIG_RETPOLINE was not set here. Note that this is not a fair comparison over hash and rbtree set types: non-ranged entries (used to have a reference for hash types) would be matched faster than this, and matching on a single field only (which is the case for rbtree) is also significantly faster. However, it's not possible at the moment to choose this set type for non-ranged entries, and the current implementation also needs a few minor adjustments in order to match on less than two fields. ---------------.-----------------------------------.------------. AMD Epyc 7402 | baselines, Mpps | this patch | 1 thread |___________________________________|____________| 3.35GHz | | | | | | 768KiB L1D$ | netdev | hash | rbtree | | | ---------------| hook | no | single | | pipapo | type entries | drop | ranges | field | pipapo | AVX2 | ---------------|--------|--------|--------|--------|------------| net,port | | | | | | 1000 | 19.0 | 10.4 | 3.8 | 4.0 | 7.5 +87% | ---------------|--------|--------|--------|--------|------------| port,net | | | | | | 100 | 18.8 | 10.3 | 5.8 | 6.3 | 8.1 +29% | ---------------|--------|--------|--------|--------|------------| net6,port | | | | | | 1000 | 16.4 | 7.6 | 1.8 | 2.1 | 4.8 +128% | ---------------|--------|--------|--------|--------|------------| port,proto | | | | | | 30000 | 19.6 | 11.6 | 3.9 | 0.5 | 2.6 +420% | ---------------|--------|--------|--------|--------|------------| net6,port,mac | | | | | | 10 | 16.5 | 5.4 | 4.3 | 3.4 | 4.7 +38% | ---------------|--------|--------|--------|--------|------------| net6,port,mac, | | | | | | proto 1000 | 16.5 | 5.7 | 1.9 | 1.4 | 3.6 +26% | ---------------|--------|--------|--------|--------|------------| net,mac | | | | | | 1000 | 19.0 | 8.4 | 3.9 | 2.5 | 6.4 +156% | ---------------'--------'--------'--------'--------'------------' A similar strategy could be easily reused to implement specialised versions for other SIMD sets, and I plan to post at least a NEON version at a later time. Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-03-15netfilter: flowtable: add tunnel match offload supportwenxu
This patch support both ipv4 and ipv6 tunnel_id, tunnel_src and tunnel_dst match for flowtable offload Signed-off-by: wenxu <wenxu@ucloud.cn> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-03-15netfilter: Replace zero-length array with flexible-array memberGustavo A. R. Silva
The current codebase makes use of the zero-length array language extension to the C90 standard, but the preferred mechanism to declare variable-length types such as these ones is a flexible array member[1][2], introduced in C99: struct foo { int stuff; struct boo array[]; }; By making use of the mechanism above, we will get a compiler warning in case the flexible array does not occur last in the structure, which will help us prevent some kind of undefined behavior bugs from being inadvertently introduced[3] to the codebase from now on. Also, notice that, dynamic memory allocations won't be affected by this change: "Flexible array members have incomplete type, and so the sizeof operator may not be applied. As a quirk of the original implementation of zero-length arrays, sizeof evaluates to zero."[1] Lastly, fix checkpatch.pl warning WARNING: __aligned(size) is preferred over __attribute__((aligned(size))) in net/bridge/netfilter/ebtables.c This issue was found with the help of Coccinelle. [1] https://gcc.gnu.org/onlinedocs/gcc/Zero-Length.html [2] https://github.com/KSPP/linux/issues/21 [3] commit 76497732932f ("cxgb3/l2t: Fix undefined behaviour") Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-03-15netfilter: nf_tables: make all set structs constFlorian Westphal
They do not need to be writeable anymore. v2: remove left-over __read_mostly annotation in set_pipapo.c (Stefano) Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-03-15netfilter: nf_tables: make sets built-inFlorian Westphal
Placing nftables set support in an extra module is pointless: 1. nf_tables needs dynamic registeration interface for sake of one module 2. nft heavily relies on sets, e.g. even simple rule like "nft ... tcp dport { 80, 443 }" will not work with _SETS=n. IOW, either nftables isn't used or both nf_tables and nf_tables_set modules are needed anyway. With extra module: 307K net/netfilter/nf_tables.ko 79K net/netfilter/nf_tables_set.ko text data bss dec filename 146416 3072 545 150033 nf_tables.ko 35496 1817 0 37313 nf_tables_set.ko This patch: 373K net/netfilter/nf_tables.ko 178563 4049 545 183157 nf_tables.ko Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-03-15netfilter: nft_tunnel: add support for geneve optsXin Long
Like vxlan and erspan opts, geneve opts should also be supported in nft_tunnel. The difference is geneve RFC (draft-ietf-nvo3-geneve-14) allows a geneve packet to carry multiple geneve opts. So with this patch, nftables/libnftnl would do: # nft add table ip filter # nft add chain ip filter input { type filter hook input priority 0 \; } # nft add tunnel filter geneve_02 { type geneve\; id 2\; \ ip saddr 192.168.1.1\; ip daddr 192.168.1.2\; \ sport 9000\; dport 9001\; dscp 1234\; ttl 64\; flags 1\; \ opts \"1:1:34567890,2:2:12121212,3:3:1212121234567890\"\; } # nft list tunnels table filter table ip filter { tunnel geneve_02 { id 2 ip saddr 192.168.1.1 ip daddr 192.168.1.2 sport 9000 dport 9001 tos 18 ttl 64 flags 1 geneve opts 1:1:34567890,2:2:12121212,3:3:1212121234567890 } } v1->v2: - no changes, just post it separately. Signed-off-by: Xin Long <lucien.xin@gmail.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-03-15netfilter: xtables: Add snapshot of hardidletimer targetManoj Basapathi
This is a snapshot of hardidletimer netfilter target. This patch implements a hardidletimer Xtables target that can be used to identify when interfaces have been idle for a certain period of time. Timers are identified by labels and are created when a rule is set with a new label. The rules also take a timeout value (in seconds) as an option. If more than one rule uses the same timer label, the timer will be restarted whenever any of the rules get a hit. One entry for each timer is created in sysfs. This attribute contains the timer remaining for the timer to expire. The attributes are located under the xt_idletimer class: /sys/class/xt_idletimer/timers/<label> When the timer expires, the target module sends a sysfs notification to the userspace, which can then decide what to do (eg. disconnect to save power) Compared to IDLETIMER, HARDIDLETIMER can send notifications when CPU is in suspend too, to notify the timer expiry. v1->v2: Moved all functionality into IDLETIMER module to avoid code duplication per comment from Florian. Signed-off-by: Manoj Basapathi <manojbm@codeaurora.org> Signed-off-by: Subash Abhinov Kasiviswanathan <subashab@codeaurora.org> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-03-14net: sched: RED: Introduce an ECN nodrop modePetr Machata
When the RED Qdisc is currently configured to enable ECN, the RED algorithm is used to decide whether a certain SKB should be marked. If that SKB is not ECN-capable, it is early-dropped. It is also possible to keep all traffic in the queue, and just mark the ECN-capable subset of it, as appropriate under the RED algorithm. Some switches support this mode, and some installations make use of it. To that end, add a new RED flag, TC_RED_NODROP. When the Qdisc is configured with this flag, non-ECT traffic is enqueued instead of being early-dropped. Signed-off-by: Petr Machata <petrm@mellanox.com> Reviewed-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-14net: sched: Allow extending set of supported RED flagsPetr Machata
The qdiscs RED, GRED, SFQ and CHOKE use different subsets of the same pool of global RED flags. These are passed in tc_red_qopt.flags. However none of these qdiscs validate the flag field, and just copy it over wholesale to internal structures, and later dump it back. (An exception is GRED, which does validate for VQs -- however not for the main setup.) A broken userspace can therefore configure a qdisc with arbitrary unsupported flags, and later expect to see the flags on qdisc dump. The current ABI therefore allows storage of several bits of custom data to qdisc instances of the types mentioned above. How many bits, depends on which flags are meaningful for the qdisc in question. E.g. SFQ recognizes flags ECN and HARDDROP, and the rest is not interpreted. If SFQ ever needs to support ADAPTATIVE, it needs another way of doing it, and at the same time it needs to retain the possibility to store 6 bits of uninterpreted data. Likewise RED, which adds a new flag later in this patchset. To that end, this patch adds a new function, red_get_flags(), to split the passed flags of RED-like qdiscs to flags and user bits, and red_validate_flags() to validate the resulting configuration. It further adds a new attribute, TCA_RED_FLAGS, to pass arbitrary flags. Signed-off-by: Petr Machata <petrm@mellanox.com> Reviewed-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-14net: phy: Add XLGMII interface defineJose Abreu
Add a define for XLGMII interface. Signed-off-by: Jose Abreu <Jose.Abreu@synopsys.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-14Merge tag 'clk-fixes-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/clk/linux Pull clk fixes from Stephen Boyd: "A small collection of fixes. I'll make another sweep soon to look for more fixes for this -rc series. - Mark device node const in of_clk_get_parent APIs to ease landing changes in users later - Fix flag for Qualcomm SC7180 video clocks where we thought it would never turn off but actually hardware takes care of it - Remove disp_cc_mdss_rscc_ahb_clk on Qualcomm SC7180 SoCs because this clk is always on anyway - Correct some bad dt-binding numbers for i.MX8MN SoCs" * tag 'clk-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/clk/linux: clk: imx8mn: Fix incorrect clock defines clk: qcom: dispcc: Remove support of disp_cc_mdss_rscc_ahb_clk clk: qcom: videocc: Update the clock flag for video_cc_vcodec0_core_clk of: clk: Make of_clk_get_parent_{count,name}() parameter const
2020-03-13Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-nextDavid S. Miller
Daniel Borkmann says: ==================== pull-request: bpf-next 2020-03-13 The following pull-request contains BPF updates for your *net-next* tree. We've added 86 non-merge commits during the last 12 day(s) which contain a total of 107 files changed, 5771 insertions(+), 1700 deletions(-). The main changes are: 1) Add modify_return attach type which allows to attach to a function via BPF trampoline and is run after the fentry and before the fexit programs and can pass a return code to the original caller, from KP Singh. 2) Generalize BPF's kallsyms handling and add BPF trampoline and dispatcher objects to be visible in /proc/kallsyms so they can be annotated in stack traces, from Jiri Olsa. 3) Extend BPF sockmap to allow for UDP next to existing TCP support in order in order to enable this for BPF based socket dispatch, from Lorenz Bauer. 4) Introduce a new bpftool 'prog profile' command which attaches to existing BPF programs via fentry and fexit hooks and reads out hardware counters during that period, from Song Liu. Example usage: bpftool prog profile id 337 duration 3 cycles instructions llc_misses 4228 run_cnt 3403698 cycles (84.08%) 3525294 instructions # 1.04 insn per cycle (84.05%) 13 llc_misses # 3.69 LLC misses per million isns (83.50%) 5) Batch of improvements to libbpf, bpftool and BPF selftests. Also addition of a new bpf_link abstraction to keep in particular BPF tracing programs attached even when the applicaion owning them exits, from Andrii Nakryiko. 6) New bpf_get_current_pid_tgid() helper for tracing to perform PID filtering and which returns the PID as seen by the init namespace, from Carlos Neira. 7) Refactor of RISC-V JIT code to move out common pieces and addition of a new RV32G BPF JIT compiler, from Luke Nelson. 8) Add gso_size context member to __sk_buff in order to be able to know whether a given skb is GSO or not, from Willem de Bruijn. 9) Add a new bpf_xdp_output() helper which reuses XDP's existing perf RB output implementation but can be called from tracepoint programs, from Eelco Chaudron. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-13afs: Fix client call Rx-phase signal handlingDavid Howells
Fix the handling of signals in client rxrpc calls made by the afs filesystem. Ignore signals completely, leaving call abandonment or connection loss to be detected by timeouts inside AF_RXRPC. Allowing a filesystem call to be interrupted after the entire request has been transmitted and an abort sent means that the server may or may not have done the action - and we don't know. It may even be worse than that for older servers. Fixes: bc5e3a546d55 ("rxrpc: Use MSG_WAITALL to tell sendmsg() to temporarily ignore signals") Signed-off-by: David Howells <dhowells@redhat.com>
2020-03-13afs: Fix some tracing detailsDavid Howells
Fix a couple of tracelines to indicate the usage count after the atomic op, not the usage count before it to be consistent with other afs and rxrpc trace lines. Change the wording of the afs_call_trace_work trace ID label from "WORK" to "QUEUE" to reflect the fact that it's queueing work, not doing work. Fixes: 341f741f04be ("afs: Refcount the afs_call struct") Signed-off-by: David Howells <dhowells@redhat.com>
2020-03-13rxrpc: Fix call interruptibility handlingDavid Howells
Fix the interruptibility of kernel-initiated client calls so that they're either only interruptible when they're waiting for a call slot to come available or they're not interruptible at all. Either way, they're not interruptible during transmission. This should help prevent StoreData calls from being interrupted when writeback is in progress. It doesn't, however, handle interruption during the receive phase. Userspace-initiated calls are still interruptable. After the signal has been handled, sendmsg() will return the amount of data copied out of the buffer and userspace can perform another sendmsg() call to continue transmission. Fixes: bc5e3a546d55 ("rxrpc: Use MSG_WAITALL to tell sendmsg() to temporarily ignore signals") Signed-off-by: David Howells <dhowells@redhat.com>
2020-03-13bpf: Remove bpf_image treeJiri Olsa
Now that we have all the objects (bpf_prog, bpf_trampoline, bpf_dispatcher) linked in bpf_tree, there's no need to have separate bpf_image tree for images. Reverting the bpf_image tree together with struct bpf_image, because it's no longer needed. Also removing bpf_image_alloc function and adding the original bpf_jit_alloc_exec_page interface instead. The kernel_text_address function can now rely only on is_bpf_text_address, because it checks the bpf_tree that contains all the objects. Keeping bpf_image_ksym_add and bpf_image_ksym_del because they are useful wrappers with perf's ksymbol interface calls. Signed-off-by: Jiri Olsa <jolsa@kernel.org> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20200312195610.346362-13-jolsa@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2020-03-13bpf: Add dispatchers to kallsymsJiri Olsa
Adding dispatchers to kallsyms. It's displayed as bpf_dispatcher_<NAME> where NAME is the name of dispatcher. Signed-off-by: Jiri Olsa <jolsa@kernel.org> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20200312195610.346362-12-jolsa@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>