summaryrefslogtreecommitdiff
path: root/include/net
AgeCommit message (Collapse)Author
2022-02-10net: mac802154: Set durations automaticallyMiquel Raynal
As depicted in the IEEE 802.15.4 specification, modulation/bands are tight to a number of page/channels so we can for most of them derive the durations automatically. The two locations that must call this new helper to set the variou symbol durations are: - when manually requesting a channel change though the netlink interface - at PHY creation, once the device driver has set the default page/channel If an information is missing, the symbol duration is not touched, a debug message is eventually printed. This keeps the compatibility with the unconverted drivers for which it was too complicated for me to find their precise information. If they initially provided a symbol duration, it would be kept. If they don't, the symbol duration value is left untouched. Once the symbol duration derived, the lifs and sifs durations are updated as well. Signed-off-by: Miquel Raynal <miquel.raynal@bootlin.com> Acked-by: Alexander Aring <aahringo@redhat.com> Link: https://lore.kernel.org/r/20220201180629.93410-4-miquel.raynal@bootlin.com Signed-off-by: Stefan Schmidt <stefan@datenfreihafen.org>
2022-02-10net: mac802154: Convert the symbol duration into nanosecondsMiquel Raynal
Tdsym is often given in the spec as pretty small numbers in microseconds and hence was reflected in the code as symbol_duration and was stored as a u8. Actually, for UWB PHYs, the symbol duration is given in nanoseconds and are as precise as picoseconds. In order to handle better these PHYs, change the type of symbol_duration to u32 and store this value in nanoseconds. All the users of this variable are updated in a mechanical way. Signed-off-by: Miquel Raynal <miquel.raynal@bootlin.com> Acked-by: Alexander Aring <aahringo@redhat.com> Link: https://lore.kernel.org/r/20220201180629.93410-3-miquel.raynal@bootlin.com Signed-off-by: Stefan Schmidt <stefan@datenfreihafen.org>
2022-02-10Merge tag 'ieee802154-for-davem-2022-02-10' of ↵David S. Miller
git://git.kernel.org/pub/scm/linux/kernel/git/sschmidt/wpan-next Stefan Schmidt says: ==================== pull-request: ieee802154-next 2022-02-10 An update from ieee802154 for your *net-next* tree. There is more ongoing in ieee802154 than usual. This will be the first pull request for this cycle, but I expect one more. Depending on review and rework times. Pavel Skripkin ported the atusb driver over to the new USB api to avoid unint problems as well as making use of the modern api without kmalloc() needs in he driver. Miquel Raynal landed some changes to ensure proper frame checksum checking with hwsim, documenting our use of wake and stop_queue and eliding a magic value by using the proper define. David Girault documented the address struct used in ieee802154. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2022-02-09Merge git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf-nextJakub Kicinski
Pablo Neira Ayuso says: ==================== Netfilter updates for net-next 1) Conntrack sets on CHECKSUM_UNNECESSARY for UDP packet with no checksum, from Kevin Mitchell. 2) skb->priority support for nfqueue, from Nicolas Dichtel. 3) Remove conntrack extension register API, from Florian Westphal. 4) Move nat destroy hook to nf_nat_hook instead, to remove nf_ct_ext_destroy(), also from Florian. 5) Wrap pptp conntrack NAT hooks into single structure, from Florian Westphal. 6) Support for tcp option set to noop for nf_tables, also from Florian. 7) Do not run x_tables comment match from packet path in nf_tables, from Florian Westphal. 8) Replace spinlock by cmpxchg() loop to update missed ct event, from Florian Westphal. 9) Wrap cttimeout hooks into single structure, from Florian. 10) Add fast nft_cmp expression for up to 16-bytes. 11) Use cb->ctx to store context in ctnetlink dump, instead of using cb->args[], from Florian Westphal. * git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf-next: netfilter: ctnetlink: use dump structure instead of raw args nfqueue: enable to set skb->priority netfilter: nft_cmp: optimize comparison for 16-bytes netfilter: cttimeout: use option structure netfilter: ecache: don't use nf_conn spinlock netfilter: nft_compat: suppress comment match netfilter: exthdr: add support for tcp option removal netfilter: conntrack: pptp: use single option structure netfilter: conntrack: remove extension register api netfilter: conntrack: handle ->destroy hook via nat_ops instead netfilter: conntrack: move extension sizes into core netfilter: conntrack: make all extensions 8-byte alignned netfilter: nfqueue: enable to get skb->priority netfilter: conntrack: mark UDP zero checksum as CHECKSUM_UNNECESSARY ==================== Link: https://lore.kernel.org/r/20220209133616.165104-1-pablo@netfilter.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-02-09Merge https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-nextJakub Kicinski
Daniel Borkmann says: ==================== pull-request: bpf-next 2022-02-09 We've added 126 non-merge commits during the last 16 day(s) which contain a total of 201 files changed, 4049 insertions(+), 2215 deletions(-). The main changes are: 1) Add custom BPF allocator for JITs that pack multiple programs into a huge page to reduce iTLB pressure, from Song Liu. 2) Add __user tagging support in vmlinux BTF and utilize it from BPF verifier when generating loads, from Yonghong Song. 3) Add per-socket fast path check guarding from cgroup/BPF overhead when used by only some sockets, from Pavel Begunkov. 4) Continued libbpf deprecation work of APIs/features and removal of their usage from samples, selftests, libbpf & bpftool, from Andrii Nakryiko and various others. 5) Improve BPF instruction set documentation by adding byte swap instructions and cleaning up load/store section, from Christoph Hellwig. 6) Switch BPF preload infra to light skeleton and remove libbpf dependency from it, from Alexei Starovoitov. 7) Fix architecture-agnostic macros in libbpf for accessing syscall arguments from BPF progs for non-x86 architectures, from Ilya Leoshkevich. 8) Rework port members in struct bpf_sk_lookup and struct bpf_sock to be of 16-bit field with anonymous zero padding, from Jakub Sitnicki. 9) Add new bpf_copy_from_user_task() helper to read memory from a different task than current. Add ability to create sleepable BPF iterator progs, from Kenny Yu. 10) Implement XSK batching for ice's zero-copy driver used by AF_XDP and utilize TX batching API from XSK buffer pool, from Maciej Fijalkowski. 11) Generate temporary netns names for BPF selftests to avoid naming collisions, from Hangbin Liu. 12) Implement bpf_core_types_are_compat() with limited recursion for in-kernel usage, from Matteo Croce. 13) Simplify pahole version detection and finally enable CONFIG_DEBUG_INFO_DWARF5 to be selected with CONFIG_DEBUG_INFO_BTF, from Nathan Chancellor. 14) Misc minor fixes to libbpf and selftests from various folks. * https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (126 commits) selftests/bpf: Cover 4-byte load from remote_port in bpf_sk_lookup bpf: Make remote_port field in struct bpf_sk_lookup 16-bit wide libbpf: Fix compilation warning due to mismatched printf format selftests/bpf: Test BPF_KPROBE_SYSCALL macro libbpf: Add BPF_KPROBE_SYSCALL macro libbpf: Fix accessing the first syscall argument on s390 libbpf: Fix accessing the first syscall argument on arm64 libbpf: Allow overriding PT_REGS_PARM1{_CORE}_SYSCALL selftests/bpf: Skip test_bpf_syscall_macro's syscall_arg1 on arm64 and s390 libbpf: Fix accessing syscall arguments on riscv libbpf: Fix riscv register names libbpf: Fix accessing syscall arguments on powerpc selftests/bpf: Use PT_REGS_SYSCALL_REGS in bpf_syscall_macro libbpf: Add PT_REGS_SYSCALL_REGS macro selftests/bpf: Fix an endianness issue in bpf_syscall_macro test bpf: Fix bpf_prog_pack build HPAGE_PMD_SIZE bpf: Fix leftover header->pages in sparc and powerpc code. libbpf: Fix signedness bug in btf_dump_array_data() selftests/bpf: Do not export subtest as standalone test bpf, x86_64: Fail gracefully on bpf_jit_binary_pack_finalize failures ... ==================== Link: https://lore.kernel.org/r/20220209210050.8425-1-daniel@iogearbox.net Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-02-09mctp: Add SIOCMCTP{ALLOC,DROP}TAG ioctls for tag controlMatt Johnston
This change adds a couple of new ioctls for mctp sockets: SIOCMCTPALLOCTAG and SIOCMCTPDROPTAG. These ioctls provide facilities for explicit allocation / release of tags, overriding the automatic allocate-on-send/release-on-reply and timeout behaviours. This allows userspace more control over messages that may not fit a simple request/response model. In order to indicate a pre-allocated tag to the sendmsg() syscall, we introduce a new flag to the struct sockaddr_mctp.smctp_tag value: MCTP_TAG_PREALLOC. Additional changes from Jeremy Kerr <jk@codeconstruct.com.au>. Contains a fix that was: Reported-by: kernel test robot <lkp@intel.com> Signed-off-by: Matt Johnston <matt@codeconstruct.com.au> Signed-off-by: Jeremy Kerr <jk@codeconstruct.com.au> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-02-09mctp: Add helper for address match checkingJeremy Kerr
Currently, we have a couple of paths that check that an EID matches, or the match value is MCTP_ADDR_ANY. Rather than open coding this, add a little helper. Signed-off-by: Jeremy Kerr <jk@codeconstruct.com.au> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-02-09net: fix a memleak when uncloning an skb dst and its metadataAntoine Tenart
When uncloning an skb dst and its associated metadata, a new dst+metadata is allocated and later replaces the old one in the skb. This is helpful to have a non-shared dst+metadata attached to a specific skb. The issue is the uncloned dst+metadata is initialized with a refcount of 1, which is increased to 2 before attaching it to the skb. When tun_dst_unclone returns, the dst+metadata is only referenced from a single place (the skb) while its refcount is 2. Its refcount will never drop to 0 (when the skb is consumed), leading to a memory leak. Fix this by removing the call to dst_hold in tun_dst_unclone, as the dst+metadata refcount is already 1. Fixes: fc4099f17240 ("openvswitch: Fix egress tunnel info.") Cc: Pravin B Shelar <pshelar@ovn.org> Reported-by: Vlad Buslov <vladbu@nvidia.com> Tested-by: Vlad Buslov <vladbu@nvidia.com> Signed-off-by: Antoine Tenart <atenart@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-02-09net: do not keep the dst cache when uncloning an skb dst and its metadataAntoine Tenart
When uncloning an skb dst and its associated metadata a new dst+metadata is allocated and the tunnel information from the old metadata is copied over there. The issue is the tunnel metadata has references to cached dst, which are copied along the way. When a dst+metadata refcount drops to 0 the metadata is freed including the cached dst entries. As they are also referenced in the initial dst+metadata, this ends up in UaFs. In practice the above did not happen because of another issue, the dst+metadata was never freed because its refcount never dropped to 0 (this will be fixed in a subsequent patch). Fix this by initializing the dst cache after copying the tunnel information from the old metadata to also unshare the dst cache. Fixes: d71785ffc7e7 ("net: add dst_cache to ovs vxlan lwtunnel") Cc: Paolo Abeni <pabeni@redhat.com> Reported-by: Vlad Buslov <vladbu@nvidia.com> Tested-by: Vlad Buslov <vladbu@nvidia.com> Signed-off-by: Antoine Tenart <atenart@kernel.org> Acked-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-02-09netfilter: nft_cmp: optimize comparison for 16-bytesPablo Neira Ayuso
Allow up to 16-byte comparisons with a new cmp fast version. Use two 64-bit words and calculate the mask representing the bits to be compared. Make sure the comparison is 64-bit aligned and avoid out-of-bound memory access on registers. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2022-02-09netfilter: cttimeout: use option structureFlorian Westphal
Instead of two exported functions, export a single option structure. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2022-02-09netfilter: ecache: don't use nf_conn spinlockFlorian Westphal
For updating eache missed value we can use cmpxchg. This also avoids need to disable BH. kernel robot reported build failure on v1 because not all arches support cmpxchg for u16, so extend this to u32. This doesn't increase struct size, existing padding is used. Reported-by: kernel test robot <lkp@intel.com> Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2022-02-08net: dsa: typo in commentLuiz Angelo Daros de Luca
Signed-off-by: Luiz Angelo Daros de Luca <luizluca@gmail.com> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Link: https://lore.kernel.org/r/20220208053210.14831-1-luizluca@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-02-08ipv6/addrconf: use one delayed work per netnsEric Dumazet
Next step for using per netns inet6_addr_lst is to have per netns work item to ultimately call addrconf_verify_rtnl() and addrconf_verify() with a new 'struct net*' argument. Everything is still using the global inet6_addr_lst[] table. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: David Ahern <dsahern@kernel.org> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-02-08ipv6/addrconf: allocate a per netns hash tableEric Dumazet
Add a per netns hash table and a dedicated spinlock, first step to get rid of the global inet6_addr_lst[] one. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: David Ahern <dsahern@kernel.org> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-02-07ipv4: Reject routes specifying ECN bits in rtm_tosGuillaume Nault
Use the new dscp_t type to replace the fc_tos field of fib_config, to ensure IPv4 routes aren't influenced by ECN bits when configured with non-zero rtm_tos. Before this patch, IPv4 routes specifying an rtm_tos with some of the ECN bits set were accepted. However they wouldn't work (never match) as IPv4 normally clears the ECN bits with IPTOS_RT_MASK before doing a FIB lookup (although a few buggy code paths don't). After this patch, IPv4 routes specifying an rtm_tos with any ECN bit set is rejected. Note: IPv6 routes ignore rtm_tos altogether, any rtm_tos is accepted, but treated as if it were 0. Signed-off-by: Guillaume Nault <gnault@redhat.com> Acked-by: David Ahern <dsahern@kernel.org> Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-02-07ipv6: Define dscp_t and stop taking ECN bits into account in fib6-rulesGuillaume Nault
Define a dscp_t type and its appropriate helpers that ensure ECN bits are not taken into account when handling DSCP. Use this new type to replace the tclass field of struct fib6_rule, so that fib6-rules don't get influenced by ECN bits anymore. Before this patch, fib6-rules didn't make any distinction between the DSCP and ECN bits. Therefore, rules specifying a DSCP (tos or dsfield options in iproute2) stopped working as soon a packets had at least one of its ECN bits set (as a work around one could create four rules for each DSCP value to match, one for each possible ECN value). After this patch fib6-rules only compare the DSCP bits. ECN doesn't influence the result anymore. Also, fib6-rules now must have the ECN bits cleared or they will be rejected. Signed-off-by: Guillaume Nault <gnault@redhat.com> Acked-by: David Ahern <dsahern@kernel.org> Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-02-06net: initialize init_net earlierEric Dumazet
While testing a patch that will follow later ("net: add netns refcount tracker to struct nsproxy") I found that devtmpfs_init() was called before init_net was initialized. This is a bug, because devtmpfs_setup() calls ksys_unshare(CLONE_NEWNS); This has the effect of increasing init_net refcount, which will be later overwritten to 1, as part of setup_net(&init_net) We had too many prior patches [1] trying to work around the root cause. Really, make sure init_net is in BSS section, and that net_ns_init() is called earlier at boot time. Note that another patch ("vfs: add netns refcount tracker to struct fs_context") also will need net_ns_init() being called before vfs_caches_init() As a bonus, this patch saves around 4KB in .data section. [1] f8c46cb39079 ("netns: do not call pernet ops for not yet set up init_net namespace") b5082df8019a ("net: Initialise init_net.count to 1") 734b65417b24 ("net: Statically initialize init_net.dev_base_head") v2: fixed a build error reported by kernel build bots (CONFIG_NET=n) Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-02-05net: gro: minor optimization for dev_gro_receive()Paolo Abeni
While inspecting some perf report, I noticed that the compiler emits suboptimal code for the napi CB initialization, fetching and storing multiple times the memory for flags bitfield. This is with gcc 10.3.1, but I observed the same with older compiler versions. We can help the compiler to do a nicer work clearing several fields at once using an u32 alias. The generated code is quite smaller, with the same number of conditional. Before: objdump -t net/core/gro.o | grep " F .text" 0000000000000bb0 l F .text 0000000000000357 dev_gro_receive After: 0000000000000bb0 l F .text 000000000000033c dev_gro_receive v1 -> v2: - use struct_group (Alexander and Alex) RFC -> v1: - use __struct_group to delimit the zeroed area (Alexander) Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-02-05net/sched: Enable tc skb ext allocation on chain miss only when neededPaul Blakey
Currently tc skb extension is used to send miss info from tc to ovs datapath module, and driver to tc. For the tc to ovs miss it is currently always allocated even if it will not be used by ovs datapath (as it depends on a requested feature). Export the static key which is used by openvswitch module to guard this code path as well, so it will be skipped if ovs datapath doesn't need it. Enable this code path once ovs datapath needs it. Signed-off-by: Paul Blakey <paulb@nvidia.com> Reviewed-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-02-04net: don't include ndisc.h from ipv6.hJakub Kicinski
Nothing in ipv6.h needs ndisc.h, drop it. Link: https://lore.kernel.org/r/20220203043457.2222388-1-kuba@kernel.org Acked-by: Jeremy Kerr <jk@codeconstruct.com.au> Acked-by: Stefan Schmidt <stefan@datenfreihafen.org> Link: https://lore.kernel.org/r/20220203231240.2297588-1-kuba@kernel.org Reviewed-by: David Ahern <dsahern@kernel.org> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-02-04mac80211: fix struct ieee80211_tx_info sizeAvraham Stern
The size of the status_driver_data field was not adjusted when the is_valid_ack_signal field was added. Since the size of struct ieee80211_tx_info is limited, replace the is_valid_ack_signal field with a flags field, and adjust the struct size accordingly. Signed-off-by: Avraham Stern <avraham.stern@intel.com> Signed-off-by: Luca Coelho <luciano.coelho@intel.com> Link: https://lore.kernel.org/r/iwlwifi.20220202104617.0ff363d4fa56.I45792c0187034a6d0e1c99a7db741996ef7caba3@changeid Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2022-02-04ieee80211: radiotap: fix -Wcast-qual warningsJohannes Berg
When enabling -Wcast-qual e.g. via W=3, we get a lot of warnings from this file, whenever it's included. Since the fixes are simple, just do that. Signed-off-by: Johannes Berg <johannes.berg@intel.com> Signed-off-by: Luca Coelho <luciano.coelho@intel.com> Link: https://lore.kernel.org/r/iwlwifi.20220202104617.cc733aeb1a18.I03396e1bf7a1af364cbd0916037f65d800035039@changeid Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2022-02-04cfg80211: fix -Wcast-qual warningsJohannes Berg
When enabling -Wcast-qual e.g. via W=3, we get a lot of warnings from this file, whenever it's included. Since the fixes are simple, just do that. Signed-off-by: Johannes Berg <johannes.berg@intel.com> Signed-off-by: Luca Coelho <luciano.coelho@intel.com> Link: https://lore.kernel.org/r/iwlwifi.20220202104617.6a1d52213019.I92d82e7251cf712faa43fd09db3142327a3bce3d@changeid Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2022-02-04netfilter: conntrack: remove extension register apiFlorian Westphal
These no longer register/unregister a meaningful structure so remove it. Cc: Paul Blakey <paulb@nvidia.com> Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2022-02-04netfilter: conntrack: handle ->destroy hook via nat_ops insteadFlorian Westphal
The nat module already exposes a few functions to the conntrack core. Move the nat extension destroy hook to it. After this, no conntrack extension needs a destroy hook. 'struct nf_ct_ext_type' and the register/unregister api can be removed in a followup patch. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2022-02-04netfilter: conntrack: move extension sizes into coreFlorian Westphal
No need to specify this in the registration modules, we already collect all sizes for build-time checks on the maximum combined size. After this change, all extensions except nat have no meaningful content in their nf_ct_ext_type struct definition. Next patch handles nat, this will then allow to remove the dynamic register api completely. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2022-02-04netfilter: conntrack: make all extensions 8-byte alignnedFlorian Westphal
All extensions except one need 8 byte alignment, so just make that the default. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2022-02-03Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski
No conflicts. Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-02-03ax25: fix reference count leaks of ax25_devDuoming Zhou
The previous commit d01ffb9eee4a ("ax25: add refcount in ax25_dev to avoid UAF bugs") introduces refcount into ax25_dev, but there are reference leak paths in ax25_ctl_ioctl(), ax25_fwd_ioctl(), ax25_rt_add(), ax25_rt_del() and ax25_rt_opt(). This patch uses ax25_dev_put() and adjusts the position of ax25_addr_ax25dev() to fix reference cout leaks of ax25_dev. Fixes: d01ffb9eee4a ("ax25: add refcount in ax25_dev to avoid UAF bugs") Signed-off-by: Duoming Zhou <duoming@zju.edu.cn> Reviewed-by: Dan Carpenter <dan.carpenter@oracle.com> Link: https://lore.kernel.org/r/20220203150811.42256-1-duoming@zju.edu.cn Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-02-03net: dsa: mv88e6xxx: Improve multichip isolation of standalone portsTobias Waldekranz
Given that standalone ports are now configured to bypass the ATU and forward all frames towards the upstream port, extend the ATU bypass to multichip systems. Load VID 0 (standalone) into the VTU with the policy bit set. Since VID 4095 (bridged) is already loaded, we now know that all VIDs in use are always available in all VTUs. Therefore, we can safely enable 802.1Q on DSA ports. Setting the DSA ports' VTU policy to TRAP means that all incoming frames on VID 0 will be classified as MGMT - as a result, the ATU is bypassed on all subsequent switches. With this isolation in place, we are able to support configurations that are simultaneously very quirky and very useful. Quirky because it involves looping cables between local switchports like in this example: CPU | .------. .---0---. | .----0----. | sw0 | | | sw1 | '-1-2-3-' | '-1-2-3-4-' $ @ '---' $ @ % % We have three physically looped pairs ($, @, and %). This is very useful because it allows us to run the kernel's kselftests for the bridge on mv88e6xxx hardware. Signed-off-by: Tobias Waldekranz <tobias@waldekranz.com> Reviewed-by: Vladimir Oltean <olteanv@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-02-03net: dsa: mv88e6xxx: Improve isolation of standalone portsTobias Waldekranz
Clear MapDA on standalone ports to bypass any ATU lookup that might point the packet in the wrong direction. This means that all packets are flooded using the PVT config. So make sure that standalone ports are only allowed to communicate with the local upstream port. Here is a scenario in which this is needed: CPU | .----. .---0---. | .--0--. | sw0 | | | sw1 | '-1-2-3-' | '-1-2-' '---' - sw0p1 and sw1p1 are bridged - sw0p2 and sw1p2 are in standalone mode - Learning must be enabled on sw0p3 in order for hardware forwarding to work properly between bridged ports 1. A packet with SA :aa comes in on sw1p2 1a. Egresses sw1p0 1b. Ingresses sw0p3, ATU adds an entry for :aa towards port 3 1c. Egresses sw0p0 2. A packet with DA :aa comes in on sw0p2 2a. If an ATU lookup is done at this point, the packet will be incorrectly forwarded towards sw0p3. With this change in place, the ATU is bypassed and the packet is forwarded in accordance with the PVT, which only contains the CPU port. Signed-off-by: Tobias Waldekranz <tobias@waldekranz.com> Reviewed-by: Vladimir Oltean <olteanv@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-02-03page_pool: Refactor page_pool to enable fragmenting after allocationAlexander Duyck
This change is meant to permit a driver to perform "fragmenting" of the page from within the driver instead of the current model which requires pre-partitioning the page. The main motivation behind this is to support use cases where the page will be split up by the driver after DMA instead of before. With this change it becomes possible to start using page pool to replace some of the existing use cases where multiple references were being used for a single page, but the number needed was unknown as the size could be dynamic. For example, with this code it would be possible to do something like the following to handle allocation: page = page_pool_alloc_pages(); if (!page) return NULL; page_pool_fragment_page(page, DRIVER_PAGECNT_BIAS_MAX); rx_buf->page = page; rx_buf->pagecnt_bias = DRIVER_PAGECNT_BIAS_MAX; Then we would process a received buffer by handling it with: rx_buf->pagecnt_bias--; Once the page has been fully consumed we could then flush the remaining instances with: if (page_pool_defrag_page(page, rx_buf->pagecnt_bias)) continue; page_pool_put_defragged_page(pool, page -1, !!budget); The general idea is that we want to have the ability to allocate a page with excess fragment count and then trim off the unneeded fragments. Signed-off-by: Alexander Duyck <alexanderduyck@fb.com> Reviewed-by: Ilias Apalodimas <ilias.apalodimas@linaro.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-02-02net, neigh: Do not trigger immediate probes on NUD_FAILED from ↵Daniel Borkmann
neigh_managed_work syzkaller was able to trigger a deadlock for NTF_MANAGED entries [0]: kworker/0:16/14617 is trying to acquire lock: ffffffff8d4dd370 (&tbl->lock){++-.}-{2:2}, at: ___neigh_create+0x9e1/0x2990 net/core/neighbour.c:652 [...] but task is already holding lock: ffffffff8d4dd370 (&tbl->lock){++-.}-{2:2}, at: neigh_managed_work+0x35/0x250 net/core/neighbour.c:1572 The neighbor entry turned to NUD_FAILED state, where __neigh_event_send() triggered an immediate probe as per commit cd28ca0a3dd1 ("neigh: reduce arp latency") via neigh_probe() given table lock was held. One option to fix this situation is to defer the neigh_probe() back to the neigh_timer_handler() similarly as pre cd28ca0a3dd1. For the case of NTF_MANAGED, this deferral is acceptable given this only happens on actual failure state and regular / expected state is NUD_VALID with the entry already present. The fix adds a parameter to __neigh_event_send() in order to communicate whether immediate probe is allowed or disallowed. Existing call-sites of neigh_event_send() default as-is to immediate probe. However, the neigh_managed_work() disables it via use of neigh_event_send_probe(). [0] <TASK> __dump_stack lib/dump_stack.c:88 [inline] dump_stack_lvl+0xcd/0x134 lib/dump_stack.c:106 print_deadlock_bug kernel/locking/lockdep.c:2956 [inline] check_deadlock kernel/locking/lockdep.c:2999 [inline] validate_chain kernel/locking/lockdep.c:3788 [inline] __lock_acquire.cold+0x149/0x3ab kernel/locking/lockdep.c:5027 lock_acquire kernel/locking/lockdep.c:5639 [inline] lock_acquire+0x1ab/0x510 kernel/locking/lockdep.c:5604 __raw_write_lock_bh include/linux/rwlock_api_smp.h:202 [inline] _raw_write_lock_bh+0x2f/0x40 kernel/locking/spinlock.c:334 ___neigh_create+0x9e1/0x2990 net/core/neighbour.c:652 ip6_finish_output2+0x1070/0x14f0 net/ipv6/ip6_output.c:123 __ip6_finish_output net/ipv6/ip6_output.c:191 [inline] __ip6_finish_output+0x61e/0xe90 net/ipv6/ip6_output.c:170 ip6_finish_output+0x32/0x200 net/ipv6/ip6_output.c:201 NF_HOOK_COND include/linux/netfilter.h:296 [inline] ip6_output+0x1e4/0x530 net/ipv6/ip6_output.c:224 dst_output include/net/dst.h:451 [inline] NF_HOOK include/linux/netfilter.h:307 [inline] ndisc_send_skb+0xa99/0x17f0 net/ipv6/ndisc.c:508 ndisc_send_ns+0x3a9/0x840 net/ipv6/ndisc.c:650 ndisc_solicit+0x2cd/0x4f0 net/ipv6/ndisc.c:742 neigh_probe+0xc2/0x110 net/core/neighbour.c:1040 __neigh_event_send+0x37d/0x1570 net/core/neighbour.c:1201 neigh_event_send include/net/neighbour.h:470 [inline] neigh_managed_work+0x162/0x250 net/core/neighbour.c:1574 process_one_work+0x9ac/0x1650 kernel/workqueue.c:2307 worker_thread+0x657/0x1110 kernel/workqueue.c:2454 kthread+0x2e9/0x3a0 kernel/kthread.c:377 ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:295 </TASK> Fixes: 7482e3841d52 ("net, neigh: Add NTF_MANAGED flag for managed neighbor entries") Reported-by: syzbot+5239d0e1778a500d477a@syzkaller.appspotmail.com Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Cc: Eric Dumazet <edumazet@google.com> Cc: Roopa Prabhu <roopa@nvidia.com> Tested-by: syzbot+5239d0e1778a500d477a@syzkaller.appspotmail.com Reviewed-by: David Ahern <dsahern@kernel.org> Link: https://lore.kernel.org/r/20220201193942.5055-1-daniel@iogearbox.net Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-02-02tcp: Use BPF timeout setting for SYN ACK RTOAkhmat Karakotov
When setting RTO through BPF program, some SYN ACK packets were unaffected and continued to use TCP_TIMEOUT_INIT constant. This patch adds timeout option to struct request_sock. Option is initialized with TCP_TIMEOUT_INIT and is reassigned through BPF using tcp_timeout_init call. SYN ACK retransmits now use newly added timeout option. Signed-off-by: Akhmat Karakotov <hmukos@yandex-team.ru> Acked-by: Martin KaFai Lau <kafai@fb.com> v2: - Add timeout option to struct request_sock. Do not call tcp_timeout_init on every syn ack retransmit. v3: - Use unsigned long for min. Bound tcp_timeout_init to TCP_RTO_MAX. v4: - Refactor duplicate code by adding reqsk_timeout function. Signed-off-by: David S. Miller <davem@davemloft.net>
2022-02-02net: dsa: provide switch operations for tracking the master stateVladimir Oltean
Certain drivers may need to send management traffic to the switch for things like register access, FDB dump, etc, to accelerate what their slow bus (SPI, I2C, MDIO) can already do. Ethernet is faster (especially in bulk transactions) but is also more unreliable, since the user may decide to bring the DSA master down (or not bring it up), therefore severing the link between the host and the attached switch. Drivers needing Ethernet-based register access already should have fallback logic to the slow bus if the Ethernet method fails, but that fallback may be based on a timeout, and the I/O to the switch may slow down to a halt if the master is down, because every Ethernet packet will have to time out. The driver also doesn't have the option to turn off Ethernet-based I/O momentarily, because it wouldn't know when to turn it back on. Which is where this change comes in. By tracking NETDEV_CHANGE, NETDEV_UP and NETDEV_GOING_DOWN events on the DSA master, we should know the exact interval of time during which this interface is reliably available for traffic. Provide this information to switches so they can use it as they wish. An helper is added dsa_port_master_is_operational() to check if a master port is operational. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: Ansuel Smith <ansuelsmth@gmail.com> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-02-01net: ieee802154: Provide a kdoc to the address structureDavid Girault
Give this structure a header to better explain its content. Signed-off-by: David Girault <david.girault@qorvo.com> [miquel.raynal@bootlin.com: Isolate this change from a bigger commit and reword the comment] Signed-off-by: Miquel Raynal <miquel.raynal@bootlin.com> Link: https://lore.kernel.org/r/20220201180956.93581-1-miquel.raynal@bootlin.com Signed-off-by: Stefan Schmidt <stefan@datenfreihafen.org>
2022-01-31txhash: Add socket option to control TX hash rethink behaviorAkhmat Karakotov
Add the SO_TXREHASH socket option to control hash rethink behavior per socket. When default mode is set, sockets disable rehash at initialization and use sysctl option when entering listen state. setsockopt() overrides default behavior. Signed-off-by: Akhmat Karakotov <hmukos@yandex-team.ru> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-01-31txhash: Make rethinking txhash behavior configurable via sysctlAkhmat Karakotov
Add a per ns sysctl that controls the txhash rethink behavior: net.core.txrehash. When enabled, the same behavior is retained, when disabled, rethink is not performed. Sysctl is enabled by default. Signed-off-by: Akhmat Karakotov <hmukos@yandex-team.ru> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-01-31ipv4: Make ip_idents_reserve staticDavid Ahern
ip_idents_reserve is only used in net/ipv4/route.c. Make it static and remove the export. Signed-off-by: David Ahern <dsahern@kernel.org> Cc: Eric Dumazet <edumazet@google.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-01-29ipv4: drop fragmentation code from ip_options_build()Jakub Kicinski
Since v2.5.44 and addition of ip_options_fragment() ip_options_build() does not render headers for fragments directly. @is_frag is always 0. Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-01-28Merge tag 'for-net-next-2022-01-28' of ↵Jakub Kicinski
git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth-next Luiz Augusto von Dentz says: ==================== bluetooth-next pull request for net-next: - Add support for RTL8822C hci_ver 0x08 - Add support for RTL8852AE part 0bda:2852 - Fix WBS setting for Intel legacy ROM products - Enable SCO over I2S ib mt7921s - Increment management interface revision * tag 'for-net-next-2022-01-28' of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth-next: (30 commits) Bluetooth: Increment management interface revision Bluetooth: hci_sync: Fix queuing commands when HCI_UNREGISTER is set Bluetooth: hci_h5: Add power reset via gpio in h5_btrtl_open Bluetooth: btrtl: Add support for RTL8822C hci_ver 0x08 Bluetooth: hci_event: Fix HCI_EV_VENDOR max_len Bluetooth: hci_core: Rate limit the logging of invalid SCO handle Bluetooth: hci_event: Ignore multiple conn complete events Bluetooth: msft: fix null pointer deref on msft_monitor_device_evt Bluetooth: btmtksdio: mask out interrupt status Bluetooth: btmtksdio: run sleep mode by default Bluetooth: btmtksdio: lower log level in btmtksdio_runtime_[resume|suspend]() Bluetooth: mt7921s: fix btmtksdio_[drv|fw]_pmctrl() Bluetooth: mt7921s: fix bus hang with wrong privilege Bluetooth: btmtksdio: refactor btmtksdio_runtime_[suspend|resume]() Bluetooth: mt7921s: fix firmware coredump retrieve Bluetooth: hci_serdev: call init_rwsem() before p->open() Bluetooth: Remove kernel-doc style comment block Bluetooth: btusb: Whitespace fixes for btusb_setup_csr() Bluetooth: btusb: Add one more Bluetooth part for the Realtek RTL8852AE Bluetooth: btintel: Fix WBS setting for Intel legacy ROM products ... ==================== Link: https://lore.kernel.org/r/20220128205915.3995760-1-luiz.dentz@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-01-28ax25: add refcount in ax25_dev to avoid UAF bugsDuoming Zhou
If we dereference ax25_dev after we call kfree(ax25_dev) in ax25_dev_device_down(), it will lead to concurrency UAF bugs. There are eight syscall functions suffer from UAF bugs, include ax25_bind(), ax25_release(), ax25_connect(), ax25_ioctl(), ax25_getname(), ax25_sendmsg(), ax25_getsockopt() and ax25_info_show(). One of the concurrency UAF can be shown as below: (USE) | (FREE) | ax25_device_event | ax25_dev_device_down ax25_bind | ... ... | kfree(ax25_dev) ax25_fillin_cb() | ... ax25_fillin_cb_from_dev() | ... | The root cause of UAF bugs is that kfree(ax25_dev) in ax25_dev_device_down() is not protected by any locks. When ax25_dev, which there are still pointers point to, is released, the concurrency UAF bug will happen. This patch introduces refcount into ax25_dev in order to guarantee that there are no pointers point to it when ax25_dev is released. Signed-off-by: Duoming Zhou <duoming@zju.edu.cn> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-01-28net: mac802154: Explain the use of ieee802154_wake/stop_queue()Miquel Raynal
It is not straightforward to the newcomer that a single skb can currently be sent at a time and that the internal process is to stop the queue when processing a frame before re-enabling it. Make this clear by documenting the ieee802154_wake/stop_queue() helpers. Signed-off-by: Miquel Raynal <miquel.raynal@bootlin.com> Acked-by: Alexander Aring <aahringo@redhat.com> Link: https://lore.kernel.org/r/20220125122540.855604-4-miquel.raynal@bootlin.com Signed-off-by: Stefan Schmidt <stefan@datenfreihafen.org>
2022-01-27ipv6: partially inline ipv6_fixup_optionsPavel Begunkov
Inline a part of ipv6_fixup_options() to avoid extra overhead on function call if opt is NULL. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-01-27udp6: pass flow in ip6_make_skb together with corkPavel Begunkov
Another preparation patch. inet_cork_full already contains a field for iflow, so we can avoid passing a separate struct iflow6 into __ip6_append_data() and ip6_make_skb(), and use the flow stored in inet_cork_full. Make sure callers set cork->fl, i.e. we init it in ip6_append_data() and before calling ip6_make_skb(). Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-01-27Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski
No conflicts. Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-01-27ipv4: remove sparse error in ip_neigh_gw4()Eric Dumazet
./include/net/route.h:373:48: warning: incorrect type in argument 2 (different base types) ./include/net/route.h:373:48: expected unsigned int [usertype] key ./include/net/route.h:373:48: got restricted __be32 [usertype] daddr Fixes: 5c9f7c1dfc2e ("ipv4: Add helpers for neigh lookup for nexthop") Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: David Ahern <dsahern@kernel.org> Link: https://lore.kernel.org/r/20220127013404.1279313-1-eric.dumazet@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-01-27ipv4: avoid using shared IP generator for connected socketsEric Dumazet
ip_select_ident_segs() has been very conservative about using the connected socket private generator only for packets with IP_DF set, claiming it was needed for some VJ compression implementations. As mentioned in this referenced document, this can be abused. (Ref: Off-Path TCP Exploits of the Mixed IPID Assignment) Before switching to pure random IPID generation and possibly hurt some workloads, lets use the private inet socket generator. Not only this will remove one vulnerability, this will also improve performance of TCP flows using pmtudisc==IP_PMTUDISC_DONT Fixes: 73f156a6e8c1 ("inetpeer: get rid of ip_id_count") Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: David Ahern <dsahern@kernel.org> Reported-by: Ray Che <xijiache@gmail.com> Cc: Willy Tarreau <w@1wt.eu> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-01-27i40e: xsk: Move tmp desc array from driver to poolMagnus Karlsson
Move desc_array from the driver to the pool. The reason behind this is that we can then reuse this array as a temporary storage for descriptors in all zero-copy drivers that use the batched interface. This will make it easier to add batching to more drivers. i40e is the only driver that has a batched Tx zero-copy implementation, so no need to touch any other driver. Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: Alexander Lobakin <alexandr.lobakin@intel.com> Link: https://lore.kernel.org/bpf/20220125160446.78976-6-maciej.fijalkowski@intel.com