summaryrefslogtreecommitdiff
path: root/include/linux
AgeCommit message (Collapse)Author
2023-04-21sctp: delete the nested flexible array skipXin Long
This patch deletes the flexible-array skip[] from the structure sctp_ifwdtsn/fwdtsn_hdr to avoid some sparse warnings: # make C=2 CF="-Wflexible-array-nested" M=./net/sctp/ net/sctp/stream_interleave.c: note: in included file (through include/net/sctp/structs.h, include/net/sctp/sctp.h): ./include/linux/sctp.h:611:32: warning: nested flexible array ./include/linux/sctp.h:628:33: warning: nested flexible array Signed-off-by: Xin Long <lucien.xin@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-04-21sctp: delete the nested flexible array paramsXin Long
This patch deletes the flexible-array params[] from the structure sctp_inithdr, sctp_addiphdr and sctp_reconf_chunk to avoid some sparse warnings: # make C=2 CF="-Wflexible-array-nested" M=./net/sctp/ net/sctp/input.c: note: in included file (through include/net/sctp/structs.h, include/net/sctp/sctp.h): ./include/linux/sctp.h:278:29: warning: nested flexible array ./include/linux/sctp.h:675:30: warning: nested flexible array This warning is reported if a structure having a flexible array member is included by other structures. Signed-off-by: Xin Long <lucien.xin@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-04-20net: move dropreason.h to dropreason-core.hJohannes Berg
This will, after the next patch, hold only the core drop reasons and minimal infrastructure. Fix a small kernel-doc issue while at it, to avoid the move triggering a checker. Signed-off-by: Johannes Berg <johannes.berg@intel.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-04-20net: skbuff: update and rename __kfree_skb_defer()Jakub Kicinski
__kfree_skb_defer() uses the old naming where "defer" meant slab bulk free/alloc APIs. In the meantime we also made __kfree_skb_defer() feed the per-NAPI skb cache, which implies bulk APIs. So take away the 'defer' and add 'napi'. While at it add a drop reason. This only matters on the tx_action path, if the skb has a frag_list. But getting rid of a SKB_DROP_REASON_NOT_SPECIFIED seems like a net benefit so why not. Reviewed-by: Alexander Lobakin <aleksander.lobakin@intel.com> Link: https://lore.kernel.org/r/20230420020005.815854-1-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-04-20Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski
Adjacent changes: net/mptcp/protocol.h 63740448a32e ("mptcp: fix accept vs worker race") 2a6a870e44dd ("mptcp: stops worker on unaccepted sockets at listener close") ddb1a072f858 ("mptcp: move first subflow allocation at mpc access time") Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-04-20Merge tag 'net-6.3-rc8' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Pull networking fixes from Paolo Abeni: "Including fixes from netfilter and bpf. There are a few fixes for new code bugs, including the Mellanox one noted in the last networking pull. No known regressions outstanding. Current release - regressions: - sched: clear actions pointer in miss cookie init fail - mptcp: fix accept vs worker race - bpf: fix bpf_arch_text_poke() with new_addr == NULL on s390 - eth: bnxt_en: fix a possible NULL pointer dereference in unload path - eth: veth: take into account peer device for NETDEV_XDP_ACT_NDO_XMIT xdp_features flag Current release - new code bugs: - eth: revert "net/mlx5: Enable management PF initialization" Previous releases - regressions: - netfilter: fix recent physdev match breakage - bpf: fix incorrect verifier pruning due to missing register precision taints - eth: virtio_net: fix overflow inside xdp_linearize_page() - eth: cxgb4: fix use after free bugs caused by circular dependency problem - eth: mlxsw: pci: fix possible crash during initialization Previous releases - always broken: - sched: sch_qfq: prevent slab-out-of-bounds in qfq_activate_agg - netfilter: validate catch-all set elements - bridge: don't notify FDB entries with "master dynamic" - eth: bonding: fix memory leak when changing bond type to ethernet - eth: i40e: fix accessing vsi->active_filters without holding lock Misc: - Mat is back as MPTCP co-maintainer" * tag 'net-6.3-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (33 commits) net: bridge: switchdev: don't notify FDB entries with "master dynamic" Revert "net/mlx5: Enable management PF initialization" MAINTAINERS: Resume MPTCP co-maintainer role mailmap: add entries for Mat Martineau e1000e: Disable TSO on i219-LM card to increase speed bnxt_en: fix free-runnig PHC mode net: dsa: microchip: ksz8795: Correctly handle huge frame configuration bpf: Fix incorrect verifier pruning due to missing register precision taints hamradio: drop ISA_DMA_API dependency mlxsw: pci: Fix possible crash during initialization mptcp: fix accept vs worker race mptcp: stops worker on unaccepted sockets at listener close net: rpl: fix rpl header size calculation net: vmxnet3: Fix NULL pointer dereference in vmxnet3_rq_rx_complete() bonding: Fix memory leak when changing bond type to Ethernet veth: take into account peer device for NETDEV_XDP_ACT_NDO_XMIT xdp_features flag mlxfw: fix null-ptr-deref in mlxfw_mfa2_tlv_next() bnxt_en: Fix a possible NULL pointer dereference in unload path bnxt_en: Do not initialize PTP on older P3/P4 chips netfilter: nf_tables: tighten netlink attribute requirements for catch-all elements ...
2023-04-19Revert "net/mlx5: Enable management PF initialization"Jakub Kicinski
This reverts commit fe998a3c77b9f989a30a2a01fb00d3729a6d53a4. Paul reports that it causes a regression with IB on CX4 and FW 12.18.1000. In addition I think that the concept of "management PF" is not fully accepted and requires a discussion. Fixes: fe998a3c77b9 ("net/mlx5: Enable management PF initialization") Reported-by: Paul Moore <paul@paul-moore.com> Link: https://lore.kernel.org/all/CAHC9VhQ7A4+msL38WpbOMYjAqLp0EtOjeLh4Dc6SQtD6OUvCQg@mail.gmail.com/ Link: https://lore.kernel.org/r/20230413222547.56901-1-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-04-19Merge tag 'mm-hotfixes-stable-2023-04-19-16-36' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull misc fixes from Andrew Morton: "22 hotfixes. 19 are cc:stable and the remainder address issues which were introduced during this merge cycle, or aren't considered suitable for -stable backporting. 19 are for MM and the remainder are for other subsystems" * tag 'mm-hotfixes-stable-2023-04-19-16-36' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (22 commits) nilfs2: initialize unused bytes in segment summary blocks mm: page_alloc: skip regions with hugetlbfs pages when allocating 1G pages mm/mmap: regression fix for unmapped_area{_topdown} maple_tree: fix mas_empty_area() search maple_tree: make maple state reusable after mas_empty_area_rev() mm: kmsan: handle alloc failures in kmsan_ioremap_page_range() mm: kmsan: handle alloc failures in kmsan_vmap_pages_range_noflush() tools/Makefile: do missed s/vm/mm/ mm: fix memory leak on mm_init error handling mm/page_alloc: fix potential deadlock on zonelist_update_seq seqlock kernel/sys.c: fix and improve control flow in __sys_setres[ug]id() Revert "userfaultfd: don't fail on unrecognized features" writeback, cgroup: fix null-ptr-deref write in bdi_split_work_to_wbs maple_tree: fix a potential memory leak, OOB access, or other unpredictable bug tools/mm/page_owner_sort.c: fix TGID output when cull=tg is used mailmap: update jtoppins' entry to reference correct email mm/mempolicy: fix use-after-free of VMA iterator mm/huge_memory.c: warn with pr_warn_ratelimited instead of VM_WARN_ON_ONCE_FOLIO mm/mprotect: fix do_mprotect_pkey() return on error mm/khugepaged: check again on anon uffd-wp during isolation ...
2023-04-19net: skbuff: hide nf_trace and ipvs_propertyJakub Kicinski
Accesses to nf_trace and ipvs_property are already wrapped by ifdefs where necessary. Don't allocate the bits for those fields at all if possible. Acked-by: Florian Westphal <fw@strlen.de> Acked-by: Simon Horman <horms@kernel.org> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-04-19net: skbuff: push nf_trace down the bitfieldJakub Kicinski
nf_trace is a debug feature, AFAIU, and yet it sits oddly high in the sk_buff bitfield. Move it down, pushing up dst_pending_confirm and inner_protocol_type. Next change will make nf_trace optional (under Kconfig) and all optional fields should be placed after 2b fields to avoid 2b fields straddling bytes. dst_pending_confirm is L3, so it makes sense next to ignore_df. inner_protocol_type goes up just to keep the balance. Acked-by: Florian Westphal <fw@strlen.de> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-04-19net: skbuff: move alloc_cpu into a potential holeJakub Kicinski
alloc_cpu is currently between 4 byte fields, so it's almost guaranteed to create a 2B hole. It has a knock on effect of creating a 4B hole after @end (and @end and @tail being in different cachelines). None of this matters hugely, but for kernel configs which don't enable all the features there may well be a 2B hole after the bitfield. Move alloc_cpu there. Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-04-19net: skbuff: hide csum_not_inet when CONFIG_IP_SCTP not setJakub Kicinski
SCTP is not universally deployed, allow hiding its bit from the skb. Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-04-19net: skbuff: hide wifi_acked when CONFIG_WIRELESS not setJakub Kicinski
Datacenter kernel builds will very likely not include WIRELESS, so let them shave 2 bits off the skb by hiding the wifi fields. Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org> Acked-by: Johannes Berg <johannes@sipsolutions.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-04-19net: phy: phy_device: Call into the PHY driver to set LED blinkingAndrew Lunn
Linux LEDs can be requested to perform hardware accelerated blinking. Pass this to the PHY driver, if it implements the op. Signed-off-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: Christian Marangi <ansuelsmth@gmail.com> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-04-19net: phy: phy_device: Call into the PHY driver to set LED brightnessAndrew Lunn
Linux LEDs can be software controlled via the brightness file in /sys. LED drivers need to implement a brightness_set function which the core will call. Implement an intermediary in phy_device, which will call into the phy driver if it implements the necessary function. Signed-off-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: Christian Marangi <ansuelsmth@gmail.com> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-04-19net: phy: Add a binding for PHY LEDsAndrew Lunn
Define common binding parsing for all PHY drivers with LEDs using phylib. Parse the DT as part of the phy_probe and add LEDs to the linux LED class infrastructure. For the moment, provide a dummy brightness function, which will later be replaced with a call into the PHY driver. This allows testing since the LED core might otherwise reject an LED whose brightness cannot be set. Add a dependency on LED_CLASS. It either needs to be built in, or not enabled, since a modular build can result in linker errors. Signed-off-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: Christian Marangi <ansuelsmth@gmail.com> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-04-19leds: Provide stubs for when CLASS_LED & NEW_LEDS are disabledAndrew Lunn
Provide stubs for devm_led_classdev_register_ext() and led_init_default_state_get() so that LED drivers embedded within other drivers such as PHYs and Ethernet switches still build when LEDS_CLASS or NEW_LEDS are disabled. This also helps with Kconfig dependencies, which are somewhat hairy for phylib and mdio and only get worse when adding a dependency on LED_CLASS. Signed-off-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: Christian Marangi <ansuelsmth@gmail.com> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-04-18bonding: add software tx timestamping supportHangbin Liu
Currently, bonding only obtain the timestamp (ts) information of the active slave, which is available only for modes 1, 5, and 6. For other modes, bonding only has software rx timestamping support. However, some users who use modes such as LACP also want tx timestamp support. To address this issue, let's check the ts information of each slave. If all slaves support tx timestamping, we can enable tx timestamping support for the bond. Add a note that the get_ts_info may be called with RCU, or rtnl or reference on the device in ethtool.h> Suggested-by: Miroslav Lichvar <mlichvar@redhat.com> Signed-off-by: Hangbin Liu <liuhangbin@gmail.com> Acked-by: Jay Vosburgh <jay.vosburgh@canonical.com> Link: https://lore.kernel.org/r/20230418034841.2566262-1-liuhangbin@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-04-18Merge git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nfJakub Kicinski
Pablo Neira Ayuso says: ==================== Netfilter fixes for net The following patchset contains Netfilter fixes for net: 1) Unbreak br_netfilter physdev match support, from Florian Westphal. 2) Use GFP_KERNEL_ACCOUNT for stateful/policy objects, from Chen Aotian. 3) Use IS_ENABLED() in nf_reset_trace(), from Florian Westphal. 4) Fix validation of catch-all set element. 5) Tighten requirements for catch-all set elements. * git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf: netfilter: nf_tables: tighten netlink attribute requirements for catch-all elements netfilter: nf_tables: validate catch-all set elements netfilter: nf_tables: fix ifdef to also consider nf_tables=m netfilter: nf_tables: Modify nla_memdup's flag to GFP_KERNEL_ACCOUNT netfilter: br_netfilter: fix recent physdev match breakage ==================== Link: https://lore.kernel.org/r/20230418145048.67270-1-pablo@netfilter.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-04-18mm: kmsan: handle alloc failures in kmsan_ioremap_page_range()Alexander Potapenko
Similarly to kmsan_vmap_pages_range_noflush(), kmsan_ioremap_page_range() must also properly handle allocation/mapping failures. In the case of such, it must clean up the already created metadata mappings and return an error code, so that the error can be propagated to ioremap_page_range(). Without doing so, KMSAN may silently fail to bring the metadata for the page range into a consistent state, which will result in user-visible crashes when trying to access them. Link: https://lkml.kernel.org/r/20230413131223.4135168-2-glider@google.com Fixes: b073d7f8aee4 ("mm: kmsan: maintain KMSAN metadata for page operations") Signed-off-by: Alexander Potapenko <glider@google.com> Reported-by: Dipanjan Das <mail.dipanjan.das@gmail.com> Link: https://lore.kernel.org/linux-mm/CANX2M5ZRrRA64k0hOif02TjmY9kbbO2aCBPyq79es34RXZ=cAw@mail.gmail.com/ Reviewed-by: Marco Elver <elver@google.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Uladzislau Rezki (Sony) <urezki@gmail.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-04-18mm: kmsan: handle alloc failures in kmsan_vmap_pages_range_noflush()Alexander Potapenko
As reported by Dipanjan Das, when KMSAN is used together with kernel fault injection (or, generally, even without the latter), calls to kcalloc() or __vmap_pages_range_noflush() may fail, leaving the metadata mappings for the virtual mapping in an inconsistent state. When these metadata mappings are accessed later, the kernel crashes. To address the problem, we return a non-zero error code from kmsan_vmap_pages_range_noflush() in the case of any allocation/mapping failure inside it, and make vmap_pages_range_noflush() return an error if KMSAN fails to allocate the metadata. This patch also removes KMSAN_WARN_ON() from vmap_pages_range_noflush(), as these allocation failures are not fatal anymore. Link: https://lkml.kernel.org/r/20230413131223.4135168-1-glider@google.com Fixes: b073d7f8aee4 ("mm: kmsan: maintain KMSAN metadata for page operations") Signed-off-by: Alexander Potapenko <glider@google.com> Reported-by: Dipanjan Das <mail.dipanjan.das@gmail.com> Link: https://lore.kernel.org/linux-mm/CANX2M5ZRrRA64k0hOif02TjmY9kbbO2aCBPyq79es34RXZ=cAw@mail.gmail.com/ Reviewed-by: Marco Elver <elver@google.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Uladzislau Rezki (Sony) <urezki@gmail.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-04-17net/mlx5e: Add IPsec packet offload tunnel bitsLeon Romanovsky
Extend packet reformat types and flow table capabilities with IPsec packet offload tunnel bits. Reviewed-by: Simon Horman <simon.horman@corigine.com> Signed-off-by: Leon Romanovsky <leonro@nvidia.com> Reviewed-by: Sridhar Samudrala <sridhar.samudrala@intel.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-04-17netfilter: nf_tables: fix ifdef to also consider nf_tables=mFlorian Westphal
nftables can be built as a module, so fix the preprocessor conditional accordingly. Fixes: 478b360a47b7 ("netfilter: nf_tables: fix nf_trace always-on with XT_TRACE=n") Reported-by: Florian Fainelli <f.fainelli@gmail.com> Reported-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2023-04-14page_pool: allow caching from safely localized NAPIJakub Kicinski
Recent patches to mlx5 mentioned a regression when moving from driver local page pool to only using the generic page pool code. Page pool has two recycling paths (1) direct one, which runs in safe NAPI context (basically consumer context, so producing can be lockless); and (2) via a ptr_ring, which takes a spin lock because the freeing can happen from any CPU; producer and consumer may run concurrently. Since the page pool code was added, Eric introduced a revised version of deferred skb freeing. TCP skbs are now usually returned to the CPU which allocated them, and freed in softirq context. This places the freeing (producing of pages back to the pool) enticingly close to the allocation (consumer). If we can prove that we're freeing in the same softirq context in which the consumer NAPI will run - lockless use of the cache is perfectly fine, no need for the lock. Let drivers link the page pool to a NAPI instance. If the NAPI instance is scheduled on the same CPU on which we're freeing - place the pages in the direct cache. With that and patched bnxt (XDP enabled to engage the page pool, sigh, bnxt really needs page pool work :() I see a 2.6% perf boost with a TCP stream test (app on a different physical core than softirq). The CPU use of relevant functions decreases as expected: page_pool_refill_alloc_cache 1.17% -> 0% _raw_spin_lock 2.41% -> 0.98% Only consider lockless path to be safe when NAPI is scheduled - in practice this should cover majority if not all of steady state workloads. It's usually the NAPI kicking in that causes the skb flush. The main case we'll miss out on is when application runs on the same CPU as NAPI. In that case we don't use the deferred skb free path. Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Acked-by: Jesper Dangaard Brouer <brouer@redhat.com> Tested-by: Dragos Tatulea <dtatulea@nvidia.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-04-13Merge tag 'mlx5-updates-2023-04-11' of ↵Jakub Kicinski
git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux Saeed Mahameed says: ==================== mlx5-updates-2023-04-11 1) Vlad adds the support for linux bridge multicast offload support Patches #1 through #9 Synopsis Vlad Says: ============== Implement support of bridge multicast offload in mlx5. Handle port object attribute SWITCHDEV_ATTR_ID_BRIDGE_MC_DISABLED notification to toggle multicast offload and bridge snooping support on bridge. Handle port object SWITCHDEV_OBJ_ID_PORT_MDB notification to attach a bridge port to MDB. Steering architecture Existing offload infrastructure relies on two levels of flow tables - bridge ingress and egress. For multicast offload the architecture is extended with additional layer of per-port multicast replication tables. Such tables filter loopback traffic (so packets are not replicated to their source port) and pop VLAN headers for "untagged" VLANs. The tables are referenced by the MDB rules in egress table. MDB egress rule can point to multiple per-port multicast tables, which causes matching multicast traffic to be replicated to all of them, and, consecutively, to several bridge ports: +--------+--+ +---------------------------------------> Port 1 | | | +-^------+--+ | | | | +-----------------------------------------+ | +---------------------------+ | | EGRESS table | | +--> PORT 1 multicast table | | +----------------------------------+ +-----------------------------------------+ | | +---------------------------+ | | INGRESS table | | | | | | | | +----------------------------------+ | dst_mac=P1,vlan=X -> pop vlan, goto P1 +--+ | | FG0: | | | | | dst_mac=P1,vlan=Y -> pop vlan, goto P1 | | | src_port=dst_port -> drop | | | src_mac=M1,vlan=X -> goto egress +---> dst_mac=P2,vlan=X -> pop vlan, goto P2 +--+ | | FG1: | | | ... | | dst_mac=P2,vlan=Y -> goto P2 | | | | VLAN X -> pop, goto port | | | | | dst_mac=MDB1,vlan=Y -> goto mcast P1,P2 +-----+ | ... | | +----------------------------------+ | | | | | VLAN Y -> pop, goto port +-------+ +-----------------------------------------+ | | | FG3: | | | | matchall -> goto port | | | | | | | +---------------------------+ | | | | | | +--------+--+ +---------------------------------------> Port 2 | | | +-^------+--+ | | | | | +---------------------------+ | +--> PORT 2 multicast table | | +---------------------------+ | | | | | FG0: | | | src_port=dst_port -> drop | | | FG1: | | | VLAN X -> pop, goto port | | | ... | | | | | | FG3: | | | matchall -> goto port +-------+ | | +---------------------------+ Patches overview: - Patch 1 adds hardware definition bits for capabilities required to replicate multicast packets to multiple per-port tables. These bits are used by following patches to only attempt multicast offload if firmware and hardware provide necessary support. - Pathces 2-4 patches are preparations and refactoring. - Patch 5 implements necessary infrastructure to toggle multicast offload via SWITCHDEV_ATTR_ID_BRIDGE_MC_DISABLED port object attribute notification. This also enabled IGMP and MLD snooping. - Patch 6 implements per-port multicast replication tables. It only supports filtering of loopback packets. - Patch 7 extends per-port multicast tables with VLAN pop support for 'untagged' VLANs. - Patch 8 handles SWITCHDEV_OBJ_ID_PORT_MDB port object notifications. It creates MDB replication rules in egress table that can replicate packets to multiple per-port multicast tables. - Patch 9 adds tracepoints for MDB events. ============== 2) Parav Create a new allocation profile for SFs, to save on memory 3) Yevgeny provides some initial patches for upcoming software steering support new pattern/arguments type of modify_header actions. Starting with ConnectX-6 DX, we use a new design of modify_header FW object. The current modify_header object allows for having only limited number of these FW objects, which means that we are limited in the number of offloaded flows that require modify_header action. As a preparation Yevgeny provides the following 4 patches: - Patch 1: Add required mlx5_ifc HW bits - Patch 2, 3: Add new WQE type and opcode that is required for pattern/arg support and adds appropriate support in dr_send.c - Patch 4: Add ICM pool for modify-header-pattern objects and implement patterns cache, allowing patterns reuse for different flows * tag 'mlx5-updates-2023-04-11' of git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux: net/mlx5: DR, Add modify-header-pattern ICM pool net/mlx5: DR, Prepare sending new WQE type net/mlx5: Add new WQE for updating flow table net/mlx5: Add mlx5_ifc bits for modify header argument net/mlx5: DR, Set counter ID on the last STE for STEv1 TX net/mlx5: Create a new profile for SFs net/mlx5: Bridge, add tracepoints for multicast net/mlx5: Bridge, implement mdb offload net/mlx5: Bridge, support multicast VLAN pop net/mlx5: Bridge, add per-port multicast replication tables net/mlx5: Bridge, snoop igmp/mld packets net/mlx5: Bridge, extract code to lookup parent bridge of port net/mlx5: Bridge, move additional data structures to priv header net/mlx5: Bridge, increase bridge tables sizes net/mlx5: Add mlx5_ifc definitions for bridge multicast support ==================== Link: https://lore.kernel.org/r/20230412040752.14220-1-saeed@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-04-13net: ethtool: create and export ethtool_dev_mm_supported()Vladimir Oltean
Create a wrapper over __ethtool_dev_mm_supported() which also calls ethnl_ops_begin() and ethnl_ops_complete(). It can be used by other code layers, such as tc, to make sure that preemptible TCs are supported (this is true if an underlying MAC Merge layer exists). Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Ferenc Fejes <fejes@inf.elte.hu> Reviewed-by: Simon Horman <simon.horman@corigine.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-04-13Daniel Borkmann says:Jakub Kicinski
==================== pull-request: bpf-next 2023-04-13 We've added 260 non-merge commits during the last 36 day(s) which contain a total of 356 files changed, 21786 insertions(+), 11275 deletions(-). The main changes are: 1) Rework BPF verifier log behavior and implement it as a rotating log by default with the option to retain old-style fixed log behavior, from Andrii Nakryiko. 2) Adds support for using {FOU,GUE} encap with an ipip device operating in collect_md mode and add a set of BPF kfuncs for controlling encap params, from Christian Ehrig. 3) Allow BPF programs to detect at load time whether a particular kfunc exists or not, and also add support for this in light skeleton, from Alexei Starovoitov. 4) Optimize hashmap lookups when key size is multiple of 4, from Anton Protopopov. 5) Enable RCU semantics for task BPF kptrs and allow referenced kptr tasks to be stored in BPF maps, from David Vernet. 6) Add support for stashing local BPF kptr into a map value via bpf_kptr_xchg(). This is useful e.g. for rbtree node creation for new cgroups, from Dave Marchevsky. 7) Fix BTF handling of is_int_ptr to skip modifiers to work around tracing issues where a program cannot be attached, from Feng Zhou. 8) Migrate a big portion of test_verifier unit tests over to test_progs -a verifier_* via inline asm to ease {read,debug}ability, from Eduard Zingerman. 9) Several updates to the instruction-set.rst documentation which is subject to future IETF standardization (https://lwn.net/Articles/926882/), from Dave Thaler. 10) Fix BPF verifier in the __reg_bound_offset's 64->32 tnum sub-register known bits information propagation, from Daniel Borkmann. 11) Add skb bitfield compaction work related to BPF with the overall goal to make more of the sk_buff bits optional, from Jakub Kicinski. 12) BPF selftest cleanups for build id extraction which stand on its own from the upcoming integration work of build id into struct file object, from Jiri Olsa. 13) Add fixes and optimizations for xsk descriptor validation and several selftest improvements for xsk sockets, from Kal Conley. 14) Add BPF links for struct_ops and enable switching implementations of BPF TCP cong-ctls under a given name by replacing backing struct_ops map, from Kui-Feng Lee. 15) Remove a misleading BPF verifier env->bypass_spec_v1 check on variable offset stack read as earlier Spectre checks cover this, from Luis Gerhorst. 16) Fix issues in copy_from_user_nofault() for BPF and other tracers to resemble copy_from_user_nmi() from safety PoV, from Florian Lehner and Alexei Starovoitov. 17) Add --json-summary option to test_progs in order for CI tooling to ease parsing of test results, from Manu Bretelle. 18) Batch of improvements and refactoring to prep for upcoming bpf_local_storage conversion to bpf_mem_cache_{alloc,free} allocator, from Martin KaFai Lau. 19) Improve bpftool's visual program dump which produces the control flow graph in a DOT format by adding C source inline annotations, from Quentin Monnet. 20) Fix attaching fentry/fexit/fmod_ret/lsm to modules by extracting the module name from BTF of the target and searching kallsyms of the correct module, from Viktor Malik. 21) Improve BPF verifier handling of '<const> <cond> <non_const>' to better detect whether in particular jmp32 branches are taken, from Yonghong Song. 22) Allow BPF TCP cong-ctls to write app_limited of struct tcp_sock. A built-in cc or one from a kernel module is already able to write to app_limited, from Yixin Shen. Conflicts: Documentation/bpf/bpf_devel_QA.rst b7abcd9c656b ("bpf, doc: Link to submitting-patches.rst for general patch submission info") 0f10f647f455 ("bpf, docs: Use internal linking for link to netdev subsystem doc") https://lore.kernel.org/all/20230307095812.236eb1be@canb.auug.org.au/ include/net/ip_tunnels.h bc9d003dc48c3 ("ip_tunnel: Preserve pointer const in ip_tunnel_info_opts") ac931d4cdec3d ("ipip,ip_tunnel,sit: Add FOU support for externally controlled ipip devices") https://lore.kernel.org/all/20230413161235.4093777-1-broonie@kernel.org/ net/bpf/test_run.c e5995bc7e2ba ("bpf, test_run: fix crashes due to XDP frame overwriting/corruption") 294635a8165a ("bpf, test_run: fix &xdp_frame misplacement for LIVE_FRAMES") https://lore.kernel.org/all/20230320102619.05b80a98@canb.auug.org.au/ ==================== Link: https://lore.kernel.org/r/20230413191525.7295-1-daniel@iogearbox.net Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-04-13Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski
Conflicts: tools/testing/selftests/net/config 62199e3f1658 ("selftests: net: Add VXLAN MDB test") 3a0385be133e ("selftests: add the missing CONFIG_IP_SCTP in net config") Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-04-13Merge tag 'net-6.3-rc7' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Pull networking fixes from Jakub Kicinski: "Including fixes from bpf, and bluetooth. Not all that quiet given spring celebrations, but "current" fixes are thinning out, which is encouraging. One outstanding regression in the mlx5 driver when using old FW, not blocking but we're pushing for a fix. Current release - new code bugs: - eth: enetc: workaround for unresponsive pMAC after receiving express traffic Previous releases - regressions: - rtnetlink: restore RTM_NEW/DELLINK notification behavior, keep the pid/seq fields 0 for backward compatibility Previous releases - always broken: - sctp: fix a potential overflow in sctp_ifwdtsn_skip - mptcp: - use mptcp_schedule_work instead of open-coding it and make the worker check stricter, to avoid scheduling work on closed sockets - fix NULL pointer dereference on fastopen early fallback - skbuff: fix memory corruption due to a race between skb coalescing and releasing clones confusing page_pool reference counting - bonding: fix neighbor solicitation validation on backup slaves - bpf: tcp: use sock_gen_put instead of sock_put in bpf_iter_tcp - bpf: arm64: fixed a BTI error on returning to patched function - openvswitch: fix race on port output leading to inf loop - sfp: initialize sfp->i2c_block_size at sfp allocation to avoid returning a different errno than expected - phy: nxp-c45-tja11xx: unregister PTP, purge queues on remove - Bluetooth: fix printing errors if LE Connection times out - Bluetooth: assorted UaF, deadlock and data race fixes - eth: macb: fix memory corruption in extended buffer descriptor mode Misc: - adjust the XDP Rx flow hash API to also include the protocol layers over which the hash was computed" * tag 'net-6.3-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (50 commits) selftests/bpf: Adjust bpf_xdp_metadata_rx_hash for new arg mlx4: bpf_xdp_metadata_rx_hash add xdp rss hash type veth: bpf_xdp_metadata_rx_hash add xdp rss hash type mlx5: bpf_xdp_metadata_rx_hash add xdp rss hash type xdp: rss hash types representation selftests/bpf: xdp_hw_metadata remove bpf_printk and add counters skbuff: Fix a race between coalescing and releasing SKBs net: macb: fix a memory corruption in extended buffer descriptor mode selftests: add the missing CONFIG_IP_SCTP in net config udp6: fix potential access to stale information selftests: openvswitch: adjust datapath NL message declaration selftests: mptcp: userspace pm: uniform verify events mptcp: fix NULL pointer dereference on fastopen early fallback mptcp: stricter state check in mptcp_worker mptcp: use mptcp_schedule_work instead of open-coding it net: enetc: workaround for unresponsive pMAC after receiving express traffic sctp: fix a potential overflow in sctp_ifwdtsn_skip net: qrtr: Fix an uninit variable access bug in qrtr_tx_resume() rtnetlink: Restore RTM_NEW/DELLINK notification behavior net: ti/cpsw: Add explicit platform_device.h and of_platform.h includes ...
2023-04-13mlx5: bpf_xdp_metadata_rx_hash add xdp rss hash typeJesper Dangaard Brouer
Update API for bpf_xdp_metadata_rx_hash() with arg for xdp rss hash type via mapping table. The mlx5 hardware can also identify and RSS hash IPSEC. This indicate hash includes SPI (Security Parameters Index) as part of IPSEC hash. Extend xdp core enum xdp_rss_hash_type with IPSEC hash type. Fixes: bc8d405b1ba9 ("net/mlx5e: Support RX XDP metadata") Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Acked-by: Toke Høiland-Jørgensen <toke@redhat.com> Acked-by: Stanislav Fomichev <sdf@google.com> Link: https://lore.kernel.org/r/168132892548.340624.11185734579430124869.stgit@firesoul Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-04-13xdp: rss hash types representationJesper Dangaard Brouer
The RSS hash type specifies what portion of packet data NIC hardware used when calculating RSS hash value. The RSS types are focused on Internet traffic protocols at OSI layers L3 and L4. L2 (e.g. ARP) often get hash value zero and no RSS type. For L3 focused on IPv4 vs. IPv6, and L4 primarily TCP vs UDP, but some hardware supports SCTP. Hardware RSS types are differently encoded for each hardware NIC. Most hardware represent RSS hash type as a number. Determining L3 vs L4 often requires a mapping table as there often isn't a pattern or sorting according to ISO layer. The patch introduce a XDP RSS hash type (enum xdp_rss_hash_type) that contains both BITs for the L3/L4 types, and combinations to be used by drivers for their mapping tables. The enum xdp_rss_type_bits get exposed to BPF via BTF, and it is up to the BPF-programmer to match using these defines. This proposal change the kfunc API bpf_xdp_metadata_rx_hash() adding a pointer value argument for provide the RSS hash type. Change signature for all xmo_rx_hash calls in drivers to make it compile. The RSS type implementations for each driver comes as separate patches. Fixes: 3d76a4d3d4e5 ("bpf: XDP metadata RX kfuncs") Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Acked-by: Toke Høiland-Jørgensen <toke@redhat.com> Acked-by: Stanislav Fomichev <sdf@google.com> Link: https://lore.kernel.org/r/168132892042.340624.582563003880565460.stgit@firesoul Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-04-13net: stmmac: dwmac4: Allow platforms to specify some DMA/MTL offsetsAndrew Halaney
Some platforms have dwmac4 implementations that have a different address space layout than the default, resulting in the need to define their own DMA/MTL offsets. Extend the functions to allow a platform driver to indicate what its addresses are, overriding the defaults. Signed-off-by: Andrew Halaney <ahalaney@redhat.com> Reviewed-by: Jesse Brandeburg <jesse.brandeburg@intel.com> Tested-by: Brian Masney <bmasney@redhat.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2023-04-12rtnetlink: Restore RTM_NEW/DELLINK notification behaviorMartin Willi
The commits referenced below allows userspace to use the NLM_F_ECHO flag for RTM_NEW/DELLINK operations to receive unicast notifications for the affected link. Prior to these changes, applications may have relied on multicast notifications to learn the same information without specifying the NLM_F_ECHO flag. For such applications, the mentioned commits changed the behavior for requests not using NLM_F_ECHO. Multicast notifications are still received, but now use the portid of the requester and the sequence number of the request instead of zero values used previously. For the application, this message may be unexpected and likely handled as a response to the NLM_F_ACKed request, especially if it uses the same socket to handle requests and notifications. To fix existing applications relying on the old notification behavior, set the portid and sequence number in the notification only if the request included the NLM_F_ECHO flag. This restores the old behavior for applications not using it, but allows unicasted notifications for others. Fixes: f3a63cce1b4f ("rtnetlink: Honour NLM_F_ECHO flag in rtnl_delete_link") Fixes: d88e136cab37 ("rtnetlink: Honour NLM_F_ECHO flag in rtnl_newlink_create") Signed-off-by: Martin Willi <martin@strongswan.org> Acked-by: Guillaume Nault <gnault@redhat.com> Acked-by: Hangbin Liu <liuhangbin@gmail.com> Link: https://lore.kernel.org/r/20230411074319.24133-1-martin@strongswan.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-04-11net/mlx5: Add new WQE for updating flow tableYevgeny Kliteynik
Add new WQE type: FLOW_TBL_ACCESS, which will be used for writing modify header arguments. This type has specific control segment and special data segment. Signed-off-by: Yevgeny Kliteynik <kliteyn@nvidia.com> Reviewed-by: Alex Vesker <valex@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2023-04-11net/mlx5: Add mlx5_ifc bits for modify header argumentYevgeny Kliteynik
Add enum value for modify-header argument object and mlx5_bits for the related capabilities. Signed-off-by: Muhammad Sammar <muhammads@nvidia.com> Signed-off-by: Yevgeny Kliteynik <kliteyn@nvidia.com> Reviewed-by: Alex Vesker <valex@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2023-04-11net/mlx5: Create a new profile for SFsParav Pandit
Create a new profile for SFs in order to disable the command cache. Each function command cache consumes ~500KB of memory, when using a large number of SFs this savings is notable on memory constarined systems. Use a new profile to provide for future differences between SFs and PFs. The mr_cache not used for non-PF functions, so it is excluded from the new profile. Signed-off-by: Parav Pandit <parav@mellanox.com> Reviewed-by: Bodong Wang <bodong@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2023-04-11net/mlx5: Add mlx5_ifc definitions for bridge multicast supportVlad Buslov
Add the required hardware definitions to mlx5_ifc: fdb_uplink_hairpin, fdb_multi_path_any_table_limit_regc, fdb_multi_path_any_table. Signed-off-by: Vlad Buslov <vladbu@nvidia.com> Reviewed-by: Maor Dickman <maord@nvidia.com> Reviewed-by: Roi Dayan <roid@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2023-04-11Merge tag 'pci-v6.3-fixes-2' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/pci/pci Pull pci fixes from Bjorn Helgaas: - Provide pci_msix_can_alloc_dyn() stub when CONFIG_PCI_MSI unset to avoid build errors (Reinette Chatre) - Quirk AMD XHCI controller that loses MSI-X state in D3hot to avoid broken USB after hotplug or suspend/resume (Basavaraj Natikar) - Fix use-after-free in pci_bus_release_domain_nr() (Rob Herring) * tag 'pci-v6.3-fixes-2' of git://git.kernel.org/pub/scm/linux/kernel/git/pci/pci: PCI: Fix use-after-free in pci_bus_release_domain_nr() x86/PCI: Add quirk for AMD XHCI controller that loses MSI-X state in D3hot PCI/MSI: Provide missing stub for pci_msix_can_alloc_dyn()
2023-04-11bpf: Simplify internal verifier log interfaceAndrii Nakryiko
Simplify internal verifier log API down to bpf_vlog_init() and bpf_vlog_finalize(). The former handles input arguments validation in one place and makes it easier to change it. The latter subsumes -ENOSPC (truncation) and -EFAULT handling and simplifies both caller's code (bpf_check() and btf_parse()). For btf_parse(), this patch also makes sure that verifier log finalization happens even if there is some error condition during BTF verification process prior to normal finalization step. Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Lorenz Bauer <lmb@isovalent.com> Link: https://lore.kernel.org/bpf/20230406234205.323208-14-andrii@kernel.org
2023-04-11bpf: Add log_true_size output field to return necessary log buffer sizeAndrii Nakryiko
Add output-only log_true_size and btf_log_true_size field to BPF_PROG_LOAD and BPF_BTF_LOAD commands, respectively. It will return the size of log buffer necessary to fit in all the log contents at specified log_level. This is very useful for BPF loader libraries like libbpf to be able to size log buffer correctly, but could be used by users directly, if necessary, as well. This patch plumbs all this through the code, taking into account actual bpf_attr size provided by user to determine if these new fields are expected by users. And if they are, set them from kernel on return. We refactory btf_parse() function to accommodate this, moving attr and uattr handling inside it. The rest is very straightforward code, which is split from the logging accounting changes in the previous patch to make it simpler to review logic vs UAPI changes. Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Lorenz Bauer <lmb@isovalent.com> Link: https://lore.kernel.org/bpf/20230406234205.323208-13-andrii@kernel.org
2023-04-11bpf: Keep track of total log content size in both fixed and rolling modesAndrii Nakryiko
Change how we do accounting in BPF_LOG_FIXED mode and adopt log->end_pos as *logical* log position. This means that we can go beyond physical log buffer size now and be able to tell what log buffer size should be to fit entire log contents without -ENOSPC. To do this for BPF_LOG_FIXED mode, we need to remove a short-circuiting logic of not vsnprintf()'ing further log content once we filled up user-provided buffer, which is done by bpf_verifier_log_needed() checks. We modify these checks to always keep going if log->level is non-zero (i.e., log is requested), even if log->ubuf was NULL'ed out due to copying data to user-space, or if entire log buffer is physically full. We adopt bpf_verifier_vlog() routine to work correctly with log->ubuf == NULL condition, performing log formatting into temporary kernel buffer, doing all the necessary accounting, but just avoiding copying data out if buffer is full or NULL'ed out. With these changes, it's now possible to do this sort of determination of log contents size in both BPF_LOG_FIXED and default rolling log mode. We need to keep in mind bpf_vlog_reset(), though, which shrinks log contents after successful verification of a particular code path. This log reset means that log->end_pos isn't always increasing, so to return back to users what should be the log buffer size to fit all log content without causing -ENOSPC even in the presence of log resetting, we need to keep maximum over "lifetime" of logging. We do this accounting in bpf_vlog_update_len_max() helper. A related and subtle aspect is that with this logical log->end_pos even in BPF_LOG_FIXED mode we could temporary "overflow" buffer, but then reset it back with bpf_vlog_reset() to a position inside user-supplied log_buf. In such situation we still want to properly maintain terminating zero. We will eventually return -ENOSPC even if final log buffer is small (we detect this through log->len_max check). This behavior is simpler to reason about and is consistent with current behavior of verifier log. Handling of this required a small addition to bpf_vlog_reset() logic to avoid doing put_user() beyond physical log buffer dimensions. Another issue to keep in mind is that we limit log buffer size to 32-bit value and keep such log length as u32, but theoretically verifier could produce huge log stretching beyond 4GB. Instead of keeping (and later returning) 64-bit log length, we cap it at UINT_MAX. Current UAPI makes it impossible to specify log buffer size bigger than 4GB anyways, so we don't really loose anything here and keep everything consistently 32-bit in UAPI. This property will be utilized in next patch. Doing the same determination of maximum log buffer for rolling mode is trivial, as log->end_pos and log->start_pos are already logical positions, so there is nothing new there. These changes do incidentally fix one small issue with previous logging logic. Previously, if use provided log buffer of size N, and actual log output was exactly N-1 bytes + terminating \0, kernel logic coun't distinguish this condition from log truncation scenario which would end up with truncated log contents of N-1 bytes + terminating \0 as well. But now with log->end_pos being logical position that could go beyond actual log buffer size, we can distinguish these two conditions, which we do in this patch. This plays nicely with returning log_size_actual (implemented in UAPI in the next patch), as we can now guarantee that if user takes such log_size_actual and provides log buffer of that exact size, they will not get -ENOSPC in return. All in all, all these changes do conceptually unify fixed and rolling log modes much better, and allow a nice feature requested by users: knowing what should be the size of the buffer to avoid -ENOSPC. We'll plumb this through the UAPI and the code in the next patch. Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Lorenz Bauer <lmb@isovalent.com> Link: https://lore.kernel.org/bpf/20230406234205.323208-12-andrii@kernel.org
2023-04-11bpf: Switch BPF verifier log to be a rotating log by defaultAndrii Nakryiko
Currently, if user-supplied log buffer to collect BPF verifier log turns out to be too small to contain full log, bpf() syscall returns -ENOSPC, fails BPF program verification/load, and preserves first N-1 bytes of the verifier log (where N is the size of user-supplied buffer). This is problematic in a bunch of common scenarios, especially when working with real-world BPF programs that tend to be pretty complex as far as verification goes and require big log buffers. Typically, it's when debugging tricky cases at log level 2 (verbose). Also, when BPF program is successfully validated, log level 2 is the only way to actually see verifier state progression and all the important details. Even with log level 1, it's possible to get -ENOSPC even if the final verifier log fits in log buffer, if there is a code path that's deep enough to fill up entire log, even if normally it would be reset later on (there is a logic to chop off successfully validated portions of BPF verifier log). In short, it's not always possible to pre-size log buffer. Also, what's worse, in practice, the end of the log most often is way more important than the beginning, but verifier stops emitting log as soon as initial log buffer is filled up. This patch switches BPF verifier log behavior to effectively behave as rotating log. That is, if user-supplied log buffer turns out to be too short, verifier will keep overwriting previously written log, effectively treating user's log buffer as a ring buffer. -ENOSPC is still going to be returned at the end, to notify user that log contents was truncated, but the important last N bytes of the log would be returned, which might be all that user really needs. This consistent -ENOSPC behavior, regardless of rotating or fixed log behavior, allows to prevent backwards compatibility breakage. The only user-visible change is which portion of verifier log user ends up seeing *if buffer is too small*. Given contents of verifier log itself is not an ABI, there is no breakage due to this behavior change. Specialized tools that rely on specific contents of verifier log in -ENOSPC scenario are expected to be easily adapted to accommodate old and new behaviors. Importantly, though, to preserve good user experience and not require every user-space application to adopt to this new behavior, before exiting to user-space verifier will rotate log (in place) to make it start at the very beginning of user buffer as a continuous zero-terminated string. The contents will be a chopped off N-1 last bytes of full verifier log, of course. Given beginning of log is sometimes important as well, we add BPF_LOG_FIXED (which equals 8) flag to force old behavior, which allows tools like veristat to request first part of verifier log, if necessary. BPF_LOG_FIXED flag is also a simple and straightforward way to check if BPF verifier supports rotating behavior. On the implementation side, conceptually, it's all simple. We maintain 64-bit logical start and end positions. If we need to truncate the log, start position will be adjusted accordingly to lag end position by N bytes. We then use those logical positions to calculate their matching actual positions in user buffer and handle wrap around the end of the buffer properly. Finally, right before returning from bpf_check(), we rotate user log buffer contents in-place as necessary, to make log contents contiguous. See comments in relevant functions for details. Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: Lorenz Bauer <lmb@isovalent.com> Link: https://lore.kernel.org/bpf/20230406234205.323208-4-andrii@kernel.org
2023-04-11bpf: Split off basic BPF verifier log into separate fileAndrii Nakryiko
kernel/bpf/verifier.c file is large and growing larger all the time. So it's good to start splitting off more or less self-contained parts into separate files to keep source code size (somewhat) somewhat under control. This patch is a one step in this direction, moving some of BPF verifier log routines into a separate kernel/bpf/log.c. Right now it's most low-level and isolated routines to append data to log, reset log to previous position, etc. Eventually we could probably move verifier state printing logic here as well, but this patch doesn't attempt to do that yet. Subsequent patches will add more logic to verifier log management, so having basics in a separate file will make sure verifier.c doesn't grow more with new changes. Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Lorenz Bauer <lmb@isovalent.com> Link: https://lore.kernel.org/bpf/20230406234205.323208-2-andrii@kernel.org
2023-04-10net: piggy back on the memory barrier in bql when waking queuesJakub Kicinski
Drivers call netdev_tx_completed_queue() right before netif_txq_maybe_wake(). If BQL is enabled netdev_tx_completed_queue() should issue a memory barrier, so we can depend on that separating the stop check from the consumer index update, instead of adding another barrier in netif_txq_maybe_wake(). This matters more than the barriers on the xmit path, because the wake condition is almost always true. So we issue the consumer side barrier often. Wrap netdev_tx_completed_queue() in a local helper to issue the barrier even if BQL is disabled. Keep the same semantics as netdev_tx_completed_queue() (barrier only if bytes != 0) to make it clear that the barrier is conditional. Plus since macro gets pkt/byte counts as arguments now - we can skip waking if there were no packets completed. Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-04-10net: provide macros for commonly copied lockless queue stop/wake codeJakub Kicinski
A lot of drivers follow the same scheme to stop / start queues without introducing locks between xmit and NAPI tx completions. I'm guessing they all copy'n'paste each other's code. The original code dates back all the way to e1000 and Linux 2.6.19. Smaller drivers shy away from the scheme and introduce a lock which may cause deadlocks in netpoll. Provide macros which encapsulate the necessary logic. The macros do not prevent false wake ups, the extra barrier required to close that race is not worth it. See discussion in: https://lore.kernel.org/all/c39312a2-4537-14b4-270c-9fe1fbb91e89@gmail.com/ Acked-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-04-09Merge branch 'hwmon-const' of ↵Jakub Kicinski
git://git.kernel.org/pub/scm/linux/kernel/git/groeck/linux-staging Pull in pre-requisite patches from Guenter Roeck to constify pointers to hwmon_channel_info. * 'hwmon-const' of git://git.kernel.org/pub/scm/linux/kernel/git/groeck/linux-staging: hwmon: constify pointers to hwmon_channel_info Link: https://lore.kernel.org/all/3a0391e7-21f6-432a-9872-329e298e1582@roeck-us.net/ Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-04-09Merge tag 'cxl-fixes-6.3-rc6' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/cxl/cxl Pull compute express link (cxl) fixes from Dan Williams: "Several fixes for driver startup regressions that landed during the merge window as well as some older bugs. The regressions were due to a lack of testing with what the CXL specification calls Restricted CXL Host (RCH) topologies compared to the testing with Virtual Host (VH) CXL topologies. A VH topology is typical PCIe while RCH topologies map CXL endpoints as Root Complex Integrated endpoints. The impact is some driver crashes on startup. This merge window also added compatibility for range registers (the mechanism that CXL 1.1 defined for mapping memory) to treat them like HDM decoders (the mechanism that CXL 2.0 defined for mapping Host-managed Device Memory). That work collided with the new region enumeration code that was tested with CXL 2.0 setups, and fails with crashes at startup. Lastly, the DOE (Data Object Exchange) implementation for retrieving an ACPI-like data table from CXL devices is being reworked for v6.4. Several fixes fell out of that work that are suitable for v6.3. All of this has been in linux-next for a while, and all reported issues [1] have been addressed. Summary: - Fix several issues with region enumeration in RCH topologies that can trigger crashes on driver startup or shutdown. - Fix CXL DVSEC range register compatibility versus region enumeration that leads to startup crashes - Fix CDAT endiannes handling - Fix multiple buffer handling boundary conditions - Fix Data Object Exchange (DOE) workqueue usage vs CONFIG_DEBUG_OBJECTS warn splats" Link: http://lore.kernel.org/r/20230405075704.33de8121@canb.auug.org.au [1] * tag 'cxl-fixes-6.3-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/cxl/cxl: cxl/hdm: Extend DVSEC range register emulation for region enumeration cxl/hdm: Limit emulation to the number of range registers cxl/region: Move coherence tracking into cxl_region_attach() cxl/region: Fix region setup/teardown for RCDs cxl/port: Fix find_cxl_root() for RCDs and simplify it cxl/hdm: Skip emulation when driver manages mem_enable cxl/hdm: Fix double allocation of @cxlhdm PCI/DOE: Fix memory leak with CONFIG_DEBUG_OBJECTS=y PCI/DOE: Silence WARN splat with CONFIG_DEBUG_OBJECTS=y cxl/pci: Handle excessive CDAT length cxl/pci: Handle truncated CDAT entries cxl/pci: Handle truncated CDAT header cxl/pci: Fix CDAT retrieval on big endian
2023-04-09net: dsa: replace NETDEV_PRE_CHANGE_HWTSTAMP notifier with a stubVladimir Oltean
There was a sort of rush surrounding commit 88c0a6b503b7 ("net: create a netdev notifier for DSA to reject PTP on DSA master"), due to a desire to convert DSA's attempt to deny TX timestamping on a DSA master to something that doesn't block the kernel-wide API conversion from ndo_eth_ioctl() to ndo_hwtstamp_set(). What was required was a mechanism that did not depend on ndo_eth_ioctl(), and what was provided was a mechanism that did not depend on ndo_eth_ioctl(), while at the same time introducing something that wasn't absolutely necessary - a new netdev notifier. There have been objections from Jakub Kicinski that using notifiers in general when they are not absolutely necessary creates complications to the control flow and difficulties to maintainers who look at the code. So there is a desire to not use notifiers. In addition to that, the notifier chain gets called even if there is no DSA in the system and no one is interested in applying any restriction. Take the model of udp_tunnel_nic_ops and introduce a stub mechanism, through which net/core/dev_ioctl.c can call into DSA even when CONFIG_NET_DSA=m. Compared to the code that existed prior to the notifier conversion, aka what was added in commits: - 4cfab3566710 ("net: dsa: Add wrappers for overloaded ndo_ops") - 3369afba1e46 ("net: Call into DSA netdevice_ops wrappers") this is different because we are not overloading any struct net_device_ops of the DSA master anymore, but rather, we are exposing a rather specific functionality which is orthogonal to which API is used to enable it - ndo_eth_ioctl() or ndo_hwtstamp_set(). Also, what is similar is that both approaches use function pointers to get from built-in code to DSA. There is no point in replicating the function pointers towards __dsa_master_hwtstamp_validate() once for every CPU port (dev->dsa_ptr). Instead, it is sufficient to introduce a singleton struct dsa_stubs, built into the kernel, which contains a single function pointer to __dsa_master_hwtstamp_validate(). I find this approach preferable to what we had originally, because dev->dsa_ptr->netdev_ops->ndo_do_ioctl() used to require going through struct dsa_port (dev->dsa_ptr), and so, this was incompatible with any attempts to add any data encapsulation and hide DSA data structures from the outside world. Link: https://lore.kernel.org/netdev/20230403083019.120b72fd@kernel.org/ Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-04-08Merge tag 'mm-hotfixes-stable-2023-04-07-16-23' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull MM fixes from Andrew Morton: "28 hotfixes. 23 are cc:stable and the other five address issues which were introduced during this merge cycle. 20 are for MM and the remainder are for other subsystems" * tag 'mm-hotfixes-stable-2023-04-07-16-23' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (28 commits) maple_tree: fix a potential concurrency bug in RCU mode maple_tree: fix get wrong data_end in mtree_lookup_walk() mm/swap: fix swap_info_struct race between swapoff and get_swap_pages() nilfs2: fix sysfs interface lifetime mm: take a page reference when removing device exclusive entries mm: vmalloc: avoid warn_alloc noise caused by fatal signal nilfs2: initialize "struct nilfs_binfo_dat"->bi_pad field nilfs2: fix potential UAF of struct nilfs_sc_info in nilfs_segctor_thread() zsmalloc: document freeable stats zsmalloc: document new fullness grouping fsdax: force clear dirty mark if CoW mm/hugetlb: fix uffd wr-protection for CoW optimization path mm: enable maple tree RCU mode by default maple_tree: add RCU lock checking to rcu callback functions maple_tree: add smp_rmb() to dead node detection maple_tree: fix write memory barrier of nodes once dead for RCU mode maple_tree: remove extra smp_wmb() from mas_dead_leaves() maple_tree: fix freeing of nodes in rcu mode maple_tree: detect dead nodes in mas_start() maple_tree: be more cautious about dead nodes ...
2023-04-07hwmon: constify pointers to hwmon_channel_infoKrzysztof Kozlowski
HWmon core receives an array of pointers to hwmon_channel_info and it does not modify it, thus it can be array of const pointers for safety. This allows drivers to make them also const. Signed-off-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org> Signed-off-by: Guenter Roeck <linux@roeck-us.net>