summaryrefslogtreecommitdiff
path: root/include/net
AgeCommit message (Collapse)Author
2024-02-08wifi: mac80211: update beacon counters per link basisAditya Kumar Singh
Currently, function to update beacon counter uses deflink to fetch the beacon and then update the counter. However, with MLO, there is a need to update the counter for the beacon in a particular link. Add support to use link_id in order to fetch the beacon from a particular link data during beacon update counter. Signed-off-by: Aditya Kumar Singh <quic_adisi@quicinc.com> Link: https://msgid.link/20240130140918.1172387-3-quic_adisi@quicinc.com Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2024-02-08wifi: cfg80211: send link id in channel_switch opsAditya Kumar Singh
Currently, during channel switch, no link id information is passed down. In order to support channel switch during Multi Link Operation, it is required to pass link id as well. Add changes to pass link id in the channel_switch cfg80211_ops. Signed-off-by: Aditya Kumar Singh <quic_adisi@quicinc.com> Link: https://msgid.link/20240130140918.1172387-2-quic_adisi@quicinc.com Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2024-02-08wifi: cfg80211: Add utility for converting op_class into chandefMichael-CY Lee
This utility is used in STA CSA handling. The op_class in the ECSA Element can be converted into chandef. Co-developed-by: Money Wang <money.wang@mediatek.com> Signed-off-by: Michael-CY Lee <michael-cy.lee@mediatek.com> Link: https://msgid.link/20231222010914.6521-2-michael-cy.lee@mediatek.com Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2024-02-08wifi: cfg80211/mac80211: move puncturing into chandefJohannes Berg
Aloka originally suggested that puncturing should be part of the chandef, so that it's treated correctly. At the time, I disagreed and it ended up not part of the chandef, but I've now realized that this was wrong. Even for clients, the RX, and perhaps more importantly, CCA configuration needs to take puncturing into account. Move puncturing into the chandef, and adjust all the code accordingly. Also add a few tests for puncturing in chandef compatibility checking. Link: https://lore.kernel.org/linux-wireless/20220214223051.3610-1-quic_alokad@quicinc.com/ Suggested-by: Aloka Dixit <quic_alokad@quicinc.com> Link: https://msgid.link/20240129194108.307183a5d2e5.I4d7fe2f126b2366c1312010e2900dfb2abffa0f6@changeid Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2024-02-08wifi: cfg80211: simplify cfg80211_chandef_compatible()Johannes Berg
Simplify cfg80211_chandef_compatible() a bit by switching c1 and c2 around so that c1 is always the narrower one (once they're not identical or narrow/S1G). Then we can just check the various primary channels and exit with the wider one (c2), or NULL. Also refactor the primary 40/80/160 function to not have all the calculations hard-coded, and use a wrapper around it to check primary 40/80/160 compatibility. While at it, add some kunit tests for this functionality. Also expose the new cfg80211_chandef_primary_freq() to drivers, mac80211 will use it. Link: https://msgid.link/20240129194108.be3e6eccaba3.I8399c2ff1435d7378e5837794cb5aa6dd2ee1416@changeid Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2024-02-08wifi: mac80211: support wider bandwidth OFDMA configJohannes Berg
EHT requires that stations are able to participate in wider bandwidth OFDMA, i.e. parse downlink OFDMA and uplink OFDMA triggers when they're not capable of (or not connected at) the (wider) bandwidth that the AP is using. This requires hardware configuration, since the entity responsible for parsing (possibly hardware) needs to know the AP bandwidth. To support this, change the channel request to have the AP's bandwidth for clients, and track that in the channel context in mac80211. This means that the same chandef might need to be split up into two different contexts, if the APs are different. Interfaces other than client are not participating in OFDMA the same way, so they don't request any AP setting. Note that this doesn't introduce any API to split a channel context, so that there are cases where this might lead to a disconnect, e.g. if there are two client interfaces using the same channel context, e.g. both 160 MHz connected to different 320 MHz APs, and one of the APs switches to 160 MHz. Note also there are possible cases where this can be optimised, e.g. when using the upper or lower 160 Mhz, but I haven't been able to really fully understand the spec and/or hardware limitations. If, for some reason, there are no hardware limits on this because the OFDMA (downlink/trigger) parsing is done in firmware and can take the transmitter into account, then drivers can set the new flag IEEE80211_VIF_IGNORE_OFDMA_WIDER_BW on interfaces to not have them request any AP bandwidth in the channel context and ignore this issue entirely. The bss_conf still contains the AP configuration (if any, i.e. EHT) in the chanreq. Link: https://msgid.link/20240129194108.d3d5b35dd783.I939d04674f4ff06f39934b1591c8d36a30ce74c2@changeid Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2024-02-08wifi: mac80211: introduce 'channel request'Johannes Berg
For channel contexts, mac80211 currently uses the cfg80211 chandef struct (control channel, center freq(s), width) to define towards drivers and internally how these behave. In fact, there are _two_ such structs used, where the min_def can reduce bandwidth according to the stations connected. Unfortunately, with EHT this is longer be sufficient, at least not for all hardware. EHT requires that non-AP STAs that are connected to an AP with a lower bandwidth than it (the AP) advertises (e.g. 160 MHz STA connected to 320 MHz AP) still be able to receive downlink OFDMA and respond to trigger frames for uplink OFDMA that specify the position and bandwidth for the non-AP STA relative to the channel the AP is using. Therefore, they need to be aware of this, and at least for some hardware (e.g. Intel) this awareness is in the hardware. As a result, use of the "same" channel may need to be split over two channel contexts where they differ by the AP being used. As a first step, introduce a concept of a channel request ('chanreq') for each interface, to control the context it requests. This step does nothing but reorganise the code, so that later the AP's chandef can be added to the request in order to handle the EHT case described above. Link: https://msgid.link/20240129194108.2e88e48bd2e9.I4256183debe975c5ed71621611206fdbb69ba330@changeid Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2024-02-08wifi: mac80211: simplify non-chanctx driversJohannes Berg
There are still surprisingly many non-chanctx drivers, but in mac80211 that code is a bit awkward. Simplify this by having those drivers assign 'emulated' ops, so that the mac80211 code can be more unified between non-chanctx/chanctx drivers. This cuts the number of places caring about it by about 15, which are scattered across - now they're fewer and no longer in the channel context handling. Link: https://msgid.link/20240129194108.6d0ead50f5cf.I60d093b2fc81ca1853925a4d0ac3a2337d5baa5b@changeid Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2024-02-08Merge tag 'nf-24-02-08' of ↵Paolo Abeni
git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf Pablo Neira Ayuso says: ==================== Netfilter fixes for net The following patchset contains Netfilter fixes for net: 1) Narrow down target/match revision to u8 in nft_compat. 2) Bail out with unused flags in nft_compat. 3) Restrict layer 4 protocol to u16 in nft_compat. 4) Remove static in pipapo get command that slipped through when reducing set memory footprint. 5) Follow up incremental fix for the ipset performance regression, this includes the missing gc cancellation, from Jozsef Kadlecsik. 6) Allow to filter by zone 0 in ctnetlink, do not interpret zone 0 as no filtering, from Felix Huettner. 7) Reject direction for NFT_CT_ID. 8) Use timestamp to check for set element expiration while transaction is handled to prevent garbage collection from removing set elements that were just added by this transaction. Packet path and netlink dump/get path still use current time to check for expiration. 9) Restore NF_REPEAT in nfnetlink_queue, from Florian Westphal. 10) map_index needs to be percpu and per-set, not just percpu. At this time its possible for a pipapo set to fill the all-zero part with ones and take the 'might have bits set' as 'start-from-zero' area. From Florian Westphal. This includes three patches: - Change scratchpad area to a structure that provides space for a per-set-and-cpu toggle and uses it of the percpu one. - Add a new free helper to prepare for the next patch. - Remove the scratch_aligned pointer and makes AVX2 implementation use the exact same memory addresses for read/store of the matching state. netfilter pull request 24-02-08 * tag 'nf-24-02-08' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf: netfilter: nft_set_pipapo: remove scratch_aligned pointer netfilter: nft_set_pipapo: add helper to release pcpu scratch area netfilter: nft_set_pipapo: store index in scratch maps netfilter: nft_set_rbtree: skip end interval element from gc netfilter: nfnetlink_queue: un-break NF_REPEAT netfilter: nf_tables: use timestamp to check for set element timeout netfilter: nft_ct: reject direction for ct id netfilter: ctnetlink: fix filtering for zone 0 netfilter: ipset: Missing gc cancellations fixed netfilter: nft_set_pipapo: remove static in nft_pipapo_get() netfilter: nft_compat: restrict match/target protocol to u16 netfilter: nft_compat: reject unused compat flag netfilter: nft_compat: narrow down revision to unsigned 8-bits ==================== Link: https://lore.kernel.org/r/20240208112834.1433-1-pablo@netfilter.org Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-02-08netfilter: nf_tables: use timestamp to check for set element timeoutPablo Neira Ayuso
Add a timestamp field at the beginning of the transaction, store it in the nftables per-netns area. Update set backend .insert, .deactivate and sync gc path to use the timestamp, this avoids that an element expires while control plane transaction is still unfinished. .lookup and .update, which are used from packet path, still use the current time to check if the element has expired. And .get path and dump also since this runs lockless under rcu read size lock. Then, there is async gc which also needs to check the current time since it runs asynchronously from a workqueue. Fixes: c3e1b005ed1c ("netfilter: nf_tables: add set element timeout support") Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2024-02-08Merge wireless into wireless-nextJohannes Berg
There are some changes coming to wireless-next that will otherwise cause conflicts, pull wireless in first to be able to resolve that when applying the individual changes rather than having to do merge resolution later. Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2024-02-07ip_tunnel: use exit_batch_rtnl() methodEric Dumazet
exit_batch_rtnl() is called while RTNL is held, and devices to be unregistered can be queued in the dev_kill_list. This saves one rtnl_lock()/rtnl_unlock() pair and one unregister_netdevice_many() call. This patch takes care of ipip, ip_vti, and ip_gre tunnels. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Antoine Tenart <atenart@kernel.org> Link: https://lore.kernel.org/r/20240206144313.2050392-15-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-02-07ipv4: add __unregister_nexthop_notifier()Eric Dumazet
unregister_nexthop_notifier() assumes the caller does not hold rtnl. We need in the following patch to use it from a context already holding rtnl. Add __unregister_nexthop_notifier(). unregister_nexthop_notifier() becomes a wrapper. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: David Ahern <dsahern@kernel.org> Reviewed-by: Antoine Tenart <atenart@kernel.org> Link: https://lore.kernel.org/r/20240206144313.2050392-9-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-02-07net: add exit_batch_rtnl() methodEric Dumazet
Many (struct pernet_operations)->exit_batch() methods have to acquire rtnl. In presence of rtnl mutex pressure, this makes cleanup_net() very slow. This patch adds a new exit_batch_rtnl() method to reduce number of rtnl acquisitions from cleanup_net(). exit_batch_rtnl() handlers are called while rtnl is locked, and devices to be killed can be queued in a list provided as their second argument. A single unregister_netdevice_many() is called right before rtnl is released. exit_batch_rtnl() handlers are called before ->exit() and ->exit_batch() handlers. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Antoine Tenart <atenart@kernel.org> Link: https://lore.kernel.org/r/20240206144313.2050392-2-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-02-07Merge tag 'mlx5-updates-2024-02-01' of ↵Jakub Kicinski
git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux Saeed Mahameed says: ==================== mlx5-updates-2024-02-01 1) IPSec global stats for xfrm and mlx5 2) XSK memory improvements for non-linear SKBs 3) Software steering debug dump to use seq_file ops 4) Various code clean-ups * tag 'mlx5-updates-2024-02-01' of git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux: net/mlx5e: XDP, Exclude headroom and tailroom from memory calculations net/mlx5e: XSK, Exclude tailroom from non-linear SKBs memory calculations net/mlx5: DR, Change SWS usage to debug fs seq_file interface net/mlx5: Change missing SyncE capability print to debug net/mlx5: Remove initial segmentation duplicate definitions net/mlx5: Return specific error code for timeout on wait_fw_init net/mlx5: SF, Stop waiting for FW as teardown was called net/mlx5: remove fw reporter dump option for non PF net/mlx5: remove fw_fatal reporter dump option for non PF net/mlx5: Rename mlx5_sf_dev_remove Documentation: Fix counter name of mlx5 vnic reporter net/mlx5e: Delete obsolete IPsec code net/mlx5e: Connect mlx5 IPsec statistics with XFRM core xfrm: get global statistics from the offloaded device xfrm: generalize xdo_dev_state_update_curlft to allow statistics update ==================== Link: https://lore.kernel.org/r/20240206005527.1353368-1-saeed@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-02-07Merge tag 'wireless-2024-02-06' of ↵Jakub Kicinski
git://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless Kalle Valo says: ==================== wireless fixes for v6.8-rc4 This time we have unusually large wireless pull request. Several functionality fixes to both stack and iwlwifi. Lots of fixes to warnings, especially to MODULE_DESCRIPTION(). * tag 'wireless-2024-02-06' of git://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless: (31 commits) wifi: mt76: mt7996: fix fortify warning wifi: brcmfmac: Adjust n_channels usage for __counted_by wifi: iwlwifi: do not announce EPCS support wifi: iwlwifi: exit eSR only after the FW does wifi: iwlwifi: mvm: fix a battery life regression wifi: mac80211: accept broadcast probe responses on 6 GHz wifi: mac80211: adding missing drv_mgd_complete_tx() call wifi: mac80211: fix waiting for beacons logic wifi: mac80211: fix unsolicited broadcast probe config wifi: mac80211: initialize SMPS mode correctly wifi: mac80211: fix driver debugfs for vif type change wifi: mac80211: set station RX-NSS on reconfig wifi: mac80211: fix RCU use in TDLS fast-xmit wifi: mac80211: improve CSA/ECSA connection refusal wifi: cfg80211: detect stuck ECSA element in probe resp wifi: iwlwifi: remove extra kernel-doc wifi: fill in MODULE_DESCRIPTION()s for mt76 drivers wifi: fill in MODULE_DESCRIPTION()s for wilc1000 wifi: fill in MODULE_DESCRIPTION()s for wl18xx wifi: fill in MODULE_DESCRIPTION()s for p54spi ... ==================== Link: https://lore.kernel.org/r/20240206095722.CD9D2C433F1@smtp.kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-02-07netlabel: cleanup struct netlbl_lsm_catmapGeorge Guo
Simplify the code from macro NETLBL_CATMAP_MAPTYPE to u64, and fix warning "Macros with complex values should be enclosed in parentheses" on "#define NETLBL_CATMAP_BIT (NETLBL_CATMAP_MAPTYPE)0x01", which is modified to "#define NETLBL_CATMAP_BIT ((u64)0x01)". Signed-off-by: George Guo <guodongtai@kylinos.cn> Acked-by: Paul Moore <paul@paul-moore.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-02-06bonding: Add independent control state machineAahil Awatramani
Add support for the independent control state machine per IEEE 802.1AX-2008 5.4.15 in addition to the existing implementation of the coupled control state machine. Introduces two new states, AD_MUX_COLLECTING and AD_MUX_DISTRIBUTING in the LACP MUX state machine for separated handling of an initial Collecting state before the Collecting and Distributing state. This enables a port to be in a state where it can receive incoming packets while not still distributing. This is useful for reducing packet loss when a port begins distributing before its partner is able to collect. Added new functions such as bond_set_slave_tx_disabled_flags and bond_set_slave_rx_enabled_flags to precisely manage the port's collecting and distributing states. Previously, there was no dedicated method to disable TX while keeping RX enabled, which this patch addresses. Note that the regular flow process in the kernel's bonding driver remains unaffected by this patch. The extension requires explicit opt-in by the user (in order to ensure no disruptions for existing setups) via netlink support using the new bonding parameter coupled_control. The default value for coupled_control is set to 1 so as to preserve existing behaviour. Signed-off-by: Aahil Awatramani <aahila@google.com> Reviewed-by: Hangbin Liu <liuhangbin@gmail.com> Link: https://lore.kernel.org/r/20240202175858.1573852-1-aahila@google.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-02-06net: dst: Make dst_destroy() static and return void.Sebastian Andrzej Siewior
Since commit 52df157f17e56 ("xfrm: take refcnt of dst when creating struct xfrm_dst bundle") dst_destroy() returns only NULL and no caller cares about the return value. There are no in in-tree users of dst_destroy() outside of the file. Make dst_destroy() static and return void. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://lore.kernel.org/r/20240202163746.2489150-1-bigeasy@linutronix.de Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-02-05xfrm: get global statistics from the offloaded deviceLeon Romanovsky
Iterate over all SAs in order to fill global IPsec statistics. Acked-by: Steffen Klassert <steffen.klassert@secunet.com> Signed-off-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2024-02-05xfrm: generalize xdo_dev_state_update_curlft to allow statistics updateLeon Romanovsky
In order to allow drivers to fill all statistics, change the name of xdo_dev_state_update_curlft to be xdo_dev_state_update_stats. Acked-by: Steffen Klassert <steffen.klassert@secunet.com> Signed-off-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2024-02-05sctp: preserve const qualifier in sctp_sk()Eric Dumazet
We can change sctp_sk() to propagate its argument const qualifier, thanks to container_of_const(). Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Cc: Xin Long <lucien.xin@gmail.com> Acked-by: Xin Long <lucien.xin@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-02-04net: make dev_unreg_count globalEric Dumazet
We can use a global dev_unreg_count counter instead of a per netns one. As a bonus we can factorize the changes done on it for bulk device removals. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-02-02net/sched: Add helper macros with module namesMichal Koutný
The macros are preparation for adding module aliases en mass in a separate commit. Although it would be tempting to create aliases like cls-foo for name cls_foo, this could not be used because modprobe utilities treat '-' and '_' interchangeably. In the end, the naming follows pattern of proto modules in linux/net.h. Signed-off-by: Michal Koutný <mkoutny@suse.com> Acked-by: Jamal Hadi Salim <jhs@mojatatu.com> Reviewed-by: Jiri Pirko <jiri@nvidia.com> Link: https://lore.kernel.org/r/20240201130943.19536-2-mkoutny@suse.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-02-02wifi: cfg80211: detect stuck ECSA element in probe respJohannes Berg
We recently added some validation that we don't try to connect to an AP that is currently in a channel switch process, since that might want the channel to be quiet or we might not be able to connect in time to hear the switching in a beacon. This was in commit c09c4f31998b ("wifi: mac80211: don't connect to an AP while it's in a CSA process"). However, we promptly got a report that this caused new connection failures, and it turns out that the AP that we now cannot connect to is permanently advertising an extended channel switch announcement, even with quiet. The AP in question was an Asus RT-AC53, with firmware 3.0.0.4.380_10760-g21a5898. As a first step, attempt to detect that we're dealing with such a situation, so mac80211 can use this later. Reported-by: coldolt <andypalmadi@gmail.com> Closes: https://lore.kernel.org/linux-wireless/CAJvGw+DQhBk_mHXeu6RTOds5iramMW2FbMB01VbKRA4YbHHDTA@mail.gmail.com/ Fixes: c09c4f31998b ("wifi: mac80211: don't connect to an AP while it's in a CSA process") Reviewed-by: Miriam Rachel Korenblit <miriam.rachel.korenblit@intel.com> Link: https://msgid.link/20240129131413.246972c8775e.Ibf834d7f52f9951a353b6872383da710a7358338@changeid Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2024-02-01Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski
Cross-merge networking fixes after downstream PR. No conflicts or adjacent changes. Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-02-01Merge tag 'nf-24-01-31' of ↵Jakub Kicinski
git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf Pablo Neira Ayuso says: ==================== Netfilter fixes for net 1) TCP conntrack now only evaluates window negotiation for packets in the REPLY direction, from Ryan Schaefer. Otherwise SYN retransmissions trigger incorrect window scale negotiation. From Ryan Schaefer. 2) Restrict tunnel objects to NFPROTO_NETDEV which is where it makes sense to use this object type. 3) Fix conntrack pick up from the middle of SCTP_CID_SHUTDOWN_ACK packets. From Xin Long. 4) Another attempt from Jozsef Kadlecsik to address the slow down of the swap command in ipset. 5) Replace a BUG_ON by WARN_ON_ONCE in nf_log, and consolidate check for the case that the logger is NULL from the read side lock section. 6) Address lack of sanitization for custom expectations. Restrict layer 3 and 4 families to what it is supported by userspace. * tag 'nf-24-01-31' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf: netfilter: nft_ct: sanitize layer 3 and 4 protocol number in custom expectations netfilter: nf_log: replace BUG_ON by WARN_ON_ONCE when putting logger netfilter: ipset: fix performance regression in swap operation netfilter: conntrack: check SCTP_CID_SHUTDOWN_ACK for vtag setting in sctp_new netfilter: nf_tables: restrict tunnel object to NFPROTO_NETDEV netfilter: conntrack: correct window scaling with retransmitted SYN ==================== Link: https://lore.kernel.org/r/20240131225943.7536-1-pablo@netfilter.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-01-31af_unix: fix lockdep positive in sk_diag_dump_icons()Eric Dumazet
syzbot reported a lockdep splat [1]. Blamed commit hinted about the possible lockdep violation, and code used unix_state_lock_nested() in an attempt to silence lockdep. It is not sufficient, because unix_state_lock_nested() is already used from unix_state_double_lock(). We need to use a separate subclass. This patch adds a distinct enumeration to make things more explicit. Also use swap() in unix_state_double_lock() as a clean up. v2: add a missing inline keyword to unix_state_lock_nested() [1] WARNING: possible circular locking dependency detected 6.8.0-rc1-syzkaller-00356-g8a696a29c690 #0 Not tainted syz-executor.1/2542 is trying to acquire lock: ffff88808b5df9e8 (rlock-AF_UNIX){+.+.}-{2:2}, at: skb_queue_tail+0x36/0x120 net/core/skbuff.c:3863 but task is already holding lock: ffff88808b5dfe70 (&u->lock/1){+.+.}-{2:2}, at: unix_dgram_sendmsg+0xfc7/0x2200 net/unix/af_unix.c:2089 which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #1 (&u->lock/1){+.+.}-{2:2}: lock_acquire+0x1e3/0x530 kernel/locking/lockdep.c:5754 _raw_spin_lock_nested+0x31/0x40 kernel/locking/spinlock.c:378 sk_diag_dump_icons net/unix/diag.c:87 [inline] sk_diag_fill+0x6ea/0xfe0 net/unix/diag.c:157 sk_diag_dump net/unix/diag.c:196 [inline] unix_diag_dump+0x3e9/0x630 net/unix/diag.c:220 netlink_dump+0x5c1/0xcd0 net/netlink/af_netlink.c:2264 __netlink_dump_start+0x5d7/0x780 net/netlink/af_netlink.c:2370 netlink_dump_start include/linux/netlink.h:338 [inline] unix_diag_handler_dump+0x1c3/0x8f0 net/unix/diag.c:319 sock_diag_rcv_msg+0xe3/0x400 netlink_rcv_skb+0x1df/0x430 net/netlink/af_netlink.c:2543 sock_diag_rcv+0x2a/0x40 net/core/sock_diag.c:280 netlink_unicast_kernel net/netlink/af_netlink.c:1341 [inline] netlink_unicast+0x7e6/0x980 net/netlink/af_netlink.c:1367 netlink_sendmsg+0xa37/0xd70 net/netlink/af_netlink.c:1908 sock_sendmsg_nosec net/socket.c:730 [inline] __sock_sendmsg net/socket.c:745 [inline] sock_write_iter+0x39a/0x520 net/socket.c:1160 call_write_iter include/linux/fs.h:2085 [inline] new_sync_write fs/read_write.c:497 [inline] vfs_write+0xa74/0xca0 fs/read_write.c:590 ksys_write+0x1a0/0x2c0 fs/read_write.c:643 do_syscall_x64 arch/x86/entry/common.c:52 [inline] do_syscall_64+0xf5/0x230 arch/x86/entry/common.c:83 entry_SYSCALL_64_after_hwframe+0x63/0x6b -> #0 (rlock-AF_UNIX){+.+.}-{2:2}: check_prev_add kernel/locking/lockdep.c:3134 [inline] check_prevs_add kernel/locking/lockdep.c:3253 [inline] validate_chain+0x1909/0x5ab0 kernel/locking/lockdep.c:3869 __lock_acquire+0x1345/0x1fd0 kernel/locking/lockdep.c:5137 lock_acquire+0x1e3/0x530 kernel/locking/lockdep.c:5754 __raw_spin_lock_irqsave include/linux/spinlock_api_smp.h:110 [inline] _raw_spin_lock_irqsave+0xd5/0x120 kernel/locking/spinlock.c:162 skb_queue_tail+0x36/0x120 net/core/skbuff.c:3863 unix_dgram_sendmsg+0x15d9/0x2200 net/unix/af_unix.c:2112 sock_sendmsg_nosec net/socket.c:730 [inline] __sock_sendmsg net/socket.c:745 [inline] ____sys_sendmsg+0x592/0x890 net/socket.c:2584 ___sys_sendmsg net/socket.c:2638 [inline] __sys_sendmmsg+0x3b2/0x730 net/socket.c:2724 __do_sys_sendmmsg net/socket.c:2753 [inline] __se_sys_sendmmsg net/socket.c:2750 [inline] __x64_sys_sendmmsg+0xa0/0xb0 net/socket.c:2750 do_syscall_x64 arch/x86/entry/common.c:52 [inline] do_syscall_64+0xf5/0x230 arch/x86/entry/common.c:83 entry_SYSCALL_64_after_hwframe+0x63/0x6b other info that might help us debug this: Possible unsafe locking scenario: CPU0 CPU1 ---- ---- lock(&u->lock/1); lock(rlock-AF_UNIX); lock(&u->lock/1); lock(rlock-AF_UNIX); *** DEADLOCK *** 1 lock held by syz-executor.1/2542: #0: ffff88808b5dfe70 (&u->lock/1){+.+.}-{2:2}, at: unix_dgram_sendmsg+0xfc7/0x2200 net/unix/af_unix.c:2089 stack backtrace: CPU: 1 PID: 2542 Comm: syz-executor.1 Not tainted 6.8.0-rc1-syzkaller-00356-g8a696a29c690 #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 11/17/2023 Call Trace: <TASK> __dump_stack lib/dump_stack.c:88 [inline] dump_stack_lvl+0x1e7/0x2d0 lib/dump_stack.c:106 check_noncircular+0x366/0x490 kernel/locking/lockdep.c:2187 check_prev_add kernel/locking/lockdep.c:3134 [inline] check_prevs_add kernel/locking/lockdep.c:3253 [inline] validate_chain+0x1909/0x5ab0 kernel/locking/lockdep.c:3869 __lock_acquire+0x1345/0x1fd0 kernel/locking/lockdep.c:5137 lock_acquire+0x1e3/0x530 kernel/locking/lockdep.c:5754 __raw_spin_lock_irqsave include/linux/spinlock_api_smp.h:110 [inline] _raw_spin_lock_irqsave+0xd5/0x120 kernel/locking/spinlock.c:162 skb_queue_tail+0x36/0x120 net/core/skbuff.c:3863 unix_dgram_sendmsg+0x15d9/0x2200 net/unix/af_unix.c:2112 sock_sendmsg_nosec net/socket.c:730 [inline] __sock_sendmsg net/socket.c:745 [inline] ____sys_sendmsg+0x592/0x890 net/socket.c:2584 ___sys_sendmsg net/socket.c:2638 [inline] __sys_sendmmsg+0x3b2/0x730 net/socket.c:2724 __do_sys_sendmmsg net/socket.c:2753 [inline] __se_sys_sendmmsg net/socket.c:2750 [inline] __x64_sys_sendmmsg+0xa0/0xb0 net/socket.c:2750 do_syscall_x64 arch/x86/entry/common.c:52 [inline] do_syscall_64+0xf5/0x230 arch/x86/entry/common.c:83 entry_SYSCALL_64_after_hwframe+0x63/0x6b RIP: 0033:0x7f26d887cda9 Code: 28 00 00 00 75 05 48 83 c4 28 c3 e8 e1 20 00 00 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 b0 ff ff ff f7 d8 64 89 01 48 RSP: 002b:00007f26d95a60c8 EFLAGS: 00000246 ORIG_RAX: 0000000000000133 RAX: ffffffffffffffda RBX: 00007f26d89abf80 RCX: 00007f26d887cda9 RDX: 000000000000003e RSI: 00000000200bd000 RDI: 0000000000000004 RBP: 00007f26d88c947a R08: 0000000000000000 R09: 0000000000000000 R10: 00000000000008c0 R11: 0000000000000246 R12: 0000000000000000 R13: 000000000000000b R14: 00007f26d89abf80 R15: 00007ffcfe081a68 Fixes: 2aac7a2cb0d9 ("unix_diag: Pending connections IDs NLA") Reported-by: syzbot <syzkaller@googlegroups.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> Link: https://lore.kernel.org/r/20240130184235.1620738-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-01-31af_unix: Remove CONFIG_UNIX_SCM.Kuniyuki Iwashima
Originally, the code related to garbage collection was all in garbage.c. Commit f4e65870e5ce ("net: split out functions related to registering inflight socket files") moved some functions to scm.c for io_uring and added CONFIG_UNIX_SCM just in case AF_UNIX was built as module. However, since commit 97154bcf4d1b ("af_unix: Kconfig: make CONFIG_UNIX bool"), AF_UNIX is no longer built separately. Also, io_uring does not support SCM_RIGHTS now. Let's move the functions back to garbage.c Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Acked-by: Jens Axboe <axboe@kernel.dk> Link: https://lore.kernel.org/r/20240129190435.57228-4-kuniyu@amazon.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-01-31af_unix: Remove io_uring code for GC.Kuniyuki Iwashima
Since commit 705318a99a13 ("io_uring/af_unix: disable sending io_uring over sockets"), io_uring's unix socket cannot be passed via SCM_RIGHTS, so it does not contribute to cyclic reference and no longer be candidate for garbage collection. Also, commit 6e5e6d274956 ("io_uring: drop any code related to SCM_RIGHTS") cleaned up SCM_RIGHTS code in io_uring. Let's do it in AF_UNIX as well by reverting commit 0091bfc81741 ("io_uring/af_unix: defer registered files gc to io_uring release") and commit 10369080454d ("net: reclaim skb->scm_io_uring bit"). Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Acked-by: Jens Axboe <axboe@kernel.dk> Link: https://lore.kernel.org/r/20240129190435.57228-3-kuniyu@amazon.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-01-31netfilter: nf_tables: restrict tunnel object to NFPROTO_NETDEVPablo Neira Ayuso
Bail out on using the tunnel dst template from other than netdev family. Add the infrastructure to check for the family in objects. Fixes: af308b94a2a4 ("netfilter: nf_tables: add tunnel support") Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2024-01-31Merge tag 'nf-next-24-01-29' of ↵David S. Miller
https://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf-next Florian Westphal says: ==================== nf-next pr 2024-01-29 This batch contains updates for your *next* tree. First three changes, from Phil Sutter, allow userspace to define a table that is exclusively owned by a daemon (via netlink socket aliveness) without auto-removing this table when the userspace program exits. Such table gets marked as orphaned and a restarting management daemon may re-attach/reassume ownership. Next patch, from Pablo, passes already-validated flags variable around rather than having called code re-fetch it from netlnik message. Patches 5 and 6 update ipvs and nf_conncount to use the recently introduced KMEM_CACHE() macro. Last three patches, from myself, tweak kconfig logic a little to permit a kernel configuration that can run iptables-over-nftables but not classic (setsockopt) iptables. Such builds lack the builtin-filter/mangle/raw/nat/security tables, the set/getsockopt interface and the "old blob format" interpreter/traverser. For now, this is 'oldconfig friendly', users need to manually deselect existing config options for this. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2024-01-31ethtool: replace struct ethtool_eee with a new struct ethtool_keee on kernel ↵Heiner Kallweit
side In order to pass EEE link modes beyond bit 32 to userspace we have to complement the 32 bit bitmaps in struct ethtool_eee with linkmode bitmaps. Therefore, similar to ethtool_link_settings and ethtool_link_ksettings, add a struct ethtool_keee. In a first step it's an identical copy of ethtool_eee. This patch simply does a s/ethtool_eee/ethtool_keee/g for all users. No functional change intended. Suggested-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-01-29netfilter: nf_tables: Implement table adoption supportPhil Sutter
Allow a new process to take ownership of a previously owned table, useful mostly for firewall management services restarting or suspending when idle. By extending __NFT_TABLE_F_UPDATE, the on/off/on check in nf_tables_updtable() also covers table adoption, although it is actually not needed: Table adoption is irreversible because nf_tables_updtable() rejects attempts to drop NFT_TABLE_F_OWNER so table->nlpid setting can happen just once within the transaction. If the transaction commences, table's nlpid and flags fields are already set and no further action is required. If it aborts, the table returns to orphaned state. Signed-off-by: Phil Sutter <phil@nwl.cc> Signed-off-by: Florian Westphal <fw@strlen.de>
2024-01-26Merge tag 'for-netdev' of ↵Jakub Kicinski
https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next Daniel Borkmann says: ==================== pull-request: bpf-next 2024-01-26 We've added 107 non-merge commits during the last 4 day(s) which contain a total of 101 files changed, 6009 insertions(+), 1260 deletions(-). The main changes are: 1) Add BPF token support to delegate a subset of BPF subsystem functionality from privileged system-wide daemons such as systemd through special mount options for userns-bound BPF fs to a trusted & unprivileged application. With addressed changes from Christian and Linus' reviews, from Andrii Nakryiko. 2) Support registration of struct_ops types from modules which helps projects like fuse-bpf that seeks to implement a new struct_ops type, from Kui-Feng Lee. 3) Add support for retrieval of cookies for perf/kprobe multi links, from Jiri Olsa. 4) Bigger batch of prep-work for the BPF verifier to eventually support preserving boundaries and tracking scalars on narrowing fills, from Maxim Mikityanskiy. 5) Extend the tc BPF flavor to support arbitrary TCP SYN cookies to help with the scenario of SYN floods, from Kuniyuki Iwashima. 6) Add code generation to inline the bpf_kptr_xchg() helper which improves performance when stashing/popping the allocated BPF objects, from Hou Tao. 7) Extend BPF verifier to track aligned ST stores as imprecise spilled registers, from Yonghong Song. 8) Several fixes to BPF selftests around inline asm constraints and unsupported VLA code generation, from Jose E. Marchesi. 9) Various updates to the BPF IETF instruction set draft document such as the introduction of conformance groups for instructions, from Dave Thaler. 10) Fix BPF verifier to make infinite loop detection in is_state_visited() exact to catch some too lax spill/fill corner cases, from Eduard Zingerman. 11) Refactor the BPF verifier pointer ALU check to allow ALU explicitly instead of implicitly for various register types, from Hao Sun. 12) Fix the flaky tc_redirect_dtime BPF selftest due to slowness in neighbor advertisement at setup time, from Martin KaFai Lau. 13) Change BPF selftests to skip callback tests for the case when the JIT is disabled, from Tiezhu Yang. 14) Add a small extension to libbpf which allows to auto create a map-in-map's inner map, from Andrey Grafin. * tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (107 commits) selftests/bpf: Add missing line break in test_verifier bpf, docs: Clarify definitions of various instructions bpf: Fix error checks against bpf_get_btf_vmlinux(). bpf: One more maintainer for libbpf and BPF selftests selftests/bpf: Incorporate LSM policy to token-based tests selftests/bpf: Add tests for LIBBPF_BPF_TOKEN_PATH envvar libbpf: Support BPF token path setting through LIBBPF_BPF_TOKEN_PATH envvar selftests/bpf: Add tests for BPF object load with implicit token selftests/bpf: Add BPF object loading tests with explicit token passing libbpf: Wire up BPF token support at BPF object level libbpf: Wire up token_fd into feature probing logic libbpf: Move feature detection code into its own file libbpf: Further decouple feature checking logic from bpf_object libbpf: Split feature detectors definitions from cached results selftests/bpf: Utilize string values for delegate_xxx mount options bpf: Support symbolic BPF FS delegation mount options bpf: Fail BPF_TOKEN_CREATE if no delegation option was set on BPF FS bpf,selinux: Allocate bpf_security_struct per BPF token selftests/bpf: Add BPF token-enabled tests libbpf: Add BPF token support to bpf_prog_load() API ... ==================== Link: https://lore.kernel.org/r/20240126215710.19855-1-daniel@iogearbox.net Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-01-26ipmr: fix kernel panic when forwarding mcast packetsNicolas Dichtel
The stacktrace was: [ 86.305548] BUG: kernel NULL pointer dereference, address: 0000000000000092 [ 86.306815] #PF: supervisor read access in kernel mode [ 86.307717] #PF: error_code(0x0000) - not-present page [ 86.308624] PGD 0 P4D 0 [ 86.309091] Oops: 0000 [#1] PREEMPT SMP NOPTI [ 86.309883] CPU: 2 PID: 3139 Comm: pimd Tainted: G U 6.8.0-6wind-knet #1 [ 86.311027] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.11.1-0-g0551a4be2c-prebuilt.qemu-project.org 04/01/2014 [ 86.312728] RIP: 0010:ip_mr_forward (/build/work/knet/net/ipv4/ipmr.c:1985) [ 86.313399] Code: f9 1f 0f 87 85 03 00 00 48 8d 04 5b 48 8d 04 83 49 8d 44 c5 00 48 8b 40 70 48 39 c2 0f 84 d9 00 00 00 49 8b 46 58 48 83 e0 fe <80> b8 92 00 00 00 00 0f 84 55 ff ff ff 49 83 47 38 01 45 85 e4 0f [ 86.316565] RSP: 0018:ffffad21c0583ae0 EFLAGS: 00010246 [ 86.317497] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000 [ 86.318596] RDX: ffff9559cb46c000 RSI: 0000000000000000 RDI: 0000000000000000 [ 86.319627] RBP: ffffad21c0583b30 R08: 0000000000000000 R09: 0000000000000000 [ 86.320650] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000001 [ 86.321672] R13: ffff9559c093a000 R14: ffff9559cc00b800 R15: ffff9559c09c1d80 [ 86.322873] FS: 00007f85db661980(0000) GS:ffff955a79d00000(0000) knlGS:0000000000000000 [ 86.324291] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 86.325314] CR2: 0000000000000092 CR3: 000000002f13a000 CR4: 0000000000350ef0 [ 86.326589] Call Trace: [ 86.327036] <TASK> [ 86.327434] ? show_regs (/build/work/knet/arch/x86/kernel/dumpstack.c:479) [ 86.328049] ? __die (/build/work/knet/arch/x86/kernel/dumpstack.c:421 /build/work/knet/arch/x86/kernel/dumpstack.c:434) [ 86.328508] ? page_fault_oops (/build/work/knet/arch/x86/mm/fault.c:707) [ 86.329107] ? do_user_addr_fault (/build/work/knet/arch/x86/mm/fault.c:1264) [ 86.329756] ? srso_return_thunk (/build/work/knet/arch/x86/lib/retpoline.S:223) [ 86.330350] ? __irq_work_queue_local (/build/work/knet/kernel/irq_work.c:111 (discriminator 1)) [ 86.331013] ? exc_page_fault (/build/work/knet/./arch/x86/include/asm/paravirt.h:693 /build/work/knet/arch/x86/mm/fault.c:1515 /build/work/knet/arch/x86/mm/fault.c:1563) [ 86.331702] ? asm_exc_page_fault (/build/work/knet/./arch/x86/include/asm/idtentry.h:570) [ 86.332468] ? ip_mr_forward (/build/work/knet/net/ipv4/ipmr.c:1985) [ 86.333183] ? srso_return_thunk (/build/work/knet/arch/x86/lib/retpoline.S:223) [ 86.333920] ipmr_mfc_add (/build/work/knet/./include/linux/rcupdate.h:782 /build/work/knet/net/ipv4/ipmr.c:1009 /build/work/knet/net/ipv4/ipmr.c:1273) [ 86.334583] ? __pfx_ipmr_hash_cmp (/build/work/knet/net/ipv4/ipmr.c:363) [ 86.335357] ip_mroute_setsockopt (/build/work/knet/net/ipv4/ipmr.c:1470) [ 86.336135] ? srso_return_thunk (/build/work/knet/arch/x86/lib/retpoline.S:223) [ 86.336854] ? ip_mroute_setsockopt (/build/work/knet/net/ipv4/ipmr.c:1470) [ 86.337679] do_ip_setsockopt (/build/work/knet/net/ipv4/ip_sockglue.c:944) [ 86.338408] ? __pfx_unix_stream_read_actor (/build/work/knet/net/unix/af_unix.c:2862) [ 86.339232] ? srso_return_thunk (/build/work/knet/arch/x86/lib/retpoline.S:223) [ 86.339809] ? aa_sk_perm (/build/work/knet/security/apparmor/include/cred.h:153 /build/work/knet/security/apparmor/net.c:181) [ 86.340342] ip_setsockopt (/build/work/knet/net/ipv4/ip_sockglue.c:1415) [ 86.340859] raw_setsockopt (/build/work/knet/net/ipv4/raw.c:836) [ 86.341408] ? security_socket_setsockopt (/build/work/knet/security/security.c:4561 (discriminator 13)) [ 86.342116] sock_common_setsockopt (/build/work/knet/net/core/sock.c:3716) [ 86.342747] do_sock_setsockopt (/build/work/knet/net/socket.c:2313) [ 86.343363] __sys_setsockopt (/build/work/knet/./include/linux/file.h:32 /build/work/knet/net/socket.c:2336) [ 86.344020] __x64_sys_setsockopt (/build/work/knet/net/socket.c:2340) [ 86.344766] do_syscall_64 (/build/work/knet/arch/x86/entry/common.c:52 /build/work/knet/arch/x86/entry/common.c:83) [ 86.345433] ? srso_return_thunk (/build/work/knet/arch/x86/lib/retpoline.S:223) [ 86.346161] ? syscall_exit_work (/build/work/knet/./include/linux/audit.h:357 /build/work/knet/kernel/entry/common.c:160) [ 86.346938] ? srso_return_thunk (/build/work/knet/arch/x86/lib/retpoline.S:223) [ 86.347657] ? syscall_exit_to_user_mode (/build/work/knet/kernel/entry/common.c:215) [ 86.348538] ? srso_return_thunk (/build/work/knet/arch/x86/lib/retpoline.S:223) [ 86.349262] ? do_syscall_64 (/build/work/knet/./arch/x86/include/asm/cpufeature.h:171 /build/work/knet/arch/x86/entry/common.c:98) [ 86.349971] entry_SYSCALL_64_after_hwframe (/build/work/knet/arch/x86/entry/entry_64.S:129) The original packet in ipmr_cache_report() may be queued and then forwarded with ip_mr_forward(). This last function has the assumption that the skb dst is set. After the below commit, the skb dst is dropped by ipv4_pktinfo_prepare(), which causes the oops. Fixes: bb7403655b3c ("ipmr: support IP_PKTINFO on cache report IGMP msg") Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://lore.kernel.org/r/20240125141847.1931933-1-nicolas.dichtel@6wind.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-01-26af_unix: Try to run GC async.Kuniyuki Iwashima
If more than 16000 inflight AF_UNIX sockets exist and the garbage collector is not running, unix_(dgram|stream)_sendmsg() call unix_gc(). Also, they wait for unix_gc() to complete. In unix_gc(), all inflight AF_UNIX sockets are traversed at least once, and more if they are the GC candidate. Thus, sendmsg() significantly slows down with too many inflight AF_UNIX sockets. However, if a process sends data with no AF_UNIX FD, the sendmsg() call does not need to wait for GC. After this change, only the process that meets the condition below will be blocked under such a situation. 1) cmsg contains AF_UNIX socket 2) more than 32 AF_UNIX sent by the same user are still inflight Note that even a sendmsg() call that does not meet the condition but has AF_UNIX FD will be blocked later in unix_scm_to_skb() by the spinlock, but we allow that as a bonus for sane users. The results below are the time spent in unix_dgram_sendmsg() sending 1 byte of data with no FD 4096 times on a host where 32K inflight AF_UNIX sockets exist. Without series: the sane sendmsg() needs to wait gc unreasonably. $ sudo /usr/share/bcc/tools/funclatency -p 11165 unix_dgram_sendmsg Tracing 1 functions for "unix_dgram_sendmsg"... Hit Ctrl-C to end. ^C nsecs : count distribution [...] 524288 -> 1048575 : 0 | | 1048576 -> 2097151 : 3881 |****************************************| 2097152 -> 4194303 : 214 |** | 4194304 -> 8388607 : 1 | | avg = 1825567 nsecs, total: 7477526027 nsecs, count: 4096 With series: the sane sendmsg() can finish much faster. $ sudo /usr/share/bcc/tools/funclatency -p 8702 unix_dgram_sendmsg Tracing 1 functions for "unix_dgram_sendmsg"... Hit Ctrl-C to end. ^C nsecs : count distribution [...] 128 -> 255 : 0 | | 256 -> 511 : 4092 |****************************************| 512 -> 1023 : 2 | | 1024 -> 2047 : 0 | | 2048 -> 4095 : 0 | | 4096 -> 8191 : 1 | | 8192 -> 16383 : 1 | | avg = 410 nsecs, total: 1680510 nsecs, count: 4096 Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Link: https://lore.kernel.org/r/20240123170856.41348-6-kuniyu@amazon.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-01-26af_unix: Return struct unix_sock from unix_get_socket().Kuniyuki Iwashima
Currently, unix_get_socket() returns struct sock, but after calling it, we always cast it to unix_sk(). Let's return struct unix_sock from unix_get_socket(). Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Acked-by: Pavel Begunkov <asml.silence@gmail.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://lore.kernel.org/r/20240123170856.41348-4-kuniyu@amazon.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-01-26af_unix: Do not use atomic ops for unix_sk(sk)->inflight.Kuniyuki Iwashima
When touching unix_sk(sk)->inflight, we are always under spin_lock(&unix_gc_lock). Let's convert unix_sk(sk)->inflight to the normal unsigned long. Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://lore.kernel.org/r/20240123170856.41348-3-kuniyu@amazon.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-01-26Revert "nl80211/cfg80211: Specify band specific min RSSI thresholds with ↵Jeff Johnson
sched scan" This *mostly* reverts commit 1e1b11b6a111 ("nl80211/cfg80211: Specify band specific min RSSI thresholds with sched scan"). During the review of a new patch [1] it was observed that the functionality being modified was not actually being used by any in-tree driver. Further research determined that the functionality was originally introduced to support a new Android interface, but that interface was subsequently abandoned. Since the functionality has apparently never been used, remove it. However, to mantain the sanctity of the UABI, keep the nl80211.h assignments, but clearly mark them as obsolete. Cc: Lin Ma <linma@zju.edu.cn> Cc: Vamsi Krishna <quic_vamsin@quicinc.com> Link: https://lore.kernel.org/linux-wireless/20240119151201.8670-1-linma@zju.edu.cn/ [1] Signed-off-by: Jeff Johnson <quic_jjohnson@quicinc.com> Link: https://msgid.link/20240125-for-next-v1-1-fd79e01c6c09@quicinc.com Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2024-01-26wifi: mac80211: add support for SPP A-MSDUsJohannes Berg
If software crypto is used, simply add support for SPP A-MSDUs (and use it whenever enabled as required by the cfg80211 API). If hardware crypto is used, leave it up to the driver to set the NL80211_EXT_FEATURE_SPP_AMSDU_SUPPORT flag and then check sta->spp_amsdu or the IEEE80211_KEY_FLAG_SPP_AMSDU key flag. Signed-off-by: Johannes Berg <johannes.berg@intel.com> Signed-off-by: Daniel Gabay <daniel.gabay@intel.com> Reviewed-by: Gregory Greenman <gregory.greenman@intel.com> Signed-off-by: Miri Korenblit <miriam.rachel.korenblit@intel.com> Link: https://msgid.link/20240102213313.b8ada4514e2b.I1ac25d5f158165b5a88062a5a5e4c4fbeecf9a5d@changeid Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2024-01-26wifi: cfg80211: add support for SPP A-MSDUsJohannes Berg
Add SPP (signaling and payload protected) AMSDU support. Since userspace has to build the RSNX element, add an extended feature flag to indicate that this is supported. In order to avoid downgrade/mismatch attacks, add a flag to the assoc command on the station side, so that we can be sure that the value of the flag comes from the same RSNX element that will be validated by the supplicant against the 4-way-handshake. If we just pulled the data out of a beacon/probe response, we could theoretically look an RSNX element from a different frame, with a different value for this flag, than the supplicant is using to validate in the 4-way-handshake. Note that this patch is only geared towards software crypto implementations or hardware ones that can perfectly implement SPP A-MSDUs, i.e. are able to switch the AAD construction on the fly for each TX/RX frame. For more limited hardware implementations, more capability advertisement would be required, e.g. if the hardware has no way to switch this on the fly but has only a global configuration that must apply to all stations. The driver could of course *reject* mismatches, but the supplicant must know so it can do things like not negotiating SPP A-MSDUs on a T-DLS link when connected to an AP that doesn't support it, or similar. Signed-off-by: Johannes Berg <johannes.berg@intel.com> Signed-off-by: Daniel Gabay <daniel.gabay@intel.com> Reviewed-by: Gregory Greenman <gregory.greenman@intel.com> Signed-off-by: Miri Korenblit <miriam.rachel.korenblit@intel.com> Link: https://msgid.link/20240102213313.fadac8df7030.I9240aebcba1be49636a73c647ed0af862713fc6f@changeid Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2024-01-26wifi: mac80211: add support for negotiated TTLM requestAyala Beker
Update neg_ttlm and active_links according to the new mapping, and send a negotiated TID-to-link map request with the new mapping. Signed-off-by: Ayala Beker <ayala.beker@intel.com> Reviewed-by: Gregory Greenman <gregory.greenman@intel.com> Reviewed-by: Johannes Berg <johannes.berg@intel.com> Signed-off-by: Miri Korenblit <miriam.rachel.korenblit@intel.com> Link: https://msgid.link/20240102213313.eeb385d771df.I2a5441c14421de884dbd93d1624ce7bb2c944833@changeid Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2024-01-26wifi: mac80211: process and save negotiated TID to Link mapping requestAyala Beker
An MLD may send TID-to-Link mapping request frame to negotiate TID to link mapping with a peer MLD. Support handling negotiated TID-to-Link mapping request frame by parsing the frame, asking the driver whether it supports the received mapping or not, and sending a TID-to-Link mapping response to the AP MLD. Theoretically, links that became inactive due to the received TID-to-Link mapping request, can be selected to be activated but this would require tearing down the negotiated TID-to-Link mapping, which is still not supported. Signed-off-by: Ayala Beker <ayala.beker@intel.com> Reviewed-by: Johannes Berg <johannes.berg@intel.com> Reviewed-by: Gregory Greenman <gregory.greenman@intel.com> Signed-off-by: Miri Korenblit <miriam.rachel.korenblit@intel.com> Link: https://msgid.link/20240102213313.0bc1a24fcc9d.Ie72e47dc6f8c77d4a2f0947b775ef6367fe0edac@changeid Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2024-01-25Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski
Cross-merge networking fixes after downstream PR. No conflicts or adjacent changes. Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-01-25Merge tag 'for-netdev' of ↵Paolo Abeni
https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf Daniel Borkmann says: ==================== pull-request: bpf 2024-01-25 The following pull-request contains BPF updates for your *net* tree. We've added 12 non-merge commits during the last 2 day(s) which contain a total of 13 files changed, 190 insertions(+), 91 deletions(-). The main changes are: 1) Fix bpf_xdp_adjust_tail() in context of XSK zero-copy drivers which support XDP multi-buffer. The former triggered a NULL pointer dereference upon shrinking, from Maciej Fijalkowski & Tirthendu Sarkar. 2) Fix a bug in riscv64 BPF JIT which emitted a wrong prologue and epilogue for struct_ops programs, from Pu Lehui. * tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf: i40e: update xdp_rxq_info::frag_size for ZC enabled Rx queue i40e: set xdp_rxq_info::frag_size xdp: reflect tail increase for MEM_TYPE_XSK_BUFF_POOL ice: update xdp_rxq_info::frag_size for ZC enabled Rx queue intel: xsk: initialize skb_frag_t::bv_offset in ZC drivers ice: remove redundant xdp_rxq_info registration i40e: handle multi-buffer packets that are shrunk by xdp prog ice: work on pre-XDP prog frag count xsk: fix usage of multi-buffer BPF helpers for ZC XDP xsk: make xsk_buff_pool responsible for clearing xdp_buff::flags xsk: recycle buffer in case Rx queue was full riscv, bpf: Fix unpredictable kernel crash about RV64 struct_ops ==================== Link: https://lore.kernel.org/r/20240125084416.10876-1-daniel@iogearbox.net Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-01-24xsk: fix usage of multi-buffer BPF helpers for ZC XDPMaciej Fijalkowski
Currently when packet is shrunk via bpf_xdp_adjust_tail() and memory type is set to MEM_TYPE_XSK_BUFF_POOL, null ptr dereference happens: [1136314.192256] BUG: kernel NULL pointer dereference, address: 0000000000000034 [1136314.203943] #PF: supervisor read access in kernel mode [1136314.213768] #PF: error_code(0x0000) - not-present page [1136314.223550] PGD 0 P4D 0 [1136314.230684] Oops: 0000 [#1] PREEMPT SMP NOPTI [1136314.239621] CPU: 8 PID: 54203 Comm: xdpsock Not tainted 6.6.0+ #257 [1136314.250469] Hardware name: Intel Corporation S2600WFT/S2600WFT, BIOS SE5C620.86B.02.01.0008.031920191559 03/19/2019 [1136314.265615] RIP: 0010:__xdp_return+0x6c/0x210 [1136314.274653] Code: ad 00 48 8b 47 08 49 89 f8 a8 01 0f 85 9b 01 00 00 0f 1f 44 00 00 f0 41 ff 48 34 75 32 4c 89 c7 e9 79 cd 80 ff 83 fe 03 75 17 <f6> 41 34 01 0f 85 02 01 00 00 48 89 cf e9 22 cc 1e 00 e9 3d d2 86 [1136314.302907] RSP: 0018:ffffc900089f8db0 EFLAGS: 00010246 [1136314.312967] RAX: ffffc9003168aed0 RBX: ffff8881c3300000 RCX: 0000000000000000 [1136314.324953] RDX: 0000000000000000 RSI: 0000000000000003 RDI: ffffc9003168c000 [1136314.336929] RBP: 0000000000000ae0 R08: 0000000000000002 R09: 0000000000010000 [1136314.348844] R10: ffffc9000e495000 R11: 0000000000000040 R12: 0000000000000001 [1136314.360706] R13: 0000000000000524 R14: ffffc9003168aec0 R15: 0000000000000001 [1136314.373298] FS: 00007f8df8bbcb80(0000) GS:ffff8897e0e00000(0000) knlGS:0000000000000000 [1136314.386105] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [1136314.396532] CR2: 0000000000000034 CR3: 00000001aa912002 CR4: 00000000007706f0 [1136314.408377] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [1136314.420173] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [1136314.431890] PKRU: 55555554 [1136314.439143] Call Trace: [1136314.446058] <IRQ> [1136314.452465] ? __die+0x20/0x70 [1136314.459881] ? page_fault_oops+0x15b/0x440 [1136314.468305] ? exc_page_fault+0x6a/0x150 [1136314.476491] ? asm_exc_page_fault+0x22/0x30 [1136314.484927] ? __xdp_return+0x6c/0x210 [1136314.492863] bpf_xdp_adjust_tail+0x155/0x1d0 [1136314.501269] bpf_prog_ccc47ae29d3b6570_xdp_sock_prog+0x15/0x60 [1136314.511263] ice_clean_rx_irq_zc+0x206/0xc60 [ice] [1136314.520222] ? ice_xmit_zc+0x6e/0x150 [ice] [1136314.528506] ice_napi_poll+0x467/0x670 [ice] [1136314.536858] ? ttwu_do_activate.constprop.0+0x8f/0x1a0 [1136314.546010] __napi_poll+0x29/0x1b0 [1136314.553462] net_rx_action+0x133/0x270 [1136314.561619] __do_softirq+0xbe/0x28e [1136314.569303] do_softirq+0x3f/0x60 This comes from __xdp_return() call with xdp_buff argument passed as NULL which is supposed to be consumed by xsk_buff_free() call. To address this properly, in ZC case, a node that represents the frag being removed has to be pulled out of xskb_list. Introduce appropriate xsk helpers to do such node operation and use them accordingly within bpf_xdp_adjust_tail(). Fixes: 24ea50127ecf ("xsk: support mbuf on ZC RX") Acked-by: Magnus Karlsson <magnus.karlsson@intel.com> # For the xsk header part Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Link: https://lore.kernel.org/r/20240124191602.566724-4-maciej.fijalkowski@intel.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-01-24xsk: make xsk_buff_pool responsible for clearing xdp_buff::flagsMaciej Fijalkowski
XDP multi-buffer support introduced XDP_FLAGS_HAS_FRAGS flag that is used by drivers to notify data path whether xdp_buff contains fragments or not. Data path looks up mentioned flag on first buffer that occupies the linear part of xdp_buff, so drivers only modify it there. This is sufficient for SKB and XDP_DRV modes as usually xdp_buff is allocated on stack or it resides within struct representing driver's queue and fragments are carried via skb_frag_t structs. IOW, we are dealing with only one xdp_buff. ZC mode though relies on list of xdp_buff structs that is carried via xsk_buff_pool::xskb_list, so ZC data path has to make sure that fragments do *not* have XDP_FLAGS_HAS_FRAGS set. Otherwise, xsk_buff_free() could misbehave if it would be executed against xdp_buff that carries a frag with XDP_FLAGS_HAS_FRAGS flag set. Such scenario can take place when within supplied XDP program bpf_xdp_adjust_tail() is used with negative offset that would in turn release the tail fragment from multi-buffer frame. Calling xsk_buff_free() on tail fragment with XDP_FLAGS_HAS_FRAGS would result in releasing all the nodes from xskb_list that were produced by driver before XDP program execution, which is not what is intended - only tail fragment should be deleted from xskb_list and then it should be put onto xsk_buff_pool::free_list. Such multi-buffer frame will never make it up to user space, so from AF_XDP application POV there would be no traffic running, however due to free_list getting constantly new nodes, driver will be able to feed HW Rx queue with recycled buffers. Bottom line is that instead of traffic being redirected to user space, it would be continuously dropped. To fix this, let us clear the mentioned flag on xsk_buff_pool side during xdp_buff initialization, which is what should have been done right from the start of XSK multi-buffer support. Fixes: 1bbc04de607b ("ice: xsk: add RX multi-buffer support") Fixes: 1c9ba9c14658 ("i40e: xsk: add RX multi-buffer support") Fixes: 24ea50127ecf ("xsk: support mbuf on ZC RX") Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Link: https://lore.kernel.org/r/20240124191602.566724-3-maciej.fijalkowski@intel.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-01-24netfilter: nf_tables: cleanup documentationGeorge Guo
- Correct comments for nlpid, family, udlen and udata in struct nft_table, and afinfo is no longer a member of enum nft_set_class. - Add comment for data in struct nft_set_elem. - Add comment for flags in struct nft_ctx. - Add comments for timeout in struct nft_set_iter, and flags is not a member of struct nft_set_iter, remove the comment for it. - Add comments for commit, abort, estimate and gc_init in struct nft_set_ops. - Add comments for pending_update, num_exprs, exprs and catchall_list in struct nft_set. - Add comment for ext_len in struct nft_set_ext_tmpl. - Add comment for inner_ops in struct nft_expr_type. - Add comments for clone, destroy_clone, reduce, gc, offload, offload_action, offload_stats in struct nft_expr_ops. - Add comments for blob_gen_0, blob_gen_1, bound, genmask, udlen, udata, blob_next in struct nft_chain. - Add comment for flags in struct nft_base_chain. - Add comments for udlen, udata in struct nft_object. - Add comment for type in struct nft_object_ops. - Add comment for hook_list in struct nft_flowtable, and remove comments for dev_name and ops which are not members of struct nft_flowtable. Signed-off-by: George Guo <guodongtai@kylinos.cn> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2024-01-24net/sched: flower: Fix chain template offloadIdo Schimmel
When a qdisc is deleted from a net device the stack instructs the underlying driver to remove its flow offload callback from the associated filter block using the 'FLOW_BLOCK_UNBIND' command. The stack then continues to replay the removal of the filters in the block for this driver by iterating over the chains in the block and invoking the 'reoffload' operation of the classifier being used. In turn, the classifier in its 'reoffload' operation prepares and emits a 'FLOW_CLS_DESTROY' command for each filter. However, the stack does not do the same for chain templates and the underlying driver never receives a 'FLOW_CLS_TMPLT_DESTROY' command when a qdisc is deleted. This results in a memory leak [1] which can be reproduced using [2]. Fix by introducing a 'tmplt_reoffload' operation and have the stack invoke it with the appropriate arguments as part of the replay. Implement the operation in the sole classifier that supports chain templates (flower) by emitting the 'FLOW_CLS_TMPLT_{CREATE,DESTROY}' command based on whether a flow offload callback is being bound to a filter block or being unbound from one. As far as I can tell, the issue happens since cited commit which reordered tcf_block_offload_unbind() before tcf_block_flush_all_chains() in __tcf_block_put(). The order cannot be reversed as the filter block is expected to be freed after flushing all the chains. [1] unreferenced object 0xffff888107e28800 (size 2048): comm "tc", pid 1079, jiffies 4294958525 (age 3074.287s) hex dump (first 32 bytes): b1 a6 7c 11 81 88 ff ff e0 5b b3 10 81 88 ff ff ..|......[...... 01 00 00 00 00 00 00 00 e0 aa b0 84 ff ff ff ff ................ backtrace: [<ffffffff81c06a68>] __kmem_cache_alloc_node+0x1e8/0x320 [<ffffffff81ab374e>] __kmalloc+0x4e/0x90 [<ffffffff832aec6d>] mlxsw_sp_acl_ruleset_get+0x34d/0x7a0 [<ffffffff832bc195>] mlxsw_sp_flower_tmplt_create+0x145/0x180 [<ffffffff832b2e1a>] mlxsw_sp_flow_block_cb+0x1ea/0x280 [<ffffffff83a10613>] tc_setup_cb_call+0x183/0x340 [<ffffffff83a9f85a>] fl_tmplt_create+0x3da/0x4c0 [<ffffffff83a22435>] tc_ctl_chain+0xa15/0x1170 [<ffffffff838a863c>] rtnetlink_rcv_msg+0x3cc/0xed0 [<ffffffff83ac87f0>] netlink_rcv_skb+0x170/0x440 [<ffffffff83ac6270>] netlink_unicast+0x540/0x820 [<ffffffff83ac6e28>] netlink_sendmsg+0x8d8/0xda0 [<ffffffff83793def>] ____sys_sendmsg+0x30f/0xa80 [<ffffffff8379d29a>] ___sys_sendmsg+0x13a/0x1e0 [<ffffffff8379d50c>] __sys_sendmsg+0x11c/0x1f0 [<ffffffff843b9ce0>] do_syscall_64+0x40/0xe0 unreferenced object 0xffff88816d2c0400 (size 1024): comm "tc", pid 1079, jiffies 4294958525 (age 3074.287s) hex dump (first 32 bytes): 40 00 00 00 00 00 00 00 57 f6 38 be 00 00 00 00 @.......W.8..... 10 04 2c 6d 81 88 ff ff 10 04 2c 6d 81 88 ff ff ..,m......,m.... backtrace: [<ffffffff81c06a68>] __kmem_cache_alloc_node+0x1e8/0x320 [<ffffffff81ab36c1>] __kmalloc_node+0x51/0x90 [<ffffffff81a8ed96>] kvmalloc_node+0xa6/0x1f0 [<ffffffff82827d03>] bucket_table_alloc.isra.0+0x83/0x460 [<ffffffff82828d2b>] rhashtable_init+0x43b/0x7c0 [<ffffffff832aed48>] mlxsw_sp_acl_ruleset_get+0x428/0x7a0 [<ffffffff832bc195>] mlxsw_sp_flower_tmplt_create+0x145/0x180 [<ffffffff832b2e1a>] mlxsw_sp_flow_block_cb+0x1ea/0x280 [<ffffffff83a10613>] tc_setup_cb_call+0x183/0x340 [<ffffffff83a9f85a>] fl_tmplt_create+0x3da/0x4c0 [<ffffffff83a22435>] tc_ctl_chain+0xa15/0x1170 [<ffffffff838a863c>] rtnetlink_rcv_msg+0x3cc/0xed0 [<ffffffff83ac87f0>] netlink_rcv_skb+0x170/0x440 [<ffffffff83ac6270>] netlink_unicast+0x540/0x820 [<ffffffff83ac6e28>] netlink_sendmsg+0x8d8/0xda0 [<ffffffff83793def>] ____sys_sendmsg+0x30f/0xa80 [2] # tc qdisc add dev swp1 clsact # tc chain add dev swp1 ingress proto ip chain 1 flower dst_ip 0.0.0.0/32 # tc qdisc del dev swp1 clsact # devlink dev reload pci/0000:06:00.0 Fixes: bbf73830cd48 ("net: sched: traverse chains in block with tcf_get_next_chain()") Signed-off-by: Ido Schimmel <idosch@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>