summaryrefslogtreecommitdiff
path: root/net
AgeCommit message (Collapse)Author
2021-01-22mptcp: re-enable sndbuf autotunePaolo Abeni
After commit 6e628cd3a8f7 ("mptcp: use mptcp release_cb for delayed tasks"), MPTCP never sets the flag bit SOCK_NOSPACE on its subflow. As a side effect, autotune never takes place, as it happens inside tcp_new_space(), which in turn is called only when the mentioned bit is set. Let's sendmsg() set the subflows NOSPACE bit when looking for more memory and use the subflow write_space callback to propagate the snd buf update and wake-up the user-space. Additionally, this allows dropping a bunch of duplicate code and makes the SNDBUF_LIMITED chrono relevant again for MPTCP subflows. Fixes: 6e628cd3a8f7 ("mptcp: use mptcp release_cb for delayed tasks") Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-22mptcp: always graft subflow socket to parentPaolo Abeni
Currently, incoming subflows link to the parent socket, while outgoing ones link to a per subflow socket. The latter is not really needed, except at the initial connect() time and for the first subflow. Always graft the outgoing subflow to the parent socket and free the unneeded ones early. This allows some code cleanup, reduces the amount of memory used and will simplify the next patch Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-22tcp: add TTL to SCM_TIMESTAMPING_OPT_STATSYousuk Seung
This patch adds TCP_NLA_TTL to SCM_TIMESTAMPING_OPT_STATS that exports the time-to-live or hop limit of the latest incoming packet with SCM_TSTAMP_ACK. The value exported may not be from the packet that acks the sequence when incoming packets are aggregated. Exporting the time-to-live or hop limit value of incoming packets helps to estimate the hop count of the path of the flow that may change over time. Signed-off-by: Yousuk Seung <ysseung@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Neal Cardwell <ncardwell@google.com> Link: https://lore.kernel.org/r/20210120204155.552275-1-ysseung@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-22Merge tag 'ceph-for-5.11-rc5' of git://github.com/ceph/ceph-clientLinus Torvalds
Pull ceph fixes from Ilya Dryomov: "A patch to zero out sensitive cryptographic data and two minor cleanups prompted by the fact that a bunch of code was moved in this cycle" * tag 'ceph-for-5.11-rc5' of git://github.com/ceph/ceph-client: libceph: fix "Boolean result is used in bitwise operation" warning libceph, ceph: disambiguate ceph_connection_operations handlers libceph: zero out session key and connection secret
2021-01-22devlink: Support get and set state of port functionParav Pandit
devlink port function can be in active or inactive state. Allow users to get and set port function's state. When the port function it activated, its operational state may change after a while when the device is created and driver binds to it. Similarly on deactivation flow. To clearly describe the state of the port function and its device's operational state in the host system, define state and opstate attributes. Example of a PCI SF port which supports a port function: $ devlink dev eswitch set pci/0000:06:00.0 mode switchdev $ devlink port show pci/0000:06:00.0/65535: type eth netdev ens2f0np0 flavour physical port 0 splittable false $ devlink port add pci/0000:06:00.0 flavour pcisf pfnum 0 sfnum 88 pci/0000:08:00.0/32768: type eth netdev eth6 flavour pcisf controller 0 pfnum 0 sfnum 88 external false splittable false function: hw_addr 00:00:00:00:00:00 state inactive opstate detached $ devlink port show pci/0000:06:00.0/32768 pci/0000:06:00.0/32768: type eth netdev ens2f0npf0sf88 flavour pcisf controller 0 pfnum 0 sfnum 88 external false splittable false function: hw_addr 00:00:00:00:88:88 state inactive opstate detached $ devlink port function set pci/0000:06:00.0/32768 hw_addr 00:00:00:00:88:88 state active $ devlink port show pci/0000:06:00.0/32768 -jp { "port": { "pci/0000:06:00.0/32768": { "type": "eth", "netdev": "ens2f0npf0sf88", "flavour": "pcisf", "controller": 0, "pfnum": 0, "sfnum": 88, "external": false, "splittable": false, "function": { "hw_addr": "00:00:00:00:88:88", "state": "active", "opstate": "attached" } } } } Signed-off-by: Parav Pandit <parav@nvidia.com> Reviewed-by: Jiri Pirko <jiri@nvidia.com> Reviewed-by: Vu Pham <vuhuong@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2021-01-22devlink: Support add and delete devlink portParav Pandit
Extended devlink interface for the user to add and delete a port. Extend devlink to connect user requests to driver to add/delete a port in the device. Driver routines are invoked without holding devlink instance lock. This enables driver to perform several devlink objects registration, unregistration such as (port, health reporter, resource etc) by using existing devlink APIs. This also helps to uniformly use the code for port unregistration during driver unload and during port deletion initiated by user. Examples of add, show and delete commands: $ devlink dev eswitch set pci/0000:06:00.0 mode switchdev $ devlink port show pci/0000:06:00.0/65535: type eth netdev ens2f0np0 flavour physical port 0 splittable false $ devlink port add pci/0000:06:00.0 flavour pcisf pfnum 0 sfnum 88 pci/0000:06:00.0/32768: type eth netdev eth6 flavour pcisf controller 0 pfnum 0 sfnum 88 external false splittable false function: hw_addr 00:00:00:00:00:00 state inactive opstate detached $ devlink port show pci/0000:06:00.0/32768 pci/0000:06:00.0/32768: type eth netdev eth6 flavour pcisf controller 0 pfnum 0 sfnum 88 external false splittable false function: hw_addr 00:00:00:00:00:00 state inactive opstate detached $ udevadm test-builtin net_id /sys/class/net/eth6 Load module index Parsed configuration file /usr/lib/systemd/network/99-default.link Created link configuration context. Using default interface naming scheme 'v245'. ID_NET_NAMING_SCHEME=v245 ID_NET_NAME_PATH=enp6s0f0npf0sf88 ID_NET_NAME_SLOT=ens2f0npf0sf88 Unload module index Unloaded link configuration context. Signed-off-by: Parav Pandit <parav@nvidia.com> Reviewed-by: Vu Pham <vuhuong@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2021-01-22devlink: Introduce PCI SF port flavour and port attributeParav Pandit
A PCI sub-function (SF) represents a portion of the device similar to PCI VF. In an eswitch, PCI SF may have port which is normally represented using a representor netdevice. To have better visibility of eswitch port, its association with SF, and its representor netdevice, introduce a PCI SF port flavour. When devlink port flavour is PCI SF, fill up PCI SF attributes of the port. Extend port name creation using PCI PF and SF number scheme on best effort basis, so that vendor drivers can skip defining their own scheme. This is done as cApfNSfM, where A, N and M are controller, PCI PF and PCI SF number respectively. This is similar to existing naming for PCI PF and PCI VF ports. An example view of a PCI SF port: $ devlink port show pci/0000:06:00.0/32768 pci/0000:06:00.0/32768: type eth netdev ens2f0npf0sf88 flavour pcisf controller 0 pfnum 0 sfnum 88 external false splittable false function: hw_addr 00:00:00:00:88:88 state active opstate attached $ devlink port show pci/0000:06:00.0/32768 -jp { "port": { "pci/0000:06:00.0/32768": { "type": "eth", "netdev": "ens2f0npf0sf88", "flavour": "pcisf", "controller": 0, "pfnum": 0, "sfnum": 88, "splittable": false, "function": { "hw_addr": "00:00:00:00:88:88", "state": "active", "opstate": "attached" } } } } Signed-off-by: Parav Pandit <parav@nvidia.com> Reviewed-by: Jiri Pirko <jiri@nvidia.com> Reviewed-by: Vu Pham <vuhuong@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2021-01-22devlink: Prepare code to fill multiple port function attributesParav Pandit
Prepare code to fill zero or more port function optional attributes. Subsequent patch makes use of this to fill more port function attributes. Signed-off-by: Parav Pandit <parav@nvidia.com> Reviewed-by: Jiri Pirko <jiri@nvidia.com> Reviewed-by: Vu Pham <vuhuong@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2021-01-22cfg80211: change netdev registration/unregistration semanticsJohannes Berg
We used to not require anything in terms of registering netdevs with cfg80211, using a netdev notifier instead. However, in the next patch reducing RTNL locking, this causes big problems, and the simplest way is to just require drivers to do things better. Change the registration/unregistration semantics to require the drivers to call cfg80211_(un)register_netdevice() when this is happening due to a cfg80211 request, i.e. add_virtual_intf() or del_virtual_intf() (or if it somehow has to happen in any other cfg80211 callback). Otherwise, in other contexts, drivers may continue to use the normal netdev (un)registration functions as usual. Internally, we still use the netdev notifier and track (by the new wdev->registered bool) if the wdev had already been added to cfg80211 or not. Link: https://lore.kernel.org/r/20210122161942.cf2f4b65e4e9.Ida8234e50da13eb675b557bac52a713ad4eddf71@changeid Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2021-01-22mac80211: minstrel_ht: fix rounding error in throughput calculationFelix Fietkau
On lower data rates, the throughput calculation has a significant rounding error, causing rates like 48M and 54M OFDM to share the same throughput value with >= 90% success probablity. This is because the result of the division (prob_avg * 1000) / nsecs is really small (8 in this example). Improve accuracy by moving over some zeroes, making better use of the full range of u32 before the division. Signed-off-by: Felix Fietkau <nbd@nbd.name> Link: https://lore.kernel.org/r/20210115120242.89616-10-nbd@nbd.name Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2021-01-22mac80211: minstrel_ht: increase stats update intervalFelix Fietkau
The shorter interval was leading to too many frames being used for probing Signed-off-by: Felix Fietkau <nbd@nbd.name> Link: https://lore.kernel.org/r/20210115120242.89616-9-nbd@nbd.name Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2021-01-22mac80211: minstrel_ht: fix max probability rate selectionFelix Fietkau
- do not select rates faster than the max throughput rate if probability is lower - reset previous rate before sorting again This ensures that the max prob rate gets set to a more reliable rate Signed-off-by: Felix Fietkau <nbd@nbd.name> Link: https://lore.kernel.org/r/20210115120242.89616-8-nbd@nbd.name Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2021-01-22mac80211: minstrel_ht: improve sample rate selectionFelix Fietkau
Always allow sampling of rates faster than the primary max throughput rate. When the second max_tp_rate is higher than the first one, sample attempts were previously skipped, potentially causing rate control to get stuck at a slightly lower rate Signed-off-by: Felix Fietkau <nbd@nbd.name> Link: https://lore.kernel.org/r/20210115120242.89616-7-nbd@nbd.name Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2021-01-22mac80211: minstrel_ht: improve ampdu length estimationFelix Fietkau
If the driver does not report A-MPDU length, estimate it based on the rate. Signed-off-by: Felix Fietkau <nbd@nbd.name> Link: https://lore.kernel.org/r/20210115120242.89616-6-nbd@nbd.name Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2021-01-22mac80211: minstrel_ht: remove old ewma based rate average codeFelix Fietkau
The new noise filter has been the default for a while now with no reported downside and significant improvement compared to the old code. Signed-off-by: Felix Fietkau <nbd@nbd.name> Link: https://lore.kernel.org/r/20210115120242.89616-5-nbd@nbd.name Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2021-01-22mac80211: remove legacy minstrel rate controlFelix Fietkau
Now that minstrel_ht supports legacy rates, it is no longer needed Signed-off-by: Felix Fietkau <nbd@nbd.name> Link: https://lore.kernel.org/r/20210115120242.89616-4-nbd@nbd.name Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2021-01-22mac80211: minstrel_ht: add support for OFDM rates on non-HT clientsFelix Fietkau
The legacy minstrel code is essentially unmaintained and receives only very little testing. In order to bring the significant algorithm improvements from minstrel_ht to legacy clients, this patch adds support for OFDM rates to minstrel_ht and removes the fallback to the legacy codepath. This also makes it work much better on hardware with rate selection constraints, e.g. mt76. Signed-off-by: Felix Fietkau <nbd@nbd.name> Link: https://lore.kernel.org/r/20210115120242.89616-3-nbd@nbd.name Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2021-01-22mac80211: minstrel_ht: clean up CCK codeFelix Fietkau
- move ack overhead out of rate duration table - remove cck_supported, cck_supported_short Preparation for adding OFDM legacy rates support Signed-off-by: Felix Fietkau <nbd@nbd.name> Link: https://lore.kernel.org/r/20210115120242.89616-2-nbd@nbd.name Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2021-01-22mac80211: introduce aql_enable node in debugfsLorenzo Bianconi
Introduce aql_enable node in debugfs in order to enable/disable aql. This is useful for debugging purpose. Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org> Link: https://lore.kernel.org/r/e7a934d5d84e4796c4f97ea5de4e66c824296b07.1610214851.git.lorenzo@kernel.org Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2021-01-22cfg80211: Add phyrate conversion support for extended MCS in 60GHz bandMax Chen
The current phyrate conversion does not include extended MCS and provides incorrect rates. Add a flag for extended MCS in DMG and add corresponding phyrate table for the correct conversions using base MCS in DMG specs. Signed-off-by: Max Chen <mxchen@codeaurora.org> Link: https://lore.kernel.org/r/1609977050-7089-2-git-send-email-mxchen@codeaurora.org [reduce data size, make a single WARN] Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2021-01-22cfg80211: add VHT rate entries for MCS-10 and MCS-11Arend van Spriel
Observed the warning in cfg80211_calculate_bitrate_vht() using an 11ac chip reporting MCS-11. Since devices reporting non-standard MCS-9 is already supported add similar entries for MCS-10 and MCS-11. Actually, the value of MCS-9@20MHz is slightly off so corrected that. Signed-off-by: Arend van Spriel <arend.vanspriel@broadcom.com> Link: https://lore.kernel.org/r/20210105105839.3795-1-arend.vanspriel@broadcom.com [fix array size] Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2021-01-22mac80211: reduce peer HE MCS/NSS to own capabilitiesWen Gong
For VHT capbility, we do intersection of MCS and NSS for peers in mac80211, to simplify drivers. Add this for HE as well. Signed-off-by: Wen Gong <wgong@codeaurora.org> Link: https://lore.kernel.org/r/1609816120-9411-3-git-send-email-wgong@codeaurora.org [reword commit message, style cleanups, fix endian annotations] Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2021-01-21Merge branch 'master' of ↵Jakub Kicinski
git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec Steffen Klassert says: ==================== pull request (net): ipsec 2021-01-21 1) Fix a rare panic on SMP systems when packet reordering happens between anti replay check and update. From Shmulik Ladkani. 2) Fix disable_xfrm sysctl when used on xfrm interfaces. From Eyal Birger. 3) Fix a race in PF_KEY when the availability of crypto algorithms is set. From Cong Wang. 4) Fix a return value override in the xfrm policy selftests. From Po-Hsu Lin. 5) Fix an integer wraparound in xfrm_policy_addr_delta. From Visa Hankala. * 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec: xfrm: Fix wraparound in xfrm_policy_addr_delta() selftests: xfrm: fix test return value override issue in xfrm_policy.sh af_key: relax availability checks for skb size calculation xfrm: fix disable_xfrm sysctl when used on xfrm interfaces xfrm: Fix oops in xfrm_replay_advance_bmp ==================== Link: https://lore.kernel.org/r/20210121121558.621339-1-steffen.klassert@secunet.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-21libceph: fix "Boolean result is used in bitwise operation" warningIlya Dryomov
This line dates back to 2013, but cppcheck complained because commit 2f713615ddd9 ("libceph: move msgr1 protocol implementation to its own file") moved it. Add parenthesis to silence the warning. Reported-by: kernel test robot <lkp@intel.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2021-01-21mac80211: remove NSS number of 160MHz if not support 160MHz for HEWen Gong
When it does not support 160MHz in HE phy capabilities information, it should not treat the NSS number of 160MHz as a valid number, otherwise the final NSS will be set to 0. Signed-off-by: Wen Gong <wgong@codeaurora.org> Link: https://lore.kernel.org/r/1609816120-9411-2-git-send-email-wgong@codeaurora.org Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2021-01-21mac80211: 160MHz with extended NSS BW in CSAShay Bar
Upon receiving CSA with 160MHz extended NSS BW from associated AP, STA should set the HT operation_mode based on new_center_freq_seg1 because it is later used as ccfs2 in ieee80211_chandef_vht_oper(). Signed-off-by: Aviad Brikman <aviad.brikman@celeno.com> Signed-off-by: Shay Bar <shay.bar@celeno.com> Link: https://lore.kernel.org/r/20201222064714.24888-1-shay.bar@celeno.com Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2021-01-21mac80211: add LDPC encoding to ieee80211_parse_tx_radiotapPhilipp Borgers
This patch adds support for LDPC encoding to the radiotap tx parse function. Piror to this change adding the LDPC flag to the radiotap header did not encode frames with LDPC. Signed-off-by: Philipp Borgers <borgers@mi.fu-berlin.de> Link: https://lore.kernel.org/r/20201219170710.11706-1-borgers@mi.fu-berlin.de Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2021-01-21mac80211: add rx decapsulation offload supportFelix Fietkau
This allows drivers to pass 802.3 frames to mac80211, with some restrictions: - the skb must be passed with a valid sta - fast-rx needs to be active for the sta - monitor mode needs to be disabled mac80211 will tell the driver when it is safe to enable rx decap offload for a particular station. In order to implement support, a driver must: - call ieee80211_hw_set(hw, SUPPORTS_RX_DECAP_OFFLOAD) - implement ops->sta_set_decap_offload - mark 802.3 frames with RX_FLAG_8023 If it doesn't want to enable offload for some vif types, it can mask out IEEE80211_OFFLOAD_DECAP_ENABLED in vif->offload_flags from within the .add_interface or .update_vif_offload driver ops Signed-off-by: Felix Fietkau <nbd@nbd.name> Link: https://lore.kernel.org/r/20201218184718.93650-6-nbd@nbd.name Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2021-01-21net/fq_impl: do not maintain a backlog-sorted list of flowsFelix Fietkau
A sorted flow list is only needed to drop packets in the biggest flow when hitting the overmemory condition. By scanning flows only when needed, we can avoid paying the cost of maintaining the list under normal conditions In order to avoid scanning lots of empty flows and touching too many cold cache lines, a bitmap of flows with backlog is maintained Signed-off-by: Felix Fietkau <nbd@nbd.name> Link: https://lore.kernel.org/r/20201218184718.93650-3-nbd@nbd.name Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2021-01-21net/fq_impl: drop get_default_func, move default flow to fq_tinFelix Fietkau
Simplifies the code and prepares for a rework of scanning for flows on overmemory drop. Signed-off-by: Felix Fietkau <nbd@nbd.name> Link: https://lore.kernel.org/r/20201218184718.93650-2-nbd@nbd.name Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2021-01-21Merge branch 'tty-splice' of ↵Greg Kroah-Hartman
git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux into tty-next Fixes both the "splice/sendfile to a tty" and "splice/sendfile from a tty" regression from 5.10. * 'tty-splice' of git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux: tty: teach the n_tty ICANON case about the new "cookie continuations" too tty: teach n_tty line discipline about the new "cookie continuations" tty: clean up legacy leftovers from n_tty line discipline tty: implement read_iter tty: convert tty_ldisc_ops 'read()' function to take a kernel pointer tty: implement write_iter
2021-01-20ip_gre: remove CRC flag from dev features in gre_gso_segmentXin Long
This patch is to let it always do CRC checksum in sctp_gso_segment() by removing CRC flag from the dev features in gre_gso_segment() for SCTP over GRE, just as it does in Commit 527beb8ef9c0 ("udp: support sctp over udp in skb_udp_tunnel_segment") for SCTP over UDP. It could set csum/csum_start in GSO CB properly in sctp_gso_segment() after that commit, so it would do checksum with gso_make_checksum() in gre_gso_segment(), and Commit 622e32b7d4a6 ("net: gre: recompute gre csum for sctp over gre tunnels") can be reverted now. Note that when need_csum is false, we can still leave CRC checksum of SCTP to HW by not clearing this CRC flag if it's supported, as Jakub and Alex noticed. v1->v2: - improve the changelog. - fix "rev xmas tree" in varibles declaration. v2->v3: - remove CRC flag from dev features only when need_csum is true. Signed-off-by: Xin Long <lucien.xin@gmail.com> Link: https://lore.kernel.org/r/00439f24d5f69e2c6fa2beadc681d056c15c258f.1610772251.git.lucien.xin@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-20udp: not remove the CRC flag from dev features when need_csum is falseXin Long
In __skb_udp_tunnel_segment(), when it's a SCTP over VxLAN/GENEVE packet and need_csum is false, which means the outer udp checksum doesn't need to be computed, csum_start and csum_offset could be used by the inner SCTP CRC CSUM for SCTP HW CRC offload. So this patch is to not remove the CRC flag from dev features when need_csum is false. Signed-off-by: Xin Long <lucien.xin@gmail.com> Link: https://lore.kernel.org/r/1e81b700642498546eaa3f298e023fd7ad394f85.1610776757.git.lucien.xin@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-20net/sched: cls_flower add CT_FLAGS_INVALID flag supportwenxu
This patch add the TCA_FLOWER_KEY_CT_FLAGS_INVALID flag to match the ct_state with invalid for conntrack. Signed-off-by: wenxu <wenxu@ucloud.cn> Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Link: https://lore.kernel.org/r/1611045110-682-1-git-send-email-wenxu@ucloud.cn Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-20net: inline rollback_registered_many()Jakub Kicinski
Similar to the change for rollback_registered() - rollback_registered_many() was a part of unregister_netdevice_many() minus the net_set_todo(), which is no longer needed. Functionally this patch moves the list_empty() check back after: BUG_ON(dev_boot_phase); ASSERT_RTNL(); but I can't find any reason why that would be an issue. Reviewed-by: Edwin Peer <edwin.peer@broadcom.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-20net: move rollback_registered_many()Jakub Kicinski
Move rollback_registered_many() and add a temporary forward declaration to make merging the code into unregister_netdevice_many() easier to review. No functional changes. Reviewed-by: Edwin Peer <edwin.peer@broadcom.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-20net: inline rollback_registered()Jakub Kicinski
rollback_registered() is a local helper, it's common for driver code to call unregister_netdevice_queue(dev, NULL) when they want to unregister netdevices under rtnl_lock. Inline rollback_registered() and adjust the only remaining caller. Reviewed-by: Edwin Peer <edwin.peer@broadcom.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-20net: move net_set_todo inside rollback_registered()Jakub Kicinski
Commit 93ee31f14f6f ("[NET]: Fix free_netdev on register_netdev failure.") moved net_set_todo() outside of rollback_registered() so that rollback_registered() can be used in the failure path of register_netdevice() but without risking a double free. Since commit cf124db566e6 ("net: Fix inconsistent teardown and release of private netdev state."), however, we have a better way of handling that condition, since destructors don't call free_netdev() directly. After the change in commit c269a24ce057 ("net: make free_netdev() more lenient with unregistering devices") we can now move net_set_todo() back. Reviewed-by: Edwin Peer <edwin.peer@broadcom.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-20nexthop: Specialize rtm_nh_policyPetr Machata
This policy is currently only used for creation of new next hops and new next hop groups. Rename it accordingly and remove the two attributes that are not valid in that context: NHA_GROUPS and NHA_MASTER. For consistency with other policies, do not mention policy array size in the declarator, and replace NHA_MAX for ARRAY_SIZE as appropriate. Note that with this commit, NHA_MAX and __NHA_MAX are not used anymore. Leave them in purely as a user API. Signed-off-by: Petr Machata <petrm@nvidia.com> Reviewed-by: David Ahern <dsahern@kernel.org> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-20nexthop: Use a dedicated policy for nh_valid_dump_req()Petr Machata
This function uses the global nexthop policy, but only accepts four particular attributes. Create a new policy that only includes the four supported attributes, and use it. Convert the loop to a series of ifs. Signed-off-by: Petr Machata <petrm@nvidia.com> Reviewed-by: David Ahern <dsahern@kernel.org> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-20nexthop: Use a dedicated policy for nh_valid_get_del_req()Petr Machata
This function uses the global nexthop policy only to then bounce all arguments except for NHA_ID. Instead, just create a new policy that only includes the one allowed attribute. Signed-off-by: Petr Machata <petrm@nvidia.com> Reviewed-by: David Ahern <dsahern@kernel.org> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-20tty: convert tty_ldisc_ops 'read()' function to take a kernel pointerLinus Torvalds
The tty line discipline .read() function was passed the final user pointer destination as an argument, which doesn't match the 'write()' function, and makes it very inconvenient to do a splice method for ttys. This is a conversion to use a kernel buffer instead. NOTE! It does this by passing the tty line discipline ->read() function an additional "cookie" to fill in, and an offset into the cookie data. The line discipline can fill in the cookie data with its own private information, and then the reader will repeat the read until either the cookie is cleared or it runs out of data. The only real user of this is N_HDLC, which can use this to handle big packets, even if the kernel buffer is smaller than the whole packet. Cc: Christoph Hellwig <hch@lst.de> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-01-20bpf: Split cgroup_bpf_enabled per attach typeStanislav Fomichev
When we attach any cgroup hook, the rest (even if unused/unattached) start to contribute small overhead. In particular, the one we want to avoid is __cgroup_bpf_run_filter_skb which does two redirections to get to the cgroup and pushes/pulls skb. Let's split cgroup_bpf_enabled to be per-attach to make sure only used attach types trigger. I've dropped some existing high-level cgroup_bpf_enabled in some places because BPF_PROG_CGROUP_XXX_RUN macros usually have another cgroup_bpf_enabled check. I also had to copy-paste BPF_CGROUP_RUN_SA_PROG_LOCK for GETPEERNAME/GETSOCKNAME because type for cgroup_bpf_enabled[type] has to be constant and known at compile time. Signed-off-by: Stanislav Fomichev <sdf@google.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Song Liu <songliubraving@fb.com> Link: https://lore.kernel.org/bpf/20210115163501.805133-4-sdf@google.com
2021-01-20bpf: Remove extra lock_sock for TCP_ZEROCOPY_RECEIVEStanislav Fomichev
Add custom implementation of getsockopt hook for TCP_ZEROCOPY_RECEIVE. We skip generic hooks for TCP_ZEROCOPY_RECEIVE and have a custom call in do_tcp_getsockopt using the on-stack data. This removes 3% overhead for locking/unlocking the socket. Without this patch: 3.38% 0.07% tcp_mmap [kernel.kallsyms] [k] __cgroup_bpf_run_filter_getsockopt | --3.30%--__cgroup_bpf_run_filter_getsockopt | --0.81%--__kmalloc With the patch applied: 0.52% 0.12% tcp_mmap [kernel.kallsyms] [k] __cgroup_bpf_run_filter_getsockopt_kern Note, exporting uapi/tcp.h requires removing netinet/tcp.h from test_progs.h because those headers have confliciting definitions. Signed-off-by: Stanislav Fomichev <sdf@google.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Martin KaFai Lau <kafai@fb.com> Link: https://lore.kernel.org/bpf/20210115163501.805133-2-sdf@google.com
2021-01-20net, xdp: Introduce xdp_build_skb_from_frame utility routineLorenzo Bianconi
Introduce xdp_build_skb_from_frame utility routine to build the skb from xdp_frame. Respect to __xdp_build_skb_from_frame, xdp_build_skb_from_frame will allocate the skb object. Rely on xdp_build_skb_from_frame in veth driver. Introduce missing xdp metadata support in veth_xdp_rcv_one routine. Add missing metadata support in veth_xdp_rcv_one(). Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: Toshiaki Makita <toshiaki.makita1@gmail.com> Acked-by: Jesper Dangaard Brouer <brouer@redhat.com> Link: https://lore.kernel.org/bpf/94ade9e853162ae1947941965193190da97457bc.1610475660.git.lorenzo@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2021-01-20net, xdp: Introduce __xdp_build_skb_from_frame utility routineLorenzo Bianconi
Introduce __xdp_build_skb_from_frame utility routine to build the skb from xdp_frame. Rely on __xdp_build_skb_from_frame in cpumap code. Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Jesper Dangaard Brouer <brouer@redhat.com> Link: https://lore.kernel.org/bpf/4f9f4c6b3dd3933770c617eb6689dbc0c6e25863.1610475660.git.lorenzo@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2021-01-20Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski
Conflicts: drivers/net/can/dev.c commit 03f16c5075b2 ("can: dev: can_restart: fix use after free bug") commit 3e77f70e7345 ("can: dev: move driver related infrastructure into separate subdir") Code move. drivers/net/dsa/b53/b53_common.c commit 8e4052c32d6b ("net: dsa: b53: fix an off by one in checking "vlan->vid"") commit b7a9e0da2d1c ("net: switchdev: remove vid_begin -> vid_end range from VLAN objects") Field rename. Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-20Merge tag 'net-5.11-rc5' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Pull networking fixes from Jakub Kicinski: "Networking fixes for 5.11-rc5, including fixes from bpf, wireless, and can trees. Current release - regressions: - nfc: nci: fix the wrong NCI_CORE_INIT parameters Current release - new code bugs: - bpf: allow empty module BTFs Previous releases - regressions: - bpf: fix signed_{sub,add32}_overflows type handling - tcp: do not mess with cloned skbs in tcp_add_backlog() - bpf: prevent double bpf_prog_put call from bpf_tracing_prog_attach - bpf: don't leak memory in bpf getsockopt when optlen == 0 - tcp: fix potential use-after-free due to double kfree() - mac80211: fix encryption issues with WEP - devlink: use right genl user_ptr when handling port param get/set - ipv6: set multicast flag on the multicast route - tcp: fix TCP_USER_TIMEOUT with zero window Previous releases - always broken: - bpf: local storage helpers should check nullness of owner ptr passed - mac80211: fix incorrect strlen of .write in debugfs - cls_flower: call nla_ok() before nla_next() - skbuff: back tiny skbs with kmalloc() in __netdev_alloc_skb() too" * tag 'net-5.11-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (52 commits) net: systemport: free dev before on error path net: usb: cdc_ncm: don't spew notifications net: mscc: ocelot: Fix multicast to the CPU port tcp: Fix potential use-after-free due to double kfree() bpf: Fix signed_{sub,add32}_overflows type handling can: peak_usb: fix use after free bugs can: vxcan: vxcan_xmit: fix use after free bug can: dev: can_restart: fix use after free bug tcp: fix TCP socket rehash stats mis-accounting net: dsa: b53: fix an off by one in checking "vlan->vid" tcp: do not mess with cloned skbs in tcp_add_backlog() selftests: net: fib_tests: remove duplicate log test net: nfc: nci: fix the wrong NCI_CORE_INIT parameters sh_eth: Fix power down vs. is_opened flag ordering net: Disable NETIF_F_HW_TLS_RX when RXCSUM is disabled netfilter: rpfilter: mask ecn bits before fib lookup udp: mask TOS bits in udp_v4_early_demux() xsk: Clear pool even for inactive queues bpf: Fix helper bpf_map_peek_elem_proto pointing to wrong callback sh_eth: Make PHY access aware of Runtime PM to fix reboot crash ...
2021-01-20tcp: Fix potential use-after-free due to double kfree()Kuniyuki Iwashima
Receiving ACK with a valid SYN cookie, cookie_v4_check() allocates struct request_sock and then can allocate inet_rsk(req)->ireq_opt. After that, tcp_v4_syn_recv_sock() allocates struct sock and copies ireq_opt to inet_sk(sk)->inet_opt. Normally, tcp_v4_syn_recv_sock() inserts the full socket into ehash and sets NULL to ireq_opt. Otherwise, tcp_v4_syn_recv_sock() has to reset inet_opt by NULL and free the full socket. The commit 01770a1661657 ("tcp: fix race condition when creating child sockets from syncookies") added a new path, in which more than one cores create full sockets for the same SYN cookie. Currently, the core which loses the race frees the full socket without resetting inet_opt, resulting in that both sock_put() and reqsk_put() call kfree() for the same memory: sock_put sk_free __sk_free sk_destruct __sk_destruct sk->sk_destruct/inet_sock_destruct kfree(rcu_dereference_protected(inet->inet_opt, 1)); reqsk_put reqsk_free __reqsk_free req->rsk_ops->destructor/tcp_v4_reqsk_destructor kfree(rcu_dereference_protected(inet_rsk(req)->ireq_opt, 1)); Calling kmalloc() between the double kfree() can lead to use-after-free, so this patch fixes it by setting NULL to inet_opt before sock_put(). As a side note, this kind of issue does not happen for IPv6. This is because tcp_v6_syn_recv_sock() clones both ipv6_opt and pktopts which correspond to ireq_opt in IPv4. Fixes: 01770a166165 ("tcp: fix race condition when creating child sockets from syncookies") CC: Ricardo Dias <rdias@singlestore.com> Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.co.jp> Reviewed-by: Benjamin Herrenschmidt <benh@amazon.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://lore.kernel.org/r/20210118055920.82516-1-kuniyu@amazon.co.jp Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-20Merge https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpfJakub Kicinski
Daniel Borkmann says: ==================== pull-request: bpf 2021-01-20 1) Fix wrong bpf_map_peek_elem_proto helper callback, from Mircea Cirjaliu. 2) Fix signed_{sub,add32}_overflows type truncation, from Daniel Borkmann. 3) Fix AF_XDP to also clear pools for inactive queues, from Maxim Mikityanskiy. * https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf: bpf: Fix signed_{sub,add32}_overflows type handling xsk: Clear pool even for inactive queues bpf: Fix helper bpf_map_peek_elem_proto pointing to wrong callback ==================== Link: https://lore.kernel.org/r/20210120163439.8160-1-daniel@iogearbox.net Signed-off-by: Jakub Kicinski <kuba@kernel.org>