summaryrefslogtreecommitdiff
path: root/net
AgeCommit message (Collapse)Author
2022-12-12net: add IFF_NO_ADDRCONF and use it in bonding to prevent ipv6 addrconfXin Long
Currently, in bonding it reused the IFF_SLAVE flag and checked it in ipv6 addrconf to prevent ipv6 addrconf. However, it is not a proper flag to use for no ipv6 addrconf, for bonding it has to move IFF_SLAVE flag setting ahead of dev_open() in bond_enslave(). Also, IFF_MASTER/SLAVE are historical flags used in bonding and eql, as Jiri mentioned, the new devices like Team, Failover do not use this flag. So as Jiri suggested, this patch adds IFF_NO_ADDRCONF in priv_flags of the device to indicate no ipv6 addconf, and uses it in bonding and moves IFF_SLAVE flag setting back to its original place. Signed-off-by: Xin Long <lucien.xin@gmail.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-12-12net: tso: inline tso_count_descs()Yunsheng Lin
tso_count_descs() is a small function doing simple calculation, and tso_count_descs() is used in fast path, so inline it to reduce the overhead of calls. Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com> Link: https://lore.kernel.org/r/20221212032426.16050-1-linyunsheng@huawei.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-12-12net: dsa: don't call ptp_classify_raw() if switch doesn't provide RX ↵Vladimir Oltean
timestamping ptp_classify_raw() is not exactly cheap, since it invokes a BPF program for every skb in the receive path. For switches which do not provide ds->ops->port_rxtstamp(), running ptp_classify_raw() provides precisely nothing, so check for the presence of the function pointer first, since that is much cheaper. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Reviewed-by: Kurt Kanzenbach <kurt@linutronix.de> Link: https://lore.kernel.org/r/20221209175840.390707-1-vladimir.oltean@nxp.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-12-12Merge tag 'for-net-next-2022-12-12' of ↵Jakub Kicinski
git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth-next Luiz Augusto von Dentz says: ==================== bluetooth-next pull request for net-next: - Add a new VID/PID 0489/e0f2 for MT7922 - Add Realtek RTL8852BE support ID 0x0cb8:0xc559 - Add a new PID/VID 13d3/3549 for RTL8822CU - Add support for broadcom BCM43430A0 & BCM43430A1 - Add CONFIG_BT_HCIBTUSB_POLL_SYNC - Add CONFIG_BT_LE_L2CAP_ECRED - Add support for CYW4373A0 - Add support for RTL8723DS - Add more device IDs for WCN6855 - Add Broadcom BCM4377 family PCIe Bluetooth * tag 'for-net-next-2022-12-12' of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth-next: (51 commits) Bluetooth: Wait for HCI_OP_WRITE_AUTH_PAYLOAD_TO to complete Bluetooth: ISO: Avoid circular locking dependency Bluetooth: RFCOMM: don't call kfree_skb() under spin_lock_irqsave() Bluetooth: hci_core: don't call kfree_skb() under spin_lock_irqsave() Bluetooth: hci_bcsp: don't call kfree_skb() under spin_lock_irqsave() Bluetooth: hci_h5: don't call kfree_skb() under spin_lock_irqsave() Bluetooth: hci_ll: don't call kfree_skb() under spin_lock_irqsave() Bluetooth: hci_qca: don't call kfree_skb() under spin_lock_irqsave() Bluetooth: btusb: don't call kfree_skb() under spin_lock_irqsave() Bluetooth: btintel: Fix missing free skb in btintel_setup_combined() Bluetooth: hci_conn: Fix crash on hci_create_cis_sync Bluetooth: btintel: Fix existing sparce warnings Bluetooth: btusb: Fix existing sparce warning Bluetooth: btusb: Fix new sparce warnings Bluetooth: btusb: Add a new PID/VID 13d3/3549 for RTL8822CU Bluetooth: btusb: Add Realtek RTL8852BE support ID 0x0cb8:0xc559 dt-bindings: net: realtek-bluetooth: Add RTL8723DS Bluetooth: btusb: Add a new VID/PID 0489/e0f2 for MT7922 dt-bindings: bluetooth: broadcom: add BCM43430A0 & BCM43430A1 Bluetooth: hci_bcm4377: Fix missing pci_disable_device() on error in bcm4377_probe() ... ==================== Link: https://lore.kernel.org/r/20221212222322.1690780-1-luiz.dentz@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-12-12Merge git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf-nextJakub Kicinski
Pablo Neira Ayuso says: ==================== Netfilter/IPVS updates for net-next 1) Incorrect error check in nft_expr_inner_parse(), from Dan Carpenter. 2) Add DATA_SENT state to SCTP connection tracking helper, from Sriram Yagnaraman. 3) Consolidate nf_confirm for ipv4 and ipv6, from Florian Westphal. 4) Add bitmask support for ipset, from Vishwanath Pai. 5) Handle icmpv6 redirects as RELATED, from Florian Westphal. 6) Add WARN_ON_ONCE() to impossible case in flowtable datapath, from Li Qiong. 7) A large batch of IPVS updates to replace timer-based estimators by kthreads to scale up wrt. CPUs and workload (millions of estimators). Julian Anastasov says: This patchset implements stats estimation in kthread context. It replaces the code that runs on single CPU in timer context every 2 seconds and causing latency splats as shown in reports [1], [2], [3]. The solution targets setups with thousands of IPVS services, destinations and multi-CPU boxes. Spread the estimation on multiple (configured) CPUs and multiple time slots (timer ticks) by using multiple chains organized under RCU rules. When stats are not needed, it is recommended to use run_estimation=0 as already implemented before this change. RCU Locking: - As stats are now RCU-locked, tot_stats, svc and dest which hold estimator structures are now always freed from RCU callback. This ensures RCU grace period after the ip_vs_stop_estimator() call. Kthread data: - every kthread works over its own data structure and all such structures are attached to array. For now we limit kthreads depending on the number of CPUs. - even while there can be a kthread structure, its task may not be running, eg. before first service is added or while the sysctl var is set to an empty cpulist or when run_estimation is set to 0 to disable the estimation. - the allocated kthread context may grow from 1 to 50 allocated structures for timer ticks which saves memory for setups with small number of estimators - a task and its structure may be released if all estimators are unlinked from its chains, leaving the slot in the array empty - every kthread data structure allows limited number of estimators. Kthread 0 is also used to initially calculate the max number of estimators to allow in every chain considering a sub-100 microsecond cond_resched rate. This number can be from 1 to hundreds. - kthread 0 has an additional job of optimizing the adding of estimators: they are first added in temp list (est_temp_list) and later kthread 0 distributes them to other kthreads. The optimization is based on the fact that newly added estimator should be estimated after 2 seconds, so we have the time to offload the adding to chain from controlling process to kthread 0. - to add new estimators we use the last added kthread context (est_add_ktid). The new estimators are linked to the chains just before the estimated one, based on add_row. This ensures their estimation will start after 2 seconds. If estimators are added in bursts, common case if all services and dests are initially configured, we may spread the estimators to more chains and as result, reducing the initial delay below 2 seconds. Many thanks to Jiri Wiesner for his valuable comments and for spending a lot of time reviewing and testing the changes on different platforms with 48-256 CPUs and 1-8 NUMA nodes under different cpufreq governors. The new IPVS estimators do not use workqueue infrastructure because: - The estimation can take long time when using multiple IPVS rules (eg. millions estimator structures) and especially when box has multiple CPUs due to the for_each_possible_cpu usage that expects packets from any CPU. With est_nice sysctl we have more control how to prioritize the estimation kthreads compared to other processes/kthreads that have latency requirements (such as servers). As a benefit, we can see these kthreads in top and decide if we will need some further control to limit their CPU usage (max number of structure to estimate per kthread). - with kthreads we run code that is read-mostly, no write/lock operations to process the estimators in 2-second intervals. - work items are one-shot: as estimators are processed every 2 seconds, they need to be re-added every time. This again loads the timers (add_timer) if we use delayed works, as there are no kthreads to do the timings. [1] Report from Yunhong Jiang: https://lore.kernel.org/netdev/D25792C1-1B89-45DE-9F10-EC350DC04ADC@gmail.com/ [2] https://marc.info/?l=linux-virtual-server&m=159679809118027&w=2 [3] Report from Dust: https://archive.linuxvirtualserver.org/html/lvs-devel/2020-12/msg00000.html * git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf-next: ipvs: run_estimation should control the kthread tasks ipvs: add est_cpulist and est_nice sysctl vars ipvs: use kthreads for stats estimation ipvs: use u64_stats_t for the per-cpu counters ipvs: use common functions for stats allocation ipvs: add rcu protection to stats netfilter: flowtable: add a 'default' case to flowtable datapath netfilter: conntrack: set icmpv6 redirects as RELATED netfilter: ipset: Add support for new bitmask parameter netfilter: conntrack: merge ipv4+ipv6 confirm functions netfilter: conntrack: add sctp DATA_SENT state netfilter: nft_inner: fix IS_ERR() vs NULL check ==================== Link: https://lore.kernel.org/r/20221211101204.1751-1-pablo@netfilter.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-12-12Bluetooth: Wait for HCI_OP_WRITE_AUTH_PAYLOAD_TO to completeLuiz Augusto von Dentz
This make sure HCI_OP_WRITE_AUTH_PAYLOAD_TO completes before notifying the encryption change just as is done with HCI_OP_READ_ENC_KEY_SIZE. Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
2022-12-12Bluetooth: ISO: Avoid circular locking dependencyLuiz Augusto von Dentz
This attempts to avoid circular locking dependency between sock_lock and hdev_lock: WARNING: possible circular locking dependency detected 6.0.0-rc7-03728-g18dd8ab0a783 #3 Not tainted ------------------------------------------------------ kworker/u3:2/53 is trying to acquire lock: ffff888000254130 (sk_lock-AF_BLUETOOTH-BTPROTO_ISO){+.+.}-{0:0}, at: iso_conn_del+0xbd/0x1d0 but task is already holding lock: ffffffff9f39a080 (hci_cb_list_lock){+.+.}-{3:3}, at: hci_le_cis_estabilished_evt+0x1b5/0x500 which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #2 (hci_cb_list_lock){+.+.}-{3:3}: __mutex_lock+0x10e/0xfe0 hci_le_remote_feat_complete_evt+0x17f/0x320 hci_event_packet+0x39c/0x7d0 hci_rx_work+0x2bf/0x950 process_one_work+0x569/0x980 worker_thread+0x2a3/0x6f0 kthread+0x153/0x180 ret_from_fork+0x22/0x30 -> #1 (&hdev->lock){+.+.}-{3:3}: __mutex_lock+0x10e/0xfe0 iso_connect_cis+0x6f/0x5a0 iso_sock_connect+0x1af/0x710 __sys_connect+0x17e/0x1b0 __x64_sys_connect+0x37/0x50 do_syscall_64+0x43/0x90 entry_SYSCALL_64_after_hwframe+0x62/0xcc -> #0 (sk_lock-AF_BLUETOOTH-BTPROTO_ISO){+.+.}-{0:0}: __lock_acquire+0x1b51/0x33d0 lock_acquire+0x16f/0x3b0 lock_sock_nested+0x32/0x80 iso_conn_del+0xbd/0x1d0 iso_connect_cfm+0x226/0x680 hci_le_cis_estabilished_evt+0x1ed/0x500 hci_event_packet+0x39c/0x7d0 hci_rx_work+0x2bf/0x950 process_one_work+0x569/0x980 worker_thread+0x2a3/0x6f0 kthread+0x153/0x180 ret_from_fork+0x22/0x30 other info that might help us debug this: Chain exists of: sk_lock-AF_BLUETOOTH-BTPROTO_ISO --> &hdev->lock --> hci_cb_list_lock Possible unsafe locking scenario: CPU0 CPU1 ---- ---- lock(hci_cb_list_lock); lock(&hdev->lock); lock(hci_cb_list_lock); lock(sk_lock-AF_BLUETOOTH-BTPROTO_ISO); *** DEADLOCK *** 4 locks held by kworker/u3:2/53: #0: ffff8880021d9130 ((wq_completion)hci0#2){+.+.}-{0:0}, at: process_one_work+0x4ad/0x980 #1: ffff888002387de0 ((work_completion)(&hdev->rx_work)){+.+.}-{0:0}, at: process_one_work+0x4ad/0x980 #2: ffff888001ac0070 (&hdev->lock){+.+.}-{3:3}, at: hci_le_cis_estabilished_evt+0xc3/0x500 #3: ffffffff9f39a080 (hci_cb_list_lock){+.+.}-{3:3}, at: hci_le_cis_estabilished_evt+0x1b5/0x500 Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
2022-12-12Bluetooth: RFCOMM: don't call kfree_skb() under spin_lock_irqsave()Yang Yingliang
It is not allowed to call kfree_skb() from hardware interrupt context or with interrupts being disabled. So replace kfree_skb() with dev_kfree_skb_irq() under spin_lock_irqsave(). Fixes: 81be03e026dc ("Bluetooth: RFCOMM: Replace use of memcpy_from_msg with bt_skb_sendmmsg") Signed-off-by: Yang Yingliang <yangyingliang@huawei.com> Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
2022-12-12Bluetooth: hci_core: don't call kfree_skb() under spin_lock_irqsave()Yang Yingliang
It is not allowed to call kfree_skb() from hardware interrupt context or with interrupts being disabled. So replace kfree_skb() with dev_kfree_skb_irq() under spin_lock_irqsave(). Fixes: 9238f36a5a50 ("Bluetooth: Add request cmd_complete and cmd_status functions") Signed-off-by: Yang Yingliang <yangyingliang@huawei.com> Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
2022-12-12Bluetooth: hci_conn: Fix crash on hci_create_cis_syncLuiz Augusto von Dentz
When attempting to connect multiple ISO sockets without using DEFER_SETUP may result in the following crash: BUG: KASAN: null-ptr-deref in hci_create_cis_sync+0x18b/0x2b0 Read of size 2 at addr 0000000000000036 by task kworker/u3:1/50 CPU: 0 PID: 50 Comm: kworker/u3:1 Not tainted 6.0.0-rc7-02243-gb84a13ff4eda #4373 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.0-1.fc36 04/01/2014 Workqueue: hci0 hci_cmd_sync_work Call Trace: <TASK> dump_stack_lvl+0x19/0x27 kasan_report+0xbc/0xf0 ? hci_create_cis_sync+0x18b/0x2b0 hci_create_cis_sync+0x18b/0x2b0 ? get_link_mode+0xd0/0xd0 ? __ww_mutex_lock_slowpath+0x10/0x10 ? mutex_lock+0xe0/0xe0 ? get_link_mode+0xd0/0xd0 hci_cmd_sync_work+0x111/0x190 process_one_work+0x427/0x650 worker_thread+0x87/0x750 ? process_one_work+0x650/0x650 kthread+0x14e/0x180 ? kthread_exit+0x50/0x50 ret_from_fork+0x22/0x30 </TASK> Fixes: 26afbd826ee3 ("Bluetooth: Add initial implementation of CIS connections") Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
2022-12-12Bluetooth: Add quirk to disable MWS Transport ConfigurationSven Peter
Broadcom 4378/4387 controllers found in Apple Silicon Macs claim to support getting MWS Transport Layer Configuration, < HCI Command: Read Local Supported... (0x04|0x0002) plen 0 > HCI Event: Command Complete (0x0e) plen 68 Read Local Supported Commands (0x04|0x0002) ncmd 1 Status: Success (0x00) [...] Get MWS Transport Layer Configuration (Octet 30 - Bit 3)] [...] , but then don't actually allow the required command: > HCI Event: Command Complete (0x0e) plen 15 Get MWS Transport Layer Configuration (0x05|0x000c) ncmd 1 Status: Command Disallowed (0x0c) Number of transports: 0 Baud rate list: 0 entries 00 00 00 00 00 00 00 00 00 00 Signed-off-by: Sven Peter <sven@svenpeter.dev> Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
2022-12-12Bluetooth: hci_event: Ignore reserved bits in LE Extended Adv ReportSven Peter
Broadcom controllers present on Apple Silicon devices use the upper 8 bits of the event type in the LE Extended Advertising Report for the channel on which the frame has been received. These bits are reserved according to the Bluetooth spec anyway such that we can just drop them to ensure that the advertising results are parsed correctly. The following excerpt from a btmon trace shows a report received on channel 37 by these controllers: > HCI Event: LE Meta Event (0x3e) plen 55 LE Extended Advertising Report (0x0d) Num reports: 1 Entry 0 Event type: 0x2513 Props: 0x0013 Connectable Scannable Use legacy advertising PDUs Data status: Complete Reserved (0x2500) Legacy PDU Type: Reserved (0x2513) Address type: Public (0x00) Address: XX:XX:XX:XX:XX:XX (Shenzhen Jingxun Software [...]) Primary PHY: LE 1M Secondary PHY: No packets SID: no ADI field (0xff) TX power: 127 dBm RSSI: -76 dBm (0xb4) Periodic advertising interval: 0.00 msec (0x0000) Direct address type: Public (0x00) Direct address: 00:00:00:00:00:00 (OUI 00-00-00) Data length: 0x1d [...] Flags: 0x18 Simultaneous LE and BR/EDR (Controller) Simultaneous LE and BR/EDR (Host) Company: Harman International Industries, Inc. (87) Data: [...] Service Data (UUID 0xfddf): Name (complete): JBL Flip 5 Signed-off-by: Sven Peter <sven@svenpeter.dev> Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
2022-12-12Bluetooth: Use kzalloc instead of kmalloc/memsetKang Minchul
Replace kmalloc+memset by kzalloc for better readability and simplicity. This addresses the cocci warning below: WARNING: kzalloc should be used for d, instead of kmalloc/memset Signed-off-by: Kang Minchul <tegongkang@gmail.com> Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
2022-12-12Bluetooth: Fix EALREADY and ELOOP cases in bt_status()Christophe JAILLET
'err' is known to be <0 at this point. So, some cases can not be reached because of a missing "-". Add it. Fixes: ca2045e059c3 ("Bluetooth: Add bt_status") Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr> Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
2022-12-12Bluetooth: Add CONFIG_BT_LE_L2CAP_ECREDLuiz Augusto von Dentz
This adds CONFIG_BT_LE_L2CAP_ECRED which can be used to enable L2CAP Enhanced Credit Flow Control Mode by default, previously it was only possible to set it via module parameter (e.g. bluetooth.enable_ecred=1). Since L2CAP ECRED mode is required by the likes of EATT which is recommended for LE Audio this enables it by default. Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com> Tested-By: Tedd Ho-Jeong An <tedd.an@intel.com>
2022-12-12Bluetooth: MGMT: Fix error report for ADD_EXT_ADV_PARAMSInga Stotland
When validating the parameter length for MGMT_OP_ADD_EXT_ADV_PARAMS command, use the correct op code in error status report: was MGMT_OP_ADD_ADVERTISING, changed to MGMT_OP_ADD_EXT_ADV_PARAMS. Fixes: 12410572833a2 ("Bluetooth: Break add adv into two mgmt commands") Signed-off-by: Inga Stotland <inga.stotland@intel.com> Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
2022-12-12Bluetooth: hci_core: fix error handling in hci_register_dev()Yang Yingliang
If hci_register_suspend_notifier() returns error, the hdev and rfkill are leaked. We could disregard the error and print a warning message instead to avoid leaks, as it just means we won't be handing suspend requests. Fixes: 9952d90ea288 ("Bluetooth: Handle PM_SUSPEND_PREPARE and PM_POST_SUSPEND") Signed-off-by: Yang Yingliang <yangyingliang@huawei.com> Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
2022-12-12Bluetooth: Use kzalloc instead of kmalloc/memsetJiapeng Chong
Use kzalloc rather than duplicating its implementation, which makes code simple and easy to understand. ./net/bluetooth/hci_conn.c:2038:6-13: WARNING: kzalloc should be used for cp, instead of kmalloc/memset. Link: https://bugzilla.openanolis.cn/show_bug.cgi?id=2406 Reported-by: Abaci Robot <abaci@linux.alibaba.com> Signed-off-by: Jiapeng Chong <jiapeng.chong@linux.alibaba.com> Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
2022-12-12Bluetooth: hci_conn: use HCI dst_type values also for BISPauli Virtanen
For ISO BIS related functions in hci_conn.c, make dst_type values be HCI address type values, not ISO socket address type values. This makes it consistent with CIS functions. Signed-off-by: Pauli Virtanen <pav@iki.fi> Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
2022-12-12Bluetooth: hci_sync: cancel cmd_timer if hci_open failedArchie Pusaka
If a command is already sent, we take care of freeing it, but we also need to cancel the timeout as well. Signed-off-by: Archie Pusaka <apusaka@chromium.org> Reviewed-by: Abhishek Pandit-Subedi <abhishekpandit@google.com> Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
2022-12-12Bluetooth: hci_sync: Fix not able to set force_static_addressLuiz Augusto von Dentz
force_static_address shall be writable while hdev is initing but is not considered powered yet since the static address is written only when powered. Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com> Tested-by: Brian Gix <brian.gix@intel.com>
2022-12-12Bluetooth: hci_sync: Fix not setting static addressLuiz Augusto von Dentz
This attempts to program the address stored in hdev->static_addr after the init sequence has been complete: @ MGMT Command: Set Static A.. (0x002b) plen 6 Address: C0:55:44:33:22:11 (Static) @ MGMT Event: Command Complete (0x0001) plen 7 Set Static Address (0x002b) plen 4 Status: Success (0x00) Current settings: 0x00008200 Low Energy Static Address @ MGMT Event: New Settings (0x0006) plen 4 Current settings: 0x00008200 Low Energy Static Address < HCI Command: LE Set Random.. (0x08|0x0005) plen 6 Address: C0:55:44:33:22:11 (Static) > HCI Event: Command Complete (0x0e) plen 4 LE Set Random Address (0x08|0x0005) ncmd 1 Status: Success (0x00) @ MGMT Event: Command Complete (0x0001) plen 7 Set Powered (0x0005) plen 4 Status: Success (0x00) Current settings: 0x00008201 Powered Low Energy Static Address @ MGMT Event: New Settings (0x0006) plen 4 Current settings: 0x00008201 Powered Low Energy Static Address Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com> Tested-by: Brian Gix <brian.gix@intel.com>
2022-12-12sctp: sysctl: make extra pointers netns awareFiro Yang
Recently, a customer reported that from their container whose net namespace is different to the host's init_net, they can't set the container's net.sctp.rto_max to any value smaller than init_net.sctp.rto_min. For instance, Host: sudo sysctl net.sctp.rto_min net.sctp.rto_min = 1000 Container: echo 100 > /mnt/proc-net/sctp/rto_min echo 400 > /mnt/proc-net/sctp/rto_max echo: write error: Invalid argument This is caused by the check made from this'commit 4f3fdf3bc59c ("sctp: add check rto_min and rto_max in sysctl")' When validating the input value, it's always referring the boundary value set for the init_net namespace. Having container's rto_max smaller than host's init_net.sctp.rto_min does make sense. Consider that the rto between two containers on the same host is very likely smaller than it for two hosts. So to fix this problem, as suggested by Marcelo, this patch makes the extra pointers of rto_min, rto_max, pf_retrans, and ps_retrans point to the corresponding variables from the newly created net namespace while the new net namespace is being registered in sctp_sysctl_net_register. Fixes: 4f3fdf3bc59c ("sctp: add check rto_min and rto_max in sysctl") Reviewed-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Reviewed-by: Jakub Kicinski <kuba@kernel.org> Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Signed-off-by: Firo Yang <firo.yang@suse.com> Link: https://lore.kernel.org/r/20221209054854.23889-1-firo.yang@suse.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-12-12Merge tag 'for-netdev' of ↵Jakub Kicinski
https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next Alexei Starovoitov says: ==================== pull-request: bpf-next 2022-12-11 We've added 74 non-merge commits during the last 11 day(s) which contain a total of 88 files changed, 3362 insertions(+), 789 deletions(-). The main changes are: 1) Decouple prune and jump points handling in the verifier, from Andrii. 2) Do not rely on ALLOW_ERROR_INJECTION for fmod_ret, from Benjamin. Merged from hid tree. 3) Do not zero-extend kfunc return values. Necessary fix for 32-bit archs, from Björn. 4) Don't use rcu_users to refcount in task kfuncs, from David. 5) Three reg_state->id fixes in the verifier, from Eduard. 6) Optimize bpf_mem_alloc by reusing elements from free_by_rcu, from Hou. 7) Refactor dynptr handling in the verifier, from Kumar. 8) Remove the "/sys" mount and umount dance in {open,close}_netns in bpf selftests, from Martin. 9) Enable sleepable support for cgrp local storage, from Yonghong. * tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (74 commits) selftests/bpf: test case for relaxed prunning of active_lock.id selftests/bpf: Add pruning test case for bpf_spin_lock bpf: use check_ids() for active_lock comparison selftests/bpf: verify states_equal() maintains idmap across all frames bpf: states_equal() must build idmap for all function frames selftests/bpf: test cases for regsafe() bug skipping check_id() bpf: regsafe() must not skip check_ids() docs/bpf: Add documentation for BPF_MAP_TYPE_SK_STORAGE selftests/bpf: Add test for dynptr reinit in user_ringbuf callback bpf: Use memmove for bpf_dynptr_{read,write} bpf: Move PTR_TO_STACK alignment check to process_dynptr_func bpf: Rework check_func_arg_reg_off bpf: Rework process_dynptr_func bpf: Propagate errors from process_* checks in check_func_arg bpf: Refactor ARG_PTR_TO_DYNPTR checks into process_dynptr_func bpf: Skip rcu_barrier() if rcu_trace_implies_rcu_gp() is true bpf: Reuse freed element in free_by_rcu during allocation selftests/bpf: Bring test_offload.py back to life bpf: Fix comment error in fixup_kfunc_call function bpf: Do not zero-extend kfunc return values ... ==================== Link: https://lore.kernel.org/r/20221212024701.73809-1-alexei.starovoitov@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-12-12Merge tag 'linux-can-next-for-6.2-20221212' of ↵David S. Miller
git://git.kernel.org/pub/scm/linux/kernel/git/mkl/linux-can-next Marc Kleine-Budde says: ==================== linux-can-next-for-6.2-20221212 this is a pull request of 39 patches for net-next/master. The first 2 patches are by me fix a warning and coding style in the kvaser_usb driver. Vivek Yadav's patch sorts the includes of the m_can driver. Biju Das contributes 5 patches for the rcar_canfd driver improve the support for different IP core variants. Jean Delvare's patch for the ctucanfd drops the dependency on COMPILE_TEST. Vincent Mailhol's patch sorts the includes of the etas_es58x driver. Haibo Chen's contributes 2 patches that add i.MX93 support to the flexcan driver. Lad Prabhakar's patch updates the dt-bindings documentation of the rcar_canfd driver. Minghao Chi's patch converts the c_can platform driver to devm_platform_get_and_ioremap_resource(). In the next 7 patches Vincent Mailhol adds devlink support to the etas_es58x driver to report firmware, bootloader and hardware version. Xu Panda's patch converts a strncpy() -> strscpy() in the ucan driver. Ye Bin's patch removes a useless parameter from the AF_CAN protocol. The next 2 patches by Vincent Mailhol and remove unneeded or unused pointers to struct usb_interface in device's priv struct in the ucan and gs_usb driver. Vivek Yadav's patch cleans up the usage of the RAM initialization in the m_can driver. A patch by me add support for SO_MARK to the AF_CAN protocol. Geert Uytterhoeven's patch fixes the number of CAN channels in the rcan_canfd bindings documentation. In the last 11 patches Markus Schneider-Pargmann optimizes the register access in the t_can driver and cleans up the tcan glue driver. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2022-12-12can: raw: add support for SO_MARKMarc Kleine-Budde
Add support for SO_MARK to the CAN_RAW protocol. This makes it possible to add traffic control filters based on the fwmark. Link: https://lore.kernel.org/all/20221210113653.170346-1-mkl@pengutronix.de Acked-by: Oliver Hartkopp <socketcan@hartkopp.net> Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
2022-12-12net: af_can: remove useless parameter 'err' in 'can_rx_register()'Ye Bin
Since commit bdfb5765e45b remove NULL-ptr checks from users of can_dev_rcv_lists_find(). 'err' parameter is useless, so remove it. Signed-off-by: Ye Bin <yebin10@huawei.com> Link: https://lore.kernel.org/all/20221208090940.3695670-1-yebin@huaweicloud.com Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
2022-12-12net: move the nat function to nf_nat_ovs for ovs and tcXin Long
There are two nat functions are nearly the same in both OVS and TC code, (ovs_)ct_nat_execute() and ovs_ct_nat/tcf_ct_act_nat(). This patch creates nf_nat_ovs.c under netfilter and moves them there then exports nf_ct_nat() so that it can be shared by both OVS and TC, and keeps the nat (type) check and nat flag update in OVS and TC's own place, as these parts are different between OVS and TC. Note that in OVS nat function it was using skb->protocol to get the proto as it already skips vlans in key_extract(), while it doesn't in TC, and TC has to call skb_protocol() to get proto. So in nf_ct_nat_execute(), we keep using skb_protocol() which works for both OVS and TC contrack. Signed-off-by: Xin Long <lucien.xin@gmail.com> Acked-by: Aaron Conole <aconole@redhat.com> Acked-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-12-12net: sched: update the nat flag for icmp error packets in ct_nat_executeXin Long
In ovs_ct_nat_execute(), the packet flow key nat flags are updated when it processes ICMP(v6) error packets translation successfully. In ct_nat_execute() when processing ICMP(v6) error packets translation successfully, it should have done the same in ct_nat_execute() to set post_ct_s/dnat flag, which will be used to update flow key nat flags in OVS module later. Reviewed-by: Saeed Mahameed <saeed@kernel.org> Signed-off-by: Xin Long <lucien.xin@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-12-12openvswitch: return NF_DROP when fails to add nat ext in ovs_ct_natXin Long
When it fails to allocate nat ext, the packet should be dropped, like the memory allocation failures in other places in ovs_ct_nat(). This patch changes to return NF_DROP when fails to add nat ext before doing NAT in ovs_ct_nat(), also it would keep consistent with tc action ct' processing in tcf_ct_act_nat(). Signed-off-by: Xin Long <lucien.xin@gmail.com> Acked-by: Aaron Conole <aconole@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-12-12openvswitch: return NF_ACCEPT when OVS_CT_NAT is not set in info natXin Long
Either OVS_CT_SRC_NAT or OVS_CT_DST_NAT is set, OVS_CT_NAT must be set in info->nat. Thus, if OVS_CT_NAT is not set in info->nat, it will definitely not do NAT but returns NF_ACCEPT in ovs_ct_nat(). This patch changes nothing funcational but only makes this return earlier in ovs_ct_nat() to keep consistent with TC's processing in tcf_ct_act_nat(). Reviewed-by: Saeed Mahameed <saeed@kernel.org> Acked-by: Aaron Conole <aconole@redhat.com> Signed-off-by: Xin Long <lucien.xin@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-12-12openvswitch: delete the unncessary skb_pull_rcsum call in ovs_ct_nat_executeXin Long
The calls to ovs_ct_nat_execute() are as below: ovs_ct_execute() ovs_ct_lookup() __ovs_ct_lookup() ovs_ct_nat() ovs_ct_nat_execute() ovs_ct_commit() __ovs_ct_lookup() ovs_ct_nat() ovs_ct_nat_execute() and since skb_pull_rcsum() and skb_push_rcsum() are already called in ovs_ct_execute(), there's no need to do it again in ovs_ct_nat_execute(). Reviewed-by: Saeed Mahameed <saeed@kernel.org> Acked-by: Aaron Conole <aconole@redhat.com> Signed-off-by: Xin Long <lucien.xin@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-12-12udp: allow header check for dodgy GSO_UDP_L4 packets.Andrew Melnychenko
Allow UDP_L4 for robust packets. Signed-off-by: Jason Wang <jasowang@redhat.com> Signed-off-by: Andrew Melnychenko <andrew@daynix.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-12-10ipvs: run_estimation should control the kthread tasksJulian Anastasov
Change the run_estimation flag to start/stop the kthread tasks. Signed-off-by: Julian Anastasov <ja@ssi.bg> Cc: yunhong-cgl jiang <xintian1976@gmail.com> Cc: "dust.li" <dust.li@linux.alibaba.com> Reviewed-by: Jiri Wiesner <jwiesner@suse.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2022-12-10ipvs: add est_cpulist and est_nice sysctl varsJulian Anastasov
Allow the kthreads for stats to be configured for specific cpulist (isolation) and niceness (scheduling priority). Signed-off-by: Julian Anastasov <ja@ssi.bg> Cc: yunhong-cgl jiang <xintian1976@gmail.com> Cc: "dust.li" <dust.li@linux.alibaba.com> Reviewed-by: Jiri Wiesner <jwiesner@suse.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2022-12-10ipvs: use kthreads for stats estimationJulian Anastasov
Estimating all entries in single list in timer context by single CPU causes large latency with multiple IPVS rules as reported in [1], [2], [3]. Spread the estimator structures in multiple chains and use kthread(s) for the estimation. The chains are processed in multiple (50) timer ticks to ensure the 2-second interval between estimations with some accuracy. Every chain is processed under RCU lock. Every kthread works over its own data structure and all such contexts are attached to array. The contexts can be preserved while the kthread tasks are stopped or restarted. When estimators are removed, unused kthread contexts are released and the slots in array are left empty. First kthread determines parameters to use, eg. maximum number of estimators to process per kthread based on chain's length (chain_max), allowing sub-100us cond_resched rate and estimation taking up to 1/8 of the CPU capacity to avoid any problems if chain_max is not correctly calculated. chain_max is calculated taking into account factors such as CPU speed and memory/cache speed where the cache_factor (4) is selected from real tests with current generation of CPU/NUMA configurations to correct the difference in CPU usage between cached (during calc phase) and non-cached (working) state of the estimated per-cpu data. First kthread also plays the role of distributor of added estimators to all kthreads, keeping low the time to add estimators. The optimization is based on the fact that newly added estimator should be estimated after 2 seconds, so we have the time to offload the adding to chain from controlling process to kthread 0. The allocated kthread context may grow from 1 to 50 allocated structures for timer ticks which saves memory for setups with small number of estimators. We also add delayed work est_reload_work that will make sure the kthread tasks are properly started/stopped. ip_vs_start_estimator() is changed to report errors which allows to safely store the estimators in allocated structures. Many thanks to Jiri Wiesner for his valuable comments and for spending a lot of time reviewing and testing the changes on different platforms with 48-256 CPUs and 1-8 NUMA nodes under different cpufreq governors. [1] Report from Yunhong Jiang: https://lore.kernel.org/netdev/D25792C1-1B89-45DE-9F10-EC350DC04ADC@gmail.com/ [2] https://marc.info/?l=linux-virtual-server&m=159679809118027&w=2 [3] Report from Dust: https://archive.linuxvirtualserver.org/html/lvs-devel/2020-12/msg00000.html Signed-off-by: Julian Anastasov <ja@ssi.bg> Cc: yunhong-cgl jiang <xintian1976@gmail.com> Cc: "dust.li" <dust.li@linux.alibaba.com> Reviewed-by: Jiri Wiesner <jwiesner@suse.de> Tested-by: Jiri Wiesner <jwiesner@suse.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2022-12-10ipvs: use u64_stats_t for the per-cpu countersJulian Anastasov
Use the provided u64_stats_t type to avoid load/store tearing. Fixes: 316580b69d0a ("u64_stats: provide u64_stats_t type") Signed-off-by: Julian Anastasov <ja@ssi.bg> Cc: yunhong-cgl jiang <xintian1976@gmail.com> Cc: "dust.li" <dust.li@linux.alibaba.com> Reviewed-by: Jiri Wiesner <jwiesner@suse.de> Tested-by: Jiri Wiesner <jwiesner@suse.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2022-12-10ipvs: use common functions for stats allocationJulian Anastasov
Move alloc_percpu/free_percpu logic in new functions Signed-off-by: Julian Anastasov <ja@ssi.bg> Cc: yunhong-cgl jiang <xintian1976@gmail.com> Cc: "dust.li" <dust.li@linux.alibaba.com> Reviewed-by: Jiri Wiesner <jwiesner@suse.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2022-12-10ipvs: add rcu protection to statsJulian Anastasov
In preparation to using RCU locking for the list with estimators, make sure the struct ip_vs_stats are released after RCU grace period by using RCU callbacks. This affects ipvs->tot_stats where we can not use RCU callbacks for ipvs, so we use allocated struct ip_vs_stats_rcu. For services and dests we force RCU callbacks for all cases. Signed-off-by: Julian Anastasov <ja@ssi.bg> Cc: yunhong-cgl jiang <xintian1976@gmail.com> Cc: "dust.li" <dust.li@linux.alibaba.com> Reviewed-by: Jiri Wiesner <jwiesner@suse.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2022-12-09Merge tag 'ipsec-next-2022-12-09' of ↵Jakub Kicinski
git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec-next Steffen Klassert says: ==================== ipsec-next 2022-12-09 1) Add xfrm packet offload core API. From Leon Romanovsky. 2) Add xfrm packet offload support for mlx5. From Leon Romanovsky and Raed Salem. 3) Fix a typto in a error message. From Colin Ian King. * tag 'ipsec-next-2022-12-09' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec-next: (38 commits) xfrm: Fix spelling mistake "oflload" -> "offload" net/mlx5e: Open mlx5 driver to accept IPsec packet offload net/mlx5e: Handle ESN update events net/mlx5e: Handle hardware IPsec limits events net/mlx5e: Update IPsec soft and hard limits net/mlx5e: Store all XFRM SAs in Xarray net/mlx5e: Provide intermediate pointer to access IPsec struct net/mlx5e: Skip IPsec encryption for TX path without matching policy net/mlx5e: Add statistics for Rx/Tx IPsec offloaded flows net/mlx5e: Improve IPsec flow steering autogroup net/mlx5e: Configure IPsec packet offload flow steering net/mlx5e: Use same coding pattern for Rx and Tx flows net/mlx5e: Add XFRM policy offload logic net/mlx5e: Create IPsec policy offload tables net/mlx5e: Generalize creation of default IPsec miss group and rule net/mlx5e: Group IPsec miss handles into separate struct net/mlx5e: Make clear what IPsec rx_err does net/mlx5e: Flatten the IPsec RX add rule path net/mlx5e: Refactor FTE setup code to be more clear net/mlx5e: Move IPsec flow table creation to separate function ... ==================== Link: https://lore.kernel.org/r/20221209093310.4018731-1-steffen.klassert@secunet.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-12-09net: devlink: Add missing error check to devlink_resource_put()Gavrilov Ilia
When the resource size changes, the return value of the 'nla_put_u64_64bit' function is not checked. That has been fixed to avoid rechecking at the next step. Found by InfoTeCS on behalf of Linux Verification Center (linuxtesting.org) with SVACE. Note that this is harmless, we'd error out at the next put(). Signed-off-by: Ilia.Gavrilov <Ilia.Gavrilov@infotecs.ru> Link: https://lore.kernel.org/r/20221208082821.3927937-1-Ilia.Gavrilov@infotecs.ru Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-12-09skbuff: Introduce slab_build_skb()Kees Cook
syzkaller reported: BUG: KASAN: slab-out-of-bounds in __build_skb_around+0x235/0x340 net/core/skbuff.c:294 Write of size 32 at addr ffff88802aa172c0 by task syz-executor413/5295 For bpf_prog_test_run_skb(), which uses a kmalloc()ed buffer passed to build_skb(). When build_skb() is passed a frag_size of 0, it means the buffer came from kmalloc. In these cases, ksize() is used to find its actual size, but since the allocation may not have been made to that size, actually perform the krealloc() call so that all the associated buffer size checking will be correctly notified (and use the "new" pointer so that compiler hinting works correctly). Split this logic out into a new interface, slab_build_skb(), but leave the original 0 checking for now to catch any stragglers. Reported-by: syzbot+fda18eaa8c12534ccb3b@syzkaller.appspotmail.com Link: https://groups.google.com/g/syzkaller-bugs/c/UnIKxTtU5-0/m/-wbXinkgAQAJ Fixes: 38931d8989b5 ("mm: Make ksize() a reporting-only function") Cc: Pavel Begunkov <asml.silence@gmail.com> Cc: pepsipu <soopthegoop@gmail.com> Cc: syzbot+fda18eaa8c12534ccb3b@syzkaller.appspotmail.com Cc: Vlastimil Babka <vbabka@suse.cz> Cc: kasan-dev <kasan-dev@googlegroups.com> Cc: Andrii Nakryiko <andrii@kernel.org> Cc: ast@kernel.org Cc: Daniel Borkmann <daniel@iogearbox.net> Cc: Hao Luo <haoluo@google.com> Cc: Jesper Dangaard Brouer <hawk@kernel.org> Cc: John Fastabend <john.fastabend@gmail.com> Cc: jolsa@kernel.org Cc: KP Singh <kpsingh@kernel.org> Cc: martin.lau@linux.dev Cc: Stanislav Fomichev <sdf@google.com> Cc: song@kernel.org Cc: Yonghong Song <yhs@fb.com> Signed-off-by: Kees Cook <keescook@chromium.org> Link: https://lore.kernel.org/r/20221208060256.give.994-kees@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-12-09mptcp: return 0 instead of 'err' varMatthieu Baerts
When 'err' is 0, it looks clearer to return '0' instead of the variable called 'err'. The behaviour is then not modified, just a clearer code. By doing this, we can also avoid false positive smatch warnings like this one: net/mptcp/pm_netlink.c:1169 mptcp_pm_parse_pm_addr_attr() warn: missing error code? 'err' Reported-by: kernel test robot <lkp@intel.com> Reported-by: Dan Carpenter <error27@gmail.com> Suggested-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-12-09mptcp: use nlmsg_free instead of kfree_skbGeliang Tang
Use nlmsg_free() instead of kfree_skb() in pm_netlink.c. The SKB's have been created by nlmsg_new(). The proper cleaning way should then be done with nlmsg_free(). For the moment, nlmsg_free() is simply calling kfree_skb() so we don't change the behaviour here. Suggested-by: Jakub Kicinski <kuba@kernel.org> Reviewed-by: Matthieu Baerts <matthieu.baerts@tessares.net> Signed-off-by: Geliang Tang <geliang.tang@suse.com> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-12-09net: openvswitch: Add support to count upcall packetswangchuanlei
Add support to count upall packets, when kmod of openvswitch upcall to count the number of packets for upcall succeed and failed, which is a better way to see how many packets upcalled on every interfaces. Signed-off-by: wangchuanlei <wangchuanlei@inspur.com> Acked-by: Eelco Chaudron <echaudro@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-12-09net/sched: avoid indirect classify functions on retpoline kernelsPedro Tammela
Expose the necessary tc classifier functions and wire up cls_api to use direct calls in retpoline kernels. Signed-off-by: Pedro Tammela <pctammela@mojatatu.com> Reviewed-by: Jamal Hadi Salim <jhs@mojatatu.com> Reviewed-by: Victor Nogueira <victor@mojatatu.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-12-09net/sched: avoid indirect act functions on retpoline kernelsPedro Tammela
Expose the necessary tc act functions and wire up act_api to use direct calls in retpoline kernels. Signed-off-by: Pedro Tammela <pctammela@mojatatu.com> Reviewed-by: Jamal Hadi Salim <jhs@mojatatu.com> Reviewed-by: Victor Nogueira <victor@mojatatu.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-12-09net/sched: add retpoline wrapper for tcPedro Tammela
On kernels using retpoline as a spectrev2 mitigation, optimize actions and filters that are compiled as built-ins into a direct call. On subsequent patches we expose the classifiers and actions functions and wire up the wrapper into tc. Signed-off-by: Pedro Tammela <pctammela@mojatatu.com> Reviewed-by: Jamal Hadi Salim <jhs@mojatatu.com> Reviewed-by: Victor Nogueira <victor@mojatatu.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-12-09xfrm: Fix spelling mistake "oflload" -> "offload"Colin Ian King
There is a spelling mistake in a NL_SET_ERR_MSG message. Fix it. Signed-off-by: Colin Ian King <colin.i.king@gmail.com> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2022-12-08net_tstamp: add SOF_TIMESTAMPING_OPT_ID_TCPWillem de Bruijn
Add an option to initialize SOF_TIMESTAMPING_OPT_ID for TCP from write_seq sockets instead of snd_una. This should have been the behavior from the start. Because processes may now exist that rely on the established behavior, do not change behavior of the existing option, but add the right behavior with a new flag. It is encouraged to always set SOF_TIMESTAMPING_OPT_ID_TCP on stream sockets along with the existing SOF_TIMESTAMPING_OPT_ID. Intuitively the contract is that the counter is zero after the setsockopt, so that the next write N results in a notification for the last byte N - 1. On idle sockets snd_una == write_seq and this holds for both. But on sockets with data in transmission, snd_una records the unacked offset in the stream. This depends on the ACK response from the peer. A process cannot learn this in a race free manner (ioctl SIOCOUTQ is one racy approach). write_seq records the offset at the last byte written by the process. This is a better starting point. It matches the intuitive contract in all circumstances, unaffected by external behavior. The new timestamp flag necessitates increasing sk_tsflags to 32 bits. Move the field in struct sock to avoid growing the socket (for some common CONFIG variants). The UAPI interface so_timestamping.flags is already int, so 32 bits wide. Reported-by: Sotirios Delimanolis <sotodel@meta.com> Signed-off-by: Willem de Bruijn <willemb@google.com> Link: https://lore.kernel.org/r/20221207143701.29861-1-willemdebruijn.kernel@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>