summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2025-01-14x86/kexec: Fix stack and handling of re-entry point for ::preserve_contextDavid Woodhouse
A ::preserve_context kimage can be invoked more than once, and the entry point can be different every time. When the callee returns to the kernel, it leaves the address of its entry point for next time on the stack. That being the case, one might reasonably assume that the caller would allocate space for it on the stack frame before actually performing the 'call' into the callee. Apparently not, though. Ever since the kjump code was first added in 2009, it has set up a *new* stack at the top of the swap_page scratch page, then just performed the 'call' without allocating any space for the re-entry address to be returned. It then reads the re-entry point for next time from 0(%rsp) which is actually the first qword of the page *after* the swap page, which might not exist at all! And if the callee has written to that, then it will have corrupted memory it doesn't own. Correct this by pushing the entry point of the callee onto the stack before calling it. The callee may then adjust it, or not, as it sees fit, and subsequent invocations should work correctly either way. Remove a stray push of zero to the *relocate_kernel* stack, which may have been intended for this purpose, but which was actually just noise. Also, loading the stack for the callee relied on the address of the swap page being in %r10 without ever documenting that fact. Recent code changes made that no longer true, so load it directly from the local kexec_pa_swap_page variable instead. Fixes: b3adabae8a96 ("x86/kexec: Drop page_list argument from relocate_kernel()") Signed-off-by: David Woodhouse <dwmw@amazon.co.uk> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/r/20250109140757.2841269-5-dwmw2@infradead.org
2025-01-14x86/kexec: Use correct swap page in swap_pages functionDavid Woodhouse
The swap_pages function expects the swap page to be in %r10, but there was no documentation to that effect. Once upon a time the setup code used to load its value from a kernel virtual address and save it to an address which is accessible in the identity-mapped page tables, and *happened* to use %r10 to do so, with no comment that it was left there on *purpose* instead of just being a scratch register. Once that was no longer necessary, %r10 just holds whatever the kernel happened to leave in it. Now that the original value passed by the kernel is accessible via %rip-relative addressing, load directly from there instead of using %r10 for it. But document the other parameters that the swap_pages function *does* expect in registers. Fixes: b3adabae8a96 ("x86/kexec: Drop page_list argument from relocate_kernel()") Signed-off-by: David Woodhouse <dwmw@amazon.co.uk> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/r/20250109140757.2841269-4-dwmw2@infradead.org
2025-01-14x86/kexec: Ensure preserve_context flag is set on return to kernelDavid Woodhouse
The swap_pages() function will only actually *swap*, as its name implies, if the preserve_context flag in the %r11 register is non-zero. On the way back from a ::preserve_context kexec, ensure that the %r11 register is non-zero so that the pages get swapped back. Fixes: 9e5683e2d0b5 ("x86/kexec: Only swap pages for ::preserve_context mode") Signed-off-by: David Woodhouse <dwmw@amazon.co.uk> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/r/20250109140757.2841269-3-dwmw2@infradead.org
2025-01-14x86/kexec: Disable global pages before writing to control pageDavid Woodhouse
The kernel switches to a new set of page tables during kexec. The global mappings (_PAGE_GLOBAL==1) can remain in the TLB after this switch. This is generally not a problem because the new page tables use a different portion of the virtual address space than the normal kernel mappings. The critical exception to that generalisation (and the only mapping which isn't an identity mapping) is the kexec control page itself — which was ROX in the original kernel mapping, but should be RWX in the new page tables. If there is a global TLB entry for that in its prior read-only state, it definitely needs to be flushed before attempting to write through that virtual mapping. It would be possible to just avoid writing to the virtual address of the page and defer all writes until they can be done through the identity mapping. But there's no good reason to keep the old TLB entries around, as they can cause nothing but trouble. Clear the PGE bit in %cr4 early, before storing data in the control page. Fixes: 5a82223e0743 ("x86/kexec: Mark relocate_kernel page as ROX instead of RWX") Closes: https://bugzilla.kernel.org/show_bug.cgi?id=219592 Reported-by: Nathan Chancellor <nathan@kernel.org> Reported-by: "Ning, Hongyu" <hongyu.ning@linux.intel.com> Co-developed-by: Dave Hansen <dave.hansen@linux.intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Signed-off-by: David Woodhouse <dwmw@amazon.co.uk> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Tested-by: Nathan Chancellor <nathan@kernel.org> Tested-by: "Ning, Hongyu" <hongyu.ning@linux.intel.com> Link: https://lore.kernel.org/r/20250109140757.2841269-2-dwmw2@infradead.org
2025-01-14RDMA/rxe: Fix the warning "__rxe_cleanup+0x12c/0x170 [rdma_rxe]"Zhu Yanjun
The Call Trace is as below: " <TASK> ? show_regs.cold+0x1a/0x1f ? __rxe_cleanup+0x12c/0x170 [rdma_rxe] ? __warn+0x84/0xd0 ? __rxe_cleanup+0x12c/0x170 [rdma_rxe] ? report_bug+0x105/0x180 ? handle_bug+0x46/0x80 ? exc_invalid_op+0x19/0x70 ? asm_exc_invalid_op+0x1b/0x20 ? __rxe_cleanup+0x12c/0x170 [rdma_rxe] ? __rxe_cleanup+0x124/0x170 [rdma_rxe] rxe_destroy_qp.cold+0x24/0x29 [rdma_rxe] ib_destroy_qp_user+0x118/0x190 [ib_core] rdma_destroy_qp.cold+0x43/0x5e [rdma_cm] rtrs_cq_qp_destroy.cold+0x1d/0x2b [rtrs_core] rtrs_srv_close_work.cold+0x1b/0x31 [rtrs_server] process_one_work+0x21d/0x3f0 worker_thread+0x4a/0x3c0 ? process_one_work+0x3f0/0x3f0 kthread+0xf0/0x120 ? kthread_complete_and_exit+0x20/0x20 ret_from_fork+0x22/0x30 </TASK> " When too many rdma resources are allocated, rxe needs more time to handle these rdma resources. Sometimes with the current timeout, rxe can not release the rdma resources correctly. Compared with other rdma drivers, a bigger timeout is used. Fixes: 215d0a755e1b ("RDMA/rxe: Stop lookup of partially built objects") Signed-off-by: Zhu Yanjun <yanjun.zhu@linux.dev> Link: https://patch.msgid.link/20250110160927.55014-1-yanjun.zhu@linux.dev Tested-by: Joe Klein <joe.klein812@gmail.com> Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-01-14RDMA/cxgb4: Notify rdma stack for IB_EVENT_QP_LAST_WQE_REACHED eventAnumula Murali Mohan Reddy
This patch sends IB_EVENT_QP_LAST_WQE_REACHED event on a QP that is in error state and associated with an SRQ. This behaviour is incorporated in flush_qp() which is called when QP transitions to error state. Supports SRQ drain functionality added by commit 844bc12e6da3 ("IB/core: add support for draining Shared receive queues") Fixes: 844bc12e6da3 ("IB/core: add support for draining Shared receive queues") Signed-off-by: Anumula Murali Mohan Reddy <anumula@chelsio.com> Signed-off-by: Potnuri Bharat Teja <bharat@chelsio.com> Link: https://patch.msgid.link/20250107095053.81007-1-anumula@chelsio.com Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-01-14Merge branch 'vsock-some-fixes-due-to-transport-de-assignment'Paolo Abeni
Stefano Garzarella says: ==================== vsock: some fixes due to transport de-assignment v1: https://lore.kernel.org/netdev/20250108180617.154053-1-sgarzare@redhat.com/ v2: - Added patch 3 to cancel the virtio close delayed work when de-assigning the transport - Added patch 4 to clean the socket state after de-assigning the transport - Added patch 5 as suggested by Michael and Hyunwoo Kim. It's based on Hyunwoo Kim and Wongi Lee patch [1] but using WARN_ON and covering more functions - Added R-b/T-b tags This series includes two patches discussed in the thread started by Hyunwoo Kim a few weeks ago [1], plus 3 more patches added after some discussions on v1 (see changelog). All related to the case where a vsock socket is de-assigned from a transport (e.g., because the connect fails or is interrupted by a signal) and then assigned to another transport or to no-one (NULL). I tested with usual vsock test suite, plus Michal repro [2]. (Note: the repo works only if a G2H transport is not loaded, e.g. virtio-vsock driver). The first patch is a fix more appropriate to the problem reported in that thread, the second patch on the other hand is a related fix but of a different problem highlighted by Michal Luczaj. It's present only in vsock_bpf and already handled in af_vsock.c The third patch is to cancel the virtio close delayed work when de-assigning the transport, the fourth patch is to clean the socket state after de-assigning the transport, the last patch adds warnings and prevents null-ptr-deref in vsock_*[has_data|has_space]. Hyunwoo Kim, Michal, if you can test and report your Tested-by that would be great! [1] https://lore.kernel.org/netdev/Z2K%2FI4nlHdfMRTZC@v4bel-B760M-AORUS-ELITE-AX/ [2] https://lore.kernel.org/netdev/2b3062e3-bdaa-4c94-a3c0-2930595b9670@rbox.co/ ==================== Link: https://patch.msgid.link/20250110083511.30419-1-sgarzare@redhat.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-01-14vsock: prevent null-ptr-deref in vsock_*[has_data|has_space]Stefano Garzarella
Recent reports have shown how we sometimes call vsock_*_has_data() when a vsock socket has been de-assigned from a transport (see attached links), but we shouldn't. Previous commits should have solved the real problems, but we may have more in the future, so to avoid null-ptr-deref, we can return 0 (no space, no data available) but with a warning. This way the code should continue to run in a nearly consistent state and have a warning that allows us to debug future problems. Fixes: c0cfa2d8a788 ("vsock: add multi-transports support") Cc: stable@vger.kernel.org Link: https://lore.kernel.org/netdev/Z2K%2FI4nlHdfMRTZC@v4bel-B760M-AORUS-ELITE-AX/ Link: https://lore.kernel.org/netdev/5ca20d4c-1017-49c2-9516-f6f75fd331e9@rbox.co/ Link: https://lore.kernel.org/netdev/677f84a8.050a0220.25a300.01b3.GAE@google.com/ Co-developed-by: Hyunwoo Kim <v4bel@theori.io> Signed-off-by: Hyunwoo Kim <v4bel@theori.io> Co-developed-by: Wongi Lee <qwerty@theori.io> Signed-off-by: Wongi Lee <qwerty@theori.io> Signed-off-by: Stefano Garzarella <sgarzare@redhat.com> Reviewed-by: Luigi Leonardi <leonardi@redhat.com> Reviewed-by: Hyunwoo Kim <v4bel@theori.io> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-01-14vsock: reset socket state when de-assigning the transportStefano Garzarella
Transport's release() and destruct() are called when de-assigning the vsock transport. These callbacks can touch some socket state like sock flags, sk_state, and peer_shutdown. Since we are reassigning the socket to a new transport during vsock_connect(), let's reset these fields to have a clean state with the new transport. Fixes: c0cfa2d8a788 ("vsock: add multi-transports support") Cc: stable@vger.kernel.org Signed-off-by: Stefano Garzarella <sgarzare@redhat.com> Reviewed-by: Luigi Leonardi <leonardi@redhat.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-01-14vsock/virtio: cancel close work in the destructorStefano Garzarella
During virtio_transport_release() we can schedule a delayed work to perform the closing of the socket before destruction. The destructor is called either when the socket is really destroyed (reference counter to zero), or it can also be called when we are de-assigning the transport. In the former case, we are sure the delayed work has completed, because it holds a reference until it completes, so the destructor will definitely be called after the delayed work is finished. But in the latter case, the destructor is called by AF_VSOCK core, just after the release(), so there may still be delayed work scheduled. Refactor the code, moving the code to delete the close work already in the do_close() to a new function. Invoke it during destruction to make sure we don't leave any pending work. Fixes: c0cfa2d8a788 ("vsock: add multi-transports support") Cc: stable@vger.kernel.org Reported-by: Hyunwoo Kim <v4bel@theori.io> Closes: https://lore.kernel.org/netdev/Z37Sh+utS+iV3+eb@v4bel-B760M-AORUS-ELITE-AX/ Signed-off-by: Stefano Garzarella <sgarzare@redhat.com> Reviewed-by: Luigi Leonardi <leonardi@redhat.com> Tested-by: Hyunwoo Kim <v4bel@theori.io> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-01-14vsock/bpf: return early if transport is not assignedStefano Garzarella
Some of the core functions can only be called if the transport has been assigned. As Michal reported, a socket might have the transport at NULL, for example after a failed connect(), causing the following trace: BUG: kernel NULL pointer dereference, address: 00000000000000a0 #PF: supervisor read access in kernel mode #PF: error_code(0x0000) - not-present page PGD 12faf8067 P4D 12faf8067 PUD 113670067 PMD 0 Oops: Oops: 0000 [#1] PREEMPT SMP NOPTI CPU: 15 UID: 0 PID: 1198 Comm: a.out Not tainted 6.13.0-rc2+ RIP: 0010:vsock_connectible_has_data+0x1f/0x40 Call Trace: vsock_bpf_recvmsg+0xca/0x5e0 sock_recvmsg+0xb9/0xc0 __sys_recvfrom+0xb3/0x130 __x64_sys_recvfrom+0x20/0x30 do_syscall_64+0x93/0x180 entry_SYSCALL_64_after_hwframe+0x76/0x7e So we need to check the `vsk->transport` in vsock_bpf_recvmsg(), especially for connected sockets (stream/seqpacket) as we already do in __vsock_connectible_recvmsg(). Fixes: 634f1a7110b4 ("vsock: support sockmap") Cc: stable@vger.kernel.org Reported-by: Michal Luczaj <mhal@rbox.co> Closes: https://lore.kernel.org/netdev/5ca20d4c-1017-49c2-9516-f6f75fd331e9@rbox.co/ Tested-by: Michal Luczaj <mhal@rbox.co> Reported-by: syzbot+3affdbfc986ecd9200fd@syzkaller.appspotmail.com Closes: https://lore.kernel.org/netdev/677f84a8.050a0220.25a300.01b3.GAE@google.com/ Tested-by: syzbot+3affdbfc986ecd9200fd@syzkaller.appspotmail.com Reviewed-by: Hyunwoo Kim <v4bel@theori.io> Acked-by: Michael S. Tsirkin <mst@redhat.com> Reviewed-by: Luigi Leonardi <leonardi@redhat.com> Signed-off-by: Stefano Garzarella <sgarzare@redhat.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-01-14vsock/virtio: discard packets if the transport changesStefano Garzarella
If the socket has been de-assigned or assigned to another transport, we must discard any packets received because they are not expected and would cause issues when we access vsk->transport. A possible scenario is described by Hyunwoo Kim in the attached link, where after a first connect() interrupted by a signal, and a second connect() failed, we can find `vsk->transport` at NULL, leading to a NULL pointer dereference. Fixes: c0cfa2d8a788 ("vsock: add multi-transports support") Cc: stable@vger.kernel.org Reported-by: Hyunwoo Kim <v4bel@theori.io> Reported-by: Wongi Lee <qwerty@theori.io> Closes: https://lore.kernel.org/netdev/Z2LvdTTQR7dBmPb5@v4bel-B760M-AORUS-ELITE-AX/ Signed-off-by: Stefano Garzarella <sgarzare@redhat.com> Reviewed-by: Hyunwoo Kim <v4bel@theori.io> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-01-14RDMA/bnxt_re: Allocate dev_attr information dynamicallyKalesh AP
In order to optimize the size of driver private structure, the memory for dev_attr is allocated dynamically during the chip context initialization. In order to make certain runtime decisions, store dev_attr in the qplib_res structure. Signed-off-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com> Signed-off-by: Selvin Xavier <selvin.xavier@broadcom.com> Link: https://patch.msgid.link/1736446693-6692-3-git-send-email-selvin.xavier@broadcom.com Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-01-14RDMA/bnxt_re: Pass the context for ulp_irq_stopKalesh AP
ulp_irq_stop() can be invoked from a context where FW is healthy or when FW is in a reset state. In the latter case, ULP must stop all interactions with HW/FW and also with application and stack. Added a new parameter to the ulp_irq_stop() function to achieve that. Reviewed-by: Vikas Gupta <vikas.gupta@broadcom.com> Reviewed-by: Michael Chan <michael.chan@broadcom.com> Reviewed-by: Chandramohan Akula <chandramohan.akula@broadcom.com> Reviewed-by: Pavan Chebbi <pavan.chebbi@broadcom.com> Signed-off-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com> Signed-off-by: Selvin Xavier <selvin.xavier@broadcom.com> Link: https://patch.msgid.link/1736446693-6692-2-git-send-email-selvin.xavier@broadcom.com Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-01-14Merge branch 'add-multicast-filtering-support-for-vlan-interface'Paolo Abeni
MD Danish Anwar says: ==================== Add Multicast Filtering support for VLAN interface This series adds Multicast filtering support for VLAN interfaces in dual EMAC and HSR offload mode for ICSSG driver. Patch 1/4 - Adds support for VLAN in dual EMAC mode Patch 2/4 - Adds MC filtering support for VLAN in dual EMAC mode Patch 3/4 - Create and export hsr_get_port_ndev() in hsr_device.c Patch 4/4 - Adds MC filtering support for VLAN in HSR mode [1] https://lore.kernel.org/all/20241216100044.577489-2-danishanwar@ti.com/ [2] https://lore.kernel.org/all/202412210336.BmgcX3Td-lkp@intel.com/#t [3] https://lore.kernel.org/all/31bb8a3e-5a1c-4c94-8c33-c0dfd6d643fb@kernel.org/ v1 https://lore.kernel.org/all/20241216100044.577489-1-danishanwar@ti.com/ v2 https://lore.kernel.org/all/20241223092557.2077526-1-danishanwar@ti.com/ v3 https://lore.kernel.org/all/20250103092033.1533374-1-danishanwar@ti.com/ ==================== Link: https://patch.msgid.link/20250110082852.3899027-1-danishanwar@ti.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-01-14net: ti: icssg-prueth: Add Support for Multicast filtering with VLAN in HSR modeMD Danish Anwar
Add multicast filtering support for VLAN interfaces in HSR offload mode for ICSSG driver. The driver calls vlan_for_each() API on the hsr device's ndev to get the list of available vlans for the hsr device. The driver then sync mc addr of vlan interface with a locally mainatined list emac->vlan_mcast_list[vid] using __hw_addr_sync_multiple() API. The driver then calls the sync / unsync callbacks. In the sync / unsync call back, driver checks if the vdev's real dev is hsr device or not. If the real dev is hsr device, driver gets the per port device using hsr_get_port_ndev() and then driver passes appropriate vid to FDB helper functions. Signed-off-by: MD Danish Anwar <danishanwar@ti.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-01-14net: hsr: Create and export hsr_get_port_ndev()MD Danish Anwar
Create an API to get the net_device to the slave port of HSR device. The API will take hsr net_device and enum hsr_port_type for which we want the net_device as arguments. This API can be used by client drivers who support HSR and want to get the net_devcie of slave ports from the hsr device. Export this API for the same. This API needs the enum hsr_port_type to be accessible by the drivers using hsr. Move the enum hsr_port_type from net/hsr/hsr_main.h to include/linux/if_hsr.h for the same. Signed-off-by: MD Danish Anwar <danishanwar@ti.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-01-14net: ti: icssg-prueth: Add Multicast Filtering support for VLAN in MAC modeMD Danish Anwar
Add multicast filtering support for VLAN interfaces in dual EMAC mode for ICSSG driver. The driver uses vlan_for_each() API to get the list of available vlans. The driver then sync mc addr of vlan interface with a locally mainatined list emac->vlan_mcast_list[vid] using __hw_addr_sync_multiple() API. __hw_addr_sync_multiple() is used instead of __hw_addr_sync() to sync vdev->mc with local list because the sync_cnt for addresses in vdev->mc will already be set by the vlan_dev_set_rx_mode() [net/8021q/vlan_dev.c] and __hw_addr_sync() only syncs when the sync_cnt == 0. Whereas __hw_addr_sync_multiple() can sync addresses even if sync_cnt is not 0. Export __hw_addr_sync_multiple() so that driver can use it. Once the local list is synced, driver calls __hw_addr_sync_dev() with the local list, vdev, sync and unsync callbacks. __hw_addr_sync_dev() is used with the local maintained list as the list to synchronize instead of using __dev_mc_sync() on vdev because __dev_mc_sync() on vdev will call __hw_addr_sync_dev() on vdev->mc and sync_cnt for addresses in vdev->mc will already be set by the vlan_dev_set_rx_mode() [net/8021q/vlan_dev.c] and __hw_addr_sync_dev() only syncs if the sync_cnt of addresses in the list (vdev->mc in this case) is 0. Whereas __hw_addr_sync_dev() on local list will work fine as the sync_cnt for addresses in the local list will still be 0. Based on change in addresses in the local list, sync / unsync callbacks are invoked. In the sync / unsync API in driver, based on whether the ndev is vlan or not, driver passes appropriate vid to FDB helper functions. Signed-off-by: MD Danish Anwar <danishanwar@ti.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-01-14net: ti: icssg-prueth: Add VLAN support in EMAC modeMD Danish Anwar
Add support for VLAN filtering in dual EMAC mode. Reviewed-by: Roger Quadros <rogerq@kernel.org> Signed-off-by: MD Danish Anwar <danishanwar@ti.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-01-14Merge tag 'nf-next-25-01-11' of ↵Paolo Abeni
git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf-next Pablo Neira Ayuso says: ==================== Netfilter/IPVS updates for net-next The following patchset contains a small batch of Netfilter/IPVS updates for net-next: 1) Remove unused genmask parameter in nf_tables_addchain() 2) Speed up reads from /proc/net/ip_vs_conn, from Florian Westphal. 3) Skip empty buckets in hashlimit to avoid atomic operations that results in false positive reports by syzbot with lockdep enabled, patch from Eric Dumazet. 4) Add conntrack event timestamps available via ctnetlink, from Florian Westphal. netfilter pull request 25-01-11 * tag 'nf-next-25-01-11' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf-next: netfilter: conntrack: add conntrack event timestamp netfilter: xt_hashlimit: htable_selective_cleanup() optimization ipvs: speed up reads from ip_vs_conn proc file netfilter: nf_tables: remove the genmask parameter ==================== Link: https://patch.msgid.link/20250111230800.67349-1-pablo@netfilter.org Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-01-14landlock: Use scoped guards for ruleset in landlock_add_rule()Mickaël Salaün
Simplify error handling by replacing goto statements with automatic calls to landlock_put_ruleset() when going out of scope. This change depends on the TCP support. Cc: Konstantin Meskhidze <konstantin.meskhidze@huawei.com> Cc: Mikhail Ivanov <ivanov.mikhail1@huawei-partners.com> Reviewed-by: Günther Noack <gnoack@google.com> Link: https://lore.kernel.org/r/20250113161112.452505-3-mic@digikod.net Signed-off-by: Mickaël Salaün <mic@digikod.net>
2025-01-14landlock: Use scoped guards for rulesetMickaël Salaün
Simplify error handling by replacing goto statements with automatic calls to landlock_put_ruleset() when going out of scope. This change will be easy to backport to v6.6 if needed, only the kernel.h include line conflicts. As for any other similar changes, we should be careful when backporting without goto statements. Add missing include file. Reviewed-by: Günther Noack <gnoack@google.com> Link: https://lore.kernel.org/r/20250113161112.452505-2-mic@digikod.net Signed-off-by: Mickaël Salaün <mic@digikod.net>
2025-01-14landlock: Constify get_mode_access()Mickaël Salaün
Use __attribute_const__ for get_mode_access(). Reviewed-by: Günther Noack <gnoack3000@gmail.com> Link: https://lore.kernel.org/r/20250110153918.241810-2-mic@digikod.net Signed-off-by: Mickaël Salaün <mic@digikod.net>
2025-01-14landlock: Handle weird filesMickaël Salaün
A corrupted filesystem (e.g. bcachefs) might return weird files. Instead of throwing a warning and allowing access to such file, treat them as regular files. Cc: Dave Chinner <david@fromorbit.com> Cc: Kent Overstreet <kent.overstreet@linux.dev> Cc: Paul Moore <paul@paul-moore.com> Reported-by: syzbot+34b68f850391452207df@syzkaller.appspotmail.com Closes: https://lore.kernel.org/r/000000000000a65b35061cffca61@google.com Reported-by: syzbot+360866a59e3c80510a62@syzkaller.appspotmail.com Closes: https://lore.kernel.org/r/67379b3f.050a0220.85a0.0001.GAE@google.com Reported-by: Ubisectech Sirius <bugreport@ubisectech.com> Closes: https://lore.kernel.org/r/c426821d-8380-46c4-a494-7008bbd7dd13.bugreport@ubisectech.com Fixes: cb2c7d1a1776 ("landlock: Support filesystem access-control") Reviewed-by: Günther Noack <gnoack3000@gmail.com> Link: https://lore.kernel.org/r/20250110153918.241810-1-mic@digikod.net Signed-off-by: Mickaël Salaün <mic@digikod.net>
2025-01-14Merge branch 'introduce-unified-and-structured-phy'Paolo Abeni
Oleksij Rempel says: ==================== Introduce unified and structured PHY This patch set introduces a unified and well-structured interface for reporting PHY statistics. Instead of relying on arbitrary strings in PHY drivers, this interface provides a consistent and structured way to expose PHY statistics to userspace via ethtool. The initial groundwork for this effort was laid by Jakub Kicinski, who contributed patches to plumb PHY statistics to drivers and added support for structured statistics in ethtool. Building on Jakub's work, I tested the implementation with several PHYs, addressed a few issues, and added support for statistics in two specific PHY drivers. Most of changes are tracked in separate patches. changes v6: - drop ethtool_stat_add patch changes v5: - rebase against latest net-next ==================== Link: https://patch.msgid.link/20250110060517.711683-1-o.rempel@pengutronix.de Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-01-14net: phy: dp83tg720: add statistics supportOleksij Rempel
Add support for reporting PHY statistics in the DP83TG720 driver. This includes cumulative tracking of link loss events, transmit/receive packet counts, and error counts. Implemented functions to update and provide statistics via ethtool, with optional polling support enabled through `PHY_POLL_STATS`. Signed-off-by: Oleksij Rempel <o.rempel@pengutronix.de> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-01-14net: phy: dp83td510: add statistics supportOleksij Rempel
Add support for reporting PHY statistics in the DP83TD510 driver. This includes cumulative tracking of transmit/receive packet counts, and error counts. Implemented functions to update and provide statistics via ethtool, with optional polling support enabled through `PHY_POLL_STATS`. Signed-off-by: Oleksij Rempel <o.rempel@pengutronix.de> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-01-14net: phy: introduce optional polling interface for PHY statisticsOleksij Rempel
Add an optional polling interface for PHY statistics to simplify driver implementation. Signed-off-by: Oleksij Rempel <o.rempel@pengutronix.de> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-01-14Documentation: networking: update PHY error counter diagnostics in twisted ↵Oleksij Rempel
pair guide Replace generic instructions for monitoring error counters with a procedure using the unified PHY statistics interface (`--all-groups`). Signed-off-by: Oleksij Rempel <o.rempel@pengutronix.de> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-01-14net: ethtool: add support for structured PHY statisticsJakub Kicinski
Introduce a new way to report PHY statistics in a structured and standardized format using the netlink API. This new method does not replace the old driver-specific stats, which can still be accessed with `ethtool -S <eth name>`. The structured stats are available with `ethtool -S <eth name> --all-groups`. This new method makes it easier to diagnose problems by organizing stats in a consistent and documented way. Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Oleksij Rempel <o.rempel@pengutronix.de> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-01-14net: ethtool: plumb PHY stats to PHY driversJakub Kicinski
Introduce support for standardized PHY statistics reporting in ethtool by extending the PHYLIB framework. Add the functions phy_ethtool_get_phy_stats() and phy_ethtool_get_link_ext_stats() to provide a consistent interface for retrieving PHY-level and link-specific statistics. These functions are used within the ethtool implementation to avoid direct access to the phy_device structure outside of the PHYLIB framework. A new structure, ethtool_phy_stats, is introduced to standardize PHY statistics such as packet counts, byte counts, and error counters. Drivers are updated to include callbacks for retrieving PHY and link-specific statistics, ensuring values are explicitly set only for supported fields, initialized with ETHTOOL_STAT_NOT_SET to avoid ambiguity. Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Oleksij Rempel <o.rempel@pengutronix.de> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-01-14ethtool: linkstate: migrate linkstate functions to support multi-PHY setupsOleksij Rempel
Adapt linkstate_get_sqi() and linkstate_get_sqi_max() to take a phy_device argument directly, enabling support for setups with multiple PHYs. The previous assumption of a single PHY attached to a net_device no longer holds. Use ethnl_req_get_phydev() to identify the appropriate PHY device for the operation. Update linkstate_prepare_data() and related logic to accommodate this change, ensuring compatibility with multi-PHY configurations. Signed-off-by: Oleksij Rempel <o.rempel@pengutronix.de> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-01-14xfs: add a b_iodone callback to struct xfs_bufChristoph Hellwig
Stop open coding the log item completions and instead add a callback into back into the submitter. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Acked-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Carlos Maiolino <cem@kernel.org>
2025-01-14xfs: move b_li_list based retry handling to common codeChristoph Hellwig
The dquot and inode version are very similar, which is expected given the overall b_li_list logic. The differences are that the inode version also clears the XFS_LI_FLUSHING which is defined in common but only ever set by the inode item, and that the dquot version takes the ail_lock over the list iteration. While this seems sensible given that additions and removals from b_li_list are protected by the ail_lock, log items are only added before buffer submission, and are only removed when completing the buffer, so nothing can change the list when retrying a buffer. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Acked-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Carlos Maiolino <cem@kernel.org>
2025-01-14xfs: simplify xfsaild_resubmit_itemChristoph Hellwig
Since commit acc8f8628c37 ("xfs: attach dquot buffer to dquot log item buffer") all buf items that use bp->b_li_list are explicitly checked for in the branch to just clears XFS_LI_FAILED. Remove the dead arm that calls xfs_clear_li_failed. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Acked-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Carlos Maiolino <cem@kernel.org>
2025-01-14xfs: always complete the buffer inline in xfs_buf_submitChristoph Hellwig
xfs_buf_submit now only completes a buffer on error, or for in-memory buftargs. There is no point in using a workqueue for the latter as the completion will just wake up the caller. Optimize this case by avoiding the workqueue roundtrip. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Acked-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Carlos Maiolino <cem@kernel.org>
2025-01-14xfs: remove the extra buffer reference in xfs_buf_submitChristoph Hellwig
Nothing touches the buffer after it has been submitted now, so the need for the extra transient reference went away as well. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Acked-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Carlos Maiolino <cem@kernel.org>
2025-01-14xfs: move invalidate_kernel_vmap_range to xfs_buf_ioendChristoph Hellwig
Invalidating cache lines can be fairly expensive, so don't do it in interrupt context. Note that in practice very few setup will actually do anything here as virtually indexed caches are rather uncommon, but we might as well move the call to the proper place while touching this area. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Acked-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Carlos Maiolino <cem@kernel.org>
2025-01-14xfs: simplify buffer I/O submissionChristoph Hellwig
The code in _xfs_buf_ioapply is unnecessarily complicated because it doesn't take advantage of modern bio features. Simplify it by making use of bio splitting and chaining, that is build a single bio for the pages in the buffer using a simple loop, and then split that bio on the map boundaries for discontiguous multi-FSB buffers and chain the split bios to the main one so that there is only a single I/O completion. This not only simplifies the code to build the buffer, but also removes the need for the b_io_remaining field as buffer ownership is granted to the bio on submit of the final bio with no chance for a completion before that as well as the b_io_error field that is now superfluous because there always is exactly one completion. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Acked-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Carlos Maiolino <cem@kernel.org>
2025-01-14xfs: move in-memory buftarg handling out of _xfs_buf_ioapplyChristoph Hellwig
No I/O to apply for in-memory buffers, so skip the function call entirely. Clean up the b_io_error initialization logic to allow for this. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Acked-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Carlos Maiolino <cem@kernel.org>
2025-01-14xfs: move write verification out of _xfs_buf_ioapplyChristoph Hellwig
Split the write verification logic out of _xfs_buf_ioapply into a new xfs_buf_verify_write helper called by xfs_buf_submit given that it isn't about applying the I/O and doesn't really fit in with the rest of _xfs_buf_ioapply. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Acked-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Carlos Maiolino <cem@kernel.org>
2025-01-14xfs: remove xfs_buf_delwri_submit_buffersChristoph Hellwig
xfs_buf_delwri_submit_buffers has two callers for synchronous and asynchronous writes that share very little logic. Split out a helper for the shared per-buffer loop and otherwise open code the submission in the two callers. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Acked-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Carlos Maiolino <cem@kernel.org>
2025-01-14xfs: simplify xfs_buf_delwri_pushbufChristoph Hellwig
xfs_buf_delwri_pushbuf synchronously writes a buffer that is on a delwri list already. Instead of doing a complicated dance with the delwri and wait list, just leave them alone and open code the actual buffer write. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Acked-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Carlos Maiolino <cem@kernel.org>
2025-01-14xfs: move xfs_buf_iowait out of (__)xfs_buf_submitChristoph Hellwig
There is no good reason to pass a bool argument to wait for a buffer when the callers that want that can easily just wait themselves. This means the wait moves out of the extra hold of the buffer, but as the callers of synchronous buffer I/O need to hold a reference anyway that is perfectly fine. Because all async buffer submitters ignore the error return value, and the synchronous ones catch the error condition through b_error and xfs_buf_iowait this also means the new xfs_buf_submit doesn't have to return an error code. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Acked-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Carlos Maiolino <cem@kernel.org>
2025-01-14xfs: remove the incorrect comment about the b_pag fieldChristoph Hellwig
The rbtree root is long gone. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Acked-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Carlos Maiolino <cem@kernel.org>
2025-01-14xfs: remove the incorrect comment above xfs_buf_free_mapsChristoph Hellwig
The comment above xfs_buf_free_maps talks about fields not even used in the function and also doesn't add any other value. Remove it. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Acked-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Carlos Maiolino <cem@kernel.org>
2025-01-14xfs: fix a double completion for buffers on in-memory targetsChristoph Hellwig
__xfs_buf_submit calls xfs_buf_ioend when b_io_remaining hits zero. For in-memory buftargs b_io_remaining is never incremented from it's initial value of 1, so this always happens. Thus the extra call to xfs_buf_ioend in _xfs_buf_ioapply causes a double completion. Fortunately __xfs_buf_submit is only used for synchronous reads on in-memory buftargs due to the peculiarities of how they work, so this is mostly harmless and just causes a little extra work to be done. Fixes: 5076a6040ca1 ("xfs: support in-memory buffer cache targets") Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Acked-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Carlos Maiolino <cem@kernel.org>
2025-01-14net: phy: microchip_t1: depend on PTP_1588_CLOCK_OPTIONALDivya Koppera
When microchip_t1_phy is built in and phyptp is module facing undefined reference issue. This get fixed when microchip_t1_phy made dependent on PTP_1588_CLOCK_OPTIONAL. Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/oe-kbuild-all/202501090604.YEoJXCXi-lkp@intel.com Fixes: fa51199c5f34 ("net: phy: microchip_rds_ptp : Add rds ptp library for Microchip phys") Signed-off-by: Divya Koppera <divya.koppera@microchip.com> Reviewed-by: Simon Horman <horms@kernel.org> Tested-by: Simon Horman <horms@kernel.org> # build-tested Link: https://patch.msgid.link/20250110054424.16807-1-divya.koppera@microchip.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-01-14Merge branch 'gtp-pfcp-fix-use-after-free-of-udp-tunnel-socket'Paolo Abeni
Kuniyuki Iwashima says: ==================== gtp/pfcp: Fix use-after-free of UDP tunnel socket. Xiao Liang pointed out weird netns usages in ->newlink() of gtp and pfcp. This series fixes the issues. Link: https://lore.kernel.org/netdev/20250104125732.17335-1-shaw.leon@gmail.com/ Changes: v2: * Patch 1 * Fix uninit/unused local var v1: https://lore.kernel.org/netdev/20250108062834.11117-1-kuniyu@amazon.com/ ==================== Link: https://patch.msgid.link/20250110014754.33847-1-kuniyu@amazon.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-01-14pfcp: Destroy device along with udp socket's netns dismantle.Kuniyuki Iwashima
pfcp_newlink() links the device to a list in dev_net(dev) instead of net, where a udp tunnel socket is created. Even when net is removed, the device stays alive on dev_net(dev). Then, removing net triggers the splat below. [0] In this example, pfcp0 is created in ns2, but the udp socket is created in ns1. ip netns add ns1 ip netns add ns2 ip -n ns1 link add netns ns2 name pfcp0 type pfcp ip netns del ns1 Let's link the device to the socket's netns instead. Now, pfcp_net_exit() needs another netdev iteration to remove all pfcp devices in the netns. pfcp_dev_list is not used under RCU, so the list API is converted to the non-RCU variant. pfcp_net_exit() can be converted to .exit_batch_rtnl() in net-next. [0]: ref_tracker: net notrefcnt@00000000128b34dc has 1/1 users at sk_alloc (./include/net/net_namespace.h:345 net/core/sock.c:2236) inet_create (net/ipv4/af_inet.c:326 net/ipv4/af_inet.c:252) __sock_create (net/socket.c:1558) udp_sock_create4 (net/ipv4/udp_tunnel_core.c:18) pfcp_create_sock (drivers/net/pfcp.c:168) pfcp_newlink (drivers/net/pfcp.c:182 drivers/net/pfcp.c:197) rtnl_newlink (net/core/rtnetlink.c:3786 net/core/rtnetlink.c:3897 net/core/rtnetlink.c:4012) rtnetlink_rcv_msg (net/core/rtnetlink.c:6922) netlink_rcv_skb (net/netlink/af_netlink.c:2542) netlink_unicast (net/netlink/af_netlink.c:1321 net/netlink/af_netlink.c:1347) netlink_sendmsg (net/netlink/af_netlink.c:1891) ____sys_sendmsg (net/socket.c:711 net/socket.c:726 net/socket.c:2583) ___sys_sendmsg (net/socket.c:2639) __sys_sendmsg (net/socket.c:2669) do_syscall_64 (arch/x86/entry/common.c:52 arch/x86/entry/common.c:83) entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:130) WARNING: CPU: 1 PID: 11 at lib/ref_tracker.c:179 ref_tracker_dir_exit (lib/ref_tracker.c:179) Modules linked in: CPU: 1 UID: 0 PID: 11 Comm: kworker/u16:0 Not tainted 6.13.0-rc5-00147-g4c1224501e9d #5 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014 Workqueue: netns cleanup_net RIP: 0010:ref_tracker_dir_exit (lib/ref_tracker.c:179) Code: 00 00 00 fc ff df 4d 8b 26 49 bd 00 01 00 00 00 00 ad de 4c 39 f5 0f 85 df 00 00 00 48 8b 74 24 08 48 89 df e8 a5 cc 12 02 90 <0f> 0b 90 48 8d 6b 44 be 04 00 00 00 48 89 ef e8 80 de 67 ff 48 89 RSP: 0018:ff11000007f3fb60 EFLAGS: 00010286 RAX: 00000000000020ef RBX: ff1100000d6481e0 RCX: 1ffffffff0e40d82 RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffffffff8423ee3c RBP: ff1100000d648230 R08: 0000000000000001 R09: fffffbfff0e395af R10: 0000000000000001 R11: 0000000000000000 R12: ff1100000d648230 R13: dead000000000100 R14: ff1100000d648230 R15: dffffc0000000000 FS: 0000000000000000(0000) GS:ff1100006ce80000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00005620e1363990 CR3: 000000000eeb2002 CR4: 0000000000771ef0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe07f0 DR7: 0000000000000400 PKRU: 55555554 Call Trace: <TASK> ? __warn (kernel/panic.c:748) ? ref_tracker_dir_exit (lib/ref_tracker.c:179) ? report_bug (lib/bug.c:201 lib/bug.c:219) ? handle_bug (arch/x86/kernel/traps.c:285) ? exc_invalid_op (arch/x86/kernel/traps.c:309 (discriminator 1)) ? asm_exc_invalid_op (./arch/x86/include/asm/idtentry.h:621) ? _raw_spin_unlock_irqrestore (./arch/x86/include/asm/irqflags.h:42 ./arch/x86/include/asm/irqflags.h:97 ./arch/x86/include/asm/irqflags.h:155 ./include/linux/spinlock_api_smp.h:151 kernel/locking/spinlock.c:194) ? ref_tracker_dir_exit (lib/ref_tracker.c:179) ? __pfx_ref_tracker_dir_exit (lib/ref_tracker.c:158) ? kfree (mm/slub.c:4613 mm/slub.c:4761) net_free (net/core/net_namespace.c:476 net/core/net_namespace.c:467) cleanup_net (net/core/net_namespace.c:664 (discriminator 3)) process_one_work (kernel/workqueue.c:3229) worker_thread (kernel/workqueue.c:3304 kernel/workqueue.c:3391) kthread (kernel/kthread.c:389) ret_from_fork (arch/x86/kernel/process.c:147) ret_from_fork_asm (arch/x86/entry/entry_64.S:257) </TASK> Fixes: 76c8764ef36a ("pfcp: add PFCP module") Reported-by: Xiao Liang <shaw.leon@gmail.com> Closes: https://lore.kernel.org/netdev/20250104125732.17335-1-shaw.leon@gmail.com/ Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>