linux-arm.git - Russell King's ARM Linux kernel tree

Age	Commit message (Collapse)	Author
2025-01-27	Merge tag 'nfsd-6.14' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux	Linus Torvalds
	Pull nfsd updates from Chuck Lever: "Jeff Layton contributed an implementation of NFSv4.2+ attribute delegation, as described here: https://www.ietf.org/archive/id/draft-ietf-nfsv4-delstid-08.html This interoperates with similar functionality introduced into the Linux NFS client in v6.11. An attribute delegation permits an NFS client to manage a file's mtime, rather than flushing dirty data to the NFS server so that the file's mtime reflects the last write, which is considerably slower. Neil Brown contributed dynamic NFSv4.1 session slot table resizing. This facility enables NFSD to increase or decrease the number of slots per NFS session depending on server memory availability. More session slots means greater parallelism. Chuck Lever fixed a long-standing latent bug where NFSv4 COMPOUND encoding screws up when crossing a page boundary in the encoding buffer. This is a zero-day bug, but hitting it is rare and depends on the NFS client implementation. The Linux NFS client does not happen to trigger this issue. A variety of bug fixes and other incremental improvements fill out the list of commits in this release. Great thanks to all contributors, reviewers, testers, and bug reporters who participated during this development cycle" * tag 'nfsd-6.14' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux: (42 commits) sunrpc: Remove gss_{de,en}crypt_xdr_buf deadcode sunrpc: Remove gss_generic_token deadcode sunrpc: Remove unused xprt_iter_get_xprt Revert "SUNRPC: Reduce thread wake-up rate when receiving large RPC messages" nfsd: implement OPEN_ARGS_SHARE_ACCESS_WANT_OPEN_XOR_DELEGATION nfsd: handle delegated timestamps in SETATTR nfsd: add support for delegated timestamps nfsd: rework NFS4_SHARE_WANT_* flag handling nfsd: add support for FATTR4_OPEN_ARGUMENTS nfsd: prepare delegation code for handing out _ATTRS_DELEG delegations nfsd: rename NFS4_SHARE_WANT_ constants to OPEN4_SHARE_ACCESS_WANT_* nfsd: switch to autogenerated definitions for open_delegation_type4 nfs_common: make include/linux/nfs4.h include generated nfs4_1.h nfsd: fix handling of delegated change attr in CB_GETATTR SUNRPC: Document validity guarantees of the pointer returned by reserve_space NFSD: Insulate nfsd4_encode_fattr4() from page boundaries in the encode buffer NFSD: Insulate nfsd4_encode_secinfo() from page boundaries in the encode buffer NFSD: Refactor nfsd4_do_encode_secinfo() again NFSD: Insulate nfsd4_encode_readlink() from page boundaries in the encode buffer NFSD: Insulate nfsd4_encode_read_plus_data() from page boundaries in the encode buffer ...
2025-01-26	Merge tag 'mm-stable-2025-01-26-14-59' of ↵	Linus Torvalds
	git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull MM updates from Andrew Morton: "The various patchsets are summarized below. Plus of course many indivudual patches which are described in their changelogs. - "Allocate and free frozen pages" from Matthew Wilcox reorganizes the page allocator so we end up with the ability to allocate and free zero-refcount pages. So that callers (ie, slab) can avoid a refcount inc & dec - "Support large folios for tmpfs" from Baolin Wang teaches tmpfs to use large folios other than PMD-sized ones - "Fix mm/rodata_test" from Petr Tesarik performs some maintenance and fixes for this small built-in kernel selftest - "mas_anode_descend() related cleanup" from Wei Yang tidies up part of the mapletree code - "mm: fix format issues and param types" from Keren Sun implements a few minor code cleanups - "simplify split calculation" from Wei Yang provides a few fixes and a test for the mapletree code - "mm/vma: make more mmap logic userland testable" from Lorenzo Stoakes continues the work of moving vma-related code into the (relatively) new mm/vma.c - "mm/page_alloc: gfp flags cleanups for alloc_contig_()" from David Hildenbrand cleans up and rationalizes handling of gfp flags in the page allocator - "readahead: Reintroduce fix for improper RA window sizing" from Jan Kara is a second attempt at fixing a readahead window sizing issue. It should reduce the amount of unnecessary reading - "synchronously scan and reclaim empty user PTE pages" from Qi Zheng addresses an issue where "huge" amounts of pte pagetables are accumulated: https://lore.kernel.org/lkml/cover.1718267194.git.zhengqi.arch@bytedance.com/ Qi's series addresses this windup by synchronously freeing PTE memory within the context of madvise(MADV_DONTNEED) - "selftest/mm: Remove warnings found by adding compiler flags" from Muhammad Usama Anjum fixes some build warnings in the selftests code when optional compiler warnings are enabled - "mm: don't use __GFP_HARDWALL when migrating remote pages" from David Hildenbrand tightens the allocator's observance of __GFP_HARDWALL - "pkeys kselftests improvements" from Kevin Brodsky implements various fixes and cleanups in the MM selftests code, mainly pertaining to the pkeys tests - "mm/damon: add sample modules" from SeongJae Park enhances DAMON to estimate application working set size - "memcg/hugetlb: Rework memcg hugetlb charging" from Joshua Hahn provides some cleanups to memcg's hugetlb charging logic - "mm/swap_cgroup: remove global swap cgroup lock" from Kairui Song removes the global swap cgroup lock. A speedup of 10% for a tmpfs-based kernel build was demonstrated - "zram: split page type read/write handling" from Sergey Senozhatsky has several fixes and cleaups for zram in the area of zram_write_page(). A watchdog softlockup warning was eliminated - "move pagetable__dtor() to __tlb_remove_table()" from Kevin Brodsky cleans up the pagetable destructor implementations. A rare use-after-free race is fixed - "mm/debug: introduce and use VM_WARN_ON_VMG()" from Lorenzo Stoakes simplifies and cleans up the debugging code in the VMA merging logic - "Account page tables at all levels" from Kevin Brodsky cleans up and regularizes the pagetable ctor/dtor handling. This results in improvements in accounting accuracy - "mm/damon: replace most damon_callback usages in sysfs with new core functions" from SeongJae Park cleans up and generalizes DAMON's sysfs file interface logic - "mm/damon: enable page level properties based monitoring" from SeongJae Park increases the amount of information which is presented in response to DAMOS actions - "mm/damon: remove DAMON debugfs interface" from SeongJae Park removes DAMON's long-deprecated debugfs interfaces. Thus the migration to sysfs is completed - "mm/hugetlb: Refactor hugetlb allocation resv accounting" from Peter Xu cleans up and generalizes the hugetlb reservation accounting - "mm: alloc_pages_bulk: small API refactor" from Luiz Capitulino removes a never-used feature of the alloc_pages_bulk() interface - "mm/damon: extend DAMOS filters for inclusion" from SeongJae Park extends DAMOS filters to support not only exclusion (rejecting), but also inclusion (allowing) behavior - "Add zpdesc memory descriptor for zswap.zpool" from Alex Shi introduces a new memory descriptor for zswap.zpool that currently overlaps with struct page for now. This is part of the effort to reduce the size of struct page and to enable dynamic allocation of memory descriptors - "mm, swap: rework of swap allocator locks" from Kairui Song redoes and simplifies the swap allocator locking. A speedup of 400% was demonstrated for one workload. As was a 35% reduction for kernel build time with swap-on-zram - "mm: update mips to use do_mmap(), make mmap_region() internal" from Lorenzo Stoakes reworks MIPS's use of mmap_region() so that mmap_region() can be made MM-internal - "mm/mglru: performance optimizations" from Yu Zhao fixes a few MGLRU regressions and otherwise improves MGLRU performance - "Docs/mm/damon: add tuning guide and misc updates" from SeongJae Park updates DAMON documentation - "Cleanup for memfd_create()" from Isaac Manjarres does that thing - "mm: hugetlb+THP folio and migration cleanups" from David Hildenbrand provides various cleanups in the areas of hugetlb folios, THP folios and migration - "Uncached buffered IO" from Jens Axboe implements the new RWF_DONTCACHE flag which provides synchronous dropbehind for pagecache reading and writing. To permite userspace to address issues with massive buildup of useless pagecache when reading/writing fast devices - "selftests/mm: virtual_address_range: Reduce memory" from Thomas Weißschuh fixes and optimizes some of the MM selftests" * tag 'mm-stable-2025-01-26-14-59' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (321 commits) mm/compaction: fix UBSAN shift-out-of-bounds warning s390/mm: add missing ctor/dtor on page table upgrade kasan: sw_tags: use str_on_off() helper in kasan_init_sw_tags() tools: add VM_WARN_ON_VMG definition mm/damon/core: use str_high_low() helper in damos_wmark_wait_us() seqlock: add missing parameter documentation for raw_seqcount_try_begin() mm/page-writeback: consolidate wb_thresh bumping logic into __wb_calc_thresh mm/page_alloc: remove the incorrect and misleading comment zram: remove zcomp_stream_put() from write_incompressible_page() mm: separate move/undo parts from migrate_pages_batch() mm/kfence: use str_write_read() helper in get_access_type() selftests/mm/mkdirty: fix memory leak in test_uffdio_copy() kasan: hw_tags: Use str_on_off() helper in kasan_init_hw_tags() selftests/mm: virtual_address_range: avoid reading from VM_IO mappings selftests/mm: vm_util: split up /proc/self/smaps parsing selftests/mm: virtual_address_range: unmap chunks after validation selftests/mm: virtual_address_range: mmap() without PROT_WRITE selftests/memfd/memfd_test: fix possible NULL pointer dereference mm: add FGP_DONTCACHE folio creation flag mm: call filemap_fdatawrite_range_kick() after IOCB_DONTCACHE issue ...
2025-01-26	Merge tag 'mm-nonmm-stable-2025-01-24-23-16' of ↵	Linus Torvalds
	git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull non-MM updates from Andrew Morton: "Mainly individually changelogged singleton patches. The patch series in this pull are: - "lib min_heap: Improve min_heap safety, testing, and documentation" from Kuan-Wei Chiu provides various tightenings to the min_heap library code - "xarray: extract __xa_cmpxchg_raw" from Tamir Duberstein preforms some cleanup and Rust preparation in the xarray library code - "Update reference to include/asm-<arch>" from Geert Uytterhoeven fixes pathnames in some code comments - "Converge on using secs_to_jiffies()" from Easwar Hariharan uses the new secs_to_jiffies() in various places where that is appropriate - "ocfs2, dlmfs: convert to the new mount API" from Eric Sandeen switches two filesystems to the new mount API - "Convert ocfs2 to use folios" from Matthew Wilcox does that - "Remove get_task_comm() and print task comm directly" from Yafang Shao removes now-unneeded calls to get_task_comm() in various places - "squashfs: reduce memory usage and update docs" from Phillip Lougher implements some memory savings in squashfs and performs some maintainability work - "lib: clarify comparison function requirements" from Kuan-Wei Chiu tightens the sort code's behaviour and adds some maintenance work - "nilfs2: protect busy buffer heads from being force-cleared" from Ryusuke Konishi fixes an issues in nlifs when the fs is presented with a corrupted image - "nilfs2: fix kernel-doc comments for function return values" from Ryusuke Konishi fixes some nilfs kerneldoc - "nilfs2: fix issues with rename operations" from Ryusuke Konishi addresses some nilfs BUG_ONs which syzbot was able to trigger - "minmax.h: Cleanups and minor optimisations" from David Laight does some maintenance work on the min/max library code - "Fixes and cleanups to xarray" from Kemeng Shi does maintenance work on the xarray library code" * tag 'mm-nonmm-stable-2025-01-24-23-16' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (131 commits) ocfs2: use str_yes_no() and str_no_yes() helper functions include/linux/lz4.h: add some missing macros Xarray: use xa_mark_t in xas_squash_marks() to keep code consistent Xarray: remove repeat check in xas_squash_marks() Xarray: distinguish large entries correctly in xas_split_alloc() Xarray: move forward index correctly in xas_pause() Xarray: do not return sibling entries from xas_find_marked() ipc/util.c: complete the kernel-doc function descriptions gcov: clang: use correct function param names latencytop: use correct kernel-doc format for func params minmax.h: remove some #defines that are only expanded once minmax.h: simplify the variants of clamp() minmax.h: move all the clamp() definitions after the min/max() ones minmax.h: use BUILD_BUG_ON_MSG() for the lo < hi test in clamp() minmax.h: reduce the #define expansion of min(), max() and clamp() minmax.h: update some comments minmax.h: add whitespace around operators and after commas nilfs2: do not update mtime of renamed directory that is not moved nilfs2: handle errors that nilfs_prepare_chunk() may return CREDITS: fix spelling mistake ...
2025-01-25	mm: alloc_pages_bulk: rename API	Luiz Capitulino
	The previous commit removed the page_list argument from alloc_pages_bulk_noprof() along with the alloc_pages_bulk_list() function. Now that only the *_array() flavour of the API remains, we can do the following renaming (along with the _noprof() ones): alloc_pages_bulk_array -> alloc_pages_bulk alloc_pages_bulk_array_mempolicy -> alloc_pages_bulk_mempolicy alloc_pages_bulk_array_node -> alloc_pages_bulk_node Link: https://lkml.kernel.org/r/275a3bbc0be20fbe9002297d60045e67ab3d4ada.1734991165.git.luizcap@redhat.com Signed-off-by: Luiz Capitulino <luizcap@redhat.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Yunsheng Lin <linyunsheng@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-01-25	Merge tag 'hyperv-next-signed-20250123' of ↵	Linus Torvalds
	git://git.kernel.org/pub/scm/linux/kernel/git/hyperv/linux Pull hyperv updates from Wei Liu: - Introduce a new set of Hyper-V headers in include/hyperv and replace the old hyperv-tlfs.h with the new headers (Nuno Das Neves) - Fixes for the Hyper-V VTL mode (Roman Kisel) - Fixes for cpu mask usage in Hyper-V code (Michael Kelley) - Document the guest VM hibernation behaviour (Michael Kelley) - Miscellaneous fixes and cleanups (Jacob Pan, John Starks, Naman Jain) * tag 'hyperv-next-signed-20250123' of git://git.kernel.org/pub/scm/linux/kernel/git/hyperv/linux: Documentation: hyperv: Add overview of guest VM hibernation hyperv: Do not overlap the hvcall IO areas in hv_vtl_apicid_to_vp_id() hyperv: Do not overlap the hvcall IO areas in get_vtl() hyperv: Enable the hypercall output page for the VTL mode hv_balloon: Fallback to generic_online_page() for non-HV hot added mem Drivers: hv: vmbus: Log on missing offers if any Drivers: hv: vmbus: Wait for boot-time offers during boot and resume uio_hv_generic: Add a check for HV_NIC for send, receive buffers setup iommu/hyper-v: Don't assume cpu_possible_mask is dense Drivers: hv: Don't assume cpu_possible_mask is dense x86/hyperv: Don't assume cpu_possible_mask is dense hyperv: Remove the now unused hyperv-tlfs.h files hyperv: Switch from hyperv-tlfs.h to hyperv/hvhdk.h hyperv: Add new Hyper-V headers in include/hyperv hyperv: Clean up unnecessary #includes hyperv: Move hv_connection_id to hyperv-tlfs.h
2025-01-23	Merge tag 'bpf-next-6.14' of ↵	Linus Torvalds
	git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next Pull bpf updates from Alexei Starovoitov: "A smaller than usual release cycle. The main changes are: - Prepare selftest to run with GCC-BPF backend (Ihor Solodrai) In addition to LLVM-BPF runs the BPF CI now runs GCC-BPF in compile only mode. Half of the tests are failing, since support for btf_decl_tag is still WIP, but this is a great milestone. - Convert various samples/bpf to selftests/bpf/test_progs format (Alexis Lothoré and Bastien Curutchet) - Teach verifier to recognize that array lookup with constant in-range index will always succeed (Daniel Xu) - Cleanup migrate disable scope in BPF maps (Hou Tao) - Fix bpf_timer destroy path in PREEMPT_RT (Hou Tao) - Always use bpf_mem_alloc in bpf_local_storage in PREEMPT_RT (Martin KaFai Lau) - Refactor verifier lock support (Kumar Kartikeya Dwivedi) This is a prerequisite for upcoming resilient spin lock. - Remove excessive 'may_goto +0' instructions in the verifier that LLVM leaves when unrolls the loops (Yonghong Song) - Remove unhelpful bpf_probe_write_user() warning message (Marco Elver) - Add fd_array_cnt attribute for prog_load command (Anton Protopopov) This is a prerequisite for upcoming support for static_branch" * tag 'bpf-next-6.14' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (125 commits) selftests/bpf: Add some tests related to 'may_goto 0' insns bpf: Remove 'may_goto 0' instruction in opt_remove_nops() bpf: Allow 'may_goto 0' instruction in verifier selftests/bpf: Add test case for the freeing of bpf_timer bpf: Cancel the running bpf_timer through kworker for PREEMPT_RT bpf: Free element after unlock in __htab_map_lookup_and_delete_elem() bpf: Bail out early in __htab_map_lookup_and_delete_elem() bpf: Free special fields after unlock in htab_lru_map_delete_node() tools: Sync if_xdp.h uapi tooling header libbpf: Work around kernel inconsistently stripping '.llvm.' suffix bpf: selftests: verifier: Add nullness elision tests bpf: verifier: Support eliding map lookup nullness bpf: verifier: Refactor helper access type tracking bpf: tcp: Mark bpf_load_hdr_opt() arg2 as read-write bpf: verifier: Add missing newline on verbose() call selftests/bpf: Add distilled BTF test about marking BTF_IS_EMBEDDED libbpf: Fix incorrect traversal end type ID when marking BTF_IS_EMBEDDED libbpf: Fix return zero when elf_begin failed selftests/bpf: Fix btf leak on new btf alloc failure in btf_distill test veristat: Load struct_ops programs only once ...
2025-01-22	Merge tag 'net-next-6.14' of ↵	Linus Torvalds
	git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next Pull networking updates from Paolo Abeni: "This is slightly smaller than usual, with the most interesting work being still around RTNL scope reduction. Core: - More core refactoring to reduce the RTNL lock contention, including preparatory work for the per-network namespace RTNL lock, replacing RTNL lock with a per device-one to protect NAPI-related net device data and moving synchronize_net() calls outside such lock. - Extend drop reasons usage, adding net scheduler, AF_UNIX, bridge and more specific TCP coverage. - Reduce network namespace tear-down time by removing per-subsystems synchronize_net() in tipc and sched. - Add flow label selector support for fib rules, allowing traffic redirection based on such header field. Netfilter: - Do not remove netdev basechain when last device is gone, allowing netdev basechains without devices. - Revisit the flowtable teardown strategy, dealing better with fin, reset and re-open events. - Scale-up IP-vs connection dumping by avoiding linear search on each restart. Protocols: - A significant XDP socket refactor, consolidating and optimizing several helpers into the core - Better scaling of ICMP rate-limiting, by removing false-sharing in inet peers handling. - Introduces netlink notifications for multicast IPv4 and IPv6 address changes. - Add ipsec support for IP-TFS/AggFrag encapsulation, allowing aggregation and fragmentation of the inner IP. - Add sysctl to configure TIME-WAIT reuse delay for TCP sockets, to avoid local port exhaustion issues when the average connection lifetime is very short. - Support updating keys (re-keying) for connections using kernel TLS (for TLS 1.3 only). - Support ipv4-mapped ipv6 address clients in smc-r v2. - Add support for jumbo data packet transmission in RxRPC sockets, gluing multiple data packets in a single UDP packet. - Support RxRPC RACK-TLP to manage packet loss and retransmission in conjunction with the congestion control algorithm. Driver API: - Introduce a unified and structured interface for reporting PHY statistics, exposing consistent data across different H/W via ethtool. - Make timestamping selectable, allow the user to select the desired hwtstamp provider (PHY or MAC) administratively. - Add support for configuring a header-data-split threshold (HDS) value via ethtool, to deal with partial or buggy H/W implementation. - Consolidate DSA drivers Energy Efficiency Ethernet support. - Add EEE management to phylink, making use of the phylib implementation. - Add phylib support for in-band capabilities negotiation. - Simplify how phylib-enabled mac drivers expose the supported interfaces. Tests and tooling: - Make the YNL tool package-friendly to make it easier to deploy it separately from the kernel. - Increase TCP selftest coverage importing several packetdrill test-cases. - Regenerate the ethtool uapi header from the YNL spec, to ease maintenance and future development. - Add YNL support for decoding the link types used in net self-tests, allowing a single build to run both net and drivers/net. Drivers: - Ethernet high-speed NICs: - nVidia/Mellanox (mlx5): - add cross E-Switch QoS support - add SW Steering support for ConnectX-8 - implement support for HW-Managed Flow Steering, improving the rule deletion/insertion rate - support for multi-host LAG - Intel (ixgbe, ice, igb): - ice: add support for devlink health events - ixgbe: add initial support for E610 chipset variant - igb: add support for AF_XDP zero-copy - Meta: - add support for basic RSS config - allow changing the number of channels - add hardware monitoring support - Broadcom (bnxt): - implement TCP data split and HDS threshold ethtool support, enabling Device Memory TCP. - Marvell Octeon: - implement egress ipsec offload support for the cn10k family - Hisilicon (HIBMC): - implement unicast MAC filtering - Ethernet NICs embedded and virtual: - Convert UDP tunnel drivers to NETDEV_PCPU_STAT_DSTATS, avoiding contented atomic operations for drop counters - Freescale: - quicc: phylink conversion - enetc: support Tx and Rx checksum offload and improve TSO performances - MediaTek: - airoha: introduce support for ETS and HTB Qdisc offload - Microchip: - lan78XX USB: preparation work for phylink conversion - Synopsys (stmmac): - support DWMAC IP on NXP Automotive SoCs S32G2xx/S32G3xx/S32R45 - refactor EEE support to leverage the new driver API - optimize DMA and cache access to increase raw RX performances by 40% - TI: - icssg-prueth: add multicast filtering support for VLAN interface - netkit: - add ability to configure head/tailroom - VXLAN: - accepts packets with user-defined reserved bit - Ethernet switches: - Microchip: - lan969x: add RGMII support - lan969x: improve TX and RX performance using the FDMA engine - nVidia/Mellanox: - move Tx header handling to PCI driver, to ease XDP support - Ethernet PHYs: - Texas Instruments DP83822: - add support for GPIO2 clock output - Realtek: - 8169: add support for RTL8125D rev.b - rtl822x: add hwmon support for the temperature sensor - Microchip: - add support for RDS PTP hardware - consolidate periodic output signal generation - CAN: - several DT-bindings to DT schema conversions - tcan4x5x: - add HW standby support - support nWKRQ voltage selection - kvaser: - allowing Bus Error Reporting runtime configuration - WiFi: - the on-going Multi-Link Operation (MLO) effort continues, affecting both the stack and in drivers - mac80211/cfg80211: - Emergency Preparedness Communication Services (EPCS) station mode support - support for adding and removing station links for MLO - add support for WiFi 7/EHT mesh over 320 MHz channels - report Tx power info for each link - RealTek (rtw88): - enable USB Rx aggregation and USB 3 to improve performance - LED support - RealTek (rtw89): - refactor power save to support Multi-Link Operations - add support for RTL8922AE-VS variant - MediaTek (mt76): - single wiphy multiband support (preparation for MLO) - p2p device support - add TP-Link TXE50UH USB adapter support - Qualcomm (ath10k): - support for the QCA6698AQ IP core - Qualcomm (ath12k): - enable MLO for QCN9274 - Bluetooth: - Allow sysfs to trigger hdev reset, to allow recovering devices not responsive from user-space - MediaTek: add support for MT7922, MT7925, MT7921e devices - Realtek: add support for RTL8851BE devices - Qualcomm: add support for WCN785x devices - ISO: allow BIG re-sync" * tag 'net-next-6.14' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (1386 commits) net/rose: prevent integer overflows in rose_setsockopt() net: phylink: fix regression when binding a PHY net: ethernet: ti: am65-cpsw: streamline TX queue creation and cleanup net: ethernet: ti: am65-cpsw: streamline RX queue creation and cleanup net: ethernet: ti: am65-cpsw: ensure proper channel cleanup in error path ipv6: Convert inet6_rtm_deladdr() to per-netns RTNL. ipv6: Convert inet6_rtm_newaddr() to per-netns RTNL. ipv6: Move lifetime validation to inet6_rtm_newaddr(). ipv6: Set cfg.ifa_flags before device lookup in inet6_rtm_newaddr(). ipv6: Pass dev to inet6_addr_add(). ipv6: Convert inet6_ioctl() to per-netns RTNL. ipv6: Hold rtnl_net_lock() in addrconf_init() and addrconf_cleanup(). ipv6: Hold rtnl_net_lock() in addrconf_dad_work(). ipv6: Hold rtnl_net_lock() in addrconf_verify_work(). ipv6: Convert net.ipv6.conf.${DEV}.XXX sysctl to per-netns RTNL. ipv6: Add __in6_dev_get_rtnl_net(). net: stmmac: Drop redundant skb_mark_for_recycle() for SKB frags net: mii: Fix the Speed display when the network cable is not connected sysctl net: Remove macro checks for CONFIG_SYSCTL eth: bnxt: update header sizing defaults ...
2025-01-21	Merge tag 'lsm-pr-20250121' of ↵	Linus Torvalds
	git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/lsm Pull lsm updates from Paul Moore: - Improved handling of LSM "secctx" strings through lsm_context struct The LSM secctx string interface is from an older time when only one LSM was supported, migrate over to the lsm_context struct to better support the different LSMs we now have and make it easier to support new LSMs in the future. These changes explain the Rust, VFS, and networking changes in the diffstat. - Only build lsm_audit.c if CONFIG_SECURITY and CONFIG_AUDIT are enabled Small tweak to be a bit smarter about when we build the LSM's common audit helpers. - Check for absurdly large policies from userspace in SafeSetID SafeSetID policies rules are fairly small, basically just "UID:UID", it easy to impose a limit of KMALLOC_MAX_SIZE on policy writes which helps quiet a number of syzbot related issues. While work is being done to address the syzbot issues through other mechanisms, this is a trivial and relatively safe fix that we can do now. - Various minor improvements and cleanups A collection of improvements to the kernel selftests, constification of some function parameters, removing redundant assignments, and local variable renames to improve readability. * tag 'lsm-pr-20250121' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/lsm: lockdown: initialize local array before use to quiet static analysis safesetid: check size of policy writes net: corrections for security_secid_to_secctx returns lsm: rename variable to avoid shadowing lsm: constify function parameters security: remove redundant assignment to return variable lsm: Only build lsm_audit.c if CONFIG_SECURITY and CONFIG_AUDIT are set selftests: refactor the lsm `flags_overset_lsm_set_self_attr` test binder: initialize lsm_context structure rust: replace lsm context+len with lsm_context lsm: secctx provider check on release lsm: lsm_context in security_dentry_init_security lsm: use lsm_context in security_inode_getsecctx lsm: replace context+len with lsm_context lsm: ensure the correct LSM context releaser
2025-01-21	Merge tag 'kthread-for-6.14-rc1' of ↵	Linus Torvalds
	git://git.kernel.org/pub/scm/linux/kernel/git/frederic/linux-dynticks Pull kthread updates from Frederic Weisbecker: "Kthreads affinity follow either of 4 existing different patterns: 1) Per-CPU kthreads must stay affine to a single CPU and never execute relevant code on any other CPU. This is currently handled by smpboot code which takes care of CPU-hotplug operations. Affinity here is a correctness constraint. 2) Some kthreads _have_ to be affine to a specific set of CPUs and can't run anywhere else. The affinity is set through kthread_bind_mask() and the subsystem takes care by itself to handle CPU-hotplug operations. Affinity here is assumed to be a correctness constraint. 3) Per-node kthreads _prefer_ to be affine to a specific NUMA node. This is not a correctness constraint but merely a preference in terms of memory locality. kswapd and kcompactd both fall into this category. The affinity is set manually like for any other task and CPU-hotplug is supposed to be handled by the relevant subsystem so that the task is properly reaffined whenever a given CPU from the node comes up. Also care should be taken so that the node affinity doesn't cross isolated (nohz_full) cpumask boundaries. 4) Similar to the previous point except kthreads have a _preferred_ affinity different than a node. Both RCU boost kthreads and RCU exp kworkers fall into this category as they refer to "RCU nodes" from a distinctly distributed tree. Currently the preferred affinity patterns (3 and 4) have at least 4 identified users, with more or less success when it comes to handle CPU-hotplug operations and CPU isolation. Each of which do it in its own ad-hoc way. This is an infrastructure proposal to handle this with the following API changes: - kthread_create_on_node() automatically affines the created kthread to its target node unless it has been set as per-cpu or bound with kthread_bind[_mask]() before the first wake-up. - kthread_affine_preferred() is a new function that can be called right after kthread_create_on_node() to specify a preferred affinity different than the specified node. When the preferred affinity can't be applied because the possible targets are offline or isolated (nohz_full), the kthread is affine to the housekeeping CPUs (which means to all online CPUs most of the time or only the non-nohz_full CPUs when nohz_full= is set). kswapd, kcompactd, RCU boost kthreads and RCU exp kworkers have been converted, along with a few old drivers. Summary of the changes: - Consolidate a bunch of ad-hoc implementations of kthread_run_on_cpu() - Introduce task_cpu_fallback_mask() that defines the default last resort affinity of a task to become nohz_full aware - Add some correctness check to ensure kthread_bind() is always called before the first kthread wake up. - Default affine kthread to its preferred node. - Convert kswapd / kcompactd and remove their halfway working ad-hoc affinity implementation - Implement kthreads preferred affinity - Unify kthread worker and kthread API's style - Convert RCU kthreads to the new API and remove the ad-hoc affinity implementation" * tag 'kthread-for-6.14-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/frederic/linux-dynticks: kthread: modify kernel-doc function name to match code rcu: Use kthread preferred affinity for RCU exp kworkers treewide: Introduce kthread_run_worker[_on_cpu]() kthread: Unify kthread_create_on_cpu() and kthread_create_worker_on_cpu() automatic format rcu: Use kthread preferred affinity for RCU boost kthread: Implement preferred affinity mm: Create/affine kswapd to its preferred node mm: Create/affine kcompactd to its preferred node kthread: Default affine kthread to its preferred NUMA node kthread: Make sure kthread hasn't started while binding it sched,arm64: Handle CPU isolation on last resort fallback rq selection arm64: Exclude nohz_full CPUs from 32bits el0 support lib: test_objpool: Use kthread_run_on_cpu() kallsyms: Use kthread_run_on_cpu() soc/qman: test: Use kthread_run_on_cpu() arm/bL_switcher: Use kthread_run_on_cpu()
2025-01-21	sunrpc: Remove gss_{de,en}crypt_xdr_buf deadcode	Dr. David Alan Gilbert
	Commit ec596aaf9b48 ("SUNRPC: Remove code behind CONFIG_RPCSEC_GSS_KRB5_SIMPLIFIED") was the last user of the gss_decrypt_xdr_buf() and gss_encrypt_xdr_buf() functions. Remove them. Signed-off-by: Dr. David Alan Gilbert <linux@treblig.org> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2025-01-21	sunrpc: Remove gss_generic_token deadcode	Dr. David Alan Gilbert
	Commit ec596aaf9b48 ("SUNRPC: Remove code behind CONFIG_RPCSEC_GSS_KRB5_SIMPLIFIED") was the last user of the routines in gss_generic_token.c. Remove the routines and associated header. Signed-off-by: Dr. David Alan Gilbert <linux@treblig.org> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2025-01-21	sunrpc: Remove unused xprt_iter_get_xprt	Dr. David Alan Gilbert
	xprt_iter_get_xprt() was added by commit 80b14d5e61ca ("SUNRPC: Add a structure to track multiple transports") but is unused. Remove it. Signed-off-by: Dr. David Alan Gilbert <linux@treblig.org> Acked-by: Anna Schumaker <anna.schumaker@oracle.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2025-01-21	Revert "SUNRPC: Reduce thread wake-up rate when receiving large RPC messages"	Chuck Lever
	I noticed that a handful of NFSv3 fstests were taking an unexpectedly long time to run. Troubleshooting showed that the server's TCP window closed and never re-opened, which caused the client to trigger an RPC retransmit timeout after 180 seconds. The client's recovery action was to establish a fresh connection and retransmit the timed-out requests. This worked, but it adds a long delay. I tracked the problem to the commit that attempted to reduce the rate at which the network layer delivers TCP socket data_ready callbacks. Under most circumstances this change worked as expected, but for NFSv3, which has no session or other type of throttling, it can overwhelm the receiver on occasion. I'm sure I could tweak the lowat settings, but the small benefit doesn't seem worth the bother. Just revert it. Fixes: 2b877fc53e97 ("SUNRPC: Reduce thread wake-up rate when receiving large RPC messages") Cc: Jakub Kicinski <kuba@kernel.org> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2025-01-21	Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net	Paolo Abeni
	No conflicts and no adjacent changes. Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-01-20	net/rose: prevent integer overflows in rose_setsockopt()	Nikita Zhandarovich
	In case of possible unpredictably large arguments passed to rose_setsockopt() and multiplied by extra values on top of that, integer overflows may occur. Do the safest minimum and fix these issues by checking the contents of 'opt' and returning -EINVAL if they are too large. Also, switch to unsigned int and remove useless check for negative 'opt' in ROSE_IDLE case. Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Signed-off-by: Nikita Zhandarovich <n.zhandarovich@fintech.ru> Link: https://patch.msgid.link/20250115164220.19954-1-n.zhandarovich@fintech.ru Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-01-20	ipv6: Convert inet6_rtm_deladdr() to per-netns RTNL.	Kuniyuki Iwashima
	Let's register inet6_rtm_deladdr() with RTNL_FLAG_DOIT_PERNET and hold rtnl_net_lock() before inet6_addr_del(). Now that inet6_addr_del() is always called under per-netns RTNL. Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Link: https://patch.msgid.link/20250115080608.28127-12-kuniyu@amazon.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-01-20	ipv6: Convert inet6_rtm_newaddr() to per-netns RTNL.	Kuniyuki Iwashima
	Let's register inet6_rtm_newaddr() with RTNL_FLAG_DOIT_PERNET and hold rtnl_net_lock() before __dev_get_by_index(). Now that inet6_addr_add() and inet6_addr_modify() are always called under per-netns RTNL. Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Link: https://patch.msgid.link/20250115080608.28127-11-kuniyu@amazon.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-01-20	ipv6: Move lifetime validation to inet6_rtm_newaddr().	Kuniyuki Iwashima
	inet6_addr_add() and inet6_addr_modify() have the same code to validate IPv6 lifetime that is done under RTNL. Let's factorise it out to inet6_rtm_newaddr() so that we can validate the lifetime without RTNL later. Note that inet6_addr_add() is called from addrconf_add_ifaddr(), but the lifetime is INFINITY_LIFE_TIME in the path, so expires and flags are 0. Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Link: https://patch.msgid.link/20250115080608.28127-10-kuniyu@amazon.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-01-20	ipv6: Set cfg.ifa_flags before device lookup in inet6_rtm_newaddr().	Kuniyuki Iwashima
	We will convert inet6_rtm_newaddr() to per-netns RTNL. Except for IFA_F_OPTIMISTIC, cfg.ifa_flags can be set before __dev_get_by_index(). Let's move ifa_flags setup before __dev_get_by_index() so that we can set ifa_flags without RTNL. Also, now it's moved before tb[IFA_CACHEINFO] in preparing for the next patch. Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Link: https://patch.msgid.link/20250115080608.28127-9-kuniyu@amazon.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-01-20	ipv6: Pass dev to inet6_addr_add().	Kuniyuki Iwashima
	inet6_addr_add() is called from inet6_rtm_newaddr() and addrconf_add_ifaddr(). inet6_addr_add() looks up dev by __dev_get_by_index(), but it's already done in inet6_rtm_newaddr(). Let's move the 2nd lookup to addrconf_add_ifaddr() and pass dev to inet6_addr_add(). Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Link: https://patch.msgid.link/20250115080608.28127-8-kuniyu@amazon.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-01-20	ipv6: Convert inet6_ioctl() to per-netns RTNL.	Kuniyuki Iwashima
	These functions are called from inet6_ioctl() with a socket's netns and hold RTNL. * SIOCSIFADDR : addrconf_add_ifaddr() * SIOCDIFADDR : addrconf_del_ifaddr() * SIOCSIFDSTADDR : addrconf_set_dstaddr() Let's use rtnl_net_lock(). Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Link: https://patch.msgid.link/20250115080608.28127-7-kuniyu@amazon.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-01-20	ipv6: Hold rtnl_net_lock() in addrconf_init() and addrconf_cleanup().	Kuniyuki Iwashima
	addrconf_init() holds RTNL for blackhole_netdev, which is the global device in init_net. addrconf_cleanup() holds RTNL to clean up devices in init_net too. Let's use rtnl_net_lock(&init_net) there. Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Link: https://patch.msgid.link/20250115080608.28127-6-kuniyu@amazon.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-01-20	ipv6: Hold rtnl_net_lock() in addrconf_dad_work().	Kuniyuki Iwashima
	addrconf_dad_work() is per-address work and holds RTNL internally. We can fetch netns as dev_net(ifp->idev->dev). Let's use rtnl_net_lock(). Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Link: https://patch.msgid.link/20250115080608.28127-5-kuniyu@amazon.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-01-20	ipv6: Hold rtnl_net_lock() in addrconf_verify_work().	Kuniyuki Iwashima
	addrconf_verify_work() is per-netns work to call addrconf_verify_rtnl() under RTNL. Let's use rtnl_net_lock(). Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Link: https://patch.msgid.link/20250115080608.28127-4-kuniyu@amazon.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-01-20	ipv6: Convert net.ipv6.conf.${DEV}.XXX sysctl to per-netns RTNL.	Kuniyuki Iwashima
	net.ipv6.conf.${DEV}.XXX sysctl are changed under RTNL: * forwarding * ignore_routes_with_linkdown * disable_ipv6 * proxy_ndp * addr_gen_mode * stable_secret * disable_policy Let's use rtnl_net_lock() there. Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Link: https://patch.msgid.link/20250115080608.28127-3-kuniyu@amazon.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-01-20	sysctl net: Remove macro checks for CONFIG_SYSCTL	Denis Kirjanov
	Since dccp and llc makefiles already check sysctl code compilation with xxx-$(CONFIG_SYSCTL) we can drop the checks Signed-off-by: Denis Kirjanov <kirjanov@gmail.com> Link: https://patch.msgid.link/20250119134254.19250-1-kirjanov@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-01-20	Merge tag 'nf-next-25-01-19' of ↵	Jakub Kicinski
	git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf-next Pablo Neira Ayuso says: ==================== Netfilter updates for net-next The following batch contains Netfilter updates for net-next: 1) Unbreak set size settings for rbtree set backend, intervals in rbtree are represented as two elements, this detailed is leaked to userspace leading to bogus ENOSPC from control plane. 2) Remove dead code in br_netfilter's br_nf_pre_routing_finish() due to never matching error when looking up for route, from Antoine Tenart. 3) Simplify check for device already in use in flowtable, from Phil Sutter. 4) Three patches to restore interface name field in struct nft_hook and use it, this is to prepare for wildcard interface support. From Phil Sutter. 5) Do not remove netdev basechain when last device is gone, this is for consistency with the flowtable behaviour. This allows for netdev basechains without devices. Another patch to simplify netdev event notifier after this update. Also from Phil. 6) Two patches to add missing spinlock when flowtable updates TCP state flags, from Florian Westphal. 7) Simplify __nf_ct_refresh_acct() by removing skbuff parameter, also from Florian. 8) Flowtable gc now extends ct timeout for offloaded flow. This is to address a possible race that leads to handing over flow to classic path with long ct timeouts. 9) Tear down flow if cached rt_mtu is stale, before this patch, packet is handed over to classic path but flow entry still remained in place. 10) Revisit the flowtable teardown strategy, which was originally designed to release flowtable hardware entries early. Add a new CLOSING flag that still allows hardware to release entries when fin/rst is seen, but keeps the flow entry in place when the TCP connection is closed. Release flow after timeout or when a new syn packet is seen for TCP reopen scenario. * tag 'nf-next-25-01-19' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf-next: netfilter: flowtable: add CLOSING state netfilter: flowtable: teardown flow if cached mtu is stale netfilter: conntrack: rework offload nf_conn timeout extension logic netfilter: conntrack: remove skb argument from nf_ct_refresh netfilter: nft_flow_offload: update tcp state flags under lock netfilter: nft_flow_offload: clear tcp MAXACK flag before moving to slowpath netfilter: nf_tables: Simplify chain netdev notifier netfilter: nf_tables: Tolerate chains with no remaining hooks netfilter: nf_tables: Compare netdev hooks based on stored name netfilter: nf_tables: Use stored ifname in netdev hook dumps netfilter: nf_tables: Store user-defined hook ifname netfilter: nf_tables: Flowtable hook's pf value never varies netfilter: br_netfilter: remove unused conditional and dead code netfilter: nf_tables: fix set size with rbtree backend ==================== Link: https://patch.msgid.link/20250119172051.8261-1-pablo@netfilter.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-01-20	net: ethtool: populate the default HDS params in the core	Jakub Kicinski
	The core has the current HDS config, it can pre-populate the values for the drivers. While at it, remove the zero-setting in netdevsim. Zero are the default values since the config is zalloc'ed. Reviewed-by: Michael Chan <michael.chan@broadcom.com> Link: https://patch.msgid.link/20250119020518.1962249-6-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-01-20	net: provide pending ring configuration in net_device	Jakub Kicinski
	Record the pending configuration in net_device struct. ethtool core duplicates the current config and the specific handlers (for now just ringparam) can modify it. Reviewed-by: Michael Chan <michael.chan@broadcom.com> Link: https://patch.msgid.link/20250119020518.1962249-4-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-01-20	net: ethtool: store netdev in a temp variable in ethnl_default_set_doit()	Jakub Kicinski
	For ease of review of the next patch store the dev pointer on the stack, instead of referring to req_info.dev every time. No functional changes. Reviewed-by: Michael Chan <michael.chan@broadcom.com> Link: https://patch.msgid.link/20250119020518.1962249-3-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-01-20	net: move HDS config from ethtool state	Jakub Kicinski
	Separate the HDS config from the ethtool state struct. The HDS config contains just simple parameters, not state. Having it as a separate struct will make it easier to clone / copy and also long term potentially make it per-queue. Reviewed-by: Michael Chan <michael.chan@broadcom.com> Link: https://patch.msgid.link/20250119020518.1962249-2-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-01-20	af_unix: Use consume_skb() in connect() and sendmsg().	Kuniyuki Iwashima
	This is based on Donald Hunter's patch. These functions could fail for various reasons, sometimes triggering kfree_skb(). * unix_stream_connect() : connect() * unix_stream_sendmsg() : sendmsg() * queue_oob() : sendmsg(MSG_OOB) * unix_dgram_sendmsg() : sendmsg() Such kfree_skb() is tied to the errno of connect() and sendmsg(), and we need not define skb drop reasons. Let's use consume_skb() not to churn kfree_skb() events. Link: https://lore.kernel.org/netdev/eb30b164-7f86-46bf-a5d3-0f8bda5e9398@redhat.com/ Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Link: https://patch.msgid.link/20250116053441.5758-10-kuniyu@amazon.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-01-20	af_unix: Reuse out_pipe label in unix_stream_sendmsg().	Kuniyuki Iwashima
	This is a follow-up of commit d460b04bc452 ("af_unix: Clean up error paths in unix_stream_sendmsg()."). If we initialise skb with NULL in unix_stream_sendmsg(), we can reuse the existing out_pipe label for the SEND_SHUTDOWN check. Let's rename it and adjust the existing label as out_pipe_lock. While at it, size and data_len are moved to the while loop scope. Suggested-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Link: https://patch.msgid.link/20250116053441.5758-9-kuniyu@amazon.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-01-20	af_unix: Set drop reason in unix_dgram_disconnected().	Kuniyuki Iwashima
	unix_dgram_disconnected() is called from two places: 1. when a connect()ed socket dis-connect()s or re-connect()s to another socket 2. when sendmsg() fails because the peer socket that the client has connect()ed to has been close()d Then, the client's recv queue is purged to remove all messages from the old peer socket. Let's define a new drop reason for that case. # echo 1 > /sys/kernel/tracing/events/skb/kfree_skb/enable # python3 >>> from socket import * >>> >>> # s1 has a message from s2 >>> s1, s2 = socketpair(AF_UNIX, SOCK_DGRAM) >>> s2.send(b'hello world') >>> >>> # re-connect() drops the message from s2 >>> s3 = socket(AF_UNIX, SOCK_DGRAM) >>> s3.bind('') >>> s1.connect(s3.getsockname()) # cat /sys/kernel/tracing/trace_pipe python3-250 ... kfree_skb: ... location=skb_queue_purge_reason+0xdc/0x110 reason: UNIX_DISCONNECT Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Link: https://patch.msgid.link/20250116053441.5758-8-kuniyu@amazon.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-01-20	af_unix: Set drop reason in unix_stream_read_skb().	Kuniyuki Iwashima
	unix_stream_read_skb() is called when BPF SOCKMAP reads some data from a socket in the map. SOCKMAP does not support MSG_OOB, and reading OOB results in a drop. Let's set drop reasons respectively. * SOCKET_CLOSE : the socket in SOCKMAP was close()d * UNIX_SKIP_OOB : OOB was read from the socket in SOCKMAP Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Link: https://patch.msgid.link/20250116053441.5758-7-kuniyu@amazon.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-01-20	af_unix: Set drop reason in manage_oob().	Kuniyuki Iwashima
	AF_UNIX SOCK_STREAM socket supports MSG_OOB. When OOB data is sent to a socket, recv() will break at that point. If the next recv() does not have MSG_OOB, the normal data following the OOB data is returned. Then, the OOB skb is dropped. Let's define a new drop reason for that case in manage_oob(). # echo 1 > /sys/kernel/tracing/events/skb/kfree_skb/enable # python3 >>> from socket import * >>> s1, s2 = socketpair(AF_UNIX) >>> s1.send(b'a', MSG_OOB) >>> s1.send(b'b') >>> s2.recv(2) b'b' # cat /sys/kernel/tracing/trace_pipe ... python3-223 ... kfree_skb: ... location=unix_stream_read_generic+0x59e/0xc20 reason: UNIX_SKIP_OOB Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Link: https://patch.msgid.link/20250116053441.5758-6-kuniyu@amazon.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-01-20	af_unix: Set drop reason in __unix_gc().	Kuniyuki Iwashima
	Inflight file descriptors by SCM_RIGHTS hold references to the struct file. AF_UNIX sockets could hold references to each other, forming reference cycles. Once such sockets are close()d without the fd recv()ed, they will be unaccessible from userspace but remain in kernel. __unix_gc() garbage-collects skb with the dead file descriptors and frees them by __skb_queue_purge(). Let's set SKB_DROP_REASON_SOCKET_CLOSE there. # echo 1 > /sys/kernel/tracing/events/skb/kfree_skb/enable # python3 >>> from socket import * >>> from array import array >>> >>> # Create a reference cycle >>> s1 = socket(AF_UNIX, SOCK_DGRAM) >>> s1.bind('') >>> s1.sendmsg([b"nop"], [(SOL_SOCKET, SCM_RIGHTS, array("i", [s1.fileno()]))], 0, s1.getsockname()) >>> s1.close() >>> >>> # Trigger GC >>> s2 = socket(AF_UNIX) >>> s2.close() # cat /sys/kernel/tracing/trace_pipe ... kworker/u16:2-42 ... kfree_skb: ... location=__unix_gc+0x4ad/0x580 reason: SOCKET_CLOSE Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Link: https://patch.msgid.link/20250116053441.5758-5-kuniyu@amazon.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-01-20	af_unix: Set drop reason in unix_sock_destructor().	Kuniyuki Iwashima
	unix_sock_destructor() is called as sk->sk_destruct() just before the socket is actually freed. Let's use SKB_DROP_REASON_SOCKET_CLOSE for skb_queue_purge(). Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Link: https://patch.msgid.link/20250116053441.5758-4-kuniyu@amazon.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-01-20	af_unix: Set drop reason in unix_release_sock().	Kuniyuki Iwashima
	unix_release_sock() is called when the last refcnt of struct file is released. Let's define a new drop reason SKB_DROP_REASON_SOCKET_CLOSE and set it for kfree_skb() in unix_release_sock(). # echo 1 > /sys/kernel/tracing/events/skb/kfree_skb/enable # python3 >>> from socket import * >>> s1, s2 = socketpair(AF_UNIX) >>> s1.send(b'hello world') >>> s2.close() # cat /sys/kernel/tracing/trace_pipe ... python3-280 ... kfree_skb: ... protocol=0 location=unix_release_sock+0x260/0x420 reason: SOCKET_CLOSE To be precise, unix_release_sock() is also called for a new child socket in unix_stream_connect() when something fails, but the new sk does not have skb in the recv queue then and no event is logged. Note that only tcp_inbound_ao_hash() uses a similar drop reason, SKB_DROP_REASON_TCP_CLOSE, and this can be generalised later. Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Link: https://patch.msgid.link/20250116053441.5758-3-kuniyu@amazon.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-01-20	tcp_cubic: fix incorrect HyStart round start detection	Mahdi Arghavani
	I noticed that HyStart incorrectly marks the start of rounds, leading to inaccurate measurements of ACK train lengths and resetting the `ca->sample_cnt` variable. This inaccuracy can impact HyStart's functionality in terminating exponential cwnd growth during Slow-Start, potentially degrading TCP performance. The issue arises because the changes introduced in commit 4e1fddc98d25 ("tcp_cubic: fix spurious Hystart ACK train detections for not-cwnd-limited flows") moved the caller of the `bictcp_hystart_reset` function inside the `hystart_update` function. This modification added an additional condition for triggering the caller, requiring that (tcp_snd_cwnd(tp) >= hystart_low_window) must also be satisfied before invoking `bictcp_hystart_reset`. This fix ensures that `bictcp_hystart_reset` is correctly called at the start of a new round, regardless of the congestion window size. This is achieved by moving the condition (tcp_snd_cwnd(tp) >= hystart_low_window) from before calling `bictcp_hystart_reset` to after it. I tested with a client and a server connected through two Linux software routers. In this setup, the minimum RTT was 150 ms, the bottleneck bandwidth was 50 Mbps, and the bottleneck buffer size was 1 BDP, calculated as (50M / 1514 / 8) * 0.150 = 619 packets. I conducted the test twice, transferring data from the server to the client for 1.5 seconds. Before the patch was applied, HYSTART-DELAY stopped the exponential growth of cwnd when cwnd = 516, and the bottleneck link was not yet saturated (516 < 619). After the patch was applied, HYSTART-ACK-TRAIN stopped the exponential growth of cwnd when cwnd = 632, and the bottleneck link was saturated (632 > 619). In this test, applying the patch resulted in 300 KB more data delivered. Fixes: 4e1fddc98d25 ("tcp_cubic: fix spurious Hystart ACK train detections for not-cwnd-limited flows") Signed-off-by: Mahdi Arghavani <ma.arghavani@yahoo.com> Reviewed-by: Jason Xing <kerneljasonxing@gmail.com> Cc: Neal Cardwell <ncardwell@google.com> Cc: Eric Dumazet <edumazet@google.com> Cc: Haibo Zhang <haibo.zhang@otago.ac.nz> Cc: David Eyers <david.eyers@otago.ac.nz> Cc: Abbas Arghavani <abbas.arghavani@mdu.se> Reviewed-by: Neal Cardwell <ncardwell@google.com> Tested-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2025-01-20	tipc: re-order conditions in tipc_crypto_key_rcv()	Dan Carpenter
	On a 32bit system the "keylen + sizeof(struct tipc_aead_key)" math could have an integer wrapping issue. It doesn't matter because the "keylen" is checked on the next line, but just to make life easier for static analysis tools, let's re-order these conditions and avoid the integer overflow. Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2025-01-20	net: appletalk: Drop aarp_send_probe_phase1()	谢致邦 (XIE Zhibang)
	aarp_send_probe_phase1() used to work by calling ndo_do_ioctl of appletalk drivers ltpc or cops, but these two drivers have been removed since the following commits: commit 03dcb90dbf62 ("net: appletalk: remove Apple/Farallon LocalTalk PC support") commit 00f3696f7555 ("net: appletalk: remove cops support") Thus aarp_send_probe_phase1() no longer works, so drop it. (found by code inspection) Signed-off-by: 谢致邦 (XIE Zhibang) <Yeking@Red54.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2025-01-20	net: sched: refine software bypass handling in tc_run	Xin Long
	This patch addresses issues with filter counting in block (tcf_block), particularly for software bypass scenarios, by introducing a more accurate mechanism using useswcnt. Previously, filtercnt and skipswcnt were introduced by: Commit 2081fd3445fe ("net: sched: cls_api: add filter counter") and Commit f631ef39d819 ("net: sched: cls_api: add skip_sw counter") filtercnt tracked all tp (tcf_proto) objects added to a block, and skipswcnt counted tp objects with the skipsw attribute set. The problem is: a single tp can contain multiple filters, some with skipsw and others without. The current implementation fails in the case: When the first filter in a tp has skipsw, both skipswcnt and filtercnt are incremented, then adding a second filter without skipsw to the same tp does not modify these counters because tp->counted is already set. This results in bypass software behavior based solely on skipswcnt equaling filtercnt, even when the block includes filters without skipsw. Consequently, filters without skipsw are inadvertently bypassed. To address this, the patch introduces useswcnt in block to explicitly count tp objects containing at least one filter without skipsw. Key changes include: Whenever a filter without skipsw is added, its tp is marked with usesw and counted in useswcnt. tc_run() now uses useswcnt to determine software bypass, eliminating reliance on filtercnt and skipswcnt. This refined approach prevents software bypass for blocks containing mixed filters, ensuring correct behavior in tc_run(). Additionally, as atomic operations on useswcnt ensure thread safety and tp->lock guards access to tp->usesw and tp->counted, the broader lock down_write(&block->cb_lock) is no longer required in tc_new_tfilter(), and this resolves a performance regression caused by the filter counting mechanism during parallel filter insertions. The improvement can be demonstrated using the following script: # cat insert_tc_rules.sh tc qdisc add dev ens1f0np0 ingress for i in $(seq 16); do taskset -c $i tc -b rules_$i.txt & done wait Each of rules_$i.txt files above includes 100000 tc filter rules to a mlx5 driver NIC ens1f0np0. Without this patch: # time sh insert_tc_rules.sh real 0m50.780s user 0m23.556s sys 4m13.032s With this patch: # time sh insert_tc_rules.sh real 0m17.718s user 0m7.807s sys 3m45.050s Fixes: 047f340b36fc ("net: sched: make skip_sw actually skip software") Reported-by: Shuang Li <shuali@redhat.com> Signed-off-by: Xin Long <lucien.xin@gmail.com> Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Reviewed-by: Asbjørn Sloth Tønnesen <ast@fiberby.net> Tested-by: Asbjørn Sloth Tønnesen <ast@fiberby.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2025-01-19	netfilter: flowtable: add CLOSING state	Pablo Neira Ayuso
	tcp rst/fin packet triggers an immediate teardown of the flow which results in sending flows back to the classic forwarding path. This behaviour was introduced by: da5984e51063 ("netfilter: nf_flow_table: add support for sending flows back to the slow path") b6f27d322a0a ("netfilter: nf_flow_table: tear down TCP flows if RST or FIN was seen") whose goal is to expedite removal of flow entries from the hardware table. Before these patches, the flow was released after the flow entry timed out. However, this approach leads to packet races when restoring the conntrack state as well as late flow re-offload situations when the TCP connection is ending. This patch adds a new CLOSING state that is is entered when tcp rst/fin packet is seen. This allows for an early removal of the flow entry from the hardware table. But the flow entry still remains in software, so tcp packets to shut down the flow are not sent back to slow path. If syn packet is seen from this new CLOSING state, then this flow enters teardown state, ct state is set to TCP_CONNTRACK_CLOSE state and packet is sent to slow path, so this TCP reopen scenario can be handled by conntrack. TCP_CONNTRACK_CLOSE provides a small timeout that aims at quickly releasing this stale entry from the conntrack table. Moreover, skip hardware re-offload from flowtable software packet if the flow is in CLOSING state. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2025-01-19	netfilter: flowtable: teardown flow if cached mtu is stale	Pablo Neira Ayuso
	Tear down the flow entry in the unlikely case that the interface mtu changes, this gives the flow a chance to refresh the cached mtu, otherwise such refresh does not occur until flow entry expires. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2025-01-19	netfilter: conntrack: rework offload nf_conn timeout extension logic	Florian Westphal
	Offload nf_conn entries may not see traffic for a very long time. To prevent incorrect 'ct is stale' checks during nf_conntrack table lookup, the gc worker extends the timeout nf_conn entries marked for offload to a large value. The existing logic suffers from a few problems. Garbage collection runs without locks, its unlikely but possible that @ct is removed right after the 'offload' bit test. In that case, the timeout of a new/reallocated nf_conn entry will be increased. Prevent this by obtaining a reference count on the ct object and re-check of the confirmed and offload bits. If those are not set, the ct is being removed, skip the timeout extension in this case. Parallel teardown is also problematic: cpu1 cpu2 gc_worker calls flow_offload_teardown() tests OFFLOAD bit, set clear OFFLOAD bit ct->timeout is repaired (e.g. set to timeout[UDP_CT_REPLIED]) nf_ct_offload_timeout() called expire value is fetched <INTERRUPT> -> NF_CT_DAY timeout for flow that isn't offloaded (and might not see any further packets). Use cmpxchg: if ct->timeout was repaired after the 2nd 'offload bit' test passed, then ct->timeout will only be updated of ct->timeout was not altered in between. As we already have a gc worker for flowtable entries, ct->timeout repair can be handled from the flowtable gc worker. This avoids having flowtable specific logic in the conntrack core and avoids checking entries that were never offloaded. This allows to remove the nf_ct_offload_timeout helper. Its safe to use in the add case, but not on teardown. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2025-01-19	netfilter: conntrack: remove skb argument from nf_ct_refresh	Florian Westphal
	Its not used (and could be NULL), so remove it. This allows to use nf_ct_refresh in places where we don't have an skb without having to double-check that skb == NULL would be safe. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2025-01-19	netfilter: nft_flow_offload: update tcp state flags under lock	Florian Westphal
	The conntrack entry is already public, there is a small chance that another CPU is handling a packet in reply direction and racing with the tcp state update. Move this under ct spinlock. This is done once, when ct is about to be offloaded, so this should not result in a noticeable performance hit. Fixes: 8437a6209f76 ("netfilter: nft_flow_offload: set liberal tracking mode for tcp") Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2025-01-19	netfilter: nft_flow_offload: clear tcp MAXACK flag before moving to slowpath	Florian Westphal
	This state reset is racy, no locks are held here. Since commit 8437a6209f76 ("netfilter: nft_flow_offload: set liberal tracking mode for tcp"), the window checks are disabled for normal data packets, but MAXACK flag is checked when validating TCP resets. Clear the flag so tcp reset validation checks are ignored. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2025-01-19	netfilter: nf_tables: Simplify chain netdev notifier	Phil Sutter
	With conditional chain deletion gone, callback code simplifies: Instead of filling an nft_ctx object, just pass basechain to the per-chain function. Also plain list_for_each_entry() is safe now. Signed-off-by: Phil Sutter <phil@nwl.cc> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>