summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2025-05-26Merge tag 'vfs-6.16-rc1.pidfs' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull pidfs updates from Christian Brauner: "Features: - Allow handing out pidfds for reaped tasks for AF_UNIX SO_PEERPIDFD socket option SO_PEERPIDFD is a socket option that allows to retrieve a pidfd for the process that called connect() or listen(). This is heavily used to safely authenticate clients in userspace avoiding security bugs due to pid recycling races (dbus, polkit, systemd, etc.) SO_PEERPIDFD currently doesn't support handing out pidfds if the sk->sk_peer_pid thread-group leader has already been reaped. In this case it currently returns EINVAL. Userspace still wants to get a pidfd for a reaped process to have a stable handle it can pass on. This is especially useful now that it is possible to retrieve exit information through a pidfd via the PIDFD_GET_INFO ioctl()'s PIDFD_INFO_EXIT flag Another summary has been provided by David Rheinsberg: > A pidfd can outlive the task it refers to, and thus user-space > must already be prepared that the task underlying a pidfd is > gone at the time they get their hands on the pidfd. For > instance, resolving the pidfd to a PID via the fdinfo must be > prepared to read `-1`. > > Despite user-space knowing that a pidfd might be stale, several > kernel APIs currently add another layer that checks for this. In > particular, SO_PEERPIDFD returns `EINVAL` if the peer-task was > already reaped, but returns a stale pidfd if the task is reaped > immediately after the respective alive-check. > > This has the unfortunate effect that user-space now has two ways > to check for the exact same scenario: A syscall might return > EINVAL/ESRCH/... *or* the pidfd might be stale, even though > there is no particular reason to distinguish both cases. This > also propagates through user-space APIs, which pass on pidfds. > They must be prepared to pass on `-1` *or* the pidfd, because > there is no guaranteed way to get a stale pidfd from the kernel. > > Userspace must already deal with a pidfd referring to a reaped > task as the task may exit and get reaped at any time will there > are still many pidfds referring to it In order to allow handing out reaped pidfd SO_PEERPIDFD needs to ensure that PIDFD_INFO_EXIT information is available whenever a pidfd for a reaped task is created by PIDFD_INFO_EXIT. The uapi promises that reaped pidfds are only handed out if it is guaranteed that the caller sees the exit information: TEST_F(pidfd_info, success_reaped) { struct pidfd_info info = { .mask = PIDFD_INFO_CGROUPID | PIDFD_INFO_EXIT, }; /* * Process has already been reaped and PIDFD_INFO_EXIT been set. * Verify that we can retrieve the exit status of the process. */ ASSERT_EQ(ioctl(self->child_pidfd4, PIDFD_GET_INFO, &info), 0); ASSERT_FALSE(!!(info.mask & PIDFD_INFO_CREDS)); ASSERT_TRUE(!!(info.mask & PIDFD_INFO_EXIT)); ASSERT_TRUE(WIFEXITED(info.exit_code)); ASSERT_EQ(WEXITSTATUS(info.exit_code), 0); } To hand out pidfds for reaped processes we thus allocate a pidfs entry for the relevant sk->sk_peer_pid at the time the sk->sk_peer_pid is stashed and drop it when the socket is destroyed. This guarantees that exit information will always be recorded for the sk->sk_peer_pid task and we can hand out pidfds for reaped processes - Hand a pidfd to the coredump usermode helper process Give userspace a way to instruct the kernel to install a pidfd for the crashing process into the process started as a usermode helper. There's still tricky race-windows that cannot be easily or sometimes not closed at all by userspace. There's various ways like looking at the start time of a process to make sure that the usermode helper process is started after the crashing process but it's all very very brittle and fraught with peril The crashed-but-not-reaped process can be killed by userspace before coredump processing programs like systemd-coredump have had time to manually open a PIDFD from the PID the kernel provides them, which means they can be tricked into reading from an arbitrary process, and they run with full privileges as they are usermode helper processes Even if that specific race-window wouldn't exist it's still the safest and cleanest way to let the kernel provide the pidfd directly instead of requiring userspace to do it manually. In parallel with this commit we already have systemd adding support for this in [1] When the usermode helper process is forked we install a pidfd file descriptor three into the usermode helper's file descriptor table so it's available to the exec'd program Since usermode helpers are either children of the system_unbound_wq workqueue or kthreadd we know that the file descriptor table is empty and can thus always use three as the file descriptor number Note, that we'll install a pidfd for the thread-group leader even if a subthread is calling do_coredump(). We know that task linkage hasn't been removed yet and even if this @current isn't the actual thread-group leader we know that the thread-group leader cannot be reaped until @current has exited - Allow telling when a task has not been found from finding the wrong task when creating a pidfd We currently report EINVAL whenever a struct pid has no tasked attached anymore thereby conflating two concepts: (1) The task has already been reaped (2) The caller requested a pidfd for a thread-group leader but the pid actually references a struct pid that isn't used as a thread-group leader This is causing issues for non-threaded workloads as in where they expect ESRCH to be reported, not EINVAL So allow userspace to reliably distinguish between (1) and (2) - Make it possible to detect when a pidfs entry would outlive the struct pid it pinned - Add a range of new selftests Cleanups: - Remove unneeded NULL check from pidfd_prepare() for passed struct pid - Avoid pointless reference count bump during release_task() Fixes: - Various fixes to the pidfd and coredump selftests - Fix error handling for replace_fd() when spawning coredump usermode helper" * tag 'vfs-6.16-rc1.pidfs' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: pidfs: detect refcount bugs coredump: hand a pidfd to the usermode coredump helper coredump: fix error handling for replace_fd() pidfs: move O_RDWR into pidfs_alloc_file() selftests: coredump: Raise timeout to 2 minutes selftests: coredump: Fix test failure for slow machines selftests: coredump: Properly initialize pointer net, pidfs: enable handing out pidfds for reaped sk->sk_peer_pid pidfs: get rid of __pidfd_prepare() net, pidfs: prepare for handing out pidfds for reaped sk->sk_peer_pid pidfs: register pid in pidfs net, pidfd: report EINVAL for ESRCH release_task: kill the no longer needed get/put_pid(thread_pid) pidfs: ensure consistent ENOENT/ESRCH reporting exit: move wake_up_all() pidfd waiters into __unhash_process() selftest/pidfd: add test for thread-group leader pidfd open for thread pidfd: improve uapi when task isn't found pidfd: remove unneeded NULL check from pidfd_prepare() selftests/pidfd: adapt to recent changes
2025-05-26vsock/virtio: fix `rx_bytes` accounting for stream socketsStefano Garzarella
In `struct virtio_vsock_sock`, we maintain two counters: - `rx_bytes`: used internally to track how many bytes have been read. This supports mechanisms like .stream_has_data() and sock_rcvlowat(). - `fwd_cnt`: used for the credit mechanism to inform available receive buffer space to the remote peer. These counters are updated via virtio_transport_inc_rx_pkt() and virtio_transport_dec_rx_pkt(). Since the beginning with commit 06a8fc78367d ("VSOCK: Introduce virtio_vsock_common.ko"), we call virtio_transport_dec_rx_pkt() in virtio_transport_stream_do_dequeue() only when we consume the entire packet, so partial reads, do not update `rx_bytes` and `fwd_cnt`. This is fine for `fwd_cnt`, because we still have space used for the entire packet, and we don't want to update the credit for the other peer until we free the space of the entire packet. However, this causes `rx_bytes` to be stale on partial reads. Previously, this didn’t cause issues because `rx_bytes` was used only by .stream_has_data(), and any unread portion of a packet implied data was still available. However, since commit 93b808876682 ("virtio/vsock: fix logic which reduces credit update messages"), we now rely on `rx_bytes` to determine if a credit update should be sent when the data in the RX queue drops below SO_RCVLOWAT value. This patch fixes the accounting by updating `rx_bytes` with the number of bytes actually read, even on partial reads, while leaving `fwd_cnt` untouched until the packet is fully consumed. Also introduce a new `buf_used` counter to check that the remote peer is honoring the given credit; this was previously done via `rx_bytes`. Fixes: 93b808876682 ("virtio/vsock: fix logic which reduces credit update messages") Signed-off-by: Stefano Garzarella <sgarzare@redhat.com> Link: https://patch.msgid.link/20250521121705.196379-1-sgarzare@redhat.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-05-26Merge tag 'vfs-6.16-rc1.mount' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull vfs mount updates from Christian Brauner: "This contains minor mount updates for this cycle: - mnt->mnt_devname can never be NULL so simplify the code handling that case - Add a comment about concurrent changes during statmount() and listmount() - Update the STATMOUNT_SUPPORTED macro - Convert mount flags to an enum" * tag 'vfs-6.16-rc1.mount' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: statmount: update STATMOUNT_SUPPORTED macro fs: convert mount flags to enum ->mnt_devname is never NULL mount: add a comment about concurrent changes with statmount()/listmount()
2025-05-26Merge tag 'nf-next-25-05-23' of ↵Paolo Abeni
git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf-next Pablo Neira Ayuso says: ==================== Netfilter updates for net-next The following batch contains Netfilter updates for net-next, specifically 26 patches: 5 patches adding/updating selftests, 4 fixes, 3 PREEMPT_RT fixes, and 14 patches to enhance nf_tables): 1) Improve selftest coverage for pipapo 4 bit group format, from Florian Westphal. 2) Fix incorrect dependencies when compiling a kernel without legacy ip{6}tables support, also from Florian. 3) Two patches to fix nft_fib vrf issues, including selftest updates to improve coverage, also from Florian Westphal. 4) Fix incorrect nesting in nft_tunnel's GENEVE support, from Fernando F. Mancera. 5) Three patches to fix PREEMPT_RT issues with nf_dup infrastructure and nft_inner to match in inner headers, from Sebastian Andrzej Siewior. 6) Integrate conntrack information into nft trace infrastructure, from Florian Westphal. 7) A series of 13 patches to allow to specify wildcard netdevice in netdev basechain and flowtables, eg. table netdev filter { chain ingress { type filter hook ingress devices = { eth0, eth1, vlan* } priority 0; policy accept; } } This also allows for runtime hook registration on NETDEV_{UN}REGISTER event, from Phil Sutter. netfilter pull request 25-05-23 * tag 'nf-next-25-05-23' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf-next: (26 commits) selftests: netfilter: Torture nftables netdev hooks netfilter: nf_tables: Add notifications for hook changes netfilter: nf_tables: Support wildcard netdev hook specs netfilter: nf_tables: Sort labels in nft_netdev_hook_alloc() netfilter: nf_tables: Handle NETDEV_CHANGENAME events netfilter: nf_tables: Wrap netdev notifiers netfilter: nf_tables: Respect NETDEV_REGISTER events netfilter: nf_tables: Prepare for handling NETDEV_REGISTER events netfilter: nf_tables: Have a list of nf_hook_ops in nft_hook netfilter: nf_tables: Pass nf_hook_ops to nft_unregister_flowtable_hook() netfilter: nf_tables: Introduce nft_register_flowtable_ops() netfilter: nf_tables: Introduce nft_hook_find_ops{,_rcu}() netfilter: nf_tables: Introduce functions freeing nft_hook objects netfilter: nf_tables: add packets conntrack state to debug trace info netfilter: conntrack: make nf_conntrack_id callable without a module dependency netfilter: nf_dup_netdev: Move the recursion counter struct netdev_xmit netfilter: nft_inner: Use nested-BH locking for nft_pcpu_tun_ctx netfilter: nf_dup{4, 6}: Move duplication check to task_struct netfilter: nft_tunnel: fix geneve_opt dump selftests: netfilter: nft_fib.sh: add type and oif tests with and without VRFs ... ==================== Link: https://patch.msgid.link/20250523132712.458507-1-pablo@netfilter.org Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-05-26Merge branches 'acpi-processor' and 'acpi-cppc'Rafael J. Wysocki
Merge ACPI processor driver updates and ACPI CPPC library updates for 6.16-rc1: - Clean up the initialization of CPU data structures in the ACPI processor driver (Zhang Rui). - Remove an obsolete comment regarding the C-states handling in the ACPI processor driver (Giovanni Gherdovich). - Simplify PCC shared memory region handling (Sudeep Holla). - Rework and extend functions for reading CPPC register values and for updating CPPC registers (Lifeng Zheng). - Add three functions related to autonomous CPU performance state selection to the CPPC library (Lifeng Zheng). * acpi-processor: ACPI: processor: idle: Remove redundant pr->power.count assignment ACPI: processor: idle: Set pr->flags.power unconditionally ACPI: processor: idle: Remove obsolete comment * acpi-cppc: ACPI: CPPC: Add three functions related to autonomous selection ACPI: CPPC: Modify cppc_get_auto_sel_caps() to cppc_get_auto_sel() ACPI: CPPC: Refactor register value get and set ABIs ACPI: CPPC: Add cppc_set_reg_val() ACPI: CPPC: Extract cppc_get_reg_val_in_pcc() ACPI: CPPC: Rename cppc_get_perf() to cppc_get_reg_val() ACPI: CPPC: Optimize cppc_get_perf() ACPI: CPPC: Add IS_OPTIONAL_CPC_REG macro to judge if a cpc_reg is optional ACPI: CPPC: Simplify PCC shared memory region handling ACPI: PCC: Simplify PCC shared memory region handling
2025-05-26Merge branch 'acpi-tables'Rafael J. Wysocki
Merge updates related to the handling of static (data-only) ACPI tables for 6.16-rc1: - Add __nonstring annotations for unterminated strings in the static ACPI tables parsing code (Kees Cook). - Add support for parsing the MRRM ACPI table and sysfs files to describe memory regions listed in it (Tony Luck, Anil Keshavamurthy). - Remove an (explicitly) unused header file include from the VIOT ACPI table parser file (Andy Shevchenko). - Improve logging around acpi_initialize_tables() (Bartosz Szczepanek). * acpi-tables: ACPI: MRRM: Fix default max memory region ACPI: tables: Improve logging around acpi_initialize_tables() ACPI: VIOT: Remove (explicitly) unused header ACPI: Add documentation for exposing MRRM data ACPI: MRRM: Add /sys files to describe memory ranges ACPI: MRRM: Minimal parse of ACPI MRRM table ACPI: tables: Add __nonstring annotations for unterminated strings
2025-05-26Merge branch 'acpica'Rafael J. Wysocki
Merge ACPICA updates, including two upstream releases 20241212 and 20250404, for 6.16-rc1: - Fix two ACPICA SLAB cache leaks (Seunghun Han). - Add EINJv2 get error type action and define Error Injection Actions in hex values to avoid inconsistencies between the specification and the code (Zaid Alali). - Fix typo in comments for SRAT structures (Adam Lackorzynski). - Prevent possible loss of data in ACPICA because of u32 to u8 conversions (Saket Dumbre). - Fix reading FFixedHW operation regions in ACPICA (Daniil Tatianin). - Add support for printing AML arguments when the ACPICA debug level is ACPI_LV_TRACE_POINT (Mario Limonciello). - Drop a stale comment about the file content from actbl2.h (Sudeep Holla). - Apply pack(1) to union aml_resource (Tamir Duberstein). - Fix overflow check in the ACPICA version of vsnprintf() (gldrk). - Interpret SIDP structures in DMAR added revision 3.4 of the VT-d specification (Alexey Neyman). - Add typedef and other definitions related to MRRM to ACPICA (Tony Luck). - Add definitions for RIMT to ACPICA (Sunil V L). - Fix spelling mistake "Incremement" -> "Increment" in the ACPICA utilities code (Colin Ian King). - Add typedef and other definitions for ERDT to ACPICA (Tony Luck). - Introduce ACPI_NONSTRING and use it (Kees Cook, Ahmed Salem). - Rename structure and field names of the RAS2 table in actbl2.h (Shiju Jose). - Fix up whitespace in acpica/utcache.c (Zhe Qiao). - Avoid sequence overread in a call to strncmp() in ap_get_table_length() and replace strncpy() with memcpy() in ACPICA in some places (Ahmed Salem). - Update copyright year in all ACPICA files (Saket Dumbre). * acpica: (30 commits) ACPICA: Update copyright year ACPICA: Logfile: Changes for version 20250404 ACPICA: Replace strncpy() with memcpy() ACPICA: Apply ACPI_NONSTRING in more places ACPICA: Avoid sequence overread in call to strncmp() ACPICA: Adjust the position of code lines ACPICA: actbl2.h: ACPI 6.5: RAS2: Rename structure and field names of the RAS2 table ACPICA: Apply ACPI_NONSTRING ACPICA: Introduce ACPI_NONSTRING ACPICA: actbl2.h: ERDT: Add typedef and other definitions ACPICA: infrastructure: Add new DMT_BUF types and shorten a long name ACPICA: Utilities: Fix spelling mistake "Incremement" -> "Increment" ACPICA: MRRM: Some cleanups ACPICA: actbl2: Add definitions for RIMT ACPICA: actbl2.h: MRRM: Add typedef and other definitions ACPICA: infrastructure: Add new header and ACPI_DMT_BUF26 types ACPICA: Interpret SIDP structures in DMAR ACPICA: utilities: Fix overflow check in vsnprintf() ACPICA: Apply pack(1) to union aml_resource ACPICA: Drop stale comment about the header file content ...
2025-05-26Merge tag 'vfs-6.16-rc1.super' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull vfs freezing updates from Christian Brauner: "This contains various filesystem freezing related work for this cycle: - Allow the power subsystem to support filesystem freeze for suspend and hibernate. Now all the pieces are in place to actually allow the power subsystem to freeze/thaw filesystems during suspend/resume. Filesystems are only frozen and thawed if the power subsystem does actually own the freeze. If the filesystem is already frozen by the time we've frozen all userspace processes we don't care to freeze it again. That's userspace's job once the process resumes. We only actually freeze filesystems if we absolutely have to and we ignore other failures to freeze. We could bubble up errors and fail suspend/resume if the error isn't EBUSY (aka it's already frozen) but I don't think that this is worth it. Filesystem freezing during suspend/resume is best-effort. If the user has 500 ext4 filesystems mounted and 4 fail to freeze for whatever reason then we simply skip them. What we have now is already a big improvement and let's see how we fare with it before making our lives even harder (and uglier) than we have to. - Allow efivars to support freeze and thaw Allow efivarfs to partake to resync variable state during system hibernation and suspend. Add freeze/thaw support. This is a pretty straightforward implementation. We simply add regular freeze/thaw support for both userspace and the kernel. efivars is the first pseudofilesystem that adds support for filesystem freezing and thawing. The simplicity comes from the fact that we simply always resync variable state after efivarfs has been frozen. It doesn't matter whether that's because of suspend, userspace initiated freeze or hibernation. Efivars is simple enough that it doesn't matter that we walk all dentries. There are no directories and there aren't insane amounts of entries and both freeze/thaw are already heavy-handed operations. If userspace initiated a freeze/thaw cycle they would need CAP_SYS_ADMIN in the initial user namespace (as that's where efivarfs is mounted) so it can't be triggered by random userspace. IOW, we really really don't care" * tag 'vfs-6.16-rc1.super' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: f2fs: fix freezing filesystem during resize kernfs: add warning about implementing freeze/thaw efivarfs: support freeze/thaw power: freeze filesystems during suspend/resume libfs: export find_next_child() super: add filesystem freezing helpers for suspend and hibernate gfs2: pass through holder from the VFS for freeze/thaw super: use common iterator (Part 2) super: use a common iterator (Part 1) super: skip dying superblocks early super: simplify user_get_super() super: remove pointless s_root checks fs: allow all writers to be frozen locking/percpu-rwsem: add freezable alternative to down_read
2025-05-26Merge tag 'ipsec-next-2025-05-23' of ↵Paolo Abeni
git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec-next Steffen Klassert says: ==================== 1) Remove some unnecessary strscpy_pad() size arguments. From Thorsten Blum. 2) Correct use of xso.real_dev on bonding offloads. Patchset from Cosmin Ratiu. 3) Add hardware offload configuration to XFRM_MSG_MIGRATE. From Chiachang Wang. 4) Refactor migration setup during cloning. This was done after the clone was created. Now it is done in the cloning function itself. From Chiachang Wang. 5) Validate assignment of maximal possible SEQ number. Prevent from setting to the maximum sequrnce number as this would cause for traffic drop. From Leon Romanovsky. 6) Prevent configuration of interface index when offload is used. Hardware can't handle this case.i From Leon Romanovsky. 7) Always use kfree_sensitive() for SA secret zeroization. From Zilin Guan. ipsec-next-2025-05-23 * tag 'ipsec-next-2025-05-23' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec-next: xfrm: use kfree_sensitive() for SA secret zeroization xfrm: prevent configuration of interface index when offload is used xfrm: validate assignment of maximal possible SEQ number xfrm: Refactor migration setup during the cloning process xfrm: Migrate offload configuration bonding: Fix multiple long standing offload races bonding: Mark active offloaded xfrm_states xfrm: Add explicit dev to .xdo_dev_state_{add,delete,free} xfrm: Remove unneeded device check from validate_xmit_xfrm xfrm: Use xdo.dev instead of xdo.real_dev net/mlx5: Avoid using xso.real_dev unnecessarily xfrm: Remove unnecessary strscpy_pad() size arguments ==================== Link: https://patch.msgid.link/20250523075611.3723340-1-steffen.klassert@secunet.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-05-26Merge tag 'linux-can-next-for-6.16-20250522' of ↵Paolo Abeni
git://git.kernel.org/pub/scm/linux/kernel/git/mkl/linux-can-next Marc Kleine-Budde says: ==================== pull-request: can-next 2025-05-22 this is a pull request of 22 patches for net-next/main. The series by Biju Das contains 19 patches and adds RZ/G3E CANFD support to the rcar_canfd driver. The patch by Vincent Mailhol adds a struct data_bittiming_params to group FD parameters as a preparation patch for CAN-XL support. Felix Maurer's patch imports tst-filter from can-tests into the kernel self tests and Vincent Mailhol adds support for physical CAN interfaces. linux-can-next-for-6.16-20250522 * tag 'linux-can-next-for-6.16-20250522' of git://git.kernel.org/pub/scm/linux/kernel/git/mkl/linux-can-next: (22 commits) selftests: can: test_raw_filter.sh: add support of physical interfaces selftests: can: Import tst-filter from can-tests can: dev: add struct data_bittiming_params to group FD parameters can: rcar_canfd: Add RZ/G3E support can: rcar_canfd: Enhance multi_channel_irqs handling can: rcar_canfd: Add external_clk variable to struct rcar_canfd_hw_info can: rcar_canfd: Add sh variable to struct rcar_canfd_hw_info can: rcar_canfd: Add struct rcanfd_regs variable to struct rcar_canfd_hw_info can: rcar_canfd: Add shared_can_regs variable to struct rcar_canfd_hw_info can: rcar_canfd: Add ch_interface_mode variable to struct rcar_canfd_hw_info can: rcar_canfd: Add {nom,data}_bittiming variables to struct rcar_canfd_hw_info can: rcar_canfd: Add max_cftml variable to struct rcar_canfd_hw_info can: rcar_canfd: Add max_aflpn variable to struct rcar_canfd_hw_info can: rcar_canfd: Add rnc_field_width variable to struct rcar_canfd_hw_info can: rcar_canfd: Update RCANFD_GAFLCFG macro can: rcar_canfd: Add rcar_canfd_setrnc() can: rcar_canfd: Drop the mask operation in RCANFD_GAFLCFG_SETRNC macro can: rcar_canfd: Update RCANFD_GERFL_ERR macro can: rcar_canfd: Drop RCANFD_GAFLCFG_GETRNC macro can: rcar_canfd: Use of_get_available_child_by_name() ... ==================== Link: https://patch.msgid.link/20250522084128.501049-1-mkl@pengutronix.de Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-05-26octeontx2-af: Send Link events one by oneSubbaraya Sundeep
Send link events one after another otherwise new message is overwriting the message which is being processed by PF. Fixes: a88e0f936ba9 ("octeontx2: Detect the mbox up or down message via register") Signed-off-by: Subbaraya Sundeep <sbhatta@marvell.com> Reviewed-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/1747823443-404-1-git-send-email-sbhatta@marvell.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-05-26Merge tag 'vfs-6.16-rc1.misc' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull misc vfs updates from Christian Brauner: "This contains the usual selections of misc updates for this cycle. Features: - Use folios for symlinks in the page cache FUSE already uses folios for its symlinks. Mirror that conversion in the generic code and the NFS code. That lets us get rid of a few folio->page->folio conversions in this path, and some of the few remaining users of read_cache_page() / read_mapping_page() - Try and make a few filesystem operations killable on the VFS inode->i_mutex level - Add sysctl vfs_cache_pressure_denom for bulk file operations Some workloads need to preserve more dentries than we currently allow through out sysctl interface A HDFS servers with 12 HDDs per server, on a HDFS datanode startup involves scanning all files and caching their metadata (including dentries and inodes) in memory. Each HDD contains approximately 2 million files, resulting in a total of ~20 million cached dentries after initialization To minimize dentry reclamation, they set vfs_cache_pressure to 1. Despite this configuration, memory pressure conditions can still trigger reclamation of up to 50% of cached dentries, reducing the cache from 20 million to approximately 10 million entries. During the subsequent cache rebuild period, any HDFS datanode restart operation incurs substantial latency penalties until full cache recovery completes To maintain service stability, more dentries need to be preserved during memory reclamation. The current minimum reclaim ratio (1/100 of total dentries) remains too aggressive for such workload. This patch introduces vfs_cache_pressure_denom for more granular cache pressure control The configuration [vfs_cache_pressure=1, vfs_cache_pressure_denom=10000] effectively maintains the full 20 million dentry cache under memory pressure, preventing datanode restart performance degradation - Avoid some jumps in inode_permission() using likely()/unlikely() - Avid a memory access which is most likely a cache miss when descending into devcgroup_inode_permission() - Add fastpath predicts for stat() and fdput() - Anonymous inodes currently don't come with a proper mode causing issues in the kernel when we want to add useful VFS debug assert. Fix that by giving them a proper mode and masking it off when we report it to userspace which relies on them not having any mode - Anonymous inodes currently allow to change inode attributes because the VFS falls back to simple_setattr() if i_op->setattr isn't implemented. This means the ownership and mode for every single user of anon_inode_inode can be changed. Block that as it's either useless or actively harmful. If specific ownership is needed the respective subsystem should allocate anonymous inodes from their own private superblock - Raise SB_I_NODEV and SB_I_NOEXEC on the anonymous inode superblock - Add proper tests for anonymous inode behavior - Make it easy to detect proper anonymous inodes and to ensure that we can detect them in codepaths such as readahead() Cleanups: - Port pidfs to the new anon_inode_{g,s}etattr() helpers - Try to remove the uselib() system call - Add unlikely branch hint return path for poll - Add unlikely branch hint on return path for core_sys_select - Don't allow signals to interrupt getdents copying for fuse - Provide a size hint to dir_context for during readdir() - Use writeback_iter directly in mpage_writepages - Update compression and mtime descriptions in initramfs documentation - Update main netfs API document - Remove useless plus one in super_cache_scan() - Remove unnecessary NULL-check guards during setns() - Add separate separate {get,put}_cgroup_ns no-op cases Fixes: - Fix typo in root= kernel parameter description - Use KERN_INFO for infof()|info_plog()|infofc() - Correct comments of fs_validate_description() - Mark an unlikely if condition with unlikely() in vfs_parse_monolithic_sep() - Delete macro fsparam_u32hex() - Remove unused and problematic validate_constant_table() - Fix potential unsigned integer underflow in fs_name() - Make file-nr output the total allocated file handles" * tag 'vfs-6.16-rc1.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (43 commits) fs: Pass a folio to page_put_link() nfs: Use a folio in nfs_get_link() fs: Convert __page_get_link() to use a folio fs/read_write: make default_llseek() killable fs/open: make do_truncate() killable fs/open: make chmod_common() and chown_common() killable include/linux/fs.h: add inode_lock_killable() readdir: supply dir_context.count as readdir buffer size hint vfs: Add sysctl vfs_cache_pressure_denom for bulk file operations fuse: don't allow signals to interrupt getdents copying Documentation: fix typo in root= kernel parameter description include/cgroup: separate {get,put}_cgroup_ns no-op case kernel/nsproxy: remove unnecessary guards fs: use writeback_iter directly in mpage_writepages fs: remove useless plus one in super_cache_scan() fs: add S_ANON_INODE fs: remove uselib() system call device_cgroup: avoid access to ->i_rdev in the common case in devcgroup_inode_permission() fs/fs_parse: Remove unused and problematic validate_constant_table() fs: touch up predicts in inode_permission() ...
2025-05-26Merge tag 'vfs-6.16-rc1.mount.api' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull vfs mount api conversions from Christian Brauner: "This converts the bfs and omfs filesystems to the new mount api" * tag 'vfs-6.16-rc1.mount.api' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: omfs: convert to new mount API bfs: convert bfs to use the new mount api
2025-05-26net: mctp: use nlmsg_payload() for netlink message data extractionJeremy Kerr
Jakub suggests: > I have a different request :) Matt, once this ends up in net-next > (end of this week) could you refactor it to use nlmsg_payload() ? > It doesn't exist in net but this is exactly why it was added. This refactors the additions to both mctp_dump_addrinfo(), and mctp_rtm_getneigh() - two cases where we're calling nlh_data() on an an incoming netlink message, without a prior nlmsg_parse(). For the neigh.c case, we cannot hit the failure where the nlh does not contain a full ndmsg at present, as the core handler (net/core/neighbour.c, neigh_get()) has already validated the size through neigh_valid_req_get(), and would have failed the get operation before the MCTP hander is called. However, relying on that is a bit fragile, so apply the nlmsg_payload refector here too. Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: Jeremy Kerr <jk@codeconstruct.com.au> Link: https://patch.msgid.link/20250521-mctp-nlmsg-payload-v2-1-e85df160c405@codeconstruct.com.au Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-05-26Merge branch ↵Paolo Abeni
'add-the-capability-to-consume-sram-for-hwfd-descriptor-queue-in-airoha_eth-driver' Lorenzo Bianconi says: ==================== Add the capability to consume SRAM for hwfd descriptor queue in airoha_eth driver In order to improve packet processing and packet forwarding performances, EN7581 SoC supports consuming SRAM instead of DRAM for hw forwarding descriptors queue. For downlink hw accelerated traffic request to consume SRAM memory for hw forwarding descriptors queue. Moreover, in some configurations QDMA blocks require a contiguous block of system memory for hwfd buffers queue. Introduce the capability to allocate hw buffers forwarding queue via the reserved-memory DTS property instead of running dmam_alloc_coherent(). v2: https://lore.kernel.org/r/20250509-airopha-desc-sram-v2-0-9dc3d8076dfb@kernel.org v1: https://lore.kernel.org/r/20250507-airopha-desc-sram-v1-0-d42037431bfa@kernel.org ==================== Link: https://patch.msgid.link/20250521-airopha-desc-sram-v3-0-a6e9b085b4f0@kernel.org Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-05-26net: airoha: Add the capability to allocate hfwd descriptors in SRAMLorenzo Bianconi
In order to improve packet processing and packet forwarding performances, EN7581 SoC supports consuming SRAM instead of DRAM for hw forwarding descriptors queue. For downlink hw accelerated traffic request to consume SRAM memory for hw forwarding descriptors queue. Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20250521-airopha-desc-sram-v3-4-a6e9b085b4f0@kernel.org Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-05-26net: airoha: Add the capability to allocate hwfd buffers via reserved-memoryLorenzo Bianconi
In some configurations QDMA blocks require a contiguous block of system memory for hwfd buffers queue. Introduce the capability to allocate hw buffers forwarding queue via the reserved-memory DTS property instead of running dmam_alloc_coherent(). Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20250521-airopha-desc-sram-v3-3-a6e9b085b4f0@kernel.org Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-05-26net: airoha: Do not store hfwd references in airoha_qdma structLorenzo Bianconi
Since hfwd descriptor and buffer queues are allocated via dmam_alloc_coherent() we do not need to store their references in airoha_qdma struct. This patch does not introduce any logical changes, just code clean-up. Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20250521-airopha-desc-sram-v3-2-a6e9b085b4f0@kernel.org Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-05-26dt-bindings: net: airoha: Add EN7581 memory-region propertyLorenzo Bianconi
Introduce memory-region and memory-region-names properties for the ethernet node available on EN7581 SoC in order to reserve system memory for hw forwarding buffers queue used by the QDMA modules. Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org> Acked-by: Conor Dooley <conor.dooley@microchip.com> Link: https://patch.msgid.link/20250521-airopha-desc-sram-v3-1-a6e9b085b4f0@kernel.org Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-05-26Merge branch 'add-functions-for-txgbe-aml-devices'Paolo Abeni
Jiawen Wu says: ==================== Support phylink and link/gpio irqs for AML 25G/10G devices, and complete PTP and SRIOV. ==================== Link: https://patch.msgid.link/ Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-05-26net: txgbe: Implement SRIOV for AML devicesJiawen Wu
Since .mac_link_up and .mac_link_down are changed for AML 25G/10G NICs, the SR-IOV related function should be invoked in these new functions, to bring VFs link up. Signed-off-by: Jiawen Wu <jiawenwu@trustnetic.com> Link: https://patch.msgid.link/BA8B302B7AAB6EA6+20250521064402.22348-10-jiawenwu@trustnetic.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-05-26net: txgbe: Implement PTP for AML devicesJiawen Wu
Support PTP clock and 1PPS output signal for AML devices. Signed-off-by: Jiawen Wu <jiawenwu@trustnetic.com> Link: https://patch.msgid.link/F2F6E5E8899D2C20+20250521064402.22348-9-jiawenwu@trustnetic.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-05-26net: txgbe: Restrict the use of mismatched FW versionsJiawen Wu
The new added mailbox commands require a new released firmware version. Otherwise, a lot of logs "Unknown FW command" would be printed. And the devices may not work properly. So add the test command in the probe function. Signed-off-by: Jiawen Wu <jiawenwu@trustnetic.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/18283F17BE0FA335+20250521064402.22348-8-jiawenwu@trustnetic.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-05-26net: txgbe: Correct the currect link settingsJiawen Wu
For AML 25G/10G devices, some of the information returned from phylink_ethtool_ksettings_get() is not correct, since there is a fixed-link mode. So add additional corrections. Signed-off-by: Jiawen Wu <jiawenwu@trustnetic.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/C94BF867617C544D+20250521064402.22348-7-jiawenwu@trustnetic.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-05-26net: txgbe: Support to handle GPIO IRQs for AML devicesJiawen Wu
The driver needs to handle GPIO interrupts to identify SFP module and configure PHY by sending mailbox messages to firmware. Since the SFP module needs to wait for ready to get information when it is inserted, workqueue is added to handle delayed tasks. And each SW-FW interaction takes time to wait, so they are processed in the workqueue instead of IRQ handler function. Signed-off-by: Jiawen Wu <jiawenwu@trustnetic.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/399624AF221E8E28+20250521064402.22348-6-jiawenwu@trustnetic.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-05-26net: txgbe: Implement PHYLINK for AML 25G/10G devicesJiawen Wu
There is a new PHY attached to AML 25G/10G NIC, which is different from SP 10G/1G NIC. But the PHY configuration is handed over to firmware, and also I2C is controlled by firmware. So the different PHYLINK fixed-link mode is added for these devices. Signed-off-by: Jiawen Wu <jiawenwu@trustnetic.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/987B973A5929CD48+20250521064402.22348-5-jiawenwu@trustnetic.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-05-26net: txgbe: Distinguish between 40G and 25G devicesJiawen Wu
For the following patches to support PHYLINK for AML 25G devices, separate MAC type wx_mac_aml40 to maintain the driver of 40G devices. Because 40G devices will complete support later, not now. And this patch makes the 25G devices use some PHYLINK interfaces, but it is not yet create PHYLINK and cannot be used on its own. It is just preparation for the next patches. Signed-off-by: Jiawen Wu <jiawenwu@trustnetic.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/592B1A6920867D0C+20250521064402.22348-4-jiawenwu@trustnetic.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-05-26net: wangxun: Use specific flag bit to simplify the codeJiawen Wu
Most of the different code that requires MAC type in the common library is due to NGBE only supports a few queues and pools, unlike TXGBE, which supports 128 queues and 64 pools. This difference accounts for most of the hardware configuration differences in the driver code. So add a flag bit "WX_FLAG_MULTI_64_FUNC" for them to clean-up the driver code. Signed-off-by: Jiawen Wu <jiawenwu@trustnetic.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/C731132E124D75E5+20250521064402.22348-3-jiawenwu@trustnetic.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-05-26net: txgbe: Remove specified SP typeJiawen Wu
Since AML devices are going to reuse some definitions, remove the "SP" qualifier from these definitions. Signed-off-by: Jiawen Wu <jiawenwu@trustnetic.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/8EF712EC14B8FF70+20250521064402.22348-2-jiawenwu@trustnetic.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-05-26Merge tag 'vfs-6.16-rc1.writepage' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull final writepage conversion from Christian Brauner: "This converts vboxfs from ->writepage() to ->writepages(). This was the last user of the ->writepage() method. So remove ->writepage() completely and all references to it" * tag 'vfs-6.16-rc1.writepage' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: fs: Remove aops->writepage mm: Remove swap_writepage() and shmem_writepage() ttm: Call shmem_writeout() from ttm_backup_backup_page() i915: Use writeback_iter() shmem: Add shmem_writeout() writeback: Remove writeback_use_writepage() migrate: Remove call to ->writepage vboxsf: Convert to writepages 9p: Add a migrate_folio method
2025-05-26net: dsa: microchip: Add SGMII port support to KSZ9477 switchTristram Ha
The KSZ9477 switch driver uses the XPCS driver to operate its SGMII port. However there are some hardware bugs in the KSZ9477 SGMII module so workarounds are needed. There was a proposal to update the XPCS driver to accommodate KSZ9477, but the new code is not generic enough to be used by other vendors. It is better to do all these workarounds inside the KSZ9477 driver instead of modifying the XPCS driver. There are 3 hardware issues. The first is the MII_ADVERTISE register needs to be write once after reset for the correct code word to be sent. The XPCS driver disables auto-negotiation first before configuring the SGMII/1000BASE-X mode and then enables it back. The KSZ9477 driver then writes the MII_ADVERTISE register before enabling auto-negotiation. In 1000BASE-X mode the MII_ADVERTISE register will be set, so KSZ9477 driver does not need to write it. The second issue is the MII_BMCR register needs to set the exact speed and duplex mode when running in SGMII mode. During link polling the KSZ9477 will check the speed and duplex mode are different from previous ones and update the MII_BMCR register accordingly. The last issue is 1000BASE-X mode does not work with auto-negotiation on. The cause is the local port hardware does not know the link is up and so network traffic is not forwarded. The workaround is to write 2 additional bits when 1000BASE-X mode is configured. Note the SGMII interrupt in the port cannot be masked. As that interrupt is not handled in the KSZ9477 driver the SGMII interrupt bit will not be set even when the XPCS driver sets it. Signed-off-by: Tristram Ha <tristram.ha@microchip.com> Reviewed-by: Maxime Chevallier <maxime.chevallier@bootlin.com> Tested-by: Maxime Chevallier <maxime.chevallier@bootlin.com> Link: https://patch.msgid.link/20250520230720.23425-1-Tristram.Ha@microchip.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-05-26Merge tag 'vfs-6.16-rc1.async.dir' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull vfs directory lookup updates from Christian Brauner: "This contains cleanups for the lookup_one*() family of helpers. We expose a set of functions with names containing "lookup_one_len" and others without the "_len". This difference has nothing to do with "len". It's rater a historical accident that can be confusing. The functions without "_len" take a "mnt_idmap" pointer. This is found in the "vfsmount" and that is an important question when choosing which to use: do you have a vfsmount, or are you "inside" the filesystem. A related question is "is permission checking relevant here?". nfsd and cachefiles *do* have a vfsmount but *don't* use the non-_len functions. They pass nop_mnt_idmap and refuse to work on filesystems which have any other idmap. This work changes nfsd and cachefile to use the lookup_one family of functions and to explictily pass &nop_mnt_idmap which is consistent with all other vfs interfaces used where &nop_mnt_idmap is explicitly passed. The remaining uses of the "_one" functions do not require permission checks so these are renamed to be "_noperm" and the permission checking is removed. This series also changes these lookup function to take a qstr instead of separate name and len. In many cases this simplifies the call" * tag 'vfs-6.16-rc1.async.dir' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: VFS: change lookup_one_common and lookup_noperm_common to take a qstr Use try_lookup_noperm() instead of d_hash_and_lookup() outside of VFS VFS: rename lookup_one_len family to lookup_noperm and remove permission check cachefiles: Use lookup_one() rather than lookup_one_len() nfsd: Use lookup_one() rather than lookup_one_len() VFS: improve interface for lookup_one functions
2025-05-26net: usb: aqc111: fix error handling of usbnet read callsNikita Zhandarovich
Syzkaller, courtesy of syzbot, identified an error (see report [1]) in aqc111 driver, caused by incomplete sanitation of usb read calls' results. This problem is quite similar to the one fixed in commit 920a9fa27e78 ("net: asix: add proper error handling of usb read errors"). For instance, usbnet_read_cmd() may read fewer than 'size' bytes, even if the caller expected the full amount, and aqc111_read_cmd() will not check its result properly. As [1] shows, this may lead to MAC address in aqc111_bind() being only partly initialized, triggering KMSAN warnings. Fix the issue by verifying that the number of bytes read is as expected and not less. [1] Partial syzbot report: BUG: KMSAN: uninit-value in is_valid_ether_addr include/linux/etherdevice.h:208 [inline] BUG: KMSAN: uninit-value in usbnet_probe+0x2e57/0x4390 drivers/net/usb/usbnet.c:1830 is_valid_ether_addr include/linux/etherdevice.h:208 [inline] usbnet_probe+0x2e57/0x4390 drivers/net/usb/usbnet.c:1830 usb_probe_interface+0xd01/0x1310 drivers/usb/core/driver.c:396 call_driver_probe drivers/base/dd.c:-1 [inline] really_probe+0x4d1/0xd90 drivers/base/dd.c:658 __driver_probe_device+0x268/0x380 drivers/base/dd.c:800 ... Uninit was stored to memory at: dev_addr_mod+0xb0/0x550 net/core/dev_addr_lists.c:582 __dev_addr_set include/linux/netdevice.h:4874 [inline] eth_hw_addr_set include/linux/etherdevice.h:325 [inline] aqc111_bind+0x35f/0x1150 drivers/net/usb/aqc111.c:717 usbnet_probe+0xbe6/0x4390 drivers/net/usb/usbnet.c:1772 usb_probe_interface+0xd01/0x1310 drivers/usb/core/driver.c:396 ... Uninit was stored to memory at: ether_addr_copy include/linux/etherdevice.h:305 [inline] aqc111_read_perm_mac drivers/net/usb/aqc111.c:663 [inline] aqc111_bind+0x794/0x1150 drivers/net/usb/aqc111.c:713 usbnet_probe+0xbe6/0x4390 drivers/net/usb/usbnet.c:1772 usb_probe_interface+0xd01/0x1310 drivers/usb/core/driver.c:396 call_driver_probe drivers/base/dd.c:-1 [inline] ... Local variable buf.i created at: aqc111_read_perm_mac drivers/net/usb/aqc111.c:656 [inline] aqc111_bind+0x221/0x1150 drivers/net/usb/aqc111.c:713 usbnet_probe+0xbe6/0x4390 drivers/net/usb/usbnet.c:1772 Reported-by: syzbot+3b6b9ff7b80430020c7b@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=3b6b9ff7b80430020c7b Tested-by: syzbot+3b6b9ff7b80430020c7b@syzkaller.appspotmail.com Fixes: df2d59a2ab6c ("net: usb: aqc111: Add support for getting and setting of MAC address") Signed-off-by: Nikita Zhandarovich <n.zhandarovich@fintech.ru> Link: https://patch.msgid.link/20250520113240.2369438-1-n.zhandarovich@fintech.ru Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-05-26exfat: do not clear volume dirty flag during syncYuezhang Mo
xfstests generic/482 tests the file system consistency after each FUA operation. It fails when run on exfat. exFAT clears the volume dirty flag with a FUA operation during sync. Since s_lock is not held when data is being written to a file, sync can be executed at the same time. When data is being written to a file, the FAT chain is updated first, and then the file size is updated. If sync is executed between updating them, the length of the FAT chain may be inconsistent with the file size. To avoid the situation where the file system is inconsistent but the volume dirty flag is cleared, this commit moves the clearing of the volume dirty flag from exfat_fs_sync() to exfat_put_super(), so that the volume dirty flag is not cleared until unmounting. After the move, there is no additional action during sync, so exfat_fs_sync() can be deleted. Reviewed-by: Sungjong Seo <sj1557.seo@samsung.com> Signed-off-by: Yuezhang Mo <Yuezhang.Mo@sony.com> Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
2025-05-26exfat: fix double free in delayed_freeNamjae Jeon
The double free could happen in the following path. exfat_create_upcase_table() exfat_create_upcase_table() : return error exfat_free_upcase_table() : free ->vol_utbl exfat_load_default_upcase_table : return error exfat_kill_sb() delayed_free() exfat_free_upcase_table() <--------- double free This patch set ->vol_util as NULL after freeing it. Reported-by: Jianzhou Zhao <xnxc22xnxc22@qq.com> Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
2025-05-26Merge tag 'opp-updates-6.16' of ↵Rafael J. Wysocki
git://git.kernel.org/pub/scm/linux/kernel/git/vireshk/pm Merge operating performance points (OPP) updates for 6.16 from Viresh Kumar: "- OPP: Add dev_pm_opp_set_level() (Praveen Talari). - Introduce scope-based cleanup headers and mutex locking guards in OPP core (Viresh Kumar). - switch to use kmemdup_array() (Zhang Enpei)." * tag 'opp-updates-6.16' of git://git.kernel.org/pub/scm/linux/kernel/git/vireshk/pm: OPP: switch to use kmemdup_array() OPP: Add dev_pm_opp_set_level() OPP: Use mutex locking guards OPP: Define and use scope-based cleanup helpers OPP: Use scope-based OF cleanup helpers OPP: Return opp_table from dev_pm_opp_get_opp_table_ref() OPP: Return opp from dev_pm_opp_get() OPP: Remove _get_opp_table_kref()
2025-05-26net: neigh: use kfree_skb_reason() in neigh_resolve_output() and ↵Qiu Yutan
neigh_connected_output() Replace kfree_skb() used in neigh_resolve_output() and neigh_connected_output() with kfree_skb_reason(). Following new skb drop reason is added: /* failed to fill the device hard header */ SKB_DROP_REASON_NEIGH_HH_FILLFAIL Signed-off-by: Qiu Yutan <qiu.yutan@zte.com.cn> Signed-off-by: Jiang Kun <jiang.kun2@zte.com.cn> Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> Reviewed-by: Xu Xin <xu.xin16@zte.com.cn> Signed-off-by: David S. Miller <davem@davemloft.net>
2025-05-26selftests: ncdevmem: add tx test with multiple IOVsStanislav Fomichev
Use prime 3 for length to make offset slowly drift away. Signed-off-by: Stanislav Fomichev <stfomichev@gmail.com> Acked-by: Mina Almasry <almasrymina@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2025-05-26selftests: ncdevmem: make chunking optionalStanislav Fomichev
Add new -z argument to specify max IOV size. By default, use single large IOV. Signed-off-by: Stanislav Fomichev <stfomichev@gmail.com> Reviewed-by: Mina Almasry <almasrymina@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2025-05-26net: devmem: support single IOV with sendmsgStanislav Fomichev
sendmsg() with a single iov becomes ITER_UBUF, sendmsg() with multiple iovs becomes ITER_IOVEC. iter_iov_len does not return correct value for UBUF, so teach to treat UBUF differently. Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Pavel Begunkov <asml.silence@gmail.com> Cc: Mina Almasry <almasrymina@google.com> Fixes: bd61848900bf ("net: devmem: Implement TX path") Signed-off-by: Stanislav Fomichev <stfomichev@gmail.com> Acked-by: Mina Almasry <almasrymina@google.com> Reviewed-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2025-05-26x86/fpu: Fix irq_fpu_usable() to return false during CPU onliningEric Biggers
irq_fpu_usable() incorrectly returned true before the FPU is initialized. The x86 CPU onlining code can call sha256() to checksum AMD microcode images, before the FPU is initialized. Since sha256() recently gained a kernel-mode FPU optimized code path, a crash occurred in kernel_fpu_begin_mask() during hotplug CPU onlining. (The crash did not occur during boot-time CPU onlining, since the optimized sha256() code is not enabled until subsys_initcalls run.) Fix this by making irq_fpu_usable() return false before fpu__init_cpu() has run. To do this without adding any additional overhead to irq_fpu_usable(), replace the existing per-CPU bool in_kernel_fpu with kernel_fpu_allowed which tracks both initialization and usage rather than just usage. The initial state is false; FPU initialization sets it to true; kernel-mode FPU sections toggle it to false and then back to true; and CPU offlining restores it to the initial state of false. Fixes: 11d7956d526f ("crypto: x86/sha256 - implement library instead of shash") Reported-by: Ayush Jain <Ayush.Jain3@amd.com> Closes: https://lore.kernel.org/r/20250516112217.GBaCcf6Yoc6LkIIryP@fat_crate.local Signed-off-by: Eric Biggers <ebiggers@google.com> Tested-by: Ayush Jain <Ayush.Jain3@amd.com> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2025-05-25Linux 6.15Linus Torvalds
2025-05-25Disable FOP_DONTCACHE for now due to bugsLinus Torvalds
This is kind of last-minute, but Al Viro reported that the new FOP_DONTCACHE flag causes memory corruption due to use-after-free issues. This was triggered by commit 974c5e6139db ("xfs: flag as supporting FOP_DONTCACHE"), but that is not the underlying bug - it is just the first user of the flag. Vlastimil Babka suspects the underlying problem stems from the folio_end_writeback() logic introduced in commit fb7d3bc414939 ("mm/filemap: drop streaming/uncached pages when writeback completes"). The most straightforward fix would be to just revert the commit that exposed this, but Matthew Wilcox points out that other filesystems are also starting to enable the FOP_DONTCACHE logic, so this instead disables that bit globally for now. The fix will hopefully end up being trivial and we can just re-enable this logic after more testing, but until such a time we'll have to disable the new FOP_DONTCACHE flag. Reported-by: Al Viro <viro@zeniv.linux.org.uk> Link: https://lore.kernel.org/all/20250525083209.GS2023217@ZenIV/ Triggered-by: 974c5e6139db ("xfs: flag as supporting FOP_DONTCACHE") Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Matthew Wilcox <willy@infradead.org> Cc: Jan Kara <jack@suse.cz> Cc: Jens Axboe <axboe@kernel.dk> Cc: Christoph Hellwig <hch@lst.de> Cc: Darrick J. Wong <djwong@kernel.org> Cc: Christian Brauner <brauner@kernel.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2025-05-25Merge tag 'mm-hotfixes-stable-2025-05-25-00-58' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull hotfixes from Andrew Morton: "22 hotfixes. 13 are cc:stable and the remainder address post-6.14 issues or aren't considered necessary for -stable kernels. 19 are for MM" * tag 'mm-hotfixes-stable-2025-05-25-00-58' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (22 commits) mailmap: add Jarkko's employer email address mm: fix copy_vma() error handling for hugetlb mappings memcg: always call cond_resched() after fn() mm/hugetlb: fix kernel NULL pointer dereference when replacing free hugetlb folios mm: vmalloc: only zero-init on vrealloc shrink mm: vmalloc: actually use the in-place vrealloc region alloc_tag: allocate percpu counters for module tags dynamically module: release codetag section when module load fails mm/cma: make detection of highmem_start more robust MAINTAINERS: add mm memory policy section MAINTAINERS: add mm ksm section kasan: avoid sleepable page allocation from atomic context highmem: add folio_test_partial_kmap() MAINTAINERS: add hung-task detector section taskstats: fix struct taskstats breaks backward compatibility since version 15 mm/truncate: fix out-of-bounds when doing a right-aligned split MAINTAINERS: add mm reclaim section MAINTAINERS: update page allocator section mm: fix VM_UFFD_MINOR == VM_SHADOW_STACK on USERFAULTFD=y && ARM64_GCS=y mm: mmap: map MAP_STACK to VM_NOHUGEPAGE only if THP is enabled ...
2025-05-25net: ethernet: mtk_eth_soc: Correct spellingSimon Horman
Correct spelling of platforms, various, and initial. As flagged by codespell. Signed-off-by: Simon Horman <horms@kernel.org> Reviewed-by: Shannon Nelson <shannon.nelson@amd.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2025-05-25net: dlink: Correct endian treatment of t_SROM dataSimon Horman
As it's name suggests, parse_eeprom() parses EEPROM data. This is done by reading data, 16 bits at a time as follows: for (i = 0; i < 128; i++) ((__le16 *) sromdata)[i] = cpu_to_le16(read_eeprom(np, i)); sromdata is at the same memory location as psrom. And the type of psrom is a pointer to struct t_SROM. As can be seen in the loop above, data is stored in sromdata, and thus psrom, as 16-bit little-endian values. However, the integer fields of t_SROM are host byte order. In the case of the led_mode field this results in a but which has been addressed by commit e7e5ae71831c ("net: dlink: Correct endianness handling of led_mode"). In the case of the remaining fields, which are updated by this patch, I do not believe this does not result in any bugs. But it does seem best to correctly annotate the endianness of integers. Flagged by Sparse as: .../dl2k.c:344:35: warning: restricted __le32 degrades to integer Compile tested only. No run-time change intended. Signed-off-by: Simon Horman <horms@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2025-05-25octeontx2-af: NPC: Clear Unicast rule on nixlf detachHariprasad Kelam
The AF driver assigns reserved MCAM entries (for unicast, broadcast, etc.) based on the NIXLF number. When a NIXLF is detached, these entries are disabled. For example, PF NIXLF -------------------- PF0 0 SDP-VF0 1 If the user unbinds both PF0 and SDP-VF0 interfaces and then binds them in reverse order PF NIXLF --------------------- SDP-VF0 0 PF0 1 In this scenario, the PF0 unicast entry is getting corrupted because the MCAM entry contains stale data (SDP-VF0 ucast data) This patch resolves the issue by clearing the unicast MCAM entry during NIXLF detach Signed-off-by: Hariprasad Kelam <hkelam@marvell.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2025-05-25Merge branch 'locking/futex' into locking/core, to pick up pending futex changesIngo Molnar
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2025-05-25mailmap: add Jarkko's employer email addressJarkko Sakkinen
Add the current employer email address to mailmap. Link: https://lkml.kernel.org/r/20250523121105.15850-1-jarkko@kernel.org Signed-off-by: Jarkko Sakkinen <jarkko@kernel.org> Cc: Alexander Sverdlin <alexander.sverdlin@gmail.com> Cc: Antonio Quartulli <antonio@openvpn.net> Cc: Carlos Bilbao <carlos.bilbao@kernel.org> Cc: Kees Cook <kees@kernel.org> Cc: Simon Wunderlich <sw@simonwunderlich.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-05-25mm: fix copy_vma() error handling for hugetlb mappingsRicardo Cañuelo Navarro
If, during a mremap() operation for a hugetlb-backed memory mapping, copy_vma() fails after the source vma has been duplicated and opened (ie. vma_link() fails), the error is handled by closing the new vma. This updates the hugetlbfs reservation counter of the reservation map which at this point is referenced by both the source vma and the new copy. As a result, once the new vma has been freed and copy_vma() returns, the reservation counter for the source vma will be incorrect. This patch addresses this corner case by clearing the hugetlb private page reservation reference for the new vma and decrementing the reference before closing the vma, so that vma_close() won't update the reservation counter. This is also what copy_vma_and_data() does with the source vma if copy_vma() succeeds, so a helper function has been added to do the fixup in both functions. The issue was reported by a private syzbot instance and can be reproduced using the C reproducer in [1]. It's also a possible duplicate of public syzbot report [2]. The WARNING report is: ============================================================ page_counter underflow: -1024 nr_pages=1024 WARNING: CPU: 0 PID: 3287 at mm/page_counter.c:61 page_counter_cancel+0xf6/0x120 Modules linked in: CPU: 0 UID: 0 PID: 3287 Comm: repro__WARNING_ Not tainted 6.15.0-rc7+ #54 NONE Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.16.3-2-gc13ff2cd-prebuilt.qemu.org 04/01/2014 RIP: 0010:page_counter_cancel+0xf6/0x120 Code: ff 5b 41 5e 41 5f 5d c3 cc cc cc cc e8 f3 4f 8f ff c6 05 64 01 27 06 01 48 c7 c7 60 15 f8 85 48 89 de 4c 89 fa e8 2a a7 51 ff <0f> 0b e9 66 ff ff ff 44 89 f9 80 e1 07 38 c1 7c 9d 4c 81 RSP: 0018:ffffc900025df6a0 EFLAGS: 00010246 RAX: 2edfc409ebb44e00 RBX: fffffffffffffc00 RCX: ffff8880155f0000 RDX: 0000000000000000 RSI: 0000000000000001 RDI: 0000000000000000 RBP: dffffc0000000000 R08: ffffffff81c4a23c R09: 1ffff1100330482a R10: dffffc0000000000 R11: ffffed100330482b R12: 0000000000000000 R13: ffff888058a882c0 R14: ffff888058a882c0 R15: 0000000000000400 FS: 0000000000000000(0000) GS:ffff88808fc53000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00000000004b33e0 CR3: 00000000076d6000 CR4: 00000000000006f0 Call Trace: <TASK> page_counter_uncharge+0x33/0x80 hugetlb_cgroup_uncharge_counter+0xcb/0x120 hugetlb_vm_op_close+0x579/0x960 ? __pfx_hugetlb_vm_op_close+0x10/0x10 remove_vma+0x88/0x130 exit_mmap+0x71e/0xe00 ? __pfx_exit_mmap+0x10/0x10 ? __mutex_unlock_slowpath+0x22e/0x7f0 ? __pfx_exit_aio+0x10/0x10 ? __up_read+0x256/0x690 ? uprobe_clear_state+0x274/0x290 ? mm_update_next_owner+0xa9/0x810 __mmput+0xc9/0x370 exit_mm+0x203/0x2f0 ? __pfx_exit_mm+0x10/0x10 ? taskstats_exit+0x32b/0xa60 do_exit+0x921/0x2740 ? do_raw_spin_lock+0x155/0x3b0 ? __pfx_do_exit+0x10/0x10 ? __pfx_do_raw_spin_lock+0x10/0x10 ? _raw_spin_lock_irq+0xc5/0x100 do_group_exit+0x20c/0x2c0 get_signal+0x168c/0x1720 ? __pfx_get_signal+0x10/0x10 ? schedule+0x165/0x360 arch_do_signal_or_restart+0x8e/0x7d0 ? __pfx_arch_do_signal_or_restart+0x10/0x10 ? __pfx___se_sys_futex+0x10/0x10 syscall_exit_to_user_mode+0xb8/0x2c0 do_syscall_64+0x75/0x120 entry_SYSCALL_64_after_hwframe+0x76/0x7e RIP: 0033:0x422dcd Code: Unable to access opcode bytes at 0x422da3. RSP: 002b:00007ff266cdb208 EFLAGS: 00000246 ORIG_RAX: 00000000000000ca RAX: 0000000000000001 RBX: 00007ff266cdbcdc RCX: 0000000000422dcd RDX: 00000000000f4240 RSI: 0000000000000081 RDI: 00000000004c7bec RBP: 00007ff266cdb220 R08: 203a6362696c6720 R09: 203a6362696c6720 R10: 0000200000c00000 R11: 0000000000000246 R12: ffffffffffffffd0 R13: 0000000000000002 R14: 00007ffe1cb5f520 R15: 00007ff266cbb000 </TASK> ============================================================ Link: https://lkml.kernel.org/r/20250523-warning_in_page_counter_cancel-v2-1-b6df1a8cfefd@igalia.com Link: https://people.igalia.com/rcn/kernel_logs/20250422__WARNING_in_page_counter_cancel__repro.c [1] Link: https://lore.kernel.org/all/67000a50.050a0220.49194.048d.GAE@google.com/ [2] Signed-off-by: Ricardo Cañuelo Navarro <rcn@igalia.com> Suggested-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: Florent Revest <revest@google.com> Cc: Jann Horn <jannh@google.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>