summaryrefslogtreecommitdiff
path: root/include
AgeCommit message (Collapse)Author
2025-01-20Merge tag 'vfs-6.14-rc1.libfs' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull vfs libfs updates from Christian Brauner: "This improves the stable directory offset behavior in various ways. Stable offsets are needed so that NFS can reliably read directories on filesystems such as tmpfs: - Improve the end-of-directory detection According to getdents(3), the d_off field in each returned directory entry points to the next entry in the directory. The d_off field in the last returned entry in the readdir buffer must contain a valid offset value, but if it points to an actual directory entry, then readdir/getdents can loop. Introduce a specific fixed offset value that is placed in the d_off field of the last entry in a directory. Some user space applications assume that the EOD offset value is larger than the offsets of real directory entries, so the largest valid offset value is reserved for this purpose. This new value is never allocated by simple_offset_add(). When ->iterate_dir() returns, getdents{64} inserts the ctx->pos value into the d_off field of the last valid entry in the readdir buffer. When it hits EOD, offset_readdir() sets ctx->pos to the EOD offset value so the last entry is updated to point to the EOD marker. When trying to read the entry at the EOD offset, offset_readdir() terminates immediately. - Rely on d_children to iterate stable offset directories Instead of using the mtree to emit entries in the order of their offset values, use it only to map incoming ctx->pos to a starting entry. Then use the directory's d_children list, which is already maintained properly by the dcache, to find the next child to emit. - Narrow the range of directory offset values returned by simple_offset_add() to 3 .. (S32_MAX - 1) on all platforms. This means the allocation behavior is identical on 32-bit systems, 64-bit systems, and 32-bit user space on 64-bit kernels. The new range still permits over 2 billion concurrent entries per directory. - Return ENOSPC when the directory offset range is exhausted. Hitting this error is almost impossible though. - Remove the simple_offset_empty() helper" * tag 'vfs-6.14-rc1.libfs' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: libfs: Use d_children list to iterate simple_offset directories libfs: Replace simple_offset end-of-directory detection Revert "libfs: fix infinite directory reads for offset dir" Revert "libfs: Add simple_offset_empty()" libfs: Return ENOSPC when the directory offset range is exhausted
2025-01-20Merge tag 'vfs-6.14-rc1.mount.v2' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull vfs mount updates from Christian Brauner: - Add a mountinfo program to demonstrate statmount()/listmount() Add a new "mountinfo" sample userland program that demonstrates how to use statmount() and listmount() to get at the same info that /proc/pid/mountinfo provides - Remove pointless nospec.h include - Prepend statmount.mnt_opts string with security_sb_mnt_opts() Currently these mount options aren't accessible via statmount() - Add new mount namespaces to mount namespace rbtree outside of the namespace semaphore - Lockless mount namespace lookup Currently we take the read lock when looking for a mount namespace to list mounts in. We can make this lockless. The simple search case can just use a sequence counter to detect concurrent changes to the rbtree For walking the list of mount namespaces sequentially via nsfs we keep a separate rcu list as rb_prev() and rb_next() aren't usable safely with rcu. Currently there is no primitive for retrieving the previous list member. To do this we need a new deletion primitive that doesn't poison the prev pointer and a corresponding retrieval helper Since creating mount namespaces is a relatively rare event compared with querying mounts in a foreign mount namespace this is worth it. Once libmount and systemd pick up this mechanism to list mounts in foreign mount namespaces this will be used very frequently - Add extended selftests for lockless mount namespace iteration - Add a sample program to list all mounts on the system, i.e., in all mount namespaces - Improve mount namespace iteration performance Make finding the last or first mount to start iterating the mount namespace from an O(1) operation and add selftests for iterating the mount table starting from the first and last mount - Use an xarray for the old mount id While the ida does use the xarray internally we can use it explicitly which allows us to increment the unique mount id under the xa lock. This allows us to remove the atomic as we're now allocating both ids in one go - Use a shared header for vfs sample programs - Fix build warnings for new sample program to list all mounts * tag 'vfs-6.14-rc1.mount.v2' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: samples/vfs: fix build warnings samples/vfs: use shared header samples/vfs/mountinfo: Use __u64 instead of uint64_t fs: remove useless lockdep assertion fs: use xarray for old mount id selftests: add listmount() iteration tests fs: cache first and last mount samples: add test-list-all-mounts selftests: remove unneeded include selftests: add tests for mntns iteration seltests: move nsfs into filesystems subfolder fs: simplify rwlock to spinlock fs: lockless mntns lookup for nsfs rculist: add list_bidir_{del,prev}_rcu() fs: lockless mntns rbtree lookup fs: add mount namespace to rbtree late fs: prepend statmount.mnt_opts string with security_sb_mnt_opts() mount: remove inlude/nospec.h include samples: add a mountinfo program to demonstrate statmount()/listmount()
2025-01-20Merge tag 'kernel-6.14-rc1.pid' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull pid_max namespacing update from Christian Brauner: "The pid_max sysctl is a global value. For a long time the default value has been 65535 and during the pidfd dicussions Linus proposed to bump pid_max by default. Based on this discussion systemd started bumping pid_max to 2^22. So all new systems now run with a very high pid_max limit with some distros having also backported that change. The decision to bump pid_max is obviously correct. It just doesn't make a lot of sense nowadays to enforce such a low pid number. There's sufficient tooling to make selecting specific processes without typing really large pid numbers available. In any case, there are workloads that have expections about how large pid numbers they accept. Either for historical reasons or architectural reasons. One concreate example is the 32-bit version of Android's bionic libc which requires pid numbers less than 65536. There are workloads where it is run in a 32-bit container on a 64-bit kernel. If the host has a pid_max value greater than 65535 the libc will abort thread creation because of size assumptions of pthread_mutex_t. That's a fairly specific use-case however, in general specific workloads that are moved into containers running on a host with a new kernel and a new systemd can run into issues with large pid_max values. Obviously making assumptions about the size of the allocated pid is suboptimal but we have userspace that does it. Of course, giving containers the ability to restrict the number of processes in their respective pid namespace indepent of the global limit through pid_max is something desirable in itself and comes in handy in general. Independent of motivating use-cases the existence of pid namespaces makes this also a good semantical extension and there have been prior proposals pushing in a similar direction. The trick here is to minimize the risk of regressions which I think is doable. The fact that pid namespaces are hierarchical will help us here. What we mostly care about is that when the host sets a low pid_max limit, say (crazy number) 100 that no descendant pid namespace can allocate a higher pid number in its namespace. Since pid allocation is hierarchial this can be ensured by checking each pid allocation against the pid namespace's pid_max limit. This means if the allocation in the descendant pid namespace succeeds, the ancestor pid namespace can reject it. If the ancestor pid namespace has a higher limit than the descendant pid namespace the descendant pid namespace will reject the pid allocation. The ancestor pid namespace will obviously not care about this. All in all this means pid_max continues to enforce a system wide limit on the number of processes but allows pid namespaces sufficient leeway in handling workloads with assumptions about pid values and allows containers to restrict the number of processes in a pid namespace through the pid_max interface" * tag 'kernel-6.14-rc1.pid' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: tests/pid_namespace: add pid_max tests pid: allow pid_max to be set per pid namespace
2025-01-20Merge branches 'pm-sleep', 'pm-cpuidle' and 'pm-em'Rafael J. Wysocki
Merge updates related to system sleep, a cpuidle update and an Energy Model handling code update for 6.14-rc1: - Allow configuring the system suspend-resume (DPM) watchdog to warn earlier than panic (Douglas Anderson). - Implement devm_device_init_wakeup() helper and introduce a device- managed variant of dev_pm_set_wake_irq() (Joe Hattori, Peng Fan). - Remove direct inclusions of 'pm_wakeup.h' which should be only included via 'device.h' (Wolfram Sang). - Clean up two comments in the core system-wide PM code (Rafael Wysocki, Randy Dunlap). - Add Clearwater Forest processor support to the intel_idle cpuidle driver (Artem Bityutskiy). - Move sched domains rebuild function from the schedutil cpufreq governor to the Energy Model handling code (Rafael Wysocki). * pm-sleep: PM: sleep: wakeirq: Introduce device-managed variant of dev_pm_set_wake_irq() PM: sleep: Allow configuring the DPM watchdog to warn earlier than panic PM: sleep: convert comment from kernel-doc to plain comment PM: wakeup: implement devm_device_init_wakeup() helper PM: sleep: sysfs: don't include 'pm_wakeup.h' directly PM: sleep: autosleep: don't include 'pm_wakeup.h' directly PM: sleep: Update stale comment in device_resume() * pm-cpuidle: intel_idle: add Clearwater Forest SoC support * pm-em: PM: EM: Move sched domains rebuild function from schedutil to EM
2025-01-20Merge tag 'kernel-6.14-rc1.cred' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull cred refcount updates from Christian Brauner: "For the v6.13 cycle we switched overlayfs to a variant of override_creds() that doesn't take an extra reference. To this end the {override,revert}_creds_light() helpers were introduced. This generalizes the idea behind {override,revert}_creds_light() to the {override,revert}_creds() helpers. Afterwards overriding and reverting credentials is reference count free unless the caller explicitly takes a reference. All callers have been appropriately ported" * tag 'kernel-6.14-rc1.cred' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (30 commits) cred: fold get_new_cred_many() into get_cred_many() cred: remove unused get_new_cred() nfsd: avoid pointless cred reference count bump cachefiles: avoid pointless cred reference count bump dns_resolver: avoid pointless cred reference count bump trace: avoid pointless cred reference count bump cgroup: avoid pointless cred reference count bump acct: avoid pointless reference count bump io_uring: avoid pointless cred reference count bump smb: avoid pointless cred reference count bump cifs: avoid pointless cred reference count bump cifs: avoid pointless cred reference count bump ovl: avoid pointless cred reference count bump open: avoid pointless cred reference count bump nfsfh: avoid pointless cred reference count bump nfs/nfs4recover: avoid pointless cred reference count bump nfs/nfs4idmap: avoid pointless reference count bump nfs/localio: avoid pointless cred reference count bumps coredump: avoid pointless cred reference count bump binfmt_misc: avoid pointless cred reference count bump ...
2025-01-20Merge tag 'vfs-6.14-rc1.pidfs' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull pidfs updates from Christian Brauner: - Rework inode number allocation Recently we received a patchset that aims to enable file handle encoding and decoding via name_to_handle_at(2) and open_by_handle_at(2). A crucical step in the patch series is how to go from inode number to struct pid without leaking information into unprivileged contexts. The issue is that in order to find a struct pid the pid number in the initial pid namespace must be encoded into the file handle via name_to_handle_at(2). This can be used by containers using a separate pid namespace to learn what the pid number of a given process in the initial pid namespace is. While this is a weak information leak it could be used in various exploits and in general is an ugly wart in the design. To solve this problem a new way is needed to lookup a struct pid based on the inode number allocated for that struct pid. The other part is to remove the custom inode number allocation on 32bit systems that is also an ugly wart that should go away. Allocate unique identifiers for struct pid by simply incrementing a 64 bit counter and insert each struct pid into the rbtree so it can be looked up to decode file handles avoiding to leak actual pids across pid namespaces in file handles. On both 64 bit and 32 bit the same 64 bit identifier is used to lookup struct pid in the rbtree. On 64 bit the unique identifier for struct pid simply becomes the inode number. Comparing two pidfds continues to be as simple as comparing inode numbers. On 32 bit the 64 bit number assigned to struct pid is split into two 32 bit numbers. The lower 32 bits are used as the inode number and the upper 32 bits are used as the inode generation number. Whenever a wraparound happens on 32 bit the 64 bit number will be incremented by 2 so inode numbering starts at 2 again. When a wraparound happens on 32 bit multiple pidfds with the same inode number are likely to exist. This isn't a problem since before pidfs pidfds used the anonymous inode meaning all pidfds had the same inode number. On 32 bit sserspace can thus reconstruct the 64 bit identifier by retrieving both the inode number and the inode generation number to compare, or use file handles. This gives the same guarantees on both 32 bit and 64 bit. - Implement file handle support This is based on custom export operation methods which allows pidfs to implement permission checking and opening of pidfs file handles cleanly without hacking around in the core file handle code too much. - Support bind-mounts Allow bind-mounting pidfds. Similar to nsfs let's allow bind-mounts for pidfds. This allows pidfds to be safely recovered and checked for process recycling. Instead of checking d_ops for both nsfs and pidfs we could in a follow-up patch add a flag argument to struct dentry_operations that functions similar to file_operations->fop_flags. * tag 'vfs-6.14-rc1.pidfs' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: selftests: add pidfd bind-mount tests pidfs: allow bind-mounts pidfs: lookup pid through rbtree selftests/pidfd: add pidfs file handle selftests pidfs: check for valid ioctl commands pidfs: implement file handle support exportfs: add permission method fhandle: pull CAP_DAC_READ_SEARCH check into may_decode_fh() exportfs: add open method fhandle: simplify error handling pseudofs: add support for export_ops pidfs: support FS_IOC_GETVERSION pidfs: remove 32bit inode number handling pidfs: rework inode number allocation
2025-01-20Merge tag 'vfs-6.14-rc1.misc' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull misc vfs updates from Christian Brauner: "Features: - Support caching symlink lengths in inodes The size is stored in a new union utilizing the same space as i_devices, thus avoiding growing the struct or taking up any more space When utilized it dodges strlen() in vfs_readlink(), giving about 1.5% speed up when issuing readlink on /initrd.img on ext4 - Add RWF_DONTCACHE iocb and FOP_DONTCACHE file_operations flag If a file system supports uncached buffered IO, it may set FOP_DONTCACHE and enable support for RWF_DONTCACHE. If RWF_DONTCACHE is attempted without the file system supporting it, it'll get errored with -EOPNOTSUPP - Enable VBOXGUEST and VBOXSF_FS on ARM64 Now that VirtualBox is able to run as a host on arm64 (e.g. the Apple M3 processors) we can enable VBOXSF_FS (and in turn VBOXGUEST) for this architecture. Tested with various runs of bonnie++ and dbench on an Apple MacBook Pro with the latest Virtualbox 7.1.4 r165100 installed Cleanups: - Delay sysctl_nr_open check in expand_files() - Use kernel-doc includes in fiemap docbook - Use page->private instead of page->index in watch_queue - Use a consume fence in mnt_idmap() as it's heavily used in link_path_walk() - Replace magic number 7 with ARRAY_SIZE() in fc_log - Sort out a stale comment about races between fd alloc and dup2() - Fix return type of do_mount() from long to int - Various cosmetic cleanups for the lockref code Fixes: - Annotate spinning as unlikely() in __read_seqcount_begin The annotation already used to be there, but got lost in commit 52ac39e5db51 ("seqlock: seqcount_t: Implement all read APIs as statement expressions") - Fix proc_handler for sysctl_nr_open - Flush delayed work in delayed fput() - Fix grammar and spelling in propagate_umount() - Fix ESP not readable during coredump In /proc/PID/stat, there is the kstkesp field which is the stack pointer of a thread. While the thread is active, this field reads zero. But during a coredump, it should have a valid value However, at the moment, kstkesp is zero even during coredump - Don't wake up the writer if the pipe is still full - Fix unbalanced user_access_end() in select code" * tag 'vfs-6.14-rc1.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (28 commits) gfs2: use lockref_init for qd_lockref erofs: use lockref_init for pcl->lockref dcache: use lockref_init for d_lockref lockref: add a lockref_init helper lockref: drop superfluous externs lockref: use bool for false/true returns lockref: improve the lockref_get_not_zero description lockref: remove lockref_put_not_zero fs: Fix return type of do_mount() from long to int select: Fix unbalanced user_access_end() vbox: Enable VBOXGUEST and VBOXSF_FS on ARM64 pipe_read: don't wake up the writer if the pipe is still full selftests: coredump: Add stackdump test fs/proc: do_task_stat: Fix ESP not readable during coredump fs: add RWF_DONTCACHE iocb and FOP_DONTCACHE file_operations flag fs: sort out a stale comment about races between fd alloc and dup2 fs: Fix grammar and spelling in propagate_umount() fs: fc_log replace magic number 7 with ARRAY_SIZE() fs: use a consume fence in mnt_idmap() file: flush delayed work in delayed fput() ...
2025-01-20Merge tag 'vfs-6.14-rc1.netfs' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull vfs netfs updates from Christian Brauner: "This contains read performance improvements and support for monolithic single-blob objects that have to be read/written as such (e.g. AFS directory contents). The implementation of the two parts is interwoven as each makes the other possible. - Read performance improvements The read performance improvements are intended to speed up some loss of performance detected in cifs and to a lesser extend in afs. The problem is that we queue too many work items during the collection of read results: each individual subrequest is collected by its own work item, and then they have to interact with each other when a series of subrequests don't exactly align with the pattern of folios that are being read by the overall request. Whilst the processing of the pages covered by individual subrequests as they complete potentially allows folios to be woken in parallel and with minimum delay, it can shuffle wakeups for sequential reads out of order - and that is the most common I/O pattern. The final assessment and cleanup of an operation is then held up until the last I/O completes - and for a synchronous sequential operation, this means the bouncing around of work items just adds latency. Two changes have been made to make this work: (1) All collection is now done in a single "work item" that works progressively through the subrequests as they complete (and also dispatches retries as necessary). (2) For readahead and AIO, this work item be done on a workqueue and can run in parallel with the ultimate consumer of the data; for synchronous direct or unbuffered reads, the collection is run in the application thread and not offloaded. Functions such as smb2_readv_callback() then just tell netfslib that the subrequest has terminated; netfslib does a minimal bit of processing on the spot - stat counting and tracing mostly - and then queues/wakes up the worker. This simplifies the logic as the collector just walks sequentially through the subrequests as they complete and walks through the folios, if buffered, unlocking them as it goes. It also keeps to a minimum the amount of latency injected into the filesystem's low-level I/O handling The way netfs supports filesystems using the deprecated PG_private_2 flag is changed: folios are flagged and added to a write request as they complete and that takes care of scheduling the writes to the cache. The originating read request can then just unlock the pages whatever happens. - Single-blob object support Single-blob objects are files for which the content of the file must be read from or written to the server in a single operation because reading them in parts may yield inconsistent results. AFS directories are an example of this as there exists the possibility that the contents are generated on the fly and would differ between reads or might change due to third party interference. Such objects will be written to and retrieved from the cache if one is present, though we allow/may need to propose multiple subrequests to do so. The important part is that read from/write to the *server* is monolithic. Single blob reading is, for the moment, fully synchronous and does result collection in the application thread and, also for the moment, the API is supplied the buffer in the form of a folio_queue chain rather than using the pagecache. - Related afs changes This series makes a number of changes to the kafs filesystem, primarily in the area of directory handling: - AFS's FetchData RPC reply processing is made partially asynchronous which allows the netfs_io_request's outstanding operation counter to be removed as part of reducing the collection to a single work item. - Directory and symlink reading are plumbed through netfslib using the single-blob object API and are now cacheable with fscache. This also allows the afs_read struct to be eliminated and netfs_io_subrequest to be used directly instead. - Directory and symlink content are now stored in a folio_queue buffer rather than in the pagecache. This means we don't require the RCU read lock and xarray iteration to access it, and folios won't randomly disappear under us because the VM wants them back. - The vnode operation lock is changed from a mutex struct to a private lock implementation. The problem is that the lock now needs to be dropped in a separate thread and mutexes don't permit that. - When a new directory or symlink is created, we now initialise it locally and mark it valid rather than downloading it (we know what it's likely to look like). - We now use the in-directory hashtable to reduce the number of entries we need to scan when doing a lookup. The edit routines have to maintain the hash chains. - Cancellation (e.g. by signal) of an async call after the rxrpc_call has been set up is now offloaded to the worker thread as there will be a notification from rxrpc upon completion. This avoids a double cleanup. - A "rolling buffer" implementation is created to abstract out the two separate folio_queue chaining implementations I had (one for read and one for write). - Functions are provided to create/extend a buffer in a folio_queue chain and tear it down again. This is used to handle AFS directories, but could also be used to create bounce buffers for content crypto and transport crypto. - The was_async argument is dropped from netfs_read_subreq_terminated() Instead we wake the read collection work item by either queuing it or waking up the app thread. - We don't need to use BH-excluding locks when communicating between the issuing thread and the collection thread as neither of them now run in BH context. - Also included are a number of new tracepoints; a split of the netfslib write collection code to put retrying into its own file (it gets more complicated with content encryption). - There are also some minor fixes AFS included, including fixing the AFS directory format struct layout, reducing some directory over-invalidation and making afs_mkdir() translate EEXIST to ENOTEMPY (which is not available on all systems the servers support). - Finally, there's a patch to try and detect entry into the folio unlock function with no folio_queue structs in the buffer (which isn't allowed in the cases that can get there). This is a debugging patch, but should be minimal overhead" * tag 'vfs-6.14-rc1.netfs' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (31 commits) netfs: Report on NULL folioq in netfs_writeback_unlock_folios() afs: Add a tracepoint for afs_read_receive() afs: Locally initialise the contents of a new symlink on creation afs: Use the contained hashtable to search a directory afs: Make afs_mkdir() locally initialise a new directory's content netfs: Change the read result collector to only use one work item afs: Make {Y,}FS.FetchData an asynchronous operation afs: Fix cleanup of immediately failed async calls afs: Eliminate afs_read afs: Use netfslib for symlinks, allowing them to be cached afs: Use netfslib for directories afs: Make afs_init_request() get a key if not given a file netfs: Add support for caching single monolithic objects such as AFS dirs netfs: Add functions to build/clean a buffer in a folio_queue afs: Add more tracepoints to do with tracking validity cachefiles: Add auxiliary data trace cachefiles: Add some subrequest tracepoints netfs: Remove some extraneous directory invalidations afs: Fix directory format encoding struct afs: Fix EEXIST error returned from afs_rmdir() to be ENOTEMPTY ...
2025-01-20Merge branches 'acpi-battery', 'acpi-fan' and 'acpi-misc'Rafael J. Wysocki
Merge ACPI battery and fan drivers updates and miscellaneous ACPI chanages for 6.14: - Update messages printed by the ACPI battery driver to always refer to driver extensions as "hooks" to avoid confusion with similar functionality in the power supply subsystem in the future (Thomas Weißschuh). - Fix .probe() error path cleanup in the ACPI fan driver to avoid memory leaks (Joe Hattori). - Constify 'struct bin_attribute' in some places in the ACPI subsystem and mark it as __ro_after_init in one place to prevent binary blob attributes from being updated (Thomas Weißschuh) - Add empty stubs for several ACPI-related symbols so that they can be used when CONFIG_ACPI is unset and use them for removing unnecessary conditional compilation from the ipu-bridge driver (Ricardo Ribalda). * acpi-battery: ACPI: battery: Rename extensions to hook in messages * acpi-fan: ACPI: fan: cleanup resources in the error path of .probe() * acpi-misc: media: ipu-bridge: Remove unneeded conditional compilations ACPI: bus: implement acpi_device_hid when !ACPI ACPI: bus: implement for_each_acpi_consumer_dev when !ACPI ACPI: header: implement acpi_device_handle when !ACPI ACPI: bus: implement acpi_get_physical_device_location when !ACPI ACPI: bus: implement for_each_acpi_dev_match when !ACPI ACPI: bus: change the prototype for acpi_get_physical_device_location ACPI: sysfs: Constify 'struct bin_attribute' ACPI: BGRT: Constify 'struct bin_attribute' ACPI: BGRT: Mark bin_attribute as __ro_after_init
2025-01-20net: macsec: Add endianness annotations in salt structAles Nezbeda
This change resolves warning produced by sparse tool as currently there is a mismatch between normal generic type in salt and endian annotated type in macsec driver code. Endian annotated types should be used here. Sparse output: warning: restricted ssci_t degrades to integer warning: incorrect type in assignment (different base types) expected restricted ssci_t [usertype] ssci got unsigned int warning: restricted __be64 degrades to integer warning: incorrect type in assignment (different base types) expected restricted __be64 [usertype] pn got unsigned long long Signed-off-by: Ales Nezbeda <anezbeda@redhat.com> Reviewed-by: Simon Horman <horms@kernel.org> Reviewed-by: Sabrina Dubroca <sd@queasysnail.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2025-01-20Merge tag 'opp-updates-6.14' of ↵Rafael J. Wysocki
ssh://gitolite.kernel.org/pub/scm/linux/kernel/git/vireshk/pm Merge OPP (Operating Performance Points) updates for 6.14 from Viresh Kumar: "- Minor cleanups / fixes (Dan Carpenter, Neil Armstrong, and Joe Hattori). - Implement dev_pm_opp_get_bw (Neil Armstrong). - Expose reference counting helpers (Viresh Kumar)." * tag 'opp-updates-6.14' of ssh://gitolite.kernel.org/pub/scm/linux/kernel/git/vireshk/pm: PM / OPP: Add reference counting helpers for Rust implementation OPP: OF: Fix an OF node leak in _opp_add_static_v2() OPP: fix dev_pm_opp_find_bw_*() when bandwidth table not initialized OPP: add index check to assert to avoid buffer overflow in _read_freq() opp: core: Fix off by one in dev_pm_opp_get_bw() opp: core: implement dev_pm_opp_get_bw
2025-01-20net: sched: refine software bypass handling in tc_runXin Long
This patch addresses issues with filter counting in block (tcf_block), particularly for software bypass scenarios, by introducing a more accurate mechanism using useswcnt. Previously, filtercnt and skipswcnt were introduced by: Commit 2081fd3445fe ("net: sched: cls_api: add filter counter") and Commit f631ef39d819 ("net: sched: cls_api: add skip_sw counter") filtercnt tracked all tp (tcf_proto) objects added to a block, and skipswcnt counted tp objects with the skipsw attribute set. The problem is: a single tp can contain multiple filters, some with skipsw and others without. The current implementation fails in the case: When the first filter in a tp has skipsw, both skipswcnt and filtercnt are incremented, then adding a second filter without skipsw to the same tp does not modify these counters because tp->counted is already set. This results in bypass software behavior based solely on skipswcnt equaling filtercnt, even when the block includes filters without skipsw. Consequently, filters without skipsw are inadvertently bypassed. To address this, the patch introduces useswcnt in block to explicitly count tp objects containing at least one filter without skipsw. Key changes include: Whenever a filter without skipsw is added, its tp is marked with usesw and counted in useswcnt. tc_run() now uses useswcnt to determine software bypass, eliminating reliance on filtercnt and skipswcnt. This refined approach prevents software bypass for blocks containing mixed filters, ensuring correct behavior in tc_run(). Additionally, as atomic operations on useswcnt ensure thread safety and tp->lock guards access to tp->usesw and tp->counted, the broader lock down_write(&block->cb_lock) is no longer required in tc_new_tfilter(), and this resolves a performance regression caused by the filter counting mechanism during parallel filter insertions. The improvement can be demonstrated using the following script: # cat insert_tc_rules.sh tc qdisc add dev ens1f0np0 ingress for i in $(seq 16); do taskset -c $i tc -b rules_$i.txt & done wait Each of rules_$i.txt files above includes 100000 tc filter rules to a mlx5 driver NIC ens1f0np0. Without this patch: # time sh insert_tc_rules.sh real 0m50.780s user 0m23.556s sys 4m13.032s With this patch: # time sh insert_tc_rules.sh real 0m17.718s user 0m7.807s sys 3m45.050s Fixes: 047f340b36fc ("net: sched: make skip_sw actually skip software") Reported-by: Shuang Li <shuali@redhat.com> Signed-off-by: Xin Long <lucien.xin@gmail.com> Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Reviewed-by: Asbjørn Sloth Tønnesen <ast@fiberby.net> Tested-by: Asbjørn Sloth Tønnesen <ast@fiberby.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2025-01-20PM / OPP: Add reference counting helpers for Rust implementationViresh Kumar
To ensure that resources such as OPP tables or OPP nodes are not freed while in use by the Rust implementation, it is necessary to increment their reference count from Rust code. This commit introduces a new helper function, dev_pm_opp_get_opp_table_ref(), to increment the reference count of an OPP table and declares the existing helper dev_pm_opp_get() in pm_opp.h. Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
2025-01-19Merge tag 'timers_urgent_for_v6.13' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull timer fixes from Borislav Petkov: - Reset hrtimers correctly when a CPU hotplug state traversal happens "half-ways" and leaves hrtimers not (re-)initialized properly - Annotate accesses to a timer group's ignore flag to prevent KCSAN from raising data_race warnings - Make sure timer group initialization is visible to timer tree walkers and avoid a hypothetical race - Fix another race between CPU hotplug and idle entry/exit where timers on a fully idle system are getting ignored - Fix a case where an ignored signal is still being handled which it shouldn't be * tag 'timers_urgent_for_v6.13' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: hrtimers: Handle CPU state correctly on hotplug timers/migration: Annotate accesses to ignore flag timers/migration: Enforce group initialization visibility to tree walkers timers/migration: Fix another race between hotplug and idle entry/exit signal/posixtimers: Handle ignore/blocked sequences correctly
2025-01-19netfilter: flowtable: add CLOSING statePablo Neira Ayuso
tcp rst/fin packet triggers an immediate teardown of the flow which results in sending flows back to the classic forwarding path. This behaviour was introduced by: da5984e51063 ("netfilter: nf_flow_table: add support for sending flows back to the slow path") b6f27d322a0a ("netfilter: nf_flow_table: tear down TCP flows if RST or FIN was seen") whose goal is to expedite removal of flow entries from the hardware table. Before these patches, the flow was released after the flow entry timed out. However, this approach leads to packet races when restoring the conntrack state as well as late flow re-offload situations when the TCP connection is ending. This patch adds a new CLOSING state that is is entered when tcp rst/fin packet is seen. This allows for an early removal of the flow entry from the hardware table. But the flow entry still remains in software, so tcp packets to shut down the flow are not sent back to slow path. If syn packet is seen from this new CLOSING state, then this flow enters teardown state, ct state is set to TCP_CONNTRACK_CLOSE state and packet is sent to slow path, so this TCP reopen scenario can be handled by conntrack. TCP_CONNTRACK_CLOSE provides a small timeout that aims at quickly releasing this stale entry from the conntrack table. Moreover, skip hardware re-offload from flowtable software packet if the flow is in CLOSING state. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2025-01-19netfilter: conntrack: rework offload nf_conn timeout extension logicFlorian Westphal
Offload nf_conn entries may not see traffic for a very long time. To prevent incorrect 'ct is stale' checks during nf_conntrack table lookup, the gc worker extends the timeout nf_conn entries marked for offload to a large value. The existing logic suffers from a few problems. Garbage collection runs without locks, its unlikely but possible that @ct is removed right after the 'offload' bit test. In that case, the timeout of a new/reallocated nf_conn entry will be increased. Prevent this by obtaining a reference count on the ct object and re-check of the confirmed and offload bits. If those are not set, the ct is being removed, skip the timeout extension in this case. Parallel teardown is also problematic: cpu1 cpu2 gc_worker calls flow_offload_teardown() tests OFFLOAD bit, set clear OFFLOAD bit ct->timeout is repaired (e.g. set to timeout[UDP_CT_REPLIED]) nf_ct_offload_timeout() called expire value is fetched <INTERRUPT> -> NF_CT_DAY timeout for flow that isn't offloaded (and might not see any further packets). Use cmpxchg: if ct->timeout was repaired after the 2nd 'offload bit' test passed, then ct->timeout will only be updated of ct->timeout was not altered in between. As we already have a gc worker for flowtable entries, ct->timeout repair can be handled from the flowtable gc worker. This avoids having flowtable specific logic in the conntrack core and avoids checking entries that were never offloaded. This allows to remove the nf_ct_offload_timeout helper. Its safe to use in the add case, but not on teardown. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2025-01-19netfilter: conntrack: remove skb argument from nf_ct_refreshFlorian Westphal
Its not used (and could be NULL), so remove it. This allows to use nf_ct_refresh in places where we don't have an skb without having to double-check that skb == NULL would be safe. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2025-01-19netfilter: nf_tables: Tolerate chains with no remaining hooksPhil Sutter
Do not drop a netdev-family chain if the last interface it is registered for vanishes. Users dumping and storing the ruleset upon shutdown to restore it upon next boot may otherwise lose the chain and all contained rules. They will still lose the list of devices, a later patch will fix that. For now, this aligns the event handler's behaviour with that for flowtables. The controversal situation at netns exit should be no problem here: event handler will unregister the hooks, core nftables cleanup code will drop the chain itself. Signed-off-by: Phil Sutter <phil@nwl.cc> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2025-01-19netfilter: nf_tables: Store user-defined hook ifnamePhil Sutter
Prepare for hooks with NULL ops.dev pointer (due to non-existent device) and store the interface name and length as specified by the user upon creation. No functional change intended. Signed-off-by: Phil Sutter <phil@nwl.cc> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2025-01-19netfilter: nf_tables: fix set size with rbtree backendPablo Neira Ayuso
The existing rbtree implementation uses singleton elements to represent ranges, however, userspace provides a set size according to the number of ranges in the set. Adjust provided userspace set size to the number of singleton elements in the kernel by multiplying the range by two. Check if the no-match all-zero element is already in the set, in such case release one slot in the set size. Fixes: 0ed6389c483d ("netfilter: nf_tables: rename set implementations") Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2025-01-18Merge tag 'for-net-next-2025-01-15' of ↵Jakub Kicinski
git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth-next Luiz Augusto von Dentz says: ==================== bluetooth-next pull request for net-next: - btusb: Add new VID/PID 13d3/3610 for MT7922 - btusb: Add new VID/PID 13d3/3628 for MT7925 - btusb: Add MT7921e device 13d3:3576 - btusb: Add RTL8851BE device 13d3:3600 - btusb: Add ID 0x2c7c:0x0130 for Qualcomm WCN785x - btusb: add sysfs attribute to control USB alt setting - qca: Expand firmware-name property - qca: Fix poor RF performance for WCN6855 - L2CAP: handle NULL sock pointer in l2cap_sock_alloc - Allow reset via sysfs - ISO: Allow BIG re-sync - dt-bindings: Utilize PMU abstraction for WCN6750 - MGMT: Mark LL Privacy as stable * tag 'for-net-next-2025-01-15' of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth-next: (23 commits) Bluetooth: MGMT: Fix slab-use-after-free Read in mgmt_remove_adv_monitor_sync Bluetooth: qca: Fix poor RF performance for WCN6855 Bluetooth: Allow reset via sysfs Bluetooth: Get rid of cmd_timeout and use the reset callback Bluetooth: Remove the cmd timeout count in btusb Bluetooth: Use str_enable_disable-like helpers Bluetooth: btmtk: Remove resetting mt7921 before downloading the fw Bluetooth: L2CAP: handle NULL sock pointer in l2cap_sock_alloc Bluetooth: btusb: Add RTL8851BE device 13d3:3600 dt-bindings: bluetooth: Utilize PMU abstraction for WCN6750 Bluetooth: btusb: Add MT7921e device 13d3:3576 Bluetooth: btrtl: check for NULL in btrtl_setup_realtek() Bluetooth: btbcm: Fix NULL deref in btbcm_get_board_name() Bluetooth: qca: Expand firmware-name to load specific rampatch Bluetooth: qca: Update firmware-name to support board specific nvm dt-bindings: net: bluetooth: qca: Expand firmware-name property Bluetooth: btusb: Add new VID/PID 13d3/3628 for MT7925 Bluetooth: btusb: Add new VID/PID 13d3/3610 for MT7922 Bluetooth: btusb: add sysfs attribute to control USB alt setting Bluetooth: btusb: Add ID 0x2c7c:0x0130 for Qualcomm WCN785x ... ==================== Link: https://patch.msgid.link/20250117213203.3921910-1-luiz.dentz@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-01-18Merge tag 'wireless-next-2025-01-17' of ↵Jakub Kicinski
git://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless-next Kalle Valo says: ==================== wireless-next patches for v6.14 Most likely the last "new features" pull request for v6.14 and this is a bigger one. Multi-Link Operation (MLO) work continues both in stack in drivers. Few new devices supported and usual fixes all over. Major changes: cfg80211 * Emergency Preparedness Communication Services (EPCS) station mode support mac80211 * an option to filter a sta from being flushed * some support for RX Operating Mode Indication (OMI) power saving * support for adding and removing station links for MLO iwlwifi * new device ids * rework firmware error handling and restart rtw88 * RTL8812A: RFE type 2 support * LED support rtw89 * variant info to support RTL8922AE-VS mt76 * mt7996: single wiphy multiband support (preparation for MLO) * mt7996: support for more variants * mt792x: P2P_DEVICE support * mt7921u: TP-Link TXE50UH support ath12k * enable MLO for QCN9274 (although it seems to be broken with dual band devices) * MLO radar detection support * debugfs: transmit buffer OFDMA, AST entry and puncture stats * tag 'wireless-next-2025-01-17' of git://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless-next: (322 commits) wifi: brcmfmac: fix NULL pointer dereference in brcmf_txfinalize() wifi: rtw88: add RTW88_LEDS depends on LEDS_CLASS to Kconfig wifi: wilc1000: unregister wiphy only after netdev registration wifi: cfg80211: adjust allocation of colocated AP data wifi: mac80211: fix memory leak in ieee80211_mgd_assoc_ml_reconf() wifi: ath12k: fix key cache handling wifi: ath12k: Fix uninitialized variable access in ath12k_mac_allocate() function wifi: ath12k: Remove ath12k_get_num_hw() helper function wifi: ath12k: Refactor the ath12k_hw get helper function argument wifi: ath12k: Refactor ath12k_hw set helper function argument wifi: mt76: mt7996: add implicit beamforming support for mt7992 wifi: mt76: mt7996: fix beacon command during disabling wifi: mt76: mt7996: fix ldpc setting wifi: mt76: mt7996: fix definition of tx descriptor wifi: mt76: connac: adjust phy capabilities based on band constraints wifi: mt76: mt7996: fix incorrect indexing of MIB FW event wifi: mt76: mt7996: fix HE Phy capability wifi: mt76: mt7996: fix the capability of reception of EHT MU PPDU wifi: mt76: mt7996: add max mpdu len capability wifi: mt76: mt7921: avoid undesired changes of the preset regulatory domain ... ==================== Link: https://patch.msgid.link/20250117203529.72D45C4CEDD@smtp.kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-01-18net: phy: remove leftovers from switch to linkmode bitmapsHeiner Kallweit
We have some leftovers from the switch to linkmode bitmaps which - have never been used - are not used any longer - have no user outside phy_device.c So remove them. Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com> Link: https://patch.msgid.link/5493b96e-88bb-4230-a911-322659ec5167@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-01-18Merge tag 'trace-v6.13-rc7-2' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace Pull tracing fix from Steven Rostedt: "Fix regression in GFP output in trace events It was reported that the GFP flags in trace events went from human readable to just their hex values: gfp_flags=GFP_HIGHUSER_MOVABLE|__GFP_COMP to gfp_flags=0x140cca This was caused by a change that added the use of enums in calculating the GFP flags. As defines get translated into their values in the trace event format files, the user space tooling could easily convert the GFP flags into their symbols via the __print_flags() helper macro. The problem is that enums do not get converted, and the names of the enums show up in the format files and user space tooling cannot translate them. Add TRACE_DEFINE_ENUM() around the enums used for GFP flags which is the tracing infrastructure macro that informs the tracing subsystem what the values for enums and it can then expose that to user space" * tag 'trace-v6.13-rc7-2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: tracing: gfp: Fix the GFP enum values shown for user space tracing tools
2025-01-17net: mscc: ocelot: add TX timestamping statisticsVladimir Oltean
Add an u64 hardware timestamping statistics structure for each ocelot port. Export a function from the common switch library for reporting them to ethtool. This is called by the ocelot switchdev front-end for now. Note that for the switchdev driver, we report the one-step PTP packets as unconfirmed, even though in principle, for some transmission mechanisms like FDMA, we may be able to confirm transmission and bump the "pkts" counter in ocelot_fdma_tx_cleanup() instead. I don't have access to hardware which uses the switchdev front-end, and I've kept the implementation simple. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Jakub Kicinski <kuba@kernel.org> Link: https://patch.msgid.link/20250116104628.123555-4-vladimir.oltean@nxp.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-01-17net: dsa: implement get_ts_stats ethtool operation for user portsVladimir Oltean
Integrate with the standard infrastructure for reporting hardware packet timestamping statistics. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Jakub Kicinski <kuba@kernel.org> Link: https://patch.msgid.link/20250116104628.123555-3-vladimir.oltean@nxp.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-01-17net: ethtool: ts: add separate counter for unconfirmed one-step TX timestampsVladimir Oltean
For packets with two-step timestamp requests, the hardware timestamp comes back to the driver through a confirmation mechanism of sorts, which allows the driver to confidently bump the successful "pkts" counter. For one-step PTP, the NIC is supposed to autonomously insert its hardware TX timestamp in the packet headers while simultaneously transmitting it. There may be a confirmation that this was done successfully, or there may not. None of the current drivers which implement ethtool_ops :: get_ts_stats() also support HWTSTAMP_TX_ONESTEP_SYNC or HWTSTAMP_TX_ONESTEP_SYNC, so it is a bit unclear which model to follow. But there are NICs, such as DSA, where there is no transmit confirmation at all. Here, it would be wrong / misleading to increment the successful "pkts" counter, because one-step PTP packets can be dropped on TX just like any other packets. So introduce a special counter which signifies "yes, an attempt was made, but we don't know whether it also exited the port or not". I expect that for one-step PTP packets where a confirmation is available, the "pkts" counter would be bumped. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Jakub Kicinski <kuba@kernel.org> Link: https://patch.msgid.link/20250116104628.123555-2-vladimir.oltean@nxp.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-01-17Merge branch '100GbE' of ↵Jakub Kicinski
git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/next-queue Tony Nguyen says: ==================== ice: support FW Recovery Mode Konrad Knitter says: Enable update of card in FW Recovery Mode * '100GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/next-queue: ice: support FW Recovery Mode devlink: add devl guard pldmfw: enable selected component update ==================== Link: https://patch.msgid.link/20250116212059.1254349-1-anthony.l.nguyen@intel.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-01-17Merge tag 'soc-fixes-6.13-4' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/soc/soc Pull SoC fixes from Arnd Bergmann: "Two last minute fixes: one build issue on TI soc drivers, and a regression in the renesas reset controller driver" * tag 'soc-fixes-6.13-4' of git://git.kernel.org/pub/scm/linux/kernel/git/soc/soc: soc: ti: pruss: Fix pruss APIs reset: rzg2l-usbphy-ctrl: Assign proper of node to the allocated device
2025-01-17tracing: gfp: Fix the GFP enum values shown for user space tracing toolsSteven Rostedt
Tracing tools like perf and trace-cmd read the /sys/kernel/tracing/events/*/*/format files to know how to parse the data and also how to print it. For the "print fmt" portion of that file, if anything uses an enum that is not exported to the tracing system, user space will not be able to parse it. The GFP flags use to be defines, and defines get translated in the print fmt sections. But now they are converted to use enums, which is not. The mm_page_alloc trace event format use to have: print fmt: "page=%p pfn=0x%lx order=%d migratetype=%d gfp_flags=%s", REC->pfn != -1UL ? (((struct page *)vmemmap_base) + (REC->pfn)) : ((void *)0), REC->pfn != -1UL ? REC->pfn : 0, REC->order, REC->migratetype, (REC->gfp_flags) ? __print_flags(REC->gfp_flags, "|", {( unsigned long)(((((((( gfp_t)(0x400u|0x800u)) | (( gfp_t)0x40u) | (( gfp_t)0x80u) | (( gfp_t)0x100000u)) | (( gfp_t)0x02u)) | (( gfp_t)0x08u) | (( gfp_t)0)) | (( gfp_t)0x40000u) | (( gfp_t)0x80000u) | (( gfp_t)0x2000u)) & ~(( gfp_t)(0x400u|0x800u))) | (( gfp_t)0x400u)), "GFP_TRANSHUGE"}, {( unsigned long)((((((( gfp_t)(0x400u|0x800u)) | (( gfp_t)0x40u) | (( gfp_t)0x80u) | (( gfp_t)0x100000u)) | (( gfp_t)0x02u)) | (( gfp_t)0x08u) | (( gfp_t)0)) ... Where the GFP values are shown and not their names. But after the GFP flags were converted to use enums, it has: print fmt: "page=%p pfn=0x%lx order=%d migratetype=%d gfp_flags=%s", REC->pfn != -1UL ? (vmemmap + (REC->pfn)) : ((void *)0), REC->pfn != -1UL ? REC->pfn : 0, REC->order, REC->migratetype, (REC->gfp_flags) ? __print_flags(REC->gfp_flags, "|", {( unsigned long)(((((((( gfp_t)(((((1UL))) << (___GFP_DIRECT_RECLAIM_BIT))|((((1UL))) << (___GFP_KSWAPD_RECLAIM_BIT)))) | (( gfp_t)((((1UL))) << (___GFP_IO_BIT))) | (( gfp_t)((((1UL))) << (___GFP_FS_BIT))) | (( gfp_t)((((1UL))) << (___GFP_HARDWALL_BIT)))) | (( gfp_t)((((1UL))) << (___GFP_HIGHMEM_BIT)))) | (( gfp_t)((((1UL))) << (___GFP_MOVABLE_BIT))) | (( gfp_t)0)) | (( gfp_t)((((1UL))) << (___GFP_COMP_BIT))) ... Where the enums names like ___GFP_KSWAPD_RECLAIM_BIT are shown and not their values. User space has no way to convert these names to their values and the output will fail to parse. What is shown is now: mm_page_alloc: page=0xffffffff981685f3 pfn=0x1d1ac1 order=0 migratetype=1 gfp_flags=0x140cca The TRACE_DEFINE_ENUM() macro was created to handle enums in the print fmt files. This causes them to be replaced at boot up with the numbers, so that user space tooling can parse it. By using this macro, the output is back to the human readable: mm_page_alloc: page=0xffffffff981685f3 pfn=0x122233 order=0 migratetype=1 gfp_flags=GFP_HIGHUSER_MOVABLE|__GFP_COMP Cc: stable@vger.kernel.org Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Veronika Molnarova <vmolnaro@redhat.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: https://lore.kernel.org/20250116214438.749504792@goodmis.org Reported-by: Michael Petlan <mpetlan@redhat.com> Closes: https://lore.kernel.org/all/87be5f7c-1a0-dad-daa0-54e342efaea7@redhat.com/ Fixes: 772dd0342727c ("mm: enumerate all gfp flags") Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-01-17block: Add common atomic writes enable flagJohn Garry
Currently only stacked devices need to explicitly enable atomic writes by setting BLK_FEAT_ATOMIC_WRITES_STACKED flag. This does not work well for device mapper stacking devices, as there many sets of limits are stacked and what is the 'bottom' and 'top' device can swapped. This means that BLK_FEAT_ATOMIC_WRITES_STACKED needs to be set for many queue limits, which is messy. Generalize enabling atomic writes enabling by ensuring that all devices must explicitly set a flag - that includes NVMe, SCSI sd, and md raid. Signed-off-by: John Garry <john.g.garry@oracle.com> Reviewed-by: Mike Snitzer <snitzer@kernel.org> Link: https://lore.kernel.org/r/20250116170301.474130-2-john.g.garry@oracle.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-01-17PM: sleep: wakeirq: Introduce device-managed variant of dev_pm_set_wake_irq()Peng Fan
Add device-managed variant of dev_pm_set_wake_irq which automatically clear the wake irq on device destruction to simplify error handling and resource management in drivers. Signed-off-by: Peng Fan <peng.fan@nxp.com> Link: https://patch.msgid.link/20250103-wake_irq-v2-1-e3aeff5e9966@nxp.com Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2025-01-17regulator: Add support for power budgetKory Maincent
Introduce power budget management for the regulator device. Enable tracking of available power capacity by providing helpers to request and release power budget allocations. Signed-off-by: Kory Maincent <kory.maincent@bootlin.com> Link: https://patch.msgid.link/20250115-feature_regulator_pw_budget-v2-1-0a44b949e6bc@bootlin.com Signed-off-by: Mark Brown <broonie@kernel.org>
2025-01-16net: phylink: add EEE managementRussell King (Oracle)
Add EEE management to phylink, making use of the phylib implementation. This will only be used where a MAC driver populates the methods and capabilities bitfield, otherwise we keep our old behaviour. Phylink will keep track of the EEE configuration, including the clock stop abilities at each end of the MAC to PHY link, programming the PHY appropriately and preserving the LPI configuration should the PHY go away. Phylink will call into the MAC driver when LPI needs to be enabled or disabled, with the requirement that the MAC have LPI disabled prior to the netdev being brought up (in other words, it will only call mac_disable_tx_lpi() if it has already called mac_enable_tx_lpi().) Support for phylink managed EEE is enabled by populating both tx_lpi MAC operations method pointers, and filling in both LPI interfaces and capabilities. If the methods are provided but the LPI interfaces or capabilities remain empty, this indicates to phylink that EEE is implemented by the driver but the hardware it is driving does not support EEE, and thus the ethtool set_eee() and get_eee() methods will return EOPNOTSUPP. No validation of the LPI timer value is performed by this patch. For interface modes which do not support LPI, we make no attempt to manipulate the phylib EEE advertisement, but instead refuse to activate LPI at the MAC, noting it at debug message level. We also restrict the advertisement and reported userspace support linkmode masks according to the lpi_capabilities provided to phylink by the MAC driver. Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Link: https://patch.msgid.link/E1tYADq-0014Pn-J1@rmk-PC.armlinux.org.uk Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-01-16net: phy: add support for querying PHY clock stop capabilityRussell King (Oracle)
Add support for querying whether the PHY allows the transmit xMII clock to be stopped while in LPI mode. This will be used by phylink to pass to the MAC driver so it can configure the generation of the xMII clock appropriately. Reviewed-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Link: https://patch.msgid.link/E1tYADg-0014Pb-AJ@rmk-PC.armlinux.org.uk Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-01-16net: mdio: add definition for clock stop capable bitRussell King (Oracle)
Add a definition for the clock stop capable bit in the PCS MMD. This bit indicates whether the MAC is able to stop the transmit xMII clock while it is signalling LPI. Reviewed-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Link: https://patch.msgid.link/E1tYADb-0014PV-6T@rmk-PC.armlinux.org.uk Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-01-16clk: bcm: rpi: Add disp clockMaxime Ripard
BCM2712 has an extra clock exposed by the firmware called DISP, and used by (at least) the HVS. Let's add it to the list of clocks to register in Linux. Acked-by: Stephen Boyd <sboyd@kernel.org> Signed-off-by: Maxime Ripard <mripard@kernel.org> Signed-off-by: Dave Stevenson <dave.stevenson@raspberrypi.com> Link: https://lore.kernel.org/r/20250116-bcm2712-clk-updates-v1-5-10bc92ffbf41@raspberrypi.com Reviewed-by: Florian Fainelli <florian.fainelli@broadcom.com> Signed-off-by: Stephen Boyd <sboyd@kernel.org>
2025-01-16devlink: add devl guardKonrad Knitter
Add devl guard for scoped_guard(). Example usage: scoped_guard(devl, priv_to_devlink(pf)) { err = init_devlink(pf); if (err) return err; } Co-developed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com> Signed-off-by: Przemek Kitszel <przemyslaw.kitszel@intel.com> Signed-off-by: Konrad Knitter <konrad.knitter@intel.com> Tested-by: Pucha Himasekhar Reddy <himasekharx.reddy.pucha@intel.com> (A Contingent worker at Intel) Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2025-01-16pldmfw: enable selected component updateKonrad Knitter
This patch enables to update a selected component from PLDM image containing multiple components. Example usage: struct pldmfw; data.mode = PLDMFW_UPDATE_MODE_SINGLE_COMPONENT; data.compontent_identifier = DRIVER_FW_MGMT_COMPONENT_ID; Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Reviewed-by: Marcin Szycik <marcin.szycik@linux.intel.com> Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com> Signed-off-by: Konrad Knitter <konrad.knitter@intel.com> Tested-by: Pucha Himasekhar Reddy <himasekharx.reddy.pucha@intel.com> (A Contingent worker at Intel) Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2025-01-16Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski
Cross-merge networking fixes after downstream PR (net-6.13-rc8). Conflicts: drivers/net/ethernet/realtek/r8169_main.c 1f691a1fc4be ("r8169: remove redundant hwmon support") 152d00a91396 ("r8169: simplify setting hwmon attribute visibility") https://lore.kernel.org/20250115122152.760b4e8d@canb.auug.org.au Adjacent changes: drivers/net/ethernet/broadcom/bnxt/bnxt.c 152f4da05aee ("bnxt_en: add support for rx-copybreak ethtool command") f0aa6a37a3db ("eth: bnxt: always recalculate features after XDP clearing, fix null-deref") drivers/net/ethernet/intel/ice/ice_type.h 50327223a8bb ("ice: add lock to protect low latency interface") dc26548d729e ("ice: Fix quad registers read on E825") Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-01-16Merge tag 'net-6.13-rc8' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Pull networking fixes from Paolo Abeni: "Notably this includes fixes for a few regressions spotted very recently. No known outstanding ones. Current release - regressions: - core: avoid CFI problems with sock priv helpers - xsk: bring back busy polling support - netpoll: ensure skb_pool list is always initialized Current release - new code bugs: - core: make page_pool_ref_netmem work with net iovs - ipv4: route: fix drop reason being overridden in ip_route_input_slow - udp: make rehash4 independent in udp_lib_rehash() Previous releases - regressions: - bpf: fix bpf_sk_select_reuseport() memory leak - openvswitch: fix lockup on tx to unregistering netdev with carrier - mptcp: be sure to send ack when mptcp-level window re-opens - eth: - bnxt: always recalculate features after XDP clearing, fix null-deref - mlx5: fix sub-function add port error handling - fec: handle page_pool_dev_alloc_pages error Previous releases - always broken: - vsock: some fixes due to transport de-assignment - eth: - ice: fix E825 initialization - mlx5e: fix inversion dependency warning while enabling IPsec tunnel - gtp: destroy device along with udp socket's netns dismantle. - xilinx: axienet: Fix IRQ coalescing packet count overflow" * tag 'net-6.13-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (44 commits) netdev: avoid CFI problems with sock priv helpers net/mlx5e: Always start IPsec sequence number from 1 net/mlx5e: Rely on reqid in IPsec tunnel mode net/mlx5e: Fix inversion dependency warning while enabling IPsec tunnel net/mlx5: Clear port select structure when fail to create net/mlx5: SF, Fix add port error handling net/mlx5: Fix a lockdep warning as part of the write combining test net/mlx5: Fix RDMA TX steering prio net: make page_pool_ref_netmem work with net iovs net: ethernet: xgbe: re-add aneg to supported features in PHY quirks net: pcs: xpcs: actively unset DW_VR_MII_DIG_CTRL1_2G5_EN for 1G SGMII net: pcs: xpcs: fix DW_VR_MII_DIG_CTRL1_2G5_EN bit being set for 1G SGMII w/o inband selftests: net: Adapt ethtool mq tests to fix in qdisc graft net: fec: handle page_pool_dev_alloc_pages error net: netpoll: ensure skb_pool list is always initialized net: xilinx: axienet: Fix IRQ coalescing packet count overflow nfp: bpf: prevent integer overflow in nfp_bpf_event_output() selftests: mptcp: avoid spurious errors on disconnect mptcp: fix spurious wake-up on under memory pressure mptcp: be sure to send ack when mptcp-level window re-opens ...
2025-01-16hrtimers: Handle CPU state correctly on hotplugKoichiro Den
Consider a scenario where a CPU transitions from CPUHP_ONLINE to halfway through a CPU hotunplug down to CPUHP_HRTIMERS_PREPARE, and then back to CPUHP_ONLINE: Since hrtimers_prepare_cpu() does not run, cpu_base.hres_active remains set to 1 throughout. However, during a CPU unplug operation, the tick and the clockevents are shut down at CPUHP_AP_TICK_DYING. On return to the online state, for instance CFS incorrectly assumes that the hrtick is already active, and the chance of the clockevent device to transition to oneshot mode is also lost forever for the CPU, unless it goes back to a lower state than CPUHP_HRTIMERS_PREPARE once. This round-trip reveals another issue; cpu_base.online is not set to 1 after the transition, which appears as a WARN_ON_ONCE in enqueue_hrtimer(). Aside of that, the bulk of the per CPU state is not reset either, which means there are dangling pointers in the worst case. Address this by adding a corresponding startup() callback, which resets the stale per CPU state and sets the online flag. [ tglx: Make the new callback unconditionally available, remove the online modification in the prepare() callback and clear the remaining state in the starting callback instead of the prepare callback ] Fixes: 5c0930ccaad5 ("hrtimers: Push pending hrtimers away from outgoing CPU earlier") Signed-off-by: Koichiro Den <koichiro.den@canonical.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/all/20241220134421.3809834-1-koichiro.den@canonical.com
2025-01-16lockref: add a lockref_init helperChristoph Hellwig
Add a helper to initialize the lockdep, that is initialize the spinlock and set a value. Having to open code them isn't a big deal, but having an initializer feels right for a proper primitive. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20250115094702.504610-6-hch@lst.de Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-01-16lockref: drop superfluous externsChristoph Hellwig
Drop the superfluous externs from the remaining prototypes in lockref.h. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20250115094702.504610-5-hch@lst.de Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-01-16lockref: use bool for false/true returnsChristoph Hellwig
Replace int used as bool with the actual bool type for return values that can only be true or false. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20250115094702.504610-4-hch@lst.de Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-01-16lockref: remove lockref_put_not_zeroChristoph Hellwig
lockref_put_not_zero is not used anywhere, and unless I'm missing something didn't end up being used used at all. Remove it. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20250115094702.504610-2-hch@lst.de Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-01-16fs: Fix return type of do_mount() from long to intSentaro Onizuka
Fix the return type of do_mount() function from long to int to match its ac tual behavior. The function only returns int values, and all callers, inclu ding those in fs/namespace.c and arch/alpha/kernel/osf_sys.c, already treat the return value as int. This change improves type consistency across the filesystem code and aligns the function signature with its existing impleme ntation and usage. Signed-off-by: Sentaro Onizuka <sentaro@amazon.com> Link: https://lore.kernel.org/r/20250113151400.55512-1-sentaro@amazon.com Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-01-15Merge branch 'net-reduce-rtnl-pressure-in-unregister_netdevice'Jakub Kicinski
Eric Dumazet says: ==================== net: reduce RTNL pressure in unregister_netdevice() One major source of RTNL contention resides in unregister_netdevice() Due to RCU protection of various network structures, and unregister_netdevice() being a synchronous function, it is calling potentially slow functions while holding RTNL. I think we can release RTNL in two points, so that three slow functions are called while RTNL can be used by other threads. v1: https://lore.kernel.org/netdev/20250107130906.098fc8d6@kernel.org/T/#m398c95f5778e1ff70938e079d3c4c43c050ad2a6 ==================== Link: https://patch.msgid.link/20250114205531.967841-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-01-15net: expedite synchronize_net() for cleanup_net()Eric Dumazet
cleanup_net() is the single thread responsible for netns dismantles, and a serious bottleneck. Before we can get per-netns RTNL, make sure all synchronize_net() called from this thread are using rcu_synchronize_expedited(). v3: deal with CONFIG_NET_NS=n Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Jesse Brandeburg <jbrandeburg@cloudflare.com> Link: https://patch.msgid.link/20250114205531.967841-2-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-01-15net: protect NAPI config fields with netdev_lock()Jakub Kicinski
Protect the following members of netdev and napi by netdev_lock: - defer_hard_irqs, - gro_flush_timeout, - irq_suspend_timeout. The first two are written via sysfs (which this patch switches to new lock), and netdev genl which holds both netdev and rtnl locks. irq_suspend_timeout is only written by netdev genl. Reviewed-by: Joe Damato <jdamato@fastly.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> Link: https://patch.msgid.link/20250115035319.559603-11-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>