summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2025-01-20Merge tag 'vfs-6.14-rc1.misc' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull misc vfs updates from Christian Brauner: "Features: - Support caching symlink lengths in inodes The size is stored in a new union utilizing the same space as i_devices, thus avoiding growing the struct or taking up any more space When utilized it dodges strlen() in vfs_readlink(), giving about 1.5% speed up when issuing readlink on /initrd.img on ext4 - Add RWF_DONTCACHE iocb and FOP_DONTCACHE file_operations flag If a file system supports uncached buffered IO, it may set FOP_DONTCACHE and enable support for RWF_DONTCACHE. If RWF_DONTCACHE is attempted without the file system supporting it, it'll get errored with -EOPNOTSUPP - Enable VBOXGUEST and VBOXSF_FS on ARM64 Now that VirtualBox is able to run as a host on arm64 (e.g. the Apple M3 processors) we can enable VBOXSF_FS (and in turn VBOXGUEST) for this architecture. Tested with various runs of bonnie++ and dbench on an Apple MacBook Pro with the latest Virtualbox 7.1.4 r165100 installed Cleanups: - Delay sysctl_nr_open check in expand_files() - Use kernel-doc includes in fiemap docbook - Use page->private instead of page->index in watch_queue - Use a consume fence in mnt_idmap() as it's heavily used in link_path_walk() - Replace magic number 7 with ARRAY_SIZE() in fc_log - Sort out a stale comment about races between fd alloc and dup2() - Fix return type of do_mount() from long to int - Various cosmetic cleanups for the lockref code Fixes: - Annotate spinning as unlikely() in __read_seqcount_begin The annotation already used to be there, but got lost in commit 52ac39e5db51 ("seqlock: seqcount_t: Implement all read APIs as statement expressions") - Fix proc_handler for sysctl_nr_open - Flush delayed work in delayed fput() - Fix grammar and spelling in propagate_umount() - Fix ESP not readable during coredump In /proc/PID/stat, there is the kstkesp field which is the stack pointer of a thread. While the thread is active, this field reads zero. But during a coredump, it should have a valid value However, at the moment, kstkesp is zero even during coredump - Don't wake up the writer if the pipe is still full - Fix unbalanced user_access_end() in select code" * tag 'vfs-6.14-rc1.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (28 commits) gfs2: use lockref_init for qd_lockref erofs: use lockref_init for pcl->lockref dcache: use lockref_init for d_lockref lockref: add a lockref_init helper lockref: drop superfluous externs lockref: use bool for false/true returns lockref: improve the lockref_get_not_zero description lockref: remove lockref_put_not_zero fs: Fix return type of do_mount() from long to int select: Fix unbalanced user_access_end() vbox: Enable VBOXGUEST and VBOXSF_FS on ARM64 pipe_read: don't wake up the writer if the pipe is still full selftests: coredump: Add stackdump test fs/proc: do_task_stat: Fix ESP not readable during coredump fs: add RWF_DONTCACHE iocb and FOP_DONTCACHE file_operations flag fs: sort out a stale comment about races between fd alloc and dup2 fs: Fix grammar and spelling in propagate_umount() fs: fc_log replace magic number 7 with ARRAY_SIZE() fs: use a consume fence in mnt_idmap() file: flush delayed work in delayed fput() ...
2025-01-20Merge tag 'vfs-6.14-rc1.kcore' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull /proc/kcore updates from Christian Brauner: "The performance of /proc/kcore reads has been showing up as a bottleneck for the drgn debugger. drgn scripts often spend ~25% of their time in the kernel reading from /proc/kcore. A lot of this overhead comes from silly inefficiencies. This pull request contains fixes for the low-hanging fruit. The fixes are all fairly small and straightforward. The result is a 25% improvement in read latency in micro-benchmarks (from ~235 nanoseconds to ~175) and a 15% improvement in execution time for real-world drgn scripts: - Make /proc/kcore entry permanent - Avoid walking the list on every read - Use percpu_rw_semaphore for kclist_lock - Make Omar Sandoval the official maintainer for /proc/kcore" * tag 'vfs-6.14-rc1.kcore' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: MAINTAINERS: add me as /proc/kcore maintainer proc/kcore: use percpu_rw_semaphore for kclist_lock proc/kcore: don't walk list on every read proc/kcore: mark proc entry as permanent
2025-01-20Merge tag 'vfs-6.14-rc1.netfs' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull vfs netfs updates from Christian Brauner: "This contains read performance improvements and support for monolithic single-blob objects that have to be read/written as such (e.g. AFS directory contents). The implementation of the two parts is interwoven as each makes the other possible. - Read performance improvements The read performance improvements are intended to speed up some loss of performance detected in cifs and to a lesser extend in afs. The problem is that we queue too many work items during the collection of read results: each individual subrequest is collected by its own work item, and then they have to interact with each other when a series of subrequests don't exactly align with the pattern of folios that are being read by the overall request. Whilst the processing of the pages covered by individual subrequests as they complete potentially allows folios to be woken in parallel and with minimum delay, it can shuffle wakeups for sequential reads out of order - and that is the most common I/O pattern. The final assessment and cleanup of an operation is then held up until the last I/O completes - and for a synchronous sequential operation, this means the bouncing around of work items just adds latency. Two changes have been made to make this work: (1) All collection is now done in a single "work item" that works progressively through the subrequests as they complete (and also dispatches retries as necessary). (2) For readahead and AIO, this work item be done on a workqueue and can run in parallel with the ultimate consumer of the data; for synchronous direct or unbuffered reads, the collection is run in the application thread and not offloaded. Functions such as smb2_readv_callback() then just tell netfslib that the subrequest has terminated; netfslib does a minimal bit of processing on the spot - stat counting and tracing mostly - and then queues/wakes up the worker. This simplifies the logic as the collector just walks sequentially through the subrequests as they complete and walks through the folios, if buffered, unlocking them as it goes. It also keeps to a minimum the amount of latency injected into the filesystem's low-level I/O handling The way netfs supports filesystems using the deprecated PG_private_2 flag is changed: folios are flagged and added to a write request as they complete and that takes care of scheduling the writes to the cache. The originating read request can then just unlock the pages whatever happens. - Single-blob object support Single-blob objects are files for which the content of the file must be read from or written to the server in a single operation because reading them in parts may yield inconsistent results. AFS directories are an example of this as there exists the possibility that the contents are generated on the fly and would differ between reads or might change due to third party interference. Such objects will be written to and retrieved from the cache if one is present, though we allow/may need to propose multiple subrequests to do so. The important part is that read from/write to the *server* is monolithic. Single blob reading is, for the moment, fully synchronous and does result collection in the application thread and, also for the moment, the API is supplied the buffer in the form of a folio_queue chain rather than using the pagecache. - Related afs changes This series makes a number of changes to the kafs filesystem, primarily in the area of directory handling: - AFS's FetchData RPC reply processing is made partially asynchronous which allows the netfs_io_request's outstanding operation counter to be removed as part of reducing the collection to a single work item. - Directory and symlink reading are plumbed through netfslib using the single-blob object API and are now cacheable with fscache. This also allows the afs_read struct to be eliminated and netfs_io_subrequest to be used directly instead. - Directory and symlink content are now stored in a folio_queue buffer rather than in the pagecache. This means we don't require the RCU read lock and xarray iteration to access it, and folios won't randomly disappear under us because the VM wants them back. - The vnode operation lock is changed from a mutex struct to a private lock implementation. The problem is that the lock now needs to be dropped in a separate thread and mutexes don't permit that. - When a new directory or symlink is created, we now initialise it locally and mark it valid rather than downloading it (we know what it's likely to look like). - We now use the in-directory hashtable to reduce the number of entries we need to scan when doing a lookup. The edit routines have to maintain the hash chains. - Cancellation (e.g. by signal) of an async call after the rxrpc_call has been set up is now offloaded to the worker thread as there will be a notification from rxrpc upon completion. This avoids a double cleanup. - A "rolling buffer" implementation is created to abstract out the two separate folio_queue chaining implementations I had (one for read and one for write). - Functions are provided to create/extend a buffer in a folio_queue chain and tear it down again. This is used to handle AFS directories, but could also be used to create bounce buffers for content crypto and transport crypto. - The was_async argument is dropped from netfs_read_subreq_terminated() Instead we wake the read collection work item by either queuing it or waking up the app thread. - We don't need to use BH-excluding locks when communicating between the issuing thread and the collection thread as neither of them now run in BH context. - Also included are a number of new tracepoints; a split of the netfslib write collection code to put retrying into its own file (it gets more complicated with content encryption). - There are also some minor fixes AFS included, including fixing the AFS directory format struct layout, reducing some directory over-invalidation and making afs_mkdir() translate EEXIST to ENOTEMPY (which is not available on all systems the servers support). - Finally, there's a patch to try and detect entry into the folio unlock function with no folio_queue structs in the buffer (which isn't allowed in the cases that can get there). This is a debugging patch, but should be minimal overhead" * tag 'vfs-6.14-rc1.netfs' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (31 commits) netfs: Report on NULL folioq in netfs_writeback_unlock_folios() afs: Add a tracepoint for afs_read_receive() afs: Locally initialise the contents of a new symlink on creation afs: Use the contained hashtable to search a directory afs: Make afs_mkdir() locally initialise a new directory's content netfs: Change the read result collector to only use one work item afs: Make {Y,}FS.FetchData an asynchronous operation afs: Fix cleanup of immediately failed async calls afs: Eliminate afs_read afs: Use netfslib for symlinks, allowing them to be cached afs: Use netfslib for directories afs: Make afs_init_request() get a key if not given a file netfs: Add support for caching single monolithic objects such as AFS dirs netfs: Add functions to build/clean a buffer in a folio_queue afs: Add more tracepoints to do with tracking validity cachefiles: Add auxiliary data trace cachefiles: Add some subrequest tracepoints netfs: Remove some extraneous directory invalidations afs: Fix directory format encoding struct afs: Fix EEXIST error returned from afs_rmdir() to be ENOTEMPTY ...
2025-01-20x86: use cmov for user address maskingLinus Torvalds
This was a suggestion by David Laight, and while I was slightly worried that some micro-architecture would predict cmov like a conditional branch, there is little reason to actually believe any core would be that broken. Intel documents that their existing cores treat CMOVcc as a data dependency that will constrain speculation in their "Speculative Execution Side Channel Mitigations" whitepaper: "Other instructions such as CMOVcc, AND, ADC, SBB and SETcc can also be used to prevent bounds check bypass by constraining speculative execution on current family 6 processors (Intel® Core™, Intel® Atom™, Intel® Xeon® and Intel® Xeon Phi™ processors)" and while that leaves the future uarch issues open, that's certainly true of our traditional SBB usage too. Any core that predicts CMOV will be unusable for various crypto algorithms that need data-independent timing stability, so let's just treat CMOV as the safe choice that simplifies the address masking by avoiding an extra instruction and doesn't need a temporary register. Suggested-by: David Laight <David.Laight@aculab.com> Link: https://www.intel.com/content/dam/develop/external/us/en/documents/336996-speculative-execution-side-channel-mitigations.pdf Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2025-01-20x86: use proper 'clac' and 'stac' opcode namesLinus Torvalds
Back when we added SMAP support, all versions of binutils didn't necessarily understand the 'clac' and 'stac' instructions. So we implemented those instructions manually as ".byte" sequences. But we've since upgraded the minimum version of binutils to version 2.25, and that included proper support for the SMAP instructions, and there's no reason for us to use some line noise to express them any more. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2025-01-20Merge branch 'for-6.14/selftests-trivial' into for-linusPetr Mladek
2025-01-20Merge branch 'for-6.14-cpu_sync-fixup' into for-linusPetr Mladek
2025-01-20net: phy: realtek: HWMON support for standalone versions of RTL8221B and RTL8251Aleksander Jan Bajkowski
HWMON support has been added for the RTL8221/8251 PHYs integrated together with the MAC inside the RTL8125/8126 chips. This patch extends temperature reading support for standalone variants of the mentioned PHYs. I don't know whether the earlier revisions of the RTL8226 also have a built-in temperature sensor, so they have been skipped for now. Tested on RTL8221B-VB-CG. Signed-off-by: Aleksander Jan Bajkowski <olek2@wp.pl> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
2025-01-20tcp_cubic: fix incorrect HyStart round start detectionMahdi Arghavani
I noticed that HyStart incorrectly marks the start of rounds, leading to inaccurate measurements of ACK train lengths and resetting the `ca->sample_cnt` variable. This inaccuracy can impact HyStart's functionality in terminating exponential cwnd growth during Slow-Start, potentially degrading TCP performance. The issue arises because the changes introduced in commit 4e1fddc98d25 ("tcp_cubic: fix spurious Hystart ACK train detections for not-cwnd-limited flows") moved the caller of the `bictcp_hystart_reset` function inside the `hystart_update` function. This modification added an additional condition for triggering the caller, requiring that (tcp_snd_cwnd(tp) >= hystart_low_window) must also be satisfied before invoking `bictcp_hystart_reset`. This fix ensures that `bictcp_hystart_reset` is correctly called at the start of a new round, regardless of the congestion window size. This is achieved by moving the condition (tcp_snd_cwnd(tp) >= hystart_low_window) from before calling `bictcp_hystart_reset` to after it. I tested with a client and a server connected through two Linux software routers. In this setup, the minimum RTT was 150 ms, the bottleneck bandwidth was 50 Mbps, and the bottleneck buffer size was 1 BDP, calculated as (50M / 1514 / 8) * 0.150 = 619 packets. I conducted the test twice, transferring data from the server to the client for 1.5 seconds. Before the patch was applied, HYSTART-DELAY stopped the exponential growth of cwnd when cwnd = 516, and the bottleneck link was not yet saturated (516 < 619). After the patch was applied, HYSTART-ACK-TRAIN stopped the exponential growth of cwnd when cwnd = 632, and the bottleneck link was saturated (632 > 619). In this test, applying the patch resulted in 300 KB more data delivered. Fixes: 4e1fddc98d25 ("tcp_cubic: fix spurious Hystart ACK train detections for not-cwnd-limited flows") Signed-off-by: Mahdi Arghavani <ma.arghavani@yahoo.com> Reviewed-by: Jason Xing <kerneljasonxing@gmail.com> Cc: Neal Cardwell <ncardwell@google.com> Cc: Eric Dumazet <edumazet@google.com> Cc: Haibo Zhang <haibo.zhang@otago.ac.nz> Cc: David Eyers <david.eyers@otago.ac.nz> Cc: Abbas Arghavani <abbas.arghavani@mdu.se> Reviewed-by: Neal Cardwell <ncardwell@google.com> Tested-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2025-01-20net: macsec: Add endianness annotations in salt structAles Nezbeda
This change resolves warning produced by sparse tool as currently there is a mismatch between normal generic type in salt and endian annotated type in macsec driver code. Endian annotated types should be used here. Sparse output: warning: restricted ssci_t degrades to integer warning: incorrect type in assignment (different base types) expected restricted ssci_t [usertype] ssci got unsigned int warning: restricted __be64 degrades to integer warning: incorrect type in assignment (different base types) expected restricted __be64 [usertype] pn got unsigned long long Signed-off-by: Ales Nezbeda <anezbeda@redhat.com> Reviewed-by: Simon Horman <horms@kernel.org> Reviewed-by: Sabrina Dubroca <sd@queasysnail.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2025-01-20tipc: re-order conditions in tipc_crypto_key_rcv()Dan Carpenter
On a 32bit system the "keylen + sizeof(struct tipc_aead_key)" math could have an integer wrapping issue. It doesn't matter because the "keylen" is checked on the next line, but just to make life easier for static analysis tools, let's re-order these conditions and avoid the integer overflow. Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2025-01-20net: phylink: always do a major config when attaching a SFP PHYRussell King (Oracle)
Background: https://lore.kernel.org/r/20250107123615.161095-1-ericwouds@gmail.com Since adding negotiation of in-band capabilities, it is no longer sufficient to just look at the MLO_AN_xxx mode and PHY interface to decide whether to do a major configuration, since the result now depends on the capabilities of the attaching PHY. Always trigger a major configuration in this case. Testing log: https://lore.kernel.org/r/f20c9744-3953-40e7-a9c9-5534b25d2e2a@gmail.com Reported-by: Eric Woudstra <ericwouds@gmail.com> Tested-by: Eric Woudstra <ericwouds@gmail.com> Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Signed-off-by: David S. Miller <davem@davemloft.net>
2025-01-20samples/vfs: fix build warningsChristian Brauner
Fix build warnings reported from linux-next. Reported-by: Stephen Rothwell <sfr@canb.auug.org.au> Link: https://lore.kernel.org/r/20250120192504.4a1965a0@canb.auug.org.au Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-01-20samples/vfs: use shared headerChristian Brauner
Share some infrastructure between sample programs and fix a build failure that was reported. Reported-by: Sasha Levin <sashal@kernel.org> Link: https://lore.kernel.org/r/Z42UkSXx0MS9qZ9w@lappy Link: https://qa-reports.linaro.org/lkft/sashal-linus-next/build/v6.13-rc7-511-g109a8e0fa9d6/testrun/26809210/suite/build/test/gcc-8-allyesconfig/log Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-01-20net: appletalk: Drop aarp_send_probe_phase1()谢致邦 (XIE Zhibang)
aarp_send_probe_phase1() used to work by calling ndo_do_ioctl of appletalk drivers ltpc or cops, but these two drivers have been removed since the following commits: commit 03dcb90dbf62 ("net: appletalk: remove Apple/Farallon LocalTalk PC support") commit 00f3696f7555 ("net: appletalk: remove cops support") Thus aarp_send_probe_phase1() no longer works, so drop it. (found by code inspection) Signed-off-by: 谢致邦 (XIE Zhibang) <Yeking@Red54.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2025-01-20net: ethernet: ti: am65-cpsw: fix freeing IRQ in am65_cpsw_nuss_remove_tx_chns()Roger Quadros
When getting the IRQ we use k3_udma_glue_tx_get_irq() which returns negative error value on error. So not NULL check is not sufficient to deteremine if IRQ is valid. Check that IRQ is greater then zero to ensure it is valid. There is no issue at probe time but at runtime user can invoke .set_channels which results in the following call chain. am65_cpsw_set_channels() am65_cpsw_nuss_update_tx_rx_chns() am65_cpsw_nuss_remove_tx_chns() am65_cpsw_nuss_init_tx_chns() At this point if am65_cpsw_nuss_init_tx_chns() fails due to k3_udma_glue_tx_get_irq() then tx_chn->irq will be set to a negative value. Then, at subsequent .set_channels with higher channel count we will attempt to free an invalid IRQ in am65_cpsw_nuss_remove_tx_chns() leading to a kernel warning. The issue is present in the original commit that introduced this driver, although there, am65_cpsw_nuss_update_tx_rx_chns() existed as am65_cpsw_nuss_update_tx_chns(). Fixes: 93a76530316a ("net: ethernet: ti: introduce am65x/j721e gigabit eth subsystem driver") Signed-off-by: Roger Quadros <rogerq@kernel.org> Reviewed-by: Simon Horman <horms@kernel.org> Reviewed-by: Siddharth Vadapalli <s-vadapalli@ti.com> Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2025-01-20dsa: Use str_enable_disable-like helpersKrzysztof Kozlowski
Replace ternary (condition ? "enable" : "disable") syntax with helpers from string_choices.h because: 1. Simple function call with one argument is easier to read. Ternary operator has three arguments and with wrapping might lead to quite long code. 2. Is slightly shorter thus also easier to read. 3. It brings uniformity in the text - same string. 4. Allows deduping by the linker, which results in a smaller binary file. Signed-off-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org> Reviewed-by: Linus Walleij <linus.walleij@linaro.org> Reviewed-by: Vladimir Oltean <olteanv@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2025-01-20net: sched: refine software bypass handling in tc_runXin Long
This patch addresses issues with filter counting in block (tcf_block), particularly for software bypass scenarios, by introducing a more accurate mechanism using useswcnt. Previously, filtercnt and skipswcnt were introduced by: Commit 2081fd3445fe ("net: sched: cls_api: add filter counter") and Commit f631ef39d819 ("net: sched: cls_api: add skip_sw counter") filtercnt tracked all tp (tcf_proto) objects added to a block, and skipswcnt counted tp objects with the skipsw attribute set. The problem is: a single tp can contain multiple filters, some with skipsw and others without. The current implementation fails in the case: When the first filter in a tp has skipsw, both skipswcnt and filtercnt are incremented, then adding a second filter without skipsw to the same tp does not modify these counters because tp->counted is already set. This results in bypass software behavior based solely on skipswcnt equaling filtercnt, even when the block includes filters without skipsw. Consequently, filters without skipsw are inadvertently bypassed. To address this, the patch introduces useswcnt in block to explicitly count tp objects containing at least one filter without skipsw. Key changes include: Whenever a filter without skipsw is added, its tp is marked with usesw and counted in useswcnt. tc_run() now uses useswcnt to determine software bypass, eliminating reliance on filtercnt and skipswcnt. This refined approach prevents software bypass for blocks containing mixed filters, ensuring correct behavior in tc_run(). Additionally, as atomic operations on useswcnt ensure thread safety and tp->lock guards access to tp->usesw and tp->counted, the broader lock down_write(&block->cb_lock) is no longer required in tc_new_tfilter(), and this resolves a performance regression caused by the filter counting mechanism during parallel filter insertions. The improvement can be demonstrated using the following script: # cat insert_tc_rules.sh tc qdisc add dev ens1f0np0 ingress for i in $(seq 16); do taskset -c $i tc -b rules_$i.txt & done wait Each of rules_$i.txt files above includes 100000 tc filter rules to a mlx5 driver NIC ens1f0np0. Without this patch: # time sh insert_tc_rules.sh real 0m50.780s user 0m23.556s sys 4m13.032s With this patch: # time sh insert_tc_rules.sh real 0m17.718s user 0m7.807s sys 3m45.050s Fixes: 047f340b36fc ("net: sched: make skip_sw actually skip software") Reported-by: Shuang Li <shuali@redhat.com> Signed-off-by: Xin Long <lucien.xin@gmail.com> Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Reviewed-by: Asbjørn Sloth Tønnesen <ast@fiberby.net> Tested-by: Asbjørn Sloth Tønnesen <ast@fiberby.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2025-01-19Merge branch 'vsnprintf'Linus Torvalds
This merges the vsnprintf internal cleanups I did, which were triggered by a combination of performance issues (see for example commit f9ed1f7c2e26: "genirq/proc: Use seq_put_decimal_ull_width() for decimal values") and discussion about tracing abusing the vsnprintf code in odd ways. The intent was to improve code generation, but also to possibly eventually expose the cleaned-up printf format decoding state machine. It certainly didn't get to the point where we'd want to expose the format decoding to external users, but it's an improvement over what we used to have. Several of the complex case statements have been simplified, or removed entirely to be replaced by simple table lookups. * branch 'vsnprintf': vsnprintf: fix the number base for non-numeric formats vsnprintf: fix up kerneldoc for argument name changes vsprintf: don't make the 'binary' version pack small integer arguments vsnprintf: collapse the number format state into one single state vsnprintf: mark the indirect width and precision cases unlikely vsnprintf: inline skip_atoi() again vsprintf: deal with format specifiers with a lookup table vsprintf: deal with format flags with a simple lookup table vsprintf: associate the format state with the format pointer vsprintf: fix calling convention for format_decode() vsprintf: avoid nested switch statement on same variable vsprintf: simplify number handling
2025-01-19Linux 6.13v6.13for-nextLinus Torvalds
2025-01-19Merge tag 'x86_urgent_for_v6.13' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 fixes from Borislav Petkov: - Mark serialize() noinstr so that it can be used from instrumentation- free code - Make sure FRED's RSP0 MSR is synchronized with its corresponding per-CPU value in order to avoid double faults in hotplug scenarios - Disable EXECMEM_ROX on x86 for now because it didn't receive proper x86 maintainers review, went in and broke a bunch of things * tag 'x86_urgent_for_v6.13' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/asm: Make serialize() always_inline x86/fred: Fix the FRED RSP0 MSR out of sync with its per-CPU cache x86: Disable EXECMEM_ROX support
2025-01-19Merge tag 'timers_urgent_for_v6.13' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull timer fixes from Borislav Petkov: - Reset hrtimers correctly when a CPU hotplug state traversal happens "half-ways" and leaves hrtimers not (re-)initialized properly - Annotate accesses to a timer group's ignore flag to prevent KCSAN from raising data_race warnings - Make sure timer group initialization is visible to timer tree walkers and avoid a hypothetical race - Fix another race between CPU hotplug and idle entry/exit where timers on a fully idle system are getting ignored - Fix a case where an ignored signal is still being handled which it shouldn't be * tag 'timers_urgent_for_v6.13' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: hrtimers: Handle CPU state correctly on hotplug timers/migration: Annotate accesses to ignore flag timers/migration: Enforce group initialization visibility to tree walkers timers/migration: Fix another race between hotplug and idle entry/exit signal/posixtimers: Handle ignore/blocked sequences correctly
2025-01-19Merge tag 'irq_urgent_for_v6.13' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull irq fixes from Borislav Petkov: - Fix an OF node leak in irqchip init's error handling path - Fix sunxi systems to wake up from suspend with an NMI by pressing the power button - Do not spuriously enable interrupts in gic-v3 in a nested interrupts-off section - Make sure gic-v3 handles properly a failure to enter a low power state * tag 'irq_urgent_for_v6.13' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: irqchip: Plug a OF node reference leak in platform_irqchip_probe() irqchip/sunxi-nmi: Add missing SKIP_WAKE flag irqchip/gic-v3-its: Don't enable interrupts in its_irq_set_vcpu_affinity() irqchip/gic-v3: Handle CPU_PM_ENTER_FAILED correctly
2025-01-19Merge tag 'sched_urgent_for_v6.13' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler fixes from Borislav Petkov: - Do not adjust the weight of empty group entities and avoid scheduling artifacts - Avoid scheduling lag by computing lag properly and thus address an EEVDF entity placement issue * tag 'sched_urgent_for_v6.13' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: sched/fair: Fix update_cfs_group() vs DELAY_DEQUEUE sched/fair: Fix EEVDF entity placement bug causing scheduling lag
2025-01-19netfilter: flowtable: add CLOSING statePablo Neira Ayuso
tcp rst/fin packet triggers an immediate teardown of the flow which results in sending flows back to the classic forwarding path. This behaviour was introduced by: da5984e51063 ("netfilter: nf_flow_table: add support for sending flows back to the slow path") b6f27d322a0a ("netfilter: nf_flow_table: tear down TCP flows if RST or FIN was seen") whose goal is to expedite removal of flow entries from the hardware table. Before these patches, the flow was released after the flow entry timed out. However, this approach leads to packet races when restoring the conntrack state as well as late flow re-offload situations when the TCP connection is ending. This patch adds a new CLOSING state that is is entered when tcp rst/fin packet is seen. This allows for an early removal of the flow entry from the hardware table. But the flow entry still remains in software, so tcp packets to shut down the flow are not sent back to slow path. If syn packet is seen from this new CLOSING state, then this flow enters teardown state, ct state is set to TCP_CONNTRACK_CLOSE state and packet is sent to slow path, so this TCP reopen scenario can be handled by conntrack. TCP_CONNTRACK_CLOSE provides a small timeout that aims at quickly releasing this stale entry from the conntrack table. Moreover, skip hardware re-offload from flowtable software packet if the flow is in CLOSING state. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2025-01-19netfilter: flowtable: teardown flow if cached mtu is stalePablo Neira Ayuso
Tear down the flow entry in the unlikely case that the interface mtu changes, this gives the flow a chance to refresh the cached mtu, otherwise such refresh does not occur until flow entry expires. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2025-01-19netfilter: conntrack: rework offload nf_conn timeout extension logicFlorian Westphal
Offload nf_conn entries may not see traffic for a very long time. To prevent incorrect 'ct is stale' checks during nf_conntrack table lookup, the gc worker extends the timeout nf_conn entries marked for offload to a large value. The existing logic suffers from a few problems. Garbage collection runs without locks, its unlikely but possible that @ct is removed right after the 'offload' bit test. In that case, the timeout of a new/reallocated nf_conn entry will be increased. Prevent this by obtaining a reference count on the ct object and re-check of the confirmed and offload bits. If those are not set, the ct is being removed, skip the timeout extension in this case. Parallel teardown is also problematic: cpu1 cpu2 gc_worker calls flow_offload_teardown() tests OFFLOAD bit, set clear OFFLOAD bit ct->timeout is repaired (e.g. set to timeout[UDP_CT_REPLIED]) nf_ct_offload_timeout() called expire value is fetched <INTERRUPT> -> NF_CT_DAY timeout for flow that isn't offloaded (and might not see any further packets). Use cmpxchg: if ct->timeout was repaired after the 2nd 'offload bit' test passed, then ct->timeout will only be updated of ct->timeout was not altered in between. As we already have a gc worker for flowtable entries, ct->timeout repair can be handled from the flowtable gc worker. This avoids having flowtable specific logic in the conntrack core and avoids checking entries that were never offloaded. This allows to remove the nf_ct_offload_timeout helper. Its safe to use in the add case, but not on teardown. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2025-01-19netfilter: conntrack: remove skb argument from nf_ct_refreshFlorian Westphal
Its not used (and could be NULL), so remove it. This allows to use nf_ct_refresh in places where we don't have an skb without having to double-check that skb == NULL would be safe. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2025-01-19netfilter: nft_flow_offload: update tcp state flags under lockFlorian Westphal
The conntrack entry is already public, there is a small chance that another CPU is handling a packet in reply direction and racing with the tcp state update. Move this under ct spinlock. This is done once, when ct is about to be offloaded, so this should not result in a noticeable performance hit. Fixes: 8437a6209f76 ("netfilter: nft_flow_offload: set liberal tracking mode for tcp") Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2025-01-19netfilter: nft_flow_offload: clear tcp MAXACK flag before moving to slowpathFlorian Westphal
This state reset is racy, no locks are held here. Since commit 8437a6209f76 ("netfilter: nft_flow_offload: set liberal tracking mode for tcp"), the window checks are disabled for normal data packets, but MAXACK flag is checked when validating TCP resets. Clear the flag so tcp reset validation checks are ignored. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2025-01-19netfilter: nf_tables: Simplify chain netdev notifierPhil Sutter
With conditional chain deletion gone, callback code simplifies: Instead of filling an nft_ctx object, just pass basechain to the per-chain function. Also plain list_for_each_entry() is safe now. Signed-off-by: Phil Sutter <phil@nwl.cc> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2025-01-19netfilter: nf_tables: Tolerate chains with no remaining hooksPhil Sutter
Do not drop a netdev-family chain if the last interface it is registered for vanishes. Users dumping and storing the ruleset upon shutdown to restore it upon next boot may otherwise lose the chain and all contained rules. They will still lose the list of devices, a later patch will fix that. For now, this aligns the event handler's behaviour with that for flowtables. The controversal situation at netns exit should be no problem here: event handler will unregister the hooks, core nftables cleanup code will drop the chain itself. Signed-off-by: Phil Sutter <phil@nwl.cc> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2025-01-19netfilter: nf_tables: Compare netdev hooks based on stored namePhil Sutter
The 1:1 relationship between nft_hook and nf_hook_ops is about to break, so choose the stored ifname to uniquely identify hooks. Signed-off-by: Phil Sutter <phil@nwl.cc> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2025-01-19netfilter: nf_tables: Use stored ifname in netdev hook dumpsPhil Sutter
The stored ifname and ops.dev->name may deviate after creation due to interface name changes. Prefer the more deterministic stored name in dumps which also helps avoiding inadvertent changes to stored ruleset dumps. Signed-off-by: Phil Sutter <phil@nwl.cc> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2025-01-19netfilter: nf_tables: Store user-defined hook ifnamePhil Sutter
Prepare for hooks with NULL ops.dev pointer (due to non-existent device) and store the interface name and length as specified by the user upon creation. No functional change intended. Signed-off-by: Phil Sutter <phil@nwl.cc> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2025-01-19netfilter: nf_tables: Flowtable hook's pf value never variesPhil Sutter
When checking for duplicate hooks in nft_register_flowtable_net_hooks(), comparing ops.pf value is pointless as it is always NFPROTO_NETDEV with flowtable hooks. Dropping the check leaves the search identical to the one in nft_hook_list_find() so call that function instead of open coding. Signed-off-by: Phil Sutter <phil@nwl.cc> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2025-01-19netfilter: br_netfilter: remove unused conditional and dead codeAntoine Tenart
The SKB_DROP_REASON_IP_INADDRERRORS drop reason is never returned from any function, as such it cannot be returned from the ip_route_input call tree. The 'reason != SKB_DROP_REASON_IP_INADDRERRORS' conditional is thus always true. Looking back at history, commit 50038bf38e65 ("net: ip: make ip_route_input() return drop reasons") changed the ip_route_input returned value check in br_nf_pre_routing_finish from -EHOSTUNREACH to SKB_DROP_REASON_IP_INADDRERRORS. It turns out -EHOSTUNREACH could not be returned either from the ip_route_input call tree and this since commit 251da4130115 ("ipv4: Cache ip_error() routes even when not forwarding."). Not a fix as this won't change the behavior. While at it use kfree_skb_reason. Signed-off-by: Antoine Tenart <atenart@kernel.org> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2025-01-19netfilter: nf_tables: fix set size with rbtree backendPablo Neira Ayuso
The existing rbtree implementation uses singleton elements to represent ranges, however, userspace provides a set size according to the number of ranges in the set. Adjust provided userspace set size to the number of singleton elements in the kernel by multiplying the range by two. Check if the no-match all-zero element is already in the set, in such case release one slot in the set size. Fixes: 0ed6389c483d ("netfilter: nf_tables: rename set implementations") Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2025-01-19io_uring/fdinfo: fix io_uring_show_fdinfo() misuse of ->d_inameAl Viro
Output of io_uring_show_fdinfo() has several problems: * racy use of ->d_iname * junk if the name is long - in that case it's not stored in ->d_iname at all * lack of quoting (names can contain newlines, etc. - or be equal to "<none>", for that matter). * lines for empty slots are pointless noise - we already have the total amount, so having just the non-empty ones would carry the same information. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-01-18Merge tag 'batadv-next-pullrequest-20250117' of ↵Jakub Kicinski
git://git.open-mesh.org/linux-merge Simon Wunderlich says: ==================== This cleanup patchset includes the following patches: - bump version strings, by Simon Wunderlich - Reorder includes for distributed-arp-table.c, by Sven Eckelmann - Fix translation table change handling, by Remi Pommarel (2 patches) - Map VID 0 to untagged TT VLAN, by Sven Eckelmann - Update MAINTAINERS/mailmap e-mail addresses, by the respective authors (4 patches) - netlink: reduce duplicate code by returning interfaces, by Linus Lüssing * tag 'batadv-next-pullrequest-20250117' of git://git.open-mesh.org/linux-merge: batman-adv: netlink: reduce duplicate code by returning interfaces MAINTAINERS: mailmap: add entries for Antonio Quartulli mailmap: add entries for Sven Eckelmann mailmap: add entries for Simon Wunderlich MAINTAINERS: update email address of Marek Linder batman-adv: Map VID 0 to untagged TT VLAN batman-adv: Don't keep redundant TT change events batman-adv: Remove atomic usage for tt.local_changes batman-adv: Reorder includes for distributed-arp-table.c batman-adv: Start new development cycle ==================== Link: https://patch.msgid.link/20250117123910.219278-1-sw@simonwunderlich.de Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-01-18Merge tag 'for-net-next-2025-01-15' of ↵Jakub Kicinski
git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth-next Luiz Augusto von Dentz says: ==================== bluetooth-next pull request for net-next: - btusb: Add new VID/PID 13d3/3610 for MT7922 - btusb: Add new VID/PID 13d3/3628 for MT7925 - btusb: Add MT7921e device 13d3:3576 - btusb: Add RTL8851BE device 13d3:3600 - btusb: Add ID 0x2c7c:0x0130 for Qualcomm WCN785x - btusb: add sysfs attribute to control USB alt setting - qca: Expand firmware-name property - qca: Fix poor RF performance for WCN6855 - L2CAP: handle NULL sock pointer in l2cap_sock_alloc - Allow reset via sysfs - ISO: Allow BIG re-sync - dt-bindings: Utilize PMU abstraction for WCN6750 - MGMT: Mark LL Privacy as stable * tag 'for-net-next-2025-01-15' of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth-next: (23 commits) Bluetooth: MGMT: Fix slab-use-after-free Read in mgmt_remove_adv_monitor_sync Bluetooth: qca: Fix poor RF performance for WCN6855 Bluetooth: Allow reset via sysfs Bluetooth: Get rid of cmd_timeout and use the reset callback Bluetooth: Remove the cmd timeout count in btusb Bluetooth: Use str_enable_disable-like helpers Bluetooth: btmtk: Remove resetting mt7921 before downloading the fw Bluetooth: L2CAP: handle NULL sock pointer in l2cap_sock_alloc Bluetooth: btusb: Add RTL8851BE device 13d3:3600 dt-bindings: bluetooth: Utilize PMU abstraction for WCN6750 Bluetooth: btusb: Add MT7921e device 13d3:3576 Bluetooth: btrtl: check for NULL in btrtl_setup_realtek() Bluetooth: btbcm: Fix NULL deref in btbcm_get_board_name() Bluetooth: qca: Expand firmware-name to load specific rampatch Bluetooth: qca: Update firmware-name to support board specific nvm dt-bindings: net: bluetooth: qca: Expand firmware-name property Bluetooth: btusb: Add new VID/PID 13d3/3628 for MT7925 Bluetooth: btusb: Add new VID/PID 13d3/3610 for MT7922 Bluetooth: btusb: add sysfs attribute to control USB alt setting Bluetooth: btusb: Add ID 0x2c7c:0x0130 for Qualcomm WCN785x ... ==================== Link: https://patch.msgid.link/20250117213203.3921910-1-luiz.dentz@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-01-18Merge tag 'wireless-next-2025-01-17' of ↵Jakub Kicinski
git://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless-next Kalle Valo says: ==================== wireless-next patches for v6.14 Most likely the last "new features" pull request for v6.14 and this is a bigger one. Multi-Link Operation (MLO) work continues both in stack in drivers. Few new devices supported and usual fixes all over. Major changes: cfg80211 * Emergency Preparedness Communication Services (EPCS) station mode support mac80211 * an option to filter a sta from being flushed * some support for RX Operating Mode Indication (OMI) power saving * support for adding and removing station links for MLO iwlwifi * new device ids * rework firmware error handling and restart rtw88 * RTL8812A: RFE type 2 support * LED support rtw89 * variant info to support RTL8922AE-VS mt76 * mt7996: single wiphy multiband support (preparation for MLO) * mt7996: support for more variants * mt792x: P2P_DEVICE support * mt7921u: TP-Link TXE50UH support ath12k * enable MLO for QCN9274 (although it seems to be broken with dual band devices) * MLO radar detection support * debugfs: transmit buffer OFDMA, AST entry and puncture stats * tag 'wireless-next-2025-01-17' of git://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless-next: (322 commits) wifi: brcmfmac: fix NULL pointer dereference in brcmf_txfinalize() wifi: rtw88: add RTW88_LEDS depends on LEDS_CLASS to Kconfig wifi: wilc1000: unregister wiphy only after netdev registration wifi: cfg80211: adjust allocation of colocated AP data wifi: mac80211: fix memory leak in ieee80211_mgd_assoc_ml_reconf() wifi: ath12k: fix key cache handling wifi: ath12k: Fix uninitialized variable access in ath12k_mac_allocate() function wifi: ath12k: Remove ath12k_get_num_hw() helper function wifi: ath12k: Refactor the ath12k_hw get helper function argument wifi: ath12k: Refactor ath12k_hw set helper function argument wifi: mt76: mt7996: add implicit beamforming support for mt7992 wifi: mt76: mt7996: fix beacon command during disabling wifi: mt76: mt7996: fix ldpc setting wifi: mt76: mt7996: fix definition of tx descriptor wifi: mt76: connac: adjust phy capabilities based on band constraints wifi: mt76: mt7996: fix incorrect indexing of MIB FW event wifi: mt76: mt7996: fix HE Phy capability wifi: mt76: mt7996: fix the capability of reception of EHT MU PPDU wifi: mt76: mt7996: add max mpdu len capability wifi: mt76: mt7921: avoid undesired changes of the preset regulatory domain ... ==================== Link: https://patch.msgid.link/20250117203529.72D45C4CEDD@smtp.kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-01-18net: introduce netdev_napi_exit()Eric Dumazet
After 1b23cdbd2bbc ("net: protect netdev->napi_list with netdev_lock()") it makes sense to iterate through dev->napi_list while holding the device lock. Also call synchronize_net() at most one time. Signed-off-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20250117232113.1612899-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-01-18net: phy: remove leftovers from switch to linkmode bitmapsHeiner Kallweit
We have some leftovers from the switch to linkmode bitmaps which - have never been used - are not used any longer - have no user outside phy_device.c So remove them. Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com> Link: https://patch.msgid.link/5493b96e-88bb-4230-a911-322659ec5167@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-01-18net: sched: Disallow replacing of child qdisc from one parent to anotherJamal Hadi Salim
Lion Ackermann was able to create a UAF which can be abused for privilege escalation with the following script Step 1. create root qdisc tc qdisc add dev lo root handle 1:0 drr step2. a class for packet aggregation do demonstrate uaf tc class add dev lo classid 1:1 drr step3. a class for nesting tc class add dev lo classid 1:2 drr step4. a class to graft qdisc to tc class add dev lo classid 1:3 drr step5. tc qdisc add dev lo parent 1:1 handle 2:0 plug limit 1024 step6. tc qdisc add dev lo parent 1:2 handle 3:0 drr step7. tc class add dev lo classid 3:1 drr step 8. tc qdisc add dev lo parent 3:1 handle 4:0 pfifo step 9. Display the class/qdisc layout tc class ls dev lo class drr 1:1 root leaf 2: quantum 64Kb class drr 1:2 root leaf 3: quantum 64Kb class drr 3:1 root leaf 4: quantum 64Kb tc qdisc ls qdisc drr 1: dev lo root refcnt 2 qdisc plug 2: dev lo parent 1:1 qdisc pfifo 4: dev lo parent 3:1 limit 1000p qdisc drr 3: dev lo parent 1:2 step10. trigger the bug <=== prevented by this patch tc qdisc replace dev lo parent 1:3 handle 4:0 step 11. Redisplay again the qdiscs/classes tc class ls dev lo class drr 1:1 root leaf 2: quantum 64Kb class drr 1:2 root leaf 3: quantum 64Kb class drr 1:3 root leaf 4: quantum 64Kb class drr 3:1 root leaf 4: quantum 64Kb tc qdisc ls qdisc drr 1: dev lo root refcnt 2 qdisc plug 2: dev lo parent 1:1 qdisc pfifo 4: dev lo parent 3:1 refcnt 2 limit 1000p qdisc drr 3: dev lo parent 1:2 Observe that a) parent for 4:0 does not change despite the replace request. There can only be one parent. b) refcount has gone up by two for 4:0 and c) both class 1:3 and 3:1 are pointing to it. Step 12. send one packet to plug echo "" | socat -u STDIN UDP4-DATAGRAM:127.0.0.1:8888,priority=$((0x10001)) step13. send one packet to the grafted fifo echo "" | socat -u STDIN UDP4-DATAGRAM:127.0.0.1:8888,priority=$((0x10003)) step14. lets trigger the uaf tc class delete dev lo classid 1:3 tc class delete dev lo classid 1:1 The semantics of "replace" is for a del/add _on the same node_ and not a delete from one node(3:1) and add to another node (1:3) as in step10. While we could "fix" with a more complex approach there could be consequences to expectations so the patch takes the preventive approach of "disallow such config". Joint work with Lion Ackermann <nnamrec@gmail.com> Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20250116013713.900000-1-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-01-18net: avoid race between device unregistration and ethnl opsAntoine Tenart
The following trace can be seen if a device is being unregistered while its number of channels are being modified. DEBUG_LOCKS_WARN_ON(lock->magic != lock) WARNING: CPU: 3 PID: 3754 at kernel/locking/mutex.c:564 __mutex_lock+0xc8a/0x1120 CPU: 3 UID: 0 PID: 3754 Comm: ethtool Not tainted 6.13.0-rc6+ #771 RIP: 0010:__mutex_lock+0xc8a/0x1120 Call Trace: <TASK> ethtool_check_max_channel+0x1ea/0x880 ethnl_set_channels+0x3c3/0xb10 ethnl_default_set_doit+0x306/0x650 genl_family_rcv_msg_doit+0x1e3/0x2c0 genl_rcv_msg+0x432/0x6f0 netlink_rcv_skb+0x13d/0x3b0 genl_rcv+0x28/0x40 netlink_unicast+0x42e/0x720 netlink_sendmsg+0x765/0xc20 __sys_sendto+0x3ac/0x420 __x64_sys_sendto+0xe0/0x1c0 do_syscall_64+0x95/0x180 entry_SYSCALL_64_after_hwframe+0x76/0x7e This is because unregister_netdevice_many_notify might run before the rtnl lock section of ethnl operations, eg. set_channels in the above example. In this example the rss lock would be destroyed by the device unregistration path before being used again, but in general running ethnl operations while dismantle has started is not a good idea. Fix this by denying any operation on devices being unregistered. A check was already there in ethnl_ops_begin, but not wide enough. Note that the same issue cannot be seen on the ioctl version (__dev_ethtool) because the device reference is retrieved from within the rtnl lock section there. Once dismantle started, the net device is unlisted and no reference will be found. Fixes: dde91ccfa25f ("ethtool: do not perform operations on net devices being unregistered") Signed-off-by: Antoine Tenart <atenart@kernel.org> Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com> Reviewed-by: Edward Cree <ecree.xilinx@gmail.com> Link: https://patch.msgid.link/20250116092159.50890-1-atenart@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-01-18net: destroy dev->lock later in free_netdev()Eric Dumazet
syzbot complained that free_netdev() was calling netif_napi_del() after dev->lock mutex has been destroyed. This fires a warning for CONFIG_DEBUG_MUTEXES=y builds. Move mutex_destroy(&dev->lock) near the end of free_netdev(). [1] DEBUG_LOCKS_WARN_ON(lock->magic != lock) WARNING: CPU: 0 PID: 5971 at kernel/locking/mutex.c:564 __mutex_lock_common kernel/locking/mutex.c:564 [inline] WARNING: CPU: 0 PID: 5971 at kernel/locking/mutex.c:564 __mutex_lock+0xdac/0xee0 kernel/locking/mutex.c:735 Modules linked in: CPU: 0 UID: 0 PID: 5971 Comm: syz-executor Not tainted 6.13.0-rc7-syzkaller-01131-g8d20dcda404d #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 12/27/2024 RIP: 0010:__mutex_lock_common kernel/locking/mutex.c:564 [inline] RIP: 0010:__mutex_lock+0xdac/0xee0 kernel/locking/mutex.c:735 Code: 0f b6 04 38 84 c0 0f 85 1a 01 00 00 83 3d 6f 40 4c 04 00 75 19 90 48 c7 c7 60 84 0a 8c 48 c7 c6 00 85 0a 8c e8 f5 dc 91 f5 90 <0f> 0b 90 90 90 e9 c7 f3 ff ff 90 0f 0b 90 e9 29 f8 ff ff 90 0f 0b RSP: 0018:ffffc90003317580 EFLAGS: 00010246 RAX: ee0f97edaf7b7d00 RBX: ffff8880299f8cb0 RCX: ffff8880323c9e00 RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000 RBP: ffffc90003317710 R08: ffffffff81602ac2 R09: 1ffff110170c519a R10: dffffc0000000000 R11: ffffed10170c519b R12: 0000000000000000 R13: 0000000000000000 R14: 1ffff92000662ec4 R15: dffffc0000000000 FS: 000055557a046500(0000) GS:ffff8880b8600000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007fd581d46ff8 CR3: 000000006f870000 CR4: 00000000003526f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: <TASK> netdev_lock include/linux/netdevice.h:2691 [inline] __netif_napi_del include/linux/netdevice.h:2829 [inline] netif_napi_del include/linux/netdevice.h:2848 [inline] free_netdev+0x2d9/0x610 net/core/dev.c:11621 netdev_run_todo+0xf21/0x10d0 net/core/dev.c:11189 nsim_destroy+0x3c3/0x620 drivers/net/netdevsim/netdev.c:1028 __nsim_dev_port_del+0x14b/0x1b0 drivers/net/netdevsim/dev.c:1428 nsim_dev_port_del_all drivers/net/netdevsim/dev.c:1440 [inline] nsim_dev_reload_destroy+0x28a/0x490 drivers/net/netdevsim/dev.c:1661 nsim_drv_remove+0x58/0x160 drivers/net/netdevsim/dev.c:1676 device_remove drivers/base/dd.c:567 [inline] Fixes: 1b23cdbd2bbc ("net: protect netdev->napi_list with netdev_lock()") Reported-by: syzbot+85ff1051228a04613a32@syzkaller.appspotmail.com Closes: https://lore.kernel.org/netdev/678add43.050a0220.303755.0016.GAE@google.com/T/#u Signed-off-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20250117224626.1427577-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-01-18eth: bnxt: fix string truncation warning in FW versionJakub Kicinski
W=1 builds with gcc 14.2.1 report: drivers/net/ethernet/broadcom/bnxt/bnxt_ethtool.c:4193:32: error: ‘%s’ directive output may be truncated writing up to 31 bytes into a region of size 27 [-Werror=format-truncation=] 4193 | "/pkg %s", buf); It's upset that we let buf be full length but then we use 5 characters for "/pkg ". The builds is also clear with clang version 19.1.5 now. Reviewed-by: Michael Chan <michael.chan@broadcom.com> Link: https://patch.msgid.link/20250117183726.1481524-1-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-01-18mptcp: sysctl: add syn_retrans_before_tcp_fallbackMatthieu Baerts (NGI0)
The number of SYN + MPC retransmissions before falling back to TCP was fixed to 2. This is certainly a good default value, but having a fixed number can be a problem in some environments. The current behaviour means that if all packets are dropped, there will be: - The initial SYN + MPC - 2 retransmissions with MPC - The next ones will be without MPTCP. So typically ~3 seconds before falling back to TCP. In some networks where some temporally blackholes are unfortunately frequent, or when a client tries to initiate connections while the network is not ready yet, this can cause new connections not to have MPTCP connections. In such environments, it is now possible to increase the number of SYN retransmissions with MPTCP options to make sure MPTCP is used. Interesting values are: - 0: the first retransmission will be done without MPTCP options: quite aggressive, but also a higher risk of detecting false-positive MPTCP blackholes. - >= 128: all SYN retransmissions will keep the MPTCP options: back to the < 6.12 behaviour. The default behaviour is not changed here. Reviewed-by: Mat Martineau <martineau@kernel.org> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Link: https://patch.msgid.link/20250117-net-next-mptcp-syn_retrans_before_tcp_fallback-v1-1-ab4b187099b0@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-01-18Merge branch 'fix-race-conditions-in-ndo_get_stats64'Jakub Kicinski
Shinas Rasheed says: ==================== Fix race conditions in ndo_get_stats64 Fix race conditions in ndo_get_stats64 by storing tx/rx stats locally and not availing per queue resources which could be torn down during interface stop. Also remove stats fetch from firmware which is currently unnecessary ==================== Link: https://patch.msgid.link/20250117094653.2588578-1-srasheed@marvell.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>