summaryrefslogtreecommitdiff
path: root/include/linux
AgeCommit message (Collapse)Author
2024-11-01tracing: Introduce tracepoint extended structureMathieu Desnoyers
Shrink the struct tracepoint size from 80 bytes to 72 bytes on x86-64 by moving the (typically NULL) regfunc/unregfunc pointers to an extended structure. Tested-by: Jordan Rife <jrife@google.com> Cc: Michael Jeanson <mjeanson@efficios.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Alexei Starovoitov <ast@kernel.org> Cc: Yonghong Song <yhs@fb.com> Cc: Paul E. McKenney <paulmck@kernel.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Andrii Nakryiko <andrii.nakryiko@gmail.com> Cc: bpf@vger.kernel.org Cc: Joel Fernandes <joel@joelfernandes.org> Cc: Jordan Rife <jrife@google.com> Cc: linux-trace-kernel@vger.kernel.org Link: https://lore.kernel.org/20241031152056.744137-2-mathieu.desnoyers@efficios.com Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2024-11-01tracing: Remove TRACE_FLAG_IRQS_NOSUPPORTSebastian Andrzej Siewior
It was possible to enable tracing with no IRQ tracing support. The tracing infrastructure would then record TRACE_FLAG_IRQS_NOSUPPORT as the only tracing flag and show an 'X' in the output. The last user of this feature was PPC32 which managed to implement it during PowerPC merge in 2009. Since then, it was unused and the PPC32 dependency was finally removed in commit 0ea5ee035133a ("tracing: Remove PPC32 wart from config TRACING_SUPPORT"). Since the PowerPC merge the code behind !CONFIG_TRACE_IRQFLAGS_SUPPORT with TRACING enabled can no longer be selected used and the 'X' is not displayed or recorded. Remove the CONFIG_TRACE_IRQFLAGS_SUPPORT from the tracing code. Remove TRACE_FLAG_IRQS_NOSUPPORT. Cc: Peter Zijlstra <peterz@infradead.org> Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Link: https://lore.kernel.org/20241022110112.XJI8I9T2@linutronix.de Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2024-11-01Merge tag 'arm64-fixes' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux Pull arm64 fixes from Will Deacon: "The important one is a change to the way in which we handle protection keys around signal delivery so that we're more closely aligned with the x86 behaviour, however there is also a revert of the previous fix to disable software tag-based KASAN with GCC, since a workaround materialised shortly afterwards. I'd love to say we're done with 6.12, but we're aware of some longstanding fpsimd register corruption issues that we're almost at the bottom of resolving. Summary: - Fix handling of POR_EL0 during signal delivery so that pushing the signal context doesn't fail based on the pkey configuration of the interrupted context and align our user-visible behaviour with that of x86. - Fix a bogus pointer being passed to the CPU hotplug code from the Arm SDEI driver. - Re-enable software tag-based KASAN with GCC by using an alternative implementation of '__no_sanitize_address'" * tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux: arm64: signal: Improve POR_EL0 handling to avoid uaccess failures firmware: arm_sdei: Fix the input parameter of cpuhp_remove_state() Revert "kasan: Disable Software Tag-Based KASAN with GCC" kasan: Fix Software Tag-Based KASAN with GCC
2024-11-01Merge tag 'vfs-6.12-rc6.iomap' of ↵Linus Torvalds
gitolite.kernel.org:pub/scm/linux/kernel/git/vfs/vfs Pull iomap fixes from Christian Brauner: "Fixes for iomap to prevent data corruption bugs in the fallocate unshare range implementation of fsdax and a small cleanup to turn iomap_want_unshare_iter() into an inline function" * tag 'vfs-6.12-rc6.iomap' of gitolite.kernel.org:pub/scm/linux/kernel/git/vfs/vfs: iomap: turn iomap_want_unshare_iter into an inline function fsdax: dax_unshare_iter needs to copy entire blocks fsdax: remove zeroing code from dax_unshare_iter iomap: share iomap_unshare_iter predicate code with fsdax xfs: don't allocate COW extents when unsharing a hole
2024-11-01Merge tag 'vfs-6.12-rc6.fixes' of ↵Linus Torvalds
gitolite.kernel.org:pub/scm/linux/kernel/git/vfs/vfs Pull filesystem fixes from Christian Brauner: "VFS: - Fix copy_page_from_iter_atomic() if KMAP_LOCAL_FORCE_MAP=y is set - Add a get_tree_bdev_flags() helper that allows to modify e.g., whether errors are logged into the filesystem context during superblock creation. This is used by erofs to fix a userspace regression where an error is currently logged when its used on a regular file which is an new allowed mode in erofs. netfs: - Fix the sysfs debug path in the documentation. - Fix iov_iter_get_pages*() for folio queues by skipping the page extracation if we're at the end of a folio. afs: - Fix moving subdirectories to different parent directory. autofs: - Fix handling of AUTOFS_DEV_IOCTL_TIMEOUT_CMD ioctl in validate_dev_ioctl(). The actual ioctl number, not the ioctl command needs to be checked for autofs" * tag 'vfs-6.12-rc6.fixes' of gitolite.kernel.org:pub/scm/linux/kernel/git/vfs/vfs: iov_iter: fix copy_page_from_iter_atomic() if KMAP_LOCAL_FORCE_MAP autofs: fix thinko in validate_dev_ioctl() iov_iter: Fix iov_iter_get_pages*() for folio_queue afs: Fix missing subdir edit when renamed between parent dirs doc: correcting the debug path for cachefiles erofs: use get_tree_bdev_flags() to avoid misleading messages fs/super.c: introduce get_tree_bdev_flags()
2024-11-01iio: backend: extend featuresAngelo Dureghello
Extend backend features with new calls needed later on this patchset from axi version of ad3552r. The follwoing calls are added: iio_backend_ddr_enable() enable ddr bus transfer iio_backend_ddr_disable() disable ddr bus transfer iio_backend_data_stream_enable() enable data stream over bus interface iio_backend_data_stream_disable() disable data stream over bus interface iio_backend_data_transfer_addr() define the target register address where the DAC sample will be written. Reviewed-by: Nuno Sa <nuno.sa@analog.com> Signed-off-by: Angelo Dureghello <adureghello@baylibre.com> Reviewed-by: David Lechner <dlechner@baylibre.com> Link: https://patch.msgid.link/20241028-wip-bl-ad3552r-axi-v0-iio-testing-v9-3-f6960b4f9719@kernel-space.org Signed-off-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
2024-10-31dql: annotate data-races around dql->last_obj_cntEric Dumazet
dql->last_obj_cnt is read/written from different contexts, without any lock synchronization. Use READ_ONCE()/WRITE_ONCE() to avoid load/store tearing. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Joe Damato <jdamato@fastly.com> Link: https://patch.msgid.link/20241029191425.2519085-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-10-31Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski
Cross-merge networking fixes after downstream PR (net-6.12-rc6). Conflicts: drivers/net/wireless/intel/iwlwifi/mvm/mld-mac80211.c cbe84e9ad5e2 ("wifi: iwlwifi: mvm: really send iwl_txpower_constraints_cmd") 188a1bf89432 ("wifi: mac80211: re-order assigning channel in activate links") https://lore.kernel.org/all/20241028123621.7bbb131b@canb.auug.org.au/ net/mac80211/cfg.c c4382d5ca1af ("wifi: mac80211: update the right link for tx power") 8dd0498983ee ("wifi: mac80211: Fix setting txpower with emulate_chanctx") drivers/net/ethernet/intel/ice/ice_ptp_hw.h 6e58c3310622 ("ice: fix crash on probe for DPLL enabled E810 LOM") e4291b64e118 ("ice: Align E810T GPIO to other products") ebb2693f8fbd ("ice: Read SDP section from NVM for pin definitions") ac532f4f4251 ("ice: Cleanup unused declarations") https://lore.kernel.org/all/20241030120524.1ee1af18@canb.auug.org.au/ No adjacent changes. Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-10-31Merge tag 'bpf-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpfLinus Torvalds
Pull bpf fixes from Daniel Borkmann: - Fix BPF verifier to force a checkpoint when the program's jump history becomes too long (Eduard Zingerman) - Add several fixes to the BPF bits iterator addressing issues like memory leaks and overflow problems (Hou Tao) - Fix an out-of-bounds write in trie_get_next_key (Byeonguk Jeong) - Fix BPF test infra's LIVE_FRAME frame update after a page has been recycled (Toke Høiland-Jørgensen) - Fix BPF verifier and undo the 40-bytes extra stack space for bpf_fastcall patterns due to various bugs (Eduard Zingerman) - Fix a BPF sockmap race condition which could trigger a NULL pointer dereference in sock_map_link_update_prog (Cong Wang) - Fix tcp_bpf_recvmsg_parser to retrieve seq_copied from tcp_sk under the socket lock (Jiayuan Chen) * tag 'bpf-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf: bpf, test_run: Fix LIVE_FRAME frame update after a page has been recycled selftests/bpf: Add three test cases for bits_iter bpf: Use __u64 to save the bits in bits iterator bpf: Check the validity of nr_words in bpf_iter_bits_new() bpf: Add bpf_mem_alloc_check_size() helper bpf: Free dynamically allocated bits in bpf_iter_bits_destroy() bpf: disallow 40-bytes extra stack for bpf_fastcall patterns selftests/bpf: Add test for trie_get_next_key() bpf: Fix out-of-bounds write in trie_get_next_key() selftests/bpf: Test with a very short loop bpf: Force checkpoint when jmp history is too long bpf: fix filed access without lock sock_map: fix a NULL pointer dereference in sock_map_link_update_prog()
2024-10-31i3c: master: Extend address status bit to 4 and add I3C_ADDR_SLOT_EXT_DESIREDFrank Li
Extend the address status bit to 4 and introduce the I3C_ADDR_SLOT_EXT_DESIRED macro to indicate that a device prefers a specific address. This is generally set by the 'assigned-address' in the device tree source (dts) file. ┌────┬─────────────┬───┬─────────┬───┐ │S/Sr│ 7'h7E RnW=0 │ACK│ ENTDAA │ T ├────┐ └────┴─────────────┴───┴─────────┴───┘ │ ┌─────────────────────────────────────────┘ │ ┌──┬─────────────┬───┬─────────────────┬────────────────┬───┬─────────┐ └─►│Sr│7'h7E RnW=1 │ACK│48bit UID BCR DCR│Assign 7bit Addr│PAR│ ACK/NACK│ └──┴─────────────┴───┴─────────────────┴────────────────┴───┴─────────┘ Some master controllers (such as HCI) need to prepare the entire above transaction before sending it out to the I3C bus. This means that a 7-bit dynamic address needs to be allocated before knowing the target device's UID information. However, some I3C targets may request specific addresses (called as "init_dyn_addr"), which is typically specified by the DT-'s assigned-address property. Lower addresses having higher IBI priority. If it is available, i3c_bus_get_free_addr() preferably return a free address that is not in the list of desired addresses (called as "init_dyn_addr"). This allows the device with the "init_dyn_addr" to switch to its "init_dyn_addr" when it hot-joins the I3C bus. Otherwise, if the "init_dyn_addr" is already in use by another I3C device, the target device will not be able to switch to its desired address. If the previous step fails, fallback returning one of the remaining unassigned address, regardless of its state in the desired list. Reviewed-by: Miquel Raynal <miquel.raynal@bootlin.com> Signed-off-by: Frank Li <Frank.Li@nxp.com> Link: https://lore.kernel.org/r/20241021-i3c_dts_assign-v8-2-4098b8bde01e@nxp.com Signed-off-by: Alexandre Belloni <alexandre.belloni@bootlin.com>
2024-10-31i3c: master: Replace hard code 2 with macro I3C_ADDR_SLOT_STATUS_BITSFrank Li
Replace the hardcoded value 2, which indicates 2 bits for I3C address status, with the predefined macro I3C_ADDR_SLOT_STATUS_BITS. Improve maintainability and extensibility of the code. Reviewed-by: Miquel Raynal <miquel.raynal@bootlin.com> Signed-off-by: Frank Li <Frank.Li@nxp.com> Link: https://lore.kernel.org/r/20241021-i3c_dts_assign-v8-1-4098b8bde01e@nxp.com Signed-off-by: Alexandre Belloni <alexandre.belloni@bootlin.com>
2024-10-31block: remove bio_add_zone_append_pageChristoph Hellwig
This is only used by the nvmet zns passthrough code, which can trivially just use bio_add_pc_page and do the sanity check for the max zone append limit itself. All future zoned file systems should follow the btrfs lead and let the upper layers fill up bios unlimited by hardware constraints and split them to the limits in the I/O submission handler. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Link: https://lore.kernel.org/r/20241030051859.280923-3-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-10-31mfd: mt6397: Add initial support for MT6328Yassine Oudjana
The MT6328 PMIC is commonly used with the MT6735 SoC. Add initial support for this PMIC. Signed-off-by: Yassine Oudjana <y.oudjana@protonmail.com> Reviewed-by: AngeloGioacchino Del Regno <angelogioacchino.delregno@collabora.com> Link: https://lore.kernel.org/r/20241018081050.23592-5-y.oudjana@protonmail.com Signed-off-by: Lee Jones <lee@kernel.org>
2024-10-31mfd: axp20x: Add support for AXP323Andre Przywara
The X-Powers AXP323 is a very close sibling of the AXP313A. The only difference seems to be the ability to dual-phase the first two DC/DC converter, which adds another register. Add the required boilerplate to introduce a new PMIC to the AXP MFD driver. Where possible, this just maps into the existing structs defined for the AXP313A, only deviating where needed. Signed-off-by: Andre Przywara <andre.przywara@arm.com> Reviewed-by: Chen-Yu Tsai <wens@csie.org> Link: https://lore.kernel.org/r/20241007001408.27249-5-andre.przywara@arm.com Signed-off-by: Lee Jones <lee@kernel.org>
2024-10-31mfd: axp20x: Ensure relationship between IDs and model namesAndre Przywara
At the moment there is an implicit relationship between the AXP model IDs and the order of the strings in the axp20x_model_names[] array. This is fragile, and makes adding IDs in the middle error prone. Make this relationship official by changing the ID type to the actual enum used, and using indexed initialisers for the string list. Signed-off-by: Andre Przywara <andre.przywara@arm.com> Reviewed-by: Chen-Yu Tsai <wens@csie.org> Link: https://lore.kernel.org/r/20241007001408.27249-3-andre.przywara@arm.com Signed-off-by: Lee Jones <lee@kernel.org>
2024-10-31Merge tag 'ath-next-20241030' of ↵Kalle Valo
git://git.kernel.org/pub/scm/linux/kernel/git/ath/ath ath.git patches for v6.13 This development cycle featured phase 1 of patches to ath12k to support the new 802.11be MLO feature, along with other ath12k feature patches. In older drivers, support for some additional devices were added. And there was the usual set of bug fixes and cleanups across most drivers. Per-driver highlights: ath12k * Switch to using wiphy_lock() and remove ar->conf_mutex * Convert struct ath12k_sta::update_wk to use struct wiphy_work * Add phase 1 of 802.11be MLO support * Add firmware coredump collection support * Add debugfs support for a multitude of statistics * Fix host representation of multiple hal_rx structs * Fix use-after-free in ath12k_dp_cc_cleanup() * Skip Rx TID cleanup for self peer * Fix warning and crash when unloading in a VM * Convert CE interrupt handling from tasklet to BH workqueue * Fix A-MSDU indication in monitor mode ath11k * Fix double free issue during SRNG deinit * Enable firmware diagnostic events for WCN6750 * Fix CE offset address calculation for WCN6750 during SSR * Fix stack frame size warning in ath11k_vif_wow_set_wakeups() * Document the inputs for ath11k on WCN6855 ath10k * Fix multiple stack frame size warnings * Fix invalid VHT parameters in supported_vht_mcs_rate_nss* structs * Avoid NULL pointer error during SDIO remove ath5k * Add support for Arcadyan ARV45XX AR2417 & Gigaset SX76[23] AR241[34]A
2024-10-31clockevents: Shutdown and unregister current clockevents at CPUHP_AP_TICK_DYINGFrederic Weisbecker
The way the clockevent devices are finally stopped while a CPU is offlining is currently chaotic. The layout being by order: 1) tick_sched_timer_dying() stops the tick and the underlying clockevent but only for oneshot case. The periodic tick and its related clockevent still runs. 2) tick_broadcast_offline() detaches and stops the per-cpu oneshot broadcast and append it to the released list. 3) Some individual clockevent drivers stop the clockevents (a second time if the tick is oneshot) 4) Once the CPU is dead, a control CPU remotely detaches and stops (a 3rd time if oneshot mode) the CPU clockevent and adds it to the released list. 5) The released list containing the broadcast device released on step 2) and the remotely detached clockevent from step 4) are unregistered. These random events can be factorized if the current clockevent is detached and stopped by the dying CPU at the generic layer, that is from the dying CPU: a) Stop the tick b) Stop/detach the underlying per-cpu oneshot broadcast clockevent c) Stop/detach the underlying clockevent d) Release / unregister the clockevents from b) and c) e) Release / unregister the remaining clockevents from the dying CPU. This part could be performed by the dying CPU This way the drivers and the tick layer don't need to care about clockevent operations during cpuhotplug down. This also unifies the tick behaviour on offline CPUs between oneshot and periodic modes, avoiding offline ticks altogether for sanity. Adopt the simplification. [ tglx: Remove the WARN_ON() in clockevents_register_device() as that is called from an upcoming CPU before the CPU is marked online ] Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/all/20241029125451.54574-3-frederic@kernel.org
2024-10-30mm: allow set/clear page_type againYu Zhao
Some page flags (page->flags) were converted to page types (page->page_types). A recent example is PG_hugetlb. From the exclusive writer's perspective, e.g., a thread doing __folio_set_hugetlb(), there is a difference between the page flag and type APIs: the former allows the same non-atomic operation to be repeated whereas the latter does not. For example, calling __folio_set_hugetlb() twice triggers VM_BUG_ON_FOLIO(), since the second call expects the type (PG_hugetlb) not to be set previously. Using add_hugetlb_folio() as an example, it calls __folio_set_hugetlb() in the following error-handling path. And when that happens, it triggers the aforementioned VM_BUG_ON_FOLIO(). if (folio_test_hugetlb(folio)) { rc = hugetlb_vmemmap_restore_folio(h, folio); if (rc) { spin_lock_irq(&hugetlb_lock); add_hugetlb_folio(h, folio, false); ... It is possible to make hugeTLB comply with the new requirements from the page type API. However, a straightforward fix would be to just allow the same page type to be set or cleared again inside the API, to avoid any changes to its callers. Link: https://lkml.kernel.org/r/20241020042212.296781-1-yuzhao@google.com Fixes: d99e3140a4d3 ("mm: turn folio_test_hugetlb into a PageType") Signed-off-by: Yu Zhao <yuzhao@google.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-10-30mm, swap: avoid over reclaim of full clustersKairui Song
When running low on usable slots, cluster allocator will try to reclaim the full clusters aggressively to reclaim HAS_CACHE slots. This guarantees that as long as there are any usable slots, HAS_CACHE or not, the swap device will be usable and workload won't go OOM early. Before the cluster allocator, swap allocator fails easily if device is filled up with reclaimable HAS_CACHE slots. Which can be easily reproduced with following simple program: #include <stdio.h> #include <string.h> #include <linux/mman.h> #include <sys/mman.h> #define SIZE 8192UL * 1024UL * 1024UL int main(int argc, char **argv) { long tmp; char *p = mmap(NULL, SIZE, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS, -1, 0); memset(p, 0, SIZE); madvise(p, SIZE, MADV_PAGEOUT); for (unsigned long i = 0; i < SIZE; ++i) tmp += p[i]; getchar(); /* Pause */ return 0; } Setup an 8G non ramdisk swap, the first run of the program will swapout 8G ram successfully. But run same program again after the first run paused, the second run can't swapout all 8G memory as now half of the swap device is pinned by HAS_CACHE. There was a random scan in the old allocator that may reclaim part of the HAS_CACHE by luck, but it's unreliable. The new allocator's added reclaim of full clusters when device is low on usable slots. But when multiple CPUs are seeing the device is low on usable slots at the same time, they ran into a thundering herd problem. This is an observable problem on large machine with mass parallel workload, as full cluster reclaim is slower on large swap device and higher number of CPUs will also make things worse. Testing using a 128G ZRAM on a 48c96t system. When the swap device is very close to full (eg. 124G / 128G), running build linux kernel with make -j96 in a 1G memory cgroup will hung (not a softlockup though) spinning in full cluster reclaim for about ~5min before go OOM. To solve this, split the full reclaim into two parts: - Instead of do a synchronous aggressively reclaim when device is low, do only one aggressively reclaim when device is strictly full with a kworker. This still ensures in worst case the device won't be unusable because of HAS_CACHE slots. - To avoid allocation (especially higher order) suffer from HAS_CACHE filling up clusters and kworker not responsive enough, do one synchronous scan every time the free list is drained, and only scan one cluster. This is kind of similar to the random reclaim before, keeps the full clusters rotated and has a minimal latency. This should provide a fair reclaim strategy suitable for most workloads. Link: https://lkml.kernel.org/r/20241022175512.10398-1-ryncsn@gmail.com Fixes: 2cacbdfdee65 ("mm: swap: add a adaptive full cluster cache reclaim") Signed-off-by: Kairui Song <kasong@tencent.com> Cc: Barry Song <v-songbaohua@oppo.com> Cc: Chris Li <chrisl@kernel.org> Cc: "Huang, Ying" <ying.huang@intel.com> Cc: Hugh Dickins <hughd@google.com> Cc: Kalesh Singh <kaleshsingh@google.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Yosry Ahmed <yosryahmed@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-10-30mm/codetag: fix null pointer check logic for ref and tagHao Ge
When we compile and load lib/slub_kunit.c,it will cause a panic. The root cause is that __kmalloc_cache_noprof was directly called instead of kmem_cache_alloc,which resulted in no alloc_tag being allocated.This caused current->alloc_tag to be null,leading to a null pointer dereference in alloc_tag_ref_set. Despite the fact that my colleague Pei Xiao will later fix the code in slub_kunit.c,we still need fix null pointer check logic for ref and tag to avoid panic caused by a null pointer dereference. Here is the log for the panic: [ 74.779373][ T2158] Unable to handle kernel NULL pointer dereference at virtual address 0000000000000020 [ 74.780130][ T2158] Mem abort info: [ 74.780406][ T2158] ESR = 0x0000000096000004 [ 74.780756][ T2158] EC = 0x25: DABT (current EL), IL = 32 bits [ 74.781225][ T2158] SET = 0, FnV = 0 [ 74.781529][ T2158] EA = 0, S1PTW = 0 [ 74.781836][ T2158] FSC = 0x04: level 0 translation fault [ 74.782288][ T2158] Data abort info: [ 74.782577][ T2158] ISV = 0, ISS = 0x00000004, ISS2 = 0x00000000 [ 74.783068][ T2158] CM = 0, WnR = 0, TnD = 0, TagAccess = 0 [ 74.783533][ T2158] GCS = 0, Overlay = 0, DirtyBit = 0, Xs = 0 [ 74.784010][ T2158] user pgtable: 4k pages, 48-bit VAs, pgdp=0000000105f34000 [ 74.784586][ T2158] [0000000000000020] pgd=0000000000000000, p4d=0000000000000000 [ 74.785293][ T2158] Internal error: Oops: 0000000096000004 [#1] SMP [ 74.785805][ T2158] Modules linked in: slub_kunit kunit ip6t_rpfilter ip6t_REJECT nf_reject_ipv6 ipt_REJECT nf_reject_ipv4 xt_conntrack ebtable_nat ebtable_broute ip6table_nat ip6table_mangle 4 [ 74.790661][ T2158] CPU: 0 UID: 0 PID: 2158 Comm: kunit_try_catch Kdump: loaded Tainted: G W N 6.12.0-rc3+ #2 [ 74.791535][ T2158] Tainted: [W]=WARN, [N]=TEST [ 74.791889][ T2158] Hardware name: QEMU KVM Virtual Machine, BIOS 0.0.0 02/06/2015 [ 74.792479][ T2158] pstate: 40400005 (nZcv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--) [ 74.793101][ T2158] pc : alloc_tagging_slab_alloc_hook+0x120/0x270 [ 74.793607][ T2158] lr : alloc_tagging_slab_alloc_hook+0x120/0x270 [ 74.794095][ T2158] sp : ffff800084d33cd0 [ 74.794418][ T2158] x29: ffff800084d33cd0 x28: 0000000000000000 x27: 0000000000000000 [ 74.795095][ T2158] x26: 0000000000000000 x25: 0000000000000012 x24: ffff80007b30e314 [ 74.795822][ T2158] x23: ffff000390ff6f10 x22: 0000000000000000 x21: 0000000000000088 [ 74.796555][ T2158] x20: ffff000390285840 x19: fffffd7fc3ef7830 x18: ffffffffffffffff [ 74.797283][ T2158] x17: ffff8000800e63b4 x16: ffff80007b33afc4 x15: ffff800081654c00 [ 74.798011][ T2158] x14: 0000000000000000 x13: 205d383531325420 x12: 5b5d383734363537 [ 74.798744][ T2158] x11: ffff800084d337e0 x10: 000000000000005d x9 : 00000000ffffffd0 [ 74.799476][ T2158] x8 : 7f7f7f7f7f7f7f7f x7 : ffff80008219d188 x6 : c0000000ffff7fff [ 74.800206][ T2158] x5 : ffff0003fdbc9208 x4 : ffff800081edd188 x3 : 0000000000000001 [ 74.800932][ T2158] x2 : 0beaa6dee1ac5a00 x1 : 0beaa6dee1ac5a00 x0 : ffff80037c2cb000 [ 74.801656][ T2158] Call trace: [ 74.801954][ T2158] alloc_tagging_slab_alloc_hook+0x120/0x270 [ 74.802494][ T2158] __kmalloc_cache_noprof+0x148/0x33c [ 74.802976][ T2158] test_kmalloc_redzone_access+0x4c/0x104 [slub_kunit] [ 74.803607][ T2158] kunit_try_run_case+0x70/0x17c [kunit] [ 74.804124][ T2158] kunit_generic_run_threadfn_adapter+0x2c/0x4c [kunit] [ 74.804768][ T2158] kthread+0x10c/0x118 [ 74.805141][ T2158] ret_from_fork+0x10/0x20 [ 74.805540][ T2158] Code: b9400a80 11000400 b9000a80 97ffd858 (f94012d3) [ 74.806176][ T2158] SMP: stopping secondary CPUs [ 74.808130][ T2158] Starting crashdump kernel... Link: https://lkml.kernel.org/r/20241020070819.307944-1-hao.ge@linux.dev Fixes: e0a955bf7f61 ("mm/codetag: add pgalloc_tag_copy()") Signed-off-by: Hao Ge <gehao@kylinos.cn> Acked-by: Suren Baghdasaryan <surenb@google.com> Suggested-by: Suren Baghdasaryan <surenb@google.com> Acked-by: Yu Zhao <yuzhao@google.com> Cc: Kent Overstreet <kent.overstreet@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-10-30s390/time: Add clocksource id to TOD clockSven Schnelle
To allow specifying the clock source in the upcoming PtP driver, add a clocksource ID to the s390 TOD clock. Acked-by: Heiko Carstens <hca@linux.ibm.com> Acked-by: Richard Cochran <richardcochran@gmail.com> Signed-off-by: Sven Schnelle <svens@linux.ibm.com> Link: https://patch.msgid.link/20241023065601.449586-2-svens@linux.ibm.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-10-30uprobes: SRCU-protect uretprobe lifetime (with timeout)Andrii Nakryiko
Avoid taking refcount on uprobe in prepare_uretprobe(), instead take uretprobe-specific SRCU lock and keep it active as kernel transfers control back to user space. Given we can't rely on user space returning from traced function within reasonable time period, we need to make sure not to keep SRCU lock active for too long, though. To that effect, we employ a timer callback which is meant to terminate SRCU lock region after predefined timeout (currently set to 100ms), and instead transfer underlying struct uprobe's lifetime protection to refcounting. This fallback to less scalable refcounting after 100ms is a fine tradeoff from uretprobe's scalability and performance perspective, because uretprobing *long running* user functions inherently doesn't run into scalability issues (there is just not enough frequency to cause noticeable issues with either performance or scalability). The overall trick is in ensuring synchronization between current thread and timer's callback fired on some other thread. To cope with that with minimal logic complications, we add hprobe wrapper which is used to contain all the synchronization related issues behind a small number of basic helpers: hprobe_expire() for "downgrading" uprobe from SRCU-protected state to refcounted state, and a hprobe_consume() and hprobe_finalize() pair of single-use consuming helpers. Other than that, whatever current thread's logic is there stays the same, as timer thread cannot modify return_instance state (or add new/remove old return_instances). It only takes care of SRCU unlock and uprobe refcounting, which is hidden from the higher-level uretprobe handling logic. We use atomic xchg() in hprobe_consume(), which is called from performance critical handle_uretprobe_chain() function run in the current context. When uncontended, this xchg() doesn't seem to hurt performance as there are no other competing CPUs fighting for the same cache line. We also mark struct return_instance as ____cacheline_aligned to ensure no false sharing can happen. Another technical moment. We need to make sure that the list of return instances can be safely traversed under RCU from timer callback, so we delay return_instance freeing with kfree_rcu() and make sure that list modifications use RCU-aware operations. Also, given SRCU lock survives transition from kernel to user space and back we need to use lower-level __srcu_read_lock() and __srcu_read_unlock() to avoid lockdep complaining. Just to give an impression of a kind of performance improvements this change brings, below are benchmarking results with and without these SRCU changes, assuming other uprobe optimizations (mainly RCU Tasks Trace for entry uprobes, lockless RB-tree lookup, and lockless VMA to uprobe lookup) are left intact: WITHOUT SRCU for uretprobes =========================== uretprobe-nop ( 1 cpus): 2.197 ± 0.002M/s ( 2.197M/s/cpu) uretprobe-nop ( 2 cpus): 3.325 ± 0.001M/s ( 1.662M/s/cpu) uretprobe-nop ( 3 cpus): 4.129 ± 0.002M/s ( 1.376M/s/cpu) uretprobe-nop ( 4 cpus): 6.180 ± 0.003M/s ( 1.545M/s/cpu) uretprobe-nop ( 8 cpus): 7.323 ± 0.005M/s ( 0.915M/s/cpu) uretprobe-nop (16 cpus): 6.943 ± 0.005M/s ( 0.434M/s/cpu) uretprobe-nop (32 cpus): 5.931 ± 0.014M/s ( 0.185M/s/cpu) uretprobe-nop (64 cpus): 5.145 ± 0.003M/s ( 0.080M/s/cpu) uretprobe-nop (80 cpus): 4.925 ± 0.005M/s ( 0.062M/s/cpu) WITH SRCU for uretprobes ======================== uretprobe-nop ( 1 cpus): 1.968 ± 0.001M/s ( 1.968M/s/cpu) uretprobe-nop ( 2 cpus): 3.739 ± 0.003M/s ( 1.869M/s/cpu) uretprobe-nop ( 3 cpus): 5.616 ± 0.003M/s ( 1.872M/s/cpu) uretprobe-nop ( 4 cpus): 7.286 ± 0.002M/s ( 1.822M/s/cpu) uretprobe-nop ( 8 cpus): 13.657 ± 0.007M/s ( 1.707M/s/cpu) uretprobe-nop (32 cpus): 45.305 ± 0.066M/s ( 1.416M/s/cpu) uretprobe-nop (64 cpus): 42.390 ± 0.922M/s ( 0.662M/s/cpu) uretprobe-nop (80 cpus): 47.554 ± 2.411M/s ( 0.594M/s/cpu) Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20241024044159.3156646-3-andrii@kernel.org
2024-10-30perf/x86/rapl: Clean up cpumask and hotplugKan Liang
The rapl pmu is die scope, which is supported by the generic perf_event subsystem now. Set the scope for the rapl PMU and remove all the cpumask and hotplug codes. Signed-off-by: Kan Liang <kan.liang@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Oliver Sang <oliver.sang@intel.com> Tested-by: Dhananjay Ugwekar <dhananjay.ugwekar@amd.com> Link: https://lore.kernel.org/r/20241010142604.770192-2-kan.liang@linux.intel.com
2024-10-30KVM: Protect vCPU's "last run PID" with rwlock, not RCUSean Christopherson
To avoid jitter on KVM_RUN due to synchronize_rcu(), use a rwlock instead of RCU to protect vcpu->pid, a.k.a. the pid of the task last used to a vCPU. When userspace is doing M:N scheduling of tasks to vCPUs, e.g. to run SEV migration helper vCPUs during post-copy, the synchronize_rcu() needed to change the PID associated with the vCPU can stall for hundreds of milliseconds, which is problematic for latency sensitive post-copy operations. In the directed yield path, do not acquire the lock if it's contended, i.e. if the associated PID is changing, as that means the vCPU's task is already running. Reported-by: Steve Rutherford <srutherford@google.com> Reviewed-by: Steve Rutherford <srutherford@google.com> Acked-by: Oliver Upton <oliver.upton@linux.dev> Link: https://lore.kernel.org/r/20240802200136.329973-3-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-10-30x86/uaccess: Avoid barrier_nospec() in 64-bit copy_from_user()Linus Torvalds
The barrier_nospec() in 64-bit copy_from_user() is slow. Instead use pointer masking to force the user pointer to all 1's for an invalid address. The kernel test robot reports a 2.6% improvement in the per_thread_ops benchmark [1]. This is a variation on a patch originally by Josh Poimboeuf [2]. Link: https://lore.kernel.org/202410281344.d02c72a2-oliver.sang@intel.com [1] Link: https://lore.kernel.org/5b887fe4c580214900e21f6c61095adf9a142735.1730166635.git.jpoimboe@kernel.org [2] Tested-and-reviewed-by: Josh Poimboeuf <jpoimboe@kernel.org> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2024-10-30PCI: Make pcim_iounmap_region() a public functionPhilipp Stanner
The function pcim_iounmap_regions() is problematic because it uses a bitmask mechanism to release / iounmap multiple BARs at once. It, thus, prevents getting rid of the problematic iomap table mechanism which was deprecated in commit e354bb84a4c1 ("PCI: Deprecate pcim_iomap_table(), pcim_iomap_regions_request_all()"). pcim_iounmap_region() does not have that problem. Make it public as the successor of pcim_iounmap_regions(). Link: https://lore.kernel.org/r/20241016094911.24818-3-pstanner@redhat.com Signed-off-by: Philipp Stanner <pstanner@redhat.com> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
2024-10-30PCI: Remove pcim_iomap_regions_request_all()Philipp Stanner
pcim_iomap_regions_request_all() have been deprecated in commit e354bb84a4c1 ("PCI: Deprecate pcim_iomap_table(), pcim_iomap_regions_request_all()"). All users of this function have been ported to other interfaces by now. Remove pcim_iomap_regions_request_all(). Link: https://lore.kernel.org/r/20241030112743.104395-11-pstanner@redhat.com Signed-off-by: Philipp Stanner <pstanner@redhat.com> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
2024-10-30PCI: Make pcim_request_all_regions() a public functionPhilipp Stanner
In order to remove the deprecated function pcim_iomap_regions_request_all(), a few drivers need an interface to request all BARs a PCI device offers. Make pcim_request_all_regions() a public interface. Link: https://lore.kernel.org/r/20241030112743.104395-2-pstanner@redhat.com Signed-off-by: Philipp Stanner <pstanner@redhat.com> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>
2024-10-30bpf: Add bpf_mem_alloc_check_size() helperHou Tao
Introduce bpf_mem_alloc_check_size() to check whether the allocation size exceeds the limitation for the kmalloc-equivalent allocator. The upper limit for percpu allocation is LLIST_NODE_SZ bytes larger than non-percpu allocation, so a percpu argument is added to the helper. The helper will be used in the following patch to check whether the size parameter passed to bpf_mem_alloc() is too big. Signed-off-by: Hou Tao <houtao1@huawei.com> Link: https://lore.kernel.org/r/20241030100516.3633640-3-houtao@huaweicloud.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-10-30jiffies: Define secs_to_jiffies()Easwar Hariharan
secs_to_jiffies() is defined in hci_event.c and cannot be reused by other call sites. Hoist it into the core code to allow conversion of the ~1150 usages of msecs_to_jiffies() that either: - use a multiplier value of 1000 or equivalently MSEC_PER_SEC, or - have timeouts that are denominated in seconds (i.e. end in 000) It's implemented as a macro to allow usage in static initializers. This will also allow conversion of yet more sites that use (sec * HZ) directly, and improve their readability. Suggested-by: Michael Kelley <mhklinux@outlook.com> Signed-off-by: Easwar Hariharan <eahariha@linux.microsoft.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com> Link: https://lore.kernel.org/all/20241030-open-coded-timeouts-v3-1-9ba123facf88@linux.microsoft.com
2024-10-30pmdomain: Merge branch fixes into nextUlf Hansson
Merge the pmdomain fixes for v6.12-rc[n] into the next branch, to allow them to get tested together with the new changes that are targeted for v6.13. Signed-off-by: Ulf Hansson <ulf.hansson@linaro.org>
2024-10-30pmdomain: core: Add GENPD_FLAG_DEV_NAME_FW flagSibi Sankar
Introduce GENPD_FLAG_DEV_NAME_FW flag which instructs genpd to generate an unique device name using ida. It is aimed to be used by genpd providers which derive their names directly from FW making them susceptible to debugfs node creation failures. Reported-by: Johan Hovold <johan+linaro@kernel.org> Closes: https://lore.kernel.org/lkml/ZoQjAWse2YxwyRJv@hovoldconsulting.com/ Fixes: 718072ceb211 ("PM: domains: create debugfs nodes when adding power domains") Suggested-by: Ulf Hansson <ulf.hansson@linaro.org> Suggested-by: Dmitry Baryshkov <dmitry.baryshkov@linaro.org> Signed-off-by: Sibi Sankar <quic_sibis@quicinc.com> Cc: stable@vger.kernel.org Message-ID: <20241030125512.2884761-5-quic_sibis@quicinc.com> Signed-off-by: Ulf Hansson <ulf.hansson@linaro.org>
2024-10-30blk-integrity: remove seed for user mapped buffersKeith Busch
The seed is only used for kernel generation and verification. That doesn't happen for user buffers, so passing the seed around doesn't accomplish anything. Signed-off-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Anuj Gupta <anuj20.g@samsung.com> Reviewed-by: Kanchan Joshi <joshi.k@samsung.com> Link: https://lore.kernel.org/r/20241016201309.1090320-1-kbusch@meta.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-10-30iommu: Make bus_iommu_probe() staticRobin Murphy
With the last external caller of bus_iommu_probe() now gone, make it internal as it really should be. Signed-off-by: Robin Murphy <robin.murphy@arm.com> Tested-by: H. Nikolaus Schaller <hns@goldelico.com> Reviewed-by: Kevin Hilman <khilman@baylibre.com> Tested-by: Kevin Hilman <khilman@baylibre.com> Tested-by: Beleswar Padhi <b-padhi@ti.com> Link: https://lore.kernel.org/r/a7511a034a27259aff4e14d80a861d3c40fbff1e.1730136799.git.robin.murphy@arm.com Signed-off-by: Joerg Roedel <jroedel@suse.de>
2024-10-30Merge branch 'work.fdtable' into vfs.fileChristian Brauner
Bring in the fdtable changes for this cycle. Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-10-30fs: port files to file_refChristian Brauner
Port files to rely on file_ref reference to improve scaling and gain overflow protection. - We continue to WARN during get_file() in case a file that is already marked dead is revived as get_file() is only valid if the caller already holds a reference to the file. This hasn't changed just the check changes. - The semantics for epoll and ttm's dmabuf usage have changed. Both epoll and ttm synchronize with __fput() to prevent the underlying file from beeing freed. (1) epoll Explaining epoll is straightforward using a simple diagram. Essentially, the mutex of the epoll instance needs to be taken in both __fput() and around epi_fget() preventing the file from being freed while it is polled or preventing the file from being resurrected. CPU1 CPU2 fput(file) -> __fput(file) -> eventpoll_release(file) -> eventpoll_release_file(file) mutex_lock(&ep->mtx) epi_item_poll() -> epi_fget() -> file_ref_get(file) mutex_unlock(&ep->mtx) mutex_lock(&ep->mtx); __ep_remove() mutex_unlock(&ep->mtx); -> kmem_cache_free(file) (2) ttm dmabuf This explanation is a bit more involved. A regular dmabuf file stashed the dmabuf in file->private_data and the file in dmabuf->file: file->private_data = dmabuf; dmabuf->file = file; The generic release method of a dmabuf file handles file specific things: f_op->release::dma_buf_file_release() while the generic dentry release method of a dmabuf handles dmabuf freeing including driver specific things: dentry->d_release::dma_buf_release() During ttm dmabuf initialization in ttm_object_device_init() the ttm driver copies the provided struct dma_buf_ops into a private location: struct ttm_object_device { spinlock_t object_lock; struct dma_buf_ops ops; void (*dmabuf_release)(struct dma_buf *dma_buf); struct idr idr; }; ttm_object_device_init(const struct dma_buf_ops *ops) { // copy original dma_buf_ops in private location tdev->ops = *ops; // stash the release method of the original struct dma_buf_ops tdev->dmabuf_release = tdev->ops.release; // override the release method in the copy of the struct dma_buf_ops // with ttm's own dmabuf release method tdev->ops.release = ttm_prime_dmabuf_release; } When a new dmabuf is created the struct dma_buf_ops with the overriden release method set to ttm_prime_dmabuf_release is passed in exp_info.ops: DEFINE_DMA_BUF_EXPORT_INFO(exp_info); exp_info.ops = &tdev->ops; exp_info.size = prime->size; exp_info.flags = flags; exp_info.priv = prime; The call to dma_buf_export() then sets mutex_lock_interruptible(&prime->mutex); dma_buf = dma_buf_export(&exp_info) { dmabuf->ops = exp_info->ops; } mutex_unlock(&prime->mutex); which creates a new dmabuf file and then install a file descriptor to it in the callers file descriptor table: ret = dma_buf_fd(dma_buf, flags); When that dmabuf file is closed we now get: fput(file) -> __fput(file) -> f_op->release::dma_buf_file_release() -> dput() -> d_op->d_release::dma_buf_release() -> dmabuf->ops->release::ttm_prime_dmabuf_release() mutex_lock(&prime->mutex); if (prime->dma_buf == dma_buf) prime->dma_buf = NULL; mutex_unlock(&prime->mutex); Where we can see that prime->dma_buf is set to NULL. So when we have the following diagram: CPU1 CPU2 fput(file) -> __fput(file) -> f_op->release::dma_buf_file_release() -> dput() -> d_op->d_release::dma_buf_release() -> dmabuf->ops->release::ttm_prime_dmabuf_release() ttm_prime_handle_to_fd() mutex_lock_interruptible(&prime->mutex) dma_buf = prime->dma_buf dma_buf && get_dma_buf_unless_doomed(dma_buf) -> file_ref_get(dma_buf->file) mutex_unlock(&prime->mutex); mutex_lock(&prime->mutex); if (prime->dma_buf == dma_buf) prime->dma_buf = NULL; mutex_unlock(&prime->mutex); -> kmem_cache_free(file) The logic of the mechanism is the same as for epoll: sync with __fput() preventing the file from being freed. Here the synchronization happens through the ttm instance's prime->mutex. Basically, the lifetime of the dma_buf and the file are tighly coupled. Both (1) and (2) used to call atomic_inc_not_zero() to check whether the file has already been marked dead and then refuse to revive it. This is only safe because both (1) and (2) sync with __fput() and thus prevent kmem_cache_free() on the file being called and thus prevent the file from being immediately recycled due to SLAB_TYPESAFE_BY_RCU. Both (1) and (2) have been ported from atomic_inc_not_zero() to file_ref_get(). That means a file that is already in the process of being marked as FILE_REF_DEAD: file_ref_put() cnt = atomic_long_dec_return() -> __file_ref_put(cnt) if (cnt == FIlE_REF_NOREF) atomic_long_try_cmpxchg_release(cnt, FILE_REF_DEAD) can be revived again: CPU1 CPU2 file_ref_put() cnt = atomic_long_dec_return() -> __file_ref_put(cnt) if (cnt == FIlE_REF_NOREF) file_ref_get() // Brings reference back to FILE_REF_ONEREF atomic_long_add_negative() atomic_long_try_cmpxchg_release(cnt, FILE_REF_DEAD) This is fine and inherent to the file_ref_get()/file_ref_put() semantics. For both (1) and (2) this is safe because __fput() is prevented from making progress if file_ref_get() fails due to the aforementioned synchronization mechanisms. Two cases need to be considered that affect both (1) epoll and (2) ttm dmabuf: (i) fput()'s file_ref_put() and marks the file as FILE_REF_NOREF but before that fput() can mark the file as FILE_REF_DEAD someone manages to sneak in a file_ref_get() and brings the refcount back from FILE_REF_NOREF to FILE_REF_ONEREF. In that case the original fput() doesn't call __fput(). For epoll the poll will finish and for ttm dmabuf the file can be used again. For ttm dambuf this is actually an advantage because it avoids immediately allocating a new dmabuf object. CPU1 CPU2 file_ref_put() cnt = atomic_long_dec_return() -> __file_ref_put(cnt) if (cnt == FIlE_REF_NOREF) file_ref_get() // Brings reference back to FILE_REF_ONEREF atomic_long_add_negative() atomic_long_try_cmpxchg_release(cnt, FILE_REF_DEAD) (ii) fput()'s file_ref_put() marks the file FILE_REF_NOREF and also suceeds in actually marking it FILE_REF_DEAD and then calls into __fput() to free the file. When either (1) or (2) call file_ref_get() they fail as atomic_long_add_negative() will return true. At the same time, both (1) and (2) all file_ref_get() under mutexes that __fput() must also acquire preventing kmem_cache_free() from freeing the file. So while this might be treated as a change in semantics for (1) and (2) it really isn't. It if should end up causing issues this can be fixed by adding a helper that does something like: long cnt = atomic_long_read(&ref->refcnt); do { if (cnt < 0) return false; } while (!atomic_long_try_cmpxchg(&ref->refcnt, &cnt, cnt + 1)); return true; which would block FILE_REF_NOREF to FILE_REF_ONEREF transitions. - Jann correctly pointed out that kmem_cache_zalloc() cannot be used anymore once files have been ported to file_ref_t. The kmem_cache_zalloc() call will memset() the whole struct file to zero when it is reallocated. This will also set file->f_ref to zero which mens that a concurrent file_ref_get() can return true: CPU1 CPU2 __get_file_rcu() rcu_dereference_raw() close() [frees file] alloc_empty_file() kmem_cache_zalloc() [reallocates same file] memset(..., 0, ...) file_ref_get() [increments 0->1, returns true] init_file() file_ref_init(..., 1) [sets to 0] rcu_dereference_raw() fput() file_ref_put() [decrements 0->FILE_REF_NOREF, frees file] [UAF] causing a concurrent __get_file_rcu() call to acquire a reference to the file that is about to be reallocated and immediately freeing it on realizing that it has been recycled. This causes a UAF for the task that reallocated/recycled the file. This is prevented by switching from kmem_cache_zalloc() to kmem_cache_alloc() and initializing the fields manually. With file->f_ref initialized last. Note that a memset() also isn't guaranteed to atomically update an unsigned long so it's theoretically possible to see torn and therefore bogus counter values. Link: https://lore.kernel.org/r/20241007-brauner-file-rcuref-v2-3-387e24dc9163@kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-10-29Merge tag 'wireless-next-2024-10-25' of ↵Jakub Kicinski
git://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless-next Kalle Valo says: ==================== wireless-next patches for v6.13 The first -next "new features" pull request for v6.13. This is a big one as we have not been able to send one earlier. We have also some patches affecting other subsystems: in staging we deleted the rtl8192e driver and in debugfs added a new interface to save struct file_operations memory; both were acked by GregKH. Because of the lib80211/libipw move there were quite a lot of conflicts and to solve those we decided to merge net-next into wireless-next. Major changes: cfg80211/mac80211 * stop exporting wext symbols * new mac80211 op to indicate that a new interface is to be added * support radio separation of multi-band devices Wireless Extensions * move wext spy implementation to libiw * remove iw_public_data from struct net_device brcmfmac * optional LPO clock support ipw2x00 * move remaining lib80211 code into libiw wilc1000 * WILC3000 support rtw89 * RTL8852BE and RTL8852BE-VT BT-coexistence improvements * tag 'wireless-next-2024-10-25' of git://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless-next: (126 commits) mac80211: Remove NOP call to ieee80211_hw_config wifi: iwlwifi: work around -Wenum-compare-conditional warning wifi: mac80211: re-order assigning channel in activate links wifi: mac80211: convert debugfs files to short fops debugfs: add small file operations for most files wifi: mac80211: remove misleading j_0 construction parts wifi: mac80211_hwsim: use hrtimer_active() wifi: mac80211: refactor BW limitation check for CSA parsing wifi: mac80211: filter on monitor interfaces based on configured channel wifi: mac80211: refactor ieee80211_rx_monitor wifi: mac80211: add support for the monitor SKIP_TX flag wifi: cfg80211: add monitor SKIP_TX flag wifi: mac80211: add flag to opt out of virtual monitor support wifi: cfg80211: pass net_device to .set_monitor_channel wifi: mac80211: remove status->ampdu_delimiter_crc wifi: cfg80211: report per wiphy radio antenna mask wifi: mac80211: use vif radio mask to limit creating chanctx wifi: mac80211: use vif radio mask to limit ibss scan frequencies wifi: cfg80211: add option for vif allowed radios wifi: iwlwifi: allow IWL_FW_CHECK() with just a string ... ==================== Link: https://patch.msgid.link/20241025170705.5F6B2C4CEC3@smtp.kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-10-29of/fdt: add dt_phys arg to early_init_dt_scan and early_init_dt_verifyUsama Arif
__pa() is only intended to be used for linear map addresses and using it for initial_boot_params which is in fixmap for arm64 will give an incorrect value. Hence save the physical address when it is known at boot time when calling early_init_dt_scan for arm64 and use it at kexec time instead of converting the virtual address using __pa(). Note that arm64 doesn't need the FDT region reserved in the DT as the kernel explicitly reserves the passed in FDT. Therefore, only a debug warning is fixed with this change. Reported-by: Breno Leitao <leitao@debian.org> Suggested-by: Mark Rutland <mark.rutland@arm.com> Signed-off-by: Usama Arif <usamaarif642@gmail.com> Fixes: ac10be5cdbfa ("arm64: Use common of_kexec_alloc_and_setup_fdt()") Link: https://lore.kernel.org/r/20241023171426.452688-1-usamaarif642@gmail.com Signed-off-by: Rob Herring (Arm) <robh@kernel.org>
2024-10-29io_uring/rsrc: move struct io_fixed_file to rsrc.h headerJens Axboe
There's no need for this internal structure to be visible, move it to the private rsrc.h header instead. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-10-29io_uring: add support for fixed wait regionsJens Axboe
Generally applications have 1 or a few waits of waiting, yet they pass in a struct io_uring_getevents_arg every time. This needs to get copied and, in turn, the timeout value needs to get copied. Rather than do this for every invocation, allow the application to register a fixed set of wait regions that can simply be indexed when asking the kernel to wait on events. At ring setup time, the application can register a number of these wait regions and initialize region/index 0 upfront: struct io_uring_reg_wait *reg; reg = io_uring_setup_reg_wait(ring, nr_regions, &ret); /* set timeout and mark as set, sigmask/sigmask_sz as needed */ reg->ts.tv_sec = 0; reg->ts.tv_nsec = 100000; reg->flags = IORING_REG_WAIT_TS; where nr_regions >= 1 && nr_regions <= PAGE_SIZE / sizeof(*reg). The above initializes index 0, but 63 other regions can be initialized, if needed. Now, instead of doing: struct __kernel_timespec timeout = { .tv_nsec = 100000, }; io_uring_submit_and_wait_timeout(ring, &cqe, nr, &t, NULL); to wait for events for each submit_and_wait, or just wait, operation, it can just reference the above region at offset 0 and do: io_uring_submit_and_wait_reg(ring, &cqe, nr, 0); to achieve the same goal of waiting 100usec without needing to copy both struct io_uring_getevents_arg (24b) and struct __kernel_timeout (16b) for each invocation. Struct io_uring_reg_wait looks as follows: struct io_uring_reg_wait { struct __kernel_timespec ts; __u32 min_wait_usec; __u32 flags; __u64 sigmask; __u32 sigmask_sz; __u32 pad[3]; __u64 pad2[2]; }; embedding the timeout itself in the region, rather than passing it as a pointer as well. Note that the signal mask is still passed as a pointer, both for compatability reasons, but also because there doesn't seem to be a lot of high frequency waits scenarios that involve setting and resetting the signal mask for each wait. The application is free to modify any region before a wait call, or it can use keep multiple regions with different settings to avoid needing to modify the same one for wait calls. Up to a page size of regions is mapped by default, allowing PAGE_SIZE / 64 available regions for use. The registered region must fit within a page. On a 4kb page size system, that allows for 64 wait regions if a full page is used, as the size of struct io_uring_reg_wait is 64b. The region registered must be aligned to io_uring_reg_wait in size. It's valid to register less than 64 entries. In network performance testing with zero-copy, this reduced the time spent waiting on the TX side from 3.12% to 0.3% and the RX side from 4.4% to 0.3%. Wait regions are fixed for the lifetime of the ring - once registered, they are persistent until the ring is torn down. The regions support minimum wait timeout as well as the regular waits. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-10-29io_uring/register: add IORING_REGISTER_RESIZE_RINGSJens Axboe
Once a ring has been created, the size of the CQ and SQ rings are fixed. Usually this isn't a problem on the SQ ring side, as it merely controls the available number of requests that can be submitted in a single system call, and there's rarely a need to change that. For the CQ ring, it's a different story. For most efficient use of io_uring, it's important that the CQ ring never overflows. This means that applications must size it for the worst case scenario, which can be wasteful. Add IORING_REGISTER_RESIZE_RINGS, which allows an application to resize the existing rings. It takes a struct io_uring_params argument, the same one which is used to setup the ring initially, and resizes rings according to the sizes given. Certain properties are always inherited from the original ring setup, like SQE128/CQE32 and other setup options. The implementation only allows flag associated with how the CQ ring is sized and clamped. Existing unconsumed SQE and CQE entries are copied as part of the process. If either the SQ or CQ resized destination ring cannot hold the entries already present in the source rings, then the operation is failed with -EOVERFLOW. Any register op holds ->uring_lock, which prevents new submissions, and the internal mapping holds the completion lock as well across moving CQ ring state. To prevent races between mmap and ring resizing, add a mutex that's solely used to serialize ring resize and mmap. mmap_sem can't be used here, as as fork'ed process may be doing mmaps on the ring as well. The ctx->resize_lock is held across mmap operations, and the resize will grab it before swapping out the already mapped new data. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-10-29io_uring: kill 'imu' from struct io_kiocbJens Axboe
It's no longer being used, remove it. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-10-29io_uring: clean up cqe trace pointsPavel Begunkov
We have too many helpers posting CQEs, instead of tracing completion events before filling in a CQE and thus having to pass all the data, set the CQE first, pass it to the tracing helper and let it extract everything it needs. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/b83c1ca9ee5aed2df0f3bb743bf5ed699cce4c86.1729267437.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-10-29io_uring/poll: get rid of per-hashtable bucket locksJens Axboe
Any access to the table is protected by ctx->uring_lock now anyway, the per-bucket locking doesn't buy us anything. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-10-29io_uring/poll: get rid of unlocked cancel hashJens Axboe
io_uring maintains two hash lists of inflight requests: 1) ctx->cancel_table_locked. This is used when the caller has the ctx->uring_lock held already. This is only an issue side parameter, as removal or task_work will always have it held. 2) ctx->cancel_table. This is used when the issuer does NOT have the ctx->uring_lock held, and relies on the table spinlocks for access. However, it's pretty trivial to simply grab the lock in the one spot where we care about it, for insertion. With that, we can kill the unlocked table (and get rid of the _locked postfix for the other one). Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-10-29block: add a bdev_limits helperChristoph Hellwig
Add a helper to get the queue_limits from the bdev without having to poke into the request_queue. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: John Garry <john.g.garry@oracle.com> Link: https://lore.kernel.org/r/20241029141937.249920-1-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-10-29regmap: Merge up fixes from v6.12-rc3Mark Brown
For the benefit of CI.
2024-10-29platform/x86: wmi: Introduce to_wmi_driver()Armin Wolf
Introduce to_wmi_driver() as a replacement for dev_to_wdrv() so WMI drivers can use this support macro instead of having to duplicate its functionality. Signed-off-by: Armin Wolf <W_Armin@gmx.de> Link: https://lore.kernel.org/r/20241026193803.8802-3-W_Armin@gmx.de Reviewed-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com> Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>
2024-10-29platform/x86: wmi: Replace dev_to_wdev() with to_wmi_device()Armin Wolf
Replace dev_to_wdev() with to_wmi_device() to stop duplicating functionality. Also switch to_wmi_device() to use container_of_const() so const values are handled correctly. Signed-off-by: Armin Wolf <W_Armin@gmx.de> Link: https://lore.kernel.org/r/20241026193803.8802-2-W_Armin@gmx.de Reviewed-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com> Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>
2024-10-29iommu: Restore iommu_flush_iotlb_all()Joerg Roedel
This patch restores the iommu_flush_iotlb_all() function. Commit 69e5a17511f6 ("iommu: Remove useless flush from iommu_create_device_direct_mappings()") claims it removed the last call-site, except it did not. There is still at least one caller in drivers/gpu/drm/msm/msm_iommu.c so keep the function around until all call-sites are updated. Cc: Jason Gunthorpe <jgg@ziepe.ca> Fixes: 69e5a17511f6 ("iommu: Remove useless flush from iommu_create_device_direct_mappings()") Acked-by: Will Deacon <will@kernel.org> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Signed-off-by: Joerg Roedel <jroedel@suse.de>