summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2025-02-28scsi: mpi3mr: Mark device strings as nonstringKees Cook
In preparation for memtostr*() checking that its source is marked as nonstring, annotate the device strings accordingly. Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> # SCSI Signed-off-by: Kees Cook <kees@kernel.org>
2025-02-28scsi: mptfusion: Mark device strings as nonstringKees Cook
In preparation for memtostr*() checking that its source is marked as nonstring, annotate the device strings accordingly. Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> # SCSI Signed-off-by: Kees Cook <kees@kernel.org>
2025-02-28fortify: Move FORTIFY_SOURCE under 'Kernel hardening options'Mel Gorman
FORTIFY_SOURCE is a hardening option both at build and runtime. Move it under 'Kernel hardening options'. Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Acked-by: Paul Moore <paul@paul-moore.com> Link: https://lore.kernel.org/r/20250123221115.19722-5-mgorman@techsingularity.net Signed-off-by: Kees Cook <kees@kernel.org>
2025-02-28mm: security: Check early if HARDENED_USERCOPY is enabledMel Gorman
HARDENED_USERCOPY is checked within a function so even if disabled, the function overhead still exists. Move the static check inline. This is at best a micro-optimisation and any difference in performance was within noise but it is relatively consistent with the init_on_* implementations. Suggested-by: Kees Cook <kees@kernel.org> Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Link: https://lore.kernel.org/r/20250123221115.19722-4-mgorman@techsingularity.net Signed-off-by: Kees Cook <kees@kernel.org>
2025-02-28mm: security: Allow default HARDENED_USERCOPY to be set at compile timeMel Gorman
HARDENED_USERCOPY defaults to on if enabled at compile time. Allow hardened_usercopy= default to be set at compile time similar to init_on_alloc= and init_on_free=. The intent is that hardening options that can be disabled at runtime can set their default at build time. Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Link: https://lore.kernel.org/r/20250123221115.19722-3-mgorman@techsingularity.net Signed-off-by: Kees Cook <kees@kernel.org>
2025-02-28mm: security: Move hardened usercopy under 'Kernel hardening options'Mel Gorman
There is a submenu for 'Kernel hardening options' under "Security". Move HARDENED_USERCOPY under the hardening options as it is clearly related. Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Acked-by: Paul Moore <paul@paul-moore.com> Link: https://lore.kernel.org/r/20250123221115.19722-2-mgorman@techsingularity.net Signed-off-by: Kees Cook <kees@kernel.org>
2025-02-28uaccess: Introduce ucopysize.hKees Cook
The object size sanity checking macros that uaccess.h and uio.h use have been living in thread_info.h for historical reasons. Needing to use jump labels for these checks, however, introduces a header include loop under certain conditions. The dependencies for the object checking macros are very limited, but they are used by separate header files, so introduce a new header that can be used directly by uaccess.h and uio.h. As a result, this also means thread_info.h (which is rather large) and be removed from those headers. Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/oe-kbuild-all/202502281153.TG2XK5SI-lkp@intel.com/ Signed-off-by: Kees Cook <kees@kernel.org>
2025-02-28MAINTAINERS: add rust bindings entry for bitmap APIYury Norov [NVIDIA]
This entry enumerates bitmap and related APIs listed in BITMAP API entry that rust requires but cannot use directly (i.e. inlined functions and macros). The "Rust kernel policy" (https://rust-for-linux.com/rust-kernel-policy) document describes the special status of rust support: "Exceptionally, for Rust, a subsystem may allow to temporarily break Rust code." Accordingly, the following policy applies to all interfaces under the BITMAP API entry that are used in rust codebase, including those not listed explicitly here. Bitmap developers do their best to keep the API stable. When API or user-visible behavior needs to be changed such that it breaks rust, bitmap and rust developers collaborate as follows: - bitmap developers don't consider rust bindings as a blocker for the API change; - bindings maintainer (me) makes sure that kernel build doesn't break with CONFIG_RUST=y. This implies fixes in the binding layer, but not in rust codebase; - rust developers adopt new version of API in their codebase and remove unused bindings timely. CC: Danilo Krummrich <dakr@redhat.com> CC: Miguel Ojeda <miguel.ojeda.sandonis@gmail.com> Reviewed-by: Viresh Kumar <viresh.kumar@linaro.org> Signed-off-by: Yury Norov [NVIDIA] <yury.norov@gmail.com>
2025-02-28rust: Add cpumask helpersViresh Kumar
In order to prepare for adding Rust abstractions for cpumask, add the required helpers for inline cpumask functions that cannot be called by rust code directly. Reviewed-by: Alice Ryhl <aliceryhl@google.com> Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org> Signed-off-by: Yury Norov [NVIDIA] <yury.norov@gmail.com>
2025-02-28Merge tag 'block-6.14-20250228' of git://git.kernel.dk/linuxLinus Torvalds
Pull block fixes from Jens Axboe: - Fix plugging for native zone writes - Fix segment limit settings for != 4K page size archs - Fix for slab names overflowing * tag 'block-6.14-20250228' of git://git.kernel.dk/linux: block: fix 'kmem_cache of name 'bio-108' already exists' block: Remove zone write plugs when handling native zone append writes block: make segment size limit workable for > 4K PAGE_SIZE
2025-02-28uapi: Revert "bitops: avoid integer overflow in GENMASK(_ULL)"I Hsin Cheng
This patch reverts 'commit c32ee3d9abd2("bitops: avoid integer overflow in GENMASK(_ULL)")'. The code generation can be shrink by over 1KB by reverting this commit. Originally the commit claimed that clang would emit warnings using the implementation at that time. The patch was applied and tested against numerous compilers, including gcc-13, gcc-12, gcc-11 cross-compiler, clang-17, clang-18 and clang-19. Various warning levels were set (-W=0, -W=1, -W=2) and CONFIG_WERROR disabled to complete the compilation. The results show that no compilation errors or warnings were generated due to the patch. The results of code size reduction are summarized in the following table. The code size changes for clang are all zero across different versions, so they're not listed in the table. For NR_CPUS=64 on x86_64. ---------------------------------------------- | | gcc-13 | gcc-12 | gcc-11 | ---------------------------------------------- | old | 22438085 | 22453915 | 22302033 | ---------------------------------------------- | new | 22436816 | 22452913 | 22300826 | ---------------------------------------------- | new - old | -1269 | -1002 | -1207 | ---------------------------------------------- For NR_CPUS=1024 on x86_64. ---------------------------------------------- | | gcc-13 | gcc-12 | gcc-11 | ---------------------------------------------- | old | 22493682 | 22509812 | 22357661 | ---------------------------------------------- | new | 22493230 | 22509487 | 22357250 | ---------------------------------------------- | new - old | -452 | -325 | -411 | ---------------------------------------------- For arm64 architecture, gcc cross-compiler was used and QEMU was utilized to execute a VM for a CPU-heavy workload to ensure no side effects and that functionalities remained correct. The test even demonstrated a positive result in terms of code size reduction: * Before: 31660668 * After: 31658724 * Difference (After - Before): -1944 An analysis of multiple functions compiled with gcc-13 on x86_64 was performed. In summary, the patch elimates one negation in almost every use case. However, negative effects may occur in some cases, such as the generation of additional "mov" instruction or increased register usage. The use of "~_UL(0) << (l)" may even result in the allocations of "%r*" registers instead of "%e*" registers (which are 32-bit registers) because the compiler cannot assume that the higher bits are zero. Yury: We limit GENMASK() usage with the const_true(l > h) condition, and most of users just call it with constant parameters. For those, the actual implementation of the macro doesn't matter, and since it triggered clang warnings back then, it was reasonable to workaround the warnings on the kernel side. Now that some find_bit() functions call GENMASK() with runtime parameters (although the const_true() condition holds), this ended up hurting the generated code, as I Hsin discovered. This is especially bad because it hurts small_const_nbits() optimization, where people are most concerned about generated code quality. So, revert it to the original version for good. Signed-off-by: I Hsin Cheng <richard120310@gmail.com> Signed-off-by: Yury Norov <yury.norov@gmail.com>
2025-02-28KVM: x86: Snapshot the host's DEBUGCTL after disabling IRQsSean Christopherson
Snapshot the host's DEBUGCTL after disabling IRQs, as perf can toggle debugctl bits from IRQ context, e.g. when enabling/disabling events via smp_call_function_single(). Taking the snapshot (long) before IRQs are disabled could result in KVM effectively clobbering DEBUGCTL due to using a stale snapshot. Cc: stable@vger.kernel.org Reviewed-and-tested-by: Ravi Bangoria <ravi.bangoria@amd.com> Link: https://lore.kernel.org/r/20250227222411.3490595-6-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-02-28KVM: SVM: Manually context switch DEBUGCTL if LBR virtualization is disabledSean Christopherson
Manually load the guest's DEBUGCTL prior to VMRUN (and restore the host's value on #VMEXIT) if it diverges from the host's value and LBR virtualization is disabled, as hardware only context switches DEBUGCTL if LBR virtualization is fully enabled. Running the guest with the host's value has likely been mildly problematic for quite some time, e.g. it will result in undesirable behavior if BTF diverges (with the caveat that KVM now suppresses guest BTF due to lack of support). But the bug became fatal with the introduction of Bus Lock Trap ("Detect" in kernel paralance) support for AMD (commit 408eb7417a92 ("x86/bus_lock: Add support for AMD")), as a bus lock in the guest will trigger an unexpected #DB. Note, suppressing the bus lock #DB, i.e. simply resuming the guest without injecting a #DB, is not an option. It wouldn't address the general issue with DEBUGCTL, e.g. for things like BTF, and there are other guest-visible side effects if BusLockTrap is left enabled. If BusLockTrap is disabled, then DR6.BLD is reserved-to-1; any attempts to clear it by software are ignored. But if BusLockTrap is enabled, software can clear DR6.BLD: Software enables bus lock trap by setting DebugCtl MSR[BLCKDB] (bit 2) to 1. When bus lock trap is enabled, ... The processor indicates that this #DB was caused by a bus lock by clearing DR6[BLD] (bit 11). DR6[11] previously had been defined to be always 1. and clearing DR6.BLD is "sticky" in that it's not set (i.e. lowered) by other #DBs: All other #DB exceptions leave DR6[BLD] unmodified E.g. leaving BusLockTrap enable can confuse a legacy guest that writes '0' to reset DR6. Reported-by: rangemachine@gmail.com Reported-by: whanos@sergal.fun Closes: https://bugzilla.kernel.org/show_bug.cgi?id=219787 Closes: https://lore.kernel.org/all/bug-219787-28872@https.bugzilla.kernel.org%2F Cc: Ravi Bangoria <ravi.bangoria@amd.com> Cc: stable@vger.kernel.org Reviewed-and-tested-by: Ravi Bangoria <ravi.bangoria@amd.com> Link: https://lore.kernel.org/r/20250227222411.3490595-5-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-02-28KVM: x86: Snapshot the host's DEBUGCTL in common x86Sean Christopherson
Move KVM's snapshot of DEBUGCTL to kvm_vcpu_arch and take the snapshot in common x86, so that SVM can also use the snapshot. Opportunistically change the field to a u64. While bits 63:32 are reserved on AMD, not mentioned at all in Intel's SDM, and managed as an "unsigned long" by the kernel, DEBUGCTL is an MSR and therefore a 64-bit value. Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com> Cc: stable@vger.kernel.org Reviewed-and-tested-by: Ravi Bangoria <ravi.bangoria@amd.com> Link: https://lore.kernel.org/r/20250227222411.3490595-4-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-02-28KVM: SVM: Suppress DEBUGCTL.BTF on AMDSean Christopherson
Mark BTF as reserved in DEBUGCTL on AMD, as KVM doesn't actually support BTF, and fully enabling BTF virtualization is non-trivial due to interactions with the emulator, guest_debug, #DB interception, nested SVM, etc. Don't inject #GP if the guest attempts to set BTF, as there's no way to communicate lack of support to the guest, and instead suppress the flag and treat the WRMSR as (partially) unsupported. In short, make KVM behave the same on AMD and Intel (VMX already squashes BTF). Note, due to other bugs in KVM's handling of DEBUGCTL, the only way BTF has "worked" in any capacity is if the guest simultaneously enables LBRs. Reported-by: Ravi Bangoria <ravi.bangoria@amd.com> Cc: stable@vger.kernel.org Reviewed-and-tested-by: Ravi Bangoria <ravi.bangoria@amd.com> Link: https://lore.kernel.org/r/20250227222411.3490595-3-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-02-28KVM: SVM: Drop DEBUGCTL[5:2] from guest's effective valueSean Christopherson
Drop bits 5:2 from the guest's effective DEBUGCTL value, as AMD changed the architectural behavior of the bits and broke backwards compatibility. On CPUs without BusLockTrap (or at least, in APMs from before ~2023), bits 5:2 controlled the behavior of external pins: Performance-Monitoring/Breakpoint Pin-Control (PBi)—Bits 5:2, read/write. Software uses thesebits to control the type of information reported by the four external performance-monitoring/breakpoint pins on the processor. When a PBi bit is cleared to 0, the corresponding external pin (BPi) reports performance-monitor information. When a PBi bit is set to 1, the corresponding external pin (BPi) reports breakpoint information. With the introduction of BusLockTrap, presumably to be compatible with Intel CPUs, AMD redefined bit 2 to be BLCKDB: Bus Lock #DB Trap (BLCKDB)—Bit 2, read/write. Software sets this bit to enable generation of a #DB trap following successful execution of a bus lock when CPL is > 0. and redefined bits 5:3 (and bit 6) as "6:3 Reserved MBZ". Ideally, KVM would treat bits 5:2 as reserved. Defer that change to a feature cleanup to avoid breaking existing guest in LTS kernels. For now, drop the bits to retain backwards compatibility (of a sort). Note, dropping bits 5:2 is still a guest-visible change, e.g. if the guest is enabling LBRs *and* the legacy PBi bits, then the state of the PBi bits is visible to the guest, whereas now the guest will always see '0'. Reported-by: Ravi Bangoria <ravi.bangoria@amd.com> Cc: stable@vger.kernel.org Reviewed-and-tested-by: Ravi Bangoria <ravi.bangoria@amd.com> Link: https://lore.kernel.org/r/20250227222411.3490595-2-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-02-28KVM: selftests: Assert that STI blocking isn't set after event injectionSean Christopherson
Add an L1 (guest) assert to the nested exceptions test to verify that KVM doesn't put VMRUN in an STI shadow (AMD CPUs bleed the shadow into the guest's int_state if a #VMEXIT occurs before VMRUN fully completes). Add a similar assert to the VMX side as well, because why not. Reviewed-by: Jim Mattson <jmattson@google.com> Link: https://lore.kernel.org/r/20250224165442.2338294-3-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-02-28KVM: SVM: Set RFLAGS.IF=1 in C code, to get VMRUN out of the STI shadowSean Christopherson
Enable/disable local IRQs, i.e. set/clear RFLAGS.IF, in the common svm_vcpu_enter_exit() just after/before guest_state_{enter,exit}_irqoff() so that VMRUN is not executed in an STI shadow. AMD CPUs have a quirk (some would say "bug"), where the STI shadow bleeds into the guest's intr_state field if a #VMEXIT occurs during injection of an event, i.e. if the VMRUN doesn't complete before the subsequent #VMEXIT. The spurious "interrupts masked" state is relatively benign, as it only occurs during event injection and is transient. Because KVM is already injecting an event, the guest can't be in HLT, and if KVM is querying IRQ blocking for injection, then KVM would need to force an immediate exit anyways since injecting multiple events is impossible. However, because KVM copies int_state verbatim from vmcb02 to vmcb12, the spurious STI shadow is visible to L1 when running a nested VM, which can trip sanity checks, e.g. in VMware's VMM. Hoist the STI+CLI all the way to C code, as the aforementioned calls to guest_state_{enter,exit}_irqoff() already inform lockdep that IRQs are enabled/disabled, and taking a fault on VMRUN with RFLAGS.IF=1 is already possible. I.e. if there's kernel code that is confused by running with RFLAGS.IF=1, then it's already a problem. In practice, since GIF=0 also blocks NMIs, the only change in exposure to non-KVM code (relative to surrounding VMRUN with STI+CLI) is exception handling code, and except for the kvm_rebooting=1 case, all exception in the core VM-Enter/VM-Exit path are fatal. Use the "raw" variants to enable/disable IRQs to avoid tracing in the "no instrumentation" code; the guest state helpers also take care of tracing IRQ state. Oppurtunstically document why KVM needs to do STI in the first place. Reported-by: Doug Covelli <doug.covelli@broadcom.com> Closes: https://lore.kernel.org/all/CADH9ctBs1YPmE4aCfGPNBwA10cA8RuAk2gO7542DjMZgs4uzJQ@mail.gmail.com Fixes: f14eec0a3203 ("KVM: SVM: move more vmentry code to assembly") Cc: stable@vger.kernel.org Reviewed-by: Jim Mattson <jmattson@google.com> Link: https://lore.kernel.org/r/20250224165442.2338294-2-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-02-28Merge tag 'io_uring-6.14-20250228' of git://git.kernel.dk/linuxLinus Torvalds
Pull io_uring fix from Jens Axboe: "Just a single fix headed for stable, ensuring that msg_control is properly saved in compat mode as well" * tag 'io_uring-6.14-20250228' of git://git.kernel.dk/linux: io_uring/net: save msg_control for compat
2025-02-28Merge tag 'efi-fixes-for-v6.14-2' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/efi/efi Pull EFI fixes from Ard Biesheuvel: "Another couple of EFI fixes for v6.14. Only James's patch stands out, as it implements a workaround for odd behavior in fwupd in user space, which creates EFI variables by touching a file in efivarfs, clearing the immutable bit (which gets set automatically for $reasons) and then opening it again for writing, none of which is really necessary. The fwupd author and LVFS maintainer is already rolling out a fix for this on the fwupd side, and suggested that the workaround in this PR could be backed out again during the next cycle. (There is a semantic mismatch in efivarfs where some essential variable attributes are stored in the first 4 bytes of the file, and so zero length files cannot exist, as they cannot be written back to the underlying variable store. So now, they are dropped once the last reference is released.) Summary: - Fix CPER error record parsing bugs - Fix a couple of efivarfs issues that were introduced in the merge window - Fix an issue in the early remapping code of the MOKvar table" * tag 'efi-fixes-for-v6.14-2' of git://git.kernel.org/pub/scm/linux/kernel/git/efi/efi: efi/mokvar-table: Avoid repeated map/unmap of the same page efi: Don't map the entire mokvar table to determine its size efivarfs: allow creation of zero length files efivarfs: Defer PM notifier registration until .fill_super efi/cper: Fix cper_arm_ctx_info alignment efi/cper: Fix cper_ia_proc_ctx alignment
2025-02-28x86/mm: Reduce header dependencies in <asm/set_memory.h>Kevin Brodsky
Commit: 03b122da74b2 ("x86/sgx: Hook arch_memory_failure() into mainline code") ... added <linux/mm.h> to <asm/set_memory.h> to provide some helpers. However the following commit: b3fdf9398a16 ("x86/mce: relocate set{clear}_mce_nospec() functions") ... moved the inline definitions someplace else, and now <asm/set_memory.h> just declares a bunch of mostly self-contained functions. No need for the whole <linux/mm.h> inclusion to declare functions; just remove that include. This helps avoid circular dependency headaches (e.g. if <linux/mm.h> ends up including <linux/set_memory.h>). This change requires a couple of include fixups not to break the build: * <asm/smp.h>: including <asm/thread_info.h> directly relies on <linux/thread_info.h> having already been included, because the former needs the BAD_STACK/NOT_STACK constants defined in the latter. This is no longer the case when <asm/smp.h> is included from some driver file - just include <linux/thread_info.h> to stay out of trouble. * sev-guest.c relies on <asm/set_memory.h> including <linux/mm.h>, so we just need to make that include explicit. [ mingo: Cleaned up the changelog ] Signed-off-by: Kevin Brodsky <kevin.brodsky@arm.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Acked-by: David Hildenbrand <david@redhat.com> Link: https://lore.kernel.org/r/20241212080904.2089632-3-kevin.brodsky@arm.com
2025-02-28x86/mm: Remove unused __set_memory_prot()Kevin Brodsky
__set_memory_prot() is unused since: 5c11f00b09c1 ("x86: remove memory hotplug support on X86_32") Let's remove it. Signed-off-by: Kevin Brodsky <kevin.brodsky@arm.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Acked-by: David Hildenbrand <david@redhat.com> Link: https://lore.kernel.org/r/20241212080904.2089632-2-kevin.brodsky@arm.com
2025-02-28Merge tag 'i2c-host-fixes-6.14-rc5' of ↵Wolfram Sang
git://git.kernel.org/pub/scm/linux/kernel/git/andi.shyti/linux into i2c/for-current i2c-host-fixes for v6.14-rc5 - npcm fixes interrupt initialization sequence. - ls2x fixes frequency setting. - amd-asf re-enables interrupts properly at irq handler's exit.
2025-02-28gpiolib: Fix Oops in gpiod_direction_input_nonotify()Dan Carpenter
The gpiod_direction_input_nonotify() function is supposed to return zero if the direction for the pin is input. But instead it accidentally returns GPIO_LINE_DIRECTION_IN (1) which will be cast into an ERR_PTR() in gpiochip_request_own_desc(). The callers dereference it and it leads to a crash. I changed gpiod_direction_output_raw_commit() just for consistency but returning GPIO_LINE_DIRECTION_OUT (0) is fine. Cc: stable@vger.kernel.org Fixes: 9d846b1aebbe ("gpiolib: check the return value of gpio_chip::get_direction()") Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org> Link: https://lore.kernel.org/r/254f3925-3015-4c9d-aac5-bb9b4b2cd2c5@stanley.mountain [Bartosz: moved the variable declarations to the top of the functions] Signed-off-by: Bartosz Golaszewski <bartosz.golaszewski@linaro.org>
2025-02-28block: fix 'kmem_cache of name 'bio-108' already exists'Ming Lei
Device mapper bioset often has big bio_slab size, which can be more than 1000, then 8byte can't hold the slab name any more, cause the kmem_cache allocation warning of 'kmem_cache of name 'bio-108' already exists'. Fix the warning by extending bio_slab->name to 12 bytes, but fix output of /proc/slabinfo Reported-by: Guangwu Zhang <guazhang@redhat.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20250228132656.2838008-1-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-02-28iommu/vt-d: Fix suspicious RCU usageLu Baolu
Commit <d74169ceb0d2> ("iommu/vt-d: Allocate DMAR fault interrupts locally") moved the call to enable_drhd_fault_handling() to a code path that does not hold any lock while traversing the drhd list. Fix it by ensuring the dmar_global_lock lock is held when traversing the drhd list. Without this fix, the following warning is triggered: ============================= WARNING: suspicious RCU usage 6.14.0-rc3 #55 Not tainted ----------------------------- drivers/iommu/intel/dmar.c:2046 RCU-list traversed in non-reader section!! other info that might help us debug this: rcu_scheduler_active = 1, debug_locks = 1 2 locks held by cpuhp/1/23: #0: ffffffff84a67c50 (cpu_hotplug_lock){++++}-{0:0}, at: cpuhp_thread_fun+0x87/0x2c0 #1: ffffffff84a6a380 (cpuhp_state-up){+.+.}-{0:0}, at: cpuhp_thread_fun+0x87/0x2c0 stack backtrace: CPU: 1 UID: 0 PID: 23 Comm: cpuhp/1 Not tainted 6.14.0-rc3 #55 Call Trace: <TASK> dump_stack_lvl+0xb7/0xd0 lockdep_rcu_suspicious+0x159/0x1f0 ? __pfx_enable_drhd_fault_handling+0x10/0x10 enable_drhd_fault_handling+0x151/0x180 cpuhp_invoke_callback+0x1df/0x990 cpuhp_thread_fun+0x1ea/0x2c0 smpboot_thread_fn+0x1f5/0x2e0 ? __pfx_smpboot_thread_fn+0x10/0x10 kthread+0x12a/0x2d0 ? __pfx_kthread+0x10/0x10 ret_from_fork+0x4a/0x60 ? __pfx_kthread+0x10/0x10 ret_from_fork_asm+0x1a/0x30 </TASK> Holding the lock in enable_drhd_fault_handling() triggers a lockdep splat about a possible deadlock between dmar_global_lock and cpu_hotplug_lock. This is avoided by not holding dmar_global_lock when calling iommu_device_register(), which initiates the device probe process. Fixes: d74169ceb0d2 ("iommu/vt-d: Allocate DMAR fault interrupts locally") Reported-and-tested-by: Ido Schimmel <idosch@nvidia.com> Closes: https://lore.kernel.org/linux-iommu/Zx9OwdLIc_VoQ0-a@shredder.mtl.com/ Tested-by: Breno Leitao <leitao@debian.org> Cc: stable@vger.kernel.org Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Link: https://lore.kernel.org/r/20250218022422.2315082-1-baolu.lu@linux.intel.com Tested-by: Ido Schimmel <idosch@nvidia.com> Signed-off-by: Joerg Roedel <jroedel@suse.de>
2025-02-28iommu/vt-d: Remove device comparison in context_setup_pass_through_cbJerry Snitselaar
Remove the device comparison check in context_setup_pass_through_cb. pci_for_each_dma_alias already makes a decision on whether the callback function should be called for a device. With the check in place it will fail to create context entries for aliases as it walks up to the root bus. Fixes: 2031c469f816 ("iommu/vt-d: Add support for static identity domain") Closes: https://lore.kernel.org/linux-iommu/82499eb6-00b7-4f83-879a-e97b4144f576@linux.intel.com/ Cc: stable@vger.kernel.org Signed-off-by: Jerry Snitselaar <jsnitsel@redhat.com> Link: https://lore.kernel.org/r/20250224180316.140123-1-jsnitsel@redhat.com Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com> Signed-off-by: Joerg Roedel <jroedel@suse.de>
2025-02-28iommu/amd: Preserve default DTE fields when updating Host Page Table RootAlejandro Jimenez
When updating the page table root field on the DTE, avoid overwriting any bits that are already set. The earlier call to make_clear_dte() writes default values that all DTEs must have set (currently DTE[V]), and those must be preserved. Currently this doesn't cause problems since the page table root update is the first field that is set after make_clear_dte() is called, and DTE_FLAG_V is set again later along with the permission bits (IR/IW). Remove this redundant assignment too. Fixes: fd5dff9de4be ("iommu/amd: Modify set_dte_entry() to use 256-bit DTE helpers") Signed-off-by: Alejandro Jimenez <alejandro.j.jimenez@oracle.com> Reviewed-by: Dheeraj Kumar Srivastava <dheerajkumar.srivastava@amd.com> Reviewed-by: Vasant Hegde <vasant.hegde@amd.com> Link: https://lore.kernel.org/r/20250106191413.3107140-1-alejandro.j.jimenez@oracle.com Signed-off-by: Joerg Roedel <jroedel@suse.de>
2025-02-28x86/cpufeatures: Rename X86_CMPXCHG64 to X86_CX8H. Peter Anvin (Intel)
Replace X86_CMPXCHG64 with X86_CX8, as CX8 is the name of the CPUID flag, thus to make it consistent with X86_FEATURE_CX8 defined in <asm/cpufeatures.h>. No functional change intended. Signed-off-by: H. Peter Anvin (Intel) <hpa@zytor.com> Signed-off-by: Xin Li (Intel) <xin@zytor.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20250228082338.73859-2-xin@zytor.com
2025-02-28Merge patch series "Remove accesses to page->index from ceph"Christian Brauner
Remove page->index access from ceph. * patches from https://lore.kernel.org/r/20250217185119.430193-1-willy@infradead.org: fs: Remove page_mkwrite_check_truncate() ceph: Pass a folio to ceph_allocate_page_array() ceph: Convert ceph_move_dirty_page_in_page_array() to move_dirty_folio_in_page_array() ceph: Remove uses of page from ceph_process_folio_batch() ceph: Convert ceph_check_page_before_write() to use a folio ceph: Convert writepage_nounlock() to write_folio_nounlock() ceph: Convert ceph_readdir_cache_control to store a folio ceph: Convert ceph_find_incompatible() to take a folio ceph: Use a folio in ceph_page_mkwrite() ceph: Remove ceph_writepage() Link: https://lore.kernel.org/r/20250217185119.430193-1-willy@infradead.org Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-02-28fs: Remove page_mkwrite_check_truncate()Matthew Wilcox (Oracle)
All callers of this function have now been converted to use folio_mkwrite_check_truncate(). Signed-off-by: "Matthew Wilcox (Oracle)" <willy@infradead.org> Link: https://lore.kernel.org/r/20250221204421.3590340-1-willy@infradead.org Tested-by: Viacheslav Dubeyko <Slava.Dubeyko@ibm.com> Reviewed-by: Viacheslav Dubeyko <Slava.Dubeyko@ibm.com> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-02-28ceph: Pass a folio to ceph_allocate_page_array()Matthew Wilcox (Oracle)
Remove two accesses to page->index. Signed-off-by: "Matthew Wilcox (Oracle)" <willy@infradead.org> Link: https://lore.kernel.org/r/20250217185119.430193-10-willy@infradead.org Tested-by: Viacheslav Dubeyko <Slava.Dubeyko@ibm.com> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-02-28ceph: Convert ceph_move_dirty_page_in_page_array() to ↵Matthew Wilcox (Oracle)
move_dirty_folio_in_page_array() Shorten the name of this internal function by dropping the 'ceph_' prefix and pass in a folio instead of a page. Signed-off-by: "Matthew Wilcox (Oracle)" <willy@infradead.org> Link: https://lore.kernel.org/r/20250217185119.430193-9-willy@infradead.org Tested-by: Viacheslav Dubeyko <Slava.Dubeyko@ibm.com> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-02-28ceph: Remove uses of page from ceph_process_folio_batch()Matthew Wilcox (Oracle)
Remove uses of page->index and deprecated page APIs. Saves a lot of hidden calls to compound_head(). Also convert is_page_index_contiguous() to is_folio_index_contiguous() and make its arguments const. Signed-off-by: "Matthew Wilcox (Oracle)" <willy@infradead.org> Link: https://lore.kernel.org/r/20250217185119.430193-8-willy@infradead.org Tested-by: Viacheslav Dubeyko <Slava.Dubeyko@ibm.com> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-02-28ceph: Convert ceph_check_page_before_write() to use a folioMatthew Wilcox (Oracle)
Remove the conversion back to a struct page and just use the folio passed in. Signed-off-by: "Matthew Wilcox (Oracle)" <willy@infradead.org> Link: https://lore.kernel.org/r/20250217185119.430193-7-willy@infradead.org Tested-by: Viacheslav Dubeyko <Slava.Dubeyko@ibm.com> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-02-28ceph: Convert writepage_nounlock() to write_folio_nounlock()Matthew Wilcox (Oracle)
Remove references to page->index, page->mapping, thp_size(), page_offset() and other page APIs in favour of their more efficient folio replacements. Signed-off-by: "Matthew Wilcox (Oracle)" <willy@infradead.org> Link: https://lore.kernel.org/r/20250217185119.430193-6-willy@infradead.org Tested-by: Viacheslav Dubeyko <Slava.Dubeyko@ibm.com> Reviewed-by: Viacheslav Dubeyko <Slava.Dubeyko@ibm.com> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-02-28ceph: Convert ceph_readdir_cache_control to store a folioMatthew Wilcox (Oracle)
Pass a folio around instead of a page. This removes an access to page->index and a few hidden calls to compound_head(). Signed-off-by: "Matthew Wilcox (Oracle)" <willy@infradead.org> Link: https://lore.kernel.org/r/20250217185119.430193-5-willy@infradead.org Tested-by: Viacheslav Dubeyko <Slava.Dubeyko@ibm.com> Reviewed-by: Viacheslav Dubeyko <Slava.Dubeyko@ibm.com> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-02-28ceph: Convert ceph_find_incompatible() to take a folioMatthew Wilcox (Oracle)
Both callers already have the folio. Pass it in and use it throughout. Removes some hidden calls to compound_head() and a reference to page->mapping. Signed-off-by: "Matthew Wilcox (Oracle)" <willy@infradead.org> Link: https://lore.kernel.org/r/20250217185119.430193-4-willy@infradead.org Tested-by: Viacheslav Dubeyko <Slava.Dubeyko@ibm.com> Reviewed-by: Viacheslav Dubeyko <Slava.Dubeyko@ibm.com> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-02-28ceph: Use a folio in ceph_page_mkwrite()Matthew Wilcox (Oracle)
Convert the passed page to a folio and use it throughout ceph_page_mkwrite(). Removes the last call to page_mkwrite_check_truncate(), the last call to offset_in_thp() and one of the last calls to thp_size(). Saves a few calls to compound_head(). Signed-off-by: "Matthew Wilcox (Oracle)" <willy@infradead.org> Link: https://lore.kernel.org/r/20250217185119.430193-3-willy@infradead.org Tested-by: Viacheslav Dubeyko <Slava.Dubeyko@ibm.com> Reviewed-by: Viacheslav Dubeyko <Slava.Dubeyko@ibm.com> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-02-28ceph: Remove ceph_writepage()Matthew Wilcox (Oracle)
Ceph already has a writepages operation which is preferred over writepage in all situations except for page migration. By adding a migrate_folio operation, there will be no situations in which ->writepage should be called. filemap_migrate_folio() is an appropriate operation to use because the ceph data stored in folio->private does not contain any reference to the memory address of the folio. Signed-off-by: "Matthew Wilcox (Oracle)" <willy@infradead.org> Link: https://lore.kernel.org/r/20250217185119.430193-2-willy@infradead.org Tested-by: Viacheslav Dubeyko <Slava.Dubeyko@ibm.com> Reviewed-by: Viacheslav Dubeyko <Slava.Dubeyko@ibm.com> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-02-28Merge patch series "ceph: fix generic/421 test failure"Christian Brauner
Viacheslav Dubeyko <slava@dubeyko.com> says: From: Viacheslav Dubeyko <Slava.Dubeyko@ibm.com> The generic/421 fails to finish because of the issue: Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.894678] INFO: task kworker/u48:0:11 blocked for more than 122 seconds. Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.895403] Not tainted 6.13.0-rc5+ #1 Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.895867] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.896633] task:kworker/u48:0 state:D stack:0 pid:11 tgid:11 ppid:2 flags:0x00004000 Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.896641] Workqueue: writeback wb_workfn (flush-ceph-24) Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897614] Call Trace: Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897620] <TASK> Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897629] __schedule+0x443/0x16b0 Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897637] schedule+0x2b/0x140 Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897640] io_schedule+0x4c/0x80 Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897643] folio_wait_bit_common+0x11b/0x310 Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897646] ? _raw_spin_unlock_irq+0xe/0x50 Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897652] ? __pfx_wake_page_function+0x10/0x10 Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897655] __folio_lock+0x17/0x30 Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897658] ceph_writepages_start+0xca9/0x1fb0 Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897663] ? fsnotify_remove_queued_event+0x2f/0x40 Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897668] do_writepages+0xd2/0x240 Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897672] __writeback_single_inode+0x44/0x350 Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897675] writeback_sb_inodes+0x25c/0x550 Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897680] wb_writeback+0x89/0x310 Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897683] ? finish_task_switch.isra.0+0x97/0x310 Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897687] wb_workfn+0xb5/0x410 Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897689] process_one_work+0x188/0x3d0 Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897692] worker_thread+0x2b5/0x3c0 Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897694] ? __pfx_worker_thread+0x10/0x10 Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897696] kthread+0xe1/0x120 Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897699] ? __pfx_kthread+0x10/0x10 Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897701] ret_from_fork+0x43/0x70 Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897705] ? __pfx_kthread+0x10/0x10 Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897707] ret_from_fork_asm+0x1a/0x30 Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897711] </TASK> There are several issues here: (1) ceph_kill_sb() doesn't wait ending of flushing all dirty folios/pages because of racy nature of mdsc->stopping_blockers. As a result, mdsc->stopping becomes CEPH_MDSC_STOPPING_FLUSHED too early. (2) The ceph_inc_osd_stopping_blocker(fsc->mdsc) fails to increment mdsc->stopping_blockers. Finally, already locked folios/pages are never been unlocked and the logic tries to lock the same page second time. (3) The folio_batch with found dirty pages by filemap_get_folios_tag() is not processed properly. And this is why some number of dirty pages simply never processed and we have dirty folios/pages after unmount anyway. This patchset is refactoring the ceph_writepages_start() method and it fixes the issues by means of: (1) introducing dirty_folios counter and flush_end_wq waiting queue in struct ceph_mds_client; (2) ceph_dirty_folio() increments the dirty_folios counter; (3) writepages_finish() decrements the dirty_folios counter and wake up all waiters on the queue if dirty_folios counter is equal or lesser than zero; (4) adding in ceph_kill_sb() method the logic of checking the value of dirty_folios counter and waiting if it is bigger than zero; (5) adding ceph_inc_osd_stopping_blocker() call in the beginning of the ceph_writepages_start() and ceph_dec_osd_stopping_blocker() at the end of the ceph_writepages_start() with the goal to resolve the racy nature of mdsc->stopping_blockers. sudo ./check generic/421 FSTYP -- ceph PLATFORM -- Linux/x86_64 ceph-testing-0001 6.13.0+ #137 SMP PREEMPT_DYNAMIC Mon Feb 3 20:30:08 UTC 2025 MKFS_OPTIONS -- 127.0.0.1:40551:/scratch MOUNT_OPTIONS -- -o name=fs,secret=<secret>,ms_mode=crc,nowsync,copyfrom 127.0.0.1:40551:/scratch /mnt/scratch generic/421 7s ... 4s Ran: generic/421 Passed all 1 tests * patches from https://lore.kernel.org/r/20250205000249.123054-1-slava@dubeyko.com: ceph: fix generic/421 test failure ceph: introduce ceph_submit_write() method ceph: introduce ceph_process_folio_batch() method ceph: extend ceph_writeback_ctl for ceph_writepages_start() refactoring Link: https://lore.kernel.org/r/20250205000249.123054-1-slava@dubeyko.com Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-02-28ceph: fix generic/421 test failureViacheslav Dubeyko
The generic/421 fails to finish because of the issue: Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.894678] INFO: task kworker/u48:0:11 blocked for more than 122 seconds. Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.895403] Not tainted 6.13.0-rc5+ #1 Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.895867] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.896633] task:kworker/u48:0 state:D stack:0 pid:11 tgid:11 ppid:2 flags:0x00004000 Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.896641] Workqueue: writeback wb_workfn (flush-ceph-24) Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897614] Call Trace: Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897620] <TASK> Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897629] __schedule+0x443/0x16b0 Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897637] schedule+0x2b/0x140 Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897640] io_schedule+0x4c/0x80 Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897643] folio_wait_bit_common+0x11b/0x310 Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897646] ? _raw_spin_unlock_irq+0xe/0x50 Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897652] ? __pfx_wake_page_function+0x10/0x10 Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897655] __folio_lock+0x17/0x30 Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897658] ceph_writepages_start+0xca9/0x1fb0 Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897663] ? fsnotify_remove_queued_event+0x2f/0x40 Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897668] do_writepages+0xd2/0x240 Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897672] __writeback_single_inode+0x44/0x350 Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897675] writeback_sb_inodes+0x25c/0x550 Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897680] wb_writeback+0x89/0x310 Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897683] ? finish_task_switch.isra.0+0x97/0x310 Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897687] wb_workfn+0xb5/0x410 Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897689] process_one_work+0x188/0x3d0 Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897692] worker_thread+0x2b5/0x3c0 Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897694] ? __pfx_worker_thread+0x10/0x10 Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897696] kthread+0xe1/0x120 Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897699] ? __pfx_kthread+0x10/0x10 Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897701] ret_from_fork+0x43/0x70 Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897705] ? __pfx_kthread+0x10/0x10 Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897707] ret_from_fork_asm+0x1a/0x30 Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897711] </TASK> There are several issues here: (1) ceph_kill_sb() doesn't wait ending of flushing all dirty folios/pages because of racy nature of mdsc->stopping_blockers. As a result, mdsc->stopping becomes CEPH_MDSC_STOPPING_FLUSHED too early. (2) The ceph_inc_osd_stopping_blocker(fsc->mdsc) fails to increment mdsc->stopping_blockers. Finally, already locked folios/pages are never been unlocked and the logic tries to lock the same page second time. (3) The folio_batch with found dirty pages by filemap_get_folios_tag() is not processed properly. And this is why some number of dirty pages simply never processed and we have dirty folios/pages after unmount anyway. This patch fixes the issues by means of: (1) introducing dirty_folios counter and flush_end_wq waiting queue in struct ceph_mds_client; (2) ceph_dirty_folio() increments the dirty_folios counter; (3) writepages_finish() decrements the dirty_folios counter and wake up all waiters on the queue if dirty_folios counter is equal or lesser than zero; (4) adding in ceph_kill_sb() method the logic of checking the value of dirty_folios counter and waiting if it is bigger than zero; (5) adding ceph_inc_osd_stopping_blocker() call in the beginning of the ceph_writepages_start() and ceph_dec_osd_stopping_blocker() at the end of the ceph_writepages_start() with the goal to resolve the racy nature of mdsc->stopping_blockers. sudo ./check generic/421 FSTYP -- ceph PLATFORM -- Linux/x86_64 ceph-testing-0001 6.13.0+ #137 SMP PREEMPT_DYNAMIC Mon Feb 3 20:30:08 UTC 2025 MKFS_OPTIONS -- 127.0.0.1:40551:/scratch MOUNT_OPTIONS -- -o name=fs,secret=<secret>,ms_mode=crc,nowsync,copyfrom 127.0.0.1:40551:/scratch /mnt/scratch generic/421 7s ... 4s Ran: generic/421 Passed all 1 tests Signed-off-by: Viacheslav Dubeyko <Slava.Dubeyko@ibm.com> Link: https://lore.kernel.org/r/20250205000249.123054-5-slava@dubeyko.com Tested-by: David Howells <dhowells@redhat.com> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-02-28ceph: introduce ceph_submit_write() methodViacheslav Dubeyko
Final responsibility of ceph_writepages_start() is to submit write requests for processed dirty folios/pages. The ceph_submit_write() summarize all this logic in one method. The generic/421 fails to finish because of the issue: Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.894678] INFO: task kworker/u48:0:11 blocked for more than 122 seconds. Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.895403] Not tainted 6.13.0-rc5+ #1 Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.895867] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.896633] task:kworker/u48:0 state:D stack:0 pid:11 tgid:11 ppid:2 flags:0x00004000 Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.896641] Workqueue: writeback wb_workfn (flush-ceph-24) Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897614] Call Trace: Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897620] <TASK> Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897629] __schedule+0x443/0x16b0 Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897637] schedule+0x2b/0x140 Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897640] io_schedule+0x4c/0x80 Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897643] folio_wait_bit_common+0x11b/0x310 Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897646] ? _raw_spin_unlock_irq+0xe/0x50 Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897652] ? __pfx_wake_page_function+0x10/0x10 Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897655] __folio_lock+0x17/0x30 Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897658] ceph_writepages_start+0xca9/0x1fb0 Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897663] ? fsnotify_remove_queued_event+0x2f/0x40 Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897668] do_writepages+0xd2/0x240 Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897672] __writeback_single_inode+0x44/0x350 Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897675] writeback_sb_inodes+0x25c/0x550 Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897680] wb_writeback+0x89/0x310 Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897683] ? finish_task_switch.isra.0+0x97/0x310 Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897687] wb_workfn+0xb5/0x410 Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897689] process_one_work+0x188/0x3d0 Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897692] worker_thread+0x2b5/0x3c0 Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897694] ? __pfx_worker_thread+0x10/0x10 Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897696] kthread+0xe1/0x120 Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897699] ? __pfx_kthread+0x10/0x10 Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897701] ret_from_fork+0x43/0x70 Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897705] ? __pfx_kthread+0x10/0x10 Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897707] ret_from_fork_asm+0x1a/0x30 Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897711] </TASK> There are two problems here: if (!ceph_inc_osd_stopping_blocker(fsc->mdsc)) { rc = -EIO; goto release_folios; } (1) ceph_kill_sb() doesn't wait ending of flushing all dirty folios/pages because of racy nature of mdsc->stopping_blockers. As a result, mdsc->stopping becomes CEPH_MDSC_STOPPING_FLUSHED too early. (2) The ceph_inc_osd_stopping_blocker(fsc->mdsc) fails to increment mdsc->stopping_blockers. Finally, already locked folios/pages are never been unlocked and the logic tries to lock the same page second time. This patch implements refactoring of ceph_submit_write() and also it solves the second issue. Signed-off-by: Viacheslav Dubeyko <Slava.Dubeyko@ibm.com> Link: https://lore.kernel.org/r/20250205000249.123054-4-slava@dubeyko.com Tested-by: David Howells <dhowells@redhat.com> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-02-28ceph: introduce ceph_process_folio_batch() methodViacheslav Dubeyko
First step of ceph_writepages_start() logic is of finding the dirty memory folios and processing it. This patch introduces ceph_process_folio_batch() method that moves this logic into dedicated method. The ceph_writepages_start() has this logic: if (ceph_wbc.locked_pages == 0) lock_page(page); /* first page */ else if (!trylock_page(page)) break; <skipped> if (folio_test_writeback(folio) || folio_test_private_2(folio) /* [DEPRECATED] */) { if (wbc->sync_mode == WB_SYNC_NONE) { doutc(cl, "%p under writeback\n", folio); folio_unlock(folio); continue; } doutc(cl, "waiting on writeback %p\n", folio); folio_wait_writeback(folio); folio_wait_private_2(folio); /* [DEPRECATED] */ } The problem here that folio/page is locked here at first and it is by set_page_writeback(page) later before submitting the write request. The folio/page is unlocked by writepages_finish() after finishing the write request. It means that logic of checking folio_test_writeback() and folio_wait_writeback() never works because page is locked and it cannot be locked again until write request completion. However, for majority of folios/pages the trylock_page() is used. As a result, multiple threads can try to lock the same folios/pages multiple times even if they are under writeback already. It makes this logic more compute intensive than it is necessary. This patch changes this logic: if (folio_test_writeback(folio) || folio_test_private_2(folio) /* [DEPRECATED] */) { if (wbc->sync_mode == WB_SYNC_NONE) { doutc(cl, "%p under writeback\n", folio); folio_unlock(folio); continue; } doutc(cl, "waiting on writeback %p\n", folio); folio_wait_writeback(folio); folio_wait_private_2(folio); /* [DEPRECATED] */ } if (ceph_wbc.locked_pages == 0) lock_page(page); /* first page */ else if (!trylock_page(page)) break; This logic should exclude the ignoring of writeback state of folios/pages. Signed-off-by: Viacheslav Dubeyko <Slava.Dubeyko@ibm.com> Link: https://lore.kernel.org/r/20250205000249.123054-3-slava@dubeyko.com Tested-by: David Howells <dhowells@redhat.com> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-02-28ceph: extend ceph_writeback_ctl for ceph_writepages_start() refactoringViacheslav Dubeyko
The ceph_writepages_start() has unreasonably huge size and complex logic that makes this method hard to understand. Current state of the method's logic makes bug fix really hard task. This patch extends the struct ceph_writeback_ctl with the goal to make ceph_writepages_start() method more compact and easy to understand by means of deep refactoring. Signed-off-by: Viacheslav Dubeyko <Slava.Dubeyko@ibm.com> Link: https://lore.kernel.org/r/20250205000249.123054-2-slava@dubeyko.com Tested-by: David Howells <dhowells@redhat.com> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-02-28x86/cpu: Enable modifying CPU bug flags with '{clear,set}puid='Brendan Jackman
Sometimes it can be very useful to run CPU vulnerability mitigations on systems where they aren't known to mitigate any real-world vulnerabilities. This can be handy for mundane reasons like debugging HW-agnostic logic on whatever machine is to hand, but also for research reasons: while some mitigations are focused on individual vulns and uarches, others are fairly general, and it's strategically useful to have an idea how they'd perform on systems where they aren't currently needed. As evidence for this being useful, a flag specifically for Retbleed was added in: 5c9a92dec323 ("x86/bugs: Add retbleed=force"). Since CPU bugs are tracked using the same basic mechanism as features, and there are already parameters for manipulating them by hand, extend that mechanism to support bug as well as capabilities. With this patch and setcpuid=srso, a QEMU guest running on an Intel host will boot with Safe-RET enabled. Signed-off-by: Brendan Jackman <jackmanb@google.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20241220-force-cpu-bug-v2-3-7dc71bce742a@google.com
2025-02-28x86/cpu: Add the 'setcpuid=' boot parameterBrendan Jackman
In preparation for adding support to inject fake CPU bugs at boot-time, add a general facility to force enablement of CPU flags. The flag taints the kernel and the documentation attempts to be clear that this is highly unsuitable for uses outside of kernel development and platform experimentation. The new arg is parsed just like clearcpuid, but instead of leading to setup_clear_cpu_cap() it leads to setup_force_cpu_cap(). I've tested this by booting a nested QEMU guest on an Intel host, which with setcpuid=svm will claim that it supports AMD virtualization. Signed-off-by: Brendan Jackman <jackmanb@google.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20241220-force-cpu-bug-v2-2-7dc71bce742a@google.com
2025-02-28x86/cpu: Create helper function to parse the 'clearcpuid=' boot parameterBrendan Jackman
This is in preparation for a later commit that will reuse this code, to make review convenient. Factor out a helper function which does the full handling for this arg including printing info to the console. No functional change intended. Signed-off-by: Brendan Jackman <jackmanb@google.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20241220-force-cpu-bug-v2-1-7dc71bce742a@google.com
2025-02-28ALSA: hda: Fix speakers on ASUS EXPERTBOOK P5405CSA 1.0Daniel Bárta
After some digging around I have found that this laptop has Cirrus's smart aplifiers connected to SPI bus (spi1-CSC3551:00-cs35l41-hda). To get them correctly detected and working I had to modify patch_realtek.c with ASUS EXPERTBOOK P5405CSA 1.0 SystemID (0x1043, 0x1f63) and add corresponding hda_quirk (ALC245_FIXUP_CS35L41_SPI_2). Signed-off-by: Daniel Bárta <daniel.barta@trustlab.cz> Link: https://patch.msgid.link/20250227161256.18061-2-daniel.barta@trustlab.cz Signed-off-by: Takashi Iwai <tiwai@suse.de>
2025-02-28ALSA: hda/realtek: Fix Asus Z13 2025 audioAntheas Kapenekakis
Use the basic quirk for this type of amplifier. Sound works in speakers, headphones, and microphone. Whereas none worked before. Tested-by: Kyle Gospodnetich <me@kylegospodneti.ch> Signed-off-by: Antheas Kapenekakis <lkml@antheas.dev> Link: https://patch.msgid.link/20250227175107.33432-3-lkml@antheas.dev Signed-off-by: Takashi Iwai <tiwai@suse.de>