summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2024-11-07arm64: smccc: Remove broken support for SMCCCv1.3 SVE discard hintMark Rutland
SMCCCv1.3 added a hint bit which callers can set in an SMCCC function ID (AKA "FID") to indicate that it is acceptable for the SMCCC implementation to discard SVE and/or SME state over a specific SMCCC call. The kernel support for using this hint is broken and SMCCC calls may clobber the SVE and/or SME state of arbitrary tasks, though FPSIMD state is unaffected. The kernel support is intended to use the hint when there is no SVE or SME state to save, and to do this it checks whether TIF_FOREIGN_FPSTATE is set or TIF_SVE is clear in assembly code: | ldr <flags>, [<current_task>, #TSK_TI_FLAGS] | tbnz <flags>, #TIF_FOREIGN_FPSTATE, 1f // Any live FP state? | tbnz <flags>, #TIF_SVE, 2f // Does that state include SVE? | | 1: orr <fid>, <fid>, ARM_SMCCC_1_3_SVE_HINT | 2: | << SMCCC call using FID >> This is not safe as-is: (1) SMCCC calls can be made in a preemptible context and preemption can result in TIF_FOREIGN_FPSTATE being set or cleared at arbitrary points in time. Thus checking for TIF_FOREIGN_FPSTATE provides no guarantee. (2) TIF_FOREIGN_FPSTATE only indicates that the live FP/SVE/SME state in the CPU does not belong to the current task, and does not indicate that clobbering this state is acceptable. When the live CPU state is clobbered it is necessary to update fpsimd_last_state.st to ensure that a subsequent context switch will reload FP/SVE/SME state from memory rather than consuming the clobbered state. This and the SMCCC call itself must happen in a critical section with preemption disabled to avoid races. (3) Live SVE/SME state can exist with TIF_SVE clear (e.g. with only TIF_SME set), and checking TIF_SVE alone is insufficient. Remove the broken support for the SMCCCv1.3 SVE saving hint. This is effectively a revert of commits: * cfa7ff959a78 ("arm64: smccc: Support SMCCC v1.3 SVE register saving hint") * a7c3acca5380 ("arm64: smccc: Save lr before calling __arm_smccc_sve_check()") ... leaving behind the ARM_SMCCC_VERSION_1_3 and ARM_SMCCC_1_3_SVE_HINT definitions, since these are simply definitions from the SMCCC specification, and the latter is used in KVM via ARM_SMCCC_CALL_HINTS. If we want to bring this back in future, we'll probably want to handle this logic in C where we can use all the usual FPSIMD/SVE/SME helper functions, and that'll likely require some rework of the SMCCC code and/or its callers. Fixes: cfa7ff959a78 ("arm64: smccc: Support SMCCC v1.3 SVE register saving hint") Signed-off-by: Mark Rutland <mark.rutland@arm.com> Cc: Ard Biesheuvel <ardb@kernel.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Marc Zyngier <maz@kernel.org> Cc: Mark Brown <broonie@kernel.org> Cc: Will Deacon <will@kernel.org> Cc: stable@vger.kernel.org Reviewed-by: Mark Brown <broonie@kernel.org> Link: https://lore.kernel.org/r/20241106160448.2712997-1-mark.rutland@arm.com Signed-off-by: Will Deacon <will@kernel.org>
2024-11-07x86/sev: Cleanup vc_handle_msr()Borislav Petkov (AMD)
Carve out the MSR_SVSM_CAA into a helper with the suggestion that upcoming future users should do the same. Rename that silly exit_info_1 into what it actually means in this function - whether the MSR access is a read or a write. No functional changes. Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com> Reviewed-by: Pankaj Gupta <pankaj.gupta@amd.com> Link: https://lore.kernel.org/r/20241106172647.GAZyum1zngPDwyD2IJ@fat_crate.local
2024-11-07s390/pci: Add header guards and includes to internal headersNiklas Schnelle
While pci_iov.h has include guards it doesn't include the necessary header for struct zpci_dev, pci_bus.h on the other hand lacks even basic include guards. This isn't only fragile and breaks convention but also confuses tooling such as clang-analyzer. Add both include guards and the necessary includes. Signed-off-by: Niklas Schnelle <schnelle@linux.ibm.com> Reviewed-by: Gerd Bayer <gbayer@linux.ibm.com> Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
2024-11-07s390/uvdevice: Fix and slightly improve kernel-doc commentHeiko Carstens
Fix incorrect kernel-doc comment style, add missing return statement, fix incorrect parameter name, and add some additional consistency across all kernel-doc comments. Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
2024-11-07s390/uvdevice: Support longer secret listsSteffen Eiden
Enable the list IOCTL to provide lists longer than one page (85 entries). The list IOCTL now accepts any argument length in page granularity. It fills the argument up to this length with entries until the list ends. User space unaware of this enhancement will still receive one page of data and an uv_rc 0x0100. Signed-off-by: Steffen Eiden <seiden@linux.ibm.com> Reviewed-by: Janosch Frank <frankja@linux.ibm.com> Link: https://lore.kernel.org/r/20241104153609.1361388-1-seiden@linux.ibm.com Signed-off-by: Janosch Frank <frankja@linux.ibm.com> Message-ID: <20241104153609.1361388-1-seiden@linux.ibm.com> Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
2024-11-07s390/sparsemem: Provide phys_to_target_node() with CONFIG_NUMAHeiko Carstens
Add a trival phys_to_target_node() implementation which always returns 0 if CONFIG_NUMA is enabled, since the s390 NUMA implementation only supports node 0. This is similar to memory_add_physaddr_to_nid() in order to avoid runtime warnings. Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
2024-11-07s390/configs: Enable CONFIG_VIRTIO_MEMHeiko Carstens
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
2024-11-07Merge branch 'virtio-mem' into featuresHeiko Carstens
David Hildenbrand says: ==================== virtio-mem: s390 support Let's finally add s390 support for virtio-mem; my last RFC was sent 4 years ago, and a lot changed in the meantime. The latest QEMU series is available at [1], which contains some more details and a usage example on s390 (last patch). There is not too much in here: The biggest part is querying a new diag(500) STORAGE_LIMIT hypercall to obtain the proper "max_physmem_end". The last three patches are not strictly required but certainly nice-to-have. Note that -- in contrast to standby memory -- virtio-mem memory must be configured to be automatically onlined as soon as hotplugged. The easiest approach is using the "memhp_default_state=" kernel parameter or by using proper udev rules. More details can be found at [2]. I have reviving+upstreaming a systemd service to handle configuring that on my todo list, but for some reason I keep getting distracted ... I tested various things, including: * Various memory hotplug/hotunplug combinations * Device hotplug/hotunplug * /proc/iomem output * reboot * kexec * kdump: make sure we properly enter the "kdump mode" in the virtio-mem driver kdump support for virtio-mem memory on s390 will be sent out separately. v2 -> v3 * "s390/kdump: make is_kdump_kernel() consistently return "true" in kdump environments only" -> Sent out separately [3] * "s390/physmem_info: query diag500(STORAGE LIMIT) to support QEMU/KVM memory devices" -> No query function for diag500 for now. -> Update comment above setup_ident_map_size(). -> Optimize/rewrite diag500_storage_limit() [Heiko] -> Change handling in detect_physmem_online_ranges [Alexander] -> Improve documentation. * "s390/sparsemem: provide memory_add_physaddr_to_nid() with CONFIG_NUMA" -> Added after testing on systems with CONFIG_NUMA=y v1 -> v2: * Document the new diag500 subfunction * Use "s390" instead of "s390x" consistently [1] https://lkml.kernel.org/r/20241008105455.2302628-1-david@redhat.com [2] https://virtio-mem.gitlab.io/user-guide/user-guide-linux.html [3] https://lkml.kernel.org/r/20241023090651.1115507-1-david@redhat.com ==================== Link: https://lore.kernel.org/r/20241025141453.1210600-1-david@redhat.com/ Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
2024-11-07s390/sparsemem: Provide memory_add_physaddr_to_nid() with CONFIG_NUMADavid Hildenbrand
virtio-mem uses memory_add_physaddr_to_nid() to determine the NID to use for memory it adds. We currently fallback to the dummy implementation in mm/numa.c with CONFIG_NUMA, which will end up triggering an undesired pr_info_once(): Unknown online node for memory at 0x100000000, assuming node 0 On s390, we map all cpus and memory to node 0, so let's add a simple memory_add_physaddr_to_nid() implementation that does exactly that, but without complaining. Signed-off-by: David Hildenbrand <david@redhat.com> Reviewed-by: Heiko Carstens <hca@linux.ibm.com> Tested-by: Sumanth Korikkar <sumanthk@linux.ibm.com> Acked-by: Christian Borntraeger <borntraeger@linux.ibm.com> Link: https://lore.kernel.org/r/20241025141453.1210600-8-david@redhat.com Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
2024-11-07s390/sparsemem: Reduce section size to 128 MiBDavid Hildenbrand
Ever since commit 421c175c4d609 ("[S390] Add support for memory hot-add.") we've been using a section size of 256 MiB on s390 and 32 MiB on s390. Before that, we were using a section size of 32 MiB on both architectures. Now that we have a new mechanism to expose additional memory to a VM -- virtio-mem -- reduce the section size to 128 MiB to allow for more flexibility and reduce the metadata overhead when dealing with hot(un)plug granularity smaller than 256 MiB. 128 MiB has been used by x86-64 since the very beginning. arm64 with 4k base pages switched to 128 MiB as well: it's just big enough on these architectures to allows for using a huge page (2 MiB) in the vmemmap in sane setups with sizeof(struct page) == 64 bytes and a huge page mapping in the direct mapping, while still allowing for small hot(un)plug granularity. For s390, we could even switch to a 64 MiB section size, as our huge page size is 1 MiB: but the smaller the section size, the more sections we'll have to manage especially on bigger machines. Making it consistent with x86-64 and arm64 feels like the right thing for now. Note that the smallest memory hot(un)plug granularity is also limited by the memory block size, determined by extracting the memory increment size from SCLP. Under QEMU/KVM, implementing virtio-mem, we expose 0; therefore, we'll end up with a memory block size of 128 MiB with a 128 MiB section size. Tested-by: Mario Casquero <mcasquer@redhat.com> Acked-by: Heiko Carstens <hca@linux.ibm.com> Signed-off-by: David Hildenbrand <david@redhat.com> Tested-by: Sumanth Korikkar <sumanthk@linux.ibm.com> Acked-by: Christian Borntraeger <borntraeger@linux.ibm.com> Link: https://lore.kernel.org/r/20241025141453.1210600-7-david@redhat.com Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
2024-11-07lib/Kconfig.debug: Default STRICT_DEVMEM to "y" on s390David Hildenbrand
virtio-mem currently depends on !DEVMEM | STRICT_DEVMEM. Let's default STRICT_DEVMEM to "y" just like we do for arm64 and x86. There could be ways in the future to filter access to virtio-mem device memory even without STRICT_DEVMEM, but for now let's just keep it simple. Tested-by: Mario Casquero <mcasquer@redhat.com> Acked-by: Heiko Carstens <hca@linux.ibm.com> Signed-off-by: David Hildenbrand <david@redhat.com> Tested-by: Sumanth Korikkar <sumanthk@linux.ibm.com> Acked-by: Christian Borntraeger <borntraeger@linux.ibm.com> Link: https://lore.kernel.org/r/20241025141453.1210600-6-david@redhat.com Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
2024-11-07virtio-mem: s390 supportDavid Hildenbrand
Now that s390 code is prepared for memory devices that reside above the maximum storage increment exposed through SCLP, everything is in place to unlock virtio-mem support. As virtio-mem in Linux currently supports logically onlining/offlining memory in pageblock granularity, we have an effective hot(un)plug granularity of 1 MiB on s390. As virito-mem adds/removes individual Linux memory blocks (256MB), we will currently never use gigantic pages in the identity mapping. It is worth noting that neither storage keys nor storage attributes (e.g., data / nodat) are touched when onlining memory blocks, which is good because we are not supposed to touch these parts for unplugged device blocks that are logically offline in Linux. We will currently never initialize storage keys for virtio-mem memory -- IOW, storage_key_init_range() is never called. It could be added in the future when plugging device blocks. But as that function essentially does nothing without modifying the code (changing PAGE_DEFAULT_ACC), that's just fine for now. kexec should work as intended and just like on other architectures that support virtio-mem: we will never place kexec binaries on virtio-mem memory, and never indicate virtio-mem memory to the 2nd kernel. The device driver in the 2nd kernel can simply reset the device -- turning all memory unplugged, to then start plugging memory and adding them to Linux, without causing trouble because the memory is already used elsewhere. The special s390 kdump mode, whereby the 2nd kernel creates the ELF core header, won't currently dump virtio-mem memory. The virtio-mem driver has a special kdump mode, from where we can detect memory ranges to dump. Based on this, support for dumping virtio-mem memory can be added in the future fairly easily. Acked-by: Michael S. Tsirkin <mst@redhat.com> Tested-by: Mario Casquero <mcasquer@redhat.com> Signed-off-by: David Hildenbrand <david@redhat.com> Acked-by: Heiko Carstens <hca@linux.ibm.com> Tested-by: Sumanth Korikkar <sumanthk@linux.ibm.com> Acked-by: Christian Borntraeger <borntraeger@linux.ibm.com> Link: https://lore.kernel.org/r/20241025141453.1210600-5-david@redhat.com Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
2024-11-07s390/physmem_info: Query diag500(STORAGE LIMIT) to support QEMU/KVM memory ↵David Hildenbrand
devices To support memory devices under QEMU/KVM, such as virtio-mem, we have to prepare our kernel virtual address space accordingly and have to know the highest possible physical memory address we might see later: the storage limit. The good old SCLP interface is not suitable for this use case. In particular, memory owned by memory devices has no relationship to storage increments, it is always detected using the device driver, and unaware OSes (no driver) must never try making use of that memory. Consequently this memory is located outside of the "maximum storage increment"-indicated memory range. Let's use our new diag500 STORAGE_LIMIT subcode to query this storage limit that can exceed the "maximum storage increment", and use the existing interfaces (i.e., SCLP) to obtain information about the initial memory that is not owned+managed by memory devices. If a hypervisor does not support such memory devices, the address exposed through diag500 STORAGE_LIMIT will correspond to the maximum storage increment exposed through SCLP. To teach kdump on s390 to include memory owned by memory devices, there will be ways to query the relevant memory ranges from the device via a driver running in special kdump mode (like virtio-mem already implements to filter /proc/vmcore access so we don't end up reading from unplugged device blocks). Update setup_ident_map_size(), to clarify that there can be more than just online and standby memory. Tested-by: Mario Casquero <mcasquer@redhat.com> Signed-off-by: David Hildenbrand <david@redhat.com> Acked-by: Heiko Carstens <hca@linux.ibm.com> Reviewed-by: Alexander Gordeev <agordeev@linux.ibm.com> Tested-by: Sumanth Korikkar <sumanthk@linux.ibm.com> Acked-by: Christian Borntraeger <borntraeger@linux.ibm.com> Link: https://lore.kernel.org/r/20241025141453.1210600-4-david@redhat.com Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
2024-11-07Documentation: s390-diag.rst: Document diag500(STORAGE LIMIT) subfunctionDavid Hildenbrand
Let's document our new diag500 subfunction that can be implemented by userspace. Signed-off-by: David Hildenbrand <david@redhat.com> Reviewed-by: Heiko Carstens <hca@linux.ibm.com> Reviewed-by: Eric Farman <farman@linux.ibm.com> Tested-by: Sumanth Korikkar <sumanthk@linux.ibm.com> Acked-by: Christian Borntraeger <borntraeger@linux.ibm.com> Link: https://lore.kernel.org/r/20241025141453.1210600-3-david@redhat.com Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
2024-11-07Documentation: s390-diag.rst: Make diag500 a generic KVM hypercallDavid Hildenbrand
Let's make it a generic KVM hypercall, allowing other subfunctions to be more independent of virtio. While at it, document that unsupported/unimplemented subfunctions result in a SPECIFICATION exception. This is a preparation for documenting a new subfunction. Signed-off-by: David Hildenbrand <david@redhat.com> Reviewed-by: Heiko Carstens <hca@linux.ibm.com> Reviewed-by: Eric Farman <farman@linux.ibm.com> Tested-by: Sumanth Korikkar <sumanthk@linux.ibm.com> Acked-by: Christian Borntraeger <borntraeger@linux.ibm.com> Link: https://lore.kernel.org/r/20241025141453.1210600-2-david@redhat.com Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
2024-11-07s390/kvm: Mask extra bits from program interrupt codeClaudio Imbrenda
The program interrupt code has some extra bits that are sometimes set by hardware for various reasons; those bits should be ignored when the program interrupt number is needed for interrupt handling. Fixes: 05066cafa925 ("s390/mm/fault: Handle guest-related program interrupts in KVM") Reported-by: Christian Borntraeger <borntraeger@linux.ibm.com> Signed-off-by: Claudio Imbrenda <imbrenda@linux.ibm.com> Tested-by: Christian Borntraeger <borntraeger@linux.ibm.com> Link: https://lore.kernel.org/r/20241031120316.25462-1-imbrenda@linux.ibm.com Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
2024-11-07net: stmmac: Fix unbalanced IRQ wake disable warning on single irq caseNícolas F. R. A. Prado
Commit a23aa0404218 ("net: stmmac: ethtool: Fixed calltrace caused by unbalanced disable_irq_wake calls") introduced checks to prevent unbalanced enable and disable IRQ wake calls. However it only initialized the auxiliary variable on one of the paths, stmmac_request_irq_multi_msi(), missing the other, stmmac_request_irq_single(). Add the same initialization on stmmac_request_irq_single() to prevent "Unbalanced IRQ <x> wake disable" warnings from being printed the first time disable_irq_wake() is called on platforms that run on that code path. Fixes: a23aa0404218 ("net: stmmac: ethtool: Fixed calltrace caused by unbalanced disable_irq_wake calls") Signed-off-by: Nícolas F. R. A. Prado <nfraprado@collabora.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20241101-stmmac-unbalanced-wake-single-fix-v1-1-5952524c97f0@collabora.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-11-06net: vertexcom: mse102x: Fix possible double free of TX skbStefan Wahren
The scope of the TX skb is wider than just mse102x_tx_frame_spi(), so in case the TX skb room needs to be expanded, we should free the the temporary skb instead of the original skb. Otherwise the original TX skb pointer would be freed again in mse102x_tx_work(), which leads to crashes: Internal error: Oops: 0000000096000004 [#2] PREEMPT SMP CPU: 0 PID: 712 Comm: kworker/0:1 Tainted: G D 6.6.23 Hardware name: chargebyte Charge SOM DC-ONE (DT) Workqueue: events mse102x_tx_work [mse102x] pstate: 20400009 (nzCv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--) pc : skb_release_data+0xb8/0x1d8 lr : skb_release_data+0x1ac/0x1d8 sp : ffff8000819a3cc0 x29: ffff8000819a3cc0 x28: ffff0000046daa60 x27: ffff0000057f2dc0 x26: ffff000005386c00 x25: 0000000000000002 x24: 00000000ffffffff x23: 0000000000000000 x22: 0000000000000001 x21: ffff0000057f2e50 x20: 0000000000000006 x19: 0000000000000000 x18: ffff00003fdacfcc x17: e69ad452d0c49def x16: 84a005feff870102 x15: 0000000000000000 x14: 000000000000024a x13: 0000000000000002 x12: 0000000000000000 x11: 0000000000000400 x10: 0000000000000930 x9 : ffff00003fd913e8 x8 : fffffc00001bc008 x7 : 0000000000000000 x6 : 0000000000000008 x5 : ffff00003fd91340 x4 : 0000000000000000 x3 : 0000000000000009 x2 : 00000000fffffffe x1 : 0000000000000000 x0 : 0000000000000000 Call trace: skb_release_data+0xb8/0x1d8 kfree_skb_reason+0x48/0xb0 mse102x_tx_work+0x164/0x35c [mse102x] process_one_work+0x138/0x260 worker_thread+0x32c/0x438 kthread+0x118/0x11c ret_from_fork+0x10/0x20 Code: aa1303e0 97fffab6 72001c1f 54000141 (f9400660) Cc: stable@vger.kernel.org Fixes: 2f207cbf0dd4 ("net: vertexcom: Add MSE102x SPI support") Signed-off-by: Stefan Wahren <wahrenst@gmx.net> Link: https://patch.msgid.link/20241105163101.33216-1-wahrenst@gmx.net Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-11-07btrfs: fix the length of reserved qgroup to freeHaisu Wang
The dealloc flag may be cleared and the extent won't reach the disk in cow_file_range when errors path. The reserved qgroup space is freed in commit 30479f31d44d ("btrfs: fix qgroup reserve leaks in cow_file_range"). However, the length of untouched region to free needs to be adjusted with the correct remaining region size. Fixes: 30479f31d44d ("btrfs: fix qgroup reserve leaks in cow_file_range") CC: stable@vger.kernel.org # 6.11+ Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Haisu Wang <haisuwang@tencent.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-11-07btrfs: reinitialize delayed ref list after deleting it from the listFilipe Manana
At insert_delayed_ref() if we need to update the action of an existing ref to BTRFS_DROP_DELAYED_REF, we delete the ref from its ref head's ref_add_list using list_del(), which leaves the ref's add_list member not reinitialized, as list_del() sets the next and prev members of the list to LIST_POISON1 and LIST_POISON2, respectively. If later we end up calling drop_delayed_ref() against the ref, which can happen during merging or when destroying delayed refs due to a transaction abort, we can trigger a crash since at drop_delayed_ref() we call list_empty() against the ref's add_list, which returns false since the list was not reinitialized after the list_del() and as a consequence we call list_del() again at drop_delayed_ref(). This results in an invalid list access since the next and prev members are set to poison pointers, resulting in a splat if CONFIG_LIST_HARDENED and CONFIG_DEBUG_LIST are set or invalid poison pointer dereferences otherwise. So fix this by deleting from the list with list_del_init() instead. Fixes: 1d57ee941692 ("btrfs: improve delayed refs iterations") CC: stable@vger.kernel.org # 4.19+ Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-11-07btrfs: fix per-subvolume RO/RW flags with new mount APIQu Wenruo
[BUG] With util-linux 2.40.2, the 'mount' utility is already utilizing the new mount API. e.g: # strace mount -o subvol=subv1,ro /dev/test/scratch1 /mnt/test/ ... fsconfig(3, FSCONFIG_SET_STRING, "source", "/dev/mapper/test-scratch1", 0) = 0 fsconfig(3, FSCONFIG_SET_STRING, "subvol", "subv1", 0) = 0 fsconfig(3, FSCONFIG_SET_FLAG, "ro", NULL, 0) = 0 fsconfig(3, FSCONFIG_CMD_CREATE, NULL, NULL, 0) = 0 fsmount(3, FSMOUNT_CLOEXEC, 0) = 4 mount_setattr(4, "", AT_EMPTY_PATH, {attr_set=MOUNT_ATTR_RDONLY, attr_clr=0, propagation=0 /* MS_??? */, userns_fd=0}, 32) = 0 move_mount(4, "", AT_FDCWD, "/mnt/test", MOVE_MOUNT_F_EMPTY_PATH) = 0 But this leads to a new problem, that per-subvolume RO/RW mount no longer works, if the initial mount is RO: # mount -o subvol=subv1,ro /dev/test/scratch1 /mnt/test # mount -o rw,subvol=subv2 /dev/test/scratch1 /mnt/scratch # mount | grep mnt /dev/mapper/test-scratch1 on /mnt/test type btrfs (ro,relatime,discard=async,space_cache=v2,subvolid=256,subvol=/subv1) /dev/mapper/test-scratch1 on /mnt/scratch type btrfs (ro,relatime,discard=async,space_cache=v2,subvolid=257,subvol=/subv2) # touch /mnt/scratch/foobar touch: cannot touch '/mnt/scratch/foobar': Read-only file system This is a common use cases on distros. [CAUSE] We have a workaround for remount to handle the RO->RW change, but if the mount is using the new mount API, we do not do that, and rely on the mount tool NOT to set the ro flag. But that's not how the mount tool is doing for the new API: fsconfig(3, FSCONFIG_SET_STRING, "source", "/dev/mapper/test-scratch1", 0) = 0 fsconfig(3, FSCONFIG_SET_STRING, "subvol", "subv1", 0) = 0 fsconfig(3, FSCONFIG_SET_FLAG, "ro", NULL, 0) = 0 <<<< Setting RO flag for super block fsconfig(3, FSCONFIG_CMD_CREATE, NULL, NULL, 0) = 0 fsmount(3, FSMOUNT_CLOEXEC, 0) = 4 mount_setattr(4, "", AT_EMPTY_PATH, {attr_set=MOUNT_ATTR_RDONLY, attr_clr=0, propagation=0 /* MS_??? */, userns_fd=0}, 32) = 0 move_mount(4, "", AT_FDCWD, "/mnt/test", MOVE_MOUNT_F_EMPTY_PATH) = 0 This means we will set the super block RO at the first mount. Later RW mount will not try to reconfigure the fs to RW because the mount tool is already using the new API. This totally breaks the per-subvolume RO/RW mount behavior. [FIX] Do not skip the reconfiguration even if using the new API. The old comments are just expecting any mount tool to properly skip the RO flag set even if we specify "ro", which is not the reality. Update the comments regarding the backward compatibility on the kernel level so it works with old and new mount utilities. CC: stable@vger.kernel.org # 6.8+ Fixes: f044b318675f ("btrfs: handle the ro->rw transition for mounting different subvolumes") Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-11-07irqchip/gic-v3: Force propagation of the active state with a read-backMarc Zyngier
Christoffer reports that on some implementations, writing to GICR_ISACTIVER0 (and similar GICD registers) can race badly with a guest issuing a deactivation of that interrupt via the system register interface. There are multiple reasons to this: - this uses an early write-acknoledgement memory type (nGnRE), meaning that the write may only have made it as far as some interconnect by the time the store is considered "done" - the GIC itself is allowed to buffer the write until it decides to take it into account (as long as it is in finite time) The effects are that the activation may not have taken effect by the time the kernel enters the guest, forcing an immediate exit, or that a guest deactivation occurs before the interrupt is active, doing nothing. In order to guarantee that the write to the ISACTIVER register has taken effect, read back from it, forcing the interconnect to propagate the write, and the GIC to process the write before returning the read. Reported-by: Christoffer Dall <christoffer.dall@arm.com> Signed-off-by: Marc Zyngier <maz@kernel.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Christoffer Dall <christoffer.dall@arm.com> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/all/20241106084418.3794612-1-maz@kernel.org
2024-11-06Merge tag 'nfs-for-6.12-3' of git://git.linux-nfs.org/projects/anna/linux-nfsLinus Torvalds
Pull NFS client fixes from Anna Schumaker: "These are mostly fixes that came up during the nfs bakeathon the other week. Stable Fixes: - Fix KMSAN warning in decode_getfattr_attrs() Other Bugfixes: - Handle -ENOTCONN in xs_tcp_setup_socked() - NFSv3: only use NFS timeout for MOUNT when protocols are compatible - Fix attribute delegation behavior on exclusive create and a/mtime changes - Fix localio to cope with racing nfs_local_probe() - Avoid i_lock contention in fs_clear_invalid_mapping()" * tag 'nfs-for-6.12-3' of git://git.linux-nfs.org/projects/anna/linux-nfs: nfs: avoid i_lock contention in nfs_clear_invalid_mapping nfs_common: fix localio to cope with racing nfs_local_probe() NFS: Further fixes to attribute delegation a/mtime changes NFS: Fix attribute delegation behaviour on exclusive create nfs: Fix KMSAN warning in decode_getfattr_attrs() NFSv3: only use NFS timeout for MOUNT when protocols are compatible sunrpc: handle -ENOTCONN in xs_tcp_setup_socket()
2024-11-06media: dvbdev: fix the logic when DVB_DYNAMIC_MINORS is not setMauro Carvalho Chehab
When CONFIG_DVB_DYNAMIC_MINORS, ret is not initialized, and a semaphore is left at the wrong state, in case of errors. Make the code simpler and avoid mistakes by having just one error check logic used weather DVB_DYNAMIC_MINORS is used or not. Reported-by: kernel test robot <lkp@intel.com> Reported-by: Dan Carpenter <dan.carpenter@linaro.org> Closes: https://lore.kernel.org/r/202410201717.ULWWdJv8-lkp@intel.com/ Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org> Link: https://lore.kernel.org/r/9e067488d8935b8cf00959764a1fa5de85d65725.1730926254.git.mchehab+huawei@kernel.org
2024-11-06io_uring: avoid normal tw intermediate fallbackPavel Begunkov
When a DEFER_TASKRUN io_uring is terminating it requeues deferred task work items as normal tw, which can further fallback to kthread execution. Avoid this extra step and always push them to the fallback kthread. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/d1cd472cec2230c66bd1c8d412a5833f0af75384.1730772720.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-11-06io_uring/napi: add static napi tracking strategyOlivier Langlois
Add the static napi tracking strategy. That allows the user to manually manage the napi ids list for busy polling, and eliminate the overhead of dynamically updating the list from the fast path. Signed-off-by: Olivier Langlois <olivier@trillion01.com> Link: https://lore.kernel.org/r/96943de14968c35a5c599352259ad98f3c0770ba.1728828877.git.olivier@trillion01.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-11-06io_uring/napi: clean up __io_napi_do_busy_loopOlivier Langlois
__io_napi_do_busy_loop now requires to have loop_end in its parameters. This makes the code cleaner and also has the benefit of removing a branch since the only caller not passing NULL for loop_end_arg is also setting the value conditionally. Signed-off-by: Olivier Langlois <olivier@trillion01.com> Link: https://lore.kernel.org/r/d5b9bb91b1a08fff50525e1c18d7b4709b9ca100.1728828877.git.olivier@trillion01.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-11-06io_uring/napi: Use lock guardsOlivier Langlois
Convert napi locks to use the shiny new Scope-Based Resource Management machinery. Signed-off-by: Olivier Langlois <olivier@trillion01.com> Link: https://lore.kernel.org/r/2680ca47ee183cfdb89d1a40c84d349edeb620ab.1728828877.git.olivier@trillion01.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-11-06io_uring/napi: improve __io_napi_addOlivier Langlois
1. move the sock->sk pointer validity test outside the function to avoid the function call overhead and to make the function more more reusable 2. change its name to __io_napi_add_id to be more precise about it is doing 3. return an error code to report errors Signed-off-by: Olivier Langlois <olivier@trillion01.com> Link: https://lore.kernel.org/r/d637fa3b437d753c0f4e44ff6a7b5bf2c2611270.1728828877.git.olivier@trillion01.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-11-06io_uring/napi: fix io_napi_entry RCU accessesOlivier Langlois
correct 3 RCU structures modifications that were not using the RCU functions to make their update. Signed-off-by: Olivier Langlois <olivier@trillion01.com> Link: https://lore.kernel.org/r/9f53b5169afa8c7bf3665a0b19dc2f7061173530.1728828877.git.olivier@trillion01.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-11-06io_uring/napi: protect concurrent io_napi_entry timeout accessesOlivier Langlois
io_napi_entry timeout value can be updated while accessed from the poll functions. Its concurrent accesses are wrapped with READ_ONCE()/WRITE_ONCE() macros to avoid incorrect compiler optimizations. Signed-off-by: Olivier Langlois <olivier@trillion01.com> Link: https://lore.kernel.org/r/3de3087563cf98f75266fd9f85fdba063a8720db.1728828877.git.olivier@trillion01.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-11-06io_uring: prevent speculating sq_array indexingPavel Begunkov
The SQ index array consists of user provided indexes, which io_uring then uses to index the SQ, and so it's susceptible to speculation. For all other queues io_uring tracks heads and tails in kernel, and they shouldn't need any special care. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/c6c7a25962924a55869e317e4fdb682dfdc6b279.1730687889.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-11-06io_uring: move struct io_kiocb from task_struct to io_uring_taskJens Axboe
Rather than store the task_struct itself in struct io_kiocb, store the io_uring specific task_struct. The life times are the same in terms of io_uring, and this avoids doing some dereferences through the task_struct. For the hot path of putting local task references, we can deref req->tctx instead, which we'll need anyway in that function regardless of whether it's local or remote references. This is mostly straight forward, except the original task PF_EXITING check needs a bit of tweaking. task_work is _always_ run from the originating task, except in the fallback case, where it's run from a kernel thread. Replace the potentially racy (in case of fallback work) checks for req->task->flags with current->flags. It's either the still the original task, in which case PF_EXITING will be sane, or it has PF_KTHREAD set, in which case it's fallback work. Both cases should prevent moving forward with the given request. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-11-06io_uring: remove task ref helpersJens Axboe
They are only used right where they are defined, just open-code them inside io_put_task(). Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-11-06io_uring: move cancelations to be io_uring_task basedJens Axboe
Right now the task_struct pointer is used as the key to match a task, but in preparation for some io_kiocb changes, move it to using struct io_uring_task instead. No functional changes intended in this patch. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-11-06io_uring/rsrc: split io_kiocb node type assignmentsJens Axboe
Currently the io_rsrc_node assignment in io_kiocb is an array of two pointers, as two nodes may be assigned to a request - one file node, and one buffer node. However, the buffer node can co-exist with the provided buffers, as currently it's not supported to use both provided and registered buffers at the same time. This crucially brings struct io_kiocb down to 4 cache lines again, as before it spilled into the 5th cacheline. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-11-06io_uring/rsrc: encode node type and ctx togetherJens Axboe
Rather than keep the type field separate rom ctx, use the fact that we can encode up to 4 types of nodes in the LSB of the ctx pointer. Doesn't reclaim any space right now on 64-bit archs, but it leaves a full int for future use. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-11-06ASoC: SOF: amd: Fix for incorrect DMA ch status register offsetVenkata Prasad Potturu
DMA ch status register offset change in acp7.0 platform Incorrect DMA channel status register offset check lead to firmware boot failure. [ 14.432497] snd_sof_amd_acp70 0000:c4:00.5: ------------[ DSP dump start ]------------ [ 14.432533] snd_sof_amd_acp70 0000:c4:00.5: Firmware boot failure due to timeout [ 14.432549] snd_sof_amd_acp70 0000:c4:00.5: fw_state: SOF_FW_BOOT_IN_PROGRESS (3) [ 14.432610] snd_sof_amd_acp70 0000:c4:00.5: invalid header size 0x71c41000. FW oops is bogus [ 14.432626] snd_sof_amd_acp70 0000:c4:00.5: unexpected fault 0x71c40000 trace 0x71c40000 [ 14.432642] snd_sof_amd_acp70 0000:c4:00.5: ------------[ DSP dump end ]------------ [ 14.432657] snd_sof_amd_acp70 0000:c4:00.5: error: failed to boot DSP firmware -5 [ 14.432672] snd_sof_amd_acp70 0000:c4:00.5: fw_state change: 3 -> 4 [ 14.433260] dmic-codec dmic-codec: ASoC: Unregistered DAI 'dmic-hifi' [ 14.433319] snd_sof_amd_acp70 0000:c4:00.5: fw_state change: 4 -> 0 [ 14.433358] snd_sof_amd_acp70 0000:c4:00.5: error: sof_probe_work failed err: -5 Update correct register offset for DMA ch status register. Fixes: 490be7ba2a01 ("ASoC: SOF: amd: add support for acp7.0 based platform") Signed-off-by: Venkata Prasad Potturu <venkataprasad.potturu@amd.com> Link: https://patch.msgid.link/20241106142658.1240929-1-venkataprasad.potturu@amd.com Signed-off-by: Mark Brown <broonie@kernel.org>
2024-11-06ACPI: processor: Move arch_init_invariance_cppc() call laterMario Limonciello
arch_init_invariance_cppc() is called at the end of acpi_cppc_processor_probe() in order to configure frequency invariance based upon the values from _CPC. This however doesn't work on AMD CPPC shared memory designs that have AMD preferred cores enabled because _CPC needs to be analyzed from all cores to judge if preferred cores are enabled. This issue manifests to users as a warning since commit 21fb59ab4b97 ("ACPI: CPPC: Adjust debug messages in amd_set_max_freq_ratio() to warn"): ``` Could not retrieve highest performance (-19) ``` However the warning isn't the cause of this, it was actually commit 279f838a61f9 ("x86/amd: Detect preferred cores in amd_get_boost_ratio_numerator()") which exposed the issue. To fix this problem, change arch_init_invariance_cppc() into a new weak symbol that is called at the end of acpi_processor_driver_init(). Each architecture that supports it can declare the symbol to override the weak one. Define it for x86, in arch/x86/kernel/acpi/cppc.c, and for all of the architectures using the generic arch_topology.c code. Fixes: 279f838a61f9 ("x86/amd: Detect preferred cores in amd_get_boost_ratio_numerator()") Reported-by: Ivan Shapovalov <intelfx@intelfx.name> Closes: https://bugzilla.kernel.org/show_bug.cgi?id=219431 Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name> Signed-off-by: Mario Limonciello <mario.limonciello@amd.com> Link: https://patch.msgid.link/20241104222855.3959267-1-superm1@kernel.org [ rjw: Changelog edit ] Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2024-11-06Merge tag 'keys-next-6.12-rc7' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/jarkko/linux-tpmdd Pull keys fixes from Jarkko Sakkinen: "A couple of fixes for keys and trusted keys" * tag 'keys-next-6.12-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/jarkko/linux-tpmdd: KEYS: trusted: dcp: fix NULL dereference in AEAD crypto operation security/keys: fix slab-out-of-bounds in key_task_permission
2024-11-06bpf: Add sk_is_inet and IS_ICSK check in tls_sw_has_ctx_tx/rxZijian Zhang
As the introduction of the support for vsock and unix sockets in sockmap, tls_sw_has_ctx_tx/rx cannot presume the socket passed in must be IS_ICSK. vsock and af_unix sockets have vsock_sock and unix_sock instead of inet_connection_sock. For these sockets, tls_get_ctx may return an invalid pointer and cause page fault in function tls_sw_ctx_rx. BUG: unable to handle page fault for address: 0000000000040030 Workqueue: vsock-loopback vsock_loopback_work RIP: 0010:sk_psock_strp_data_ready+0x23/0x60 Call Trace: ? __die+0x81/0xc3 ? no_context+0x194/0x350 ? do_page_fault+0x30/0x110 ? async_page_fault+0x3e/0x50 ? sk_psock_strp_data_ready+0x23/0x60 virtio_transport_recv_pkt+0x750/0x800 ? update_load_avg+0x7e/0x620 vsock_loopback_work+0xd0/0x100 process_one_work+0x1a7/0x360 worker_thread+0x30/0x390 ? create_worker+0x1a0/0x1a0 kthread+0x112/0x130 ? __kthread_cancel_work+0x40/0x40 ret_from_fork+0x1f/0x40 v2: - Add IS_ICSK check v3: - Update the commits in Fixes Fixes: 634f1a7110b4 ("vsock: support sockmap") Fixes: 94531cfcbe79 ("af_unix: Add unix_stream_proto for sockmap") Signed-off-by: Zijian Zhang <zijianzhang@bytedance.com> Acked-by: Stanislav Fomichev <sdf@fomichev.me> Acked-by: Jakub Kicinski <kuba@kernel.org> Reviewed-by: Cong Wang <cong.wang@bytedance.com> Acked-by: Stefano Garzarella <sgarzare@redhat.com> Link: https://lore.kernel.org/r/20241106003742.399240-1-zijianzhang@bytedance.com Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2024-11-06Merge tag 'tracefs-v6.12-rc6' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace Pull tracefs fixes from Steven Rostedt: "Fix tracefs mount options. Commit 78ff64081949 ("vfs: Convert tracefs to use the new mount API") broke the gid setting when set by fstab or other mount utility. It is ignored when it is set. Fix the code so that it recognises the option again and will honor the settings on mount at boot up. Update the internal documentation and create a selftest to make sure it doesn't break again in the future" * tag 'tracefs-v6.12-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: tracing/selftests: Add tracefs mount options test tracing: Document tracefs gid mount option tracing: Fix tracefs mount options
2024-11-06Merge tag 'platform-drivers-x86-v6.12-4' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/pdx86/platform-drivers-x86 Pull x86 platform driver fixes from Hans de Goede: - AMD PMF: Add new hardware id - AMD PMC: Fix crash when loaded with enable_stb=1 on devices without STB - Dell: Add Alienware hwid for Alienware systems with Dell WMI interface - thinkpad_acpi: Quirk to fix wrong fan speed readings on L480 - New hotkey mappings for Dell and Lenovo laptops * tag 'platform-drivers-x86-v6.12-4' of git://git.kernel.org/pub/scm/linux/kernel/git/pdx86/platform-drivers-x86: platform/x86: thinkpad_acpi: Fix for ThinkPad's with ECFW showing incorrect fan speed platform/x86: ideapad-laptop: add missing Ideapad Pro 5 fn keys platform/x86: dell-wmi-base: Handle META key Lock/Unlock events platform/x86: dell-smbios-base: Extends support to Alienware products platform/x86/amd/pmc: Detect when STB is not available platform/x86/amd/pmf: Add SMU metrics table support for 1Ah family 60h model
2024-11-06xattr: remove redundant check on variable errColin Ian King
Curretly in function generic_listxattr the for_each_xattr_handler loop checks err and will return out of the function if err is non-zero. It's impossible for err to be non-zero at the end of the function where err is checked again for a non-zero value. The final non-zero check is therefore redundant and can be removed. Also move the declaration of err into the loop. Signed-off-by: Colin Ian King <colin.i.king@gmail.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2024-11-06fs/xattr: add *at family syscallsChristian Göttsche
Add the four syscalls setxattrat(), getxattrat(), listxattrat() and removexattrat(). Those can be used to operate on extended attributes, especially security related ones, either relative to a pinned directory or on a file descriptor without read access, avoiding a /proc/<pid>/fd/<fd> detour, requiring a mounted procfs. One use case will be setfiles(8) setting SELinux file contexts ("security.selinux") without race conditions and without a file descriptor opened with read access requiring SELinux read permission. Use the do_{name}at() pattern from fs/open.c. Pass the value of the extended attribute, its length, and for setxattrat(2) the command (XATTR_CREATE or XATTR_REPLACE) via an added struct xattr_args to not exceed six syscall arguments and not merging the AT_* and XATTR_* flags. [AV: fixes by Christian Brauner folded in, the entire thing rebased on top of {filename,file}_...xattr() primitives, treatment of empty pathnames regularized. As the result, AT_EMPTY_PATH+NULL handling is cheap, so f...(2) can use it] Signed-off-by: Christian Göttsche <cgzones@googlemail.com> Link: https://lore.kernel.org/r/20240426162042.191916-1-cgoettsche@seltendoof.de Reviewed-by: Arnd Bergmann <arnd@arndb.de> Reviewed-by: Christian Brauner <brauner@kernel.org> CC: x86@kernel.org CC: linux-alpha@vger.kernel.org CC: linux-kernel@vger.kernel.org CC: linux-arm-kernel@lists.infradead.org CC: linux-ia64@vger.kernel.org CC: linux-m68k@lists.linux-m68k.org CC: linux-mips@vger.kernel.org CC: linux-parisc@vger.kernel.org CC: linuxppc-dev@lists.ozlabs.org CC: linux-s390@vger.kernel.org CC: linux-sh@vger.kernel.org CC: sparclinux@vger.kernel.org CC: linux-fsdevel@vger.kernel.org CC: audit@vger.kernel.org CC: linux-arch@vger.kernel.org CC: linux-api@vger.kernel.org CC: linux-security-module@vger.kernel.org CC: selinux@vger.kernel.org [brauner: slight tweaks] Signed-off-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2024-11-06new helpers: file_removexattr(), filename_removexattr()Al Viro
switch path_removexattrat() and fremovexattr(2) to those Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2024-11-06new helpers: file_listxattr(), filename_listxattr()Al Viro
switch path_listxattr() and flistxattr(2) to those Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2024-11-06replace do_getxattr() with saner helpers.Al Viro
similar to do_setxattr() in the previous commit... Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2024-11-06replace do_setxattr() with saner helpers.Al Viro
io_uring setxattr logics duplicates stuff from fs/xattr.c; provide saner helpers (filename_setxattr() and file_setxattr() resp.) and use them. NB: putname(ERR_PTR()) is a no-op Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2024-11-06new helper: import_xattr_name()Al Viro
common logics for marshalling xattr names. Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>