summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2024-03-10PCI: cadence: Clear the ARI Capability Next Function Number of the last functionJasko-EXT Wojciech
Next Function Number field in ARI Capability Register for last function must be zero by default as per the PCIe specification, indicating there is no next higher number function but that's not happening in our case, so this patch clears the Next Function Number field for last function used. [kwilczynski: white spaces update for one define] Link: https://lore.kernel.org/linux-pci/20231202085015.3048516-1-s-vadapalli@ti.com Signed-off-by: Jasko-EXT Wojciech <wojciech.jasko-EXT@continental-corporation.com> Signed-off-by: Achal Verma <a-verma1@ti.com> Signed-off-by: Siddharth Vadapalli <s-vadapalli@ti.com> Signed-off-by: Krzysztof Wilczyński <kwilczynski@kernel.org> Reviewed-by: Vignesh Raghavendra <vigneshr@ti.com>
2024-03-10PCI: dwc: Strengthen the MSI address allocation logicAjay Agarwal
There can be platforms that do not use/have 32-bit DMA addresses. The current implementation of 32-bit IOVA allocation can fail for such platforms, eventually leading to the probe failure. Try to allocate a 32-bit msi_data. If this allocation fails, attempt a 64-bit address allocation. Please note that if the 64-bit MSI address is allocated, then the EPs supporting 32-bit MSI address only will not work. Link: https://lore.kernel.org/linux-pci/20240221153840.1789979-1-ajayagarwal@google.com Tested-by: Will McVicker <willmcvicker@google.com> Signed-off-by: Ajay Agarwal <ajayagarwal@google.com> Signed-off-by: Krzysztof Wilczyński <kwilczynski@kernel.org> Reviewed-by: Manivannan Sadhasivam <manivannan.sadhasivam@linaro.org> Reviewed-by: Serge Semin <fancer.lancer@gmail.com> Reviewed-by: Will McVicker <willmcvicker@google.com>
2024-03-10PCI: brcmstb: Fix broken brcm_pcie_mdio_write() pollingJonathan Bell
The MDIO_WT_DONE() macro tests bit 31, which is always 0 (== done) as readw_poll_timeout_atomic() does a 16-bit read. Replace with the readl variant. [kwilczynski: commit log] Fixes: ca5dcc76314d ("PCI: brcmstb: Replace status loops with read_poll_timeout_atomic()") Link: https://lore.kernel.org/linux-pci/20240217133722.14391-1-wahrenst@gmx.net Signed-off-by: Jonathan Bell <jonathan@raspberrypi.com> Signed-off-by: Stefan Wahren <wahrenst@gmx.net> Signed-off-by: Krzysztof Wilczyński <kwilczynski@kernel.org> Acked-by: Florian Fainelli <florian.fainelli@broadcom.com>
2024-03-10PCI: qcom: Add X1E80100 PCIe supportAbel Vesa
Add the compatible and the driver data for X1E80100 PCIe controller. There are 5 controller instances found on this platform, out of which 2 are Gen3 with speeds of up to 8.0GT/s, while the other 3 are Gen4 with speeds of up to 16GT/s. The version of the controller is 1.38.0 for all instances, but they are compatible with 1.9.0 config. The max link width is x8 for one controller, x4 for two of others and x2 for the two left. [kwilczynski: commit log] Link: https://lore.kernel.org/linux-pci/20240301-x1e80100-pci-v4-2-7ab7e281d647@linaro.org Reviewed-by: Manivannan Sadhasivam <manivannan.sadhasivam@linaro.org> Signed-off-by: Abel Vesa <abel.vesa@linaro.org> Signed-off-by: Krzysztof Wilczyński <kwilczynski@kernel.org>
2024-03-10dt-bindings: PCI: qcom: Document the X1E80100 PCIe ControllerAbel Vesa
Add dedicated schema for the PCIe controllers found on X1E80100. Link: https://lore.kernel.org/linux-pci/20240301-x1e80100-pci-v4-1-7ab7e281d647@linaro.org Signed-off-by: Abel Vesa <abel.vesa@linaro.org> Signed-off-by: Krzysztof Wilczyński <kwilczynski@kernel.org> Reviewed-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org> Acked-by: Manivannan Sadhasivam <manivannan.sadhasivam@linaro.org>
2024-03-10PCI: qcom: Enable BDF to SID translation properlyManivannan Sadhasivam
Qcom SoCs making use of ARM SMMU require BDF to SID translation table in the driver to properly map the SID for the PCIe devices based on their BDF identifier. This is currently achieved with the help of qcom_pcie_config_sid_1_9_0() function for SoCs supporting the 1_9_0 config. But With newer Qcom SoCs starting from SM8450, BDF to SID translation is set to bypass mode by default in hardware. Due to this, the translation table that is set in the qcom_pcie_config_sid_1_9_0() is essentially unused and the default SID is used for all endpoints in SoCs starting from SM8450. This is a security concern and also warrants swapping the DeviceID in DT while using the GIC ITS to handle MSIs from endpoints. The swapping is currently done like below in DT when using GIC ITS: /* * MSIs for BDF (1:0.0) only works with Device ID 0x5980. * Hence, the IDs are swapped. */ msi-map = <0x0 &gic_its 0x5981 0x1>, <0x100 &gic_its 0x5980 0x1>; Here, swapping of the DeviceIDs ensure that the endpoint with BDF (1:0.0) gets the DeviceID 0x5980 which is associated with the default SID as per the iommu mapping in DT. So MSIs were delivered with IDs swapped so far. But this also means the Root Port (0:0.0) won't receive any MSIs (for PME, AER etc...) So let's fix these issues by clearing the BDF to SID bypass mode for all SoCs making use of the 1_9_0 config. This allows the PCIe devices to use the correct SID, thus avoiding the DeviceID swapping hack in DT and also achieving the isolation between devices. Fixes: 4c9398822106 ("PCI: qcom: Add support for configuring BDF to SID mapping for SM8250") Link: https://lore.kernel.org/linux-pci/20240307-pci-bdf-sid-fix-v1-1-9423a7e2d63c@linaro.org Signed-off-by: Manivannan Sadhasivam <manivannan.sadhasivam@linaro.org> Signed-off-by: Krzysztof Wilczyński <kwilczynski@kernel.org> Cc: stable@vger.kernel.org # 5.11
2024-03-10tracing: Use .flush() call to wake up readersSteven Rostedt (Google)
The .release() function does not get called until all readers of a file descriptor are finished. If a thread is blocked on reading a file descriptor in ring_buffer_wait(), and another thread closes the file descriptor, it will not wake up the other thread as ring_buffer_wake_waiters() is called by .release(), and that will not get called until the .read() is finished. The issue originally showed up in trace-cmd, but the readers are actually other processes with their own file descriptors. So calling close() would wake up the other tasks because they are blocked on another descriptor then the one that was closed(). But there's other wake ups that solve that issue. When a thread is blocked on a read, it can still hang even when another thread closed its descriptor. This is what the .flush() callback is for. Have the .flush() wake up the readers. Link: https://lore.kernel.org/linux-trace-kernel/20240308202432.107909457@goodmis.org Cc: stable@vger.kernel.org Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: linke li <lilinke99@qq.com> Cc: Rabin Vincent <rabin@rab.in> Fixes: f3ddb74ad0790 ("tracing: Wake up ring buffer waiters on closing of the file") Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2024-03-10ring-buffer: Fix resetting of shortest_fullSteven Rostedt (Google)
The "shortest_full" variable is used to keep track of the waiter that is waiting for the smallest amount on the ring buffer before being woken up. When a tasks waits on the ring buffer, it passes in a "full" value that is a percentage. 0 means wake up on any data. 1-100 means wake up from 1% to 100% full buffer. As all waiters are on the same wait queue, the wake up happens for the waiter with the smallest percentage. The problem is that the smallest_full on the cpu_buffer that stores the smallest amount doesn't get reset when all the waiters are woken up. It does get reset when the ring buffer is reset (echo > /sys/kernel/tracing/trace). This means that tasks may be woken up more often then when they want to be. Instead, have the shortest_full field get reset just before waking up all the tasks. If the tasks wait again, they will update the shortest_full before sleeping. Also add locking around setting of shortest_full in the poll logic, and change "work" to "rbwork" to match the variable name for rb_irq_work structures that are used in other places. Link: https://lore.kernel.org/linux-trace-kernel/20240308202431.948914369@goodmis.org Cc: stable@vger.kernel.org Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: linke li <lilinke99@qq.com> Cc: Rabin Vincent <rabin@rab.in> Fixes: 2c2b0a78b3739 ("ring-buffer: Add percentage of ring buffer full to wake up reader") Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2024-03-10Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvmLinus Torvalds
Pull kvm fixes from Paolo Bonzini: "KVM GUEST_MEMFD fixes for 6.8: - Make KVM_MEM_GUEST_MEMFD mutually exclusive with KVM_MEM_READONLY to avoid creating an inconsistent ABI (KVM_MEM_GUEST_MEMFD is not writable from userspace, so there would be no way to write to a read-only guest_memfd). - Update documentation for KVM_SW_PROTECTED_VM to make it abundantly clear that such VMs are purely for development and testing. - Limit KVM_SW_PROTECTED_VM guests to the TDP MMU, as the long term plan is to support confidential VMs with deterministic private memory (SNP and TDX) only in the TDP MMU. - Fix a bug in a GUEST_MEMFD dirty logging test that caused false passes. x86 fixes: - Fix missing marking of a guest page as dirty when emulating an atomic access. - Check for mmu_notifier invalidation events before faulting in the pfn, and before acquiring mmu_lock, to avoid unnecessary work and lock contention with preemptible kernels (including CONFIG_PREEMPT_DYNAMIC in non-preemptible mode). - Disable AMD DebugSwap by default, it breaks VMSA signing and will be re-enabled with a better VM creation API in 6.10. - Do the cache flush of converted pages in svm_register_enc_region() before dropping kvm->lock, to avoid a race with unregistering of the same region and the consequent use-after-free issue" * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: SEV: disable SEV-ES DebugSwap by default KVM: x86/mmu: Retry fault before acquiring mmu_lock if mapping is changing KVM: SVM: Flush pages under kvm->lock to fix UAF in svm_register_enc_region() KVM: selftests: Add a testcase to verify GUEST_MEMFD and READONLY are exclusive KVM: selftests: Create GUEST_MEMFD for relevant invalid flags testcases KVM: x86/mmu: Restrict KVM_SW_PROTECTED_VM to the TDP MMU KVM: x86: Update KVM_SW_PROTECTED_VM docs to make it clear they're a WIP KVM: Make KVM_MEM_GUEST_MEMFD mutually exclusive with KVM_MEM_READONLY KVM: x86: Mark target gfn of emulated atomic instruction as dirty
2024-03-10ring-buffer: Fix waking up ring buffer readersSteven Rostedt (Google)
A task can wait on a ring buffer for when it fills up to a specific watermark. The writer will check the minimum watermark that waiters are waiting for and if the ring buffer is past that, it will wake up all the waiters. The waiters are in a wait loop, and will first check if a signal is pending and then check if the ring buffer is at the desired level where it should break out of the loop. If a file that uses a ring buffer closes, and there's threads waiting on the ring buffer, it needs to wake up those threads. To do this, a "wait_index" was used. Before entering the wait loop, the waiter will read the wait_index. On wakeup, it will check if the wait_index is different than when it entered the loop, and will exit the loop if it is. The waker will only need to update the wait_index before waking up the waiters. This had a couple of bugs. One trivial one and one broken by design. The trivial bug was that the waiter checked the wait_index after the schedule() call. It had to be checked between the prepare_to_wait() and the schedule() which it was not. The main bug is that the first check to set the default wait_index will always be outside the prepare_to_wait() and the schedule(). That's because the ring_buffer_wait() doesn't have enough context to know if it should break out of the loop. The loop itself is not needed, because all the callers to the ring_buffer_wait() also has their own loop, as the callers have a better sense of what the context is to decide whether to break out of the loop or not. Just have the ring_buffer_wait() block once, and if it gets woken up, exit the function and let the callers decide what to do next. Link: https://lore.kernel.org/all/CAHk-=whs5MdtNjzFkTyaUy=vHi=qwWgPi0JgTe6OYUYMNSRZfg@mail.gmail.com/ Link: https://lore.kernel.org/linux-trace-kernel/20240308202431.792933613@goodmis.org Cc: stable@vger.kernel.org Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: linke li <lilinke99@qq.com> Cc: Rabin Vincent <rabin@rab.in> Fixes: e30f53aad2202 ("tracing: Do not busy wait in buffer splice") Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2024-03-10erofs: support compressed inodes over fscacheJingbo Xu
Since fscache can utilize iov_iter to write dest buffers, bio_vec can be used in this way too. To simplify this, pseudo bios are prepared and bio_vec will be filled with bio_add_page(). And a common .bi_end_io will be called directly to handle I/O completions. Signed-off-by: Jingbo Xu <jefflexu@linux.alibaba.com> Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com> Link: https://lore.kernel.org/r/20240308094159.40547-2-jefflexu@linux.alibaba.com Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
2024-03-10erofs: make iov_iter describe target buffers over fscacheJingbo Xu
So far the fscache mode supports uncompressed data only, and the data read from fscache is put directly into the target page cache. As the support for compressed data in fscache mode is going to be introduced, rework the fscache internals so that the following compressed part could make the raw data read from fscache be directed to the target buffer it wants, decompress the raw data, and finally fill the page cache with the decompressed data. As the first step, a new structure, i.e. erofs_fscache_io (io), is introduced to describe a generic read request from the fscache, while the caller can specify the target buffer it wants in the iov_iter structure (io->iter). Besides, the caller can also specify its completion callback and private data through erofs_fscache_io, which will be called to make further handling, e.g. unlocking the page cache for uncompressed data or decompressing the read raw data, when the read request from the fscache completes. Now erofs_fscache_read_io_async() serves as a generic interface for reading raw data from fscache for both compressed and uncompressed data. The erofs_fscache_rq structure is kept to describe a request to fill the page cache in the specified range. Signed-off-by: Jingbo Xu <jefflexu@linux.alibaba.com> Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com> Link: https://lore.kernel.org/r/20240308094159.40547-1-jefflexu@linux.alibaba.com Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
2024-03-10erofs: fix lockdep false positives on initializing erofs_pseudo_mntBaokun Li
Lockdep reported the following issue when mounting erofs with a domain_id: ============================================ WARNING: possible recursive locking detected 6.8.0-rc7-xfstests #521 Not tainted -------------------------------------------- mount/396 is trying to acquire lock: ffff907a8aaaa0e0 (&type->s_umount_key#50/1){+.+.}-{3:3}, at: alloc_super+0xe3/0x3d0 but task is already holding lock: ffff907a8aaa90e0 (&type->s_umount_key#50/1){+.+.}-{3:3}, at: alloc_super+0xe3/0x3d0 other info that might help us debug this: Possible unsafe locking scenario: CPU0 ---- lock(&type->s_umount_key#50/1); lock(&type->s_umount_key#50/1); *** DEADLOCK *** May be due to missing lock nesting notation 2 locks held by mount/396: #0: ffff907a8aaa90e0 (&type->s_umount_key#50/1){+.+.}-{3:3}, at: alloc_super+0xe3/0x3d0 #1: ffffffffc00e6f28 (erofs_domain_list_lock){+.+.}-{3:3}, at: erofs_fscache_register_fs+0x3d/0x270 [erofs] stack backtrace: CPU: 1 PID: 396 Comm: mount Not tainted 6.8.0-rc7-xfstests #521 Call Trace: <TASK> dump_stack_lvl+0x64/0xb0 validate_chain+0x5c4/0xa00 __lock_acquire+0x6a9/0xd50 lock_acquire+0xcd/0x2b0 down_write_nested+0x45/0xd0 alloc_super+0xe3/0x3d0 sget_fc+0x62/0x2f0 vfs_get_super+0x21/0x90 vfs_get_tree+0x2c/0xf0 fc_mount+0x12/0x40 vfs_kern_mount.part.0+0x75/0x90 kern_mount+0x24/0x40 erofs_fscache_register_fs+0x1ef/0x270 [erofs] erofs_fc_fill_super+0x213/0x380 [erofs] This is because the file_system_type of both erofs and the pseudo-mount point of domain_id is erofs_fs_type, so two successive calls to alloc_super() are considered to be using the same lock and trigger the warning above. Therefore add a nodev file_system_type called erofs_anon_fs_type in fscache.c to silence this complaint. Because kern_mount() takes a pointer to struct file_system_type, not its (string) name. So we don't need to call register_filesystem(). In addition, call init_pseudo() in erofs_anon_init_fs_context() as suggested by Al Viro, so that we can remove erofs_fc_fill_pseudo_super(), erofs_fc_anon_get_tree(), and erofs_anon_context_ops. Suggested-by: Al Viro <viro@zeniv.linux.org.uk> Fixes: a9849560c55e ("erofs: introduce a pseudo mnt to manage shared cookies") Signed-off-by: Baokun Li <libaokun1@huawei.com> Reviewed-and-tested-by: Jingbo Xu <jefflexu@linux.alibaba.com> Reviewed-by: Yang Erkun <yangerkun@huawei.com> Link: https://lore.kernel.org/r/20240307101018.2021925-1-libaokun1@huawei.com Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
2024-03-10erofs: refine managed cache operations to foliosGao Xiang
Convert erofs_try_to_free_all_cached_pages() and z_erofs_cache_release_folio(). Besides, erofs_page_is_managed() is moved to zdata.c and renamed as erofs_folio_is_managed(). Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com> Link: https://lore.kernel.org/r/20240305091448.1384242-6-hsiangkao@linux.alibaba.com
2024-03-10erofs: convert z_erofs_submissionqueue_endio() to foliosGao Xiang
Use bio_for_each_folio() to iterate over each folio in the bio and there is no large folios for now. Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com> Link: https://lore.kernel.org/r/20240305091448.1384242-5-hsiangkao@linux.alibaba.com
2024-03-10erofs: convert z_erofs_fill_bio_vec() to foliosGao Xiang
Introduce a folio member to `struct z_erofs_bvec` and convert most of z_erofs_fill_bio_vec() to folios, which is still straight-forward. Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com> Link: https://lore.kernel.org/r/20240305091448.1384242-4-hsiangkao@linux.alibaba.com
2024-03-10erofs: get rid of `justfound` debugging tagGao Xiang
`justfound` is introduced to identify cached folios that are just added to compressed bvecs so that more checks can be applied in the I/O submission path. EROFS is quite now stable compared to the codebase at that stage. `justfound` becomes a burden for upcoming features. Drop it. Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com> Link: https://lore.kernel.org/r/20240305091448.1384242-3-hsiangkao@linux.alibaba.com
2024-03-10erofs: convert z_erofs_do_read_page() to foliosGao Xiang
It is a straight-forward conversion. Besides, it's renamed as z_erofs_scan_folio(). Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com> Link: https://lore.kernel.org/r/20240305091448.1384242-2-hsiangkao@linux.alibaba.com
2024-03-10erofs: convert z_erofs_onlinepage_.* to foliosGao Xiang
Online folios are locked file-backed folios which will eventually keep decoded (e.g. decompressed) data of each inode for end users to utilize. It may belong to a few pclusters and contain other data (e.g. compressed data for inplace I/Os) temporarily in a time-sharing manner to reduce memory footprints for low-ended storage devices with high latencies under heary I/O pressure. Apart from folio_end_read() usage, it's a straight-forward conversion. Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com> Link: https://lore.kernel.org/r/20240305091448.1384242-1-hsiangkao@linux.alibaba.com
2024-03-10watchdog: intel-mid_wdt: Get platform data via dev_get_platdata()Andy Shevchenko
Access to platform data via dev_get_platdata() getter to make code cleaner. Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Reviewed-by: Guenter Roeck <linux@roeck-us.net> Link: https://lore.kernel.org/r/20240305165306.1366823-4-andriy.shevchenko@linux.intel.com Signed-off-by: Guenter Roeck <linux@roeck-us.net> Signed-off-by: Wim Van Sebroeck <wim@linux-watchdog.org>
2024-03-10watchdog: intel-mid_wdt: Don't use "proxy" headersAndy Shevchenko
Update header inclusions to follow IWYU (Include What You Use) principle. Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Reviewed-by: Guenter Roeck <linux@roeck-us.net> Link: https://lore.kernel.org/r/20240305165306.1366823-3-andriy.shevchenko@linux.intel.com Signed-off-by: Guenter Roeck <linux@roeck-us.net> Signed-off-by: Wim Van Sebroeck <wim@linux-watchdog.org>
2024-03-10watchdog: intel-mid_wdt: Remove unused intel-mid.hAndy Shevchenko
intel-mid.h is providing some core parts of the South Complex PM, which are usually are not used by individual drivers. In particular, this driver doesn't use it, so simply remove the unused header. Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Reviewed-by: Guenter Roeck <linux@roeck-us.net> Link: https://lore.kernel.org/r/20240305165306.1366823-2-andriy.shevchenko@linux.intel.com Signed-off-by: Guenter Roeck <linux@roeck-us.net> Signed-off-by: Wim Van Sebroeck <wim@linux-watchdog.org>
2024-03-10openrisc: Use asm-generic's version of fix_to_virt() & virt_to_fix()Dawei Li
Openrisc's implementation of fix_to_virt() & virt_to_fix() share same functionality with ones of asm generic. Plus, generic version of fix_to_virt() can trap invalid index at compile time. Thus, Replace the arch-specific implementations with asm generic's ones. Signed-off-by: Dawei Li <set_pte_at@outlook.com> Signed-off-by: Stafford Horne <shorne@gmail.com>
2024-03-10openrisc: Call setup_memory() earlier in the init sequenceOreoluwa Babatunde
The unflatten_and_copy_device_tree() function contains a call to memblock_alloc(). This means that memblock is allocating memory before any of the reserved memory regions are set aside in the setup_memory() function which calls early_init_fdt_scan_reserved_mem(). Therefore, there is a possibility for memblock to allocate from any of the reserved memory regions. Hence, move the call to setup_memory() to be earlier in the init sequence so that the reserved memory regions are set aside before any allocations are done using memblock. Signed-off-by: Oreoluwa Babatunde <quic_obabatun@quicinc.com> Signed-off-by: Stafford Horne <shorne@gmail.com>
2024-03-10pcmcia: cs: make pcmcia_socket_class constantRicardo B. Marliere
Since commit 43a7206b0963 ("driver core: class: make class_register() take a const *"), the driver core allows for struct class to be in read-only memory, so move the pcmcia_socket_class structure to be declared at build time placing it into read-only memory, instead of having to be dynamically allocated at boot time. Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Suggested-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Ricardo B. Marliere <ricardo@marliere.net> Signed-off-by: Dominik Brodowski <linux@dominikbrodowski.net>
2024-03-10dt-bindings: pinctrl: qcom: update compatible name for match with driverTengfei Fan
Use compatible name "qcom,sm4450-tlmm" instead of "qcom,sm4450-pinctrl" to match the compatible name in sm4450 pinctrl driver. Fixes: 7bf8b78f86db ("dt-bindings: pinctrl: qcom: Add SM4450 pinctrl") Reviewed-by: Dmitry Baryshkov <dmitry.baryshkov@linaro.org> Signed-off-by: Tengfei Fan <quic_tengfan@quicinc.com> Reviewed-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org> Link: https://lore.kernel.org/r/20240129092512.23602-2-quic_tengfan@quicinc.com Signed-off-by: Linus Walleij <linus.walleij@linaro.org>
2024-03-09exec: Simplify remove_arg_zero() error pathKees Cook
We don't need the "out" label any more, so remove "ret" and return directly on error. Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Kees Cook <keescook@chromium.org> --- Cc: Eric Biederman <ebiederm@xmission.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Christian Brauner <brauner@kernel.org> Cc: Jan Kara <jack@suse.cz> Cc: linux-mm@kvack.org Cc: linux-fsdevel@vger.kernel.org
2024-03-09pstore/zone: Don't clear memory twiceChristophe JAILLET
There is no need to call memset(..., 0, ...) on memory allocated by kcalloc(). It is already zeroed. Remove the redundant call. Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr> Link: https://lore.kernel.org/r/fa2597400051c18c6ca11187b0e4b906729991b2.1709972649.git.christophe.jaillet@wanadoo.fr Signed-off-by: Kees Cook <keescook@chromium.org>
2024-03-09NFSD: Clean up nfsd4_encode_replay()Chuck Lever
Replace open-coded encoding logic with the use of conventional XDR utility functions. Add a tracepoint to make replays observable in field troubleshooting situations. The WARN_ON is removed. A stack trace is of little use, as there is only one call site for nfsd4_encode_replay(), and a buffer length shortage here is unlikely. Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-03-09Merge tag 'i2c-for-6.8-rc8' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux Pull i2c fixes from Wolfram Sang: "Two patches from Heiner for the i801 are targeting muxes discovered while working on some other features. Essentially, there is a reordering when adding optional slaves and proper cleanup upon registering a mux device. Christophe fixes the exit path in the wmt driver that was leaving the clocks hanging, and the last fix from Tommy avoids false error reports in IRQ" * tag 'i2c-for-6.8-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux: i2c: aspeed: Fix the dummy irq expected print i2c: wmt: Fix an error handling path in wmt_i2c_probe() i2c: i801: Avoid potential double call to gpiod_remove_lookup_table i2c: i801: Fix using mux_pdev before it's set
2024-03-09Merge tag 'firewire-fixes-6.8-final' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/ieee1394/linux1394 Pull firewire fix from Takashi Sakamoto: "A fix to suppress a warning about unreleased IRQ for 1394 OHCI hardware when disabling MSI. In Linux kernel v6.5, a PCI driver for 1394 OHCI hardware was optimized into the managed device resources. Edmund Raile points out that the change brings the warning about unreleased IRQ at the call of pci_disable_msi(), since the API expects that the relevant IRQ has already been released in advance. As long as the API is called in .remove callback of PCI device operation, it is prohibited to maintain the IRQ as the part of managed device resource. As a workaround, the IRQ is explicitly released at .remove callback, before the call of pci_disable_msi(). pci_disable_msi() is legacy API nowadays in PCI MSI implementation. I have a plan to replace it with the modern API in the development for the future version of Linux kernel. So at present I keep them as is" * tag 'firewire-fixes-6.8-final' of git://git.kernel.org/pub/scm/linux/kernel/git/ieee1394/linux1394: firewire: ohci: prevent leak of left-over IRQ on unbind
2024-03-09Merge tag 'kvm-x86-guest_memfd_fixes-6.8' of ↵Paolo Bonzini
https://github.com/kvm-x86/linux into HEAD KVM GUEST_MEMFD fixes for 6.8: - Make KVM_MEM_GUEST_MEMFD mutually exclusive with KVM_MEM_READONLY to avoid creating ABI that KVM can't sanely support. - Update documentation for KVM_SW_PROTECTED_VM to make it abundantly clear that such VMs are purely a development and testing vehicle, and come with zero guarantees. - Limit KVM_SW_PROTECTED_VM guests to the TDP MMU, as the long term plan is to support confidential VMs with deterministic private memory (SNP and TDX) only in the TDP MMU. - Fix a bug in a GUEST_MEMFD negative test that resulted in false passes when verifying that KVM_MEM_GUEST_MEMFD memslots can't be dirty logged.
2024-03-09SEV: disable SEV-ES DebugSwap by defaultPaolo Bonzini
The DebugSwap feature of SEV-ES provides a way for confidential guests to use data breakpoints. However, because the status of the DebugSwap feature is recorded in the VMSA, enabling it by default invalidates the attestation signatures. In 6.10 we will introduce a new API to create SEV VMs that will allow enabling DebugSwap based on what the user tells KVM to do. Contextually, we will change the legacy KVM_SEV_ES_INIT API to never enable DebugSwap. For compatibility with kernels that pre-date the introduction of DebugSwap, as well as with those where KVM_SEV_ES_INIT will never enable it, do not enable the feature by default. If anybody wants to use it, for now they can enable the sev_es_debug_swap_enabled module parameter, but this will result in a warning. Fixes: d1f85fbe836e ("KVM: SEV: Enable data breakpoints in SEV-ES") Cc: stable@vger.kernel.org Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-03-09Merge tag 'kvm-x86-guest_memfd_fixes-6.8' of ↵Paolo Bonzini
https://github.com/kvm-x86/linux into HEAD KVM GUEST_MEMFD fixes for 6.8: - Make KVM_MEM_GUEST_MEMFD mutually exclusive with KVM_MEM_READONLY to avoid creating ABI that KVM can't sanely support. - Update documentation for KVM_SW_PROTECTED_VM to make it abundantly clear that such VMs are purely a development and testing vehicle, and come with zero guarantees. - Limit KVM_SW_PROTECTED_VM guests to the TDP MMU, as the long term plan is to support confidential VMs with deterministic private memory (SNP and TDX) only in the TDP MMU. - Fix a bug in a GUEST_MEMFD negative test that resulted in false passes when verifying that KVM_MEM_GUEST_MEMFD memslots can't be dirty logged.
2024-03-09Merge tag 'kvm-x86-fixes-6.8-2' of https://github.com/kvm-x86/linux into HEADPaolo Bonzini
KVM x86 fixes for 6.8, round 2: - When emulating an atomic access, mark the gfn as dirty in the memslot to fix a bug where KVM could fail to mark the slot as dirty during live migration, ultimately resulting in guest data corruption due to a dirty page not being re-copied from the source to the target. - Check for mmu_notifier invalidation events before faulting in the pfn, and before acquiring mmu_lock, to avoid unnecessary work and lock contention. Contending mmu_lock is especially problematic on preemptible kernels, as KVM may yield mmu_lock in response to the contention, which severely degrades overall performance due to vCPUs making it difficult for the task that triggered invalidation to make forward progress. Note, due to another kernel bug, this fix isn't limited to preemtible kernels, as any kernel built with CONFIG_PREEMPT_DYNAMIC=y will yield contended rwlocks and spinlocks. https://lore.kernel.org/all/20240110214723.695930-1-seanjc@google.com
2024-03-09arm64, bpf: Use bpf_prog_pack for arm64 bpf trampolinePuranjay Mohan
We used bpf_prog_pack to aggregate bpf programs into huge page to relieve the iTLB pressure on the system. This was merged for ARM64[1] We can apply it to bpf trampoline as well. This would increase the preformance of fentry and struct_ops programs. [1] https://lore.kernel.org/bpf/20240228141824.119877-1-puranjay12@gmail.com/ Signed-off-by: Puranjay Mohan <puranjay12@gmail.com> Reviewed-by: Pu Lehui <pulehui@huawei.com> Message-ID: <20240304202803.31400-1-puranjay12@gmail.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-03-09block: partitions: only define function mac_fix_string for CONFIG_PPC_PMACColin Ian King
The helper function mac_fix_string is only required with CONFIG_PPC_PMAC, add #if CONFIG_PPC_PMAC and #endif around the function. Cleans up clang scan build warning: block/partitions/mac.c:23:20: warning: unused function 'mac_fix_string' [-Wunused-function] Signed-off-by: Colin Ian King <colin.i.king@gmail.com> Link: https://lore.kernel.org/r/20240308133921.2058227-1-colin.i.king@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-03-09io_uring: Fix sqpoll utilization check racing with dying sqpollGabriel Krisman Bertazi
Commit 3fcb9d17206e ("io_uring/sqpoll: statistics of the true utilization of sq threads"), currently in Jens for-next branch, peeks at io_sq_data->thread to report utilization statistics. But, If io_uring_show_fdinfo races with sqpoll terminating, even though we hold the ctx lock, sqd->thread might be NULL and we hit the Oops below. Note that we could technically just protect the getrusage() call and the sq total/work time calculations. But showing some sq information (pid/cpu) and not other information (utilization) is more confusing than not reporting anything, IMO. So let's hide it all if we happen to race with a dying sqpoll. This can be triggered consistently in my vm setup running sqpoll-cancel-hang.t in a loop. BUG: kernel NULL pointer dereference, address: 00000000000007b0 PGD 0 P4D 0 Oops: 0000 [#1] PREEMPT SMP NOPTI CPU: 0 PID: 16587 Comm: systemd-coredum Not tainted 6.8.0-rc3-g3fcb9d17206e-dirty #69 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS unknown 2/2/2022 RIP: 0010:getrusage+0x21/0x3e0 Code: 90 90 90 90 90 90 90 90 90 0f 1f 44 00 00 55 48 89 d1 48 89 e5 41 57 41 56 41 55 41 54 49 89 fe 41 52 53 48 89 d3 48 83 ec 30 <4c> 8b a7 b0 07 00 00 48 8d 7a 08 65 48 8b 04 25 28 00 00 00 48 89 RSP: 0018:ffffa166c671bb80 EFLAGS: 00010282 RAX: 00000000000040ca RBX: ffffa166c671bc60 RCX: ffffa166c671bc60 RDX: ffffa166c671bc60 RSI: 0000000000000000 RDI: 0000000000000000 RBP: ffffa166c671bbe0 R08: ffff9448cc3930c0 R09: 0000000000000000 R10: ffffa166c671bd50 R11: ffffffff9ee89260 R12: 0000000000000000 R13: ffff9448ce099480 R14: 0000000000000000 R15: ffff9448cff5b000 FS: 00007f786e225900(0000) GS:ffff94493bc00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00000000000007b0 CR3: 000000010d39c000 CR4: 0000000000750ef0 PKRU: 55555554 Call Trace: <TASK> ? __die_body+0x1a/0x60 ? page_fault_oops+0x154/0x440 ? srso_alias_return_thunk+0x5/0xfbef5 ? do_user_addr_fault+0x174/0x7c0 ? srso_alias_return_thunk+0x5/0xfbef5 ? exc_page_fault+0x63/0x140 ? asm_exc_page_fault+0x22/0x30 ? getrusage+0x21/0x3e0 ? seq_printf+0x4e/0x70 io_uring_show_fdinfo+0x9db/0xa10 ? srso_alias_return_thunk+0x5/0xfbef5 ? vsnprintf+0x101/0x4d0 ? srso_alias_return_thunk+0x5/0xfbef5 ? seq_vprintf+0x34/0x50 ? srso_alias_return_thunk+0x5/0xfbef5 ? seq_printf+0x4e/0x70 ? seq_show+0x16b/0x1d0 ? __pfx_io_uring_show_fdinfo+0x10/0x10 seq_show+0x16b/0x1d0 seq_read_iter+0xd7/0x440 seq_read+0x102/0x140 vfs_read+0xae/0x320 ? srso_alias_return_thunk+0x5/0xfbef5 ? __do_sys_newfstat+0x35/0x60 ksys_read+0xa5/0xe0 do_syscall_64+0x50/0x110 entry_SYSCALL_64_after_hwframe+0x6e/0x76 RIP: 0033:0x7f786ec1db4d Code: e8 46 e3 01 00 0f 1f 84 00 00 00 00 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 80 3d d9 ce 0e 00 00 74 17 31 c0 0f 05 <48> 3d 00 f0 ff ff 77 5b c3 66 2e 0f 1f 84 00 00 00 00 00 48 83 ec RSP: 002b:00007ffcb361a4b8 EFLAGS: 00000246 ORIG_RAX: 0000000000000000 RAX: ffffffffffffffda RBX: 000055a4c8fe42f0 RCX: 00007f786ec1db4d RDX: 0000000000000400 RSI: 000055a4c8fe48a0 RDI: 0000000000000006 RBP: 00007f786ecfb0b0 R08: 00007f786ecfb2a8 R09: 0000000000000001 R10: 0000000000000000 R11: 0000000000000246 R12: 00007f786ecfaf60 R13: 000055a4c8fe42f0 R14: 0000000000000000 R15: 00007ffcb361a628 </TASK> Modules linked in: CR2: 00000000000007b0 ---[ end trace 0000000000000000 ]--- RIP: 0010:getrusage+0x21/0x3e0 Code: 90 90 90 90 90 90 90 90 90 0f 1f 44 00 00 55 48 89 d1 48 89 e5 41 57 41 56 41 55 41 54 49 89 fe 41 52 53 48 89 d3 48 83 ec 30 <4c> 8b a7 b0 07 00 00 48 8d 7a 08 65 48 8b 04 25 28 00 00 00 48 89 RSP: 0018:ffffa166c671bb80 EFLAGS: 00010282 RAX: 00000000000040ca RBX: ffffa166c671bc60 RCX: ffffa166c671bc60 RDX: ffffa166c671bc60 RSI: 0000000000000000 RDI: 0000000000000000 RBP: ffffa166c671bbe0 R08: ffff9448cc3930c0 R09: 0000000000000000 R10: ffffa166c671bd50 R11: ffffffff9ee89260 R12: 0000000000000000 R13: ffff9448ce099480 R14: 0000000000000000 R15: ffff9448cff5b000 FS: 00007f786e225900(0000) GS:ffff94493bc00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00000000000007b0 CR3: 000000010d39c000 CR4: 0000000000750ef0 PKRU: 55555554 Kernel panic - not syncing: Fatal exception Kernel Offset: 0x1ce00000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff) Fixes: 3fcb9d17206e ("io_uring/sqpoll: statistics of the true utilization of sq threads") Signed-off-by: Gabriel Krisman Bertazi <krisman@suse.de> Link: https://lore.kernel.org/r/20240309003256.358-1-krisman@suse.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-03-09NFS: trace the uniquifier of fscacheChen Hanxiao
Trace the mount option fsc=xxx. Signed-off-by: Chen Hanxiao <chenhx.fnst@fujitsu.com> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2024-03-09NFS: Read unlock folio on nfs_page_create_from_folio() errorBenjamin Coddington
The netfs conversion lost a folio_unlock() for the case where nfs_page_create_from_folio() returns an error (usually -ENOMEM). Restore it. Reported-by: David Jeffery <djeffery@redhat.com> Cc: <stable@vger.kernel.org> # 6.4+ Fixes: 000dbe0bec05 ("NFS: Convert buffered read paths to use netfs when fscache is enabled") Signed-off-by: Benjamin Coddington <bcodding@redhat.com> Acked-by: Dave Wysochanski <dwysocha@redhat.com> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2024-03-09NFS: remove unused variable nfs_rpcstatTrond Myklebust
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2024-03-09nfs: fix UAF in direct writesJosef Bacik
In production we have been hitting the following warning consistently ------------[ cut here ]------------ refcount_t: underflow; use-after-free. WARNING: CPU: 17 PID: 1800359 at lib/refcount.c:28 refcount_warn_saturate+0x9c/0xe0 Workqueue: nfsiod nfs_direct_write_schedule_work [nfs] RIP: 0010:refcount_warn_saturate+0x9c/0xe0 PKRU: 55555554 Call Trace: <TASK> ? __warn+0x9f/0x130 ? refcount_warn_saturate+0x9c/0xe0 ? report_bug+0xcc/0x150 ? handle_bug+0x3d/0x70 ? exc_invalid_op+0x16/0x40 ? asm_exc_invalid_op+0x16/0x20 ? refcount_warn_saturate+0x9c/0xe0 nfs_direct_write_schedule_work+0x237/0x250 [nfs] process_one_work+0x12f/0x4a0 worker_thread+0x14e/0x3b0 ? ZSTD_getCParams_internal+0x220/0x220 kthread+0xdc/0x120 ? __btf_name_valid+0xa0/0xa0 ret_from_fork+0x1f/0x30 This is because we're completing the nfs_direct_request twice in a row. The source of this is when we have our commit requests to submit, we process them and send them off, and then in the completion path for the commit requests we have if (nfs_commit_end(cinfo.mds)) nfs_direct_write_complete(dreq); However since we're submitting asynchronous requests we sometimes have one that completes before we submit the next one, so we end up calling complete on the nfs_direct_request twice. The only other place we use nfs_generic_commit_list() is in __nfs_commit_inode, which wraps this call in a nfs_commit_begin(); nfs_commit_end(); Which is a common pattern for this style of completion handling, one that is also repeated in the direct code with get_dreq()/put_dreq() calls around where we process events as well as in the completion paths. Fix this by using the same pattern for the commit requests. Before with my 200 node rocksdb stress running this warning would pop every 10ish minutes. With my patch the stress test has been running for several hours without popping. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Cc: stable@vger.kernel.org Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2024-03-09nfs: properly protect nfs_direct_req fieldsJosef Bacik
We protect accesses to the nfs_direct_req fields with the dreq->lock ever where except nfs_direct_commit_complete. This isn't a huge deal, but it does lead to confusion, and we could potentially end up setting NFS_ODIRECT_RESCHED_WRITES in one thread where we've had an error in another. Clean this up to properly protect ->error and ->flags in the commit completion path. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2024-03-09NFS: enable nconnect for RDMATrond Myklebust
It appears that in certain cases, RDMA capable transports can benefit from the ability to establish multiple connections to increase their throughput. This patch therefore enables the use of the "nconnect" mount option for those use cases. Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2024-03-09NFSv4: nfs4_do_open() is incorrectly triggering state recoveryTrond Myklebust
We're seeing spurious calls to nfs4_schedule_stateid_recovery() from nfs4_do_open() in situations where there is no trigger coming from the server. In theory the code path being triggered is supposed to notice that state recovery happened while we were processing the open call result from the server, before the open stateid is published. However in the years since that code was added, we've also added the 'session draining' mechanism, which ensures that the state recovery will wait until all the session slots have been returned. In nfs4_do_open() the session slot is only returned on exit of the function, so we don't need the legacy mechanism. Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2024-03-09NFS: avoid infinite loop in pnfs_update_layout.NeilBrown
If pnfsd_update_layout() is called on a file for which recovery has failed it will enter a tight infinite loop. NFS_LAYOUT_INVALID_STID will be set, nfs4_select_rw_stateid() will return -EIO, and nfs4_schedule_stateid_recovery() will do nothing, so nfs4_client_recover_expired_lease() will not wait. So the code will loop indefinitely. Break the loop by testing the validity of the open stateid at the top of the loop. Signed-off-by: NeilBrown <neilb@suse.de> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2024-03-09NFS: remove sync_mode test from nfs_writepage_locked()NeilBrown
nfs_writepage_locked() is only called from nfs_wb_folio() (since Commit 12fc0a963128 ("nfs: Remove writepage")) so ->sync_mode is always WB_SYNC_ALL. This means the test for WB_SYNC_NONE is dead code and can be removed. Signed-off-by: NeilBrown <neilb@suse.de> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2024-03-09NFSv4.1/pnfs: fix NFS with TLS in pnfsOlga Kornievskaia
Currently, even though xprtsec=tls is specified and used for operations to MDS, any operations that go to DS travel over unencrypted connection. Or additionally, if more than 1 DS can serve the data, then trunked connections are also done unencrypted. IN GETDEVINCEINFO, we get an entry for the DS which carries a protocol type (which is TCP), then nfs4_set_ds_client() gets called with TCP instead of TCP with TLS. Currently, each trunked connection is created and uses clp->cl_hostname value which if TLS is used would get passed up in the handshake upcall, but instead we need to pass in the appropriate trunked address value. Fixes: c8407f2e560c ("NFS: Add an "xprtsec=" NFS mount option") Signed-off-by: Olga Kornievskaia <kolga@netapp.com> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2024-03-09NFS: Fix an off by one in root_nfs_cat()Christophe JAILLET
The intent is to check if 'dest' is truncated or not. So, >= should be used instead of >, because strlcat() returns the length of 'dest' and 'src' excluding the trailing NULL. Fixes: 56463e50d1fc ("NFS: Use super.c for NFSROOT mount option parsing") Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr> Reviewed-by: Benjamin Coddington <bcodding@redhat.com> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2024-03-09nfs: make the rpc_stat per net namespaceJosef Bacik
Now that we're exposing the rpc stats on a per-network namespace basis, move this struct into struct nfs_net and use that to make sure only the per-network namespace stats are exposed. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>