git.armlinux.org.uk/linux.git - Linus' kernel tree

Age	Commit message (Collapse)	Author
2024-09-25	tracepoint: Support iterating tracepoints in a loading module	Masami Hiramatsu (Google)
	Add for_each_tracepoint_in_module() function to iterate tracepoints in a module. This API is needed for handling tracepoints in a loading module from tracepoint_module_notifier callback function. This also update for_each_module_tracepoint() to pass the module to callback function so that it can find module easily. Link: https://lore.kernel.org/all/172397778740.286558.15781131277732977643.stgit@devnote2/ Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
2024-09-25	tracepoint: Support iterating over tracepoints on modules	Masami Hiramatsu (Google)
	Add for_each_module_tracepoint() for iterating over tracepoints on modules. This is similar to the for_each_kernel_tracepoint() but only for the tracepoints on modules (not including kernel built-in tracepoints). Link: https://lore.kernel.org/all/172397777800.286558.14554748203446214056.stgit@devnote2/ Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
2024-09-25	x86/pvh: Add 64bit relocation page tables	Jason Andryuk
	The PVH entry point is 32bit. For a 64bit kernel, the entry point must switch to 64bit mode, which requires a set of page tables. In the past, PVH used init_top_pgt. This works fine when the kernel is loaded at LOAD_PHYSICAL_ADDR, as the page tables are prebuilt for this address. If the kernel is loaded at a different address, they need to be adjusted. __startup_64() adjusts the prebuilt page tables for the physical load address, but it is 64bit code. The 32bit PVH entry code can't call it to adjust the page tables, so it can't readily be re-used. 64bit PVH entry needs page tables set up for identity map, the kernel high map and the direct map. pvh_start_xen() enters identity mapped. Inside xen_prepare_pvh(), it jumps through a pv_ops function pointer into the highmap. The direct map is used for __va() on the initramfs and other guest physical addresses. Add a dedicated set of prebuild page tables for PVH entry. They are adjusted in assembly before loading. Add XEN_ELFNOTE_PHYS32_RELOC to indicate support for relocation along with the kernel's loading constraints. The maximum load address, KERNEL_IMAGE_SIZE - 1, is determined by a single pvh_level2_ident_pgt page. It could be larger with more pages. Signed-off-by: Jason Andryuk <jason.andryuk@amd.com> Reviewed-by: Juergen Gross <jgross@suse.com> Message-ID: <20240823193630.2583107-6-jason.andryuk@amd.com> Signed-off-by: Juergen Gross <jgross@suse.com>
2024-09-25	x86/kernel: Move page table macros to header	Jason Andryuk
	The PVH entry point will need an additional set of prebuild page tables. Move the macros and defines to pgtable_64.h, so they can be re-used. Signed-off-by: Jason Andryuk <jason.andryuk@amd.com> Reviewed-by: Juergen Gross <jgross@suse.com> Acked-by: Dave Hansen <dave.hansen@linux.intel.com> Message-ID: <20240823193630.2583107-5-jason.andryuk@amd.com> Signed-off-by: Juergen Gross <jgross@suse.com>
2024-09-25	tomoyo: fallback to realpath if symlink's pathname does not exist	Tetsuo Handa
	Alfred Agrell found that TOMOYO cannot handle execveat(AT_EMPTY_PATH) inside chroot environment where /dev and /proc are not mounted, for commit 51f39a1f0cea ("syscalls: implement execveat() system call") missed that TOMOYO tries to canonicalize argv[0] when the filename fed to the executed program as argv[0] is supplied using potentially nonexistent pathname. Since "/dev/fd/<fd>" already lost symlink information used for obtaining that <fd>, it is too late to reconstruct symlink's pathname. Although <filename> part of "/dev/fd/<fd>/<filename>" might not be canonicalized, TOMOYO cannot use tomoyo_realpath_nofollow() when /dev or /proc is not mounted. Therefore, fallback to tomoyo_realpath_from_path() when tomoyo_realpath_nofollow() failed. Reported-by: Alfred Agrell <blubban@gmail.com> Closes: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1082001 Fixes: 51f39a1f0cea ("syscalls: implement execveat() system call") Cc: stable@vger.kernel.org # v3.19+ Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
2024-09-25	x86/pvh: Set phys_base when calling xen_prepare_pvh()	Jason Andryuk
	phys_base needs to be set for __pa() to work in xen_pvh_init() when finding the hypercall page. Set it before calling into xen_prepare_pvh(), which calls xen_pvh_init(). Clear it afterward to avoid __startup_64() adding to it and creating an incorrect value. Signed-off-by: Jason Andryuk <jason.andryuk@amd.com> Reviewed-by: Juergen Gross <jgross@suse.com> Message-ID: <20240823193630.2583107-4-jason.andryuk@amd.com> Signed-off-by: Juergen Gross <jgross@suse.com>
2024-09-25	x86/pvh: Make PVH entrypoint PIC for x86-64	Jason Andryuk
	The PVH entrypoint is 32bit non-PIC code running the uncompressed vmlinux at its load address CONFIG_PHYSICAL_START - default 0x1000000 (16MB). The kernel is loaded at that physical address inside the VM by the VMM software (Xen/QEMU). When running a Xen PVH Dom0, the host reserved addresses are mapped 1-1 into the PVH container. There exist system firmwares (Coreboot/EDK2) with reserved memory at 16MB. This creates a conflict where the PVH kernel cannot be loaded at that address. Modify the PVH entrypoint to be position-indepedent to allow flexibility in load address. Only the 64bit entry path is converted. A 32bit kernel is not PIC, so calling into other parts of the kernel, like xen_prepare_pvh() and mk_pgtable_32(), don't work properly when relocated. This makes the code PIC, but the page tables need to be updated as well to handle running from the kernel high map. The UNWIND_HINT_END_OF_STACK is to silence: vmlinux.o: warning: objtool: pvh_start_xen+0x7f: unreachable instruction after the lret into 64bit code. Signed-off-by: Jason Andryuk <jason.andryuk@amd.com> Reviewed-by: Juergen Gross <jgross@suse.com> Message-ID: <20240823193630.2583107-3-jason.andryuk@amd.com> Signed-off-by: Juergen Gross <jgross@suse.com>
2024-09-25	xen: sync elfnote.h from xen tree	Jason Andryuk
	Sync Xen's elfnote.h header from xen.git to pull in the XEN_ELFNOTE_PHYS32_RELOC define. xen commit dfc9fab00378 ("x86/PVH: Support relocatable dom0 kernels") This is a copy except for the removal of the emacs editor config at the end of the file. Signed-off-by: Jason Andryuk <jason.andryuk@amd.com> Reviewed-by: Juergen Gross <jgross@suse.com> Message-ID: <20240823193630.2583107-2-jason.andryuk@amd.com> Signed-off-by: Juergen Gross <jgross@suse.com>
2024-09-25	drm/v3d: Expose Super Pages capability	Maíra Canal
	Add a new V3D parameter to expose the support of Super Pages to userspace. The userspace might want to know this information to apply optimizations that are specific to kernels with Super Pages enabled. Signed-off-by: Maíra Canal <mcanal@igalia.com> Reviewed-by: Iago Toral Quiroga <itoral@igalia.com> Link: https://patchwork.freedesktop.org/patch/msgid/20240923141348.2422499-12-mcanal@igalia.com
2024-09-25	drm/v3d: Add modparam for turning off Big/Super Pages	Maíra Canal
	Add a modparam for turning off Big/Super Pages to make sure that if an user doesn't want Big/Super Pages enabled, it can disabled it by setting the modparam to false. Signed-off-by: Maíra Canal <mcanal@igalia.com> Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@igalia.com> Link: https://patchwork.freedesktop.org/patch/msgid/20240923141348.2422499-11-mcanal@igalia.com
2024-09-25	drm/v3d: Use gemfs/THP in BO creation if available	Maíra Canal
	Although Big/Super Pages could appear naturally, it would be quite hard to have 1MB or 64KB allocated contiguously naturally. Therefore, we can force the creation of large pages allocated contiguously by using a mountpoint with "huge=within_size" enabled. Therefore, as V3D has a mountpoint with "huge=within_size" (if user has THP enabled), use this mountpoint for BO creation if available. This will allow us to create large pages allocated contiguously and make use of Big/Super Pages. Signed-off-by: Maíra Canal <mcanal@igalia.com> Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@igalia.com> Link: https://patchwork.freedesktop.org/patch/msgid/20240923141348.2422499-10-mcanal@igalia.com
2024-09-25	drm/v3d: Support Big/Super Pages when writing out PTEs	Maíra Canal
	The V3D MMU also supports 64KB and 1MB pages, called big and super pages, respectively. In order to set a 64KB page or 1MB page in the MMU, we need to make sure that page table entries for all 4KB pages within a big/super page must be correctly configured. In order to create a big/super page, we need a contiguous memory region. That's why we use a separate mountpoint with THP enabled. In order to place the page table entries in the MMU, we iterate over the 16 4KB pages (for big pages) or 256 4KB pages (for super pages) and insert the PTE. Signed-off-by: Maíra Canal <mcanal@igalia.com> Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@igalia.com> Link: https://patchwork.freedesktop.org/patch/msgid/20240923141348.2422499-9-mcanal@igalia.com
2024-09-25	drm/v3d: Reduce the alignment of the node allocation	Maíra Canal
	Currently, we are using an alignment of 128 kB to insert a node, which ends up wasting memory as we perform plenty of small BOs allocations (<= 4 kB). We require that allocations are aligned to 128Kb so for any allocation smaller than that, we are wasting the difference. This implies that we cannot effectively use the whole 4 GB address space available for the GPU in the RPi 4. Currently, we can allocate up to 32000 BOs of 4 kB (~140 MB) and 3000 BOs of 400 kB (~1,3 GB). This can be quite limiting for applications that have a high memory requirement, such as vkoverhead [1]. By reducing the page alignment to 4 kB, we can allocate up to 1000000 BOs of 4 kB (~4 GB) and 10000 BOs of 400 kB (~4 GB). Moreover, by performing benchmarks, we were able to attest that reducing the page alignment to 4 kB can provide a general performance improvement in OpenGL applications (e.g. glmark2). Therefore, this patch reduces the alignment of the node allocation to 4 kB, which will allow RPi users to explore the whole 4GB virtual address space provided by the hardware. Also, this patch allow users to fully run vkoverhead in the RPi 4/5, solving the issue reported in [1]. [1] https://github.com/zmike/vkoverhead/issues/14 Signed-off-by: Maíra Canal <mcanal@igalia.com> Reviewed-by: Iago Toral Quiroga <itoral@igalia.com> Link: https://patchwork.freedesktop.org/patch/msgid/20240923141348.2422499-8-mcanal@igalia.com
2024-09-25	drm/gem: Create shmem GEM object in a given mountpoint	Maíra Canal
	Create a function `drm_gem_shmem_create_with_mnt()`, similar to `drm_gem_shmem_create()`, that has a mountpoint as a argument. This function will create a shmem GEM object in a given tmpfs mountpoint. This function will be useful for drivers that have a special mountpoint with flags enabled. Signed-off-by: Maíra Canal <mcanal@igalia.com> Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@igalia.com> Link: https://patchwork.freedesktop.org/patch/msgid/20240923141348.2422499-7-mcanal@igalia.com
2024-09-25	drm/v3d: Introduce gemfs	Maíra Canal
	Create a separate "tmpfs" kernel mount for V3D. This will allow us to move away from the shmemfs `shm_mnt` and gives the flexibility to do things like set our own mount options. Here, the interest is to use "huge=", which should allow us to enable the use of THP for our shmem-backed objects. Signed-off-by: Maíra Canal <mcanal@igalia.com> Reviewed-by: Iago Toral Quiroga <itoral@igalia.com> Link: https://patchwork.freedesktop.org/patch/msgid/20240923141348.2422499-6-mcanal@igalia.com
2024-09-25	drm/gem: Create a drm_gem_object_init_with_mnt() function	Maíra Canal
	For some applications, such as applications that uses huge pages, we might want to have a different mountpoint, for which we pass mount flags that better match our usecase. Therefore, create a new function `drm_gem_object_init_with_mnt()` that allow us to define the tmpfs mountpoint where the GEM object will be created. If this parameter is NULL, then we fallback to `shmem_file_setup()`. Signed-off-by: Maíra Canal <mcanal@igalia.com> Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@igalia.com> Link: https://patchwork.freedesktop.org/patch/msgid/20240923141348.2422499-5-mcanal@igalia.com
2024-09-25	drm/v3d: Fix return if scheduler initialization fails	Maíra Canal
	If the scheduler initialization fails, GEM initialization must fail as well. Therefore, if `v3d_sched_init()` fails, free the DMA memory allocated and return the error value in `v3d_gem_init()`. Signed-off-by: Maíra Canal <mcanal@igalia.com> Reviewed-by: Iago Toral Quiroga <itoral@igalia.com> Link: https://patchwork.freedesktop.org/patch/msgid/20240923141348.2422499-4-mcanal@igalia.com
2024-09-25	drm/v3d: Flush the MMU before we supply more memory to the binner	Maíra Canal
	We must ensure that the MMU is flushed before we supply more memory to the binner, otherwise we might end up with invalid MMU accesses by the GPU. Fixes: 57692c94dcbe ("drm/v3d: Introduce a new DRM driver for Broadcom V3D V3.x+") Signed-off-by: Maíra Canal <mcanal@igalia.com> Reviewed-by: Iago Toral Quiroga <itoral@igalia.com> Link: https://patchwork.freedesktop.org/patch/msgid/20240923141348.2422499-3-mcanal@igalia.com
2024-09-25	drm/v3d: Address race-condition in MMU flush	Maíra Canal
	We must first flush the MMU cache and then, flush the TLB, not the other way around. Currently, we can see a race condition between the MMU cache and the TLB when running multiple rendering processes at the same time. This is evidenced by MMU errors triggered by the IRQ. Fix the MMU flush order by flushing the MMU cache and then the TLB. Also, in order to address the race condition, wait for the MMU cache flush to finish before starting the TLB flush. Fixes: 57692c94dcbe ("drm/v3d: Introduce a new DRM driver for Broadcom V3D V3.x+") Signed-off-by: Maíra Canal <mcanal@igalia.com> Reviewed-by: Iago Toral Quiroga <itoral@igalia.com> Link: https://patchwork.freedesktop.org/patch/msgid/20240923141348.2422499-2-mcanal@igalia.com
2024-09-25	kprobes: Remove obsoleted declaration for init_test_probes	Gaosheng Cui
	The init_test_probes() have been removed since commit e44e81c5b90f ("kprobes: convert tests to kunit"), and now it is useless, so remove it. Link: https://lore.kernel.org/all/20240826032552.4016314-1-cuigaosheng1@huawei.com/ Signed-off-by: Gaosheng Cui <cuigaosheng1@huawei.com> Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
2024-09-25	uprobes: turn trace_uprobe's nhit counter to be per-CPU one	Andrii Nakryiko
	trace_uprobe->nhit counter is not incremented atomically, so its value is questionable in when uprobe is hit on multiple CPUs simultaneously. Also, doing this shared counter increment across many CPUs causes heavy cache line bouncing, limiting uprobe/uretprobe performance scaling with number of CPUs. Solve both problems by making this a per-CPU counter. Link: https://lore.kernel.org/all/20240813203409.3985398-1-andrii@kernel.org/ Reviewed-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
2024-09-25	vsock/virtio: avoid queuing packets when intermediate queue is empty	Luigi Leonardi
	When the driver needs to send new packets to the device, it always queues the new sk_buffs into an intermediate queue (send_pkt_queue) and schedules a worker (send_pkt_work) to then queue them into the virtqueue exposed to the device. This increases the chance of batching, but also introduces a lot of latency into the communication. So we can optimize this path by adding a fast path to be taken when there is no element in the intermediate queue, there is space available in the virtqueue, and no other process that is sending packets (tx_lock held). The following benchmarks were run to check improvements in latency and throughput. The test bed is a host with Intel i7-10700KF CPU @ 3.80GHz and L1 guest running on QEMU/KVM with vhost process and all vCPUs pinned individually to pCPUs. - Latency Tool: Fio version 3.37-56 Mode: pingpong (h-g-h) Test runs: 50 Runtime-per-test: 50s Type: SOCK_STREAM In the following fio benchmark (pingpong mode) the host sends a payload to the guest and waits for the same payload back. fio process pinned both inside the host and the guest system. Before: Linux 6.9.8 Payload 64B: 1st perc. overall 99th perc. Before 12.91 16.78 42.24 us After 9.77 13.57 39.17 us Payload 512B: 1st perc. overall 99th perc. Before 13.35 17.35 41.52 us After 10.25 14.11 39.58 us Payload 4K: 1st perc. overall 99th perc. Before 14.71 19.87 41.52 us After 10.51 14.96 40.81 us - Throughput Tool: iperf-vsock The size represents the buffer length (-l) to read/write P represents the number of parallel streams P=1 4K 64K 128K Before 6.87 29.3 29.5 Gb/s After 10.5 39.4 39.9 Gb/s P=2 4K 64K 128K Before 10.5 32.8 33.2 Gb/s After 17.8 47.7 48.5 Gb/s P=4 4K 64K 128K Before 12.7 33.6 34.2 Gb/s After 16.9 48.1 50.5 Gb/s The performance improvement is related to this optimization, I used a ebpf kretprobe on virtio_transport_send_skb to check that each packet was sent directly to the virtqueue Co-developed-by: Marco Pinna <marco.pinn95@gmail.com> Signed-off-by: Marco Pinna <marco.pinn95@gmail.com> Signed-off-by: Luigi Leonardi <luigi.leonardi@outlook.com> Message-Id: <20240730-pinna-v4-2-5c9179164db5@outlook.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
2024-09-25	vsock/virtio: refactor virtio_transport_send_pkt_work	Marco Pinna
	Preliminary patch to introduce an optimization to the enqueue system. All the code used to enqueue a packet into the virtqueue is removed from virtio_transport_send_pkt_work() and moved to the new virtio_transport_send_skb() function. Co-developed-by: Luigi Leonardi <luigi.leonardi@outlook.com> Signed-off-by: Luigi Leonardi <luigi.leonardi@outlook.com> Signed-off-by: Marco Pinna <marco.pinn95@gmail.com> Reviewed-by: Stefano Garzarella <sgarzare@redhat.com> Message-Id: <20240730-pinna-v4-1-5c9179164db5@outlook.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2024-09-25	fw_cfg: Constify struct kobj_type	Hongbo Li
	This 'struct kobj_type' is not modified. It is only used in kobject_init_and_add() which takes a 'const struct kobj_type *ktype' parameter. Constifying this structure and moving it to a read-only section, and this can increase over all security. ``` [Before] text data bss dec hex filename 5974 1008 96 7078 1ba6 drivers/firmware/qemu_fw_cfg.o [After] text data bss dec hex filename 6038 944 96 7078 1ba6 drivers/firmware/qemu_fw_cfg.o ``` Signed-off-by: Hongbo Li <lihongbo22@huawei.com> Message-Id: <20240904011743.2010319-1-lihongbo22@huawei.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2024-09-25	vdpa/mlx5: Postpone MR deletion	Dragos Tatulea
	Currently, when a new MR is set up, the old MR is deleted. MR deletion is about 30-40% the time of MR creation. As deleting the old MR is not important for the process of setting up the new MR, this operation can be postponed. This series adds a workqueue that does MR garbage collection at a later point. If the MR lock is taken, the handler will back off and reschedule. The exception during shutdown: then the handler must not postpone the work. Note that this is only a speculative optimization: if there is some mapping operation that is triggered while the garbage collector handler has the lock taken, this operation it will have to wait for the handler to finish. Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com> Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com> Message-Id: <20240830105838.2666587-9-dtatulea@nvidia.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2024-09-25	vdpa/mlx5: Introduce init/destroy for MR resources	Dragos Tatulea
	There's currently not a lot of action happening during the init/destroy of MR resources. But more will be added in the upcoming patches. As the mr mutex lock init/destroy has been moved to these new functions, the lifetime has now shifted away from mlx5_vdpa_alloc_resources() / mlx5_vdpa_free_resources() into these new functions. However, the lifetime at the outer scope remains the same: mlx5_vdpa_dev_add() / mlx5_vdpa_dev_free() Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com> Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com> Message-Id: <20240830105838.2666587-8-dtatulea@nvidia.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2024-09-25	vdpa/mlx5: Rename mr_mtx -> lock	Dragos Tatulea
	Now that the mr resources have their own namespace in the struct, give the lock a clearer name. Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com> Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com> Acked-by: Eugenio Pérez <eperezma@redhat.com> Message-Id: <20240830105838.2666587-7-dtatulea@nvidia.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2024-09-25	vdpa/mlx5: Extract mr members in own resource struct	Dragos Tatulea
	Group all mapping related resources into their own structure. Upcoming patches will add more members in this new structure. Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com> Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com> Acked-by: Eugenio Pérez <eperezma@redhat.com> Message-Id: <20240830105838.2666587-6-dtatulea@nvidia.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2024-09-25	vdpa/mlx5: Rename function	Dragos Tatulea
	A followup patch will use this name for something else. Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com> Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com> Message-Id: <20240830105838.2666587-5-dtatulea@nvidia.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2024-09-25	vdpa/mlx5: Delete direct MKEYs in parallel	Dragos Tatulea
	Use the async interface to issue MTT MKEY deletion. This makes destroy_user_mr() on average 8x times faster. This number is also dependent on the size of the MR being deleted. Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com> Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com> Acked-by: Eugenio Pérez <eperezma@redhat.com> Message-Id: <20240830105838.2666587-4-dtatulea@nvidia.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2024-09-25	vdpa/mlx5: Create direct MKEYs in parallel	Dragos Tatulea
	Use the async interface to issue MTT MKEY creation. Extra care is taken at the allocation of FW input commands due to the MTT tables having variable sizes depending on MR. The indirect MKEY is still created synchronously at the end as the direct MKEYs need to be filled in. This makes create_user_mr() 3-5x faster, depending on the size of the MR. Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com> Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com> Message-Id: <20240830105838.2666587-3-dtatulea@nvidia.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2024-09-25	MAINTAINERS: add virtio-vsock driver in the VIRTIO CORE section	Stefano Garzarella
	The virtio-vsock driver is already under VM SOCKETS (AF_VSOCK), managed pricipally with the net tree, and VIRTIO AND VHOST VSOCK DRIVER. However, changes that only affect the virtio part usually go with Michael's tree, so let's also put the driver in the VIRTIO CORE section to have its maintainers in CC for changes to the virtio-vsock driver. Cc: "Michael S. Tsirkin" <mst@redhat.com> Cc: Jason Wang <jasowang@redhat.com> Signed-off-by: Stefano Garzarella <sgarzare@redhat.com> Message-Id: <20240829143757.85844-1-sgarzare@redhat.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com> Acked-by: Jason Wang <jasowang@redhat.com>
2024-09-25	virtio_fs: add sysfs entries for queue information	Max Gurtovoy
	Introduce sysfs entries to provide visibility to the multiple queues used by the Virtio FS device. This enhancement allows users to query information about these queues. Specifically, add two sysfs entries: 1. Queue name: Provides the name of each queue (e.g. hiprio/requests.8). 2. CPU list: Shows the list of CPUs that can process requests for each queue. The CPU list feature is inspired by similar functionality in the block MQ layer, which provides analogous sysfs entries for block devices. These new sysfs entries will improve observability and aid in debugging and performance tuning of Virtio FS devices. Reviewed-by: Idan Zach <izach@nvidia.com> Reviewed-by: Shai Malin <smalin@nvidia.com> Signed-off-by: Max Gurtovoy <mgurtovoy@nvidia.com> Message-Id: <20240825130716.9506-2-mgurtovoy@nvidia.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2024-09-25	virtio_fs: introduce virtio_fs_put_locked helper	Max Gurtovoy
	Introduce a new helper function virtio_fs_put_locked to encapsulate the common pattern of releasing a virtio_fs reference while holding a lock. The existing virtio_fs_put helper will be used to release a virtio_fs reference while not holding a lock. Also add an assertion in case the lock is not taken when it should. Reviewed-by: Idan Zach <izach@nvidia.com> Reviewed-by: Shai Malin <smalin@nvidia.com> Signed-off-by: Max Gurtovoy <mgurtovoy@nvidia.com> Message-Id: <20240825130716.9506-1-mgurtovoy@nvidia.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
2024-09-25	vdpa: Remove unused declarations	Yue Haibing
	There is no caller and implementation in tree. Signed-off-by: Yue Haibing <yuehaibing@huawei.com> Message-Id: <20240819140930.122019-1-yuehaibing@huawei.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Reviewed-by: Shannon Nelson <shannon.nelson@amd.com> Reviewed-by: Zhu Lingshan <lingshan.zhu@kernel.org> Reviewed-by: Shannon Nelson <<a href="mailto:shannon.nelson@amd.com" target="_blank">shannon.nelson@amd.com</a>><br> Reviewed-by: Zhu Lingshan <lingshan.zhu@kernel.org>
2024-09-25	vdpa/mlx5: Parallelize VQ suspend/resume for CVQ MQ command	Dragos Tatulea
	change_num_qps() is still suspending/resuming VQs one by one. This change switches to parallel suspend/resume. When increasing the number of queues the flow has changed a bit for simplicity: the setup_vq() function will always be called before resume_vqs(). If the VQ is initialized, setup_vq() will exit early. If the VQ is not initialized, setup_vq() will create it and resume_vqs() will resume it. Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Message-Id: <20240816090159.1967650-11-dtatulea@nvidia.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Acked-by: Eugenio Pérez <eperezma@redhat.com> Tested-by: Lei Yang <leiyang@redhat.com>
2024-09-25	vdpa/mlx5: Small improvement for change_num_qps()	Dragos Tatulea
	change_num_qps() has a lot of multiplications by 2 to convert the number of VQ pairs to number of VQs. This patch simplifies the code by doing the VQP -> VQ count conversion at the beginning in a variable. Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Message-Id: <20240816090159.1967650-10-dtatulea@nvidia.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Acked-by: Eugenio Pérez <eperezma@redhat.com> Tested-by: Lei Yang <leiyang@redhat.com>
2024-09-25	vdpa/mlx5: Keep notifiers during suspend but ignore	Dragos Tatulea
	Unregistering notifiers is a costly operation. Instead of removing the notifiers during device suspend and adding them back at resume, simply ignore the call when the device is suspended. At resume time call queue_link_work() to make sure that the device state is propagated in case there were changes. For 1 vDPA device x 32 VQs (16 VQPs) attached to a large VM (256 GB RAM, 32 CPUs x 2 threads per core), the device suspend time is reduced from ~13 ms to ~2.5 ms. Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Acked-by: Eugenio Pérez <eperezma@redhat.com> Message-Id: <20240816090159.1967650-9-dtatulea@nvidia.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Tested-by: Lei Yang <leiyang@redhat.com>
2024-09-25	vdpa/mlx5: Parallelize device resume	Dragos Tatulea
	Currently device resume works on vqs serially. Building up on previous changes that converted vq operations to the async api, this patch parallelizes the device resume. For 1 vDPA device x 32 VQs (16 VQPs) attached to a large VM (256 GB RAM, 32 CPUs x 2 threads per core), the device resume time is reduced from ~16 ms to ~4.5 ms. Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Acked-by: Eugenio Pérez <eperezma@redhat.com> Message-Id: <20240816090159.1967650-8-dtatulea@nvidia.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Tested-by: Lei Yang <leiyang@redhat.com>
2024-09-25	vdpa/mlx5: Parallelize device suspend	Dragos Tatulea
	Currently device suspend works on vqs serially. Building up on previous changes that converted vq operations to the async api, this patch parallelizes the device suspend: 1) Suspend all active vqs parallel. 2) Query suspended vqs in parallel. For 1 vDPA device x 32 VQs (16 VQPs) attached to a large VM (256 GB RAM, 32 CPUs x 2 threads per core), the device suspend time is reduced from ~37 ms to ~13 ms. A later patch will remove the link unregister operation which will make it even faster. Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Acked-by: Eugenio Pérez <eperezma@redhat.com> Message-Id: <20240816090159.1967650-7-dtatulea@nvidia.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Tested-by: Lei Yang <leiyang@redhat.com>
2024-09-25	vdpa/mlx5: Use async API for vq modify commands	Dragos Tatulea
	Switch firmware vq modify command to be issued via the async API to allow future parallelization. The new refactored function applies the modify on a range of vqs and waits for their execution to complete. For now the command is still used in a serial fashion. A later patch will switch to modifying multiple vqs in parallel. Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Message-Id: <20240816090159.1967650-6-dtatulea@nvidia.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Acked-by: Eugenio Pérez <eperezma@redhat.com> Tested-by: Lei Yang <leiyang@redhat.com>
2024-09-25	vdpa/mlx5: Use async API for vq query command	Dragos Tatulea
	Switch firmware vq query command to be issued via the async API to allow future parallelization. For now the command is still serial but the infrastructure is there to issue commands in parallel, including ratelimiting the number of issued async commands to firmware. A later patch will switch to issuing more commands at a time. Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Message-Id: <20240816090159.1967650-5-dtatulea@nvidia.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Tested-by: Lei Yang <leiyang@redhat.com>
2024-09-25	vdpa/mlx5: Introduce async fw command wrapper	Dragos Tatulea
	Introduce a new function mlx5_vdpa_exec_async_cmds() which wraps the mlx5_core async firmware command API in a way that will be used to parallelize certain operation in this driver. The wrapper deals with the case when mlx5_cmd_exec_cb() returns EBUSY due to the command being throttled. Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Message-Id: <20240816090159.1967650-4-dtatulea@nvidia.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Acked-by: Eugenio Pérez <eperezma@redhat.com> Tested-by: Lei Yang <leiyang@redhat.com>
2024-09-25	vdpa/mlx5: Introduce error logging function	Dragos Tatulea
	mlx5_vdpa_err() was missing. This patch adds it and uses it in the necessary places. Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Acked-by: Eugenio Pérez <eperezma@redhat.com> Message-Id: <20240816090159.1967650-3-dtatulea@nvidia.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Tested-by: Lei Yang <leiyang@redhat.com>
2024-09-25	net/mlx5: Support throttled commands from async API	Dragos Tatulea
	Currently, commands that qualify as throttled can't be used via the async API. That's due to the fact that the throttle semaphore can sleep but the async API can't. This patch allows throttling in the async API by using the tentative variant of the semaphore and upon failure (semaphore at 0) returns EBUSY to signal to the caller that they need to wait for the completion of previously issued commands. Furthermore, make sure that the semaphore is released in the callback. Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com> Cc: Leon Romanovsky <leonro@nvidia.com> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Message-Id: <20240816090159.1967650-2-dtatulea@nvidia.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Tested-by: Lei Yang <leiyang@redhat.com>
2024-09-25	xen/pciback: fix cast to restricted pci_ers_result_t and pci_power_t	Min-Hua Chen
	This patch fix the following sparse warning by applying __force cast to pci_ers_result_t and pci_power_t. drivers/xen/xen-pciback/pci_stub.c:760:16: sparse: warning: cast to restricted pci_ers_result_t drivers/xen/xen-pciback/conf_space_capability.c:125:22: sparse: warning: cast to restricted pci_power_t No functional changes intended. Signed-off-by: Min-Hua Chen <minhuadotchen@gmail.com> Reviewed-by: Juergen Gross <jgross@suse.com> Message-ID: <20240917233653.61630-1-minhuadotchen@gmail.com> Signed-off-by: Juergen Gross <jgross@suse.com>
2024-09-25	Merge tag 'nvme-6.12-2024-09-25' of git://git.infradead.org/nvme into ↵	Jens Axboe
	for-6.12/block Pull NVMe fixes from Keith: "nvme fixes for Linux 6.12 - Multipath fixes (Hannes) - Sysfs attribute list NULL terminate fix (Shin'ichiro) - Remove problematic read-back (Keith)" * tag 'nvme-6.12-2024-09-25' of git://git.infradead.org/nvme: nvme: remove CC register read-back during enabling nvme: null terminate nvme_tls_attrs nvme-multipath: avoid hang on inaccessible namespaces nvme-multipath: system fails to create generic nvme device
2024-09-25	Revert "driver core: don't always lock parent in shutdown"	Greg Kroah-Hartman
	This reverts commit ba6353748e71bd1d7e422fec2b5c2e2dfc2e3bd9. The series is being reverted before -rc1 as there are still reports of lockups on shutdown, so it's not quite ready for "prime time." Reported-by: Andrey Skvortsov <andrej.skvortzov@gmail.com> Link: https://lore.kernel.org/r/ZvMkkhyJrohaajuk@skv.local Cc: Christoph Hellwig <hch@lst.de> Cc: David Jeffery <djeffery@redhat.com> Cc: Keith Busch <kbusch@kernel.org> Cc: Laurence Oberman <loberman@redhat.com> Cc: Nathan Chancellor <nathan@kernel.org> Cc: Sagi Grimberg <sagi@grimberg.me> Cc: Stuart Hayes <stuart.w.hayes@gmail.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-09-25	Revert "driver core: separate function to shutdown one device"	Greg Kroah-Hartman
	This reverts commit 95dc7565253a8564911190ebd1e4ffceb4de208a. The series is being reverted before -rc1 as there are still reports of lockups on shutdown, so it's not quite ready for "prime time." Reported-by: Andrey Skvortsov <andrej.skvortzov@gmail.com> Link: https://lore.kernel.org/r/ZvMkkhyJrohaajuk@skv.local Cc: Christoph Hellwig <hch@lst.de> Cc: David Jeffery <djeffery@redhat.com> Cc: Keith Busch <kbusch@kernel.org> Cc: Laurence Oberman <loberman@redhat.com> Cc: Nathan Chancellor <nathan@kernel.org> Cc: Sagi Grimberg <sagi@grimberg.me> Cc: Stuart Hayes <stuart.w.hayes@gmail.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-09-25	Revert "driver core: shut down devices asynchronously"	Greg Kroah-Hartman
	This reverts commit 8064952c65045f05ee2671fe437770e50c151776. The series is being reverted before -rc1 as there are still reports of lockups on shutdown, so it's not quite ready for "prime time." Reported-by: Andrey Skvortsov <andrej.skvortzov@gmail.com> Link: https://lore.kernel.org/r/ZvMkkhyJrohaajuk@skv.local Cc: Christoph Hellwig <hch@lst.de> Cc: David Jeffery <djeffery@redhat.com> Cc: Keith Busch <kbusch@kernel.org> Cc: Laurence Oberman <loberman@redhat.com> Cc: Nathan Chancellor <nathan@kernel.org> Cc: Sagi Grimberg <sagi@grimberg.me> Cc: Stuart Hayes <stuart.w.hayes@gmail.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>