summaryrefslogtreecommitdiff
path: root/drivers/virtio
AgeCommit message (Collapse)Author
2024-04-08virtio: store owner from modules with register_virtio_driver()Krzysztof Kozlowski
Modules registering driver with register_virtio_driver() might forget to set .owner field. i2c-virtio.c for example has it missing. The field is used by some other kernel parts for reference counting (try_module_get()), so it is expected that drivers will set it. Solve the problem by moving this task away from the drivers to the core virtio code, just like we did for platform_driver in commit 9447057eaff8 ("platform_device: use a macro instead of platform_driver_register"). Fixes: 3cfc88380413 ("i2c: virtio: add a virtio i2c frontend driver") Cc: "Jie Deng" <jie.deng@intel.com> Signed-off-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org> Message-Id: <20240331-module-owner-virtio-v2-1-98f04bfaf46a@linaro.org> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2024-03-19virtio: packed: fix unmap leak for indirect desc tableXuan Zhuo
When use_dma_api and premapped are true, then the do_unmap is false. Because the do_unmap is false, vring_unmap_extra_packed is not called by detach_buf_packed. if (unlikely(vq->do_unmap)) { curr = id; for (i = 0; i < state->num; i++) { vring_unmap_extra_packed(vq, &vq->packed.desc_extra[curr]); curr = vq->packed.desc_extra[curr].next; } } So the indirect desc table is not unmapped. This causes the unmap leak. So here, we check vq->use_dma_api instead. Synchronously, dma info is updated based on use_dma_api judgment This bug does not occur, because no driver use the premapped with indirect. Fixes: b319940f83c2 ("virtio_ring: skip unmap for premapped") Signed-off-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com> Message-Id: <20240223071833.26095-1-xuanzhuo@linux.alibaba.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2024-03-19virtio: make virtio_bus constRicardo B. Marliere
Now that the driver core can properly handle constant struct bus_type, move the virtio_bus variable to be a constant structure as well, placing it into read-only memory which can not be modified at runtime. Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Suggested-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Ricardo B. Marliere <ricardo@marliere.net> Message-Id: <20240204-bus_cleanup-virtio-v1-1-3bcb2212aaa0@marliere.net> Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Acked-by: Jason Wang <jasowang@redhat.com>
2024-03-19virtio_vdpa: create vqs with the actual sizeZhu Lingshan
The size of a virtqueue is a per vq configuration, this commit allows virtio_vdpa to create virtqueues with the actual size of a specific vq size that supported by the backend device. Signed-off-by: Zhu Lingshan <lingshan.zhu@intel.com> Message-Id: <20240202163905.8834-9-lingshan.zhu@intel.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2024-03-19virtio: reenable config if freezing device failedDavid Hildenbrand
Currently, we don't reenable the config if freezing the device failed. For example, virtio-mem currently doesn't support suspend+resume, and trying to freeze the device will always fail. Afterwards, the device will no longer respond to resize requests, because it won't get notified about config changes. Let's fix this by re-enabling the config if freezing fails. Fixes: 22b7050a024d ("virtio: defer config changed notifications") Cc: <stable@kernel.org> Cc: "Michael S. Tsirkin" <mst@redhat.com> Cc: Jason Wang <jasowang@redhat.com> Cc: Xuan Zhuo <xuanzhuo@linux.alibaba.com> Signed-off-by: David Hildenbrand <david@redhat.com> Message-Id: <20240213135425.795001-1-david@redhat.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2024-01-18Merge tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhostLinus Torvalds
Pull virtio updates from Michael Tsirkin: - vdpa/mlx5: support for resumable vqs - virtio_scsi: mq_poll support - 3virtio_pmem: support SHMEM_REGION - virtio_balloon: stay awake while adjusting balloon - virtio: support for no-reset virtio PCI PM - Fixes, cleanups * tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost: vdpa/mlx5: Add mkey leak detection vdpa/mlx5: Introduce reference counting to mrs vdpa/mlx5: Use vq suspend/resume during .set_map vdpa/mlx5: Mark vq state for modification in hw vq vdpa/mlx5: Mark vq addrs for modification in hw vq vdpa/mlx5: Introduce per vq and device resume vdpa/mlx5: Allow modifying multiple vq fields in one modify command vdpa/mlx5: Expose resumable vq capability vdpa: Block vq property changes in DRIVER_OK vdpa: Track device suspended state scsi: virtio_scsi: Add mq_poll support virtio_pmem: support feature SHMEM_REGION virtio_balloon: stay awake while adjusting balloon vdpa: Remove usage of the deprecated ida_simple_xx() API virtio: Add support for no-reset virtio PCI PM virtio_net: fix missing dma unmap for resize vhost-vdpa: account iommu allocations vdpa: Fix an error handling path in eni_vdpa_probe()
2024-01-18Merge tag 'vfio-v6.8-rc1' of https://github.com/awilliam/linux-vfioLinus Torvalds
Pull VFIO updates from Alex Williamson: - Add debugfs support, initially used for reporting device migration state (Longfang Liu) - Fixes and support for migration dirty tracking across multiple IOVA regions in the pds-vfio-pci driver (Brett Creeley) - Improved IOMMU allocation accounting visibility (Pasha Tatashin) - Virtio infrastructure and a new virtio-vfio-pci variant driver, which provides emulation of a legacy virtio interfaces on modern virtio hardware for virtio-net VF devices where the PF driver exposes support for legacy admin queues, ie. an emulated IO BAR on an SR-IOV VF to provide driver ABI compatibility to legacy devices (Yishai Hadas & Feng Liu) - Migration fixes for the hisi-acc-vfio-pci variant driver (Shameer Kolothum) - Kconfig dependency fix for new virtio-vfio-pci variant driver (Arnd Bergmann) * tag 'vfio-v6.8-rc1' of https://github.com/awilliam/linux-vfio: (22 commits) vfio/virtio: fix virtio-pci dependency hisi_acc_vfio_pci: Update migration data pointer correctly on saving/resume vfio/virtio: Declare virtiovf_pci_aer_reset_done() static vfio/virtio: Introduce a vfio driver over virtio devices vfio/pci: Expose vfio_pci_core_iowrite/read##size() vfio/pci: Expose vfio_pci_core_setup_barmap() virtio-pci: Introduce APIs to execute legacy IO admin commands virtio-pci: Initialize the supported admin commands virtio-pci: Introduce admin commands virtio-pci: Introduce admin command sending function virtio-pci: Introduce admin virtqueue virtio: Define feature bit for administration virtqueue vfio/type1: account iommu allocations vfio/pds: Add multi-region support vfio/pds: Move seq/ack bitmaps into region struct vfio/pds: Pass region info to relevant functions vfio/pds: Move and rename region specific info vfio/pds: Only use a single SGL for both seq and ack vfio/pds: Fix calculations in pds_vfio_dirty_sync MAINTAINERS: Add vfio debugfs interface doc link ...
2024-01-10virtio_balloon: stay awake while adjusting balloonDavid Stevens
A virtio_balloon's parent device may be configured so that a configuration change interrupt is a wakeup event. Extend the processing of such a wakeup event until the balloon finishes inflating or deflating by calling pm_stay_awake/pm_relax in the virtio_balloon driver. Note that these calls are no-ops if the parent device doesn't support wakeup events or if the wakeup events are not enabled. This change allows the guest to use system power states such as s2idle without running the risk the virtio_balloon's cooperative memory management becoming unresponsive to the host's requests. Tested-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: David Stevens <stevensd@chromium.org> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Message-Id: <20240110021925.1137333-1-stevensd@google.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2024-01-09Merge tag 'mm-stable-2024-01-08-15-31' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull MM updates from Andrew Morton: "Many singleton patches against the MM code. The patch series which are included in this merge do the following: - Peng Zhang has done some mapletree maintainance work in the series 'maple_tree: add mt_free_one() and mt_attr() helpers' 'Some cleanups of maple tree' - In the series 'mm: use memmap_on_memory semantics for dax/kmem' Vishal Verma has altered the interworking between memory-hotplug and dax/kmem so that newly added 'device memory' can more easily have its memmap placed within that newly added memory. - Matthew Wilcox continues folio-related work (including a few fixes) in the patch series 'Add folio_zero_tail() and folio_fill_tail()' 'Make folio_start_writeback return void' 'Fix fault handler's handling of poisoned tail pages' 'Convert aops->error_remove_page to ->error_remove_folio' 'Finish two folio conversions' 'More swap folio conversions' - Kefeng Wang has also contributed folio-related work in the series 'mm: cleanup and use more folio in page fault' - Jim Cromie has improved the kmemleak reporting output in the series 'tweak kmemleak report format'. - In the series 'stackdepot: allow evicting stack traces' Andrey Konovalov to permits clients (in this case KASAN) to cause eviction of no longer needed stack traces. - Charan Teja Kalla has fixed some accounting issues in the page allocator's atomic reserve calculations in the series 'mm: page_alloc: fixes for high atomic reserve caluculations'. - Dmitry Rokosov has added to the samples/ dorectory some sample code for a userspace memcg event listener application. See the series 'samples: introduce cgroup events listeners'. - Some mapletree maintanance work from Liam Howlett in the series 'maple_tree: iterator state changes'. - Nhat Pham has improved zswap's approach to writeback in the series 'workload-specific and memory pressure-driven zswap writeback'. - DAMON/DAMOS feature and maintenance work from SeongJae Park in the series 'mm/damon: let users feed and tame/auto-tune DAMOS' 'selftests/damon: add Python-written DAMON functionality tests' 'mm/damon: misc updates for 6.8' - Yosry Ahmed has improved memcg's stats flushing in the series 'mm: memcg: subtree stats flushing and thresholds'. - In the series 'Multi-size THP for anonymous memory' Ryan Roberts has added a runtime opt-in feature to transparent hugepages which improves performance by allocating larger chunks of memory during anonymous page faults. - Matthew Wilcox has also contributed some cleanup and maintenance work against eh buffer_head code int he series 'More buffer_head cleanups'. - Suren Baghdasaryan has done work on Andrea Arcangeli's series 'userfaultfd move option'. UFFDIO_MOVE permits userspace heap compaction algorithms to move userspace's pages around rather than UFFDIO_COPY'a alloc/copy/free. - Stefan Roesch has developed a 'KSM Advisor', in the series 'mm/ksm: Add ksm advisor'. This is a governor which tunes KSM's scanning aggressiveness in response to userspace's current needs. - Chengming Zhou has optimized zswap's temporary working memory use in the series 'mm/zswap: dstmem reuse optimizations and cleanups'. - Matthew Wilcox has performed some maintenance work on the writeback code, both code and within filesystems. The series is 'Clean up the writeback paths'. - Andrey Konovalov has optimized KASAN's handling of alloc and free stack traces for secondary-level allocators, in the series 'kasan: save mempool stack traces'. - Andrey also performed some KASAN maintenance work in the series 'kasan: assorted clean-ups'. - David Hildenbrand has gone to town on the rmap code. Cleanups, more pte batching, folio conversions and more. See the series 'mm/rmap: interface overhaul'. - Kinsey Ho has contributed some maintenance work on the MGLRU code in the series 'mm/mglru: Kconfig cleanup'. - Matthew Wilcox has contributed lruvec page accounting code cleanups in the series 'Remove some lruvec page accounting functions'" * tag 'mm-stable-2024-01-08-15-31' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (361 commits) mm, treewide: rename MAX_ORDER to MAX_PAGE_ORDER mm, treewide: introduce NR_PAGE_ORDERS selftests/mm: add separate UFFDIO_MOVE test for PMD splitting selftests/mm: skip test if application doesn't has root privileges selftests/mm: conform test to TAP format output selftests: mm: hugepage-mmap: conform to TAP format output selftests/mm: gup_test: conform test to TAP format output mm/selftests: hugepage-mremap: conform test to TAP format output mm/vmstat: move pgdemote_* out of CONFIG_NUMA_BALANCING mm: zsmalloc: return -ENOSPC rather than -EINVAL in zs_malloc while size is too large mm/memcontrol: remove __mod_lruvec_page_state() mm/khugepaged: use a folio more in collapse_file() slub: use a folio in __kmalloc_large_node slub: use folio APIs in free_large_kmalloc() slub: use alloc_pages_node() in alloc_slab_page() mm: remove inc/dec lruvec page state functions mm: ratelimit stat flush from workingset shrinker kasan: stop leaking stack trace handles mm/mglru: remove CONFIG_TRANSPARENT_HUGEPAGE mm/mglru: add dummy pmd_dirty() ...
2024-01-08mm, treewide: rename MAX_ORDER to MAX_PAGE_ORDERKirill A. Shutemov
commit 23baf831a32c ("mm, treewide: redefine MAX_ORDER sanely") has changed the definition of MAX_ORDER to be inclusive. This has caused issues with code that was not yet upstream and depended on the previous definition. To draw attention to the altered meaning of the define, rename MAX_ORDER to MAX_PAGE_ORDER. Link: https://lkml.kernel.org/r/20231228144704.14033-2-kirill.shutemov@linux.intel.com Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-12-27virtio: Add support for no-reset virtio PCI PMDavid Stevens
If a virtio_pci_device supports native PCI power management and has the No_Soft_Reset bit set, then skip resetting and reinitializing the device when suspending and restoring the device. This allows system-wide low power states like s2idle to be used in systems with stateful virtio devices that can't simply be re-initialized (e.g. virtio-fs). Signed-off-by: David Stevens <stevensd@chromium.org> Message-Id: <20231208070754.3132339-1-stevensd@chromium.org> Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Acked-by: Jason Wang <jasowang@redhat.com>
2023-12-19virtio-pci: Introduce APIs to execute legacy IO admin commandsYishai Hadas
Introduce APIs to execute legacy IO admin commands. It includes: io_legacy_read/write for both common and the device configuration, io_legacy_notify_info. In addition, exposing an API to check whether the legacy IO commands are supported. (i.e. virtio_pci_admin_has_legacy_io()). Those APIs will be used by the next patches from this series. Note: Unlike modern drivers which support hardware virtio devices, legacy drivers assume software-based devices: e.g. they don't use proper memory barriers on ARM, use big endian on PPC, etc. X86 drivers are mostly ok though, more or less by chance. For now, only support legacy IO on X86. Acked-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Yishai Hadas <yishaih@nvidia.com> Link: https://lore.kernel.org/r/20231219093247.170936-7-yishaih@nvidia.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2023-12-19virtio-pci: Initialize the supported admin commandsYishai Hadas
Initialize the supported admin commands upon activating the admin queue. The supported commands are saved as part of the admin queue context. Next patches in this series will expose APIs to use them. Reviewed-by: Feng Liu <feliu@nvidia.com> Acked-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Yishai Hadas <yishaih@nvidia.com> Link: https://lore.kernel.org/r/20231219093247.170936-6-yishaih@nvidia.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2023-12-19virtio-pci: Introduce admin command sending functionFeng Liu
Add support for sending admin command through admin virtqueue interface. Abort any inflight admin commands once device reset completes. Activate admin queue when device becomes ready; deactivate on device reset. To comply to the below specification statement [1], the admin virtqueue is activated for upper layer users only after setting DRIVER_OK status. [1] The driver MUST NOT send any buffer available notifications to the device before setting DRIVER_OK. Signed-off-by: Feng Liu <feliu@nvidia.com> Reviewed-by: Parav Pandit <parav@nvidia.com> Acked-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Yishai Hadas <yishaih@nvidia.com> Link: https://lore.kernel.org/r/20231219093247.170936-4-yishaih@nvidia.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2023-12-19virtio-pci: Introduce admin virtqueueFeng Liu
Introduce support for the admin virtqueue. By negotiating VIRTIO_F_ADMIN_VQ feature, driver detects capability and creates one administration virtqueue. Administration virtqueue implementation in virtio pci generic layer, enables multiple types of upper layer drivers such as vfio, net, blk to utilize it. Signed-off-by: Feng Liu <feliu@nvidia.com> Reviewed-by: Parav Pandit <parav@nvidia.com> Reviewed-by: Jiri Pirko <jiri@nvidia.com> Acked-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Yishai Hadas <yishaih@nvidia.com> Link: https://lore.kernel.org/r/20231219093247.170936-3-yishaih@nvidia.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2023-12-04virtio_ring: fix syncs DMA memory with different directionXuan Zhuo
Now the APIs virtqueue_dma_sync_single_range_for_{cpu,device} ignore the parameter 'dir', that is a mistake. [ 6.101666] ------------[ cut here ]------------ [ 6.102079] DMA-API: virtio-pci 0000:00:04.0: device driver syncs DMA memory with different direction [device address=0x00000000ae010000] [size=32752 bytes] [mapped with DMA_FROM_DEVICE] [synced with DMA_BIDIRECTIONAL] [ 6.103630] WARNING: CPU: 6 PID: 0 at kernel/dma/debug.c:1125 check_sync+0x53e/0x6c0 [ 6.107420] CPU: 6 PID: 0 Comm: swapper/6 Tainted: G E 6.6.0+ #290 [ 6.108030] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014 [ 6.108936] RIP: 0010:check_sync+0x53e/0x6c0 [ 6.109289] Code: 24 10 e8 f5 d9 74 00 4c 8b 4c 24 10 4c 8b 44 24 18 48 8b 4c 24 20 48 89 c6 41 56 4c 89 ea 48 c7 c7 b0 f1 50 82 e8 32 fc f3 ff <0f> 0b 48 c7 c7 48 4b 4a 82 e8 74 d9 fc ff 8b 73 4c 48 8d 7b 50 31 [ 6.110750] RSP: 0018:ffffc90000180cd8 EFLAGS: 00010092 [ 6.111178] RAX: 00000000000000ce RBX: ffff888100aa5900 RCX: 0000000000000000 [ 6.111744] RDX: 0000000000000104 RSI: ffffffff824c3208 RDI: 00000000ffffffff [ 6.112316] RBP: ffffc90000180d40 R08: 0000000000000000 R09: 00000000fffeffff [ 6.112893] R10: ffffc90000180b98 R11: ffffffff82f63308 R12: ffffffff83d5af00 [ 6.113460] R13: ffff888100998200 R14: ffffffff824a4b5f R15: 0000000000000286 [ 6.114027] FS: 0000000000000000(0000) GS:ffff88842fd80000(0000) knlGS:0000000000000000 [ 6.114665] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 6.115128] CR2: 00007f10f1e03030 CR3: 0000000108272004 CR4: 0000000000770ee0 [ 6.115701] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 6.116272] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 6.116842] PKRU: 55555554 [ 6.117069] Call Trace: [ 6.117275] <IRQ> [ 6.117452] ? __warn+0x84/0x140 [ 6.117727] ? check_sync+0x53e/0x6c0 [ 6.118034] ? __report_bug+0xea/0x100 [ 6.118353] ? check_sync+0x53e/0x6c0 [ 6.118653] ? report_bug+0x41/0xc0 [ 6.118944] ? handle_bug+0x3c/0x70 [ 6.119237] ? exc_invalid_op+0x18/0x70 [ 6.119551] ? asm_exc_invalid_op+0x1a/0x20 [ 6.119900] ? check_sync+0x53e/0x6c0 [ 6.120199] ? check_sync+0x53e/0x6c0 [ 6.120499] debug_dma_sync_single_for_cpu+0x5c/0x70 [ 6.120906] ? dma_sync_single_for_cpu+0xb7/0x100 [ 6.121291] virtnet_rq_unmap+0x158/0x170 [virtio_net] [ 6.121716] virtnet_receive+0x196/0x220 [virtio_net] [ 6.122135] virtnet_poll+0x48/0x1b0 [virtio_net] [ 6.122524] __napi_poll+0x29/0x1b0 [ 6.123083] net_rx_action+0x282/0x360 [ 6.123612] __do_softirq+0xf3/0x2fb [ 6.124138] __irq_exit_rcu+0x8e/0xf0 [ 6.124663] common_interrupt+0xbc/0xe0 [ 6.125202] </IRQ> We need to enable CONFIG_DMA_API_DEBUG and work with need sync mode(such as swiotlb) to reproduce this warn. Fixes: 8bd2f71054bd ("virtio_ring: introduce dma sync api for virtqueue") Reported-by: "Ning, Hongyu" <hongyu.ning@linux.intel.com> Closes: https://lore.kernel.org/all/f37cb55a-6fc8-4e21-8789-46d468325eea@linux.intel.com/ Suggested-by: Jason Wang <jasowang@redhat.com> Signed-off-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com> Message-Id: <20231201033303.25141-1-xuanzhuo@linux.alibaba.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Reviewed-by: Parav Pandit <parav@nvidia.com> Acked-by: Jason Wang <jasowang@redhat.com> Tested-by: Hongyu Ning <hongyu.ning@linux.intel.com>
2023-11-16Merge tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhostLinus Torvalds
Pull virtio fixes from Michael Tsirkin: "Bugfixes all over the place" * tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost: vhost-vdpa: fix use after free in vhost_vdpa_probe() virtio_pci: Switch away from deprecated irq_set_affinity_hint riscv, qemu_fw_cfg: Add support for RISC-V architecture vdpa_sim_blk: allocate the buffer zeroed virtio_pci: move structure to a header
2023-11-05Merge tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhostLinus Torvalds
Pull virtio updates from Michael Tsirkin: "vhost,virtio,vdpa: features, fixes, cleanups. vdpa/mlx5: - VHOST_BACKEND_F_ENABLE_AFTER_DRIVER_OK - new maintainer vdpa: - support for vq descriptor mappings - decouple reset of iotlb mapping from device reset and fixes, cleanups all over the place" * tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost: (34 commits) vdpa_sim: implement .reset_map support vdpa/mlx5: implement .reset_map driver op vhost-vdpa: clean iotlb map during reset for older userspace vdpa: introduce .compat_reset operation callback vhost-vdpa: introduce IOTLB_PERSIST backend feature bit vhost-vdpa: reset vendor specific mapping to initial state in .release vdpa: introduce .reset_map operation callback virtio_pci: add check for common cfg size virtio-blk: fix implicit overflow on virtio_max_dma_size virtio_pci: add build offset check for the new common cfg items virtio: add definition of VIRTIO_F_NOTIF_CONFIG_DATA feature bit vduse: make vduse_class constant vhost-scsi: Spelling s/preceeding/preceding/g virtio: kdoc for struct virtio_pci_modern_device vdpa: Update sysfs ABI documentation MAINTAINERS: Add myself as mlx5_vdpa driver virtio-balloon: correct the comment of virtballoon_migratepage() mlx5_vdpa: offer VHOST_BACKEND_F_ENABLE_AFTER_DRIVER_OK vdpa/mlx5: Update cvq iotlb mapping on ASID change vdpa/mlx5: Make iotlb helper functions more generic ...
2023-11-02Merge tag 'mm-stable-2023-11-01-14-33' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull MM updates from Andrew Morton: "Many singleton patches against the MM code. The patch series which are included in this merge do the following: - Kemeng Shi has contributed some compation maintenance work in the series 'Fixes and cleanups to compaction' - Joel Fernandes has a patchset ('Optimize mremap during mutual alignment within PMD') which fixes an obscure issue with mremap()'s pagetable handling during a subsequent exec(), based upon an implementation which Linus suggested - More DAMON/DAMOS maintenance and feature work from SeongJae Park i the following patch series: mm/damon: misc fixups for documents, comments and its tracepoint mm/damon: add a tracepoint for damos apply target regions mm/damon: provide pseudo-moving sum based access rate mm/damon: implement DAMOS apply intervals mm/damon/core-test: Fix memory leaks in core-test mm/damon/sysfs-schemes: Do DAMOS tried regions update for only one apply interval - In the series 'Do not try to access unaccepted memory' Adrian Hunter provides some fixups for the recently-added 'unaccepted memory' feature. To increase the feature's checking coverage. 'Plug a few gaps where RAM is exposed without checking if it is unaccepted memory' - In the series 'cleanups for lockless slab shrink' Qi Zheng has done some maintenance work which is preparation for the lockless slab shrinking code - Qi Zheng has redone the earlier (and reverted) attempt to make slab shrinking lockless in the series 'use refcount+RCU method to implement lockless slab shrink' - David Hildenbrand contributes some maintenance work for the rmap code in the series 'Anon rmap cleanups' - Kefeng Wang does more folio conversions and some maintenance work in the migration code. Series 'mm: migrate: more folio conversion and unification' - Matthew Wilcox has fixed an issue in the buffer_head code which was causing long stalls under some heavy memory/IO loads. Some cleanups were added on the way. Series 'Add and use bdev_getblk()' - In the series 'Use nth_page() in place of direct struct page manipulation' Zi Yan has fixed a potential issue with the direct manipulation of hugetlb page frames - In the series 'mm: hugetlb: Skip initialization of gigantic tail struct pages if freed by HVO' has improved our handling of gigantic pages in the hugetlb vmmemmep optimizaton code. This provides significant boot time improvements when significant amounts of gigantic pages are in use - Matthew Wilcox has sent the series 'Small hugetlb cleanups' - code rationalization and folio conversions in the hugetlb code - Yin Fengwei has improved mlock()'s handling of large folios in the series 'support large folio for mlock' - In the series 'Expose swapcache stat for memcg v1' Liu Shixin has added statistics for memcg v1 users which are available (and useful) under memcg v2 - Florent Revest has enhanced the MDWE (Memory-Deny-Write-Executable) prctl so that userspace may direct the kernel to not automatically propagate the denial to child processes. The series is named 'MDWE without inheritance' - Kefeng Wang has provided the series 'mm: convert numa balancing functions to use a folio' which does what it says - In the series 'mm/ksm: add fork-exec support for prctl' Stefan Roesch makes is possible for a process to propagate KSM treatment across exec() - Huang Ying has enhanced memory tiering's calculation of memory distances. This is used to permit the dax/kmem driver to use 'high bandwidth memory' in addition to Optane Data Center Persistent Memory Modules (DCPMM). The series is named 'memory tiering: calculate abstract distance based on ACPI HMAT' - In the series 'Smart scanning mode for KSM' Stefan Roesch has optimized KSM by teaching it to retain and use some historical information from previous scans - Yosry Ahmed has fixed some inconsistencies in memcg statistics in the series 'mm: memcg: fix tracking of pending stats updates values' - In the series 'Implement IOCTL to get and optionally clear info about PTEs' Peter Xu has added an ioctl to /proc/<pid>/pagemap which permits us to atomically read-then-clear page softdirty state. This is mainly used by CRIU - Hugh Dickins contributed the series 'shmem,tmpfs: general maintenance', a bunch of relatively minor maintenance tweaks to this code - Matthew Wilcox has increased the use of the VMA lock over file-backed page faults in the series 'Handle more faults under the VMA lock'. Some rationalizations of the fault path became possible as a result - In the series 'mm/rmap: convert page_move_anon_rmap() to folio_move_anon_rmap()' David Hildenbrand has implemented some cleanups and folio conversions - In the series 'various improvements to the GUP interface' Lorenzo Stoakes has simplified and improved the GUP interface with an eye to providing groundwork for future improvements - Andrey Konovalov has sent along the series 'kasan: assorted fixes and improvements' which does those things - Some page allocator maintenance work from Kemeng Shi in the series 'Two minor cleanups to break_down_buddy_pages' - In thes series 'New selftest for mm' Breno Leitao has developed another MM self test which tickles a race we had between madvise() and page faults - In the series 'Add folio_end_read' Matthew Wilcox provides cleanups and an optimization to the core pagecache code - Nhat Pham has added memcg accounting for hugetlb memory in the series 'hugetlb memcg accounting' - Cleanups and rationalizations to the pagemap code from Lorenzo Stoakes, in the series 'Abstract vma_merge() and split_vma()' - Audra Mitchell has fixed issues in the procfs page_owner code's new timestamping feature which was causing some misbehaviours. In the series 'Fix page_owner's use of free timestamps' - Lorenzo Stoakes has fixed the handling of new mappings of sealed files in the series 'permit write-sealed memfd read-only shared mappings' - Mike Kravetz has optimized the hugetlb vmemmap optimization in the series 'Batch hugetlb vmemmap modification operations' - Some buffer_head folio conversions and cleanups from Matthew Wilcox in the series 'Finish the create_empty_buffers() transition' - As a page allocator performance optimization Huang Ying has added automatic tuning to the allocator's per-cpu-pages feature, in the series 'mm: PCP high auto-tuning' - Roman Gushchin has contributed the patchset 'mm: improve performance of accounted kernel memory allocations' which improves their performance by ~30% as measured by a micro-benchmark - folio conversions from Kefeng Wang in the series 'mm: convert page cpupid functions to folios' - Some kmemleak fixups in Liu Shixin's series 'Some bugfix about kmemleak' - Qi Zheng has improved our handling of memoryless nodes by keeping them off the allocation fallback list. This is done in the series 'handle memoryless nodes more appropriately' - khugepaged conversions from Vishal Moola in the series 'Some khugepaged folio conversions'" [ bcachefs conflicts with the dynamically allocated shrinkers have been resolved as per Stephen Rothwell in https://lore.kernel.org/all/20230913093553.4290421e@canb.auug.org.au/ with help from Qi Zheng. The clone3 test filtering conflict was half-arsed by yours truly ] * tag 'mm-stable-2023-11-01-14-33' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (406 commits) mm/damon/sysfs: update monitoring target regions for online input commit mm/damon/sysfs: remove requested targets when online-commit inputs selftests: add a sanity check for zswap Documentation: maple_tree: fix word spelling error mm/vmalloc: fix the unchecked dereference warning in vread_iter() zswap: export compression failure stats Documentation: ubsan: drop "the" from article title mempolicy: migration attempt to match interleave nodes mempolicy: mmap_lock is not needed while migrating folios mempolicy: alloc_pages_mpol() for NUMA policy without vma mm: add page_rmappable_folio() wrapper mempolicy: remove confusing MPOL_MF_LAZY dead code mempolicy: mpol_shared_policy_init() without pseudo-vma mempolicy trivia: use pgoff_t in shared mempolicy tree mempolicy trivia: slightly more consistent naming mempolicy trivia: delete those ancient pr_debug()s mempolicy: fix migrate_pages(2) syscall return nr_failed kernfs: drop shared NUMA mempolicy hooks hugetlbfs: drop shared NUMA mempolicy pretence mm/damon/sysfs-test: add a unit test for damon_sysfs_set_targets() ...
2023-11-01virtio_pci: Switch away from deprecated irq_set_affinity_hintJakub Sitnicki
Since commit 65c7cdedeb30 ("genirq: Provide new interfaces for affinity hints") irq_set_affinity_hint is being phased out. Switch to new interfaces for setting and applying irq affinity hints. Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com> Message-Id: <20231025145319.380775-1-jakub@cloudflare.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Reviewed-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
2023-11-01virtio_pci: move structure to a headerMichael S. Tsirkin
These are guest/host interfaces, so they belong in the header where e.g. qemu will know to find them. Note: we added a new structure as opposed to extending existing one because someone might be relying on the size of the existing structure staying unchanged. Add a warning to avoid using sizeof. Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Reviewed-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
2023-11-01vhost-vdpa: clean iotlb map during reset for older userspaceSi-Wei Liu
Using .compat_reset op from the previous patch, the buggy .reset behaviour can be kept as-is on older userspace apps, which don't ack the IOTLB_PERSIST backend feature. As this compatibility quirk is limited to those drivers that used to be buggy in the past, it won't affect change the behaviour or affect ABI on the setups with API compliant driver. The separation of .compat_reset from the regular .reset allows vhost-vdpa able to know which driver had broken behaviour before, so it can apply the corresponding compatibility quirk to the individual driver whenever needed. Compared to overloading the existing .reset with flags, .compat_reset won't cause any extra burden to the implementation of every compliant driver. [mst: squashed in two fixup commits] Message-Id: <1697880319-4937-6-git-send-email-si-wei.liu@oracle.com> Message-Id: <1698102863-21122-1-git-send-email-si-wei.liu@oracle.com> Reported-by: Dragos Tatulea <dtatulea@nvidia.com> Tested-by: Dragos Tatulea <dtatulea@nvidia.com> Message-Id: <1698275594-19204-1-git-send-email-si-wei.liu@oracle.com> Reported-by: Lei Yang <leiyang@redhat.com> Signed-off-by: Si-Wei Liu <si-wei.liu@oracle.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Tested-by: Lei Yang <leiyang@redhat.com>
2023-11-01virtio_pci: add check for common cfg sizeXuan Zhuo
Some buggy devices, the common cfg size may not match the features. This patch checks the common cfg size for the features(VIRTIO_F_NOTIF_CONFIG_DATA, VIRTIO_F_RING_RESET). When the common cfg size does not match the corresponding feature, we fail the probe and print error message. Signed-off-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com> Acked-by: Jason Wang <jasowang@redhat.com> Message-Id: <20231019034902.7346-1-xuanzhuo@linux.alibaba.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2023-11-01virtio_pci: add build offset check for the new common cfg itemsXuan Zhuo
Add checks to the check_offsets(void) for queue_notify_data and queue_reset. Signed-off-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com> Acked-by: Jason Wang <jasowang@redhat.com> Message-Id: <20231010031120.81272-5-xuanzhuo@linux.alibaba.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2023-11-01virtio-balloon: correct the comment of virtballoon_migratepage()Xueshi Hu
After commit 68f2736a8583 ("mm: Convert all PageMovable users to movable_operations"), the execution path has been changed to move_to_new_folio movable_operations->migrate_page balloon_page_migrate balloon_page_migrate->balloon_page_migrate balloon_page_migrate Correct the outdated comment. Signed-off-by: Xueshi Hu <xueshi.hu@smartx.com> Message-Id: <20230813140709.835536-1-xueshi.hu@smartx.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Reviewed-by: David Hildenbrand <david@redhat.com> Reviewed-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
2023-10-18virtio_pci: fix the common cfg map sizeXuan Zhuo
The function vp_modern_map_capability() takes the size parameter, which corresponds to the size of virtio_pci_common_cfg. As a result, this indicates the size of memory area to map. Now the size is the size of virtio_pci_common_cfg, but some feature(such as the _F_RING_RESET) needs the virtio_pci_modern_common_cfg, so this commit changes the size to the size of virtio_pci_modern_common_cfg. Cc: stable@vger.kernel.org Fixes: 0b50cece0b78 ("virtio_pci: introduce helper to get/set queue reset") Signed-off-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com> Message-Id: <20231010031120.81272-3-xuanzhuo@linux.alibaba.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Acked-by: Jason Wang <jasowang@redhat.com>
2023-10-18virtio_balloon: Fix endless deflation and inflation on arm64Gavin Shan
The deflation request to the target, which isn't unaligned to the guest page size causes endless deflation and inflation actions. For example, we receive the flooding QMP events for the changes on memory balloon's size after a deflation request to the unaligned target is sent for the ARM64 guest, where we have 64KB base page size. /home/gavin/sandbox/qemu.main/build/qemu-system-aarch64 \ -accel kvm -machine virt,gic-version=host -cpu host \ -smp maxcpus=8,cpus=8,sockets=2,clusters=2,cores=2,threads=1 \ -m 1024M,slots=16,maxmem=64G \ -object memory-backend-ram,id=mem0,size=512M \ -object memory-backend-ram,id=mem1,size=512M \ -numa node,nodeid=0,memdev=mem0,cpus=0-3 \ -numa node,nodeid=1,memdev=mem1,cpus=4-7 \ : \ -device virtio-balloon-pci,id=balloon0,bus=pcie.10 { "execute" : "balloon", "arguments": { "value" : 1073672192 } } {"return": {}} {"timestamp": {"seconds": 1693272173, "microseconds": 88667}, \ "event": "BALLOON_CHANGE", "data": {"actual": 1073610752}} {"timestamp": {"seconds": 1693272174, "microseconds": 89704}, \ "event": "BALLOON_CHANGE", "data": {"actual": 1073610752}} {"timestamp": {"seconds": 1693272175, "microseconds": 90819}, \ "event": "BALLOON_CHANGE", "data": {"actual": 1073610752}} {"timestamp": {"seconds": 1693272176, "microseconds": 91961}, \ "event": "BALLOON_CHANGE", "data": {"actual": 1073610752}} {"timestamp": {"seconds": 1693272177, "microseconds": 93040}, \ "event": "BALLOON_CHANGE", "data": {"actual": 1073676288}} {"timestamp": {"seconds": 1693272178, "microseconds": 94117}, \ "event": "BALLOON_CHANGE", "data": {"actual": 1073676288}} {"timestamp": {"seconds": 1693272179, "microseconds": 95337}, \ "event": "BALLOON_CHANGE", "data": {"actual": 1073610752}} {"timestamp": {"seconds": 1693272180, "microseconds": 96615}, \ "event": "BALLOON_CHANGE", "data": {"actual": 1073676288}} {"timestamp": {"seconds": 1693272181, "microseconds": 97626}, \ "event": "BALLOON_CHANGE", "data": {"actual": 1073610752}} {"timestamp": {"seconds": 1693272182, "microseconds": 98693}, \ "event": "BALLOON_CHANGE", "data": {"actual": 1073676288}} {"timestamp": {"seconds": 1693272183, "microseconds": 99698}, \ "event": "BALLOON_CHANGE", "data": {"actual": 1073610752}} {"timestamp": {"seconds": 1693272184, "microseconds": 100727}, \ "event": "BALLOON_CHANGE", "data": {"actual": 1073610752}} {"timestamp": {"seconds": 1693272185, "microseconds": 90430}, \ "event": "BALLOON_CHANGE", "data": {"actual": 1073610752}} {"timestamp": {"seconds": 1693272186, "microseconds": 102999}, \ "event": "BALLOON_CHANGE", "data": {"actual": 1073676288}} : <The similar QMP events repeat> Fix it by aligning the target up to the guest page size, 64KB in this specific case. With this applied, no flooding QMP events are observed and the memory balloon's size can be stablizied to 0x3ffe0000 soon after the deflation request is sent. { "execute" : "balloon", "arguments": { "value" : 1073672192 } } {"return": {}} {"timestamp": {"seconds": 1693273328, "microseconds": 793075}, \ "event": "BALLOON_CHANGE", "data": {"actual": 1073610752}} { "execute" : "query-balloon" } {"return": {"actual": 1073610752}} Cc: stable@vger.kernel.org Signed-off-by: Gavin Shan <gshan@redhat.com> Tested-by: Zhenyu Zhang <zhenyzha@redhat.com> Message-Id: <20230831011007.1032822-1-gshan@redhat.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Reviewed-by: David Hildenbrand <david@redhat.com>
2023-10-18virtio-mmio: fix memory leak of vm_devMaximilian Heyne
With the recent removal of vm_dev from devres its memory is only freed via the callback virtio_mmio_release_dev. However, this only takes effect after device_add is called by register_virtio_device. Until then it's an unmanaged resource and must be explicitly freed on error exit. This bug was discovered and resolved using Coverity Static Analysis Security Testing (SAST) by Synopsys, Inc. Cc: stable@vger.kernel.org Fixes: 55c91fedd03d ("virtio-mmio: don't break lifecycle of vm_dev") Signed-off-by: Maximilian Heyne <mheyne@amazon.de> Reviewed-by: Catalin Marinas <catalin.marinas@arm.com> Tested-by: Catalin Marinas <catalin.marinas@arm.com> Reviewed-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com> Message-Id: <20230911090328.40538-1-mheyne@amazon.de> Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Reviewed-by: Wolfram Sang <wsa+renesas@sang-engineering.com>
2023-10-04virtio_balloon: dynamically allocate the virtio-balloon shrinkerQi Zheng
In preparation for implementing lockless slab shrink, use new APIs to dynamically allocate the virtio-balloon shrinker, so that it can be freed asynchronously via RCU. Then it doesn't need to wait for RCU read-side critical section when releasing the struct virtio_balloon. Link: https://lkml.kernel.org/r/20230911094444.68966-29-zhengqi.arch@bytedance.com Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com> Reviewed-by: Muchun Song <songmuchun@bytedance.com> Cc: "Michael S. Tsirkin" <mst@redhat.com> Cc: David Hildenbrand <david@redhat.com> Cc: Jason Wang <jasowang@redhat.com> Cc: Xuan Zhuo <xuanzhuo@linux.alibaba.com> Cc: Abhinav Kumar <quic_abhinavk@quicinc.com> Cc: Alasdair Kergon <agk@redhat.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Alyssa Rosenzweig <alyssa.rosenzweig@collabora.com> Cc: Andreas Dilger <adilger.kernel@dilger.ca> Cc: Andreas Gruenbacher <agruenba@redhat.com> Cc: Anna Schumaker <anna@kernel.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Bob Peterson <rpeterso@redhat.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Carlos Llamas <cmllamas@google.com> Cc: Chandan Babu R <chandan.babu@oracle.com> Cc: Chao Yu <chao@kernel.org> Cc: Chris Mason <clm@fb.com> Cc: Christian Brauner <brauner@kernel.org> Cc: Christian Koenig <christian.koenig@amd.com> Cc: Chuck Lever <cel@kernel.org> Cc: Coly Li <colyli@suse.de> Cc: Dai Ngo <Dai.Ngo@oracle.com> Cc: Daniel Vetter <daniel@ffwll.ch> Cc: Daniel Vetter <daniel.vetter@ffwll.ch> Cc: "Darrick J. Wong" <djwong@kernel.org> Cc: Dave Chinner <david@fromorbit.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David Airlie <airlied@gmail.com> Cc: David Sterba <dsterba@suse.com> Cc: Dmitry Baryshkov <dmitry.baryshkov@linaro.org> Cc: Gao Xiang <hsiangkao@linux.alibaba.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Huang Rui <ray.huang@amd.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jaegeuk Kim <jaegeuk@kernel.org> Cc: Jani Nikula <jani.nikula@linux.intel.com> Cc: Jan Kara <jack@suse.cz> Cc: Jeff Layton <jlayton@kernel.org> Cc: Jeffle Xu <jefflexu@linux.alibaba.com> Cc: Joel Fernandes (Google) <joel@joelfernandes.org> Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com> Cc: Josef Bacik <josef@toxicpanda.com> Cc: Juergen Gross <jgross@suse.com> Cc: Kent Overstreet <kent.overstreet@gmail.com> Cc: Kirill Tkhai <tkhai@ya.ru> Cc: Marijn Suijten <marijn.suijten@somainline.org> Cc: Mike Snitzer <snitzer@kernel.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Nadav Amit <namit@vmware.com> Cc: Neil Brown <neilb@suse.de> Cc: Oleksandr Tyshchenko <oleksandr_tyshchenko@epam.com> Cc: Olga Kornievskaia <kolga@netapp.com> Cc: Paul E. McKenney <paulmck@kernel.org> Cc: Richard Weinberger <richard@nod.at> Cc: Rob Clark <robdclark@gmail.com> Cc: Rob Herring <robh@kernel.org> Cc: Rodrigo Vivi <rodrigo.vivi@intel.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Sean Paul <sean@poorly.run> Cc: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: Song Liu <song@kernel.org> Cc: Stefano Stabellini <sstabellini@kernel.org> Cc: Steven Price <steven.price@arm.com> Cc: "Theodore Ts'o" <tytso@mit.edu> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Tomeu Vizoso <tomeu.vizoso@collabora.com> Cc: Tom Talpey <tom@talpey.com> Cc: Trond Myklebust <trond.myklebust@hammerspace.com> Cc: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Yue Hu <huyue2@coolpad.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-09-03virtio_ring: fix avail_wrap_counter in virtqueue_add_packedYuan Yao
In current packed virtqueue implementation, the avail_wrap_counter won't flip, in the case when the driver supplies a descriptor chain with a length equals to the queue size; total_sg == vq->packed.vring.num. Let’s assume the following situation: vq->packed.vring.num=4 vq->packed.next_avail_idx: 1 vq->packed.avail_wrap_counter: 0 Then the driver adds a descriptor chain containing 4 descriptors. We expect the following result with avail_wrap_counter flipped: vq->packed.next_avail_idx: 1 vq->packed.avail_wrap_counter: 1 But, the current implementation gives the following result: vq->packed.next_avail_idx: 1 vq->packed.avail_wrap_counter: 0 To reproduce the bug, you can set a packed queue size as small as possible, so that the driver is more likely to provide a descriptor chain with a length equal to the packed queue size. For example, in qemu run following commands: sudo qemu-system-x86_64 \ -enable-kvm \ -nographic \ -kernel "path/to/kernel_image" \ -m 1G \ -drive file="path/to/rootfs",if=none,id=disk \ -device virtio-blk,drive=disk \ -drive file="path/to/disk_image",if=none,id=rwdisk \ -device virtio-blk,drive=rwdisk,packed=on,queue-size=4,\ indirect_desc=off \ -append "console=ttyS0 root=/dev/vda rw init=/bin/bash" Inside the VM, create a directory and mount the rwdisk device on it. The rwdisk will hang and mount operation will not complete. This commit fixes the wrap counter error by flipping the packed.avail_wrap_counter, when start of descriptor chain equals to the end of descriptor chain (head == i). Fixes: 1ce9e6055fa0 ("virtio_ring: introduce packed ring support") Signed-off-by: Yuan Yao <yuanyaogoog@chromium.org> Message-Id: <20230808051110.3492693-1-yuanyaogoog@chromium.org> Acked-by: Jason Wang <jasowang@redhat.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2023-09-03virtio_vdpa: build affinity masks conditionallyJason Wang
We try to build affinity mask via create_affinity_masks() unconditionally which may lead several issues: - the affinity mask is not used for parent without affinity support (only VDUSE support the affinity now) - the logic of create_affinity_masks() might not work for devices other than block. For example it's not rare in the networking device where the number of queues could exceed the number of CPUs. Such case breaks the current affinity logic which is based on group_cpus_evenly() who assumes the number of CPUs are not less than the number of groups. This can trigger a warning[1]: if (ret >= 0) WARN_ON(nr_present + nr_others < numgrps); Fixing this by only build the affinity masks only when - Driver passes affinity descriptor, driver like virtio-blk can make sure to limit the number of queues when it exceeds the number of CPUs - Parent support affinity setting config ops This help to avoid the warning. More optimizations could be done on top. [1] [ 682.146655] WARNING: CPU: 6 PID: 1550 at lib/group_cpus.c:400 group_cpus_evenly+0x1aa/0x1c0 [ 682.146668] CPU: 6 PID: 1550 Comm: vdpa Not tainted 6.5.0-rc5jason+ #79 [ 682.146671] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.2-0-gea1b7a073390-prebuilt.qemu.org 04/01/2014 [ 682.146673] RIP: 0010:group_cpus_evenly+0x1aa/0x1c0 [ 682.146676] Code: 4c 89 e0 5b 5d 41 5c 41 5d 41 5e c3 cc cc cc cc e8 1b c4 74 ff 48 89 ef e8 13 ac 98 ff 4c 89 e7 45 31 e4 e8 08 ac 98 ff eb c2 <0f> 0b eb b6 e8 fd 05 c3 00 45 31 e4 eb e5 cc cc cc cc cc cc cc cc [ 682.146679] RSP: 0018:ffffc9000215f498 EFLAGS: 00010293 [ 682.146682] RAX: 000000000001f1e0 RBX: 0000000000000041 RCX: 0000000000000000 [ 682.146684] RDX: ffff888109922058 RSI: 0000000000000041 RDI: 0000000000000030 [ 682.146686] RBP: ffff888109922058 R08: ffffc9000215f498 R09: ffffc9000215f4a0 [ 682.146687] R10: 00000000000198d0 R11: 0000000000000030 R12: ffff888107e02800 [ 682.146689] R13: 0000000000000030 R14: 0000000000000030 R15: 0000000000000041 [ 682.146692] FS: 00007fef52315740(0000) GS:ffff888237380000(0000) knlGS:0000000000000000 [ 682.146695] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 682.146696] CR2: 00007fef52509000 CR3: 0000000110dbc004 CR4: 0000000000370ee0 [ 682.146698] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 682.146700] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 682.146701] Call Trace: [ 682.146703] <TASK> [ 682.146705] ? __warn+0x7b/0x130 [ 682.146709] ? group_cpus_evenly+0x1aa/0x1c0 [ 682.146712] ? report_bug+0x1c8/0x1e0 [ 682.146717] ? handle_bug+0x3c/0x70 [ 682.146721] ? exc_invalid_op+0x14/0x70 [ 682.146723] ? asm_exc_invalid_op+0x16/0x20 [ 682.146727] ? group_cpus_evenly+0x1aa/0x1c0 [ 682.146729] ? group_cpus_evenly+0x15c/0x1c0 [ 682.146731] create_affinity_masks+0xaf/0x1a0 [ 682.146735] virtio_vdpa_find_vqs+0x83/0x1d0 [ 682.146738] ? __pfx_default_calc_sets+0x10/0x10 [ 682.146742] virtnet_find_vqs+0x1f0/0x370 [ 682.146747] virtnet_probe+0x501/0xcd0 [ 682.146749] ? vp_modern_get_status+0x12/0x20 [ 682.146751] ? get_cap_addr.isra.0+0x10/0xc0 [ 682.146754] virtio_dev_probe+0x1af/0x260 [ 682.146759] really_probe+0x1a5/0x410 Fixes: 3dad56823b53 ("virtio-vdpa: Support interrupt affinity spreading mechanism") Signed-off-by: Jason Wang <jasowang@redhat.com> Message-Id: <20230811091539.1359865-1-jasowang@redhat.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2023-09-03virtio_ring: introduce dma sync api for virtqueueXuan Zhuo
These API has been introduced: * virtqueue_dma_need_sync * virtqueue_dma_sync_single_range_for_cpu * virtqueue_dma_sync_single_range_for_device These APIs can be used together with the premapped mechanism to sync the DMA address. Signed-off-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com> Message-Id: <20230810123057.43407-12-xuanzhuo@linux.alibaba.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2023-09-03virtio_ring: introduce dma map api for virtqueueXuan Zhuo
Added virtqueue_dma_map_api* to map DMA addresses for virtual memory in advance. The purpose is to keep memory mapped across multiple add/get buf operations. Signed-off-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com> Message-Id: <20230810123057.43407-11-xuanzhuo@linux.alibaba.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2023-09-03virtio_ring: introduce virtqueue_reset()Xuan Zhuo
Introduce virtqueue_reset() to release all buffer inside vq. Signed-off-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com> Acked-by: Jason Wang <jasowang@redhat.com> Message-Id: <20230810123057.43407-10-xuanzhuo@linux.alibaba.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2023-09-03virtio_ring: separate the logic of reset/enable from virtqueue_resizeXuan Zhuo
The subsequent reset function will reuse these logic. Signed-off-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com> Acked-by: Jason Wang <jasowang@redhat.com> Message-Id: <20230810123057.43407-9-xuanzhuo@linux.alibaba.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2023-09-03virtio_ring: correct the expression of the description of virtqueue_resize()Xuan Zhuo
Modify the "useless" to a more accurate "unused". Signed-off-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com> Acked-by: Jason Wang <jasowang@redhat.com> Message-Id: <20230810123057.43407-8-xuanzhuo@linux.alibaba.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2023-09-03virtio_ring: skip unmap for premappedXuan Zhuo
Now we add a case where we skip dma unmap, the vq->premapped is true. We can't just rely on use_dma_api to determine whether to skip the dma operation. For convenience, I introduced the "do_unmap". By default, it is the same as use_dma_api. If the driver is configured with premapped, then do_unmap is false. So as long as do_unmap is false, for addr of desc, we should skip dma unmap operation. Signed-off-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com> Message-Id: <20230810123057.43407-7-xuanzhuo@linux.alibaba.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2023-09-03virtio_ring: introduce virtqueue_dma_dev()Xuan Zhuo
Added virtqueue_dma_dev() to get DMA device for virtio. Then the caller can do dma operation in advance. The purpose is to keep memory mapped across multiple add/get buf operations. Signed-off-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com> Acked-by: Jason Wang <jasowang@redhat.com> Message-Id: <20230810123057.43407-6-xuanzhuo@linux.alibaba.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2023-09-03virtio_ring: support add premapped bufXuan Zhuo
If the vq is the premapped mode, use the sg_dma_address() directly. Signed-off-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com> Message-Id: <20230810123057.43407-5-xuanzhuo@linux.alibaba.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2023-09-03virtio_ring: introduce virtqueue_set_dma_premapped()Xuan Zhuo
This helper allows the driver change the dma mode to premapped mode. Under the premapped mode, the virtio core do not do dma mapping internally. This just work when the use_dma_api is true. If the use_dma_api is false, the dma options is not through the DMA APIs, that is not the standard way of the linux kernel. Signed-off-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com> Message-Id: <20230810123057.43407-4-xuanzhuo@linux.alibaba.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2023-09-03virtio_ring: put mapping error check in vring_map_one_sgXuan Zhuo
This patch put the dma addr error check in vring_map_one_sg(). The benefits of doing this: 1. reduce one judgment of vq->use_dma_api. 2. make vring_map_one_sg more simple, without calling vring_mapping_error to check the return value. simplifies subsequent code Signed-off-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com> Acked-by: Jason Wang <jasowang@redhat.com> Message-Id: <20230810123057.43407-3-xuanzhuo@linux.alibaba.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2023-09-03virtio_ring: check use_dma_api before unmap desc for indirectXuan Zhuo
Inside detach_buf_split(), if use_dma_api is false, vring_unmap_one_split_indirect will be called many times, but actually nothing is done. So this patch check use_dma_api firstly. Signed-off-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com> Acked-by: Jason Wang <jasowang@redhat.com> Message-Id: <20230810123057.43407-2-xuanzhuo@linux.alibaba.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2023-08-10virtio-mem: check if the config changed before fake offlining memoryDavid Hildenbrand
If we repeatedly fail to fake offline memory to unplug it, we won't be sending any unplug requests to the device. However, we only check if the config changed when sending such (un)plug requests. We could end up trying for a long time to unplug memory, even though the config changed already and we're not supposed to unplug memory anymore. For example, the hypervisor might detect a low-memory situation while unplugging memory and decide to replug some memory. Continuing trying to unplug memory in that case can be problematic. So let's check on a more regular basis. Signed-off-by: David Hildenbrand <david@redhat.com> Message-Id: <20230713145551.2824980-5-david@redhat.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2023-08-10virtio-mem: keep retrying on offline_and_remove_memory() errors in Sub Block ↵David Hildenbrand
Mode (SBM) In case offline_and_remove_memory() fails in SBM, we leave a completely unplugged Linux memory block stick around until we try plugging memory again. We won't try removing that memory block again. offline_and_remove_memory() may, for example, fail if we're racing with another alloc_contig_range() user, if allocating temporary memory fails, or if some memory notifier rejected the offlining request. Let's handle that case better, by simple retrying to offline and remove such memory. Tested using CONFIG_MEMORY_NOTIFIER_ERROR_INJECT. Signed-off-by: David Hildenbrand <david@redhat.com> Message-Id: <20230713145551.2824980-4-david@redhat.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2023-08-10virtio-mem: convert most offline_and_remove_memory() errors to -EBUSYDavid Hildenbrand
Just like we do with alloc_contig_range(), let's convert all unknown errors to -EBUSY, but WARN so we can look into the issue. For example, offline_pages() could fail with -EINTR, which would be unexpected in our case. Signed-off-by: David Hildenbrand <david@redhat.com> Message-Id: <20230713145551.2824980-3-david@redhat.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2023-08-10virtio-mem: remove unsafe unplug in Big Block Mode (BBM)David Hildenbrand
When "unsafe unplug" is enabled, we don't fake-offline all memory ahead of actual memory offlining using alloc_contig_range(). Instead, we rely on offline_pages() to also perform actual page migration, which might fail or take a very long time. In that case, it's possible to easily run into endless loops that cannot be aborted anymore (as offlining is triggered by a workqueue then): For example, a single (accidentally) permanently unmovable page in ZONE_MOVABLE results in an endless loop. For ZONE_NORMAL, races between isolating the pageblock (and checking for unmovable pages) and concurrent page allocation are possible and similarly result in endless loops. The idea of the unsafe unplug mode was to make it possible to more reliably unplug large memory blocks. However, (a) we really should be tackling that differently, by extending the alloc_contig_range()-based mechanism; and (b) this mode is not the default and as far as I know, it's unused either way. So let's simply get rid of it. Signed-off-by: David Hildenbrand <david@redhat.com> Message-Id: <20230713145551.2824980-2-david@redhat.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2023-08-10virtio-vdpa: Fix cpumask memory leak in virtio_vdpa_find_vqs()Gal Pressman
Free the cpumask allocated by create_affinity_masks() before returning from the function. Fixes: 3dad56823b53 ("virtio-vdpa: Support interrupt affinity spreading mechanism") Signed-off-by: Gal Pressman <gal@nvidia.com> Reviewed-by: Dragos Tatulea <dtatulea@nvidia.com> Message-Id: <20230726191036.14324-1-dtatulea@nvidia.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Acked-by: Jason Wang <jasowang@redhat.com> Reviewed-by: Xie Yongji <xieyongji@bytedance.com>
2023-08-10virtio-pci: Fix legacy device flag setting error in probeFeng Liu
The 'is_legacy' flag is used to differentiate between legacy vs modern device. Currently, it is based on the value of vp_dev->ldev.ioaddr. However, due to the shared memory of the union between struct virtio_pci_legacy_device and struct virtio_pci_modern_device, when virtio_pci_modern_probe modifies the content of struct virtio_pci_modern_device, it affects the content of struct virtio_pci_legacy_device, and ldev.ioaddr is no longer zero, causing the 'is_legacy' flag to be set as true. To resolve issue, when legacy device is probed, mark 'is_legacy' as true, when modern device is probed, keep 'is_legacy' as false. Fixes: 4f0fc22534e3 ("virtio_pci: Optimize virtio_pci_device structure size") Signed-off-by: Feng Liu <feliu@nvidia.com> Reviewed-by: Parav Pandit <parav@nvidia.com> Reviewed-by: Jiri Pirko <jiri@nvidia.com> Message-Id: <20230719154550.79536-1-feliu@nvidia.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Reviewed-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com> Acked-by: Jason Wang <jasowang@redhat.com>
2023-08-10virtio-mmio: don't break lifecycle of vm_devWolfram Sang
vm_dev has a separate lifecycle because it has a 'struct device' embedded. Thus, having a release callback for it is correct. Allocating the vm_dev struct with devres totally breaks this protection, though. Instead of waiting for the vm_dev release callback, the memory is freed when the platform_device is removed. Resulting in a use-after-free when finally the callback is to be called. To easily see the problem, compile the kernel with CONFIG_DEBUG_KOBJECT_RELEASE and unbind with sysfs. The fix is easy, don't use devres in this case. Found during my research about object lifetime problems. Fixes: 7eb781b1bbb7 ("virtio_mmio: add cleanup for virtio_mmio_probe") Signed-off-by: Wolfram Sang <wsa+renesas@sang-engineering.com> Message-Id: <20230629120526.7184-1-wsa+renesas@sang-engineering.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2023-06-27virtio: allow caller to override device DMA mask in vp_modernShannon Nelson
To add a bit of vendor flexibility with various virtio based devices, allow the caller to specify a different DMA mask. This adds a dma_mask field to struct virtio_pci_modern_device. If defined by the driver, this mask will be used in a call to dma_set_mask_and_coherent() instead of the traditional DMA_BIT_MASK(64). This allows limiting the DMA space on vendor devices with address limitations. Signed-off-by: Shannon Nelson <shannon.nelson@amd.com> Acked-by: Jason Wang <jasowang@redhat.com> Message-Id: <20230519215632.12343-3-shannon.nelson@amd.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>