summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2022-10-07Merge tag 'for-6.1/block-2022-10-03' of git://git.kernel.dk/linuxLinus Torvalds
Pull block updates from Jens Axboe: - NVMe pull requests via Christoph: - handle number of queue changes in the TCP and RDMA drivers (Daniel Wagner) - allow changing the number of queues in nvmet (Daniel Wagner) - also consider host_iface when checking ip options (Daniel Wagner) - don't map pages which can't come from HIGHMEM (Fabio M. De Francesco) - avoid unnecessary flush bios in nvmet (Guixin Liu) - shrink and better pack the nvme_iod structure (Keith Busch) - add comment for unaligned "fake" nqn (Linjun Bao) - print actual source IP address through sysfs "address" attr (Martin Belanger) - various cleanups (Jackie Liu, Wolfram Sang, Genjian Zhang) - handle effects after freeing the request (Keith Busch) - copy firmware_rev on each init (Keith Busch) - restrict management ioctls to admin (Keith Busch) - ensure subsystem reset is single threaded (Keith Busch) - report the actual number of tagset maps in nvme-pci (Keith Busch) - small fabrics authentication fixups (Christoph Hellwig) - add common code for tagset allocation and freeing (Christoph Hellwig) - stop using the request_queue in nvmet (Christoph Hellwig) - set min_align_mask before calculating max_hw_sectors (Rishabh Bhatnagar) - send a rediscover uevent when a persistent discovery controller reconnects (Sagi Grimberg) - misc nvmet-tcp fixes (Varun Prakash, zhenwei pi) - MD pull request via Song: - Various raid5 fix and clean up, by Logan Gunthorpe and David Sloan. - Raid10 performance optimization, by Yu Kuai. - sbitmap wakeup hang fixes (Hugh, Keith, Jan, Yu) - IO scheduler switching quisce fix (Keith) - s390/dasd block driver updates (Stefan) - support for recovery for the ublk driver (ZiyangZhang) - rnbd drivers fixes and updates (Guoqing, Santosh, ye, Christoph) - blk-mq and null_blk map fixes (Bart) - various bcache fixes (Coly, Jilin, Jules) - nbd signal hang fix (Shigeru) - block writeback throttling fix (Yu) - optimize the passthrough mapping handling (me) - prepare block cgroups to being gendisk based (Christoph) - get rid of an old PSI hack in the block layer, moving it to the callers instead where it belongs (Christoph) - blk-throttle fixes and cleanups (Yu) - misc fixes and cleanups (Liu Shixin, Liu Song, Miaohe, Pankaj, Ping-Xiang, Wolfram, Saurabh, Li Jinlin, Li Lei, Lin, Li zeming, Miaohe, Bart, Coly, Gaosheng * tag 'for-6.1/block-2022-10-03' of git://git.kernel.dk/linux: (162 commits) sbitmap: fix lockup while swapping block: add rationale for not using blk_mq_plug() when applicable block: adapt blk_mq_plug() to not plug for writes that require a zone lock s390/dasd: use blk_mq_alloc_disk blk-cgroup: don't update the blkg lookup hint in blkg_conf_prep nvmet: don't look at the request_queue in nvmet_bdev_set_limits nvmet: don't look at the request_queue in nvmet_bdev_zone_mgmt_emulate_all blk-mq: use quiesced elevator switch when reinitializing queues block: replace blk_queue_nowait with bdev_nowait nvme: remove nvme_ctrl_init_connect_q nvme-loop: use the tagset alloc/free helpers nvme-loop: store the generic nvme_ctrl in set->driver_data nvme-loop: initialize sqsize later nvme-fc: use the tagset alloc/free helpers nvme-fc: store the generic nvme_ctrl in set->driver_data nvme-fc: keep ctrl->sqsize in sync with opts->queue_size nvme-rdma: use the tagset alloc/free helpers nvme-rdma: store the generic nvme_ctrl in set->driver_data nvme-tcp: use the tagset alloc/free helpers nvme-tcp: store the generic nvme_ctrl in set->driver_data ...
2022-10-07kunit: rename base KUNIT_ASSERTION macro to _KUNIT_FAILEDDaniel Latypov
Context: Currently this macro's name, KUNIT_ASSERTION conflicts with the name of an enum whose values are {KUNIT_EXPECTATION, KUNIT_ASSERTION}. It's hard to think of a better name for the enum, so rename this macro. It's also a bit strange that the macro might do nothing depending on the boolean argument `pass`. Why not have callers check themselves? This patch: Moves the pass/fail checking into the callers of KUNIT_ASSERTION, so now we only call it when the check has failed. Then we rename the macro the _KUNIT_FAILED() to reflect the new semantics. Signed-off-by: Daniel Latypov <dlatypov@google.com> Reviewed-by: David Gow <davidgow@google.com> Signed-off-by: Shuah Khan <skhan@linuxfoundation.org>
2022-10-07kunit: remove format func from struct kunit_assert, get it to 0 bytesDaniel Latypov
Each calll to a KUNIT_EXPECT_*() macro creates a local variable which contains a struct kunit_assert. Normally, we'd hope the compiler would be able to optimize this away, but we've seen cases where it hasn't, see https://groups.google.com/g/kunit-dev/c/i3fZXgvBrfA/m/GbrMNej2BAAJ. In changes like commit 21957f90b28f ("kunit: split out part of kunit_assert into a static const"), we've moved more and more parts out of struct kunit_assert and its children types (kunit_binary_assert). This patch removes the final field and gets us to: sizeof(struct kunit_assert) == 0 sizeof(struct kunit_binary_assert) == 24 (on UML x86_64). This also reduces the amount of macro plumbing going on at the cost of passing in one more arg to the base KUNIT_ASSERTION macro and kunit_do_failed_assertion(). Signed-off-by: Daniel Latypov <dlatypov@google.com> Reviewed-by: David Gow <davidgow@google.com> Signed-off-by: Shuah Khan <skhan@linuxfoundation.org>
2022-10-07kunit: tool: Don't download risc-v opensbi firmware with wgetDavid Gow
When running a RISC-V test kernel under QEMU, we need an OpenSBI BIOS file. In the original QEMU support patchset, kunit_tool would optionally download this file from GitHub if it didn't exist, using wget. These days, it can usually be found in the distro's qemu-system-riscv package, and is located in /usr/share/qemu on all the distros I tried (Debian, Arch, OpenSUSE). Use this file, and thereby don't do any downloading in kunit_tool. In addition, we used to shell out to whatever 'wget' was in the path, which could have potentially been used to trick the developer into running another binary. By not using wget at all, we nicely sidestep this issue. Cc: Xu Panda <xu.panda@zte.com.cn> Fixes: 87c9c1631788 ("kunit: tool: add support for QEMU") Reported-by: Zeal Robot <zealci@zte.com.cn> Signed-off-by: David Gow <davidgow@google.com> Tested-by: Daniel Latypov <dlatypov@google.com> Reviewed-by: Brendan Higgins <brendanhiggins@google.com> Signed-off-by: Shuah Khan <skhan@linuxfoundation.org>
2022-10-07kunit: make kunit_kfree(NULL) a no-op to match kfree()Daniel Latypov
The real kfree() function will silently return when given a NULL. So a user might reasonably think they can write the following code: char *buffer = NULL; if (param->use_buffer) buffer = kunit_kzalloc(test, 10, GFP_KERNEL); ... kunit_kfree(test, buffer); As-is, kunit_kfree() will mark the test as FAILED when buffer is NULL. (And in earlier times, it would segfault). Let's match the semantics of kfree(). Suggested-by: David Gow <davidgow@google.com> Signed-off-by: Daniel Latypov <dlatypov@google.com> Reviewed-by: David Gow <davidgow@google.com> Reviewed-by: Brendan Higgins <brendanhiggins@google.com> Signed-off-by: Shuah Khan <skhan@linuxfoundation.org>
2022-10-07kunit: make kunit_kfree() not segfault on invalid inputsDaniel Latypov
kunit_kfree() can only work on data ("resources") allocated by KUnit. Currently for code like this, > void *ptr = kmalloc(4, GFP_KERNEL); > kunit_kfree(test, ptr); kunit_kfree() will segfault. It'll try and look up the kunit_resource associated with `ptr` and get a NULL back, but it won't check for this. This means we also segfault if you double-free. Change kunit_kfree() so it'll notice these invalid pointers and respond by failing the test. Implementation: kunit_destroy_resource() does what kunit_kfree() does, but is more generic and returns -ENOENT when it can't find the resource. Sadly, unlike just letting it crash, this means we don't get a stack trace. But kunit_kfree() is so infrequently used it shouldn't be hard to track down the bad callsite anyways. After this change, the above code gives: > # example_simple_test: EXPECTATION FAILED at lib/kunit/test.c:702 > kunit_kfree: 00000000626ec200 already freed or not allocated by kunit Signed-off-by: Daniel Latypov <dlatypov@google.com> Reviewed-by: David Gow <davidgow@google.com> Reviewed-by: Brendan Higgins <brendanhiggins@google.com> Signed-off-by: Shuah Khan <skhan@linuxfoundation.org>
2022-10-07kunit: make kunit_kfree() only work on pointers from kunit_malloc() and friendsDaniel Latypov
kunit_kfree() exists to clean up allocations from kunit_kmalloc() and friends early instead of waiting for this to happen automatically at the end of the test. But it can be used on *anything* registered with the kunit resource API. E.g. the last 2 statements are equivalent: struct kunit_resource *res = something(); kfree(res->data); kunit_put_resource(res); The problem is that there could be multiple resources that point to the same `data`. E.g. you can have a named resource acting as a pseudo-global variable in a test. If you point it to data allocated with kunit_kmalloc(), then calling `kunit_kfree(ptr)` has the chance to delete either the named resource or to kfree `ptr`. Which one it does depends on the order the resources are registered as kunit_kfree() will delete resources in LIFO order. So this patch restricts kunit_kfree() to only working on resources created by kunit_kmalloc(). Calling it is therefore guaranteed to free the memory, not do anything else. Note: kunit_resource_instance_match() wasn't used outside of KUnit, so it should be safe to remove from the public interface. It's also generally dangerous, as shown above, and shouldn't be used. Signed-off-by: Daniel Latypov <dlatypov@google.com> Reviewed-by: David Gow <davidgow@google.com> Reviewed-by: Brendan Higgins <brendanhiggins@google.com> Signed-off-by: Shuah Khan <skhan@linuxfoundation.org>
2022-10-07kunit: drop test pointer in string_stream_fragmentDaniel Latypov
We already store the `struct kunit *test` in the string_stream object itself, so we need don't need to store a copy of this pointer in every fragment in the stream. Drop it, getting string_stream_fragment down the bare minimum: a list_head and the `char *` with the actual fragment. Signed-off-by: Daniel Latypov <dlatypov@google.com> Reviewed-by: David Gow <davidgow@google.com> Reviewed-by: Brendan Higgins <brendanhiggins@google.com> Signed-off-by: Shuah Khan <skhan@linuxfoundation.org>
2022-10-07kunit: string-stream: Simplify resource useDavid Gow
Currently, KUnit's string streams are themselves "KUnit resources". This is redundant since the stream itself is already allocated with kunit_kzalloc() and will thus be freed automatically at the end of the test. string-stream is only used internally within KUnit, and isn't using the extra features that resources provide like reference counting, being able to locate them dynamically as "test-local variables", etc. Indeed, the resource's refcount is never incremented when the pointer is returned. The fact that it's always manually destroyed is more evidence that the reference counting is unused. Signed-off-by: David Gow <davidgow@google.com> Signed-off-by: Daniel Latypov <dlatypov@google.com> Reviewed-by: Brendan Higgins <brendanhiggins@google.com> Signed-off-by: Shuah Khan <skhan@linuxfoundation.org>
2022-10-07Merge tag 'for-6.1/io_uring-2022-10-03' of git://git.kernel.dk/linuxLinus Torvalds
Pull io_uring updates from Jens Axboe: - Add supported for more directly managed task_work running. This is beneficial for real world applications that end up issuing lots of system calls as part of handling work. Normal task_work will always execute as we transition in and out of the kernel, even for "unrelated" system calls. It's more efficient to defer the handling of io_uring's deferred work until the application wants it to be run, generally in batches. As part of ongoing work to write an io_uring network backend for Thrift, this has been shown to greatly improve performance. (Dylan) - Add IOPOLL support for passthrough (Kanchan) - Improvements and fixes to the send zero-copy support (Pavel) - Partial IO handling fixes (Pavel) - CQE ordering fixes around CQ ring overflow (Pavel) - Support sendto() for non-zc as well (Pavel) - Support sendmsg for zerocopy (Pavel) - Networking iov_iter fix (Stefan) - Misc fixes and cleanups (Pavel, me) * tag 'for-6.1/io_uring-2022-10-03' of git://git.kernel.dk/linux: (56 commits) io_uring/net: fix notif cqe reordering io_uring/net: don't update msg_name if not provided io_uring: don't gate task_work run on TIF_NOTIFY_SIGNAL io_uring/rw: defer fsnotify calls to task context io_uring/net: fix fast_iov assignment in io_setup_async_msg() io_uring/net: fix non-zc send with address io_uring/net: don't skip notifs for failed requests io_uring/rw: don't lose short results on io_setup_async_rw() io_uring/rw: fix unexpected link breakage io_uring/net: fix cleanup double free free_iov init io_uring: fix CQE reordering io_uring/net: fix UAF in io_sendrecv_fail() selftest/net: adjust io_uring sendzc notif handling io_uring: ensure local task_work marks task as running io_uring/net: zerocopy sendmsg io_uring/net: combine fail handlers io_uring/net: rename io_sendzc() io_uring/net: support non-zerocopy sendto io_uring/net: refactor io_setup_async_addr io_uring/net: don't lose partial send_zc on fail ...
2022-10-07Merge tag 'fs-for_v6.1-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs Pull ext2, udf, reiserfs, and quota updates from Jan Kara: - Fix for udf to make splicing work again - More disk format sanity checks for ext2 to avoid crashes found by syzbot - More quota disk format checks to avoid crashes found by fuzzing - Reiserfs & isofs cleanups * tag 'fs-for_v6.1-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs: quota: Add more checking after reading from quota file quota: Replace all block number checking with helper function quota: Check next/prev free block number after reading from quota file ext2: Use kvmalloc() for group descriptor array ext2: Add sanity checks for group and filesystem size udf: Support splicing to file isofs: delete unnecessary checks before brelse() fs/reiserfs: replace ternary operator with min() and min_t()
2022-10-07Merge tag 'fsnotify-for_v6.1-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs Pull fsnotify updates from Jan Kara: "Two cleanups for fsnotify code" * tag 'fsnotify-for_v6.1-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs: fanotify: Remove obsoleted fanotify_event_has_path() fsnotify: remove unused declaration
2022-10-07Merge tag '6.1-rc-ksmbd-fixes' of git://git.samba.org/ksmbdLinus Torvalds
Pull ksmbd updates from Steve French: - RDMA (smbdirect) fixes - fixes for SMB3.1.1 POSIX Extensions (especially for id mapping) - various casemapping fixes for mount and lookup - UID mapping fixes - fix confusing error message - protocol negotiation fixes, including NTLMSSP fix - two encryption fixes - directory listing fix - some cleanup fixes * tag '6.1-rc-ksmbd-fixes' of git://git.samba.org/ksmbd: (24 commits) ksmbd: validate share name from share config response ksmbd: call ib_drain_qp when disconnected ksmbd: make utf-8 file name comparison work in __caseless_lookup() ksmbd: Fix user namespace mapping ksmbd: hide socket error message when ipv6 config is disable ksmbd: reduce server smbdirect max send/receive segment sizes ksmbd: decrease the number of SMB3 smbdirect server SGEs ksmbd: Fix wrong return value and message length check in smb2_ioctl() ksmbd: set NTLMSSP_NEGOTIATE_SEAL flag to challenge blob ksmbd: fix encryption failure issue for session logoff response ksmbd: fix endless loop when encryption for response fails ksmbd: fill sids in SMB_FIND_FILE_POSIX_INFO response ksmbd: set file permission mode to match Samba server posix extension behavior ksmbd: change security id to the one samba used for posix extension ksmbd: update documentation ksmbd: casefold utf-8 share names and fix ascii lowercase conversion ksmbd: port to vfs{g,u}id_t and associated helpers ksmbd: fix incorrect handling of iterate_dir MAINTAINERS: remove Hyunchul Lee from ksmbd maintainers MAINTAINERS: Add Tom Talpey as ksmbd reviewer ...
2022-10-07Merge tag 'nand/for-6.1' into mtd/nextMiquel Raynal
Raw NAND core changes: * Replace of_gpio_named_count() by gpiod_count() - Remove misguided comment of nand_get_device() - bbt: Use the bitmap API to allocate bitmaps Raw NAND controller drivers changes: * Meson: - Stop supporting legacy clocks - Refine resource getting in probe - Convert bindings to yaml - Fix clock handling and update the bindings accordingly - Fix bit map use in meson_nfc_ecc_correct() * bcm47xx: - Fix spelling typo in comment * STM32 FMC2: - Switch to using devm_fwnode_gpiod_get() - Fix dma_map_sg error check * Cadence: - Remove an unneeded result variable * Marvell: - Fix error handle regarding dma_map_sg * Orion: - Use devm_clk_get_optional() * Cafe: - Use correct function name in comment block * Atmel: - Unmap streaming DMA mappings * Arasan: - Stop using 0 as NULL pointer * GPMI: - Fix typo 'the the' in comment * BRCM: - Add individual glue driver selection - Move Kconfig to driver folder * FSL: Fix none ECC mode * Intel: - Use devm_platform_ioremap_resource_byname() - Remove unused clk_rate member from struct ebu_nand - Remove unused nand_pa member from ebu_nand_cs - Don't re-define NAND_DATA_IFACE_CHECK_ONLY - Remove undocumented compatible string - Fix compatible string in the bindings - Read the chip-select line from the correct OF node - Fix maximum chip select value in the bindings Signed-off-by: Miquel Raynal <miquel.raynal@bootlin.com>
2022-10-07xen/virtio: use dom0 as default backend for CONFIG_XEN_VIRTIO_FORCE_GRANTJuergen Gross
With CONFIG_XEN_VIRTIO_FORCE_GRANT set the default backend domid to 0, enabling to use xen_grant_dma_ops for those devices. Signed-off-by: Juergen Gross <jgross@suse.com> Reviewed-by: Oleksandr Tyshchenko <oleksandr_tyshchenko@epam.com> Acked-by: Stefano Stabellini <sstabellini@kernel.org> Signed-off-by: Juergen Gross <jgross@suse.com>
2022-10-07xen/virtio: restructure xen grant dma setupJuergen Gross
In order to prepare supporting other means than device tree for setting up virtio devices under Xen, restructure the functions xen_is_grant_dma_device() and xen_grant_setup_dma_ops() a little bit. Signed-off-by: Juergen Gross <jgross@suse.com> Reviewed-by: Oleksandr Tyshchenko <oleksandr_tyshchenko@epam.com> Tested-by: Oleksandr Tyshchenko <oleksandr_tyshchenko@epam.com> # Arm64 only Acked-by: Stefano Stabellini <sstabellini@kernel.org> Signed-off-by: Juergen Gross <jgross@suse.com>
2022-10-07vfio: Make the group FD disassociate from the iommu_groupJason Gunthorpe
Allow the vfio_group struct to exist with a NULL iommu_group pointer. When the pointer is NULL the vfio_group users promise not to touch the iommu_group. This allows a driver to be hot unplugged while userspace is keeping the group FD open. Remove all the code waiting for the group FD to close. This fixes a userspace regression where we learned that virtnodedevd leaves a group FD open even though the /dev/ node for it has been deleted and all the drivers for it unplugged. Fixes: ca5f21b25749 ("vfio: Follow a strict lifetime for struct iommu_group") Reported-by: Christian Borntraeger <borntraeger@linux.ibm.com> Tested-by: Matthew Rosato <mjrosato@linux.ibm.com> Tested-by: Christian Borntraeger <borntraeger@de.ibm.com> Tested-by: Eric Farman <farman@linux.ibm.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com> Link: https://lore.kernel.org/r/3-v2-15417f29324e+1c-vfio_group_disassociate_jgg@nvidia.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2022-10-07vfio: Hold a reference to the iommu_group in kvm for SPAPRJason Gunthorpe
SPAPR exists completely outside the normal iommu driver framework, the groups it creates are fake and are only created to enable VFIO's uAPI. Thus, it does not need to follow the iommu core rule that the iommu_group will only be touched while a driver is attached. Carry a group reference into KVM and have KVM directly manage the lifetime of this object independently of VFIO. This means KVM no longer relies on the vfio group file being valid to maintain the group reference. Tested-by: Matthew Rosato <mjrosato@linux.ibm.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com> Link: https://lore.kernel.org/r/2-v2-15417f29324e+1c-vfio_group_disassociate_jgg@nvidia.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2022-10-07vfio: Add vfio_file_is_group()Jason Gunthorpe
This replaces uses of vfio_file_iommu_group() which were only detecting if the file is a VFIO file with no interest in the actual group. The only remaning user of vfio_file_iommu_group() is in KVM for the SPAPR stuff. It passes the iommu_group into the arch code through kvm for some reason. Tested-by: Matthew Rosato <mjrosato@linux.ibm.com> Tested-by: Christian Borntraeger <borntraeger@de.ibm.com> Tested-by: Eric Farman <farman@linux.ibm.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com> Link: https://lore.kernel.org/r/1-v2-15417f29324e+1c-vfio_group_disassociate_jgg@nvidia.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2022-10-07virtio_blk: add SECURE ERASE command supportAlvaro Karsz
Support for the VIRTIO_BLK_F_SECURE_ERASE VirtIO feature. A device that offers this feature can receive VIRTIO_BLK_T_SECURE_ERASE commands. A device which supports this feature has the following fields in the virtio config: - max_secure_erase_sectors - max_secure_erase_seg - secure_erase_sector_alignment max_secure_erase_sectors and secure_erase_sector_alignment are expressed in 512-byte units. Every secure erase command has the following fields: - sectors: The starting offset in 512-byte units. - num_sectors: The number of sectors. Signed-off-by: Alvaro Karsz <alvaro.karsz@solid-run.com> Message-Id: <20220921082729.2516779-1-alvaro.karsz@solid-run.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
2022-10-07vp_vdpa: support feature provisioningJason Wang
This patch allows the device features to be provisioned via netlink. This is done by: 1) validating the provisioned features to be a subset of the parent features. 2) clearing the features that is not wanted by the userspace For example: # vdpa mgmtdev show pci/0000:02:00.0: supported_classes net max_supported_vqs 3 dev_features CSUM GUEST_CSUM CTRL_GUEST_OFFLOADS MAC GUEST_TSO4 GUEST_TSO6 GUEST_ECN GUEST_UFO HOST_TSO4 HOST_TSO6 HOST_ECN HOST_UFO MRG_RXBUF STATUS CTRL_VQ CTRL_RX CTRL_VLAN CTRL_RX_EXTRA GUEST_ANNOUNCE CTRL_MAC_ADDR RING_INDIRECT_DESC RING_EVENT_IDX VERSION_1 ACCESS_PLATFORM 1) provision vDPA device with all features that are supported by the virtio-pci # vdpa dev add name dev1 mgmtdev pci/0000:02:00.0 # vdpa dev config show dev1: mac 52:54:00:12:34:56 link up link_announce false mtu 65535 negotiated_features CSUM GUEST_CSUM CTRL_GUEST_OFFLOADS MAC GUEST_TSO4 GUEST_TSO6 GUEST_ECN GUEST_UFO HOST_TSO4 HOST_TSO6 HOST_ECN HOST_UFO MRG_RXBUF STATUS CTRL_VQ CTRL_RX CTRL_VLAN GUEST_ANNOUNCE CTRL_MAC_ADDR RING_INDIRECT_DESC RING_EVENT_IDX VERSION_1 ACCESS_PLATFORM 2) provision vDPA device with a subset of the features # vdpa dev add name dev1 mgmtdev pci/0000:02:00.0 device_features 0x300020000 # dev1: mac 52:54:00:12:34:56 link up link_announce false mtu 65535 negotiated_features CTRL_VQ VERSION_1 ACCESS_PLATFORM Reviewed-by: Eli Cohen <elic@nvidia.com> Signed-off-by: Jason Wang <jasowang@redhat.com> Message-Id: <20220927074810.28627-4-jasowang@redhat.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2022-10-07vdpa_sim_net: support feature provisioningJason Wang
This patch implements features provisioning for vdpa_sim_net. 1) validating the provisioned features to be a subset of the parent features. 2) clearing the features that is not wanted by the userspace For example: vdpasim_net: supported_classes net max_supported_vqs 3 dev_features MTU MAC CTRL_VQ CTRL_MAC_ADDR ANY_LAYOUT VERSION_1 ACCESS_PLATFORM 1) provision vDPA device with all features that are supported by the net simulator dev1: mac 00:00:00:00:00:00 link up link_announce false mtu 1500 negotiated_features MTU MAC CTRL_VQ CTRL_MAC_ADDR VERSION_1 ACCESS_PLATFORM 2) provision vDPA device with a subset of the features dev1: mac 00:00:00:00:00:00 link up link_announce false mtu 1500 negotiated_features CTRL_VQ VERSION_1 ACCESS_PLATFORM Reviewed-by: Eli Cohen <elic@nvidia.com> Signed-off-by: Jason Wang <jasowang@redhat.com> Message-Id: <20220927074810.28627-3-jasowang@redhat.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
2022-10-07vdpa: device feature provisioningJason Wang
This patch allows the device features to be provisioned through netlink. A new attribute is introduced to allow the userspace to pass a 64bit device features during device adding. This provides several advantages: - Allow to provision a subset of the features to ease the cross vendor live migration. - Better debug-ability for vDPA framework and parent. Reviewed-by: Eli Cohen <elic@nvidia.com> Signed-off-by: Jason Wang <jasowang@redhat.com> Message-Id: <20220927074810.28627-2-jasowang@redhat.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2022-10-07virtio-net: use mtu size as buffer length for big packetsGavin Li
Currently add_recvbuf_big() allocates MAX_SKB_FRAGS segments for big packets even when GUEST_* offloads are not present on the device. However, if guest GSO is not supported, it would be sufficient to allocate segments to cover just up the MTU size and no further. Allocating the maximum amount of segments results in a large waste of buffer space in the queue, which limits the number of packets that can be buffered and can result in reduced performance. Therefore, if guest GSO is not supported, use the MTU to calculate the optimal amount of segments required. Below is the iperf TCP test results over a Mellanox NIC, using vDPA for 1 VQ, queue size 1024, before and after the change, with the iperf server running over the virtio-net interface. MTU(Bytes)/Bandwidth (Gbit/s) Before After 1500 22.5 22.4 9000 12.8 25.9 And result of queue size 256. MTU(Bytes)/Bandwidth (Gbit/s) Before After 9000 2.15 11.9 With this patch no degradation is observed with multiple below tests and feature bit combinations. Results are summarized below for q depth of 1024. Interface MTU is 1500 if MTU feature is disabled. MTU is set to 9000 in other tests. Features/ Bandwidth (Gbit/s) Before After mtu off 20.1 20.2 mtu/indirect on 17.4 17.3 mtu/indirect/packed on 17.2 17.2 Signed-off-by: Gavin Li <gavinl@nvidia.com> Reviewed-by: Gavi Teitz <gavi@nvidia.com> Reviewed-by: Parav Pandit <parav@nvidia.com> Reviewed-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com> Reviewed-by: Si-Wei Liu <si-wei.liu@oracle.com> Message-Id: <20220914144911.56422-3-gavinl@nvidia.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Acked-by: Jason Wang <jasowang@redhat.com>
2022-10-07virtio-net: introduce and use helper function for guest gso support checksGavin Li
Probe routine is already several hundred lines. Use helper function for guest gso support check. Signed-off-by: Gavin Li <gavinl@nvidia.com> Reviewed-by: Gavi Teitz <gavi@nvidia.com> Reviewed-by: Parav Pandit <parav@nvidia.com> Reviewed-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com> Reviewed-by: Si-Wei Liu <si-wei.liu@oracle.com> Message-Id: <20220914144911.56422-2-gavinl@nvidia.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2022-10-07virtio: drop vp_legacy_set_queue_sizeMichael S. Tsirkin
There's actually no way to set queue size on legacy virtio pci. Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Message-Id: <20220815220447.155860-1-mst@redhat.com>
2022-10-07virtio_ring: make vring_alloc_queue_packed prettierDeming Wang
Add some spaces to vring_alloc_queue(make it look prettier). Signed-off-by: Deming Wang <wangdeming@inspur.com> Message-Id: <20220926183306.4535-1-wangdeming@inspur.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2022-10-07virtio_ring: split: Operators use unified styleDeming Wang
The operators of vring_alloc_queue_split should use the unified style.Add space for the '|' ,make it be looked more pretty. Signed-off-by: Deming Wang <wangdeming@inspur.com> Message-Id: <20220926022202.1516-1-wangdeming@inspur.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2022-10-07vhost: add __init/__exit annotations to module init/exit funcsXiu Jianfeng
Add missing __init/__exit annotations to module init/exit funcs. Signed-off-by: Xiu Jianfeng <xiujianfeng@huawei.com> Message-Id: <20220917083803.21521-1-xiujianfeng@huawei.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2022-10-07net/9p: clarify trans_fd parse_opt failure handlingLi Zhong
This parse_opts will set invalid opts.rfd/wfd in case of failure which we already check, but it is not clear for readers that parse_opts error are handled in p9_fd_create: clarify this by explicitely checking the return value. Link: https://lkml.kernel.org/r/20220921210921.1654735-1-floridsleeves@gmail.com Signed-off-by: Li Zhong <floridsleeves@gmail.com> [Dominique: reworded commit message to clarify this is NOOP] Signed-off-by: Dominique Martinet <asmadeus@codewreck.org>
2022-10-07net/9p: add __init/__exit annotations to module init/exit funcsXiu Jianfeng
xen transport was missing annotations Link: https://lkml.kernel.org/r/20220909103546.73015-1-xiujianfeng@huawei.com Signed-off-by: Xiu Jianfeng <xiujianfeng@huawei.com> Signed-off-by: Dominique Martinet <asmadeus@codewreck.org>
2022-10-07net/9p: use a dedicated spinlock for trans_fdDominique Martinet
Shamelessly copying the explanation from Tetsuo Handa's suggested patch[1] (slightly reworded): syzbot is reporting inconsistent lock state in p9_req_put()[2], for p9_tag_remove() from p9_req_put() from IRQ context is using spin_lock_irqsave() on "struct p9_client"->lock but trans_fd (not from IRQ context) is using spin_lock(). Since the locks actually protect different things in client.c and in trans_fd.c, just replace trans_fd.c's lock by a new one specific to the transport (client.c's protect the idr for fid/tag allocations, while trans_fd.c's protects its own req list and request status field that acts as the transport's state machine) Link: https://lore.kernel.org/r/20220904112928.1308799-1-asmadeus@codewreck.org Link: https://lkml.kernel.org/r/2470e028-9b05-2013-7198-1fdad071d999@I-love.SAKURA.ne.jp [1] Link: https://syzkaller.appspot.com/bug?extid=2f20b523930c32c160cc [2] Reported-by: syzbot <syzbot+2f20b523930c32c160cc@syzkaller.appspotmail.com> Reported-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Reviewed-by: Christian Schoenebeck <linux_oss@crudebyte.com> Signed-off-by: Dominique Martinet <asmadeus@codewreck.org>
2022-10-07KVM: PPC: Book3S HV: Fix stack frame regs markerNicholas Piggin
The hard-coded marker is out of date now, fix it using the nice define. Fixes: 17773afdcd15 ("powerpc/64: use 32-bit immediate for STACK_FRAME_REGS_MARKER") Reported-by: Joel Stanley <joel@jms.id.au> Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20221006143345.129077-1-npiggin@gmail.com
2022-10-07watchdog: twl4030_wdt: add missing mod_devicetable.h includeDmitry Torokhov
The driver is using of_device_id and therefore needs to include mod_devicetable.h header. We used to get this definition indirectly via inclusion of matrix_keypad.h from twl.h, but we are cleaning up matrix_keypad.h from unnecessary includes. Signed-off-by: Dmitry Torokhov <dmitry.torokhov@gmail.com> Reviewed-by: Guenter Roeck <linux@roeck-us.net> Link: https://lore.kernel.org/r/20220927154611.3330871-1-dmitry.torokhov@gmail.com Signed-off-by: Guenter Roeck <linux@roeck-us.net> Signed-off-by: Wim Van Sebroeck <wim@linux-watchdog.org>
2022-10-07xen/pcifront: move xenstore config scanning into sub-functionJuergen Gross
pcifront_try_connect() and pcifront_attach_devices() share a large chunk of duplicated code for reading the config information from Xenstore, which only differs regarding calling pcifront_rescan_root() or pcifront_scan_root(). Put that code into a new sub-function. It is fine to always call pcifront_rescan_root() from that common function, as it will fallback to pcifront_scan_root() if the domain/bus combination isn't known yet (and pcifront_scan_root() should never be called for an already known domain/bus combination anyway). In order to avoid duplicate messages for the fallback case move the check for domain/bus not known to the beginning of pcifront_rescan_root(). While at it fix the error reporting in case the root-xx node had the wrong format. As the return value of pcifront_try_connect() and pcifront_attach_devices() are not used anywhere make those functions return void. As an additional bonus this removes the dubious return of -EFAULT in case of an unexpected driver state. Signed-off-by: Juergen Gross <jgross@suse.com> Reviewed-by: Jason Andryuk <jandryuk@gmail.com> Signed-off-by: Juergen Gross <jgross@suse.com>
2022-10-06riscv: enable THP_SWAP for RV64Jisheng Zhang
I have a Sipeed Lichee RV dock board which only has 512MB DDR, so memory optimizations such as swap on zram are helpful. As is seen in commit d0637c505f8a ("arm64: enable THP_SWAP for arm64") and commit bd4c82c22c367e ("mm, THP, swap: delay splitting THP after swapped out"), THP_SWAP can improve the swap throughput significantly. Enable THP_SWAP for RV64, testing the micro-benchmark which is introduced by commit d0637c505f8a ("arm64: enable THP_SWAP for arm64") shows below numbers on the Lichee RV dock board: swp out bandwidth w/o patch: 66908 bytes/ms (mean of 10 tests) swp out bandwidth w/ patch: 322638 bytes/ms (mean of 10 tests) Improved by 382%! Signed-off-by: Jisheng Zhang <jszhang@kernel.org> Reviewed-by: Andrew Jones <ajones@ventanamicro.com> Reviewed-by: Conor Dooley <conor.dooley@microchip.com> Link: https://lore.kernel.org/r/20220829145742.3139-1-jszhang@kernel.org/ Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
2022-10-06RISC-V: Print SSTC in canonical orderPalmer Dabbelt
This got out of order during a merge conflict, fix it by putting the entries in the correct order. Fixes: 7ab52f75a9cf ("RISC-V: Add Sstc extension support") Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com> Reviewed-by: Conor Dooley <conor.dooley@microchip.com> Reviewed-by: Heiko Stuebner <heiko@sntech.de> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20220920204518.10988-1-palmer@rivosinc.com/ Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
2022-10-07Revert "drm/sched: Use parent fence instead of finished"Dave Airlie
This reverts commit e4dc45b1848bc6bcac31eb1b4ccdd7f6718b3c86. This is causing instability on Linus' desktop, and I'm seeing oops with VK CTS runs. netconsole got me the following oops: [ 1234.778760] BUG: kernel NULL pointer dereference, address: 0000000000000088 [ 1234.778782] #PF: supervisor read access in kernel mode [ 1234.778787] #PF: error_code(0x0000) - not-present page [ 1234.778791] PGD 0 P4D 0 [ 1234.778798] Oops: 0000 [#1] PREEMPT SMP NOPTI [ 1234.778803] CPU: 7 PID: 805 Comm: systemd-journal Not tainted 6.0.0+ #2 [ 1234.778809] Hardware name: System manufacturer System Product Name/PRIME X370-PRO, BIOS 5603 07/28/2020 [ 1234.778813] RIP: 0010:drm_sched_job_done.isra.0+0xc/0x140 [gpu_sched] [ 1234.778828] Code: aa 0f 1d ce e9 57 ff ff ff 48 89 d7 e8 9d 8f 3f ce e9 4a ff ff ff 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 41 54 55 53 48 89 fb <48> 8b af 88 00 00 00 f0 ff 8d f0 00 00 00 48 8b 85 80 01 00 00 f0 [ 1234.778834] RSP: 0000:ffffabe680380de0 EFLAGS: 00010087 [ 1234.778839] RAX: ffffffffc04e9230 RBX: 0000000000000000 RCX: 0000000000000018 [ 1234.778897] RDX: 00000ba278e8977a RSI: ffff953fb288b460 RDI: 0000000000000000 [ 1234.778901] RBP: ffff953fb288b598 R08: 00000000000000e0 R09: ffff953fbd98b808 [ 1234.778905] R10: 0000000000000000 R11: ffffabe680380ff8 R12: ffffabe680380e00 [ 1234.778908] R13: 0000000000000001 R14: 00000000ffffffff R15: ffff953fbd9ec458 [ 1234.778912] FS: 00007f35e7008580(0000) GS:ffff95428ebc0000(0000) knlGS:0000000000000000 [ 1234.778916] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 1234.778919] CR2: 0000000000000088 CR3: 000000010147c000 CR4: 00000000003506e0 [ 1234.778924] Call Trace: [ 1234.778981] <IRQ> [ 1234.778989] dma_fence_signal_timestamp_locked+0x6a/0xe0 [ 1234.778999] dma_fence_signal+0x2c/0x50 [ 1234.779005] amdgpu_fence_process+0xc8/0x140 [amdgpu] [ 1234.779234] sdma_v3_0_process_trap_irq+0x70/0x80 [amdgpu] [ 1234.779395] amdgpu_irq_dispatch+0xa9/0x1d0 [amdgpu] [ 1234.779609] amdgpu_ih_process+0x80/0x100 [amdgpu] [ 1234.779783] amdgpu_irq_handler+0x1f/0x60 [amdgpu] [ 1234.779940] __handle_irq_event_percpu+0x46/0x190 [ 1234.779946] handle_irq_event+0x34/0x70 [ 1234.779949] handle_edge_irq+0x9f/0x240 [ 1234.779954] __common_interrupt+0x66/0x100 [ 1234.779960] common_interrupt+0xa0/0xc0 [ 1234.779965] </IRQ> [ 1234.779968] <TASK> [ 1234.779971] asm_common_interrupt+0x22/0x40 [ 1234.779976] RIP: 0010:finish_mkwrite_fault+0x22/0x110 [ 1234.779981] Code: 1f 84 00 00 00 00 00 90 0f 1f 44 00 00 41 55 41 54 55 48 89 fd 53 48 8b 07 f6 40 50 08 0f 84 eb 00 00 00 48 8b 45 30 48 8b 18 <48> 89 df e8 66 bd ff ff 48 85 c0 74 0d 48 89 c2 83 e2 01 48 83 ea [ 1234.779985] RSP: 0000:ffffabe680bcfd78 EFLAGS: 00000202 Revert it for now and figure it out later. Signed-off-by: Dave Airlie <airlied@redhat.com>
2022-10-079p/trans_fd: always use O_NONBLOCK read/writeTetsuo Handa
syzbot is reporting hung task at p9_fd_close() [1], for p9_mux_poll_stop() from p9_conn_destroy() from p9_fd_close() is failing to interrupt already started kernel_read() from p9_fd_read() from p9_read_work() and/or kernel_write() from p9_fd_write() from p9_write_work() requests. Since p9_socket_open() sets O_NONBLOCK flag, p9_mux_poll_stop() does not need to interrupt kernel_read()/kernel_write(). However, since p9_fd_open() does not set O_NONBLOCK flag, but pipe blocks unless signal is pending, p9_mux_poll_stop() needs to interrupt kernel_read()/kernel_write() when the file descriptor refers to a pipe. In other words, pipe file descriptor needs to be handled as if socket file descriptor. We somehow need to interrupt kernel_read()/kernel_write() on pipes. A minimal change, which this patch is doing, is to set O_NONBLOCK flag from p9_fd_open(), for O_NONBLOCK flag does not affect reading/writing of regular files. But this approach changes O_NONBLOCK flag on userspace- supplied file descriptors (which might break userspace programs), and O_NONBLOCK flag could be changed by userspace. It would be possible to set O_NONBLOCK flag every time p9_fd_read()/p9_fd_write() is invoked, but still remains small race window for clearing O_NONBLOCK flag. If we don't want to manipulate O_NONBLOCK flag, we might be able to surround kernel_read()/kernel_write() with set_thread_flag(TIF_SIGPENDING) and recalc_sigpending(). Since p9_read_work()/p9_write_work() works are processed by kernel threads which process global system_wq workqueue, signals could not be delivered from remote threads when p9_mux_poll_stop() from p9_conn_destroy() from p9_fd_close() is called. Therefore, calling set_thread_flag(TIF_SIGPENDING)/recalc_sigpending() every time would be needed if we count on signals for making kernel_read()/kernel_write() non-blocking. Link: https://lkml.kernel.org/r/345de429-a88b-7097-d177-adecf9fed342@I-love.SAKURA.ne.jp Link: https://syzkaller.appspot.com/bug?extid=8b41a1365f1106fd0f33 [1] Reported-by: syzbot <syzbot+8b41a1365f1106fd0f33@syzkaller.appspotmail.com> Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Tested-by: syzbot <syzbot+8b41a1365f1106fd0f33@syzkaller.appspotmail.com> Reviewed-by: Christian Schoenebeck <linux_oss@crudebyte.com> [Dominique: add comment at Christian's suggestion] Signed-off-by: Dominique Martinet <asmadeus@codewreck.org>
2022-10-06Merge tag 'iomap-6.1-merge-1' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linuxLinus Torvalds
Pull iomap updates from Darrick Wong: "It's pretty quiet this time around -- a UAF bugfix and a new tracepoint so we can watch file writeback: - Fix a UAF bug when recording writeback mapping errors - Add a tracepoint so that we can monitor writeback mappings" * tag 'iomap-6.1-merge-1' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: iomap: add a tracepoint for mappings returned by map_blocks iomap: iomap: fix memory corruption when recording errors during writeback
2022-10-06Merge tag 'ext4_for_linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 Pull ext4 updates from Ted Ts'o: "The first two changes involve files outside of fs/ext4: - submit_bh() can never return an error, so change it to return void, and remove the unused checks from its callers - fix I_DIRTY_TIME handling so it will be set even if the inode already has I_DIRTY_INODE Performance: - Always enable i_version counter (as btrfs and xfs already do). Remove some uneeded i_version bumps to avoid unnecessary nfs cache invalidations - Wake up journal waiters in FIFO order, to avoid some journal users from not getting a journal handle for an unfairly long time - In ext4_write_begin() allocate any necessary buffer heads before starting the journal handle - Don't try to prefetch the block allocation bitmaps for a read-only file system Bug Fixes: - Fix a number of fast commit bugs, including resources leaks and out of bound references in various error handling paths and/or if the fast commit log is corrupted - Avoid stopping the online resize early when expanding a file system which is less than 16TiB to a size greater than 16TiB - Fix apparent metadata corruption caused by a race with a metadata buffer head getting migrated while it was trying to be read - Mark the lazy initialization thread freezable to prevent suspend failures - Other miscellaneous bug fixes Cleanups: - Break up the incredibly long ext4_full_super() function by refactoring to move code into more understandable, smaller functions - Remove the deprecated (and ignored) noacl and nouser_attr mount option - Factor out some common code in fast commit handling - Other miscellaneous cleanups" * tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (53 commits) ext4: fix potential out of bound read in ext4_fc_replay_scan() ext4: factor out ext4_fc_get_tl() ext4: introduce EXT4_FC_TAG_BASE_LEN helper ext4: factor out ext4_free_ext_path() ext4: remove unnecessary drop path references in mext_check_coverage() ext4: update 'state->fc_regions_size' after successful memory allocation ext4: fix potential memory leak in ext4_fc_record_regions() ext4: fix potential memory leak in ext4_fc_record_modified_inode() ext4: remove redundant checking in ext4_ioctl_checkpoint jbd2: add miss release buffer head in fc_do_one_pass() ext4: move DIOREAD_NOLOCK setting to ext4_set_def_opts() ext4: remove useless local variable 'blocksize' ext4: unify the ext4 super block loading operation ext4: factor out ext4_journal_data_mode_check() ext4: factor out ext4_load_and_init_journal() ext4: factor out ext4_group_desc_init() and ext4_group_desc_free() ext4: factor out ext4_geometry_check() ext4: factor out ext4_check_feature_compatibility() ext4: factor out ext4_init_metadata_csum() ext4: factor out ext4_encoding_init() ...
2022-10-06Merge tag 'affs-for-6.1-tag' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull affs update from David Sterba: "One minor update for AFFS, switching away from strlcpy" * tag 'affs-for-6.1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: affs: move from strlcpy with unused retval to strscpy
2022-10-06Merge tag 'for-6.1-tag' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs updates from David Sterba: "There's a bunch of performance improvements, most notably the FIEMAP speedup, the new block group tree to speed up mount on large filesystems, more io_uring integration, some sysfs exports and the usual fixes and core updates. Summary: Performance: - outstanding FIEMAP speed improvement - algorithmic change how extents are enumerated leads to orders of magnitude speed boost (uncached and cached) - extent sharing check speedup (2.2x uncached, 3x cached) - add more cancellation points, allowing to interrupt seeking in files with large number of extents - more efficient hole and data seeking (4x uncached, 1.3x cached) - sample results: 256M, 32K extents: 4s -> 29ms (~150x) 512M, 64K extents: 30s -> 59ms (~550x) 1G, 128K extents: 225s -> 120ms (~1800x) - improved inode logging, especially for directories (on dbench workload throughput +25%, max latency -21%) - improved buffered IO, remove redundant extent state tracking, lowering memory consumption and avoiding rb tree traversal - add sysfs tunable to let qgroup temporarily skip exact accounting when deleting snapshot, leading to a speedup but requiring a rescan after that, will be used by snapper - support io_uring and buffered writes, until now it was just for direct IO, with the no-wait semantics implemented in the buffered write path it now works and leads to speed improvement in IOPS (2x), throughput (2.2x), latency (depends, 2x to 150x) - small performance improvements when dropping and searching for extent maps as well as when flushing delalloc in COW mode (throughput +5MB/s) User visible changes: - new incompatible feature block-group-tree adding a dedicated tree for tracking block groups, this allows a much faster load during mount and avoids seeking unlike when it's scattered in the extent tree items - this reduces mount time for many-terabyte sized filesystems - conversion tool will be provided so existing filesystem can also be updated in place - to reduce test matrix and feature combinations requires no-holes and free-space-tree (mkfs defaults since 5.15) - improved reporting of super block corruption detected by scrub - scrub also tries to repair super block and does not wait until next commit - discard stats and tunables are exported in sysfs (/sys/fs/btrfs/FSID/discard) - qgroup status is exported in sysfs (/sys/sys/fs/btrfs/FSID/qgroups/) - verify that super block was not modified when thawing filesystem Fixes: - FIEMAP fixes - fix extent sharing status, does not depend on the cached status where merged - flush delalloc so compressed extents are reported correctly - fix alignment of VMA for memory mapped files on THP - send: fix failures when processing inodes with no links (orphan files and directories) - fix race between quota enable and quota rescan ioctl - handle more corner cases for read-only compat feature verification - fix missed extent on fsync after dropping extent maps Core: - lockdep annotations to validate various transactions states and state transitions - preliminary support for fs-verity in send - more effective memory use in scrub for subpage where sector is smaller than page - block group caching progress logic has been removed, load is now synchronous - simplify end IO callbacks and bio handling, use chained bios instead of own tracking - add no-wait semantics to several functions (tree search, nocow, flushing, buffered write - cleanups and refactoring MM changes: - export balance_dirty_pages_ratelimited_flags" * tag 'for-6.1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: (177 commits) btrfs: set generation before calling btrfs_clean_tree_block in btrfs_init_new_buffer btrfs: drop extent map range more efficiently btrfs: avoid pointless extent map tree search when flushing delalloc btrfs: remove unnecessary next extent map search btrfs: remove unnecessary NULL pointer checks when searching extent maps btrfs: assert tree is locked when clearing extent map from logging btrfs: remove unnecessary extent map initializations btrfs: remove the refcount warning/check at free_extent_map() btrfs: add helper to replace extent map range with a new extent map btrfs: move open coded extent map tree deletion out of inode eviction btrfs: use cond_resched_rwlock_write() during inode eviction btrfs: use extent_map_end() at btrfs_drop_extent_map_range() btrfs: move btrfs_drop_extent_cache() to extent_map.c btrfs: fix missed extent on fsync after dropping extent maps btrfs: remove stale prototype of btrfs_write_inode btrfs: enable nowait async buffered writes btrfs: assert nowait mode is not used for some btree search functions btrfs: make btrfs_buffered_write nowait compatible btrfs: plumb NOWAIT through the write path btrfs: make lock_and_cleanup_extent_if_need nowait compatible ...
2022-10-06Merge tag 'pull-path' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfsLinus Torvalds
Pull vfs constification updates from Al Viro: "whack-a-mole: constifying struct path *" * tag 'pull-path' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: ecryptfs: constify path spufs: constify path nd_jump_link(): constify path audit_init_parent(): constify path __io_setxattr(): constify path do_proc_readlink(): constify path overlayfs: constify path fs/notify: constify path may_linkat(): constify path do_sys_name_to_handle(): constify path ->getprocattr(): attribute name is const char *, TYVM...
2022-10-06Merge tag 'pull-tomoyo' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull misc tomoyo changes from Al Viro: "A couple of assorted tomoyo patches" * tag 'pull-tomoyo' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: tomoyo: struct path it might get from LSM callers won't have NULL dentry or mnt tomoyo: use vsnprintf() properly
2022-10-06Merge tag 'pull-file_inode' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull file_inode() updates from Al Vrio: "whack-a-mole: cropped up open-coded file_inode() uses..." * tag 'pull-file_inode' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: orangefs: use ->f_mapping _nfs42_proc_copy(): use ->f_mapping instead of file_inode()->i_mapping dma_buf: no need to bother with file_inode()->i_mapping nfs_finish_open(): don't open-code file_inode() bprm_fill_uid(): don't open-code file_inode() sgx: use ->f_mapping... exfat_iterate(): don't open-code file_inode(file) ibmvmc: don't open-code file_inode()
2022-10-06Merge tag 'pull-file' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfsLinus Torvalds
Pull vfs file updates from Al Viro: "struct file-related stuff" * tag 'pull-file' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: dma_buf_getfile(): don't bother with ->f_flags reassignments Change calling conventions for filldir_t locks: fix TOCTOU race when granting write lease
2022-10-06Merge tag 'pull-d_path' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull vfs d_path updates from Al Viro. * tag 'pull-d_path' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: d_path.c: typo fix... dynamic_dname(): drop unused dentry argument
2022-10-06Merge tag 'pull-inode' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfsLinus Torvalds
Pull vfs inode update from Al Viro: "Saner inode_init_always(), also fixing a nilfs problem" * tag 'pull-inode' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: fs: fix UAF/GPF bug in nilfs_mdt_destroy
2022-10-06Merge tag 'v6.0' into rdma.git for-nextJason Gunthorpe
Trvial merge conflicts against rdma.git for-rc resolved matching linux-next: drivers/infiniband/hw/hns/hns_roce_hw_v2.c drivers/infiniband/hw/hns/hns_roce_main.c https://lore.kernel.org/r/20220929124005.105149-1-broonie@kernel.org Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>