summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2024-04-17Merge branch 'svm' of https://github.com/kvm-x86/linux into HEADPaolo Bonzini
Clean up SVM's enter/exit assembly code so that it can be compiled without OBJECT_FILES_NON_STANDARD. The "standard" __svm_vcpu_run() can't be made 100% bulletproof, as RBP isn't restored on #VMEXIT, but that's also the case for __vmx_vcpu_run(), and getting "close enough" is better than not even trying. As for SEV-ES, after yet another refresher on swap types, I realized KVM can simply let the hardware restore registers after #VMEXIT, all that's missing is storing the current values to the host save area (they are swap type B). This should provide 100% accuracy when using stack frames for unwinding, and requires less assembly. In between, build the SEV-ES code iff CONFIG_KVM_AMD_SEV=y, and yank out "support" for 32-bit kernels in __svm_sev_es_vcpu_run, which was unnecessarily polluting the code for a configuration that is disabled at build time. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-04-17netfilter: nf_tables: restore set elements when delete set failsPablo Neira Ayuso
From abort path, nft_mapelem_activate() needs to restore refcounters to the original state. Currently, it uses the set->ops->walk() to iterate over these set elements. The existing set iterator skips inactive elements in the next generation, this does not work from the abort path to restore the original state since it has to skip active elements instead (not inactive ones). This patch moves the check for inactive elements to the set iterator callback, then it reverses the logic for the .activate case which needs to skip active elements. Toggle next generation bit for elements when delete set command is invoked and call nft_clear() from .activate (abort) path to restore the next generation bit. The splat below shows an object in mappings memleak: [43929.457523] ------------[ cut here ]------------ [43929.457532] WARNING: CPU: 0 PID: 1139 at include/net/netfilter/nf_tables.h:1237 nft_setelem_data_deactivate+0xe4/0xf0 [nf_tables] [...] [43929.458014] RIP: 0010:nft_setelem_data_deactivate+0xe4/0xf0 [nf_tables] [43929.458076] Code: 83 f8 01 77 ab 49 8d 7c 24 08 e8 37 5e d0 de 49 8b 6c 24 08 48 8d 7d 50 e8 e9 5c d0 de 8b 45 50 8d 50 ff 89 55 50 85 c0 75 86 <0f> 0b eb 82 0f 0b eb b3 0f 1f 40 00 90 90 90 90 90 90 90 90 90 90 [43929.458081] RSP: 0018:ffff888140f9f4b0 EFLAGS: 00010246 [43929.458086] RAX: 0000000000000000 RBX: ffff8881434f5288 RCX: dffffc0000000000 [43929.458090] RDX: 00000000ffffffff RSI: ffffffffa26d28a7 RDI: ffff88810ecc9550 [43929.458093] RBP: ffff88810ecc9500 R08: 0000000000000001 R09: ffffed10281f3e8f [43929.458096] R10: 0000000000000003 R11: ffff0000ffff0000 R12: ffff8881434f52a0 [43929.458100] R13: ffff888140f9f5f4 R14: ffff888151c7a800 R15: 0000000000000002 [43929.458103] FS: 00007f0c687c4740(0000) GS:ffff888390800000(0000) knlGS:0000000000000000 [43929.458107] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [43929.458111] CR2: 00007f58dbe5b008 CR3: 0000000123602005 CR4: 00000000001706f0 [43929.458114] Call Trace: [43929.458118] <TASK> [43929.458121] ? __warn+0x9f/0x1a0 [43929.458127] ? nft_setelem_data_deactivate+0xe4/0xf0 [nf_tables] [43929.458188] ? report_bug+0x1b1/0x1e0 [43929.458196] ? handle_bug+0x3c/0x70 [43929.458200] ? exc_invalid_op+0x17/0x40 [43929.458211] ? nft_setelem_data_deactivate+0xd7/0xf0 [nf_tables] [43929.458271] ? nft_setelem_data_deactivate+0xe4/0xf0 [nf_tables] [43929.458332] nft_mapelem_deactivate+0x24/0x30 [nf_tables] [43929.458392] nft_rhash_walk+0xdd/0x180 [nf_tables] [43929.458453] ? __pfx_nft_rhash_walk+0x10/0x10 [nf_tables] [43929.458512] ? rb_insert_color+0x2e/0x280 [43929.458520] nft_map_deactivate+0xdc/0x1e0 [nf_tables] [43929.458582] ? __pfx_nft_map_deactivate+0x10/0x10 [nf_tables] [43929.458642] ? __pfx_nft_mapelem_deactivate+0x10/0x10 [nf_tables] [43929.458701] ? __rcu_read_unlock+0x46/0x70 [43929.458709] nft_delset+0xff/0x110 [nf_tables] [43929.458769] nft_flush_table+0x16f/0x460 [nf_tables] [43929.458830] nf_tables_deltable+0x501/0x580 [nf_tables] Fixes: 628bd3e49cba ("netfilter: nf_tables: drop map element references from preparation phase") Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2024-04-17netfilter: nf_tables: missing iterator type in lookup walkPablo Neira Ayuso
Add missing decorator type to lookup expression and tighten WARN_ON_ONCE check in pipapo to spot earlier that this is unset. Fixes: 29b359cf6d95 ("netfilter: nft_set_pipapo: walk over current view on netlink dump") Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2024-04-17s390/mm: Fix NULL pointer dereferenceSven Schnelle
The recently added check to figure out if a fault happened on gmap ASCE dereferences the gmap pointer in lowcore without checking that it is not NULL. For all non-KVM processes the pointer is NULL, so that some value from lowcore will be read. With the current layouts of struct gmap and struct lowcore the read value (aka ASCE) is zero, so that this doesn't lead to any observable bug; at least currently. Fix this by adding the missing NULL pointer check. Fixes: 64c3431808bd ("s390/entry: compare gmap asce to determine guest/host fault") Signed-off-by: Sven Schnelle <svens@linux.ibm.com> Reviewed-by: Claudio Imbrenda <imbrenda@linux.ibm.com> Reviewed-by: Heiko Carstens <hca@linux.ibm.com> Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
2024-04-17io_uring/rw: ensure retry condition isn't lostJens Axboe
A previous commit removed the checking on whether or not it was possible to retry a request, since it's now possible to retry any of them. This would previously have caused the request to have been ended with an error, but now the retry condition can simply get lost instead. Cleanup the retry handling and always just punt it to task_work, which will queue it with io-wq appropriately. Reported-by: Changhui Zhong <czhong@redhat.com> Tested-by: Ming Lei <ming.lei@redhat.com> Fixes: cca6571381a0 ("io_uring/rw: cleanup retry path") Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-04-17Revert "drm/amd/display: fix USB-C flag update after enc10 feature init"Alex Deucher
This reverts commit b5abd7f983e14054593dc91d6df2aa5f8cc67652. This change breaks DSC on 4k monitors at 144Hz over USB-C. Closes: https://gitlab.freedesktop.org/drm/amd/-/issues/3254 Reviewed-by: Harry Wentland <harry.wentland@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> Cc: Muhammad Ahmed <ahmed.ahmed@amd.com> Cc: Tom Chung <chiahsuan.chung@amd.com> Cc: Charlene Liu <charlene.liu@amd.com> Cc: Hamza Mahfooz <hamza.mahfooz@amd.com> Cc: Harry Wentland <harry.wentland@amd.com> Cc: stable@vger.kernel.org
2024-04-17drm/amdkfd: Fix memory leak in create_process failureFelix Kuehling
Fix memory leak due to a leaked mmget reference on an error handling code path that is triggered when attempting to create KFD processes while a GPU reset is in progress. Fixes: 0ab2d7532b05 ("drm/amdkfd: prepare per-process debug enable and disable") CC: Xiaogang Chen <xiaogang.chen@amd.com> Signed-off-by: Felix Kuehling <felix.kuehling@amd.com> Tested-by: Harish Kasiviswanthan <Harish.Kasiviswanthan@amd.com> Reviewed-by: Mukul Joshi <mukul.joshi@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> Cc: stable@vger.kernel.org
2024-04-17drm/amdgpu: remove invalid resource->start check v2Christian König
The majority of those where removed in the commit aed01a68047b ("drm/amdgpu: Remove TTM resource->start visible VRAM condition v2") But this one was missed because it's working on the resource and not the BO. Since we also no longer use a fake start address for visible BOs this will now trigger invalid mapping errors. v2: also remove the unused variable Signed-off-by: Christian König <christian.koenig@amd.com> Fixes: aed01a68047b ("drm/amdgpu: Remove TTM resource->start visible VRAM condition v2") CC: stable@vger.kernel.org Acked-by: Pierre-Eric Pelloux-Prayer <pierre-eric.pelloux-prayer@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-04-17null_blk: Simplify null_zone_write()Damien Le Moal
In null_zone_write, we do not need to first check if the target zone condition is FULL, READONLY or OFFLINE: for theses conditions, the check of the command sector against the zone write pointer will always result in the command failing. Remove these checks. We still however need to check that the target zone write pointer is not invalid for zone append operations. To do so, add the macro NULL_ZONE_INVALID_WP and use it in null_set_zone_cond() when changing a zone to READONLY or OFFLINE condition. Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Link: https://lore.kernel.org/r/20240411085502.728558-4-dlemoal@kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-04-17null_blk: Do zone resource management only if necessaryDamien Le Moal
For zoned null_blk devices setup without any limit on the maximum number of open and active zones, there is no need to count the number of zones that are implicitly open, explicitly open and closed. This is indicated by the boolean field need_zone_res_mgmt of sturct nullb_device. Modify the zone management functions null_reset_zone(), null_finish_zone(), null_open_zone() and null_close_zone() to manage the zone condition counters only if the device need_zone_res_mgmt field is true. With this change, the function __null_close_zone() is removed and integrated into the 2 caller sites directly, with the null_close_imp_open_zone() call site greatly simplified as this function closes zones that are known to be in the implicit open condition. null_zone_write() is modified in a similar manner to do zone condition accouting only when the device need_zone_res_mgmt field is true. With these changes, the inline helpers null_lock_zone_res() and null_unlock_zone_res() are removed and replaced with direct calls to spin_lock()/spin_unlock(). Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Link: https://lore.kernel.org/r/20240411085502.728558-3-dlemoal@kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-04-17null_blk: Have all null_handle_xxx() return a blk_status_tDamien Le Moal
Modify the null_handle_flush() and null_handle_rq() functions to return a blk_status_t instead of an errno to simplify the call sites of these functions and to be consistant with other null_handle_xxx() functions. Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Nitesh Shetty <nj.shetty@samsung.com> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Link: https://lore.kernel.org/r/20240411085502.728558-2-dlemoal@kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-04-17block: Do not special-case plugging of zone write operationsDamien Le Moal
With the block layer zone write plugging being automatically done for any write operation to a zone of a zoned block device, a regular request plugging handled through current->plug can only ever see at most a single write request per zone. In such case, any potential reordering of the plugged requests will be harmless. We can thus remove the special casing for write operations to zones and have these requests plugged as well. This allows removing the function blk_mq_plug and instead directly using current->plug where needed. Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Tested-by: Hans Holmberg <hans.holmberg@wdc.com> Tested-by: Dennis Maisenbacher <dennis.maisenbacher@wdc.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Link: https://lore.kernel.org/r/20240408014128.205141-29-dlemoal@kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-04-17block: Do not force select mq-deadline with CONFIG_BLK_DEV_ZONEDDamien Le Moal
Now that zone block device write ordering control does not depend anymore on mq-deadline and zone write locking, there is no need to force select the mq-deadline scheduler when CONFIG_BLK_DEV_ZONED is enabled. Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Christoph Hellwig <hch@lst.de> Tested-by: Hans Holmberg <hans.holmberg@wdc.com> Tested-by: Dennis Maisenbacher <dennis.maisenbacher@wdc.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Link: https://lore.kernel.org/r/20240408014128.205141-28-dlemoal@kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-04-17block: Remove zone write lockingDamien Le Moal
Zone write locking is now unused and replaced with zone write plugging. Remove all code that was implementing zone write locking, that is, the various helper functions controlling request zone write locking and the gendisk attached zone bitmaps. Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Tested-by: Hans Holmberg <hans.holmberg@wdc.com> Tested-by: Dennis Maisenbacher <dennis.maisenbacher@wdc.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Link: https://lore.kernel.org/r/20240408014128.205141-27-dlemoal@kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-04-17block: Replace zone_wlock debugfs entry with zone_wplugs entryDamien Le Moal
In preparation to completely remove zone write locking, replace the "zone_wlock" mq-debugfs entry that was listing zones that are write-locked with the zone_wplugs entry which lists the zones that currently have a write plug allocated. The write plug information provided is: the zone number, the zone write plug flags, the zone write plug write pointer offset and the number of BIOs currently waiting for execution in the zone write plug BIO list. Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Hannes Reinecke <hare@suse.de> Tested-by: Hans Holmberg <hans.holmberg@wdc.com> Tested-by: Dennis Maisenbacher <dennis.maisenbacher@wdc.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Link: https://lore.kernel.org/r/20240408014128.205141-26-dlemoal@kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-04-17block: Move zone related debugfs attribute to blk-zoned.cDamien Le Moal
block/blk-mq-debugfs-zone.c contains a single debugfs attribute function. Defining this outside of block/blk-zoned.c does not really help in any way, so move this zone related debugfs attribute to block/blk-zoned.c and delete block/blk-mq-debugfs-zone.c. Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Tested-by: Hans Holmberg <hans.holmberg@wdc.com> Tested-by: Dennis Maisenbacher <dennis.maisenbacher@wdc.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Link: https://lore.kernel.org/r/20240408014128.205141-25-dlemoal@kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-04-17block: Do not check zone type in blk_check_zone_append()Damien Le Moal
Zone append operations are only allowed to target sequential write required zones. blk_check_zone_append() uses bio_zone_is_seq() to check this. However, this check is not necessary because: 1) For NVMe ZNS namespace devices, only sequential write required zones exist, making the zone type check useless. 2) For null_blk, the driver will fail the request anyway, thus notifying the user that a conventional zone was targeted. 3) For all other zoned devices, zone append is now emulated using zone write plugging, which checks that a zone append operation does not target a conventional zone. In preparation for the removal of zone write locking and its conventional zone bitmap (used by bio_zone_is_seq()), remove the bio_zone_is_seq() call from blk_check_zone_append(). Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Tested-by: Hans Holmberg <hans.holmberg@wdc.com> Tested-by: Dennis Maisenbacher <dennis.maisenbacher@wdc.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Link: https://lore.kernel.org/r/20240408014128.205141-24-dlemoal@kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-04-17block: Remove elevator required featuresDamien Le Moal
The only elevator feature ever implemented is ELEVATOR_F_ZBD_SEQ_WRITE for signaling that a scheduler implements zone write locking to tightly control the dispatching order of write operations to zoned block devices. With the removal of zone write locking support in mq-deadline and the reliance of all block device drivers on the block layer zone write plugging to control ordering of write operations to zones, the elevator feature ELEVATOR_F_ZBD_SEQ_WRITE is completely unused. Remove it, and also remove the now unused code for filtering the possible schedulers for a block device based on required features. Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Tested-by: Hans Holmberg <hans.holmberg@wdc.com> Tested-by: Dennis Maisenbacher <dennis.maisenbacher@wdc.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Link: https://lore.kernel.org/r/20240408014128.205141-23-dlemoal@kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-04-17block: mq-deadline: Remove support for zone write lockingDamien Le Moal
With the block layer generic plugging of write operations for zoned block devices, mq-deadline, or any other scheduler, can only ever see at most one write operation per zone at any time. There is thus no sequentiality requirements for these writes and thus no need to tightly control the dispatching of write requests using zone write locking. Remove all the code that implement this control in the mq-deadline scheduler and remove advertizing support for the ELEVATOR_F_ZBD_SEQ_WRITE elevator feature. Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Tested-by: Hans Holmberg <hans.holmberg@wdc.com> Tested-by: Dennis Maisenbacher <dennis.maisenbacher@wdc.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Link: https://lore.kernel.org/r/20240408014128.205141-22-dlemoal@kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-04-17block: Simplify blk_revalidate_disk_zones() interfaceDamien Le Moal
The only user of blk_revalidate_disk_zones() second argument was the SCSI disk driver (sd). Now that this driver does not require this update_driver_data argument, remove it to simplify the interface of blk_revalidate_disk_zones(). Also update the function kdoc comment to be more accurate (i.e. there is no gendisk ->revalidate method). Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Tested-by: Hans Holmberg <hans.holmberg@wdc.com> Tested-by: Dennis Maisenbacher <dennis.maisenbacher@wdc.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Link: https://lore.kernel.org/r/20240408014128.205141-21-dlemoal@kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-04-17block: Remove BLK_STS_ZONE_RESOURCEDamien Le Moal
The zone append emulation of the scsi disk driver was the only driver using BLK_STS_ZONE_RESOURCE. With this code removed, BLK_STS_ZONE_RESOURCE is now unused. Remove this macro definition and simplify blk_mq_dispatch_rq_list() where this status code was handled. Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Tested-by: Hans Holmberg <hans.holmberg@wdc.com> Tested-by: Dennis Maisenbacher <dennis.maisenbacher@wdc.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Link: https://lore.kernel.org/r/20240408014128.205141-20-dlemoal@kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-04-17nvmet: zns: Do not reference the gendisk conv_zones_bitmapDamien Le Moal
The gendisk conventional zone bitmap is going away. So to check for the presence of conventional zones on a zoned target device, always use report zones. Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Tested-by: Hans Holmberg <hans.holmberg@wdc.com> Tested-by: Dennis Maisenbacher <dennis.maisenbacher@wdc.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Link: https://lore.kernel.org/r/20240408014128.205141-19-dlemoal@kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-04-17null_blk: Introduce fua attributeDamien Le Moal
Add the fua configfs attribute and module parameter to allow configuring if the device supports FUA or not. Using this attribute has an effect on the null_blk device only if memory backing is enabled together with a write cache (cache_size option). This new attribute allows configuring a null_blk device with a write cache but without FUA support. This is convenient to test the block layer flush machinery. Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Tested-by: Hans Holmberg <hans.holmberg@wdc.com> Tested-by: Dennis Maisenbacher <dennis.maisenbacher@wdc.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Link: https://lore.kernel.org/r/20240408014128.205141-18-dlemoal@kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-04-17null_blk: Introduce zone_append_max_sectors attributeDamien Le Moal
Add the zone_append_max_sectors configfs attribute and module parameter to allow configuring the maximum number of 512B sectors of zone append operations. This attribute is meaningful only for zoned null block devices. If not specified, the default is unchanged and the zoned device max append sectors limit is set to the device max sectors limit. If a non 0 value is used for this attribute, which is the default, then native support for zone append operations is enabled. Setting a 0 value disables native zone append operations support to instead use the block layer emulation. Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Tested-by: Hans Holmberg <hans.holmberg@wdc.com> Tested-by: Dennis Maisenbacher <dennis.maisenbacher@wdc.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Link: https://lore.kernel.org/r/20240408014128.205141-17-dlemoal@kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-04-17null_blk: Do not request ELEVATOR_F_ZBD_SEQ_WRITE elevator featureDamien Le Moal
With zone write plugging enabled at the block layer level, a zoned device can only ever see at most a single write operation per zone. There is thus no need to request a block scheduler with strick per-zone sequential write ordering control through the ELEVATOR_F_ZBD_SEQ_WRITE feature. Removing this allows using a zoned null_blk device with any scheduler, including "none". Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Tested-by: Hans Holmberg <hans.holmberg@wdc.com> Tested-by: Dennis Maisenbacher <dennis.maisenbacher@wdc.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Link: https://lore.kernel.org/r/20240408014128.205141-16-dlemoal@kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-04-17ublk_drv: Do not request ELEVATOR_F_ZBD_SEQ_WRITE elevator featureDamien Le Moal
With zone write plugging enabled at the block layer level, any zone device can only ever see at most a single write operation per zone. There is thus no need to request a block scheduler with strick per-zone sequential write ordering control through the ELEVATOR_F_ZBD_SEQ_WRITE feature. Removing this allows using a zoned ublk device with any scheduler, including "none". Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Tested-by: Hans Holmberg <hans.holmberg@wdc.com> Tested-by: Dennis Maisenbacher <dennis.maisenbacher@wdc.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Link: https://lore.kernel.org/r/20240408014128.205141-15-dlemoal@kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-04-17scsi: sd: Use the block layer zone append emulationDamien Le Moal
Set the request queue of a TYPE_ZBC device as needing zone append emulation by setting the device queue max_zone_append_sectors limit to 0. This enables the block layer generic implementation provided by zone write plugging. With this, the sd driver will never see a REQ_OP_ZONE_APPEND request and the zone append emulation code implemented in sd_zbc.c can be removed. Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Tested-by: Hans Holmberg <hans.holmberg@wdc.com> Tested-by: Dennis Maisenbacher <dennis.maisenbacher@wdc.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Link: https://lore.kernel.org/r/20240408014128.205141-14-dlemoal@kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-04-17dm: Use the block layer zone append emulationDamien Le Moal
For targets requiring zone append operation emulation with regular writes (e.g. dm-crypt), we can use the block layer emulation provided by zone write plugging. Remove DM implemented zone append emulation and enable the block layer one. This is done by setting the max_zone_append_sectors limit of the mapped device queue to 0 for mapped devices that have a target table that cannot support native zone append operations (e.g. dm-crypt). Such mapped devices are flagged with the DMF_EMULATE_ZONE_APPEND flag. dm_split_and_process_bio() is modified to execute blk_zone_write_plug_bio() for such device to let the block layer transform zone append operations into regular writes. This is done after ensuring that the submitted BIO is split if it straddles zone boundaries. Both changes are implemented unsing the inline helpers dm_zone_write_plug_bio() and dm_zone_bio_needs_split() respectively. dm_revalidate_zones() is also modified to use the block layer provided function blk_revalidate_disk_zones() so that all zone resources needed for zone append emulation are initialized by the block layer without DM core needing to do anything. Since the device table is not yet live when dm_revalidate_zones() is executed, enabling the use of blk_revalidate_disk_zones() requires adding a pointer to the device table in struct mapped_device. This avoids errors in dm_blk_report_zones() trying to get the table with dm_get_live_table(). The mapped device table pointer is set to the table passed as argument to dm_revalidate_zones() before calling blk_revalidate_disk_zones() and reset to NULL after this function returns to restore the live table handling for user call of report zones. All the code related to zone append emulation is removed from dm-zone.c. This leads to simplifications of the functions __map_bio() and dm_zone_endio(). This later function now only needs to deal with completions of real zone append operations for targets that support it. Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Mike Snitzer <snitzer@kernel.org> Reviewed-by: Hannes Reinecke <hare@suse.de> Tested-by: Hans Holmberg <hans.holmberg@wdc.com> Tested-by: Dennis Maisenbacher <dennis.maisenbacher@wdc.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Link: https://lore.kernel.org/r/20240408014128.205141-13-dlemoal@kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-04-17block: Allow BIO-based drivers to use blk_revalidate_disk_zones()Damien Le Moal
In preparation for allowing BIO based device drivers to use zone write plugging and its zone append emulation, allow these drivers to call blk_revalidate_disk_zones() so that all zone resources necessary to zone write plugging can be initialized. To do so, remove the check in blk_revalidate_disk_zones() restricting the use of this function to mq request-based drivers to allow also BIO-based drivers to use it. This is safe to do as long as the BIO-based block device queue is already setup and usable, as it should, and can be safely frozen. The helper function disk_need_zone_resources() is added to control the allocation and initialization of the zone write plug hash table and of the conventional zone bitmap only for mq devices and for BIO-based devices that require zone append emulation. Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Tested-by: Hans Holmberg <hans.holmberg@wdc.com> Tested-by: Dennis Maisenbacher <dennis.maisenbacher@wdc.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Link: https://lore.kernel.org/r/20240408014128.205141-12-dlemoal@kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-04-17block: Implement zone append emulationDamien Le Moal
Given that zone write plugging manages all writes to zones of a zoned block device and tracks the write pointer position of all zones that are not full nor empty, emulating zone append operations using regular writes can be implemented generically, without relying on the underlying device driver to implement such emulation. This is needed for devices that do not natively support the zone append command (e.g. SMR hard-disks). A device may request zone append emulation by setting its max_zone_append_sectors queue limit to 0. For such device, the function blk_zone_wplug_prepare_bio() changes zone append BIOs into non-mergeable regular write BIOs. Modified zone append BIOs are flagged with the new BIO flag BIO_EMULATES_ZONE_APPEND. This flag is checked on completion of the BIO in blk_zone_write_plug_bio_endio() to restore the original REQ_OP_ZONE_APPEND operation code of the BIO. The block layer internal inline helper function bio_is_zone_append() is added to test if a BIO is either a native zone append operation (REQ_OP_ZONE_APPEND operation code) or if it is flagged with BIO_EMULATES_ZONE_APPEND. Given that both native and emulated zone append BIO completion handling should be similar, The functions blk_update_request() and blk_zone_complete_request_bio() are modified to use bio_is_zone_append() to execute blk_zone_update_request_bio() for both native and emulated zone append operations. This commit contains contributions from Christoph Hellwig <hch@lst.de>. Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Tested-by: Hans Holmberg <hans.holmberg@wdc.com> Tested-by: Dennis Maisenbacher <dennis.maisenbacher@wdc.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Link: https://lore.kernel.org/r/20240408014128.205141-11-dlemoal@kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-04-17block: Allow zero value of max_zone_append_sectors queue limitDamien Le Moal
In preparation for adding a generic zone append emulation using zone write plugging, allow device drivers supporting zoned block device to set a the max_zone_append_sectors queue limit of a device to 0 to indicate the lack of native support for zone append operations and that the block layer should emulate these operations using regular write operations. blk_queue_max_zone_append_sectors() is modified to allow passing 0 as the max_zone_append_sectors argument. The function queue_max_zone_append_sectors() is also modified to ensure that the minimum of the max_hw_sectors and chunk_sectors limit is used whenever the max_zone_append_sectors limit is 0. This minimum is consistent with the value set for the max_zone_append_sectors limit by the function blk_validate_zoned_limits() when limits for a queue are validated. The helper functions queue_emulates_zone_append() and bdev_emulates_zone_append() are added to test if a queue (or block device) emulates zone append operations. In order for blk_revalidate_disk_zones() to accept zoned block devices relying on zone append emulation, the direct check to the max_zone_append_sectors queue limit of the disk is replaced by a check using the value returned by queue_max_zone_append_sectors(). Similarly, queue_zone_append_max_show() is modified to use the same accessor so that the sysfs attribute advertizes the non-zero limit that will be used, regardless if it is for native or emulated commands. For stacking drivers, a top device should not need to care if the underlying devices have native or emulated zone append operations. blk_stack_limits() is thus modified to set the top device max_zone_append_sectors limit using the new accessor queue_limits_max_zone_append_sectors(). queue_max_zone_append_sectors() is modified to use this function as well. Stacking drivers that require zone append emulation, e.g. dm-crypt, can still request this feature by calling blk_queue_max_zone_append_sectors() with a 0 limit. Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Tested-by: Hans Holmberg <hans.holmberg@wdc.com> Tested-by: Dennis Maisenbacher <dennis.maisenbacher@wdc.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Link: https://lore.kernel.org/r/20240408014128.205141-10-dlemoal@kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-04-17block: Fake max open zones limit when there is no limitDamien Le Moal
For a zoned block device that has no limit on the number of open zones and no limit on the number of active zones, the zone write plug mempool is created with a size of 128 zone write plugs. For such case, set the device max_open_zones queue limit to this value to indicate to the user the potential performance penalty that may happen when writing simultaneously to more zones than the mempool size. Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Tested-by: Hans Holmberg <hans.holmberg@wdc.com> Tested-by: Dennis Maisenbacher <dennis.maisenbacher@wdc.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Link: https://lore.kernel.org/r/20240408014128.205141-9-dlemoal@kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-04-17block: Introduce zone write pluggingDamien Le Moal
Zone write plugging implements a per-zone "plug" for write operations to control the submission and execution order of write operations to sequential write required zones of a zoned block device. Per-zone plugging guarantees that at any time there is at most only one write request per zone being executed. This mechanism is intended to replace zone write locking which implements a similar per-zone write throttling at the scheduler level, but is implemented only by mq-deadline. Unlike zone write locking which operates on requests, zone write plugging operates on BIOs. A zone write plug is simply a BIO list that is atomically manipulated using a spinlock and a kblockd submission work. A write BIO to a zone is "plugged" to delay its execution if a write BIO for the same zone was already issued, that is, if a write request for the same zone is being executed. The next plugged BIO is unplugged and issued once the write request completes. This mechanism allows to: - Untangle zone write ordering from block IO schedulers. This allows removing the restriction on using mq-deadline for writing to zoned block devices. Any block IO scheduler, including "none" can be used. - Zone write plugging operates on BIOs instead of requests. Plugged BIOs waiting for execution thus do not hold scheduling tags and thus are not preventing other BIOs from executing (reads or writes to other zones). Depending on the workload, this can significantly improve the device use (higher queue depth operation) and performance. - Both blk-mq (request based) zoned devices and BIO-based zoned devices (e.g. device mapper) can use zone write plugging. It is mandatory for the former but optional for the latter. BIO-based drivers can use zone write plugging to implement write ordering guarantees, or the drivers can implement their own if needed. - The code is less invasive in the block layer and is mostly limited to blk-zoned.c with some small changes in blk-mq.c, blk-merge.c and bio.c. Zone write plugging is implemented using struct blk_zone_wplug. This structure includes a spinlock, a BIO list and a work structure to handle the submission of plugged BIOs. Zone write plugs structures are managed using a per-disk hash table. Plugging of zone write BIOs is done using the function blk_zone_write_plug_bio() which returns false if a BIO execution does not need to be delayed and true otherwise. This function is called from blk_mq_submit_bio() after a BIO is split to avoid large BIOs spanning multiple zones which would cause mishandling of zone write plugs. This ichange enables by default zone write plugging for any mq request-based block device. BIO-based device drivers can also use zone write plugging by expliclty calling blk_zone_write_plug_bio() in their ->submit_bio method. For such devices, the driver must ensure that a BIO passed to blk_zone_write_plug_bio() is already split and not straddling zone boundaries. Only write and write zeroes BIOs are plugged. Zone write plugging does not introduce any significant overhead for other operations. A BIO that is being handled through zone write plugging is flagged using the new BIO flag BIO_ZONE_WRITE_PLUGGING. A request handling a BIO flagged with this new flag is flagged with the new RQF_ZONE_WRITE_PLUGGING flag. The completion of BIOs and requests flagged trigger respectively calls to the functions blk_zone_write_bio_endio() and blk_zone_write_complete_request(). The latter function is used to trigger submission of the next plugged BIO using the zone plug work. blk_zone_write_bio_endio() does the same for BIO-based devices. This ensures that at any time, at most one request (blk-mq devices) or one BIO (BIO-based devices) is being executed for any zone. The handling of zone write plugs using a per-zone plug spinlock maximizes parallelism and device usage by allowing multiple zones to be writen simultaneously without lock contention. Zone write plugging ignores flush BIOs without data. Hovever, any flush BIO that has data is always plugged so that the write part of the flush sequence is serialized with other regular writes. Given that any BIO handled through zone write plugging will be the only BIO in flight for the target zone when it is executed, the unplugging and submission of a BIO will have no chance of successfully merging with plugged requests or requests in the scheduler. To overcome this potential performance degradation, blk_mq_submit_bio() calls the function blk_zone_write_plug_attempt_merge() to try to merge other plugged BIOs with the one just unplugged and submitted. Successful merging is signaled using blk_zone_write_plug_bio_merged(), called from bio_attempt_back_merge(). Furthermore, to avoid recalculating the number of segments of plugged BIOs to attempt merging, the number of segments of a plugged BIO is saved using the new struct bio field __bi_nr_segments. To avoid growing the size of struct bio, this field is added as a union with the bio_cookie field. This is safe to do as polling is always disabled for plugged BIOs. When BIOs are plugged in a zone write plug, the device request queue usage counter is always incremented. This reference is kept and reused for blk-mq devices when the plugged BIO is unplugged and submitted again using submit_bio_noacct_nocheck(). For this case, the unplugged BIO is already flagged with BIO_ZONE_WRITE_PLUGGING and blk_mq_submit_bio() proceeds directly to allocating a new request for the BIO, re-using the usage reference count taken when the BIO was plugged. This extra reference count is dropped in blk_zone_write_plug_attempt_merge() for any plugged BIO that is successfully merged. Given that BIO-based devices will not take this path, the extra reference is dropped after a plugged BIO is unplugged and submitted. Zone write plugs are dynamically allocated and managed using a hash table (an array of struct hlist_head) with RCU protection. A zone write plug is allocated when a write BIO is received for the zone and not freed until the zone is fully written, reset or finished. To detect when a zone write plug can be freed, the write state of each zone is tracked using a write pointer offset which corresponds to the offset of a zone write pointer relative to the zone start. Write operations always increment this write pointer offset. Zone reset operations set it to 0 and zone finish operations set it to the zone size. If a write error happens, the wp_offset value of a zone write plug may become incorrect and out of sync with the device managed write pointer. This is handled using the zone write plug flag BLK_ZONE_WPLUG_ERROR. The function blk_zone_wplug_handle_error() is called from the new disk zone write plug work when this flag is set. This function executes a report zone to update the zone write pointer offset to the current value as indicated by the device. The disk zone write plug work is scheduled whenever a BIO flagged with BIO_ZONE_WRITE_PLUGGING completes with an error or when bio_zone_wplug_prepare_bio() detects an unaligned write. Once scheduled, the disk zone write plugs work keeps running until all zone errors are handled. To match the new data structures used for zoned disks, the function disk_free_zone_bitmaps() is renamed to the more generic disk_free_zone_resources(). The function disk_init_zone_resources() is also introduced to initialize zone write plugs resources when a gendisk is allocated. In order to guarantee that the user can simultaneously write up to a number of zones equal to a device max active zone limit or max open zone limit, zone write plugs are allocated using a mempool sized to the maximum of these 2 device limits. For a device that does not have active and open zone limits, 128 is used as the default mempool size. If a change to the device active and open zone limits is detected, the disk mempool is resized when blk_revalidate_disk_zones() is executed. This commit contains contributions from Christoph Hellwig <hch@lst.de>. Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Tested-by: Hans Holmberg <hans.holmberg@wdc.com> Tested-by: Dennis Maisenbacher <dennis.maisenbacher@wdc.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Link: https://lore.kernel.org/r/20240408014128.205141-8-dlemoal@kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-04-17block: Remember zone capacity when revalidating zonesDamien Le Moal
In preparation for adding zone write plugging, modify blk_revalidate_disk_zones() to get the capacity of zones of a zoned block device. This capacity value as a number of 512B sectors is stored in the gendisk zone_capacity field. Given that host-managed SMR disks (including zoned UFS drives) and all known NVMe ZNS devices have the same zone capacity for all zones blk_revalidate_disk_zones() returns an error if different capacities are detected for different zones. This also adds check to verify that the values reported by the device for zone capacities are correct, that is, that the zone capacity is never 0, does not exceed the zone size and is equal to the zone size for conventional zones. Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Tested-by: Hans Holmberg <hans.holmberg@wdc.com> Tested-by: Dennis Maisenbacher <dennis.maisenbacher@wdc.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Link: https://lore.kernel.org/r/20240408014128.205141-7-dlemoal@kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-04-17block: Allow using bio_attempt_back_merge() internallyDamien Le Moal
Remove "static" from the definition of bio_attempt_back_merge() and declare this function in block/blk.h to allow using it internally from other block layer files. The definition of enum bio_merge_status is also moved to block/blk.h. Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Tested-by: Hans Holmberg <hans.holmberg@wdc.com> Tested-by: Dennis Maisenbacher <dennis.maisenbacher@wdc.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Link: https://lore.kernel.org/r/20240408014128.205141-6-dlemoal@kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-04-17block: Introduce bio_straddles_zones() and bio_offset_from_zone_start()Damien Le Moal
Implement the inline helper functions bio_straddles_zones() and bio_offset_from_zone_start() to respectively test if a BIO crosses a zone boundary (the start sector and last sector belong to different zones) and to obtain the offset of a BIO from the start sector of its target zone. Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Tested-by: Hans Holmberg <hans.holmberg@wdc.com> Tested-by: Dennis Maisenbacher <dennis.maisenbacher@wdc.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Link: https://lore.kernel.org/r/20240408014128.205141-5-dlemoal@kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-04-17block: Introduce blk_zone_update_request_bio()Damien Le Moal
On completion of a zone append request, the request sector indicates the location of the written data. This value must be returned to the user through the BIO iter sector. This is done in 2 places: in blk_complete_request() and in blk_update_request(). Introduce the inline helper function blk_zone_update_request_bio() to avoid duplicating this BIO update for zone append requests, and to compile out this helper call when CONFIG_BLK_DEV_ZONED is not enabled. Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Tested-by: Hans Holmberg <hans.holmberg@wdc.com> Tested-by: Dennis Maisenbacher <dennis.maisenbacher@wdc.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Link: https://lore.kernel.org/r/20240408014128.205141-4-dlemoal@kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-04-17block: Remove req_bio_endio()Damien Le Moal
Moving req_bio_endio() code into its only caller, blk_update_request(), allows reducing accesses to and tests of bio and request fields. Also, given that partial completions of zone append operations is not possible and that zone append operations cannot be merged, the update of the BIO sector using the request sector for these operations can be moved directly before the call to bio_endio(). Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Hannes Reinecke <hare@suse.de> Tested-by: Hans Holmberg <hans.holmberg@wdc.com> Tested-by: Dennis Maisenbacher <dennis.maisenbacher@wdc.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Link: https://lore.kernel.org/r/20240408014128.205141-3-dlemoal@kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-04-17block: Restore sector of flush requestsDamien Le Moal
On completion of a flush sequence, blk_flush_restore_request() restores the bio of a request to the original submitted BIO. However, the last use of the request in the flush sequence may have been for a POSTFLUSH which does not have a sector. So make sure to restore the request sector using the iter sector of the original BIO. This BIO has not changed yet since the completions of the flush sequence intermediate steps use requeueing of the request until all steps are completed. Restoring the request sector ensures that blk_mq_end_request() will see a valid sector as originally set when the flush BIO was submitted. Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Tested-by: Hans Holmberg <hans.holmberg@wdc.com> Tested-by: Dennis Maisenbacher <dennis.maisenbacher@wdc.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Link: https://lore.kernel.org/r/20240408014128.205141-2-dlemoal@kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-04-17io-wq: Drop intermediate step between pending list and active workGabriel Krisman Bertazi
next_work is only used to make the work visible for cancellation. Instead, we can just directly write to cur_work before dropping the acct_lock and avoid the extra hop. Signed-off-by: Gabriel Krisman Bertazi <krisman@suse.de> Link: https://lore.kernel.org/r/20240416021054.3940-3-krisman@suse.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-04-17io-wq: write next_work before dropping acct_lockGabriel Krisman Bertazi
Commit 361aee450c6e ("io-wq: add intermediate work step between pending list and active work") closed a race between a cancellation and the work being removed from the wq for execution. To ensure the request is always reachable by the cancellation, we need to move it within the wq lock, which also synchronizes the cancellation. But commit 42abc95f05bf ("io-wq: decouple work_list protection from the big wqe->lock") replaced the wq lock here and accidentally reintroduced the race by releasing the acct_lock too early. In other words: worker | cancellation work = io_get_next_work() | raw_spin_unlock(&acct->lock); | | | io_acct_cancel_pending_work | io_wq_worker_cancel() worker->next_work = work Using acct_lock is still enough since we synchronize on it on io_acct_cancel_pending_work. Fixes: 42abc95f05bf ("io-wq: decouple work_list protection from the big wqe->lock") Signed-off-by: Gabriel Krisman Bertazi <krisman@suse.de> Link: https://lore.kernel.org/r/20240416021054.3940-2-krisman@suse.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-04-17platform/x86/amd/pmc: Extend Framework 13 quirk to more BIOSesMario Limonciello
BIOS 03.05 still hasn't fixed the spurious IRQ1 issue. As it's still being worked on there is still a possibility that it won't need to apply to future BIOS releases. Add a quirk for BIOS 03.05 as well. Signed-off-by: Mario Limonciello <mario.limonciello@amd.com> Reviewed-by: Hans de Goede <hdegoede@redhat.com> Link: https://lore.kernel.org/r/20240410141046.433-1-mario.limonciello@amd.com Reviewed-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com> Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>
2024-04-17ASoC: SOF: Core: Handle error returned by sof_select_ipc_and_pathsPeter Ujfalusi
The patch which fixed the missing remove_late() calls missed a case when sof_select_ipc_and_paths() could return with error and in this case sof_init_environment() would just return with 0. Do not ignore the error code returned by sof_select_ipc_and_paths(). Fixes: 90f8917e7a15 ("ASoC: SOF: Core: Add remove_late() to sof_init_environment failure path") Signed-off-by: Peter Ujfalusi <peter.ujfalusi@linux.intel.com> Reviewed-by: Pierre-Louis Bossart <pierre-louis.bossart@linux.intel.com> Link: https://lore.kernel.org/r/20240417075804.10829-1-peter.ujfalusi@linux.intel.com Signed-off-by: Mark Brown <broonie@kernel.org>
2024-04-17drm/xe/vm: prevent UAF with asid based lookupMatthew Auld
The asid is only erased from the xarray when the vm refcount reaches zero, however this leads to potential UAF since the xe_vm_get() only works on a vm with refcount != 0. Since the asid is allocated in the vm create ioctl, rather erase it when closing the vm, prior to dropping the potential last ref. This should also work when user closes driver fd without explicit vm destroy. Fixes: dd08ebf6c352 ("drm/xe: Introduce a new DRM driver for Intel GPUs") Closes: https://gitlab.freedesktop.org/drm/xe/kernel/-/issues/1594 Signed-off-by: Matthew Auld <matthew.auld@intel.com> Cc: Matthew Brost <matthew.brost@intel.com> Cc: <stable@vger.kernel.org> # v6.8+ Reviewed-by: Matthew Brost <matthew.brost@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20240412113144.259426-4-matthew.auld@intel.com (cherry picked from commit 83967c57320d0d01ae512f10e79213f81e4bf594) Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>
2024-04-17drm/xe: Fix bo leak in intel_fb_bo_framebuffer_initMaarten Lankhorst
Add a unreference bo in the error path, to prevent leaking a bo ref. Return 0 on success to clarify the success path. Signed-off-by: Maarten Lankhorst <maarten.lankhorst@linux.intel.com> Fixes: 44e694958b95 ("drm/xe/display: Implement display support") Cc: <stable@vger.kernel.org> # v6.8+ Reviewed-by: Nirmoy Das <nirmoy.das@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20240404090302.68422-1-maarten.lankhorst@linux.intel.com (cherry picked from commit a2f3d731be3893e730417ae3190760fcaffdf549) Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>
2024-04-17s390/ism: Properly fix receive message buffer allocationGerd Bayer
Since [1], dma_alloc_coherent() does not accept requests for GFP_COMP anymore, even on archs that may be able to fulfill this. Functionality that relied on the receive buffer being a compound page broke at that point: The SMC-D protocol, that utilizes the ism device driver, passes receive buffers to the splice processor in a struct splice_pipe_desc with a single entry list of struct pages. As the buffer is no longer a compound page, the splice processor now rejects requests to handle more than a page worth of data. Replace dma_alloc_coherent() and allocate a buffer with folio_alloc and create a DMA map for it with dma_map_page(). Since only receive buffers on ISM devices use DMA, qualify the mapping as FROM_DEVICE. Since ISM devices are available on arch s390, only, and on that arch all DMA is coherent, there is no need to introduce and export some kind of dma_sync_to_cpu() method to be called by the SMC-D protocol layer. Analogously, replace dma_free_coherent by a two step dma_unmap_page, then folio_put to free the receive buffer. [1] https://lore.kernel.org/all/20221113163535.884299-1-hch@lst.de/ Fixes: c08004eede4b ("s390/ism: don't pass bogus GFP_ flags to dma_alloc_coherent") Signed-off-by: Gerd Bayer <gbayer@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-04-17random: handle creditable entropy from atomic process contextJason A. Donenfeld
The entropy accounting changes a static key when the RNG has initialized, since it only ever initializes once. Static key changes, however, cannot be made from atomic context, so depending on where the last creditable entropy comes from, the static key change might need to be deferred to a worker. Previously the code used the execute_in_process_context() helper function, which accounts for whether or not the caller is in_interrupt(). However, that doesn't account for the case where the caller is actually in process context but is holding a spinlock. This turned out to be the case with input_handle_event() in drivers/input/input.c contributing entropy: [<ffffffd613025ba0>] die+0xa8/0x2fc [<ffffffd613027428>] bug_handler+0x44/0xec [<ffffffd613016964>] brk_handler+0x90/0x144 [<ffffffd613041e58>] do_debug_exception+0xa0/0x148 [<ffffffd61400c208>] el1_dbg+0x60/0x7c [<ffffffd61400c000>] el1h_64_sync_handler+0x38/0x90 [<ffffffd613011294>] el1h_64_sync+0x64/0x6c [<ffffffd613102d88>] __might_resched+0x1fc/0x2e8 [<ffffffd613102b54>] __might_sleep+0x44/0x7c [<ffffffd6130b6eac>] cpus_read_lock+0x1c/0xec [<ffffffd6132c2820>] static_key_enable+0x14/0x38 [<ffffffd61400ac08>] crng_set_ready+0x14/0x28 [<ffffffd6130df4dc>] execute_in_process_context+0xb8/0xf8 [<ffffffd61400ab30>] _credit_init_bits+0x118/0x1dc [<ffffffd6138580c8>] add_timer_randomness+0x264/0x270 [<ffffffd613857e54>] add_input_randomness+0x38/0x48 [<ffffffd613a80f94>] input_handle_event+0x2b8/0x490 [<ffffffd613a81310>] input_event+0x6c/0x98 According to Guoyong, it's not really possible to refactor the various drivers to never hold a spinlock there. And in_atomic() isn't reliable. So, rather than trying to be too fancy, just punt the change in the static key to a workqueue always. There's basically no drawback of doing this, as the code already needed to account for the static key not changing immediately, and given that it's just an optimization, there's not exactly a hurry to change the static key right away, so deferal is fine. Reported-by: Guoyong Wang <guoyong.wang@mediatek.com> Cc: stable@vger.kernel.org Fixes: f5bda35fba61 ("random: use static branch for crng_ready()") Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2024-04-17Merge patch series 'Fix shmem_rename2 directory offset calculation' of ↵Christian Brauner
https://lore.kernel.org/r/20240415152057.4605-1-cel@kernel.org Pull shmem_rename2() offset fixes from Chuck Lever: The existing code in shmem_rename2() allocates a fresh directory offset value when renaming over an existing destination entry. User space does not expect this behavior. In particular, applications that rename while walking a directory can loop indefinitely because they never reach the end of the directory. * 'Fix shmem_rename2 directory offset calculation' of https://lore.kernel.org/r/20240415152057.4605-1-cel@kernel.org: (3 commits) shmem: Fix shmem_rename2() libfs: Add simple_offset_rename() API libfs: Fix simple_offset_rename_exchange() fs/libfs.c | 55 +++++++++++++++++++++++++++++++++++++++++----- include/linux/fs.h | 2 ++ mm/shmem.c | 3 +-- 3 files changed, 52 insertions(+), 8 deletions(-) Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-04-17shmem: Fix shmem_rename2()Chuck Lever
When renaming onto an existing directory entry, user space expects the replacement entry to have the same directory offset as the original one. Link: https://gitlab.alpinelinux.org/alpine/aports/-/issues/15966 Fixes: a2e459555c5f ("shmem: stable directory offsets") Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Link: https://lore.kernel.org/r/20240415152057.4605-4-cel@kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-04-17libfs: Add simple_offset_rename() APIChuck Lever
I'm about to fix a tmpfs rename bug that requires the use of internal simple_offset helpers that are not available in mm/shmem.c Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Link: https://lore.kernel.org/r/20240415152057.4605-3-cel@kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>