summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2021-02-10nvmet: make nvmet_find_namespace() req basedChaitanya Kulkarni
The six callers of nvmet_find_namespace() duplicate the error log page update and status setting code for each call on failure. All callers are nvmet requests based functions, so we can pass req to the nvmet_find_namesapce() & derive ctrl from req, that'll allow us to update the error log page in nvmet_find_namespace(). Now that we pass the request we can also get rid of the local variable in nvmet_find_namespace() and use the req->ns and return the error code. Replace the ctrl parameter with nvmet_req for nvmet_find_namespace(), centralize the error log page update for non allocated namesapces, and return uniform error for non-allocated namespace. The nvmet_find_namespace() takes nsid parameter which is from NVMe commands structures such as get_log_page, identify, rw and common. All these commands have same offset for the nsid field. Derive nsid from req->cmd->common.nsid) & remove the extra parameter from the nvmet_find_namespace(). Lastly now we associate the ns to the req parameter that we pass to the nvmet_find_namespace(), rename nvmet_find_namespace() to nvmet_req_find_ns(). Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
2021-02-10nvmet: return uniform error for invalid nsChaitanya Kulkarni
For nvmet_find_namespace() error case we have inconsistent error code mapping in the function nvmet_get_smart_log_nsid() and nvmet_set_feat_write_protect(). There is no point in retrying for the invalid namesapce from the host side. Set the error code to the NVME_SC_INVALID_NS | NVME_SC_DNR which matches what we have in nvmet_execute_identify_desclist(). Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
2021-02-10nvmet: set status to 0 in case for invalid nsidChaitanya Kulkarni
For unallocated namespace in nvmet_execute_identify_ns() don't set the status to NVME_SC_INVALID_NS, set it to zero. Fixes: bffcd507780e ("nvmet: set right status on error in id-ns handler") Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
2021-02-10nvmet-fc: add a missing __rcu annotation to nvmet_fc_tgt_assoc.queuesChristoph Hellwig
Make sparse happy after the recent conversion to RCU lookups. Fixes: 4e2f02bf77da ("nvmet-fc: use RCU proctection for assoc_list") Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Reviewed-by: James Smart <james.smart@broadcom.com>
2021-02-10nvme-multipath: set nr_zones for zoned namespacesKeith Busch
The bio based drivers only require the request_queue's nr_zones is set, so set this field in the head if the namespace path is zoned. Fixes: 240e6ee272c07 ("nvme: support for zoned namespaces") Reported-by: Minwoo Im <minwoo.im.dev@gmail.com> Cc: Damien Le Moal <damien.lemoal@wdc.com> Signed-off-by: Keith Busch <kbusch@kernel.org> Signed-off-by: Christoph Hellwig <hch@lst.de>
2021-02-10nvmet-tcp: fix potential race of tcp socket closing accept_workSagi Grimberg
When we accept a TCP connection and allocate an nvmet-tcp queue we should make sure not to fully establish it or reference it as the connection may be already closing, which triggers queue release work, which does not fence against queue establishment. In order to address such a race, we make sure to check the sk_state and contain the queue reference to be done underneath the sk_callback_lock such that the queue release work correctly fences against it. Fixes: 872d26a391da ("nvmet-tcp: add NVMe over TCP target driver") Reported-by: Elad Grupi <elad.grupi@dell.com> Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de>
2021-02-10nvmet-tcp: fix receive data digest calculation for multiple h2cdata PDUsSagi Grimberg
When a host sends multiple h2cdata PDUs for a single command, we should verify the data digest calculation per PDU and not per command. Fixes: 872d26a391da ("nvmet-tcp: add NVMe over TCP target driver") Reported-by: Narayan Ayalasomayajula <Narayan.Ayalasomayajula@wdc.com> Tested-by: Narayan Ayalasomayajula <Narayan.Ayalasomayajula@wdc.com> Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de>
2021-02-10nvme-rdma: handle nvme_rdma_post_send failures betterChao Leng
nvme_rdma_post_send failing is a path related error and should bounce to another path when using nvme-multipath. Call nvme_host_path_error when nvme_rdma_post_send returns -EIO to ensure nvme_complete_rq gets invoked to fail over to another path if there is one. Signed-off-by: Chao Leng <lengchao@huawei.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
2021-02-10nvme-fabrics: avoid double completions in nvmf_fail_nonready_commandChao Leng
When reconnecting, the request may be completed with NVME_SC_HOST_PATH_ERROR in nvmf_fail_nonready_command, which currently set the state of the request to MQ_RQ_IN_FLIGHT before calling nvme_complete_rq. When this happens for a request that is freed by the caller, such as nvme_submit_user_cmd, in the worst case the request could be completed again in tear down process. Instead of calling blk_mq_start_request from nvmf_fail_nonready_command, just use the new nvme_host_path_error helper to complete the command without starting it. Signed-off-by: Chao Leng <lengchao@huawei.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
2021-02-10nvme: introduce a nvme_host_path_error helperChao Leng
When using nvme native multipathing, if a path related error occurs during ->queue_rq, the request needs to be completed with NVME_SC_HOST_PATH_ERROR so that the request can be failed over. Introduce a helper to complete the command from ->queue_rq in a wait that invokes nvme_complete_rq. Signed-off-by: Chao Leng <lengchao@huawei.com> [hch: renamed, added a return value to clean up the callers a bit] Signed-off-by: Christoph Hellwig <hch@lst.de>
2021-02-10blk-mq: introduce blk_mq_set_request_completeChao Leng
nvme drivers need to set the state of request to MQ_RQ_COMPLETE when directly complete request in queue_rq. So add blk_mq_set_request_complete. Signed-off-by: Chao Leng <lengchao@huawei.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
2021-02-10nvme: convert sysfs sprintf/snprintf family to sysfs_emitJiapeng Chong
Fix the following coccicheck warning: ./drivers/nvme/host/core.c:3580:8-16: WARNING: use scnprintf or sprintf. ./drivers/nvme/host/core.c:3570:8-16: WARNING: use scnprintf or sprintf. ./drivers/nvme/host/core.c:3560:8-16: WARNING: use scnprintf or sprintf. ./drivers/nvme/host/core.c:3526:8-16: WARNING: use scnprintf or sprintf. ./drivers/nvme/host/core.c:2833:8-16: WARNING: use scnprintf or sprintf. Reported-by: Abaci Robot<abaci@linux.alibaba.com> Signed-off-by: Jiapeng Chong <jiapeng.chong@linux.alibaba.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
2021-02-10x86/fault: Don't run fixups for SMAP violationsAndy Lutomirski
A SMAP-violating kernel access is not a recoverable condition. Imagine kernel code that, outside of a uaccess region, dereferences a pointer to the user range by accident. If SMAP is on, this will reliably generate as an intentional user access. This makes it easy for bugs to be overlooked if code is inadequately tested both with and without SMAP. This was discovered because BPF can generate invalid accesses to user memory, but those warnings only got printed if SMAP was off. Make it so that this type of error will be discovered with SMAP on as well. [ bp: Massage commit message. ] Signed-off-by: Andy Lutomirski <luto@kernel.org> Signed-off-by: Borislav Petkov <bp@suse.de> Link: https://lkml.kernel.org/r/66a02343624b1ff46f02a838c497fc05c1a871b3.1612924255.git.luto@kernel.org
2021-02-10mm: simplify swapdev_blockChristoph Hellwig
Open code the parts of map_swap_entry that was actually used by swapdev_block, and remove the now unused map_swap_entry function. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-02-10bcache: Avoid comma separated statementsJoe Perches
Use semicolons and braces. Signed-off-by: Joe Perches <joe@perches.com> Signed-off-by: Coly Li <colyli@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-02-10bcache: Move journal work to new flush wqKai Krakow
This is potentially long running and not latency sensitive, let's get it out of the way of other latency sensitive events. As observed in the previous commit, the `system_wq` comes easily congested by bcache, and this fixes a few more stalls I was observing every once in a while. Let's not make this `WQ_MEM_RECLAIM` as it showed to reduce performance of boot and file system operations in my tests. Also, without `WQ_MEM_RECLAIM`, I no longer see desktop stalls. This matches the previous behavior as `system_wq` also does no memory reclaim: > // workqueue.c: > system_wq = alloc_workqueue("events", 0, 0); Cc: Coly Li <colyli@suse.de> Cc: stable@vger.kernel.org # 5.4+ Signed-off-by: Kai Krakow <kai@kaishome.de> Signed-off-by: Coly Li <colyli@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-02-10bcache: Give btree_io_wq correct semantics againKai Krakow
Before killing `btree_io_wq`, the queue was allocated using `create_singlethread_workqueue()` which has `WQ_MEM_RECLAIM`. After killing it, it no longer had this property but `system_wq` is not single threaded. Let's combine both worlds and make it multi threaded but able to reclaim memory. Cc: Coly Li <colyli@suse.de> Cc: stable@vger.kernel.org # 5.4+ Signed-off-by: Kai Krakow <kai@kaishome.de> Signed-off-by: Coly Li <colyli@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-02-10Revert "bcache: Kill btree_io_wq"Kai Krakow
This reverts commit 56b30770b27d54d68ad51eccc6d888282b568cee. With the btree using the `system_wq`, I seem to see a lot more desktop latency than I should. After some more investigation, it looks like the original assumption of 56b3077 no longer is true, and bcache has a very high potential of congesting the `system_wq`. In turn, this introduces laggy desktop performance, IO stalls (at least with btrfs), and input events may be delayed. So let's revert this. It's important to note that the semantics of using `system_wq` previously mean that `btree_io_wq` should be created before and destroyed after other bcache wqs to keep the same assumptions. Cc: Coly Li <colyli@suse.de> Cc: stable@vger.kernel.org # 5.4+ Signed-off-by: Kai Krakow <kai@kaishome.de> Signed-off-by: Coly Li <colyli@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-02-10bcache: Fix register_device_aync typoKai Krakow
Should be `register_device_async`. Cc: Coly Li <colyli@suse.de> Signed-off-by: Kai Krakow <kai@kaishome.de> Signed-off-by: Coly Li <colyli@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-02-10bcache: consider the fragmentation when update the writeback ratedongdong tao
Current way to calculate the writeback rate only considered the dirty sectors, this usually works fine when the fragmentation is not high, but it will give us unreasonable small rate when we are under a situation that very few dirty sectors consumed a lot dirty buckets. In some case, the dirty bucekts can reached to CUTOFF_WRITEBACK_SYNC while the dirty data(sectors) not even reached the writeback_percent, the writeback rate will still be the minimum value (4k), thus it will cause all the writes to be stucked in a non-writeback mode because of the slow writeback. We accelerate the rate in 3 stages with different aggressiveness, the first stage starts when dirty buckets percent reach above BCH_WRITEBACK_FRAGMENT_THRESHOLD_LOW (50), the second is BCH_WRITEBACK_FRAGMENT_THRESHOLD_MID (57), the third is BCH_WRITEBACK_FRAGMENT_THRESHOLD_HIGH (64). By default the first stage tries to writeback the amount of dirty data in one bucket (on average) in (1 / (dirty_buckets_percent - 50)) second, the second stage tries to writeback the amount of dirty data in one bucket in (1 / (dirty_buckets_percent - 57)) * 100 millisecond, the third stage tries to writeback the amount of dirty data in one bucket in (1 / (dirty_buckets_percent - 64)) millisecond. the initial rate at each stage can be controlled by 3 configurable parameters writeback_rate_fp_term_{low|mid|high}, they are by default 1, 10, 1000, the hint of IO throughput that these values are trying to achieve is described by above paragraph, the reason that I choose those value as default is based on the testing and the production data, below is some details: A. When it comes to the low stage, there is still a bit far from the 70 threshold, so we only want to give it a little bit push by setting the term to 1, it means the initial rate will be 170 if the fragment is 6, it is calculated by bucket_size/fragment, this rate is very small, but still much reasonable than the minimum 8. For a production bcache with unheavy workload, if the cache device is bigger than 1 TB, it may take hours to consume 1% buckets, so it is very possible to reclaim enough dirty buckets in this stage, thus to avoid entering the next stage. B. If the dirty buckets ratio didn't turn around during the first stage, it comes to the mid stage, then it is necessary for mid stage to be more aggressive than low stage, so i choose the initial rate to be 10 times more than low stage, that means 1700 as the initial rate if the fragment is 6. This is some normal rate we usually see for a normal workload when writeback happens because of writeback_percent. C. If the dirty buckets ratio didn't turn around during the low and mid stages, it comes to the third stage, and it is the last chance that we can turn around to avoid the horrible cutoff writeback sync issue, then we choose 100 times more aggressive than the mid stage, that means 170000 as the initial rate if the fragment is 6. This is also inferred from a production bcache, I've got one week's writeback rate data from a production bcache which has quite heavy workloads, again, the writeback is triggered by the writeback percent, the highest rate area is around 100000 to 240000, so I believe this kind aggressiveness at this stage is reasonable for production. And it should be mostly enough because the hint is trying to reclaim 1000 bucket per second, and from that heavy production env, it is consuming 50 bucket per second on average in one week's data. Option writeback_consider_fragment is to control whether we want this feature to be on or off, it's on by default. Lastly, below is the performance data for all the testing result, including the data from production env: https://docs.google.com/document/d/1AmbIEa_2MhB9bqhC3rfga9tp7n9YX9PLn0jSUxscVW0/edit?usp=sharing Signed-off-by: dongdong tao <dongdong.tao@canonical.com> Signed-off-by: Coly Li <colyli@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-02-10usb: quirks: add quirk to start video capture on ELMO L-12F document camera ↵Stefan Ursella
reliable Without this quirk starting a video capture from the device often fails with kernel: uvcvideo: Failed to set UVC probe control : -110 (exp. 34). Signed-off-by: Stefan Ursella <stefan.ursella@wolfvision.net> Link: https://lore.kernel.org/r/20210210140713.18711-1-stefan.ursella@wolfvision.net Cc: stable <stable@vger.kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-02-10Merge tag 'usb-serial-5.12-rc1' of ↵Greg Kroah-Hartman
https://git.kernel.org/pub/scm/linux/kernel/git/johan/usb-serial into usb-next Johan writes: USB-serial updates for 5.12-rc1 Here are the USB-serial updates for 5.12-rc1, including: - a line-speed fix for newer pl2303 devices - a line-speed fix for FTDI FT-X devices - a new xr_serial driver for MaxLinear/Exar devices (non-ACM mode) - a cdc-acm blacklist entry for when the xr_serial driver is enabled - cp210x support for software flow control - various cp210x modem-control fixes - an updated ZTE P685M modem entry to stop claiming the QMI interface - an update to drop the port_remove() driver-callback return value Included are also various clean ups. All have been in linux-next with no reported issues. * tag 'usb-serial-5.12-rc1' of https://git.kernel.org/pub/scm/linux/kernel/git/johan/usb-serial: (41 commits) USB: serial: drop bogus to_usb_serial_port() checks USB: serial: make remove callback return void USB: serial: drop if with an always false condition USB: serial: option: update interface mapping for ZTE P685M USB: serial: ftdi_sio: restore divisor-encoding comments USB: serial: ftdi_sio: fix FTX sub-integer prescaler USB: serial: cp210x: clean up auto-RTS handling USB: serial: cp210x: fix RTS handling USB: serial: cp210x: clean up printk zero padding USB: serial: cp210x: clean up flow-control debug message USB: serial: cp210x: drop shift macros USB: serial: cp210x: fix modem-control handling USB: serial: cp210x: suppress modem-control errors USB: serial: mos7720: fix error code in mos7720_write() USB: serial: xr: fix B0 handling USB: serial: xr: fix pin configuration USB: serial: xr: fix gpio-mode handling USB: serial: xr: simplify line-speed logic USB: serial: xr: clean up line-settings handling USB: serial: xr: document vendor-request recipient ...
2021-02-10sd_zbc: clear zone resources for non-zoned caseDamien Le Moal
For host-aware ZBC disk, setting the device zoned model to BLK_ZONED_HA using blk_queue_set_zoned() in sd_read_block_characteristics() may result in the block device effective zoned model to be "none" (BLK_ZONED_NONE) if partitions are present on the device. In this case, sd_zbc_read_zones() should not setup the zone related queue limits for the disk so that the device limits and configuration is consistent with a regular disk and resources not uselessly allocated (e.g. the zone write pointer tracking array for zone append emulation). Furthermore, if the disk zoned model changes at run time due to the creation of a partition by the user, the zone related resources can be released. Fix both problems by introducing the function sd_zbc_clear_zone_info() to reset the scsi disk zone information and free resources and by returning early in sd_zbc_read_zones() for a block device that has a zoned model equal to BLK_ZONED_NONE. Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@edc.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-02-10block: introduce blk_queue_clear_zone_settings()Damien Le Moal
Introduce the internal function blk_queue_clear_zone_settings() to cleanup all limits and resources related to zoned block devices. This new function is called from blk_queue_set_zoned() when a disk zoned model is set to BLK_ZONED_NONE. This particular case can happens when a partition is created on a host-aware scsi disk. Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com> Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@edc.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-02-10zonefs: use zone write granularity as block sizeDamien Le Moal
Zoned block devices have different granularity constraints for write operations into sequential zones. E.g. ZBC and ZAC devices require that writes be aligned to the device physical block size while NVMe ZNS devices allow logical block size aligned write operations. To correctly handle such difference, use the device zone write granularity limit to set the block size of a zonefs volume, thus allowing the smallest possible write unit for all zoned device types. Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@edc.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-02-10block: introduce zone_write_granularity limitDamien Le Moal
Per ZBC and ZAC specifications, host-managed SMR hard-disks mandate that all writes into sequential write required zones be aligned to the device physical block size. However, NVMe ZNS does not have this constraint and allows write operations into sequential zones to be aligned to the device logical block size. This inconsistency does not help with software portability across device types. To solve this, introduce the zone_write_granularity queue limit to indicate the alignment constraint, in bytes, of write operations into zones of a zoned block device. This new limit is exported as a read-only sysfs queue attribute and the helper blk_queue_zone_write_granularity() introduced for drivers to set this limit. The function blk_queue_set_zoned() is modified to set this new limit to the device logical block size by default. NVMe ZNS devices as well as zoned nullb devices use this default value as is. The scsi disk driver is modified to execute the blk_queue_zone_write_granularity() helper to set the zone write granularity of host-managed SMR disks to the disk physical block size. The accessor functions queue_zone_write_granularity() and bdev_zone_write_granularity() are also introduced. Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@edc.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-02-10block: use blk_queue_set_zoned in add_partition()Damien Le Moal
When changing the zoned model of host-aware zoned block devices, use blk_queue_set_zoned() instead of directly assigning the gendisk queue zoned limit. Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@edc.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-02-10nullb: use blk_queue_set_zoned() to setup zoned devicesDamien Le Moal
Use blk_queue_set_zoned() to set a nullb device zone model instead of directly assigning the device queue zoned limit. This initialization of the devicve zoned model as well as the setup of the queue flag QUEUE_FLAG_ZONE_RESETALL and of the device queue elevator feature are moved from null_init_zoned_dev() to null_register_zoned_dev() so that the initialization of the queue limits is done when the gendisk of the nullb device is available. Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@edc.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-02-10nvme: cleanup zone information initializationDamien Le Moal
For a zoned namespace, in nvme_update_ns_info(), call nvme_update_zone_info() after executing nvme_update_disk_info() so that the namespace queue logical and physical block size limits are set. This allows setting the namespace queue max_zone_append_sectors limit in nvme_update_zone_info() instead of nvme_revalidate_zones(), simplifying this function. Also use blk_queue_set_zoned() to set the namespace zoned model. Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@edc.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-02-10block: document zone_append_max_bytes attributeDamien Le Moal
The description of the zone_append_max_bytes sysfs queue attribute is missing from Documentation/block/queue-sysfs.rst. Add it. Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@edc.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-02-10io_uring: place ring SQ/CQ arrays under memcg memory limitsJens Axboe
Instead of imposing rlimit memlock limits for the rings themselves, ensure that we account them properly under memcg with __GFP_ACCOUNT. We retain rlimit memlock for registered buffers, this is just for the ring arrays themselves. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-02-10io_uring: enable kmemcg account for io_uring requestsJens Axboe
This puts io_uring under the memory cgroups accounting and limits for requests. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-02-10io_uring: enable req cache for IRQ driven IOJens Axboe
This is the last class of requests that cannot utilize the req alloc cache. Add a per-ctx req cache that is protected by the completion_lock, and refill our submit side cache when it gets over our batch count. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-02-10io_uring: fix possible deadlock in io_uring_pollHao Xu
Abaci reported follow issue: [ 30.615891] ====================================================== [ 30.616648] WARNING: possible circular locking dependency detected [ 30.617423] 5.11.0-rc3-next-20210115 #1 Not tainted [ 30.618035] ------------------------------------------------------ [ 30.618914] a.out/1128 is trying to acquire lock: [ 30.619520] ffff88810b063868 (&ep->mtx){+.+.}-{3:3}, at: __ep_eventpoll_poll+0x9f/0x220 [ 30.620505] [ 30.620505] but task is already holding lock: [ 30.621218] ffff88810e952be8 (&ctx->uring_lock){+.+.}-{3:3}, at: __x64_sys_io_uring_enter+0x3f0/0x5b0 [ 30.622349] [ 30.622349] which lock already depends on the new lock. [ 30.622349] [ 30.623289] [ 30.623289] the existing dependency chain (in reverse order) is: [ 30.624243] [ 30.624243] -> #1 (&ctx->uring_lock){+.+.}-{3:3}: [ 30.625263] lock_acquire+0x2c7/0x390 [ 30.625868] __mutex_lock+0xae/0x9f0 [ 30.626451] io_cqring_overflow_flush.part.95+0x6d/0x70 [ 30.627278] io_uring_poll+0xcb/0xd0 [ 30.627890] ep_item_poll.isra.14+0x4e/0x90 [ 30.628531] do_epoll_ctl+0xb7e/0x1120 [ 30.629122] __x64_sys_epoll_ctl+0x70/0xb0 [ 30.629770] do_syscall_64+0x2d/0x40 [ 30.630332] entry_SYSCALL_64_after_hwframe+0x44/0xa9 [ 30.631187] [ 30.631187] -> #0 (&ep->mtx){+.+.}-{3:3}: [ 30.631985] check_prevs_add+0x226/0xb00 [ 30.632584] __lock_acquire+0x1237/0x13a0 [ 30.633207] lock_acquire+0x2c7/0x390 [ 30.633740] __mutex_lock+0xae/0x9f0 [ 30.634258] __ep_eventpoll_poll+0x9f/0x220 [ 30.634879] __io_arm_poll_handler+0xbf/0x220 [ 30.635462] io_issue_sqe+0xa6b/0x13e0 [ 30.635982] __io_queue_sqe+0x10b/0x550 [ 30.636648] io_queue_sqe+0x235/0x470 [ 30.637281] io_submit_sqes+0xcce/0xf10 [ 30.637839] __x64_sys_io_uring_enter+0x3fb/0x5b0 [ 30.638465] do_syscall_64+0x2d/0x40 [ 30.638999] entry_SYSCALL_64_after_hwframe+0x44/0xa9 [ 30.639643] [ 30.639643] other info that might help us debug this: [ 30.639643] [ 30.640618] Possible unsafe locking scenario: [ 30.640618] [ 30.641402] CPU0 CPU1 [ 30.641938] ---- ---- [ 30.642664] lock(&ctx->uring_lock); [ 30.643425] lock(&ep->mtx); [ 30.644498] lock(&ctx->uring_lock); [ 30.645668] lock(&ep->mtx); [ 30.646321] [ 30.646321] *** DEADLOCK *** [ 30.646321] [ 30.647642] 1 lock held by a.out/1128: [ 30.648424] #0: ffff88810e952be8 (&ctx->uring_lock){+.+.}-{3:3}, at: __x64_sys_io_uring_enter+0x3f0/0x5b0 [ 30.649954] [ 30.649954] stack backtrace: [ 30.650592] CPU: 1 PID: 1128 Comm: a.out Not tainted 5.11.0-rc3-next-20210115 #1 [ 30.651554] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011 [ 30.652290] Call Trace: [ 30.652688] dump_stack+0xac/0xe3 [ 30.653164] check_noncircular+0x11e/0x130 [ 30.653747] ? check_prevs_add+0x226/0xb00 [ 30.654303] check_prevs_add+0x226/0xb00 [ 30.654845] ? add_lock_to_list.constprop.49+0xac/0x1d0 [ 30.655564] __lock_acquire+0x1237/0x13a0 [ 30.656262] lock_acquire+0x2c7/0x390 [ 30.656788] ? __ep_eventpoll_poll+0x9f/0x220 [ 30.657379] ? __io_queue_proc.isra.88+0x180/0x180 [ 30.658014] __mutex_lock+0xae/0x9f0 [ 30.658524] ? __ep_eventpoll_poll+0x9f/0x220 [ 30.659112] ? mark_held_locks+0x5a/0x80 [ 30.659648] ? __ep_eventpoll_poll+0x9f/0x220 [ 30.660229] ? _raw_spin_unlock_irqrestore+0x2d/0x40 [ 30.660885] ? trace_hardirqs_on+0x46/0x110 [ 30.661471] ? __io_queue_proc.isra.88+0x180/0x180 [ 30.662102] ? __ep_eventpoll_poll+0x9f/0x220 [ 30.662696] __ep_eventpoll_poll+0x9f/0x220 [ 30.663273] ? __ep_eventpoll_poll+0x220/0x220 [ 30.663875] __io_arm_poll_handler+0xbf/0x220 [ 30.664463] io_issue_sqe+0xa6b/0x13e0 [ 30.664984] ? __lock_acquire+0x782/0x13a0 [ 30.665544] ? __io_queue_proc.isra.88+0x180/0x180 [ 30.666170] ? __io_queue_sqe+0x10b/0x550 [ 30.666725] __io_queue_sqe+0x10b/0x550 [ 30.667252] ? __fget_files+0x131/0x260 [ 30.667791] ? io_req_prep+0xd8/0x1090 [ 30.668316] ? io_queue_sqe+0x235/0x470 [ 30.668868] io_queue_sqe+0x235/0x470 [ 30.669398] io_submit_sqes+0xcce/0xf10 [ 30.669931] ? xa_load+0xe4/0x1c0 [ 30.670425] __x64_sys_io_uring_enter+0x3fb/0x5b0 [ 30.671051] ? lockdep_hardirqs_on_prepare+0xde/0x180 [ 30.671719] ? syscall_enter_from_user_mode+0x2b/0x80 [ 30.672380] do_syscall_64+0x2d/0x40 [ 30.672901] entry_SYSCALL_64_after_hwframe+0x44/0xa9 [ 30.673503] RIP: 0033:0x7fd89c813239 [ 30.673962] Code: 01 00 48 81 c4 80 00 00 00 e9 f1 fe ff ff 0f 1f 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 3d 01 f0 ff ff 73 01 c3 48 8b 0d 27 ec 2c 00 f7 d8 64 89 01 48 [ 30.675920] RSP: 002b:00007ffc65a7c628 EFLAGS: 00000217 ORIG_RAX: 00000000000001aa [ 30.676791] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007fd89c813239 [ 30.677594] RDX: 0000000000000000 RSI: 0000000000000014 RDI: 0000000000000003 [ 30.678678] RBP: 00007ffc65a7c720 R08: 0000000000000000 R09: 0000000003000000 [ 30.679492] R10: 0000000000000000 R11: 0000000000000217 R12: 0000000000400ff0 [ 30.680282] R13: 00007ffc65a7c840 R14: 0000000000000000 R15: 0000000000000000 This might happen if we do epoll_wait on a uring fd while reading/writing the former epoll fd in a sqe in the former uring instance. So let's don't flush cqring overflow list, just do a simple check. Reported-by: Abaci <abaci@linux.alibaba.com> Fixes: 6c503150ae33 ("io_uring: patch up IOPOLL overflow_flush sync") Signed-off-by: Hao Xu <haoxu@linux.alibaba.com> Reviewed-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-02-10io_uring: defer flushing cached reqsPavel Begunkov
Awhile there are requests in the allocation cache -- use them, only if those ended go for the stashed memory in comp.free_list. As list manipulation are generally heavy and are not good for caches, flush them all or as much as can in one go. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> [axboe: return success/failure from io_flush_cached_reqs()] Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-02-10io_uring: take comp_state from ctxPavel Begunkov
__io_queue_sqe() is always called with a non-NULL comp_state, which is taken directly from context. Don't pass it around but infer from ctx. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-02-10io_uring: enable req cache for task_work itemsJens Axboe
task_work is run without utilizing the req alloc cache, so any deferred items don't get to take advantage of either the alloc or free side of it. With task_work now being wrapped by io_uring, we can use the ctx completion state to both use the req cache and the completion flush batching. With this, the only request type that cannot take advantage of the req cache is IRQ driven IO for regular files / block devices. Anything else, including IOPOLL polled IO to those same tyes, will take advantage of it. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-02-10io_uring: provide FIFO ordering for task_workJens Axboe
task_work is a LIFO list, due to how it's implemented as a lockless list. For long chains of task_work, this can be problematic as the first entry added is the last one processed. Similarly, we'd waste a lot of CPU cycles reversing this list. Wrap the task_work so we have a single task_work entry per task per ctx, and use that to run it in the right order. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-02-10io_uring: use persistent request cacheJens Axboe
Now that we have the submit_state in the ring itself, we can have io_kiocb allocations that are persistent across invocations. This reduces the time spent doing slab allocations and frees. [sil: rebased] Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-02-10io_uring: feed reqs back into alloc cachePavel Begunkov
Make io_req_free_batch(), which is used for inline executed requests and IOPOLL, to return requests back into the allocation cache, so avoid most of kmalloc()/kfree() for those cases. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-02-10io_uring: persistent req cachePavel Begunkov
Don't free batch-allocated requests across syscalls. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-02-10io_uring: count ctx refs separately from reqsPavel Begunkov
Currently batch free handles request memory freeing and ctx ref putting together. Separate them and use different counters, that will be needed for reusing reqs memory. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-02-10io_uring: remove fallback_reqPavel Begunkov
Remove fallback_req for now, it gets in the way of other changes. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-02-10io_uring: submit-completion free batchingPavel Begunkov
io_submit_flush_completions() does completion batching, but may also use free batching as iopoll does. The main beneficiaries should be buffered reads/writes and send/recv. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-02-10io_uring: replace list with array for compl batchPavel Begunkov
Reincarnation of an old patch that replaces a list in struct io_compl_batch with an array. It's needed to avoid hooking requests via their compl.list, because it won't be always available in the future. It's also nice to split io_submit_flush_completions() to avoid free under locks and remove unlock/lock with a long comment describing when it can be done. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-02-10io_uring: don't reinit submit state every timePavel Begunkov
As now submit_state is retained across syscalls, we can save ourself from initialising it from ground up for each io_submit_sqes(). Set some fields during ctx allocation, and just keep them always consistent. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> [axboe: remove unnecessary zeroing of ctx members] Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-02-10io_uring: remove ctx from comp_statePavel Begunkov
completion state is closely bound to ctx, we don't need to store ctx inside as we always have it around to pass to flush. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-02-10io_uring: don't keep submit_state on stackPavel Begunkov
struct io_submit_state is quite big (168 bytes) and going to grow. It's better to not keep it on stack as it is now. Move it to context, it's always protected by uring_lock, so it's fine to have only one instance of it. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-02-10io_uring: don't propagate io_comp_statePavel Begunkov
There is no reason to drag io_comp_state into opcode handlers, we just need a flag and the actual work will be done in __io_queue_sqe(). Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-02-10Revert "drm/scheduler: Job timeout handler returns status (v3)"Maarten Lankhorst
This reverts commit c10983e14e8f5d7c8dab0415e0cb7fe8d10aa9e3. This commit is not meant for drm-misc-next-fixes, and was accidentally cherry picked over. Signed-off-by: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>