summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2021-02-10bcache: Fix register_device_aync typoKai Krakow
Should be `register_device_async`. Cc: Coly Li <colyli@suse.de> Signed-off-by: Kai Krakow <kai@kaishome.de> Signed-off-by: Coly Li <colyli@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-02-10bcache: consider the fragmentation when update the writeback ratedongdong tao
Current way to calculate the writeback rate only considered the dirty sectors, this usually works fine when the fragmentation is not high, but it will give us unreasonable small rate when we are under a situation that very few dirty sectors consumed a lot dirty buckets. In some case, the dirty bucekts can reached to CUTOFF_WRITEBACK_SYNC while the dirty data(sectors) not even reached the writeback_percent, the writeback rate will still be the minimum value (4k), thus it will cause all the writes to be stucked in a non-writeback mode because of the slow writeback. We accelerate the rate in 3 stages with different aggressiveness, the first stage starts when dirty buckets percent reach above BCH_WRITEBACK_FRAGMENT_THRESHOLD_LOW (50), the second is BCH_WRITEBACK_FRAGMENT_THRESHOLD_MID (57), the third is BCH_WRITEBACK_FRAGMENT_THRESHOLD_HIGH (64). By default the first stage tries to writeback the amount of dirty data in one bucket (on average) in (1 / (dirty_buckets_percent - 50)) second, the second stage tries to writeback the amount of dirty data in one bucket in (1 / (dirty_buckets_percent - 57)) * 100 millisecond, the third stage tries to writeback the amount of dirty data in one bucket in (1 / (dirty_buckets_percent - 64)) millisecond. the initial rate at each stage can be controlled by 3 configurable parameters writeback_rate_fp_term_{low|mid|high}, they are by default 1, 10, 1000, the hint of IO throughput that these values are trying to achieve is described by above paragraph, the reason that I choose those value as default is based on the testing and the production data, below is some details: A. When it comes to the low stage, there is still a bit far from the 70 threshold, so we only want to give it a little bit push by setting the term to 1, it means the initial rate will be 170 if the fragment is 6, it is calculated by bucket_size/fragment, this rate is very small, but still much reasonable than the minimum 8. For a production bcache with unheavy workload, if the cache device is bigger than 1 TB, it may take hours to consume 1% buckets, so it is very possible to reclaim enough dirty buckets in this stage, thus to avoid entering the next stage. B. If the dirty buckets ratio didn't turn around during the first stage, it comes to the mid stage, then it is necessary for mid stage to be more aggressive than low stage, so i choose the initial rate to be 10 times more than low stage, that means 1700 as the initial rate if the fragment is 6. This is some normal rate we usually see for a normal workload when writeback happens because of writeback_percent. C. If the dirty buckets ratio didn't turn around during the low and mid stages, it comes to the third stage, and it is the last chance that we can turn around to avoid the horrible cutoff writeback sync issue, then we choose 100 times more aggressive than the mid stage, that means 170000 as the initial rate if the fragment is 6. This is also inferred from a production bcache, I've got one week's writeback rate data from a production bcache which has quite heavy workloads, again, the writeback is triggered by the writeback percent, the highest rate area is around 100000 to 240000, so I believe this kind aggressiveness at this stage is reasonable for production. And it should be mostly enough because the hint is trying to reclaim 1000 bucket per second, and from that heavy production env, it is consuming 50 bucket per second on average in one week's data. Option writeback_consider_fragment is to control whether we want this feature to be on or off, it's on by default. Lastly, below is the performance data for all the testing result, including the data from production env: https://docs.google.com/document/d/1AmbIEa_2MhB9bqhC3rfga9tp7n9YX9PLn0jSUxscVW0/edit?usp=sharing Signed-off-by: dongdong tao <dongdong.tao@canonical.com> Signed-off-by: Coly Li <colyli@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-02-10usb: quirks: add quirk to start video capture on ELMO L-12F document camera ↵Stefan Ursella
reliable Without this quirk starting a video capture from the device often fails with kernel: uvcvideo: Failed to set UVC probe control : -110 (exp. 34). Signed-off-by: Stefan Ursella <stefan.ursella@wolfvision.net> Link: https://lore.kernel.org/r/20210210140713.18711-1-stefan.ursella@wolfvision.net Cc: stable <stable@vger.kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-02-10Merge tag 'usb-serial-5.12-rc1' of ↵Greg Kroah-Hartman
https://git.kernel.org/pub/scm/linux/kernel/git/johan/usb-serial into usb-next Johan writes: USB-serial updates for 5.12-rc1 Here are the USB-serial updates for 5.12-rc1, including: - a line-speed fix for newer pl2303 devices - a line-speed fix for FTDI FT-X devices - a new xr_serial driver for MaxLinear/Exar devices (non-ACM mode) - a cdc-acm blacklist entry for when the xr_serial driver is enabled - cp210x support for software flow control - various cp210x modem-control fixes - an updated ZTE P685M modem entry to stop claiming the QMI interface - an update to drop the port_remove() driver-callback return value Included are also various clean ups. All have been in linux-next with no reported issues. * tag 'usb-serial-5.12-rc1' of https://git.kernel.org/pub/scm/linux/kernel/git/johan/usb-serial: (41 commits) USB: serial: drop bogus to_usb_serial_port() checks USB: serial: make remove callback return void USB: serial: drop if with an always false condition USB: serial: option: update interface mapping for ZTE P685M USB: serial: ftdi_sio: restore divisor-encoding comments USB: serial: ftdi_sio: fix FTX sub-integer prescaler USB: serial: cp210x: clean up auto-RTS handling USB: serial: cp210x: fix RTS handling USB: serial: cp210x: clean up printk zero padding USB: serial: cp210x: clean up flow-control debug message USB: serial: cp210x: drop shift macros USB: serial: cp210x: fix modem-control handling USB: serial: cp210x: suppress modem-control errors USB: serial: mos7720: fix error code in mos7720_write() USB: serial: xr: fix B0 handling USB: serial: xr: fix pin configuration USB: serial: xr: fix gpio-mode handling USB: serial: xr: simplify line-speed logic USB: serial: xr: clean up line-settings handling USB: serial: xr: document vendor-request recipient ...
2021-02-10sd_zbc: clear zone resources for non-zoned caseDamien Le Moal
For host-aware ZBC disk, setting the device zoned model to BLK_ZONED_HA using blk_queue_set_zoned() in sd_read_block_characteristics() may result in the block device effective zoned model to be "none" (BLK_ZONED_NONE) if partitions are present on the device. In this case, sd_zbc_read_zones() should not setup the zone related queue limits for the disk so that the device limits and configuration is consistent with a regular disk and resources not uselessly allocated (e.g. the zone write pointer tracking array for zone append emulation). Furthermore, if the disk zoned model changes at run time due to the creation of a partition by the user, the zone related resources can be released. Fix both problems by introducing the function sd_zbc_clear_zone_info() to reset the scsi disk zone information and free resources and by returning early in sd_zbc_read_zones() for a block device that has a zoned model equal to BLK_ZONED_NONE. Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@edc.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-02-10block: introduce blk_queue_clear_zone_settings()Damien Le Moal
Introduce the internal function blk_queue_clear_zone_settings() to cleanup all limits and resources related to zoned block devices. This new function is called from blk_queue_set_zoned() when a disk zoned model is set to BLK_ZONED_NONE. This particular case can happens when a partition is created on a host-aware scsi disk. Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com> Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@edc.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-02-10zonefs: use zone write granularity as block sizeDamien Le Moal
Zoned block devices have different granularity constraints for write operations into sequential zones. E.g. ZBC and ZAC devices require that writes be aligned to the device physical block size while NVMe ZNS devices allow logical block size aligned write operations. To correctly handle such difference, use the device zone write granularity limit to set the block size of a zonefs volume, thus allowing the smallest possible write unit for all zoned device types. Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@edc.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-02-10block: introduce zone_write_granularity limitDamien Le Moal
Per ZBC and ZAC specifications, host-managed SMR hard-disks mandate that all writes into sequential write required zones be aligned to the device physical block size. However, NVMe ZNS does not have this constraint and allows write operations into sequential zones to be aligned to the device logical block size. This inconsistency does not help with software portability across device types. To solve this, introduce the zone_write_granularity queue limit to indicate the alignment constraint, in bytes, of write operations into zones of a zoned block device. This new limit is exported as a read-only sysfs queue attribute and the helper blk_queue_zone_write_granularity() introduced for drivers to set this limit. The function blk_queue_set_zoned() is modified to set this new limit to the device logical block size by default. NVMe ZNS devices as well as zoned nullb devices use this default value as is. The scsi disk driver is modified to execute the blk_queue_zone_write_granularity() helper to set the zone write granularity of host-managed SMR disks to the disk physical block size. The accessor functions queue_zone_write_granularity() and bdev_zone_write_granularity() are also introduced. Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@edc.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-02-10block: use blk_queue_set_zoned in add_partition()Damien Le Moal
When changing the zoned model of host-aware zoned block devices, use blk_queue_set_zoned() instead of directly assigning the gendisk queue zoned limit. Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@edc.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-02-10nullb: use blk_queue_set_zoned() to setup zoned devicesDamien Le Moal
Use blk_queue_set_zoned() to set a nullb device zone model instead of directly assigning the device queue zoned limit. This initialization of the devicve zoned model as well as the setup of the queue flag QUEUE_FLAG_ZONE_RESETALL and of the device queue elevator feature are moved from null_init_zoned_dev() to null_register_zoned_dev() so that the initialization of the queue limits is done when the gendisk of the nullb device is available. Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@edc.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-02-10nvme: cleanup zone information initializationDamien Le Moal
For a zoned namespace, in nvme_update_ns_info(), call nvme_update_zone_info() after executing nvme_update_disk_info() so that the namespace queue logical and physical block size limits are set. This allows setting the namespace queue max_zone_append_sectors limit in nvme_update_zone_info() instead of nvme_revalidate_zones(), simplifying this function. Also use blk_queue_set_zoned() to set the namespace zoned model. Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@edc.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-02-10block: document zone_append_max_bytes attributeDamien Le Moal
The description of the zone_append_max_bytes sysfs queue attribute is missing from Documentation/block/queue-sysfs.rst. Add it. Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@edc.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-02-10io_uring: place ring SQ/CQ arrays under memcg memory limitsJens Axboe
Instead of imposing rlimit memlock limits for the rings themselves, ensure that we account them properly under memcg with __GFP_ACCOUNT. We retain rlimit memlock for registered buffers, this is just for the ring arrays themselves. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-02-10io_uring: enable kmemcg account for io_uring requestsJens Axboe
This puts io_uring under the memory cgroups accounting and limits for requests. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-02-10io_uring: enable req cache for IRQ driven IOJens Axboe
This is the last class of requests that cannot utilize the req alloc cache. Add a per-ctx req cache that is protected by the completion_lock, and refill our submit side cache when it gets over our batch count. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-02-10io_uring: fix possible deadlock in io_uring_pollHao Xu
Abaci reported follow issue: [ 30.615891] ====================================================== [ 30.616648] WARNING: possible circular locking dependency detected [ 30.617423] 5.11.0-rc3-next-20210115 #1 Not tainted [ 30.618035] ------------------------------------------------------ [ 30.618914] a.out/1128 is trying to acquire lock: [ 30.619520] ffff88810b063868 (&ep->mtx){+.+.}-{3:3}, at: __ep_eventpoll_poll+0x9f/0x220 [ 30.620505] [ 30.620505] but task is already holding lock: [ 30.621218] ffff88810e952be8 (&ctx->uring_lock){+.+.}-{3:3}, at: __x64_sys_io_uring_enter+0x3f0/0x5b0 [ 30.622349] [ 30.622349] which lock already depends on the new lock. [ 30.622349] [ 30.623289] [ 30.623289] the existing dependency chain (in reverse order) is: [ 30.624243] [ 30.624243] -> #1 (&ctx->uring_lock){+.+.}-{3:3}: [ 30.625263] lock_acquire+0x2c7/0x390 [ 30.625868] __mutex_lock+0xae/0x9f0 [ 30.626451] io_cqring_overflow_flush.part.95+0x6d/0x70 [ 30.627278] io_uring_poll+0xcb/0xd0 [ 30.627890] ep_item_poll.isra.14+0x4e/0x90 [ 30.628531] do_epoll_ctl+0xb7e/0x1120 [ 30.629122] __x64_sys_epoll_ctl+0x70/0xb0 [ 30.629770] do_syscall_64+0x2d/0x40 [ 30.630332] entry_SYSCALL_64_after_hwframe+0x44/0xa9 [ 30.631187] [ 30.631187] -> #0 (&ep->mtx){+.+.}-{3:3}: [ 30.631985] check_prevs_add+0x226/0xb00 [ 30.632584] __lock_acquire+0x1237/0x13a0 [ 30.633207] lock_acquire+0x2c7/0x390 [ 30.633740] __mutex_lock+0xae/0x9f0 [ 30.634258] __ep_eventpoll_poll+0x9f/0x220 [ 30.634879] __io_arm_poll_handler+0xbf/0x220 [ 30.635462] io_issue_sqe+0xa6b/0x13e0 [ 30.635982] __io_queue_sqe+0x10b/0x550 [ 30.636648] io_queue_sqe+0x235/0x470 [ 30.637281] io_submit_sqes+0xcce/0xf10 [ 30.637839] __x64_sys_io_uring_enter+0x3fb/0x5b0 [ 30.638465] do_syscall_64+0x2d/0x40 [ 30.638999] entry_SYSCALL_64_after_hwframe+0x44/0xa9 [ 30.639643] [ 30.639643] other info that might help us debug this: [ 30.639643] [ 30.640618] Possible unsafe locking scenario: [ 30.640618] [ 30.641402] CPU0 CPU1 [ 30.641938] ---- ---- [ 30.642664] lock(&ctx->uring_lock); [ 30.643425] lock(&ep->mtx); [ 30.644498] lock(&ctx->uring_lock); [ 30.645668] lock(&ep->mtx); [ 30.646321] [ 30.646321] *** DEADLOCK *** [ 30.646321] [ 30.647642] 1 lock held by a.out/1128: [ 30.648424] #0: ffff88810e952be8 (&ctx->uring_lock){+.+.}-{3:3}, at: __x64_sys_io_uring_enter+0x3f0/0x5b0 [ 30.649954] [ 30.649954] stack backtrace: [ 30.650592] CPU: 1 PID: 1128 Comm: a.out Not tainted 5.11.0-rc3-next-20210115 #1 [ 30.651554] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011 [ 30.652290] Call Trace: [ 30.652688] dump_stack+0xac/0xe3 [ 30.653164] check_noncircular+0x11e/0x130 [ 30.653747] ? check_prevs_add+0x226/0xb00 [ 30.654303] check_prevs_add+0x226/0xb00 [ 30.654845] ? add_lock_to_list.constprop.49+0xac/0x1d0 [ 30.655564] __lock_acquire+0x1237/0x13a0 [ 30.656262] lock_acquire+0x2c7/0x390 [ 30.656788] ? __ep_eventpoll_poll+0x9f/0x220 [ 30.657379] ? __io_queue_proc.isra.88+0x180/0x180 [ 30.658014] __mutex_lock+0xae/0x9f0 [ 30.658524] ? __ep_eventpoll_poll+0x9f/0x220 [ 30.659112] ? mark_held_locks+0x5a/0x80 [ 30.659648] ? __ep_eventpoll_poll+0x9f/0x220 [ 30.660229] ? _raw_spin_unlock_irqrestore+0x2d/0x40 [ 30.660885] ? trace_hardirqs_on+0x46/0x110 [ 30.661471] ? __io_queue_proc.isra.88+0x180/0x180 [ 30.662102] ? __ep_eventpoll_poll+0x9f/0x220 [ 30.662696] __ep_eventpoll_poll+0x9f/0x220 [ 30.663273] ? __ep_eventpoll_poll+0x220/0x220 [ 30.663875] __io_arm_poll_handler+0xbf/0x220 [ 30.664463] io_issue_sqe+0xa6b/0x13e0 [ 30.664984] ? __lock_acquire+0x782/0x13a0 [ 30.665544] ? __io_queue_proc.isra.88+0x180/0x180 [ 30.666170] ? __io_queue_sqe+0x10b/0x550 [ 30.666725] __io_queue_sqe+0x10b/0x550 [ 30.667252] ? __fget_files+0x131/0x260 [ 30.667791] ? io_req_prep+0xd8/0x1090 [ 30.668316] ? io_queue_sqe+0x235/0x470 [ 30.668868] io_queue_sqe+0x235/0x470 [ 30.669398] io_submit_sqes+0xcce/0xf10 [ 30.669931] ? xa_load+0xe4/0x1c0 [ 30.670425] __x64_sys_io_uring_enter+0x3fb/0x5b0 [ 30.671051] ? lockdep_hardirqs_on_prepare+0xde/0x180 [ 30.671719] ? syscall_enter_from_user_mode+0x2b/0x80 [ 30.672380] do_syscall_64+0x2d/0x40 [ 30.672901] entry_SYSCALL_64_after_hwframe+0x44/0xa9 [ 30.673503] RIP: 0033:0x7fd89c813239 [ 30.673962] Code: 01 00 48 81 c4 80 00 00 00 e9 f1 fe ff ff 0f 1f 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 3d 01 f0 ff ff 73 01 c3 48 8b 0d 27 ec 2c 00 f7 d8 64 89 01 48 [ 30.675920] RSP: 002b:00007ffc65a7c628 EFLAGS: 00000217 ORIG_RAX: 00000000000001aa [ 30.676791] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007fd89c813239 [ 30.677594] RDX: 0000000000000000 RSI: 0000000000000014 RDI: 0000000000000003 [ 30.678678] RBP: 00007ffc65a7c720 R08: 0000000000000000 R09: 0000000003000000 [ 30.679492] R10: 0000000000000000 R11: 0000000000000217 R12: 0000000000400ff0 [ 30.680282] R13: 00007ffc65a7c840 R14: 0000000000000000 R15: 0000000000000000 This might happen if we do epoll_wait on a uring fd while reading/writing the former epoll fd in a sqe in the former uring instance. So let's don't flush cqring overflow list, just do a simple check. Reported-by: Abaci <abaci@linux.alibaba.com> Fixes: 6c503150ae33 ("io_uring: patch up IOPOLL overflow_flush sync") Signed-off-by: Hao Xu <haoxu@linux.alibaba.com> Reviewed-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-02-10io_uring: defer flushing cached reqsPavel Begunkov
Awhile there are requests in the allocation cache -- use them, only if those ended go for the stashed memory in comp.free_list. As list manipulation are generally heavy and are not good for caches, flush them all or as much as can in one go. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> [axboe: return success/failure from io_flush_cached_reqs()] Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-02-10io_uring: take comp_state from ctxPavel Begunkov
__io_queue_sqe() is always called with a non-NULL comp_state, which is taken directly from context. Don't pass it around but infer from ctx. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-02-10io_uring: enable req cache for task_work itemsJens Axboe
task_work is run without utilizing the req alloc cache, so any deferred items don't get to take advantage of either the alloc or free side of it. With task_work now being wrapped by io_uring, we can use the ctx completion state to both use the req cache and the completion flush batching. With this, the only request type that cannot take advantage of the req cache is IRQ driven IO for regular files / block devices. Anything else, including IOPOLL polled IO to those same tyes, will take advantage of it. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-02-10io_uring: provide FIFO ordering for task_workJens Axboe
task_work is a LIFO list, due to how it's implemented as a lockless list. For long chains of task_work, this can be problematic as the first entry added is the last one processed. Similarly, we'd waste a lot of CPU cycles reversing this list. Wrap the task_work so we have a single task_work entry per task per ctx, and use that to run it in the right order. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-02-10io_uring: use persistent request cacheJens Axboe
Now that we have the submit_state in the ring itself, we can have io_kiocb allocations that are persistent across invocations. This reduces the time spent doing slab allocations and frees. [sil: rebased] Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-02-10io_uring: feed reqs back into alloc cachePavel Begunkov
Make io_req_free_batch(), which is used for inline executed requests and IOPOLL, to return requests back into the allocation cache, so avoid most of kmalloc()/kfree() for those cases. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-02-10io_uring: persistent req cachePavel Begunkov
Don't free batch-allocated requests across syscalls. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-02-10io_uring: count ctx refs separately from reqsPavel Begunkov
Currently batch free handles request memory freeing and ctx ref putting together. Separate them and use different counters, that will be needed for reusing reqs memory. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-02-10io_uring: remove fallback_reqPavel Begunkov
Remove fallback_req for now, it gets in the way of other changes. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-02-10io_uring: submit-completion free batchingPavel Begunkov
io_submit_flush_completions() does completion batching, but may also use free batching as iopoll does. The main beneficiaries should be buffered reads/writes and send/recv. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-02-10io_uring: replace list with array for compl batchPavel Begunkov
Reincarnation of an old patch that replaces a list in struct io_compl_batch with an array. It's needed to avoid hooking requests via their compl.list, because it won't be always available in the future. It's also nice to split io_submit_flush_completions() to avoid free under locks and remove unlock/lock with a long comment describing when it can be done. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-02-10io_uring: don't reinit submit state every timePavel Begunkov
As now submit_state is retained across syscalls, we can save ourself from initialising it from ground up for each io_submit_sqes(). Set some fields during ctx allocation, and just keep them always consistent. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> [axboe: remove unnecessary zeroing of ctx members] Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-02-10io_uring: remove ctx from comp_statePavel Begunkov
completion state is closely bound to ctx, we don't need to store ctx inside as we always have it around to pass to flush. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-02-10io_uring: don't keep submit_state on stackPavel Begunkov
struct io_submit_state is quite big (168 bytes) and going to grow. It's better to not keep it on stack as it is now. Move it to context, it's always protected by uring_lock, so it's fine to have only one instance of it. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-02-10io_uring: don't propagate io_comp_statePavel Begunkov
There is no reason to drag io_comp_state into opcode handlers, we just need a flag and the actual work will be done in __io_queue_sqe(). Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-02-10Revert "drm/scheduler: Job timeout handler returns status (v3)"Maarten Lankhorst
This reverts commit c10983e14e8f5d7c8dab0415e0cb7fe8d10aa9e3. This commit is not meant for drm-misc-next-fixes, and was accidentally cherry picked over. Signed-off-by: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
2021-02-10gpio: ep93xx: Fix single irqchip with multi gpiochipsNikita Shubin
Fixes the following warnings which results in interrupts disabled on port B/F: gpio gpiochip1: (B): detected irqchip that is shared with multiple gpiochips: please fix the driver. gpio gpiochip5: (F): detected irqchip that is shared with multiple gpiochips: please fix the driver. - added separate irqchip for each interrupt capable gpiochip - provided unique names for each irqchip Fixes: d2b091961510 ("gpio: ep93xx: Pass irqchip when adding gpiochip") Cc: <stable@vger.kernel.org> Signed-off-by: Nikita Shubin <nikita.shubin@maquefel.me> Tested-by: Alexander Sverdlin <alexander.sverdlin@gmail.com> Signed-off-by: Bartosz Golaszewski <bgolaszewski@baylibre.com>
2021-02-10gpio: ep93xx: fix BUG_ON port F usageNikita Shubin
Two index spaces and ep93xx_gpio_port are confusing. Instead add a separate struct to store necessary data and remove ep93xx_gpio_port. - add struct to store IRQ related data for each IRQ capable chip - replace offset array with defined offsets - add IRQ registers offset for each IRQ capable chip into ep93xx_gpio_banks ------------[ cut here ]------------ kernel BUG at drivers/gpio/gpio-ep93xx.c:64! ---[ end trace 3f6544e133e9f5ae ]--- Fixes: fd935fc421e74 ("gpio: ep93xx: Do not pingpong irq numbers") Cc: <stable@vger.kernel.org> Reviewed-by: Alexander Sverdlin <alexander.sverdlin@gmail.com> Tested-by: Alexander Sverdlin <alexander.sverdlin@gmail.com> Signed-off-by: Nikita Shubin <nikita.shubin@maquefel.me> Signed-off-by: Bartosz Golaszewski <bgolaszewski@baylibre.com>
2021-02-10x86/fault: Don't look for extable entries for SMEP violationsAndy Lutomirski
If the kernel gets a SMEP violation or a fault that would have been a SMEP violation if it had SMEP support, it shouldn't run fixups. Just OOPS. [ bp: Massage commit message. ] Signed-off-by: Andy Lutomirski <luto@kernel.org> Signed-off-by: Borislav Petkov <bp@suse.de> Link: https://lkml.kernel.org/r/46160d8babce2abf1d6daa052146002efa24ac56.1612924255.git.luto@kernel.org
2021-02-10perf/x86/rapl: Fix psys-energy event on Intel SPR platformZhang Rui
There are several things special for the RAPL Psys energy counter, on Intel Sapphire Rapids platform. 1. it contains one Psys master package, and only CPUs on the master package can read valid value of the Psys energy counter, reading the MSR on CPUs in the slave package returns 0. 2. The master package does not have to be Physical package 0. And when all the CPUs on the Psys master package are offlined, we lose the Psys energy counter, at runtime. 3. The Psys energy counter can be disabled by BIOS, while all the other energy counters are not affected. It is not easy to handle all of these in the current RAPL PMU design because a) perf_msr_probe() validates the MSR on some random CPU, which may either be in the Psys master package or in the Psys slave package. b) all the RAPL events share the same PMU, and there is not API to remove the psys-energy event cleanly, without affecting the other events in the same PMU. This patch addresses the problems in a simple way. First, by setting .no_check bit for RAPL Psys MSR, the psys-energy event is always added, so we don't have to check the Psys ENERGY_STATUS MSR on master package. Then, by removing rapl_not_visible(), the psys-energy event is always available in sysfs. This does not affect the previous code because, for the RAPL MSRs with .no_check cleared, the .is_visible() callback is always overriden in the perf_msr_probe() function. Note, although RAPL PMU is die-based, and the Psys energy counter MSR on Intel SPR is package scope, this is not a problem because there is only one die in each package on SPR. Signed-off-by: Zhang Rui <rui.zhang@intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Andi Kleen <ak@linux.intel.com> Link: https://lkml.kernel.org/r/20210204161816.12649-3-rui.zhang@intel.com
2021-02-10perf/x86/rapl: Only check lower 32bits for RAPL energy countersZhang Rui
In the RAPL ENERGY_COUNTER MSR, only the lower 32bits represent the energy counter. On previous platforms, the higher 32bits are reverved and always return Zero. But on Intel SapphireRapids platform, the higher 32bits are reused for other purpose and return non-zero value. Thus check the lower 32bits only for these ENERGY_COUTNER MSRs, to make sure the RAPL PMU events are not added erroneously when higher 32bits contain non-zero value. Signed-off-by: Zhang Rui <rui.zhang@intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Andi Kleen <ak@linux.intel.com> Link: https://lkml.kernel.org/r/20210204161816.12649-2-rui.zhang@intel.com
2021-02-10perf/x86/rapl: Add msr mask supportZhang Rui
In some cases, when probing a perf MSR, we're probing certain bits of the MSR instead of the whole register, thus only these bits should be checked. For example, for RAPL ENERGY_STATUS MSR, only the lower 32 bits represents the energy counter, and the higher 32bits are reserved. Introduce a new mask field in struct perf_msr to allow probing certain bits of a MSR. This change is transparent to the current perf_msr_probe() users. Signed-off-by: Zhang Rui <rui.zhang@intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Andi Kleen <ak@linux.intel.com> Link: https://lkml.kernel.org/r/20210204161816.12649-1-rui.zhang@intel.com
2021-02-10perf/x86/kvm: Add Cascade Lake Xeon steppings to isolation_ucodes[]Jim Mattson
Cascade Lake Xeon parts have the same model number as Skylake Xeon parts, so they are tagged with the intel_pebs_isolation quirk. However, as with Skylake Xeon H0 stepping parts, the PEBS isolation issue is fixed in all microcode versions. Add the Cascade Lake Xeon steppings (5, 6, and 7) to the isolation_ucodes[] table so that these parts benefit from Andi's optimization in commit 9b545c04abd4f ("perf/x86/kvm: Avoid unnecessary work in guest filtering"). Signed-off-by: Jim Mattson <jmattson@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Andi Kleen <ak@linux.intel.com> Link: https://lkml.kernel.org/r/20210205191324.2889006-1-jmattson@google.com
2021-02-10checkpatch: Don't check for mutex_trylock_recursive()Sebastian Andrzej Siewior
mutex_trylock_recursive() has been removed from the tree, there is no need to check for it. Remove traces of mutex_trylock_recursive()'s existence. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20210210085248.219210-3-bigeasy@linutronix.de
2021-02-10locking/mutex: Kill mutex_trylock_recursive()Sebastian Andrzej Siewior
There are not users of mutex_trylock_recursive() in tree as of v5.11-rc7. Remove it. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20210210085248.219210-2-bigeasy@linutronix.de
2021-02-10s390: Use arch_local_irq_{save,restore}() in early boot codeSven Schnelle
Commit 997acaf6b4b5 ("lockdep: report broken irq restoration") makes compiling s390 fail because the irq enable/disable functions are now no longer fully contained in header files. Fixes: 997acaf6b4b5 ("lockdep: report broken irq restoration") Signed-off-by: Sven Schnelle <svens@linux.ibm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
2021-02-10lockdep: Noinstr annotate warn_bogus_irq_restore()Peter Zijlstra
vmlinux.o: warning: objtool: lock_is_held_type()+0x107: call to warn_bogus_irq_restore() leaves .noinstr.text section As per the general rule that WARNs are allowed to violate noinstr to get out, annotate it away. Fixes: 997acaf6b4b5 ("lockdep: report broken irq restoration") Reported-by: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Mark Rutland <mark.rutland@arm.com> Acked-by: Randy Dunlap <rdunlap@infradead.org> # build-tested Link: https://lkml.kernel.org/r/YCKyYg53mMp4E7YI@hirez.programming.kicks-ass.net
2021-02-10x86/fault: Rename no_context() to kernelmode_fixup_or_oops()Andy Lutomirski
The name no_context() has never been very clear. It's only called for faults from kernel mode, so rename it and change the no-longer-useful user_mode(regs) check to a WARN_ON_ONCE. Signed-off-by: Andy Lutomirski <luto@kernel.org> Signed-off-by: Borislav Petkov <bp@suse.de> Link: https://lkml.kernel.org/r/c21940efe676024bb4bc721f7d70c29c420e127e.1612924255.git.luto@kernel.org
2021-02-10x86/fault: Bypass no_context() for implicit kernel faults from usermodeAndy Lutomirski
Drop an indentation level and remove the last user_mode(regs) == true caller of no_context() by directly OOPSing for implicit kernel faults from usermode. Signed-off-by: Andy Lutomirski <luto@kernel.org> Signed-off-by: Borislav Petkov <bp@suse.de> Link: https://lkml.kernel.org/r/6e3d1129494a8de1e59d28012286e3a292a2296e.1612924255.git.luto@kernel.org
2021-02-10x86/fault: Split the OOPS code out from no_context()Andy Lutomirski
Not all callers of no_context() want to run exception fixups. Separate the OOPS code out from the fixup code in no_context(). Signed-off-by: Andy Lutomirski <luto@kernel.org> Signed-off-by: Borislav Petkov <bp@suse.de> Link: https://lkml.kernel.org/r/450f8d8eabafb83a5df349108c8e5ea83a2f939d.1612924255.git.luto@kernel.org
2021-02-10gpio: mxs: GPIO_MXS should not default to y unconditionallyGeert Uytterhoeven
Merely enabling CONFIG_COMPILE_TEST should not enable additional code. To fix this, restrict the automatic enabling of GPIO_MXS to ARCH_MXS, and ask the user in case of compile-testing. Fixes: 6876ca311bfca5d7 ("gpio: mxs: add COMPILE_TEST support for GPIO_MXS") Cc: <stable@vger.kernel.org> Signed-off-by: Geert Uytterhoeven <geert+renesas@glider.be> Signed-off-by: Bartosz Golaszewski <bgolaszewski@baylibre.com>
2021-02-10x86/fault: Improve kernel-executing-user-memory handlingAndy Lutomirski
Right now, the case of the kernel trying to execute from user memory is treated more or less just like the kernel getting a page fault on a user access. In the failure path, it checks for erratum #93, tries to otherwise fix up the error, and then oopses. If it manages to jump to the user address space, with or without SMEP, it should not try to resolve the page fault. This is an error, pure and simple. Rearrange the code so that this case is caught early, check for erratum #93, and bail out. [ bp: Massage commit message. ] Signed-off-by: Andy Lutomirski <luto@kernel.org> Signed-off-by: Borislav Petkov <bp@suse.de> Link: https://lkml.kernel.org/r/ab8719c7afb8bd501c4eee0e36493150fbbe5f6a.1612924255.git.luto@kernel.org
2021-02-10x86/fault: Correct a few user vs kernel checks wrt WRUSSAndy Lutomirski
In general, page fault errors for WRUSS should be just like get_user(), etc. Fix three bugs in this area: There is a comment that says that, if the kernel can't handle a page fault on a user address due to OOM, the OOM-kill-and-retry logic would be skipped. The code checked kernel *privilege*, not kernel mode, so it missed WRUSS. This means that the kernel would malfunction if it got OOM on a WRUSS fault -- this would be a kernel-mode, user-privilege fault, and the OOM killer would be invoked and the handler would retry the faulting instruction. A failed user access from kernel while a fatal signal is pending should fail even if the instruction in question was WRUSS. do_sigbus() should not send SIGBUS for WRUSS -- it should handle it like any other kernel mode failure. Signed-off-by: Andy Lutomirski <luto@kernel.org> Signed-off-by: Borislav Petkov <bp@suse.de> Link: https://lkml.kernel.org/r/a7b7bcea730bd4069e6b7e629236bb2cf526c2fb.1612924255.git.luto@kernel.org
2021-02-10x86/fault: Document the locking in the fault_signal_pending() pathAndy Lutomirski
If fault_signal_pending() returns true, then the core mm has unlocked the mm for us. Add a comment to help future readers of this code. Signed-off-by: Andy Lutomirski <luto@kernel.org> Signed-off-by: Borislav Petkov <bp@suse.de> Link: https://lkml.kernel.org/r/c56de3d103f40e6304437b150aa7b215530d23f7.1612924255.git.luto@kernel.org