linux-arm.git - Russell King's ARM Linux kernel tree

Age	Commit message (Collapse)	Author
2019-02-09	bcache: use sysfs_strtoul_bool() to set bit-field variables	Coly Li
	When setting bcache parameters via sysfs, there are some variables are defined as bit-field value. Current bcache code in sysfs.c uses either d_strtoul() or sysfs_strtoul() to convert the input string to unsigned integer value and set it to the corresponded bit-field value. The problem is, the bit-field value only takes the lowest bit of the converted value. If input is 2, the expected value (like bool value) of the bit-field value should be 1, but indeed it is 0. The following sysfs files for bit-field variables have such problem, bypass_torture_test, for dc->bypass_torture_test writeback_metadata, for dc->writeback_metadata writeback_running, for dc->writeback_running verify, for c->verify key_merging_disabled, for c->key_merging_disabled gc_always_rewrite, for c->gc_always_rewrite btree_shrinker_disabled,for c->shrinker_disabled copy_gc_enabled, for c->copy_gc_enabled This patch uses sysfs_strtoul_bool() to set such bit-field variables, then if the converted value is non-zero, the bit-field variables will be set to 1, like setting a bool value like expensive_debug_checks. Signed-off-by: Coly Li <colyli@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-02-09	bcache: add sysfs_strtoul_bool() for setting bit-field variables	Coly Li
	When setting bool values via sysfs interface, e.g. writeback_metadata, if writing 1 into writeback_metadata file, dc->writeback_metadata is set to 1, but if writing 2 into the file, dc->writeback_metadata is 0. This is misleading, a better result should be 1 for all non-zero input value. It is because dc->writeback_metadata is a bit-field variable, and current code simply use d_strtoul() to convert a string into integer and takes the lowest bit value. To fix such error, we need a routine to convert the input string into unsigned integer, and set target variable to 1 if the converted integer is non-zero. This patch introduces a new macro called sysfs_strtoul_bool(), it can be used to convert input string into bool value, we can use it to set bool value for bit-field vairables. Signed-off-by: Coly Li <colyli@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-02-09	bcache: fix input overflow to sequential_cutoff	Coly Li
	People may set sequential_cutoff of a cached device via sysfs file, but current code does not check input value overflow. E.g. if value 4294967295 (UINT_MAX) is written to file sequential_cutoff, its value is 4GB, but if 4294967296 (UINT_MAX + 1) is written into, its value will be 0. This is an unexpected behavior. This patch replaces d_strtoi_h() by sysfs_strtoul_clamp() to convert input string to unsigned integer value, and limit its range in [0, UINT_MAX]. Then the input overflow can be fixed. Signed-off-by: Coly Li <colyli@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-02-09	bcache: fix input integer overflow of congested threshold	Coly Li
	Cache set congested threshold values congested_read_threshold_us and congested_write_threshold_us can be set via sysfs interface. These two values are 'unsigned int' type, but sysfs interface uses strtoul to convert input string. So if people input a large number like 9999999999, the value indeed set is 1410065407, which is not expected behavior. This patch replaces sysfs_strtoul() by sysfs_strtoul_clamp() when convert input string to unsigned int value, and set value range in [0, UINT_MAX], to avoid the above integer overflow errors. Signed-off-by: Coly Li <colyli@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-02-09	bcache: improve sysfs_strtoul_clamp()	Coly Li
	Currently sysfs_strtoul_clamp() is defined as, 82 #define sysfs_strtoul_clamp(file, var, min, max) \ 83 do { \ 84 if (attr == &sysfs_ ## file) \ 85 return strtoul_safe_clamp(buf, var, min, max) \ 86 ?: (ssize_t) size; \ 87 } while (0) The problem is, if bit width of var is less then unsigned long, min and max may not protect var from integer overflow, because overflow happens in strtoul_safe_clamp() before checking min and max. To fix such overflow in sysfs_strtoul_clamp(), to make min and max take effect, this patch adds an unsigned long variable, and uses it to macro strtoul_safe_clamp() to convert an unsigned long value in range defined by [min, max]. Then assign this value to var. By this method, if bit width of var is less than unsigned long, integer overflow won't happen before min and max are checking. Now sysfs_strtoul_clamp() can properly handle smaller data type like unsigned int, of cause min and max should be defined in range of unsigned int too. Signed-off-by: Coly Li <colyli@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-02-09	bcache: treat stale && dirty keys as bad keys	Tang Junhui
	Stale && dirty keys can be produced in the follow way: After writeback in write_dirty_finish(), dirty keys k1 will replace by clean keys k2 ==>ret = bch_btree_insert(dc->disk.c, &keys, NULL, &w->key); ==>btree_insert_fn(struct btree_op b_op, struct btree b) ==>static int bch_btree_insert_node(struct btree b, struct btree_op op, struct keylist insert_keys, atomic_t journal_ref, Then two steps: A) update k1 to k2 in btree node memory; bch_btree_insert_keys(b, op, insert_keys, replace_key) B) Write the bset(contains k2) to cache disk by a 30s delay work bch_btree_leaf_dirty(b, journal_ref). But before the 30s delay work write the bset to cache device, these things happened: A) GC works, and reclaim the bucket k2 point to; B) Allocator works, and invalidate the bucket k2 point to, and increase the gen of the bucket, and place it into free_inc fifo; C) Until now, the 30s delay work still does not finish work, so in the disk, the key still is k1, it is dirty and stale (its gen is smaller than the gen of the bucket). and then the machine power off suddenly happens; D) When the machine power on again, after the btree reconstruction, the stale dirty key appear. In bch_extent_bad(), when expensive_debug_checks is off, it would treat the dirty key as good even it is stale keys, and it would cause bellow probelms: A) In read_dirty() it would cause machine crash: BUG_ON(ptr_stale(dc->disk.c, &w->key, 0)); B) It could be worse when reads hits stale dirty keys, it would read old incorrect data. This patch tolerate the existence of these stale && dirty keys, and treat them as bad key in bch_extent_bad(). (Coly Li: fix indent which was modified by sender's email client) Signed-off-by: Tang Junhui <tang.junhui.linux@gmail.com> Cc: stable@vger.kernel.org Signed-off-by: Coly Li <colyli@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-02-09	bcache: fix indentation issue, remove tabs on a hunk of code	Colin Ian King
	There is a hunk of code that is indented one level too deep, fix this by removing the extra tabs. Signed-off-by: Colin Ian King <colin.king@canonical.com> Signed-off-by: Coly Li <colyli@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-02-09	bcache: export backing_dev_uuid via sysfs	Coly Li
	When there are multiple bcache devices, after a reboot the name of bcache devices may change (e.g. current /dev/bcache1 was /dev/bcache0 before reboot). Therefore we need the backing device UUID (sb.uuid) to identify each bcache device. Backing device uuid can be found by program bcache-super-show, but directly exporting backing_dev_uuid by sysfs file /sys/block/bcache<?>/bcache/backing_dev_uuid is a much simpler method. With backing_dev_uuid, and partition uuids from /dev/disk/by-partuuid/, now we can identify each bcache device and its partitions conveniently. Signed-off-by: Coly Li <colyli@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-02-09	bcache: export backing_dev_name via sysfs	Coly Li
	This patch export dc->backing_dev_name to sysfs file /sys/block/bcache<?>/bcache/backing_dev_name, then people or user space tools may know the backing device name of this bcache device. Of cause it can be done by parsing sysfs links, but this method can be much simpler to find the link between bcache device and backing device. Signed-off-by: Coly Li <colyli@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-02-09	bcache: not use hard coded memset size in bch_cache_accounting_clear()	Coly Li
	In stats.c:bch_cache_accounting_clear(), a hard coded number '7' is used in memset(). It is because in struct cache_stats, there are 7 atomic_t type members. This is not good when new members added into struct stats, the hard coded number will only clear part of memory. This patch replaces 'sizeof(unsigned long) * 7' by more generic 'sizeof(struct cache_stats))', to avoid potential error if new member added into struct cache_stats. Signed-off-by: Coly Li <colyli@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-02-09	bcache: never writeback a discard operation	Daniel Axtens
	Some users see panics like the following when performing fstrim on a bcached volume: [ 529.803060] BUG: unable to handle kernel NULL pointer dereference at 0000000000000008 [ 530.183928] #PF error: [normal kernel read fault] [ 530.412392] PGD 8000001f42163067 P4D 8000001f42163067 PUD 1f42168067 PMD 0 [ 530.750887] Oops: 0000 [#1] SMP PTI [ 530.920869] CPU: 10 PID: 4167 Comm: fstrim Kdump: loaded Not tainted 5.0.0-rc1+ #3 [ 531.290204] Hardware name: HP ProLiant DL360 Gen9/ProLiant DL360 Gen9, BIOS P89 12/27/2015 [ 531.693137] RIP: 0010:blk_queue_split+0x148/0x620 [ 531.922205] Code: 60 38 89 55 a0 45 31 db 45 31 f6 45 31 c9 31 ff 89 4d 98 85 db 0f 84 7f 04 00 00 44 8b 6d 98 4c 89 ee 48 c1 e6 04 49 03 70 78 <8b> 46 08 44 8b 56 0c 48 8b 16 44 29 e0 39 d8 48 89 55 a8 0f 47 c3 [ 532.838634] RSP: 0018:ffffb9b708df39b0 EFLAGS: 00010246 [ 533.093571] RAX: 00000000ffffffff RBX: 0000000000046000 RCX: 0000000000000000 [ 533.441865] RDX: 0000000000000200 RSI: 0000000000000000 RDI: 0000000000000000 [ 533.789922] RBP: ffffb9b708df3a48 R08: ffff940d3b3fdd20 R09: 0000000000000000 [ 534.137512] R10: ffffb9b708df3958 R11: 0000000000000000 R12: 0000000000000000 [ 534.485329] R13: 0000000000000000 R14: 0000000000000000 R15: ffff940d39212020 [ 534.833319] FS: 00007efec26e3840(0000) GS:ffff940d1f480000(0000) knlGS:0000000000000000 [ 535.224098] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 535.504318] CR2: 0000000000000008 CR3: 0000001f4e256004 CR4: 00000000001606e0 [ 535.851759] Call Trace: [ 535.970308] ? mempool_alloc_slab+0x15/0x20 [ 536.174152] ? bch_data_insert+0x42/0xd0 [bcache] [ 536.403399] blk_mq_make_request+0x97/0x4f0 [ 536.607036] generic_make_request+0x1e2/0x410 [ 536.819164] submit_bio+0x73/0x150 [ 536.980168] ? submit_bio+0x73/0x150 [ 537.149731] ? bio_associate_blkg_from_css+0x3b/0x60 [ 537.391595] ? _cond_resched+0x1a/0x50 [ 537.573774] submit_bio_wait+0x59/0x90 [ 537.756105] blkdev_issue_discard+0x80/0xd0 [ 537.959590] ext4_trim_fs+0x4a9/0x9e0 [ 538.137636] ? ext4_trim_fs+0x4a9/0x9e0 [ 538.324087] ext4_ioctl+0xea4/0x1530 [ 538.497712] ? _copy_to_user+0x2a/0x40 [ 538.679632] do_vfs_ioctl+0xa6/0x600 [ 538.853127] ? __do_sys_newfstat+0x44/0x70 [ 539.051951] ksys_ioctl+0x6d/0x80 [ 539.212785] __x64_sys_ioctl+0x1a/0x20 [ 539.394918] do_syscall_64+0x5a/0x110 [ 539.568674] entry_SYSCALL_64_after_hwframe+0x44/0xa9 We have observed it where both: 1) LVM/devmapper is involved (bcache backing device is LVM volume) and 2) writeback cache is involved (bcache cache_mode is writeback) On one machine, we can reliably reproduce it with: # echo writeback > /sys/block/bcache0/bcache/cache_mode (not sure whether above line is required) # mount /dev/bcache0 /test # for i in {0..10}; do file="$(mktemp /test/zero.XXX)" dd if=/dev/zero of="$file" bs=1M count=256 sync rm $file done # fstrim -v /test Observing this with tracepoints on, we see the following writes: fstrim-18019 [022] .... 91107.302026: bcache_write: 73f95583-561c-408f-a93a-4cbd2498f5c8 inode 0 DS 4260112 + 196352 hit 0 bypass 1 fstrim-18019 [022] .... 91107.302050: bcache_write: 73f95583-561c-408f-a93a-4cbd2498f5c8 inode 0 DS 4456464 + 262144 hit 0 bypass 1 fstrim-18019 [022] .... 91107.302075: bcache_write: 73f95583-561c-408f-a93a-4cbd2498f5c8 inode 0 DS 4718608 + 81920 hit 0 bypass 1 fstrim-18019 [022] .... 91107.302094: bcache_write: 73f95583-561c-408f-a93a-4cbd2498f5c8 inode 0 DS 5324816 + 180224 hit 0 bypass 1 fstrim-18019 [022] .... 91107.302121: bcache_write: 73f95583-561c-408f-a93a-4cbd2498f5c8 inode 0 DS 5505040 + 262144 hit 0 bypass 1 fstrim-18019 [022] .... 91107.302145: bcache_write: 73f95583-561c-408f-a93a-4cbd2498f5c8 inode 0 DS 5767184 + 81920 hit 0 bypass 1 fstrim-18019 [022] .... 91107.308777: bcache_write: 73f95583-561c-408f-a93a-4cbd2498f5c8 inode 0 DS 6373392 + 180224 hit 1 bypass 0 <crash> Note the final one has different hit/bypass flags. This is because in should_writeback(), we were hitting a case where the partial stripe condition was returning true and so should_writeback() was returning true early. If that hadn't been the case, it would have hit the would_skip test, and as would_skip == s->iop.bypass == true, should_writeback() would have returned false. Looking at the git history from 'commit 72c270612bd3 ("bcache: Write out full stripes")', it looks like the idea was to optimise for raid5/6: * If a stripe is already dirty, force writes to that stripe to writeback mode - to help build up full stripes of dirty data To fix this issue, make sure that should_writeback() on a discard op never returns true. More details of debugging: https://www.spinics.net/lists/linux-bcache/msg06996.html Previous reports: - https://bugzilla.kernel.org/show_bug.cgi?id=201051 - https://bugzilla.kernel.org/show_bug.cgi?id=196103 - https://www.spinics.net/lists/linux-bcache/msg06885.html (Coly Li: minor modification to follow maximum 75 chars per line rule) Cc: Kent Overstreet <koverstreet@google.com> Cc: stable@vger.kernel.org Fixes: 72c270612bd3 ("bcache: Write out full stripes") Signed-off-by: Daniel Axtens <dja@axtens.net> Signed-off-by: Coly Li <colyli@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-02-08	block: avoid setting nr_requests to current value	Aleksei Zakharov
	There's no reason to freeze queue and set nr_requests value if current value is the same. Signed-off-by: Aleksei Zakharov <zakharov.a.g@yandex.ru> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-02-08	cdrom: Fix race condition in cdrom_sysctl_register	Guenter Roeck
	The following traceback is sometimes seen when booting an image in qemu: [ 54.608293] cdrom: Uniform CD-ROM driver Revision: 3.20 [ 54.611085] Fusion MPT base driver 3.04.20 [ 54.611877] Copyright (c) 1999-2008 LSI Corporation [ 54.616234] Fusion MPT SAS Host driver 3.04.20 [ 54.635139] sysctl duplicate entry: /dev/cdrom//info [ 54.639578] CPU: 0 PID: 266 Comm: kworker/u4:5 Not tainted 5.0.0-rc5 #1 [ 54.639578] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015 [ 54.641273] Workqueue: events_unbound async_run_entry_fn [ 54.641273] Call Trace: [ 54.641273] dump_stack+0x67/0x90 [ 54.641273] __register_sysctl_table+0x50b/0x570 [ 54.641273] ? rcu_read_lock_sched_held+0x6f/0x80 [ 54.641273] ? kmem_cache_alloc_trace+0x1c7/0x1f0 [ 54.646814] __register_sysctl_paths+0x1c8/0x1f0 [ 54.646814] cdrom_sysctl_register.part.7+0xc/0x5f [ 54.646814] register_cdrom.cold.24+0x2a/0x33 [ 54.646814] sr_probe+0x4bd/0x580 [ 54.646814] ? __driver_attach+0xd0/0xd0 [ 54.646814] really_probe+0xd6/0x260 [ 54.646814] ? __driver_attach+0xd0/0xd0 [ 54.646814] driver_probe_device+0x4a/0xb0 [ 54.646814] ? __driver_attach+0xd0/0xd0 [ 54.646814] bus_for_each_drv+0x73/0xc0 [ 54.646814] __device_attach+0xd6/0x130 [ 54.646814] bus_probe_device+0x9a/0xb0 [ 54.646814] device_add+0x40c/0x670 [ 54.646814] ? __pm_runtime_resume+0x4f/0x80 [ 54.646814] scsi_sysfs_add_sdev+0x81/0x290 [ 54.646814] scsi_probe_and_add_lun+0x888/0xc00 [ 54.646814] ? scsi_autopm_get_host+0x21/0x40 [ 54.646814] __scsi_add_device+0x116/0x130 [ 54.646814] ata_scsi_scan_host+0x93/0x1c0 [ 54.646814] async_run_entry_fn+0x34/0x100 [ 54.646814] process_one_work+0x237/0x5e0 [ 54.646814] worker_thread+0x37/0x380 [ 54.646814] ? rescuer_thread+0x360/0x360 [ 54.646814] kthread+0x118/0x130 [ 54.646814] ? kthread_create_on_node+0x60/0x60 [ 54.646814] ret_from_fork+0x3a/0x50 The only sensible explanation is that cdrom_sysctl_register() is called twice, once from the module init function and once from register_cdrom(). cdrom_sysctl_register() is not mutex protected and may happily execute twice if the second call is made before the first call is complete. Use a static atomic to ensure that the function is executed exactly once. Signed-off-by: Guenter Roeck <linux@roeck-us.net> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-02-04	Merge branch 'md-next' of https://github.com/liu-song-6/linux into for-5.1/block	Jens Axboe
	Pull MD changes for 5.1 from Song. * 'md-next' of https://github.com/liu-song-6/linux: raid1: simplify raid1_error function md-linear: use struct_size() in kzalloc()
2019-02-04	raid1: simplify raid1_error function	Yufen Yu
	Remove redundance set_bit and let code simplify. Signed-off-by: Yufen Yu <yuyufen@huawei.com> Signed-off-by: Song Liu <songliubraving@fb.com>
2019-02-04	md-linear: use struct_size() in kzalloc()	Gustavo A. R. Silva
	One of the more common cases of allocation size calculations is finding the size of a structure that has a zero-sized array at the end, along with memory for some number of elements for that array. For example: struct foo { int stuff; void entry[]; }; instance = kzalloc(sizeof(struct foo) + sizeof(void ) * count, GFP_KERNEL); Instead of leaving these open-coded and prone to type mistakes, we can now use the new struct_size() helper: instance = kzalloc(struct_size(instance, entry, count), GFP_KERNEL); This code was detected with the help of Coccinelle. Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com> Signed-off-by: Song Liu <songliubraving@fb.com>
2019-02-04	Merge branch 'nvme-5.1' of git://git.infradead.org/nvme into for-5.1/block	Jens Axboe
	Pull 5.1 NVMe material from Christoph: "Below is our current (small) queue of NVMe patches for Linux 5.1. We want the re-addition of the Write Zeroes support to be in linu-next for a few weeks as it caused some problems last time. The only other patch is a cleanup from Sagi." * 'nvme-5.1' of git://git.infradead.org/nvme: nvme: remove the .stop_ctrl callout nvme: add support for the Write Zeroes command
2019-02-04	nvme: remove the .stop_ctrl callout	Sagi Grimberg
	It is used now just to flush error recovery and reconnect work items in the RDMA and TCP transports, which can simply be moved to the corresponding teardown routines. Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de>
2019-02-04	nvme: add support for the Write Zeroes command	Chaitanya Kulkarni
	Allow write zeroes operations (REQ_OP_WRITE_ZEROES) on the block device, if the device supports an optional command bit set for write zeroes. Add support to setup write zeroes command. Set maximum possible write zeroes sectors in one write zeroes command according to nvme write zeroes command definition. This patch was posted as a part of block-write-zeroes support implementation (https://patchwork.kernel.org/patch/9454859/), but did not make into mainline kernel as it got reverted due to failure on the Linus's machine. In this patch in order to be more cautious, we use NVMe controller's maximum hardware sector size which is calculated based on the controller's MDTS (Maximum Data Transfer Size) field to calculate the maximum sectors for the write zeroes request. Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> [folded a fix from Keith Busch to properly respect NVME_QUIRK_DEALLOCATE_ZEROES] Signed-off-by: Christoph Hellwig <hch@lst.de>
2019-02-01	blk-mq: save default hctx into ctx->hctxs for not-supported type	Jianchao Wang
	Currently, we check whether the hctx type is supported every time in hot path. Actually, this is not necessary, we could save the default hctx into ctx->hctxs if the type is not supported when map swqueues and use it directly with ctx->hctxs[type]. We also needn't check whether the poll is enabled or not, because the caller would clear the REQ_HIPRI in that case. Signed-off-by: Jianchao Wang <jianchao.w.wang@oracle.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-02-01	blk-mq: save queue mapping result into ctx directly	Jianchao Wang
	Currently, the queue mapping result is saved in a two-dimensional array. In the hot path, to get a hctx, we need do following: q->queue_hw_ctx[q->tag_set->map[type].mq_map[cpu]] This isn't very efficient. We could save the queue mapping result into ctx directly with different hctx type, like, ctx->hctxs[type] Signed-off-by: Jianchao Wang <jianchao.w.wang@oracle.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-01-31	block, bfq: fix in-service-queue check for queue merging	Paolo Valente
	When a new I/O request arrives for a bfq_queue, say Q, bfq checks whether that request is close to (a) the head request of some other queue waiting to be served, or (b) the last request dispatched for the in-service queue (in case Q itself is not the in-service queue) If a queue, say Q2, is found for which the above condition holds, then bfq merges Q and Q2, to hopefully get a more sequential I/O in the resulting merged queue, and thus a possibly higher throughput. Case (b) is checked by comparing the new request for Q with the last request dispatched, assuming that the latter necessarily belonged to the in-service queue. Unfortunately, this assumption is no longer always correct, since commit d0edc2473be9 ("block, bfq: inject other-queue I/O into seeky idle queues on NCQ flash"). When the assumption does not hold, queues that must not be merged may be merged, causing unexpected loss of control on per-queue service guarantees. This commit solves this problem by adding an extra field, which stores the actual last request dispatched for the in-service queue, and by using this new field to correctly check case (b). Signed-off-by: Paolo Valente <paolo.valente@linaro.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-01-31	block, bfq: do not overcharge writes in asymmetric scenarios	Paolo Valente
	Writes tend to starve reads. bfq counters this problem by overcharging writes with an inflated service w.r.t. the actual service (number of sector written) they receive. Yet his overcharging is useless, and actually causes unfairness in the opposite direction, when bfq happens to be enforcing strong I/O control. bfq does this enforcing when the scenario is asymmetric, i.e., when some bfq_queue or group of bfq_queues is to be granted a different bandwidth than some other bfq_queue or group of bfq_queues. So, in such a scenario, this commit disables write overcharging. Signed-off-by: Paolo Valente <paolo.valente@linaro.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-01-31	block, bfq: port commit "cfq-iosched: improve hw_tag detection"	Paolo Valente
	The original commit is commit 1a1238a7dd48 ("cfq-iosched: improve hw_tag detection") and has the following commit message: If active queue hasn't enough requests and idle window opens, cfq will not dispatch sufficient requests to hardware. In such situation, current code will zero hw_tag. But this is because cfq doesn't dispatch enough requests instead of hardware queue doesn't work. Don't zero hw_tag in such case. Signed-off-by: Paolo Valente <paolo.valente@linaro.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-01-31	block, bfq: reduce threshold for detecting command queueing	Paolo Valente
	bfq simple heuristic from cfq for detecting whether the drive performs command queueing: check whether the average number of in-flight requests is above a given threshold. Unfortunately this heuristic does fail to detect queueing (on drives with queueing) if processes doing I/O are few and issue I/O with a low depth. To reduce false negatives, this commit lowers the threshold. Signed-off-by: Paolo Valente <paolo.valente@linaro.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-01-31	block, bfq: fix queue removal from weights tree	Paolo Valente
	bfq maintains an ordered list, through a red-black tree, of unique weights of active bfq_queues. This list is used to detect whether there are active queues with differentiated weights. The weight of a queue is removed from the list when both the following two conditions become true: (1) the bfq_queue is flagged as inactive (2) the has no in-flight request any longer; Unfortunately, in the rare cases where condition (2) becomes true before condition (1), the removal fails, because the function to remove the weight of the queue (bfq_weights_tree_remove) is rightly invoked in the path that deactivates the bfq_queue, but mistakenly invoked before the function that actually performs the deactivation (bfq_deactivate_bfqq). This commits moves the invocation of bfq_weights_tree_remove for condition (1) to after bfq_deactivate_bfqq. As a consequence of this move, it is necessary to add a further reference to the queue when the weight of a queue is added, because the queue might otherwise be freed before bfq_weights_tree_remove is invoked. This commit adds this reference and makes all related modifications. Signed-off-by: Paolo Valente <paolo.valente@linaro.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-01-31	block, bfq: fix sequential rq detection in rate estimation	Paolo Valente
	In bfq_update_peak_rate, to check whether an I/O request rq is sequential, only the seek distance of rq w.r.t. the last request dispatched is controlled. This is not sufficient for non-rotational storage, where the size of rq is at least as relevant. This commit adds the missing control. Signed-off-by: Paolo Valente <paolo.valente@linaro.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-01-31	block, bfq: unconditionally plug I/O in asymmetric scenarios	Paolo Valente
	bfq detects the creation of multiple bfq_queues shortly after each other, namely a burst of queue creations in the terminology used in the code. If the burst is large, then no queue in the burst is granted - either I/O-dispatch plugging when the queue remains temporarily idle while in service; - or weight raising, because it causes even longer plugging. In fact, such a plugging tends to lower throughput, while these bursts are typically due to applications or services that spawn multiple processes, to reach a common goal as soon as possible. Examples are a "git grep" or the booting of a system. Unfortunately, disabling plugging may cause a loss of service guarantees in asymmetric scenarios, i.e., if queue weights are differentiated or if more than one group is active. This commit addresses this issue by no longer disabling I/O-dispatch plugging for queues in large bursts. Signed-off-by: Paolo Valente <paolo.valente@linaro.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-01-31	block, bfq: do not plug I/O of in-service queue when harmful	Paolo Valente
	If the in-service bfq_queue is sync and remains temporarily idle, then I/O dispatching (from other queues) may be plugged. It may be dome for two reasons: either to boost throughput, or to preserve the bandwidth share of the in-service queue. In the first case, if the I/O of the in-service queue, when it finally arrives, consists only of one small I/O request, then it makes sense to plug even the I/O of the in-service queue. In fact, serving such a small request immediately is likely to lower throughput instead of boosting it, whereas waiting a little bit is likely to let that request grow, thanks to request merging, and become more profitable in terms of throughput (this is likely to happen exactly because the I/O of the queue has been detected to boost throughput). On the opposite end, if I/O dispatching is being plugged only to preserve the bandwidth of the in-service queue, then it would be better not to plug also the I/O of the in-service queue, because such a plugging is likely to cause only loss of bandwidth for the queue. Unfortunately, no distinction is made between the two cases, and the I/O of the in-service queue is always plugged in case just a small I/O request arrives. This commit draws this missing distinction and does not perform harmful plugging. Signed-off-by: Paolo Valente <paolo.valente@linaro.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-01-31	block, bfq: split function bfq_better_to_idle	Paolo Valente
	This is a preparatory commit for commits that need to check only one of the two main reasons for idling. This change should also improve the quality of the code a little bit, by splitting a function that contains very long, non-trivial and little related comments. Signed-off-by: Paolo Valente <paolo.valente@linaro.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-01-31	block, bfq: consider also ioprio classes in symmetry detection	Paolo Valente
	In asymmetric scenarios, i.e., when some bfq_queue or bfq_group needs to be guaranteed a different bandwidth than other bfq_queues or bfq_groups, these service guaranteed can be provided only by plugging I/O dispatch, completely or partially, when the queue in service remains temporarily empty. A case where asymmetry is particularly strong is when some active bfq_queues belong to a higher-priority class than some other active bfq_queues. Unfortunately, this important case is not considered at all in the code for detecting asymmetric scenarios. This commit adds the missing logic. Signed-off-by: Paolo Valente <paolo.valente@linaro.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-01-31	block, bfq: remove case of redirected bic from insert_request	Paolo Valente
	Before commit 18e5a57d7987 ("block, bfq: postpone rq preparation to insert or merge"), the destination queue for a request was chosen by a different hook than the one that then inserted the request. So, between the execution of the two hooks, the bic of the process generating the request could happen to be redirected to a different bfq_queue. As a consequence, the destination bfq_queue stored in the request could be wrong. Such an event does not need to ba handled any longer. Signed-off-by: Paolo Valente <paolo.valente@linaro.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-01-31	block, bfq: make sure queue budgets are not below service received	Paolo Valente
	With some unlucky sequences of events, the function bfq_updated_next_req updates the current budget of a bfq_queue to a lower value than the service received by the queue using such a budget. Unfortunately, if this happens, then the return value of the function bfq_bfqq_budget_left becomes inconsistent. This commit solves this problem by lower-bounding the budget computed in bfq_updated_next_req to the service currently charged to the queue. Signed-off-by: Paolo Valente <paolo.valente@linaro.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-01-31	block, bfq: avoid selecting a queue w/o budget	Paolo Valente
	To boost throughput on devices with internal queueing and in scenarios where device idling is not strictly needed, bfq immediately starts serving a new bfq_queue if the in-service bfq_queue remains without pending I/O, even if new I/O may arrive soon for the latter queue. Then, if such I/O actually arrives soon, bfq preempts the new in-service bfq_queue so as to give the previous queue a chance to go on being served (in case the previous queue should actually be the one to be served, according to its timestamps). However, the in-service bfq_queue, say Q, may also be without further budget when it remains also pending I/O. Since bfq changes budgets dynamically to fit the needs of bfq_queues, this happens more often than one may expect. If this happens, then there is no point in trying to go on serving Q when new I/O arrives for it soon: Q would be expired immediately after being selected for service. This would only cause useless overhead. This commit avoids such a useless selection. Signed-off-by: Paolo Valente <paolo.valente@linaro.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-01-31	block, bfq: do not consider interactive queues in srt filtering	Paolo Valente
	The speed at which a bfq_queue receives I/O is one of the parameters by which bfq decides whether the queue is soft real-time (i.e., whether the queue contains the I/O of a soft real-time application). In particular, when a bfq_queue remains without outstanding I/O requests, bfq computes the minimum time instant, named soft_rt_next_start, at which the next request of the queue may arrive for the queue to be deemed as soft real time. Unfortunately this filtering may cause problems with a queue in interactive weight raising. In fact, such a queue may be conveying the I/O needed to load a soft real-time application. The latter will actually exhibit a soft real-time I/O pattern after it finally starts doing its job. But, if soft_rt_next_start is updated for an interactive bfq_queue, and the queue has received a lot of service before remaining with no outstanding request (likely to happen on a fast device), then soft_rt_next_start is assigned such a high value that, for a very long time, the queue is prevented from being possibly considered as soft real time. This commit removes the updating of soft_rt_next_start for bfq_queues in interactive weight raising. Signed-off-by: Paolo Valente <paolo.valente@linaro.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-01-31	mtip32xx: ѕtop abusing the managed resource APIs	Christoph Hellwig
	The mtip32xx driver uses managed resources for DMA coherent memory and irqs, but then always pairs them with free calls anyway, making the resource tracking rather pointless. Given some DMA allocations are transient anyway, the irq freeing seems to require ordering vs other hardware access the best solution seems to be to stop using the managed resource API entirely. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-01-27	Linux 5.0-rc4	Linus Torvalds

2019-01-27	Merge branch 'x86-urgent-for-linus' of ↵	Linus Torvalds
	git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 fixes from Thomas Gleixner: "A set of fixes for x86: - Fix the swapped outb() parameters in the KASLR code - Fix the PKEY handling at fork which missed to preserve the pkey state for the child. Comes with a test case to validate that. - Fix the entry stack handling for XEN PV to respect that XEN PV systems enter the function already on the current thread stack and not on the trampoline. - Fix kexec load failure caused by using a stale value when the kexec_buf structure is reused for subsequent allocations. - Fix a bogus sizeof() in the memory encryption code - Enforce PCI dependency for the Intel Low Power Subsystem - Enforce PCI_LOCKLESS_CONFIG when PCI is enabled" * 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/Kconfig: Select PCI_LOCKLESS_CONFIG if PCI is enabled x86/entry/64/compat: Fix stack switching for XEN PV x86/kexec: Fix a kexec_file_load() failure x86/mm/mem_encrypt: Fix erroneous sizeof() x86/selftests/pkeys: Fork() to check for state being preserved x86/pkeys: Properly copy pkey state at fork() x86/kaslr: Fix incorrect i8254 outb() parameters x86/intel/lpss: Make PCI dependency explicit
2019-01-27	Merge branch 'x86-timers-for-linus' of ↵	Linus Torvalds
	git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 timer fixes from Thomas Gleixner: "Two commits which were missed to be sent during the merge window. - The TSC calibration fix turns out to be more urgent as recent Skylake-X systems seem to have massive trouble with calibration disturbance. This should go back into stable for that reason and it the risk of breakage is rather low. - Drop an unused define" * 'x86-timers-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/hpet: Remove unused FSEC_PER_NSEC define x86/tsc: Make calibration refinement more robust
2019-01-27	Merge branch 'timers-urgent-for-linus' of ↵	Linus Torvalds
	git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull timer fix from Thomas Glexiner: "A single regression fix to address the unintended breakage of posix cpu timers. This is caused by a new sanity check in the common code, which fails for posix cpu timers under certain conditions because the posix cpu timer code never updates the variable which is checked" * 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: posix-cpu-timers: Unbreak timer rearming
2019-01-27	Merge branch 'locking-urgent-for-linus' of ↵	Linus Torvalds
	git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull locking fixes from Thomas Gleixner: "A small series of fixes which all address possible missed wakeups: - Document and fix the wakeup ordering of wake_q - Add the missing barrier in rcuwait_wake_up(), which was documented in the comment but missing in the code - Fix the possible missed wakeups in the rwsem and futex code" * 'locking-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: locking/rwsem: Fix (possible) missed wakeup futex: Fix (possible) missed wakeup sched/wake_q: Fix wakeup ordering for wake_q sched/wake_q: Document wake_q_add() sched/wait: Fix rcuwait_wake_up() ordering
2019-01-27	Merge branch 'irq-urgent-for-linus' of ↵	Linus Torvalds
	git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull irq fixes from Thomas Gleixner: "A small set of fixes for the interrupt subsystem: - Fix a double increment in the irq descriptor allocator which resulted in a sanity check only being done for every second affinity mask - Add a missing device tree translation in the stm32-exti driver. Without that the interrupt association is completely wrong. - Initialize the mutex in the GIC-V3 MBI driver - Fix the alignment for aliasing devices in the GIC-V3-ITS driver so multi MSI allocations work correctly - Ensure that the initial affinity of a interrupt is not empty at startup time. - Drop bogus include in the madera irq chip driver - Fix KernelDoc regression" * 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: irqchip/gic-v3-its: Align PCI Multi-MSI allocation on their size genirq/irqdesc: Fix double increment in alloc_descs() genirq: Fix the kerneldoc comment for struct irq_affinity_desc irqchip/madera: Drop GPIO includes irqchip/gic-v3-mbi: Fix uninitialized mbi_lock irqchip/stm32-exti: Add domain translate function genirq: Make sure the initial affinity is not empty
2019-01-27	Merge tag 'edac_fix_for_5.0' of ↵	Linus Torvalds
	git://git.kernel.org/pub/scm/linux/kernel/git/bp/bp Pull EDAC fix from Borislav Petkov: "Fix persistent register offsets of altera_edac, from Thor Thayer" * tag 'edac_fix_for_5.0' of git://git.kernel.org/pub/scm/linux/kernel/git/bp/bp: EDAC, altera: Fix S10 persistent register offset
2019-01-27	Merge tag 'for-linus-20190127' of git://git.kernel.dk/linux-block	Linus Torvalds
	Pull block revert from Jens Axboe: "Silly error snuck into a patch from the last series, let's do a revert to avoid a potential use-after-free" * tag 'for-linus-20190127' of git://git.kernel.dk/linux-block: Revert "block: cover another queue enter recursion via BIO_QUEUE_ENTERED"
2019-01-27	Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm	Linus Torvalds
	Pull KVM fixes from Paolo Bonzini: "Quite a few fixes for x86: nested virtualization save/restore, AMD nested virtualization and virtual APIC, 32-bit fixes, an important fix to restore operation on older processors, and a bunch of hyper-v bugfixes. Several are marked stable. There are also fixes for GCC warnings and for a GCC/objtool interaction" * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: KVM: x86: Mark expected switch fall-throughs KVM: x86: fix TRACE_INCLUDE_PATH and remove -I. header search paths KVM: selftests: check returned evmcs version range x86/kvm/hyper-v: nested_enable_evmcs() sets vmcs_version incorrectly KVM: VMX: Move vmx_vcpu_run()'s VM-Enter asm blob to a helper function kvm: selftests: Fix region overlap check in kvm_util kvm: vmx: fix some -Wmissing-prototypes warnings KVM: nSVM: clear events pending from svm_complete_interrupts() when exiting to L1 svm: Fix AVIC incomplete IPI emulation svm: Add warning message for AVIC IPI invalid target KVM: x86: WARN_ONCE if sending a PV IPI returns a fatal error KVM: x86: Fix PV IPIs for 32-bit KVM host x86/kvm/hyper-v: recommend using eVMCS only when it is enabled x86/kvm/hyper-v: don't recommend doing reset via synthetic MSR kvm: x86/vmx: Use kzalloc for cached_vmcs12 KVM: VMX: Use the correct field var when clearing VM_ENTRY_LOAD_IA32_PERF_GLOBAL_CTRL KVM: x86: Fix single-step debugging x86/kvm/hyper-v: don't announce GUEST IDLE MSR support
2019-01-27	Merge tag 'dma-mapping-5.0-2' of git://git.infradead.org/users/hch/dma-mapping	Linus Torvalds
	Pull dma-mapping fix from Christoph Hellwig: "Fix a xen-swiotlb regression on arm64" * tag 'dma-mapping-5.0-2' of git://git.infradead.org/users/hch/dma-mapping: arm64/xen: fix xen-swiotlb cache flushing
2019-01-27	Merge tag 'libnvdimm-fixes-5.0-rc4' of ↵	Linus Torvalds
	git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm Pull libnvdimm fixes from Dan Williams: "A fix for namespace label support for non-Intel NVDIMMs that implement the ACPI standard label method. This has apparently never worked and could wait for v5.1. However it has enough visibility with hardware vendors [1] and distro bug trackers [2], and low enough risk that I decided it should go in for -rc4. The other fixups target the new, for v5.0, nvdimm security functionality. The larger init path fixup closes a memory leak and a potential userspace lockup due to missed notifications. [1] https://github.com/pmem/ndctl/issues/78 [2] https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1811785 These have all soaked in -next for a week with no reported issues. Summary: - Fix support for NVDIMMs that implement the ACPI standard label methods. - Fix error handling for security overwrite (memory leak / userspace hang condition), and another one-line security cleanup" * tag 'libnvdimm-fixes-5.0-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm: acpi/nfit: Fix command-supported detection acpi/nfit: Block function zero DSMs libnvdimm/security: Require nvdimm_security_setup_events() to succeed nfit_test: fix security state pull for nvdimm security nfit_test
2019-01-27	Merge branch 'for-linus' of ↵	Linus Torvalds
	git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input Pull input fixes from Dmitry Torokhov: "A fixup for the input_event fix for y2038 Sparc64, and couple other minor fixes" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input: Input: input_event - fix the CONFIG_SPARC64 mixup Input: olpc_apsp - assign priv->dev earlier Input: uinput - fix undefined behavior in uinput_validate_absinfo() Input: raspberrypi-ts - fix link error Input: xpad - add support for SteelSeries Stratus Duo Input: input_event - provide override for sparc64
2019-01-27	Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net	Linus Torvalds
	Pull networking fixes from David Miller: 1) Count ttl-dropped frames properly in mac80211, from Bob Copeland. 2) Integer overflow in ktime handling of bcm can code, from Oliver Hartkopp. 3) Fix RX desc handling wrt. hw checksumming in ravb, from Simon Horman. 4) Various hash key fixes in hv_netvsc, from Haiyang Zhang. 5) Use after free in ax25, from Eric Dumazet. 6) Several fixes to the SSN support in SCTP, from Xin Long. 7) Do not process frames after a NAPI reschedule in ibmveth, from Thomas Falcon. 8) Fix NLA_POLICY_NESTED arguments, from Johannes Berg. * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (42 commits) qed: Revert error handling changes. cfg80211: extend range deviation for DMG cfg80211: reg: remove warn_on for a normal case mac80211: Add attribute aligned(2) to struct 'action' mac80211: don't initiate TDLS connection if station is not associated to AP nl80211: fix NLA_POLICY_NESTED() arguments ibmveth: Do not process frames after calling napi_reschedule net: dev_is_mac_header_xmit() true for ARPHRD_RAWIP net: usb: asix: ax88772_bind return error when hw_reset fail MAINTAINERS: Update cavium networking drivers net/mlx4_core: Fix error handling when initializing CQ bufs in the driver net/mlx4_core: Add masking for a few queries on HCA caps sctp: set flow sport from saddr only when it's 0 sctp: set chunk transport correctly when it's a new asoc sctp: improve the events for sctp stream adding sctp: improve the events for sctp stream reset ip_tunnel: Make none-tunnel-dst tunnel port work with lwtunnel ax25: fix possible use-after-free sfc: suppress duplicate nvmem partition types in efx_ef10_mtd_probe hv_netvsc: fix typos in code comments ...
2019-01-27	Revert "block: cover another queue enter recursion via BIO_QUEUE_ENTERED"	Jens Axboe
	We can't touch a bio after ->make_request_fn(), for all we know it could already have been completed by the time this function returns. This reverts commit 698cef173983b086977e633e46476e0f925ca01e. Reported-by: syzbot+4df6ca820108fd248943@syzkaller.appspotmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>