summaryrefslogtreecommitdiff
path: root/block
AgeCommit message (Collapse)Author
2016-07-20block: do not merge requests without consulting with io schedulerTahsin Erdogan
Before merging a bio into an existing request, io scheduler is called to get its approval first. However, the requests that come from a plug flush may get merged by block layer without consulting with io scheduler. In case of CFQ, this can cause fairness problems. For instance, if a request gets merged into a low weight cgroup's request, high weight cgroup now will depend on low weight cgroup to get scheduled. If high weigt cgroup needs that io request to complete before submitting more requests, then it will also lose its timeslice. Following script demonstrates the problem. Group g1 has a low weight, g2 and g3 have equal high weights but g2's requests are adjacent to g1's requests so they are subject to merging. Due to these merges, g2 gets poor disk time allocation. cat > cfq-merge-repro.sh << "EOF" #!/bin/bash set -e IO_ROOT=/mnt-cgroup/io mkdir -p $IO_ROOT if ! mount | grep -qw $IO_ROOT; then mount -t cgroup none -oblkio $IO_ROOT fi cd $IO_ROOT for i in g1 g2 g3; do if [ -d $i ]; then rmdir $i fi done mkdir g1 && echo 10 > g1/blkio.weight mkdir g2 && echo 495 > g2/blkio.weight mkdir g3 && echo 495 > g3/blkio.weight RUNTIME=10 (echo $BASHPID > g1/cgroup.procs && fio --readonly --name name1 --filename /dev/sdb \ --rw read --size 64k --bs 64k --time_based \ --runtime=$RUNTIME --offset=0k &> /dev/null)& (echo $BASHPID > g2/cgroup.procs && fio --readonly --name name1 --filename /dev/sdb \ --rw read --size 64k --bs 64k --time_based \ --runtime=$RUNTIME --offset=64k &> /dev/null)& (echo $BASHPID > g3/cgroup.procs && fio --readonly --name name1 --filename /dev/sdb \ --rw read --size 64k --bs 64k --time_based \ --runtime=$RUNTIME --offset=256k &> /dev/null)& sleep $((RUNTIME+1)) for i in g1 g2 g3; do echo ---- $i ---- cat $i/blkio.time done EOF # ./cfq-merge-repro.sh ---- g1 ---- 8:16 162 ---- g2 ---- 8:16 165 ---- g3 ---- 8:16 686 After applying the patch: # ./cfq-merge-repro.sh ---- g1 ---- 8:16 90 ---- g2 ---- 8:16 445 ---- g3 ---- 8:16 471 Signed-off-by: Tahsin Erdogan <tahsin@google.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2016-07-20block: Fix spelling in a source code commentBart Van Assche
Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com> Cc: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>
2016-07-20block: expose QUEUE_FLAG_DAX in sysfsYigal Korman
Provides the ability to identify DAX enabled devices in userspace. Signed-off-by: Yigal Korman <yigal@plexistor.com> Signed-off-by: Toshi Kani <toshi.kani@hpe.com> Acked-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2016-07-13block: atari: Return early for unsupported sector sizeGabriel Krisman Bertazi
For 4K LBA or very large disks, atari_partition can easily get tricked into thinking it has found an Atari partition table. Depending on the data in the disk, it ends up creating partitions with awkward lengths. We saw logs like this while playing with fio. [5.625867] nvme2n1: AHDI p2 [5.625872] nvme2n1: p2 size 2910030523 extends beyond EOD, truncated People has had issues with misinterpreted AHDI partition tables for a long time, see this BSD thread from 1995, for example. https://mail-index.netbsd.org/port-atari/1995/11/19/0001.html Since the atari partition, according to the spec, doesn't even support sector sizes with more than 512, a quick sanity check is reasonable to just bail out early, before even attempting to read sector 0. Signed-off-by: Gabriel Krisman Bertazi <krisman@linux.vnet.ibm.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2016-06-28cfq-iosched: Charge at least 1 jiffie instead of 1 nsJan Kara
Commit 9a7f38c42c2b (cfq-iosched: Convert from jiffies to nanoseconds) could result in charging just 1 ns to a cgroup submitting IO instead of 1 jiffie we always charged before. It is arguable what is the right amount to change but for now lets retain the old behavior of always charging at least one jiffie. Fixes: 9a7f38c42c2b92391d9dabaf9f51df7cfe5608e4 Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Jens Axboe <axboe@fb.com>
2016-06-28cfq-iosched: Fix regression in bonnie++ rewrite performanceJan Kara
Commit 9a7f38c42c2 (cfq-iosched: Convert from jiffies to nanoseconds) broke the condition for detecting starved sync IO in cfq_completed_request() because rq->start_time remained in jiffies but we compared it with nanosecond values. This manifested as a regression in bonnie++ rewrite performance because we always ended up considering sync IO starved and thus never increased async IO queue depth. Since rq->start_time is used in a lot of places, converting it to ns values would be non-trivial. So just revert the condition in CFQ to use comparison with jiffies. This will lead to suboptimal results if cfq_fifo_expire[1] will ever come close to 1 jiffie but so far we are relatively far from that with the storage used with CFQ (the default value is 128 ms). Fixes: 9a7f38c42c2b92391d9dabaf9f51df7cfe5608e4 Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Jens Axboe <axboe@fb.com>
2016-06-28cfq-iosched: Convert slice_resid from u64 to s64Jan Kara
slice_resid can be both positive and negative. Commit 9a7f38c42c2b (cfq-iosched: Convert from jiffies to nanoseconds) converted it from long to u64. Although this did not introduce any functional regression (the operations just overflow and the result was fine), it is certainly wrong and could cause issues in future. So convert the type to more appropriate s64. Fixes: 9a7f38c42c2b92391d9dabaf9f51df7cfe5608e4 Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Jens Axboe <axboe@fb.com>
2016-06-28block: Convert fifo_time from ulong to u64Jan Kara
Currently rq->fifo_time is unsigned long but CFQ stores nanosecond timestamp in it which would overflow on 32-bit archs. Convert it to u64 to avoid the overflow. Since the rq->fifo_time is unioned with struct call_single_data(), this does not change the size of struct request in any way. We have to slightly fixup block/deadline-iosched.c so that comparison happens in the right types. Fixes: 9a7f38c42c2b92391d9dabaf9f51df7cfe5608e4 Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Jens Axboe <axboe@fb.com>
2016-06-14block/blk-cgroup.c: Declare local symbols staticBart Van Assche
Detected by sparse. Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@fb.com>
2016-06-14block/bio-integrity.c: Add #include "blk.h"Bart Van Assche
This patch avoids that building with W=1 C=2 triggers the following warning: block/bio-integrity.c:35:6: warning: symbol 'blk_flush_integrity' was not declared. Should it be static? Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2016-06-14block/partition-generic.c: Remove a set-but-not-used variableBart Van Assche
A value is assigned to the variable 'info' but that value is never used. Hence remove the variable 'info'. Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2016-06-09cfq-iosched: temporarily boost queue priority for idle classesJens Axboe
If we're queuing REQ_PRIO IO and the task is running at an idle IO class, then temporarily boost the priority. This prevents livelocks due to priority inversion, when a low priority task is holding file system resources while attempting to do IO. An example of that is shown below. An ioniced idle task is holding the directory mutex, while a normal priority task is trying to do a directory lookup. [478381.198925] ------------[ cut here ]------------ [478381.200315] INFO: task ionice:1168369 blocked for more than 120 seconds. [478381.201324] Not tainted 4.0.9-38_fbk5_hotfix1_2936_g85409c6 #1 [478381.202278] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [478381.203462] ionice D ffff8803692736a8 0 1168369 1 0x00000080 [478381.203466] ffff8803692736a8 ffff880399c21300 ffff880276adcc00 ffff880369273698 [478381.204589] ffff880369273fd8 0000000000000000 7fffffffffffffff 0000000000000002 [478381.205752] ffffffff8177d5e0 ffff8803692736c8 ffffffff8177cea7 0000000000000000 [478381.206874] Call Trace: [478381.207253] [<ffffffff8177d5e0>] ? bit_wait_io_timeout+0x80/0x80 [478381.208175] [<ffffffff8177cea7>] schedule+0x37/0x90 [478381.208932] [<ffffffff8177f5fc>] schedule_timeout+0x1dc/0x250 [478381.209805] [<ffffffff81421c17>] ? __blk_run_queue+0x37/0x50 [478381.210706] [<ffffffff810ca1c5>] ? ktime_get+0x45/0xb0 [478381.211489] [<ffffffff8177c407>] io_schedule_timeout+0xa7/0x110 [478381.212402] [<ffffffff810a8c2b>] ? prepare_to_wait+0x5b/0x90 [478381.213280] [<ffffffff8177d616>] bit_wait_io+0x36/0x50 [478381.214063] [<ffffffff8177d325>] __wait_on_bit+0x65/0x90 [478381.214961] [<ffffffff8177d5e0>] ? bit_wait_io_timeout+0x80/0x80 [478381.215872] [<ffffffff8177d47c>] out_of_line_wait_on_bit+0x7c/0x90 [478381.216806] [<ffffffff810a89f0>] ? wake_atomic_t_function+0x40/0x40 [478381.217773] [<ffffffff811f03aa>] __wait_on_buffer+0x2a/0x30 [478381.218641] [<ffffffff8123c557>] ext4_bread+0x57/0x70 [478381.219425] [<ffffffff8124498c>] __ext4_read_dirblock+0x3c/0x380 [478381.220467] [<ffffffff8124665d>] ext4_dx_find_entry+0x7d/0x170 [478381.221357] [<ffffffff8114c49e>] ? find_get_entry+0x1e/0xa0 [478381.222208] [<ffffffff81246bd4>] ext4_find_entry+0x484/0x510 [478381.223090] [<ffffffff812471a2>] ext4_lookup+0x52/0x160 [478381.223882] [<ffffffff811c401d>] lookup_real+0x1d/0x60 [478381.224675] [<ffffffff811c4698>] __lookup_hash+0x38/0x50 [478381.225697] [<ffffffff817745bd>] lookup_slow+0x45/0xab [478381.226941] [<ffffffff811c690e>] link_path_walk+0x7ae/0x820 [478381.227880] [<ffffffff811c6a42>] path_init+0xc2/0x430 [478381.228677] [<ffffffff813e6e26>] ? security_file_alloc+0x16/0x20 [478381.229776] [<ffffffff811c8c57>] path_openat+0x77/0x620 [478381.230767] [<ffffffff81185c6e>] ? page_add_file_rmap+0x2e/0x70 [478381.232019] [<ffffffff811cb253>] do_filp_open+0x43/0xa0 [478381.233016] [<ffffffff8108c4a9>] ? creds_are_invalid+0x29/0x70 [478381.234072] [<ffffffff811c0cb0>] do_open_execat+0x70/0x170 [478381.235039] [<ffffffff811c1bf8>] do_execveat_common.isra.36+0x1b8/0x6e0 [478381.236051] [<ffffffff811c214c>] do_execve+0x2c/0x30 [478381.236809] [<ffffffff811ca392>] ? getname+0x12/0x20 [478381.237564] [<ffffffff811c23be>] SyS_execve+0x2e/0x40 [478381.238338] [<ffffffff81780a1d>] stub_execve+0x6d/0xa0 [478381.239126] ------------[ cut here ]------------ [478381.239915] ------------[ cut here ]------------ [478381.240606] INFO: task python2.7:1168375 blocked for more than 120 seconds. [478381.242673] Not tainted 4.0.9-38_fbk5_hotfix1_2936_g85409c6 #1 [478381.243653] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [478381.244902] python2.7 D ffff88005cf8fb98 0 1168375 1168248 0x00000080 [478381.244904] ffff88005cf8fb98 ffff88016c1f0980 ffffffff81c134c0 ffff88016c1f11a0 [478381.246023] ffff88005cf8ffd8 ffff880466cd0cbc ffff88016c1f0980 00000000ffffffff [478381.247138] ffff880466cd0cc0 ffff88005cf8fbb8 ffffffff8177cea7 ffff88005cf8fcc8 [478381.248252] Call Trace: [478381.248630] [<ffffffff8177cea7>] schedule+0x37/0x90 [478381.249382] [<ffffffff8177d08e>] schedule_preempt_disabled+0xe/0x10 [478381.250465] [<ffffffff8177e892>] __mutex_lock_slowpath+0x92/0x100 [478381.251409] [<ffffffff8177e91b>] mutex_lock+0x1b/0x2f [478381.252199] [<ffffffff817745ae>] lookup_slow+0x36/0xab [478381.253023] [<ffffffff811c690e>] link_path_walk+0x7ae/0x820 [478381.253877] [<ffffffff811aeb41>] ? try_charge+0xc1/0x700 [478381.254690] [<ffffffff811c6a42>] path_init+0xc2/0x430 [478381.255525] [<ffffffff813e6e26>] ? security_file_alloc+0x16/0x20 [478381.256450] [<ffffffff811c8c57>] path_openat+0x77/0x620 [478381.257256] [<ffffffff8115b2fb>] ? lru_cache_add_active_or_unevictable+0x2b/0xa0 [478381.258390] [<ffffffff8117b623>] ? handle_mm_fault+0x13f3/0x1720 [478381.259309] [<ffffffff811cb253>] do_filp_open+0x43/0xa0 [478381.260139] [<ffffffff811d7ae2>] ? __alloc_fd+0x42/0x120 [478381.260962] [<ffffffff811b95ac>] do_sys_open+0x13c/0x230 [478381.261779] [<ffffffff81011393>] ? syscall_trace_enter_phase1+0x113/0x170 [478381.262851] [<ffffffff811b96c2>] SyS_open+0x22/0x30 [478381.263598] [<ffffffff81780532>] system_call_fastpath+0x12/0x17 [478381.264551] ------------[ cut here ]------------ [478381.265377] ------------[ cut here ]------------ Signed-off-by: Jens Axboe <axboe@fb.com> Reviewed-by: Jeff Moyer <jmoyer@redhat.com>
2016-06-09blk-mq: actually hook up defer list when running requestsOmar Sandoval
If ->queue_rq() returns BLK_MQ_RQ_QUEUE_OK, we use continue and skip over the rest of the loop body. However, dptr is assigned later in the loop body, and the BLK_MQ_RQ_QUEUE_OK case is exactly the case that we'd want it for. NVMe isn't actually using BLK_MQ_F_DEFER_ISSUE yet, nor is any other in-tree driver, but if the code's going to be there, it might as well work. Fixes: 74c450521dd8 ("blk-mq: add a 'list' parameter to ->queue_rq()") Signed-off-by: Omar Sandoval <osandov@fb.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2016-06-08cfq-iosched: Convert to use highres timersJan Kara
Reviewed-by: Jeff Moyer <jmoyer@redhat.com> Signed-off-by: Jan Kara <jack@suse.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2016-06-08cfq-iosched: Expose microsecond interfacesJeff Moyer
Expose interfaces to tune time slices of CFQ IO scheduler in microseconds. Signed-off-by: Jeff Moyer <jmoyer@redhat.com> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Jens Axboe <axboe@fb.com>
2016-06-08cfq-iosched: Convert from jiffies to nanosecondsJeff Moyer
Convert all time-keeping in CFQ IO scheduler from jiffies to nanoseconds so that we can later make the intervals more fine-grained than jiffies. One jiffie is several miliseconds and even for today's rotating disks that is a noticeable amount of time and thus we leave disk unnecessarily idle. Signed-off-by: Jeff Moyer <jmoyer@redhat.com> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Jens Axboe <axboe@fb.com>
2016-06-07block, drivers, fs: rename REQ_FLUSH to REQ_PREFLUSHMike Christie
To avoid confusion between REQ_OP_FLUSH, which is handled by request_fn drivers, and upper layers requesting the block layer perform a flush sequence along with possibly a WRITE, this patch renames REQ_FLUSH to REQ_PREFLUSH. Signed-off-by: Mike Christie <mchristi@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2016-06-07block, drivers: add REQ_OP_FLUSH operationMike Christie
This adds a REQ_OP_FLUSH operation that is sent to request_fn based drivers by the block layer's flush code, instead of sending requests with the request->cmd_flags REQ_FLUSH bit set. Signed-off-by: Mike Christie <mchristi@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2016-06-07block, drivers, fs: shrink bi_rw from long to intMike Christie
We don't need bi_rw to be so large on 64 bit archs, so reduce it to unsigned int. Signed-off-by: Mike Christie <mchristi@redhat.com> Reviewed-by: Hannes Reinecke <hare@suse.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2016-06-07block: convert is_sync helpers to use REQ_OPs.Mike Christie
This patch converts the is_sync helpers to use separate variables for the operation and flags. Signed-off-by: Mike Christie <mchristi@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2016-06-07block: convert merge/insert code to check for REQ_OPs.Mike Christie
This patch converts the block layer merging code to use separate variables for the operation and flags, and to check req_op for the REQ_OP. Signed-off-by: Mike Christie <mchristi@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2016-06-07blkg_rwstat: separate op from flagsMike Christie
The bio and request operation and flags are going to be separate definitions, so we cannot pass them in as a bitmap. This patch converts the blkg_rwstat code and its caller, cfq, to pass in the values separately. Signed-off-by: Mike Christie <mchristi@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2016-06-07block: prepare elevator to use REQ_OPs.Mike Christie
This patch converts the elevator code to use separate variables for the operation and flags, and to check req_op for the REQ_OP. Signed-off-by: Mike Christie <mchristi@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2016-06-07block: prepare mq request creation to use REQ_OPsMike Christie
This patch modifies the blk mq request creation code to use separate variables for the operation and flags, because in the the next patches the struct request users will be converted like was done for bios. Signed-off-by: Mike Christie <mchristi@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2016-06-07block: prepare request creation/destruction code to use REQ_OPsMike Christie
This patch prepares *_get_request/*_put_request and freed_request, to use separate variables for the operation and flags. In the next patches the struct request users will be converted like was done for bios where the op and flags are set separately. Signed-off-by: Mike Christie <mchristi@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2016-06-07block: copy bio op to request opMike Christie
The bio users should now always be setting up the bio op. This patch has the block layer copy that to the request. Signed-off-by: Mike Christie <mchristi@redhat.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2016-06-07block discard: use bio set op accessorMike Christie
This converts the block issue discard helper and users to use the bio_set_op_attrs accessor and only pass in the operation flags like REQ_SEQURE. Signed-off-by: Mike Christie <mchristi@redhat.com> Reviewed-by: Hannes Reinecke <hare@suse.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2016-06-07block, fs, mm, drivers: use bio set/get op accessorsMike Christie
This patch converts the simple bi_rw use cases in the block, drivers, mm and fs code to set/get the bio operation using bio_set_op_attrs/bio_op These should be simple one or two liner cases, so I just did them in one patch. The next patches handle the more complicated cases in a module per patch. Signed-off-by: Mike Christie <mchristi@redhat.com> Reviewed-by: Hannes Reinecke <hare@suse.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2016-06-07block, drivers, cgroup: use op_is_write helper instead of checking for REQ_WRITEMike Christie
We currently set REQ_WRITE/WRITE for all non READ IOs like discard, flush, writesame, etc. In the next patches where we no longer set up the op as a bitmap, we will not be able to detect a operation direction like writesame by testing if REQ_WRITE is set. This patch converts the drivers and cgroup to use the op_is_write helper. This should just cover the simple cases. I did dm, md and bcache in their own patches because they were more involved. Signed-off-by: Mike Christie <mchristi@redhat.com> Reviewed-by: Hannes Reinecke <hare@suse.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2016-06-07block/fs/drivers: remove rw argument from submit_bioMike Christie
This has callers of submit_bio/submit_bio_wait set the bio->bi_rw instead of passing it in. This makes that use the same as generic_make_request and how we set the other bio fields. Signed-off-by: Mike Christie <mchristi@redhat.com> Fixed up fs/ext4/crypto.c Signed-off-by: Jens Axboe <axboe@fb.com>
2016-05-27Merge branch 'for-linus' of git://git.kernel.dk/linux-blockLinus Torvalds
Pull block fixes from Jens Axboe: "A set of fixes that wasn't included in the first merge window pull request. This pull request contains: - A set of NVMe fixes from Keith, and one from Nic for the integrity side of it. - Fix from Ming, clearing ->mq_ops if we don't successfully setup a queue for multiqueue. - A set of stability fixes for bcache from Jiri, and also marking bcache as orphaned as it's no longer actively maintained (in mainline, at least)" * 'for-linus' of git://git.kernel.dk/linux-block: blk-mq: clear q->mq_ops if init fail MAINTAINERS: mark bcache as orphan bcache: bch_gc_thread() is not freezable bcache: bch_allocator_thread() is not freezable bcache: bch_writeback_thread() is not freezable nvme/host: Add missing blk_integrity tag_size + flags assignments NVMe: Add device ID's with stripe quirk NVMe: Short-cut removal on surprise hot-unplug NVMe: Allow user initiated rescan NVMe: Reduce driver log spamming NVMe: Unbind driver on failure NVMe: Delete only created queues NVMe: Allocate queues only for online cpus
2016-05-26Merge tag 'dax-misc-for-4.7' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm Pull misc DAX updates from Vishal Verma: "DAX error handling for 4.7 - Until now, dax has been disabled if media errors were found on any device. This enables the use of DAX in the presence of these errors by making all sector-aligned zeroing go through the driver. - The driver (already) has the ability to clear errors on writes that are sent through the block layer using 'DSMs' defined in ACPI 6.1. Other misc changes: - When mounting DAX filesystems, check to make sure the partition is page aligned. This is a requirement for DAX, and previously, we allowed such unaligned mounts to succeed, but subsequent reads/writes would fail. - Misc/cleanup fixes from Jan that remove unused code from DAX related to zeroing, writeback, and some size checks" * tag 'dax-misc-for-4.7' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm: dax: fix a comment in dax_zero_page_range and dax_truncate_page dax: for truncate/hole-punch, do zeroing through the driver if possible dax: export a low-level __dax_zero_page_range helper dax: use sb_issue_zerout instead of calling dax_clear_sectors dax: enable dax in the presence of known media errors (badblocks) dax: fallback from pmd to pte on error block: Update blkdev_dax_capable() for consistency xfs: Add alignment check for DAX mount ext2: Add alignment check for DAX mount ext4: Add alignment check for DAX mount block: Add bdev_dax_supported() for dax mount checks block: Add vfs_msg() interface dax: Remove redundant inode size checks dax: Remove pointless writeback from dax_do_io() dax: Remove zeroing from dax_io() dax: Remove dead zeroing code from fault handlers ext2: Avoid DAX zeroing to corrupt data ext2: Fix block zeroing in ext2_get_blocks() for DAX dax: Remove complete_unwritten argument DAX: move RADIX_DAX_ definitions to dax.c
2016-05-26blk-mq: clear q->mq_ops if init failMing Lin
blk_mq_init_queue() calls blk_mq_init_allocated_queue(), but q->mq_ops was not cleared when blk_mq_init_allocated_queue() fails. Then blk_cleanup_queue() calls blk_mq_free_queue() which will crash because: - q->all_q_node is not added to all_q_list yet - q->tag_set is NULL - hctx was not setup yet or already freed Fixed it by clearing q->mq_ops on error path. Signed-off-by: Ming Lin <ming.l@samsung.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>
2016-05-23Merge tag 'libnvdimm-for-4.7' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm Pull libnvdimm updates from Dan Williams: "The bulk of this update was stabilized before the merge window and appeared in -next. The "device dax" implementation was revised this week in response to review feedback, and to address failures detected by the recently expanded ndctl unit test suite. Not included in this pull request are two dax topic branches (dax error handling, and dax radix-tree locking). These topics were deferred to get a few more days of -next integration testing, and to coordinate a branch baseline with Ted and the ext4 tree. Vishal and Ross will send the error handling and locking topics respectively in the next few days. This branch has received a positive build result from the kbuild robot across 226 configs. Summary: - Device DAX for persistent memory: Device DAX is the device-centric analogue of Filesystem DAX (CONFIG_FS_DAX). It allows memory ranges to be allocated and mapped without need of an intervening file system. Device DAX is strict, precise and predictable. Specifically this interface: a) Guarantees fault granularity with respect to a given page size (pte, pmd, or pud) set at configuration time. b) Enforces deterministic behavior by being strict about what fault scenarios are supported. Persistent memory is the first target, but the mechanism is also targeted for exclusive allocations of performance/feature differentiated memory ranges. - Support for the HPE DSM (device specific method) command formats. This enables management of these first generation devices until a unified DSM specification materializes. - Further ACPI 6.1 compliance with support for the common dimm identifier format. - Various fixes and cleanups across the subsystem" * tag 'libnvdimm-for-4.7' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm: (40 commits) libnvdimm, dax: fix deletion libnvdimm, dax: fix alignment validation libnvdimm, dax: autodetect support libnvdimm: release ida resources Revert "block: enable dax for raw block devices" /dev/dax, core: file operations and dax-mmap /dev/dax, pmem: direct access to persistent memory libnvdimm: stop requiring a driver ->remove() method libnvdimm, dax: record the specified alignment of a dax-device instance libnvdimm, dax: reserve space to store labels for device-dax libnvdimm, dax: introduce device-dax infrastructure nfit: add sysfs dimm 'family' and 'dsm_mask' attributes tools/testing/nvdimm: ND_CMD_CALL support nfit: disable vendor specific commands nfit: export subsystem ids as attributes nfit: fix format interface code byte order per ACPI6.1 nfit, libnvdimm: limited/whitelisted dimm command marshaling mechanism nfit, libnvdimm: clarify "commands" vs "_DSMs" libnvdimm: increase max envelope size for ioctl acpi/nfit: Add sysfs "id" for NVDIMM ID ...
2016-05-20Revert "block: enable dax for raw block devices"Dan Williams
This reverts commit 5a023cdba50c5f5f2bc351783b3131699deb3937. The functionality is superseded by the new "Device DAX" facility. Cc: Jeff Moyer <jmoyer@redhat.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Dave Chinner <david@fromorbit.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Ross Zwisler <ross.zwisler@linux.intel.com> Cc: Jan Kara <jack@suse.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2016-05-20block/partitions/ldm.c: use generic UUID libraryAndy Shevchenko
Instead of opencoding let's use generic UUID library functions here. Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Cc: "Richard Russon (FlatCap)" <ldm@flatcap.org> Cc: Jens Axboe <axboe@kernel.dk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-05-17Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial Pull trivial tree updates from Jiri Kosina. * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (21 commits) gitignore: fix wording mfd: ab8500-debugfs: fix "between" in printk memstick: trivial fix of spelling mistake on management cpupowerutils: bench: fix "average" treewide: Fix typos in printk IB/mlx4: printk fix pinctrl: sirf/atlas7: fix printk spelling serial: mctrl_gpio: Grammar s/lines GPIOs/line GPIOs/, /sets/set/ w1: comment spelling s/minmum/minimum/ Blackfin: comment spelling s/divsor/divisor/ metag: Fix misspellings in comments. ia64: Fix misspellings in comments. hexagon: Fix misspellings in comments. tools/perf: Fix misspellings in comments. cris: Fix misspellings in comments. c6x: Fix misspellings in comments. blackfin: Fix misspelling of 'register' in comment. avr32: Fix misspelling of 'definitions' in comment. treewide: Fix typos in printk Doc: treewide : Fix typos in DocBook/filesystem.xml ...
2016-05-17Merge branch 'for-4.7/drivers' of git://git.kernel.dk/linux-blockLinus Torvalds
Pull block driver updates from Jens Axboe: "On top of the core pull request, this is the drivers pull request for this merge window. This contains: - Switch drivers to the new write back cache API, and kill off the flush flags. From me. - Kill the discard support for the STEC pci-e flash driver. It's trivially broken, and apparently unmaintained, so it's safer to just remove it. From Jeff Moyer. - A set of lightnvm updates from the usual suspects (Matias/Javier, and Simon), and fixes from Arnd, Jeff Mahoney, Sagi, and Wenwei Tao. - A set of updates for NVMe: - Turn the controller state management into a proper state machine. From Christoph. - Shuffling of code in preparation for NVMe-over-fabrics, also from Christoph. - Cleanup of the command prep part from Ming Lin. - Rewrite of the discard support from Ming Lin. - Deadlock fix for namespace removal from Ming Lin. - Use the now exported blk-mq tag helper for IO termination. From Sagi. - Various little fixes from Christoph, Guilherme, Keith, Ming Lin, Wang Sheng-Hui. - Convert mtip32xx to use the now exported blk-mq tag iter function, from Keith" * 'for-4.7/drivers' of git://git.kernel.dk/linux-block: (74 commits) lightnvm: reserved space calculation incorrect lightnvm: rename nr_pages to nr_ppas on nvm_rq lightnvm: add is_cached entry to struct ppa_addr lightnvm: expose gennvm_mark_blk to targets lightnvm: remove mgt targets on mgt removal lightnvm: pass dma address to hardware rather than pointer lightnvm: do not assume sequential lun alloc. nvme/lightnvm: Log using the ctrl named device lightnvm: rename dma helper functions lightnvm: enable metadata to be sent to device lightnvm: do not free unused metadata on rrpc lightnvm: fix out of bound ppa lun id on bb tbl lightnvm: refactor set_bb_tbl for accepting ppa list lightnvm: move responsibility for bad blk mgmt to target lightnvm: make nvm_set_rqd_ppalist() aware of vblks lightnvm: remove struct factory_blks lightnvm: refactor device ops->get_bb_tbl() lightnvm: introduce nvm_for_each_lun_ppa() macro lightnvm: refactor dev->online_target to global nvm_targets lightnvm: rename nvm_targets to nvm_tgt_type ...
2016-05-17Merge branch 'for-4.7/core' of git://git.kernel.dk/linux-blockLinus Torvalds
Pull core block layer updates from Jens Axboe: "This is the core block IO changes for this merge window. Nothing earth shattering in here, it's mostly just fixes. In detail: - Fix for a long standing issue where wrong ordering in blk-mq caused order_to_size() to spew a warning. From Bart. - Async discard support from Christoph. Basically just splitting our sync interface into a submit + wait part. - Add a cleaner interface for flagging whether a device has a write back cache or not. We've previously overloaded blk_queue_flush() with this, but let's make it more explicit. Drivers cleaned up and updated in the drivers pull request. From me. - Fix for a double check for whether IO accounting is enabled or not. From Michael Callahan. - Fix for the async discard from Mike Snitzer, reinstating the early EOPNOTSUPP return if the device doesn't support discards. - Also from Mike, export bio_inc_remaining() so dm can drop it's private copy of it. - From Ming Lin, add support for passing in an offset for request payloads. - Tag function export from Sagi, which will be used in NVMe in the drivers pull. - Two blktrace related fixes from Shaohua. - Propagate NOMERGE flag when making a request from a bio, also from Shaohua. - An optimization to not parse cgroup paths in blk-throttle, if we don't need to. From Shaohua" * 'for-4.7/core' of git://git.kernel.dk/linux-block: blk-mq: fix undefined behaviour in order_to_size() blk-throttle: don't parse cgroup path if trace isn't enabled blktrace: add missed mask name blktrace: delete garbage for message trace block: make bio_inc_remaining() interface accessible again block: reinstate early return of -EOPNOTSUPP from blkdev_issue_discard block: Minor blk_account_io_start usage cleanup block: add __blkdev_issue_discard block: remove struct bio_batch block: copy NOMERGE flag from bio to request block: add ability to flag write back caching on a device blk-mq: Export tagset iter function block: add offset in blk_add_request_payload() writeback: Fix performance regression in wb_over_bg_thresh()
2016-05-17block: Update blkdev_dax_capable() for consistencyToshi Kani
blkdev_dax_capable() is similar to bdev_dax_supported(), but needs to remain as a separate interface for checking dax capability of a raw block device. Rename and relocate blkdev_dax_capable() to keep them maintained consistently, and call bdev_direct_access() for the dax capability check. There is no change in the behavior. Link: https://lkml.org/lkml/2016/5/9/950 Signed-off-by: Toshi Kani <toshi.kani@hpe.com> Reviewed-by: Jan Kara <jack@suse.cz> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Jens Axboe <axboe@fb.com> Cc: Andreas Dilger <adilger.kernel@dilger.ca> Cc: Jan Kara <jack@suse.cz> Cc: Dave Chinner <david@fromorbit.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Ross Zwisler <ross.zwisler@linux.intel.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Boaz Harrosh <boaz@plexistor.com> Signed-off-by: Vishal Verma <vishal.l.verma@intel.com>
2016-05-16blk-mq: fix undefined behaviour in order_to_size()Bartlomiej Zolnierkiewicz
When this_order variable in blk_mq_init_rq_map() becomes zero the code incorrectly decrements the variable and passes the result to order_to_size() helper causing undefined behaviour: UBSAN: Undefined behaviour in block/blk-mq.c:1459:27 shift exponent 4294967295 is too large for 32-bit type 'unsigned int' CPU: 0 PID: 1 Comm: swapper/0 Not tainted 4.6.0-rc6-00072-g33656a1 #22 Fix the code by checking this_order variable for not having the zero value first. Reported-by: Meelis Roos <mroos@linux.ee> Fixes: 320ae51feed5 ("blk-mq: new multi-queue block IO queueing mechanism") Signed-off-by: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2016-05-11Merge branch 'ovl-fixes' into for-linusAl Viro
2016-05-10blk-throttle: don't parse cgroup path if trace isn't enabledShaohua Li
if trace isn't enabled, parsing cgroup path just wastes cpu Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2016-05-05block: make bio_inc_remaining() interface accessible againMike Snitzer
Commit 326e1dbb57 ("block: remove management of bi_remaining when restoring original bi_end_io") made bio_inc_remaining() private to bio.c because the only use-case that made sense was confined to the bio_chain() interface. Since that time DM thinp went on to use bio_chain() in its relatively complex implementation of async discard support. That implementation, even when converted over to use the new async __blkdev_issue_discard() interface, depends on deferred completion of the original discard bio -- which is most appropriately implemented using bio_inc_remaining(). DM thinp foolishly duplicated bio_inc_remaining(), local to dm-thin.c as __bio_inc_remaining(), so re-exporting bio_inc_remaining() allows us to put an end to that foolishness. All said, bio_inc_remaining() should really only be used in conjunction with bio_chain(). It isn't intended for generic bio reference counting. Signed-off-by: Mike Snitzer <snitzer@redhat.com> Acked-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2016-05-05block: reinstate early return of -EOPNOTSUPP from blkdev_issue_discardMike Snitzer
Commit 38f25255330 ("block: add __blkdev_issue_discard") incorrectly disallowed the early return of -EOPNOTSUPP if the device doesn't support discard (or secure discard). This early return of -EOPNOTSUPP has always been part of blkdev_issue_discard() interface so there isn't a good reason to break that behaviour -- especially when it can be easily reinstated. The nuance of allowing early return of -EOPNOTSUPP vs disallowing late return of -EOPNOTSUPP is: if the overall device never advertised support for discards and one is issued to the device it is beneficial to inform the caller that discards are not supported via -EOPNOTSUPP. But if a device advertises discard support it means that at least a subset of the device does have discard support -- but it could be that discards issued to some regions of a stacked device will not be supported. In that case the late return of -EOPNOTSUPP must be disallowed. Fixes: 38f25255330 ("block: add __blkdev_issue_discard") Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2016-05-03block: Minor blk_account_io_start usage cleanupMichael Callahan
blk_account_io_start does not need to be wrapped with blk_do_io_stat ais it already checks for that condition. Signed-off-by: Michael Callahan <michaelcallahan@fb.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2016-05-02block: add __blkdev_issue_discardChristoph Hellwig
This is a version of blkdev_issue_discard which doesn't wait for the I/O to complete, but instead allows the caller to submit the final bio and/or chain it to others. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Ming Lin <ming.l@ssi.samsung.com> Signed-off-by: Sagi Grimberg <sagig@grimberg.me> Reviewed-by: Ming Lei <tom.leiming@gmail.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2016-05-02block: remove struct bio_batchChristoph Hellwig
It can be replaced with a combination of bio_chain and submit_bio_wait. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Ming Lin <ming.l@ssi.samsung.com> Signed-off-by: Sagi Grimberg <sagig@grimberg.me> Reviewed-by: Ming Lei <tom.leiming@gmail.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2016-04-18treewide: Fix typos in printkMasanari Iida
This patch fix spelling typos found in printk within various part of the kernel sources. Signed-off-by: Masanari Iida <standby24x7@gmail.com> Acked-by: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2016-04-15Merge branch 'for-linus' of git://git.kernel.dk/linux-blockLinus Torvalds
Pull block fixes from Jens Axboe: "A few fixes for the current series. This contains: - Two fixes for NVMe: One fixes a reset race that can be triggered by repeated insert/removal of the module. The other fixes an issue on some platforms, where we get probe timeouts since legacy interrupts isn't working. This used not to be a problem since we had the worker thread poll for completions, but since that was killed off, it means those poor souls can't successfully probe their NVMe device. Use a proper IRQ check and probe (msi-x -> msi ->legacy), like most other drivers to work around this. Both from Keith. - A loop corruption issue with offset in iters, from Ming Lei. - A fix for not having the partition stat per cpu ref count initialized before sending out the KOBJ_ADD, which could cause user space to access the counter prior to initialization. Also from Ming Lei. - A fix for using the wrong congestion state, from Kaixu Xia" * 'for-linus' of git://git.kernel.dk/linux-block: block: loop: fix filesystem corruption in case of aio/dio NVMe: Always use MSI/MSI-x interrupts NVMe: Fix reset/remove race writeback: fix the wrong congested state variable definition block: partition: initialize percpuref before sending out KOBJ_ADD