summaryrefslogtreecommitdiff
path: root/Documentation/block
AgeCommit message (Collapse)Author
2019-04-09block: null: Add documentation for "zone_nr_conv" paramMinwoo Im
zone_nr_conv module parameter was introduced by a commit ea2c18e1("null_blk: Add conventional zone configuration for zoned support"). This patch describes "zone_nr_conv" module parameter to the document. Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Signed-off-by: Minwoo Im <minwoo.im.dev@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-04-01doc, block, bfq: add information on bfq execution timePaolo Valente
The execution time of BFQ has been slightly lowered. Report the new execution time in BFQ documentation. Signed-off-by: Paolo Valente <paolo.valente@linaro.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-02-15block: document usage of bio iterator helpersMing Lei
Now multi-page bvec is supported, some helpers may return page by page, meantime some may return segment by segment, this patch documents the usage. Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Omar Sandoval <osandov@fb.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-01-09block: doc: add slice_idle_us to bfq documentationJohn Pittman
Of the tunables available for the bfq I/O scheduler, the only one missing from the documentation in 'Documentation/block/bfq-iosched.txt' is slice_idle_us. Add this tunable to the documentation and a short explanation of its purpose. Acked-by: Paolo Valente <paolo.valente@linaro.org> Signed-off-by: John Pittman <jpittman@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-01-06null_blk: add zoned config support informationJohn Pittman
If the kernel is built without CONFIG_BLK_DEV_ZONED, a modprobe of the null_blk driver with zoned=1 fails with 'Invalid argument'. This can be confusing to users, prompting a search as to why the parameter is invalid. To assist in that search, add a bit more information to the failure, additionally adding to the documentation that CONFIG_BLK_DEV_ZONED is needed for zoned=1. Reviewed-by: Bart Van Assche <bvanassche@acm.org> Signed-off-by: John Pittman <jpittman@redhat.com> Added null_blk prefix to error message. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-01-04block: add documentation for io_timeoutWeiping Zhang
Add documentation for /sys/block/<disk>/queue/io_timeout. Reviewed-by: Bart Van Assche <bvanassche@acm.org> Signed-off-by: Weiping Zhang <zhangweiping@didiglobal.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-12-16block: update sysfs documentationDamien Le Moal
Add the description of the zoned, nr_zones and chunk_sectors sysfs queue attributes to Documentation/block/queue-sysfs.txt. The description of the zoned and chunk_sector attributes are mostly copied from ABI/testing/sysfs-block (added a typo fix). While at it, also fix a typo in the description of the io_poll_delay attribute. nr_zones description is also added to ABI/testing/sysfs-block and contact email address updated for the zoned attribute. Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-11-07block: remove legacy IO schedulersJens Axboe
Retain the deadline documentation, as that carries over to mq-deadline as well. Tested-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Omar Sandoval <osandov@fb.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-11-07block: remove legacy rq taggingJens Axboe
It's now unused, kill it. Reviewed-by: Hannes Reinecke <hare@suse.com> Tested-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Omar Sandoval <osandov@fb.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-09-09Drop all 00-INDEX files from Documentation/Henrik Austad
This is a respin with a wider audience (all that get_maintainer returned) and I know this spams a *lot* of people. Not sure what would be the correct way, so my apologies for ruining your inbox. The 00-INDEX files are supposed to give a summary of all files present in a directory, but these files are horribly out of date and their usefulness is brought into question. Often a simple "ls" would reveal the same information as the filenames are generally quite descriptive as a short introduction to what the file covers (it should not surprise anyone what Documentation/sched/sched-design-CFS.txt covers) A few years back it was mentioned that these files were no longer really needed, and they have since then grown further out of date, so perhaps it is time to just throw them out. A short status yields the following _outdated_ 00-INDEX files, first counter is files listed in 00-INDEX but missing in the directory, last is files present but not listed in 00-INDEX. List of outdated 00-INDEX: Documentation: (4/10) Documentation/sysctl: (0/1) Documentation/timers: (1/0) Documentation/blockdev: (3/1) Documentation/w1/slaves: (0/1) Documentation/locking: (0/1) Documentation/devicetree: (0/5) Documentation/power: (1/1) Documentation/powerpc: (0/5) Documentation/arm: (1/0) Documentation/x86: (0/9) Documentation/x86/x86_64: (1/1) Documentation/scsi: (4/4) Documentation/filesystems: (2/9) Documentation/filesystems/nfs: (0/2) Documentation/cgroup-v1: (0/2) Documentation/kbuild: (0/4) Documentation/spi: (1/0) Documentation/virtual/kvm: (1/0) Documentation/scheduler: (0/2) Documentation/fb: (0/1) Documentation/block: (0/1) Documentation/networking: (6/37) Documentation/vm: (1/3) Then there are 364 subdirectories in Documentation/ with several files that are missing 00-INDEX alltogether (and another 120 with a single file and no 00-INDEX). I don't really have an opinion to whether or not we /should/ have 00-INDEX, but the above 00-INDEX should either be removed or be kept up to date. If we should keep the files, I can try to keep them updated, but I rather not if we just want to delete them anyway. As a starting point, remove all index-files and references to 00-INDEX and see where the discussion is going. Signed-off-by: Henrik Austad <henrik@austad.us> Acked-by: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> Just-do-it-by: Steven Rostedt <rostedt@goodmis.org> Reviewed-by: Jens Axboe <axboe@kernel.dk> Acked-by: Paul Moore <paul@paul-moore.com> Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Acked-by: Mark Brown <broonie@kernel.org> Acked-by: Mike Rapoport <rppt@linux.vnet.ibm.com> Cc: [Almost everybody else] Signed-off-by: Jonathan Corbet <corbet@lwn.net>
2018-07-18block: Track DISCARD statistics and output them in stat and diskstatMichael Callahan
Add tracking of REQ_OP_DISCARD ios to the partition statistics and append them to the various stat files in /sys as well as /proc/diskstats. These are tracked with the same four stats as reads and writes: Number of discard ios completed. Number of discard ios merged Number of discard sectors completed Milliseconds spent on discard requests This is done via adding a new STAT_DISCARD define to genhd.h and then using it to index that stat field for discard requests. tj: Refreshed on top of v4.17 and other previous updates. Signed-off-by: Michael Callahan <michaelcallahan@fb.com> Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Andy Newell <newella@fb.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-07-09null_blk: add zone supportMatias Bjørling
Adds support for exposing a null_blk device through the zone device interface. The interface is managed with the parameters zoned and zone_size. If zoned is set, the null_blk instance registers as a zoned block device. The zone_size parameter defines how big each zone will be. Signed-off-by: Matias Bjørling <matias.bjorling@wdc.com> Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com> Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-06-15block: remov blk_queue_invalidate_tagsChristoph Hellwig
This function is entirely unused, so remove it and the tag_queue_busy member of struct request_queue. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-06-04Merge tag 'docs-4.18' of git://git.lwn.net/linuxLinus Torvalds
Pull documentation updates from Jonathan Corbet: "There's been a fair amount of work in the docs tree this time around, including: - Extensive RST conversions and organizational work in the memory-management docs thanks to Mike Rapoport. - An update of Documentation/features from Andrea Parri and a script to keep it updated. - Various LICENSES updates from Thomas, along with a script to check SPDX tags. - Work to fix dangling references to documentation files; this involved a fair number of one-liner comment changes outside of Documentation/ ... and the usual list of documentation improvements, typo fixes, etc" * tag 'docs-4.18' of git://git.lwn.net/linux: (103 commits) Documentation: document hung_task_panic kernel parameter docs/admin-guide/mm: add high level concepts overview docs/vm: move ksm and transhuge from "user" to "internals" section. docs: Use the kerneldoc comments for memalloc_no*() doc: document scope NOFS, NOIO APIs docs: update kernel versions and dates in tables docs/vm: transhuge: split userspace bits to admin-guide/mm/transhuge docs/vm: transhuge: minor updates docs/vm: transhuge: change sections order Documentation: arm: clean up Marvell Berlin family info Documentation: gpio: driver: Fix a typo and some odd grammar docs: ranoops.rst: fix location of ramoops.txt scripts/documentation-file-ref-check: rewrite it in perl with auto-fix mode docs: uio-howto.rst: use a code block to solve a warning mm, THP, doc: Add document for thp_swpout/thp_swpout_fallback w1: w1_io.c: fix a kernel-doc warning Documentation/process/posting: wrap text at 80 cols docs: admin-guide: add cgroup-v2 documentation Revert "Documentation/features/vm: Remove arch support status file for 'pte_special'" Documentation: refcount-vs-atomic: Update reference to LKMM doc. ...
2018-05-25null_blk: add blocking description and remove lightnvmLiu Bo
- The description of 'blocking' is missing in null_blk.txt - The 'lightnvm' parameter has been removed in null_blk.c This updates both in null_blk.txt. Signed-off-by: Liu Bo <bo.liu@linux.alibaba.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-05-08Documentation: block: cmdline-partition.txt fixes and additionsRandy Dunlap
Make the description of the kernel command line option "blkdevparts" a bit more flowing and readable. Fix a few typos. Add the optional <size> and <offset> suffixes. Note that size can be "-" to indicate all of the remaining space. Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Cc: Cai Zhiyong <caizhiyong@huawei.com> Signed-off-by: Jonathan Corbet <corbet@lwn.net>
2017-11-14block, bfq: move debug blkio stats behind CONFIG_DEBUG_BLK_CGROUPLuca Miccio
BFQ currently creates, and updates, its own instance of the whole set of blkio statistics that cfq creates. Yet, from the comments of Tejun Heo in [1], it turned out that most of these statistics are meant/useful only for debugging. This commit makes BFQ create the latter, debugging statistics only if the option CONFIG_DEBUG_BLK_CGROUP is set. By doing so, this commit also enables BFQ to enjoy a high perfomance boost. The reason is that, if CONFIG_DEBUG_BLK_CGROUP is not set, then BFQ has to update far fewer statistics, and, in particular, not the heaviest to update. To give an idea of the benefits, if CONFIG_DEBUG_BLK_CGROUP is not set, then, on an Intel i7-4850HQ, and with 8 threads doing random I/O in parallel on null_blk (configured with 0 latency), the throughput of BFQ grows from 310 to 400 KIOPS (+30%). We have measured similar or even much higher boosts with other CPUs: e.g., +45% with an ARM CortexTM-A53 Octa-core. Our results have been obtained and can be reproduced very easily with the script in [1]. [1] https://www.spinics.net/lists/linux-block/msg18943.html Suggested-by: Tejun Heo <tj@kernel.org> Suggested-by: Ulf Hansson <ulf.hansson@linaro.org> Tested-by: Lee Tibbert <lee.tibbert@gmail.com> Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name> Signed-off-by: Luca Miccio <lucmiccio@gmail.com> Signed-off-by: Paolo Valente <paolo.valente@linaro.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-11-14block, bfq: update blkio stats outside the scheduler lockPaolo Valente
bfq invokes various blkg_*stats_* functions to update the statistics contained in the special files blkio.bfq.* in the blkio controller groups, i.e., the I/O accounting related to the proportional-share policy provided by bfq. The execution of these functions takes a considerable percentage, about 40%, of the total per-request execution time of bfq (i.e., of the sum of the execution time of all the bfq functions that have to be executed to process an I/O request from its creation to its destruction). This reduces the request-processing rate sustainable by bfq noticeably, even on a multicore CPU. In fact, the bfq functions that invoke blkg_*stats_* functions cannot be executed in parallel with the rest of the code of bfq, because both are executed under the same same per-device scheduler lock. To reduce this slowdown, this commit moves, wherever possible, the invocation of these functions (more precisely, of the bfq functions that invoke blkg_*stats_* functions) outside the critical sections protected by the scheduler lock. With this change, and with all blkio.bfq.* statistics enabled, the throughput grows, e.g., from 250 to 310 KIOPS (+25%) on an Intel i7-4850HQ, in case of 8 threads doing random I/O in parallel on null_blk, with the latter configured with 0 latency. We obtained the same or higher throughput boosts, up to +30%, with other processors (some figures are reported in the documentation). For our tests, we used the script [1], with which our results can be easily reproduced. NOTE. This commit still protects the invocation of blkg_*stats_* functions with the request_queue lock, because the group these functions are invoked on may otherwise disappear before or while these functions are executed. Fortunately, tests without even this lock show, by difference, that the serialization caused by this lock has a little impact (at most ~5% of throughput reduction). [1] https://github.com/Algodev-github/IOSpeed Tested-by: Lee Tibbert <lee.tibbert@gmail.com> Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name> Signed-off-by: Paolo Valente <paolo.valente@linaro.org> Signed-off-by: Luca Miccio <lucmiccio@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-11-14doc, block, bfq: update max IOPS sustainable with BFQPaolo Valente
We have investigated more deeply the performance of BFQ, in terms of number of IOPS that can be processed by the CPU when BFQ is used as I/O scheduler. In more detail, using the script [1], we have measured the number of IOPS reached on top of a null block device configured with zero latency, as a function of the workload (sequential read, sequential write, random read, random write) and of the system (we considered desktops, laptops and embedded systems). Basing on the resulting figures, with this commit we update the current, conservative IOPS range reported in BFQ documentation. In particular, the documentation now reports, for each of three different systems, the lowest number of IOPS obtained for that system with the above test (namely, the value obtained with the workload leading to the lowest IOPS). [1] https://github.com/Algodev-github/IOSpeed Reviewed-by: Lee Tibbert <lee.tibbert@gmail.com> Signed-off-by: Paolo Valente <paolo.valente@linaro.org> Signed-off-by: Luca Miccio <lucmiccio@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-11-10block: remove __bio_kmap_atomicChristoph Hellwig
This helper doesn't buy us much over calling kmap_atomic directly. In fact in the only caller it does a bit of useless work as the caller already has the bvec at hand, and said caller would even buggy for a multi-segment bio due to the use of this helper. So just remove it. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-11-10block: kill bio_kmap/kunmap_irq()Jens Axboe
There are no users of it anymore. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-11-07null_blk: add an usage for shared tags in documentationMinwoo Im
Add usage explanation for a shared_tags, introduced by commit: 82f402fefa50 ("null_blk: add support for shared tags") Signed-off-by: Minwoo Im <minwoo.im.dev@gmail.com> Reworded slightly. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-11-07null_blk: fix default values in documentationMinwoo Im
null_blk.c has initial value of (1) nr_devices as 1. (2) completion_nsec as 10,000ns, not 10.000ns. documentation should be updated for fixes above. Signed-off-by: Minwoo Im <minwoo.im.dev@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-10-13null_blk: add usage hints for no_schedweiping zhang
This parameter provide an option to disable io scheduler when nullb* in multi-queue mode. Signed-off-by: weiping zhang <zhangweiping@didichuxing.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-10-13null_blk: update usage hints for submit_queuesweiping zhang
update the range of submits_queues, and correct usage hints. Signed-off-by: weiping zhang <zhangweiping@didichuxing.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-08-31doc, block, bfq: better describe how to properly configure bfqPaolo Valente
Many users have reported the lack of an HOWTO for properly configuring bfq as a function of the goal one wants to achieve (max responsiveness, max throughput, ...). In fact, all needed details are already provided in the documentation file bfq-iosched.txt. Yet the document lacks guidance on which parameter descriptions to look at. This commit adds some simple direction. Signed-off-by: Paolo Valente <paolo.valente@linaro.org> Reviewed-by: Jeremy Hickman <jeremywh7@gmail.com> Reviewed-by: Laurentiu Nicola <lnicola@dend.ro> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-08-31doc, block, bfq: fix some typos and remove stale stuffPaolo Valente
In addition to containing some typos and stale sentences, the file bfq-iosched.txt still mentioned a set of sysfs parameters that have been removed from this version of bfq. This commit fixes all these issues. Signed-off-by: Paolo Valente <paolo.valente@linaro.org> Reviewed-by: Jeremy Hickman <jeremywh7@gmail.com> Reviewed-by: Laurentiu Nicola <lnicola@dend.ro> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-07-03bio-integrity: fold bio_integrity_enabled to bio_integrity_prepDmitry Monakhov
Currently all integrity prep hooks are open-coded, and if prepare fails we ignore it's code and fail bio with EIO. Let's return real error to upper layer, so later caller may react accordingly. In fact no one want to use bio_integrity_prep() w/o bio_integrity_enabled, so it is reasonable to fold it in to one function. Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> [hch: merged with the latest block tree, return bool from bio_integrity_prep] Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-06-18block: remove bio_clone() and all references.NeilBrown
bio_clone() is no longer used. Only bio_clone_bioset() or bio_clone_fast(). This is for the best, as bio_clone() used fs_bio_set, and filesystems are unlikely to want to use bio_clone(). So remove bio_clone() and all references. This includes a fix to some incorrect documentation. Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-05-10block, bfq: stress that low_latency must be off to get max throughputPaolo Valente
The introduction of the BFQ and Kyber I/O schedulers has triggered a new wave of I/O benchmarks. Unfortunately, comments and discussions on these benchmarks confirm that there is still little awareness that it is very hard to achieve, at the same time, a low latency and a high throughput. In particular, virtually all benchmarks measure throughput, or throughput-related figures of merit, but, for BFQ, they use the scheduler in its default configuration. This configuration is geared, instead, toward a low latency. This is evidently a sign that BFQ documentation is still too unclear on this important aspect. This commit addresses this issue by stressing how BFQ configuration must be (easily) changed if the only goal is maximum throughput. Signed-off-by: Paolo Valente <paolo.valente@linaro.org> Signed-off-by: Jens Axboe <axboe@fb.com>
2017-04-19block, bfq: improve responsivenessPaolo Valente
This patch introduces a simple heuristic to load applications quickly, and to perform the I/O requested by interactive applications just as quickly. To this purpose, both a newly-created queue and a queue associated with an interactive application (we explain in a moment how BFQ decides whether the associated application is interactive), receive the following two special treatments: 1) The weight of the queue is raised. 2) The queue unconditionally enjoys device idling when it empties; in fact, if the requests of a queue are sync, then performing device idling for the queue is a necessary condition to guarantee that the queue receives a fraction of the throughput proportional to its weight (see [1] for details). For brevity, we call just weight-raising the combination of these two preferential treatments. For a newly-created queue, weight-raising starts immediately and lasts for a time interval that: 1) depends on the device speed and type (rotational or non-rotational), and 2) is equal to the time needed to load (start up) a large-size application on that device, with cold caches and with no additional workload. Finally, as for guaranteeing a fast execution to interactive, I/O-related tasks (such as opening a file), consider that any interactive application blocks and waits for user input both after starting up and after executing some task. After a while, the user may trigger new operations, after which the application stops again, and so on. Accordingly, the low-latency heuristic weight-raises again a queue in case it becomes backlogged after being idle for a sufficiently long (configurable) time. The weight-raising then lasts for the same time as for a just-created queue. According to our experiments, the combination of this low-latency heuristic and of the improvements described in the previous patch allows BFQ to guarantee a high application responsiveness. [1] P. Valente, A. Avanzini, "Evolution of the BFQ Storage I/O Scheduler", Proceedings of the First Workshop on Mobile System Technologies (MST-2015), May 2015. http://algogroup.unimore.it/people/paolo/disk_sched/mst-2015.pdf Signed-off-by: Paolo Valente <paolo.valente@linaro.org> Signed-off-by: Arianna Avanzini <avanzini.arianna@gmail.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2017-04-19block, bfq: add full hierarchical scheduling and cgroups supportArianna Avanzini
Add complete support for full hierarchical scheduling, with a cgroups interface. Full hierarchical scheduling is implemented through the 'entity' abstraction: both bfq_queues, i.e., the internal BFQ queues associated with processes, and groups are represented in general by entities. Given the bfq_queues associated with the processes belonging to a given group, the entities representing these queues are sons of the entity representing the group. At higher levels, if a group, say G, contains other groups, then the entity representing G is the parent entity of the entities representing the groups in G. Hierarchical scheduling is performed as follows: if the timestamps of a leaf entity (i.e., of a bfq_queue) change, and such a change lets the entity become the next-to-serve entity for its parent entity, then the timestamps of the parent entity are recomputed as a function of the budget of its new next-to-serve leaf entity. If the parent entity belongs, in its turn, to a group, and its new timestamps let it become the next-to-serve for its parent entity, then the timestamps of the latter parent entity are recomputed as well, and so on. When a new bfq_queue must be set in service, the reverse path is followed: the next-to-serve highest-level entity is chosen, then its next-to-serve child entity, and so on, until the next-to-serve leaf entity is reached, and the bfq_queue that this entity represents is set in service. Writeback is accounted for on a per-group basis, i.e., for each group, the async I/O requests of the processes of the group are enqueued in a distinct bfq_queue, and the entity associated with this queue is a child of the entity associated with the group. Weights can be assigned explicitly to groups and processes through the cgroups interface, differently from what happens, for single processes, if the cgroups interface is not used (as explained in the description of the previous patch). In particular, since each node has a full scheduler, each group can be assigned its own weight. Signed-off-by: Fabio Checconi <fchecconi@gmail.com> Signed-off-by: Paolo Valente <paolo.valente@linaro.org> Signed-off-by: Arianna Avanzini <avanzini.arianna@gmail.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2017-04-19block, bfq: introduce the BFQ-v0 I/O scheduler as an extra schedulerPaolo Valente
We tag as v0 the version of BFQ containing only BFQ's engine plus hierarchical support. BFQ's engine is introduced by this commit, while hierarchical support is added by next commit. We use the v0 tag to distinguish this minimal version of BFQ from the versions containing also the features and the improvements added by next commits. BFQ-v0 coincides with the version of BFQ submitted a few years ago [1], apart from the introduction of preemption, described below. BFQ is a proportional-share I/O scheduler, whose general structure, plus a lot of code, are borrowed from CFQ. - Each process doing I/O on a device is associated with a weight and a (bfq_)queue. - BFQ grants exclusive access to the device, for a while, to one queue (process) at a time, and implements this service model by associating every queue with a budget, measured in number of sectors. - After a queue is granted access to the device, the budget of the queue is decremented, on each request dispatch, by the size of the request. - The in-service queue is expired, i.e., its service is suspended, only if one of the following events occurs: 1) the queue finishes its budget, 2) the queue empties, 3) a "budget timeout" fires. - The budget timeout prevents processes doing random I/O from holding the device for too long and dramatically reducing throughput. - Actually, as in CFQ, a queue associated with a process issuing sync requests may not be expired immediately when it empties. In contrast, BFQ may idle the device for a short time interval, giving the process the chance to go on being served if it issues a new request in time. Device idling typically boosts the throughput on rotational devices, if processes do synchronous and sequential I/O. In addition, under BFQ, device idling is also instrumental in guaranteeing the desired throughput fraction to processes issuing sync requests (see [2] for details). - With respect to idling for service guarantees, if several processes are competing for the device at the same time, but all processes (and groups, after the following commit) have the same weight, then BFQ guarantees the expected throughput distribution without ever idling the device. Throughput is thus as high as possible in this common scenario. - Queues are scheduled according to a variant of WF2Q+, named B-WF2Q+, and implemented using an augmented rb-tree to preserve an O(log N) overall complexity. See [2] for more details. B-WF2Q+ is also ready for hierarchical scheduling. However, for a cleaner logical breakdown, the code that enables and completes hierarchical support is provided in the next commit, which focuses exactly on this feature. - B-WF2Q+ guarantees a tight deviation with respect to an ideal, perfectly fair, and smooth service. In particular, B-WF2Q+ guarantees that each queue receives a fraction of the device throughput proportional to its weight, even if the throughput fluctuates, and regardless of: the device parameters, the current workload and the budgets assigned to the queue. - The last, budget-independence, property (although probably counterintuitive in the first place) is definitely beneficial, for the following reasons: - First, with any proportional-share scheduler, the maximum deviation with respect to an ideal service is proportional to the maximum budget (slice) assigned to queues. As a consequence, BFQ can keep this deviation tight not only because of the accurate service of B-WF2Q+, but also because BFQ *does not* need to assign a larger budget to a queue to let the queue receive a higher fraction of the device throughput. - Second, BFQ is free to choose, for every process (queue), the budget that best fits the needs of the process, or best leverages the I/O pattern of the process. In particular, BFQ updates queue budgets with a simple feedback-loop algorithm that allows a high throughput to be achieved, while still providing tight latency guarantees to time-sensitive applications. When the in-service queue expires, this algorithm computes the next budget of the queue so as to: - Let large budgets be eventually assigned to the queues associated with I/O-bound applications performing sequential I/O: in fact, the longer these applications are served once got access to the device, the higher the throughput is. - Let small budgets be eventually assigned to the queues associated with time-sensitive applications (which typically perform sporadic and short I/O), because, the smaller the budget assigned to a queue waiting for service is, the sooner B-WF2Q+ will serve that queue (Subsec 3.3 in [2]). - Weights can be assigned to processes only indirectly, through I/O priorities, and according to the relation: weight = 10 * (IOPRIO_BE_NR - ioprio). The next patch provides, instead, a cgroups interface through which weights can be assigned explicitly. - If several processes are competing for the device at the same time, but all processes and groups have the same weight, then BFQ guarantees the expected throughput distribution without ever idling the device. It uses preemption instead. Throughput is then much higher in this common scenario. - ioprio classes are served in strict priority order, i.e., lower-priority queues are not served as long as there are higher-priority queues. Among queues in the same class, the bandwidth is distributed in proportion to the weight of each queue. A very thin extra bandwidth is however guaranteed to the Idle class, to prevent it from starving. - If the strict_guarantees parameter is set (default: unset), then BFQ - always performs idling when the in-service queue becomes empty; - forces the device to serve one I/O request at a time, by dispatching a new request only if there is no outstanding request. In the presence of differentiated weights or I/O-request sizes, both the above conditions are needed to guarantee that every queue receives its allotted share of the bandwidth (see Documentation/block/bfq-iosched.txt for more details). Setting strict_guarantees may evidently affect throughput. [1] https://lkml.org/lkml/2008/4/1/234 https://lkml.org/lkml/2008/11/11/148 [2] P. Valente and M. Andreolini, "Improving Application Responsiveness with the BFQ Disk I/O Scheduler", Proceedings of the 5th Annual International Systems and Storage Conference (SYSTOR '12), June 2012. Slightly extended version: http://algogroup.unimore.it/people/paolo/disk_sched/bfq-v1-suite- results.pdf Signed-off-by: Fabio Checconi <fchecconi@gmail.com> Signed-off-by: Paolo Valente <paolo.valente@linaro.org> Signed-off-by: Arianna Avanzini <avanzini.arianna@gmail.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2017-04-14blk-mq: introduce Kyber multiqueue I/O schedulerOmar Sandoval
The Kyber I/O scheduler is an I/O scheduler for fast devices designed to scale to multiple queues. Users configure only two knobs, the target read and synchronous write latencies, and the scheduler tunes itself to achieve that latency goal. The implementation is based on "tokens", built on top of the scalable bitmap library. Tokens serve as a mechanism for limiting requests. There are two tiers of tokens: queueing tokens and dispatch tokens. A queueing token is required to allocate a request. In fact, these tokens are actually the blk-mq internal scheduler tags, but the scheduler manages the allocation directly in order to implement its policy. Dispatch tokens are device-wide and split up into two scheduling domains: reads vs. writes. Each hardware queue dispatches batches round-robin between the scheduling domains as long as tokens are available for that domain. These tokens can be used as the mechanism to enable various policies. The policy Kyber uses is inspired by active queue management techniques for network routing, similar to blk-wbt. The scheduler monitors latencies and scales the number of dispatch tokens accordingly. Queueing tokens are used to prevent starvation of synchronous requests by asynchronous requests. Various extensions are possible, including better heuristics and ionice support. The new scheduler isn't set as the default yet. Signed-off-by: Omar Sandoval <osandov@fb.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2017-04-08block: remove the discard_zeroes_data flagChristoph Hellwig
Now that we use the proper REQ_OP_WRITE_ZEROES operation everywhere we can kill this hack. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Hannes Reinecke <hare@suse.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2017-03-28blk-throttle: make throtl_slice tunableShaohua Li
throtl_slice is important for blk-throttling. It's called slice internally but it really is a time window blk-throttling samples data. blk-throttling will make decision based on the samplings. An example is bandwidth measurement. A cgroup's bandwidth is measured in the time interval of throtl_slice. A small throtl_slice meanse cgroups have smoother throughput but burn more CPUs. It has 100ms default value, which is not appropriate for all disks. A fast SSD can dispatch a lot of IOs in 100ms. This patch makes it tunable. Since throtl_slice isn't a time slice, the sysfs name 'throttle_sample_time' reflects its character better. Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2017-02-22Merge tag 'docs-4.11' of git://git.lwn.net/linuxLinus Torvalds
Pull documentation updates from Jonathan Corbet: "A slightly quieter cycle for documentation this time around. Three more DocBook template files have been converted to RST; only 21 to go. There are various build improvements and the usual array of documentation improvements and fixes" * tag 'docs-4.11' of git://git.lwn.net/linux: (44 commits) docs / driver-api: Fix structure references in device_link.rst PM / docs: Fix structure references in device.rst Add a target to check broken external links in the Documentation Documentation: Fix linux-api list typo Documentation: DocBook/Makefile comment typo Improve sparse documentation Documentation: make Makefile.sphinx no-ops quieter Documentation: DMA-ISA-LPC.txt Documentation: input: fix path to input code definitions docs: Remove the copyright year from conf.py docs: Fix a warning in the Korean HOWTO.rst translation PM / sleep / docs: Convert PM notifiers document to reST PM / core / docs: Convert sleep states API document to reST PM / core: Update kerneldoc comments in pm.h doc-rst: Fix recursive make invocation from macros doc-rst: Delete output of failed dot-SVG conversion doc-rst: Break shell command sequences on failure Documentation/sphinx: make targets independent of Sphinx work for HAVE_SPHINX=0 doc-rst: fixed cleandoc target when used with O=dir Documentation/sphinx: prevent generation of .pyc files in the source tree ...
2017-01-26Doc: Fix double words in DocumentationMasanari Iida
This patch fix some double words found in Documentation. Signed-off-by: Masanari Iida <standby24x7@gmail.com> Signed-off-by: Jonathan Corbet <corbet@lwn.net>
2017-01-03block: fix up io_poll documentationJeff Moyer
/sys/block/<dev>/queue/io_poll is a boolean. Fix the docs. Signed-off-by: Jeff Moyer <jmoyer@redhat.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2016-11-28blk-wbt: allow reset of default latency through sysfsJens Axboe
Allow a write of '-1' to reset the default latency target for a given device. This removes knowledge of the different default settings for rotational vs non-rotational from user space. Signed-off-by: Jens Axboe <axboe@fb.com>
2016-11-17block: document the 'io_poll_delay' queue sysfs fileJens Axboe
This was documented in the original commit, 64f1c21e86f7, but it never made it into the proper location for queue sysfs files. Signed-off-by: Jens Axboe <axboe@fb.com>
2016-11-16null_blk: add usage hints for NVMYasuaki Ishimatsu
If CONFIG_NVM is disabled, loading null_block module with use_lightnvm=1 fails. But there are no messages and documents related to the failure. Add the appropriate error message. Signed-off-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com> Massaged the text a bit. Signed-off-by: Jens Axboe <axboe@fb.com>
2016-11-10block: hook up writeback throttlingJens Axboe
Enable throttling of buffered writeback to make it a lot more smooth, and has way less impact on other system activity. Background writeback should be, by definition, background activity. The fact that we flush huge bundles of it at the time means that it potentially has heavy impacts on foreground workloads, which isn't ideal. We can't easily limit the sizes of writes that we do, since that would impact file system layout in the presence of delayed allocation. So just throttle back buffered writeback, unless someone is waiting for it. The algorithm for when to throttle takes its inspiration in the CoDel networking scheduling algorithm. Like CoDel, blk-wb monitors the minimum latencies of requests over a window of time. In that window of time, if the minimum latency of any request exceeds a given target, then a scale count is incremented and the queue depth is shrunk. The next monitoring window is shrunk accordingly. Unlike CoDel, if we hit a window that exhibits good behavior, then we simply increment the scale count and re-calculate the limits for that scale value. This prevents us from oscillating between a close-to-ideal value and max all the time, instead remaining in the windows where we get good behavior. Unlike CoDel, blk-wb allows the scale count to to negative. This happens if we primarily have writes going on. Unlike positive scale counts, this doesn't change the size of the monitoring window. When the heavy writers finish, blk-bw quickly snaps back to it's stable state of a zero scale count. The patch registers a sysfs entry, 'wb_lat_usec'. This sets the latency target to me met. It defaults to 2 msec for non-rotational storage, and 75 msec for rotational storage. Setting this value to '0' disables blk-wb. Generally, a user would not have to touch this setting. We don't enable WBT on devices that are managed with CFQ, and have a non-root block cgroup attached. If we have a proportional share setup on this particular disk, then the wbt throttling will interfere with that. We don't have a strong need for wbt for that case, since we will rely on CFQ doing that for us. Signed-off-by: Jens Axboe <axboe@fb.com>
2016-11-01block: replace REQ_NOIDLE with REQ_IDLEChristoph Hellwig
Noidle should be the default for writes as seen by all the compounds definitions in fs.h using it. In fact only direct I/O really should be using NODILE, so turn the whole flag around to get the defaults right, which will make our life much easier especially onces the WRITE_* defines go away. This assumes all the existing "raw" users of REQ_SYNC for writes want noidle behavior, which seems to be spot on from a quick audit. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>
2016-10-28block: better op and flags encodingChristoph Hellwig
Now that we don't need the common flags to overflow outside the range of a 32-bit type we can encode them the same way for both the bio and request fields. This in addition allows us to place the operation first (and make some room for more ops while we're at it) and to stop having to shift around the operation values. In addition this allows passing around only one value in the block layer instead of two (and eventuall also in the file systems, but we can do that later) and thus clean up a lot of code. Last but not least this allows decreasing the size of the cmd_flags field in struct request to 32-bits. Various functions passing this value could also be updated, but I'd like to avoid the churn for now. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>
2016-10-28block: split out request-only flags into a new namespaceChristoph Hellwig
A lot of the REQ_* flags are only used on struct requests, and only of use to the block layer and a few drivers that dig into struct request internals. This patch adds a new req_flags_t rq_flags field to struct request for them, and thus dramatically shrinks the number of common requests. It also removes the unfortunate situation where we have to fit the fields from the same enum into 32 bits for struct bio and 64 bits for struct request. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Shaun Tancheff <shaun.tancheff@seagate.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2016-09-14block: remove remnant refs to hardsectLinus Walleij
commit e1defc4ff0cf57aca6c5e3ff99fa503f5943c1f1 "block: Do away with the notion of hardsect_size" removed the notion of "hardware sector size" from the kernel in favor of logical block size, but references remain in comments and documentation. Update the remaining sites mentioning hardsect. Signed-off-by: Linus Walleij <linus.walleij@linaro.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>
2016-08-11doc: update block/queue-sysfs.txt entriesJoe Lawrence
Add descriptions for dax, io_poll, and write_same_max_bytes files. Signed-off-by: Joe Lawrence <joe.lawrence@redhat.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2016-08-07block: rename bio bi_rw to bi_opfJens Axboe
Since commit 63a4cc24867d, bio->bi_rw contains flags in the lower portion and the op code in the higher portions. This means that old code that relies on manually setting bi_rw is most likely going to be broken. Instead of letting that brokeness linger, rename the member, to force old and out-of-tree code to break at compile time instead of at runtime. No intended functional changes in this commit. Signed-off-by: Jens Axboe <axboe@fb.com>
2016-07-28Merge branch 'work.misc' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull vfs updates from Al Viro: "Assorted cleanups and fixes. Probably the most interesting part long-term is ->d_init() - that will have a bunch of followups in (at least) ceph and lustre, but we'll need to sort the barrier-related rules before it can get used for really non-trivial stuff. Another fun thing is the merge of ->d_iput() callers (dentry_iput() and dentry_unlink_inode()) and a bunch of ->d_compare() ones (all except the one in __d_lookup_lru())" * 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (26 commits) fs/dcache.c: avoid soft-lockup in dput() vfs: new d_init method vfs: Update lookup_dcache() comment bdev: get rid of ->bd_inodes Remove last traces of ->sync_page new helper: d_same_name() dentry_cmp(): use lockless_dereference() instead of smp_read_barrier_depends() vfs: clean up documentation vfs: document ->d_real() vfs: merge .d_select_inode() into .d_real() unify dentry_iput() and dentry_unlink_inode() binfmt_misc: ->s_root is not going anywhere drop redundant ->owner initializations ufs: get rid of redundant checks orangefs: constify inode_operations missed comment updates from ->direct_IO() prototype change file_inode(f)->i_mapping is f->f_mapping trim fsnotify hooks a bit 9p: new helper - v9fs_parent_fid() debugfs: ->d_parent is never NULL or negative ...