From 68017e5d87a2477d40476f1a0a06f202ee79316b Mon Sep 17 00:00:00 2001
From: Paolo Valente <paolo.valente@linaro.org>
Date: Mon, 13 Nov 2017 07:34:07 +0100
Subject: doc, block, bfq: update max IOPS sustainable with BFQ

We have investigated more deeply the performance of BFQ, in terms of
number of IOPS that can be processed by the CPU when BFQ is used as
I/O scheduler. In more detail, using the script [1], we have measured
the number of IOPS reached on top of a null block device configured
with zero latency, as a function of the workload (sequential read,
sequential write, random read, random write) and of the system (we
considered desktops, laptops and embedded systems).

Basing on the resulting figures, with this commit we update the
current, conservative IOPS range reported in BFQ documentation. In
particular, the documentation now reports, for each of three different
systems, the lowest number of IOPS obtained for that system with the
above test (namely, the value obtained with the workload leading to
the lowest IOPS).

[1] https://github.com/Algodev-github/IOSpeed

Reviewed-by: Lee Tibbert <lee.tibbert@gmail.com>
Signed-off-by: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: Luca Miccio <lucmiccio@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 Documentation/block/bfq-iosched.txt | 17 +++++++++++------
 1 file changed, 11 insertions(+), 6 deletions(-)

(limited to 'Documentation')

diff --git a/Documentation/block/bfq-iosched.txt b/Documentation/block/bfq-iosched.txt
index 3d6951d63489..7a9361508157 100644
--- a/Documentation/block/bfq-iosched.txt
+++ b/Documentation/block/bfq-iosched.txt
@@ -20,12 +20,17 @@ for that device, by setting low_latency to 0. See Section 3 for
 details on how to configure BFQ for the desired tradeoff between
 latency and throughput, or on how to maximize throughput.
 
-On average CPUs, the current version of BFQ can handle devices
-performing at most ~30K IOPS; at most ~50 KIOPS on faster CPUs. As a
-reference, 30-50 KIOPS correspond to very high bandwidths with
-sequential I/O (e.g., 8-12 GB/s if I/O requests are 256 KB large), and
-to 120-200 MB/s with 4KB random I/O. BFQ is currently being tested on
-multi-queue devices too.
+BFQ has a non-null overhead, which limits the maximum IOPS that the
+CPU can process for a device scheduled with BFQ. To give an idea of
+the limits on slow or average CPUs, here are BFQ limits for three
+different CPUs, on, respectively, an average laptop, an old desktop,
+and a cheap embedded system, in case full hierarchical support is
+enabled (i.e., CONFIG_BFQ_GROUP_IOSCHED is set):
+- Intel i7-4850HQ: 250 KIOPS
+- AMD A8-3850: 170 KIOPS
+- ARM CortexTM-A53 Octa-core: 45 KIOPS
+
+BFQ works for multi-queue devices too.
 
 The table of contents follow. Impatients can just jump to Section 3.
 
-- 
cgit 


From 24bfd19bb7890255693ee5cb6dc100d8d215d00b Mon Sep 17 00:00:00 2001
From: Paolo Valente <paolo.valente@linaro.org>
Date: Mon, 13 Nov 2017 07:34:09 +0100
Subject: block, bfq: update blkio stats outside the scheduler lock

bfq invokes various blkg_*stats_* functions to update the statistics
contained in the special files blkio.bfq.* in the blkio controller
groups, i.e., the I/O accounting related to the proportional-share
policy provided by bfq. The execution of these functions takes a
considerable percentage, about 40%, of the total per-request execution
time of bfq (i.e., of the sum of the execution time of all the bfq
functions that have to be executed to process an I/O request from its
creation to its destruction).  This reduces the request-processing
rate sustainable by bfq noticeably, even on a multicore CPU. In fact,
the bfq functions that invoke blkg_*stats_* functions cannot be
executed in parallel with the rest of the code of bfq, because both
are executed under the same same per-device scheduler lock.

To reduce this slowdown, this commit moves, wherever possible, the
invocation of these functions (more precisely, of the bfq functions
that invoke blkg_*stats_* functions) outside the critical sections
protected by the scheduler lock.

With this change, and with all blkio.bfq.* statistics enabled, the
throughput grows, e.g., from 250 to 310 KIOPS (+25%) on an Intel
i7-4850HQ, in case of 8 threads doing random I/O in parallel on
null_blk, with the latter configured with 0 latency. We obtained the
same or higher throughput boosts, up to +30%, with other processors
(some figures are reported in the documentation). For our tests, we
used the script [1], with which our results can be easily reproduced.

NOTE. This commit still protects the invocation of blkg_*stats_*
functions with the request_queue lock, because the group these
functions are invoked on may otherwise disappear before or while these
functions are executed.  Fortunately, tests without even this lock
show, by difference, that the serialization caused by this lock has a
little impact (at most ~5% of throughput reduction).

[1] https://github.com/Algodev-github/IOSpeed

Tested-by: Lee Tibbert <lee.tibbert@gmail.com>
Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Signed-off-by: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: Luca Miccio <lucmiccio@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 Documentation/block/bfq-iosched.txt | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

(limited to 'Documentation')

diff --git a/Documentation/block/bfq-iosched.txt b/Documentation/block/bfq-iosched.txt
index 7a9361508157..7fad6c061470 100644
--- a/Documentation/block/bfq-iosched.txt
+++ b/Documentation/block/bfq-iosched.txt
@@ -26,9 +26,9 @@ the limits on slow or average CPUs, here are BFQ limits for three
 different CPUs, on, respectively, an average laptop, an old desktop,
 and a cheap embedded system, in case full hierarchical support is
 enabled (i.e., CONFIG_BFQ_GROUP_IOSCHED is set):
-- Intel i7-4850HQ: 250 KIOPS
-- AMD A8-3850: 170 KIOPS
-- ARM CortexTM-A53 Octa-core: 45 KIOPS
+- Intel i7-4850HQ: 310 KIOPS
+- AMD A8-3850: 200 KIOPS
+- ARM CortexTM-A53 Octa-core: 56 KIOPS
 
 BFQ works for multi-queue devices too.
 
-- 
cgit 


From a33801e8b4735b8d473f963e5854172f9cde3e8b Mon Sep 17 00:00:00 2001
From: Luca Miccio <lucmiccio@gmail.com>
Date: Mon, 13 Nov 2017 07:34:10 +0100
Subject: block, bfq: move debug blkio stats behind CONFIG_DEBUG_BLK_CGROUP

BFQ currently creates, and updates, its own instance of the whole
set of blkio statistics that cfq creates. Yet, from the comments
of Tejun Heo in [1], it turned out that most of these statistics
are meant/useful only for debugging. This commit makes BFQ create
the latter, debugging statistics only if the option
CONFIG_DEBUG_BLK_CGROUP is set.

By doing so, this commit also enables BFQ to enjoy a high perfomance
boost. The reason is that, if CONFIG_DEBUG_BLK_CGROUP is not set, then
BFQ has to update far fewer statistics, and, in particular, not the
heaviest to update.  To give an idea of the benefits, if
CONFIG_DEBUG_BLK_CGROUP is not set, then, on an Intel i7-4850HQ, and
with 8 threads doing random I/O in parallel on null_blk (configured
with 0 latency), the throughput of BFQ grows from 310 to 400 KIOPS
(+30%). We have measured similar or even much higher boosts with other
CPUs: e.g., +45% with an ARM CortexTM-A53 Octa-core. Our results have
been obtained and can be reproduced very easily with the script in [1].

[1] https://www.spinics.net/lists/linux-block/msg18943.html

Suggested-by: Tejun Heo <tj@kernel.org>
Suggested-by: Ulf Hansson <ulf.hansson@linaro.org>
Tested-by: Lee Tibbert <lee.tibbert@gmail.com>
Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Signed-off-by: Luca Miccio <lucmiccio@gmail.com>
Signed-off-by: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 Documentation/block/bfq-iosched.txt | 38 +++++++++++++++++++++++++++++++------
 1 file changed, 32 insertions(+), 6 deletions(-)

(limited to 'Documentation')

diff --git a/Documentation/block/bfq-iosched.txt b/Documentation/block/bfq-iosched.txt
index 7fad6c061470..8d8d8f06cab2 100644
--- a/Documentation/block/bfq-iosched.txt
+++ b/Documentation/block/bfq-iosched.txt
@@ -20,12 +20,22 @@ for that device, by setting low_latency to 0. See Section 3 for
 details on how to configure BFQ for the desired tradeoff between
 latency and throughput, or on how to maximize throughput.
 
-BFQ has a non-null overhead, which limits the maximum IOPS that the
-CPU can process for a device scheduled with BFQ. To give an idea of
-the limits on slow or average CPUs, here are BFQ limits for three
-different CPUs, on, respectively, an average laptop, an old desktop,
-and a cheap embedded system, in case full hierarchical support is
-enabled (i.e., CONFIG_BFQ_GROUP_IOSCHED is set):
+BFQ has a non-null overhead, which limits the maximum IOPS that a CPU
+can process for a device scheduled with BFQ. To give an idea of the
+limits on slow or average CPUs, here are, first, the limits of BFQ for
+three different CPUs, on, respectively, an average laptop, an old
+desktop, and a cheap embedded system, in case full hierarchical
+support is enabled (i.e., CONFIG_BFQ_GROUP_IOSCHED is set), but
+CONFIG_DEBUG_BLK_CGROUP is not set (Section 4-2):
+- Intel i7-4850HQ: 400 KIOPS
+- AMD A8-3850: 250 KIOPS
+- ARM CortexTM-A53 Octa-core: 80 KIOPS
+
+If CONFIG_DEBUG_BLK_CGROUP is set (and of course full hierarchical
+support is enabled), then the sustainable throughput with BFQ
+decreases, because all blkio.bfq* statistics are created and updated
+(Section 4-2). For BFQ, this leads to the following maximum
+sustainable throughputs, on the same systems as above:
 - Intel i7-4850HQ: 310 KIOPS
 - AMD A8-3850: 200 KIOPS
 - ARM CortexTM-A53 Octa-core: 56 KIOPS
@@ -505,6 +515,22 @@ BFQ-specific files is "blkio.bfq." or "io.bfq." For example, the group
 parameter to set the weight of a group with BFQ is blkio.bfq.weight
 or io.bfq.weight.
 
+As for cgroups-v1 (blkio controller), the exact set of stat files
+created, and kept up-to-date by bfq, depends on whether
+CONFIG_DEBUG_BLK_CGROUP is set. If it is set, then bfq creates all
+the stat files documented in
+Documentation/cgroup-v1/blkio-controller.txt. If, instead,
+CONFIG_DEBUG_BLK_CGROUP is not set, then bfq creates only the files
+blkio.bfq.io_service_bytes
+blkio.bfq.io_service_bytes_recursive
+blkio.bfq.io_serviced
+blkio.bfq.io_serviced_recursive
+
+The value of CONFIG_DEBUG_BLK_CGROUP greatly influences the maximum
+throughput sustainable with bfq, because updating the blkio.bfq.*
+stats is rather costly, especially for some of the stats enabled by
+CONFIG_DEBUG_BLK_CGROUP.
+
 Parameters to set
 -----------------
 
-- 
cgit