From 778c02a236a8728bb992de10ed1f12c0be5b7b0e Mon Sep 17 00:00:00 2001
From: Paolo Valente <paolo.valente@linaro.org>
Date: Tue, 12 Mar 2019 09:59:27 +0100
Subject: block, bfq: increase idling for weight-raised queues
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

If a sync bfq_queue has a higher weight than some other queue, and
remains temporarily empty while in service, then, to preserve the
bandwidth share of the queue, it is necessary to plug I/O dispatching
until a new request arrives for the queue. In addition, a timeout
needs to be set, to avoid waiting for ever if the process associated
with the queue has actually finished its I/O.

Even with the above timeout, the device is however not fed with new
I/O for a while, if the process has finished its I/O. If this happens
often, then throughput drops and latencies grow. For this reason, the
timeout is kept rather low: 8 ms is the current default.

Unfortunately, such a low value may cause, on the opposite end, a
violation of bandwidth guarantees for a process that happens to issue
new I/O too late. The higher the system load, the higher the
probability that this happens to some process. This is a problem in
scenarios where service guarantees matter more than throughput. One
important case are weight-raised queues, which need to be granted a
very high fraction of the bandwidth.

To address this issue, this commit lower-bounds the plugging timeout
for weight-raised queues to 20 ms. This simple change provides
relevant benefits. For example, on a PLEXTOR PX-256M5S, with which
gnome-terminal starts in 0.6 seconds if there is no other I/O in
progress, the same applications starts in
- 0.8 seconds, instead of 1.2 seconds, if ten files are being read
  sequentially in parallel
- 1 second, instead of 2 seconds, if, in parallel, five files are
  being read sequentially, and five more files are being written
  sequentially

Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Signed-off-by: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 block/bfq-iosched.c | 2 ++
 1 file changed, 2 insertions(+)

(limited to 'block')

diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
index fac188dd78fa..f30d1cb887d4 100644
--- a/block/bfq-iosched.c
+++ b/block/bfq-iosched.c
@@ -2545,6 +2545,8 @@ static void bfq_arm_slice_timer(struct bfq_data *bfqd)
 	if (BFQQ_SEEKY(bfqq) && bfqq->wr_coeff == 1 &&
 	    bfq_symmetric_scenario(bfqd))
 		sl = min_t(u64, sl, BFQ_MIN_TT);
+	else if (bfqq->wr_coeff > 1)
+		sl = max_t(u32, sl, 20ULL * NSEC_PER_MSEC);
 
 	bfqd->last_idling_start = ktime_get();
 	hrtimer_start(&bfqd->idle_slice_timer, ns_to_ktime(sl),
-- 
cgit 


From fb53ac6cd0269987b1b77f957db453b3ec7bf7e4 Mon Sep 17 00:00:00 2001
From: Paolo Valente <paolo.valente@linaro.org>
Date: Tue, 12 Mar 2019 09:59:28 +0100
Subject: block, bfq: do not idle for lowest-weight queues
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

In most cases, it is detrimental for throughput to plug I/O dispatch
when the in-service bfq_queue becomes temporarily empty (plugging is
performed to wait for the possible arrival, soon, of new I/O from the
in-service queue). There is however a case where plugging is needed
for service guarantees. If a bfq_queue, say Q, has a higher weight
than some other active bfq_queue, and is sync, i.e., contains sync
I/O, then, to guarantee that Q does receive a higher share of the
throughput than other lower-weight queues, it is necessary to plug I/O
dispatch when Q remains temporarily empty while being served.

For this reason, BFQ performs I/O plugging when some active bfq_queue
has a higher weight than some other active bfq_queue. But this is
unnecessarily overkill. In fact, if the in-service bfq_queue actually
has a weight lower than or equal to the other queues, then the queue
*must not* be guaranteed a higher share of the throughput than the
other queues. So, not plugging I/O cannot cause any harm to the
queue. And can boost throughput.

Taking advantage of this fact, this commit does not plug I/O for sync
bfq_queues with a weight lower than or equal to the weights of the
other queues. Here is an example of the resulting throughput boost
with the dbench workload, which is particularly nasty for BFQ. With
the dbench test in the Phoronix suite, BFQ reaches its lowest total
throughput with 6 clients on a filesystem with journaling, in case the
journaling daemon has a higher weight than normal processes. Before
this commit, the total throughput was ~80 MB/sec on a PLEXTOR PX-256M5,
after this commit it is ~100 MB/sec.

Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Signed-off-by: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 block/bfq-iosched.c | 204 +++++++++++++++++++++++++++++-----------------------
 block/bfq-iosched.h |   6 +-
 block/bfq-wf2q.c    |   2 +-
 3 files changed, 118 insertions(+), 94 deletions(-)

(limited to 'block')

diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
index f30d1cb887d4..2eb587fe7c1a 100644
--- a/block/bfq-iosched.c
+++ b/block/bfq-iosched.c
@@ -629,12 +629,19 @@ void bfq_pos_tree_add_move(struct bfq_data *bfqd, struct bfq_queue *bfqq)
 }
 
 /*
- * The following function returns true if every queue must receive the
- * same share of the throughput (this condition is used when deciding
- * whether idling may be disabled, see the comments in the function
- * bfq_better_to_idle()).
+ * The following function returns false either if every active queue
+ * must receive the same share of the throughput (symmetric scenario),
+ * or, as a special case, if bfqq must receive a share of the
+ * throughput lower than or equal to the share that every other active
+ * queue must receive.  If bfqq does sync I/O, then these are the only
+ * two cases where bfqq happens to be guaranteed its share of the
+ * throughput even if I/O dispatching is not plugged when bfqq remains
+ * temporarily empty (for more details, see the comments in the
+ * function bfq_better_to_idle()). For this reason, the return value
+ * of this function is used to check whether I/O-dispatch plugging can
+ * be avoided.
  *
- * Such a scenario occurs when:
+ * The above first case (symmetric scenario) occurs when:
  * 1) all active queues have the same weight,
  * 2) all active queues belong to the same I/O-priority class,
  * 3) all active groups at the same level in the groups tree have the same
@@ -654,30 +661,36 @@ void bfq_pos_tree_add_move(struct bfq_data *bfqd, struct bfq_queue *bfqq)
  * support or the cgroups interface are not enabled, thus no state
  * needs to be maintained in this case.
  */
-static bool bfq_symmetric_scenario(struct bfq_data *bfqd)
+static bool bfq_asymmetric_scenario(struct bfq_data *bfqd,
+				   struct bfq_queue *bfqq)
 {
+	bool smallest_weight = bfqq &&
+		bfqq->weight_counter &&
+		bfqq->weight_counter ==
+		container_of(
+			rb_first_cached(&bfqd->queue_weights_tree),
+			struct bfq_weight_counter,
+			weights_node);
+
 	/*
 	 * For queue weights to differ, queue_weights_tree must contain
 	 * at least two nodes.
 	 */
-	bool varied_queue_weights = !RB_EMPTY_ROOT(&bfqd->queue_weights_tree) &&
-		(bfqd->queue_weights_tree.rb_node->rb_left ||
-		 bfqd->queue_weights_tree.rb_node->rb_right);
+	bool varied_queue_weights = !smallest_weight &&
+		!RB_EMPTY_ROOT(&bfqd->queue_weights_tree.rb_root) &&
+		(bfqd->queue_weights_tree.rb_root.rb_node->rb_left ||
+		 bfqd->queue_weights_tree.rb_root.rb_node->rb_right);
 
 	bool multiple_classes_busy =
 		(bfqd->busy_queues[0] && bfqd->busy_queues[1]) ||
 		(bfqd->busy_queues[0] && bfqd->busy_queues[2]) ||
 		(bfqd->busy_queues[1] && bfqd->busy_queues[2]);
 
-	/*
-	 * For queue weights to differ, queue_weights_tree must contain
-	 * at least two nodes.
-	 */
-	return !(varied_queue_weights || multiple_classes_busy
+	return varied_queue_weights || multiple_classes_busy
 #ifdef CONFIG_BFQ_GROUP_IOSCHED
 	       || bfqd->num_groups_with_pending_reqs > 0
 #endif
-		);
+		;
 }
 
 /*
@@ -694,10 +707,11 @@ static bool bfq_symmetric_scenario(struct bfq_data *bfqd)
  * should be low too.
  */
 void bfq_weights_tree_add(struct bfq_data *bfqd, struct bfq_queue *bfqq,
-			  struct rb_root *root)
+			  struct rb_root_cached *root)
 {
 	struct bfq_entity *entity = &bfqq->entity;
-	struct rb_node **new = &(root->rb_node), *parent = NULL;
+	struct rb_node **new = &(root->rb_root.rb_node), *parent = NULL;
+	bool leftmost = true;
 
 	/*
 	 * Do not insert if the queue is already associated with a
@@ -726,8 +740,10 @@ void bfq_weights_tree_add(struct bfq_data *bfqd, struct bfq_queue *bfqq,
 		}
 		if (entity->weight < __counter->weight)
 			new = &((*new)->rb_left);
-		else
+		else {
 			new = &((*new)->rb_right);
+			leftmost = false;
+		}
 	}
 
 	bfqq->weight_counter = kzalloc(sizeof(struct bfq_weight_counter),
@@ -736,7 +752,7 @@ void bfq_weights_tree_add(struct bfq_data *bfqd, struct bfq_queue *bfqq,
 	/*
 	 * In the unlucky event of an allocation failure, we just
 	 * exit. This will cause the weight of queue to not be
-	 * considered in bfq_symmetric_scenario, which, in its turn,
+	 * considered in bfq_asymmetric_scenario, which, in its turn,
 	 * causes the scenario to be deemed wrongly symmetric in case
 	 * bfqq's weight would have been the only weight making the
 	 * scenario asymmetric.  On the bright side, no unbalance will
@@ -750,7 +766,8 @@ void bfq_weights_tree_add(struct bfq_data *bfqd, struct bfq_queue *bfqq,
 
 	bfqq->weight_counter->weight = entity->weight;
 	rb_link_node(&bfqq->weight_counter->weights_node, parent, new);
-	rb_insert_color(&bfqq->weight_counter->weights_node, root);
+	rb_insert_color_cached(&bfqq->weight_counter->weights_node, root,
+				leftmost);
 
 inc_counter:
 	bfqq->weight_counter->num_active++;
@@ -765,7 +782,7 @@ inc_counter:
  */
 void __bfq_weights_tree_remove(struct bfq_data *bfqd,
 			       struct bfq_queue *bfqq,
-			       struct rb_root *root)
+			       struct rb_root_cached *root)
 {
 	if (!bfqq->weight_counter)
 		return;
@@ -774,7 +791,7 @@ void __bfq_weights_tree_remove(struct bfq_data *bfqd,
 	if (bfqq->weight_counter->num_active > 0)
 		goto reset_entity_pointer;
 
-	rb_erase(&bfqq->weight_counter->weights_node, root);
+	rb_erase_cached(&bfqq->weight_counter->weights_node, root);
 	kfree(bfqq->weight_counter);
 
 reset_entity_pointer:
@@ -889,7 +906,7 @@ static unsigned long bfq_serv_to_charge(struct request *rq,
 					struct bfq_queue *bfqq)
 {
 	if (bfq_bfqq_sync(bfqq) || bfqq->wr_coeff > 1 ||
-	    !bfq_symmetric_scenario(bfqq->bfqd))
+	    bfq_asymmetric_scenario(bfqq->bfqd, bfqq))
 		return blk_rq_sectors(rq);
 
 	return blk_rq_sectors(rq) * bfq_async_charge_factor;
@@ -2543,7 +2560,7 @@ static void bfq_arm_slice_timer(struct bfq_data *bfqd)
 	 * queue).
 	 */
 	if (BFQQ_SEEKY(bfqq) && bfqq->wr_coeff == 1 &&
-	    bfq_symmetric_scenario(bfqd))
+	    !bfq_asymmetric_scenario(bfqd, bfqq))
 		sl = min_t(u64, sl, BFQ_MIN_TT);
 	else if (bfqq->wr_coeff > 1)
 		sl = max_t(u32, sl, 20ULL * NSEC_PER_MSEC);
@@ -3500,8 +3517,9 @@ static bool idling_boosts_thr_without_issues(struct bfq_data *bfqd,
 }
 
 /*
- * There is a case where idling must be performed not for
- * throughput concerns, but to preserve service guarantees.
+ * There is a case where idling does not have to be performed for
+ * throughput concerns, but to preserve the throughput share of
+ * the process associated with bfqq.
  *
  * To introduce this case, we can note that allowing the drive
  * to enqueue more than one request at a time, and hence
@@ -3517,77 +3535,83 @@ static bool idling_boosts_thr_without_issues(struct bfq_data *bfqd,
  * concern about per-process throughput distribution, and
  * makes its decisions only on a per-request basis. Therefore,
  * the service distribution enforced by the drive's internal
- * scheduler is likely to coincide with the desired
- * device-throughput distribution only in a completely
- * symmetric scenario where:
- * (i)  each of these processes must get the same throughput as
- *      the others;
- * (ii) the I/O of each process has the same properties, in
- *      terms of locality (sequential or random), direction
- *      (reads or writes), request sizes, greediness
- *      (from I/O-bound to sporadic), and so on.
- * In fact, in such a scenario, the drive tends to treat
- * the requests of each of these processes in about the same
- * way as the requests of the others, and thus to provide
- * each of these processes with about the same throughput
- * (which is exactly the desired throughput distribution). In
- * contrast, in any asymmetric scenario, device idling is
- * certainly needed to guarantee that bfqq receives its
- * assigned fraction of the device throughput (see [1] for
- * details).
- * The problem is that idling may significantly reduce
- * throughput with certain combinations of types of I/O and
- * devices. An important example is sync random I/O, on flash
- * storage with command queueing. So, unless bfqq falls in the
- * above cases where idling also boosts throughput, it would
- * be important to check conditions (i) and (ii) accurately,
- * so as to avoid idling when not strictly needed for service
- * guarantees.
+ * scheduler is likely to coincide with the desired throughput
+ * distribution only in a completely symmetric, or favorably
+ * skewed scenario where:
+ * (i-a) each of these processes must get the same throughput as
+ *	 the others,
+ * (i-b) in case (i-a) does not hold, it holds that the process
+ *       associated with bfqq must receive a lower or equal
+ *	 throughput than any of the other processes;
+ * (ii)  the I/O of each process has the same properties, in
+ *       terms of locality (sequential or random), direction
+ *       (reads or writes), request sizes, greediness
+ *       (from I/O-bound to sporadic), and so on;
+
+ * In fact, in such a scenario, the drive tends to treat the requests
+ * of each process in about the same way as the requests of the
+ * others, and thus to provide each of these processes with about the
+ * same throughput.  This is exactly the desired throughput
+ * distribution if (i-a) holds, or, if (i-b) holds instead, this is an
+ * even more convenient distribution for (the process associated with)
+ * bfqq.
+ *
+ * In contrast, in any asymmetric or unfavorable scenario, device
+ * idling (I/O-dispatch plugging) is certainly needed to guarantee
+ * that bfqq receives its assigned fraction of the device throughput
+ * (see [1] for details).
+ *
+ * The problem is that idling may significantly reduce throughput with
+ * certain combinations of types of I/O and devices. An important
+ * example is sync random I/O on flash storage with command
+ * queueing. So, unless bfqq falls in cases where idling also boosts
+ * throughput, it is important to check conditions (i-a), i(-b) and
+ * (ii) accurately, so as to avoid idling when not strictly needed for
+ * service guarantees.
  *
- * Unfortunately, it is extremely difficult to thoroughly
- * check condition (ii). And, in case there are active groups,
- * it becomes very difficult to check condition (i) too. In
- * fact, if there are active groups, then, for condition (i)
- * to become false, it is enough that an active group contains
- * more active processes or sub-groups than some other active
- * group. More precisely, for condition (i) to hold because of
- * such a group, it is not even necessary that the group is
- * (still) active: it is sufficient that, even if the group
- * has become inactive, some of its descendant processes still
- * have some request already dispatched but still waiting for
- * completion. In fact, requests have still to be guaranteed
- * their share of the throughput even after being
- * dispatched. In this respect, it is easy to show that, if a
- * group frequently becomes inactive while still having
- * in-flight requests, and if, when this happens, the group is
- * not considered in the calculation of whether the scenario
- * is asymmetric, then the group may fail to be guaranteed its
- * fair share of the throughput (basically because idling may
- * not be performed for the descendant processes of the group,
- * but it had to be).  We address this issue with the
- * following bi-modal behavior, implemented in the function
- * bfq_symmetric_scenario().
+ * Unfortunately, it is extremely difficult to thoroughly check
+ * condition (ii). And, in case there are active groups, it becomes
+ * very difficult to check conditions (i-a) and (i-b) too.  In fact,
+ * if there are active groups, then, for conditions (i-a) or (i-b) to
+ * become false 'indirectly', it is enough that an active group
+ * contains more active processes or sub-groups than some other active
+ * group. More precisely, for conditions (i-a) or (i-b) to become
+ * false because of such a group, it is not even necessary that the
+ * group is (still) active: it is sufficient that, even if the group
+ * has become inactive, some of its descendant processes still have
+ * some request already dispatched but still waiting for
+ * completion. In fact, requests have still to be guaranteed their
+ * share of the throughput even after being dispatched. In this
+ * respect, it is easy to show that, if a group frequently becomes
+ * inactive while still having in-flight requests, and if, when this
+ * happens, the group is not considered in the calculation of whether
+ * the scenario is asymmetric, then the group may fail to be
+ * guaranteed its fair share of the throughput (basically because
+ * idling may not be performed for the descendant processes of the
+ * group, but it had to be).  We address this issue with the following
+ * bi-modal behavior, implemented in the function
+ * bfq_asymmetric_scenario().
  *
  * If there are groups with requests waiting for completion
  * (as commented above, some of these groups may even be
  * already inactive), then the scenario is tagged as
  * asymmetric, conservatively, without checking any of the
- * conditions (i) and (ii). So the device is idled for bfqq.
+ * conditions (i-a), (i-b) or (ii). So the device is idled for bfqq.
  * This behavior matches also the fact that groups are created
  * exactly if controlling I/O is a primary concern (to
  * preserve bandwidth and latency guarantees).
  *
- * On the opposite end, if there are no groups with requests
- * waiting for completion, then only condition (i) is actually
- * controlled, i.e., provided that condition (i) holds, idling
- * is not performed, regardless of whether condition (ii)
- * holds. In other words, only if condition (i) does not hold,
- * then idling is allowed, and the device tends to be
- * prevented from queueing many requests, possibly of several
- * processes. Since there are no groups with requests waiting
- * for completion, then, to control condition (i) it is enough
- * to check just whether all the queues with requests waiting
- * for completion also have the same weight.
+ * On the opposite end, if there are no groups with requests waiting
+ * for completion, then only conditions (i-a) and (i-b) are actually
+ * controlled, i.e., provided that conditions (i-a) or (i-b) holds,
+ * idling is not performed, regardless of whether condition (ii)
+ * holds.  In other words, only if conditions (i-a) and (i-b) do not
+ * hold, then idling is allowed, and the device tends to be prevented
+ * from queueing many requests, possibly of several processes. Since
+ * there are no groups with requests waiting for completion, then, to
+ * control conditions (i-a) and (i-b) it is enough to check just
+ * whether all the queues with requests waiting for completion also
+ * have the same weight.
  *
  * Not checking condition (ii) evidently exposes bfqq to the
  * risk of getting less throughput than its fair share.
@@ -3639,7 +3663,7 @@ static bool idling_boosts_thr_without_issues(struct bfq_data *bfqd,
  * compound condition that is checked below for deciding
  * whether the scenario is asymmetric. To explain this
  * compound condition, we need to add that the function
- * bfq_symmetric_scenario checks the weights of only
+ * bfq_asymmetric_scenario checks the weights of only
  * non-weight-raised queues, for efficiency reasons (see
  * comments on bfq_weights_tree_add()). Then the fact that
  * bfqq is weight-raised is checked explicitly here. More
@@ -3667,7 +3691,7 @@ static bool idling_needed_for_service_guarantees(struct bfq_data *bfqd,
 	return (bfqq->wr_coeff > 1 &&
 		bfqd->wr_busy_queues <
 		bfq_tot_busy_queues(bfqd)) ||
-		!bfq_symmetric_scenario(bfqd);
+		bfq_asymmetric_scenario(bfqd, bfqq);
 }
 
 /*
@@ -5505,7 +5529,7 @@ static int bfq_init_queue(struct request_queue *q, struct elevator_type *e)
 		     HRTIMER_MODE_REL);
 	bfqd->idle_slice_timer.function = bfq_idle_slice_timer;
 
-	bfqd->queue_weights_tree = RB_ROOT;
+	bfqd->queue_weights_tree = RB_ROOT_CACHED;
 	bfqd->num_groups_with_pending_reqs = 0;
 
 	INIT_LIST_HEAD(&bfqd->active_list);
diff --git a/block/bfq-iosched.h b/block/bfq-iosched.h
index 062e1c4787f4..81cabf51a87e 100644
--- a/block/bfq-iosched.h
+++ b/block/bfq-iosched.h
@@ -450,7 +450,7 @@ struct bfq_data {
 	 * weight-raised @bfq_queue (see the comments to the functions
 	 * bfq_weights_tree_[add|remove] for further details).
 	 */
-	struct rb_root queue_weights_tree;
+	struct rb_root_cached queue_weights_tree;
 
 	/*
 	 * Number of groups with at least one descendant process that
@@ -898,10 +898,10 @@ void bic_set_bfqq(struct bfq_io_cq *bic, struct bfq_queue *bfqq, bool is_sync);
 struct bfq_data *bic_to_bfqd(struct bfq_io_cq *bic);
 void bfq_pos_tree_add_move(struct bfq_data *bfqd, struct bfq_queue *bfqq);
 void bfq_weights_tree_add(struct bfq_data *bfqd, struct bfq_queue *bfqq,
-			  struct rb_root *root);
+			  struct rb_root_cached *root);
 void __bfq_weights_tree_remove(struct bfq_data *bfqd,
 			       struct bfq_queue *bfqq,
-			       struct rb_root *root);
+			       struct rb_root_cached *root);
 void bfq_weights_tree_remove(struct bfq_data *bfqd,
 			     struct bfq_queue *bfqq);
 void bfq_bfqq_expire(struct bfq_data *bfqd, struct bfq_queue *bfqq,
diff --git a/block/bfq-wf2q.c b/block/bfq-wf2q.c
index a11bef75483d..51ef1f00df80 100644
--- a/block/bfq-wf2q.c
+++ b/block/bfq-wf2q.c
@@ -737,7 +737,7 @@ __bfq_entity_update_weight_prio(struct bfq_service_tree *old_st,
 		struct bfq_queue *bfqq = bfq_entity_to_bfqq(entity);
 		unsigned int prev_weight, new_weight;
 		struct bfq_data *bfqd = NULL;
-		struct rb_root *root;
+		struct rb_root_cached *root;
 #ifdef CONFIG_BFQ_GROUP_IOSCHED
 		struct bfq_sched_data *sd;
 		struct bfq_group *bfqg;
-- 
cgit 


From 2341d662e9a2a5751ff8ac4ffa640fb493b0ee84 Mon Sep 17 00:00:00 2001
From: Paolo Valente <paolo.valente@linaro.org>
Date: Tue, 12 Mar 2019 09:59:29 +0100
Subject: block, bfq: tune service injection basing on request service times
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

The processes associated with a bfq_queue, say Q, may happen to
generate their cumulative I/O at a lower rate than the rate at which
the device could serve the same I/O. This is rather probable, e.g., if
only one process is associated with Q and the device is an SSD. It
results in Q becoming often empty while in service. If BFQ is not
allowed to switch to another queue when Q becomes empty, then, during
the service of Q, there will be frequent "service holes", i.e., time
intervals during which Q gets empty and the device can only consume
the I/O already queued in its hardware queues. This easily causes
considerable losses of throughput.

To counter this problem, BFQ implements a request injection mechanism,
which tries to fill the above service holes with I/O requests taken
from other bfq_queues. The hard part in this mechanism is finding the
right amount of I/O to inject, so as to both boost throughput and not
break Q's bandwidth and latency guarantees. To this goal, the current
version of this mechanism measures the bandwidth enjoyed by Q while it
is being served, and tries to inject the maximum possible amount of
extra service that does not cause Q's bandwidth to decrease too
much.

This solution has an important shortcoming. For bandwidth measurements
to be stable and reliable, Q must remain in service for a much longer
time than that needed to serve a single I/O request. Unfortunately,
this does not hold with many workloads. This commit addresses this
issue by changing the way the amount of injection allowed is
dynamically computed. It tunes injection as a function of the service
times of single I/O requests of Q, instead of Q's
bandwidth. Single-request service times are evidently meaningful even
if Q gets very few I/O requests completed while it is in service.

As a testbed for this new solution, we measured the throughput reached
by BFQ for one of the nastiest workloads and configurations for this
scheduler: the workload generated by the dbench test (in the Phoronix
suite), with 6 clients, on a filesystem with journaling, and with the
journaling daemon enjoying a higher weight than normal processes.
With this commit, the throughput grows from ~100 MB/s to ~150 MB/s on
a PLEXTOR PX-256M5.

Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Tested-by: Francesco Pollicino <fra.fra.800@gmail.com>
Signed-off-by: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 block/bfq-iosched.c | 417 +++++++++++++++++++++++++++++++++++++++++++++++-----
 block/bfq-iosched.h |  51 ++++---
 2 files changed, 409 insertions(+), 59 deletions(-)

(limited to 'block')

diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
index 2eb587fe7c1a..f59efee7a601 100644
--- a/block/bfq-iosched.c
+++ b/block/bfq-iosched.c
@@ -1721,6 +1721,123 @@ static void bfq_add_request(struct request *rq)
 	bfqq->queued[rq_is_sync(rq)]++;
 	bfqd->queued++;
 
+	if (RB_EMPTY_ROOT(&bfqq->sort_list) && bfq_bfqq_sync(bfqq)) {
+		/*
+		 * Periodically reset inject limit, to make sure that
+		 * the latter eventually drops in case workload
+		 * changes, see step (3) in the comments on
+		 * bfq_update_inject_limit().
+		 */
+		if (time_is_before_eq_jiffies(bfqq->decrease_time_jif +
+					     msecs_to_jiffies(1000))) {
+			/* invalidate baseline total service time */
+			bfqq->last_serv_time_ns = 0;
+
+			/*
+			 * Reset pointer in case we are waiting for
+			 * some request completion.
+			 */
+			bfqd->waited_rq = NULL;
+
+			/*
+			 * If bfqq has a short think time, then start
+			 * by setting the inject limit to 0
+			 * prudentially, because the service time of
+			 * an injected I/O request may be higher than
+			 * the think time of bfqq, and therefore, if
+			 * one request was injected when bfqq remains
+			 * empty, this injected request might delay
+			 * the service of the next I/O request for
+			 * bfqq significantly. In case bfqq can
+			 * actually tolerate some injection, then the
+			 * adaptive update will however raise the
+			 * limit soon. This lucky circumstance holds
+			 * exactly because bfqq has a short think
+			 * time, and thus, after remaining empty, is
+			 * likely to get new I/O enqueued---and then
+			 * completed---before being expired. This is
+			 * the very pattern that gives the
+			 * limit-update algorithm the chance to
+			 * measure the effect of injection on request
+			 * service times, and then to update the limit
+			 * accordingly.
+			 *
+			 * On the opposite end, if bfqq has a long
+			 * think time, then start directly by 1,
+			 * because:
+			 * a) on the bright side, keeping at most one
+			 * request in service in the drive is unlikely
+			 * to cause any harm to the latency of bfqq's
+			 * requests, as the service time of a single
+			 * request is likely to be lower than the
+			 * think time of bfqq;
+			 * b) on the downside, after becoming empty,
+			 * bfqq is likely to expire before getting its
+			 * next request. With this request arrival
+			 * pattern, it is very hard to sample total
+			 * service times and update the inject limit
+			 * accordingly (see comments on
+			 * bfq_update_inject_limit()). So the limit is
+			 * likely to be never, or at least seldom,
+			 * updated.  As a consequence, by setting the
+			 * limit to 1, we avoid that no injection ever
+			 * occurs with bfqq. On the downside, this
+			 * proactive step further reduces chances to
+			 * actually compute the baseline total service
+			 * time. Thus it reduces chances to execute the
+			 * limit-update algorithm and possibly raise the
+			 * limit to more than 1.
+			 */
+			if (bfq_bfqq_has_short_ttime(bfqq))
+				bfqq->inject_limit = 0;
+			else
+				bfqq->inject_limit = 1;
+			bfqq->decrease_time_jif = jiffies;
+		}
+
+		/*
+		 * The following conditions must hold to setup a new
+		 * sampling of total service time, and then a new
+		 * update of the inject limit:
+		 * - bfqq is in service, because the total service
+		 *   time is evaluated only for the I/O requests of
+		 *   the queues in service;
+		 * - this is the right occasion to compute or to
+		 *   lower the baseline total service time, because
+		 *   there are actually no requests in the drive,
+		 *   or
+		 *   the baseline total service time is available, and
+		 *   this is the right occasion to compute the other
+		 *   quantity needed to update the inject limit, i.e.,
+		 *   the total service time caused by the amount of
+		 *   injection allowed by the current value of the
+		 *   limit. It is the right occasion because injection
+		 *   has actually been performed during the service
+		 *   hole, and there are still in-flight requests,
+		 *   which are very likely to be exactly the injected
+		 *   requests, or part of them;
+		 * - the minimum interval for sampling the total
+		 *   service time and updating the inject limit has
+		 *   elapsed.
+		 */
+		if (bfqq == bfqd->in_service_queue &&
+		    (bfqd->rq_in_driver == 0 ||
+		     (bfqq->last_serv_time_ns > 0 &&
+		      bfqd->rqs_injected && bfqd->rq_in_driver > 0)) &&
+		    time_is_before_eq_jiffies(bfqq->decrease_time_jif +
+					      msecs_to_jiffies(100))) {
+			bfqd->last_empty_occupied_ns = ktime_get_ns();
+			/*
+			 * Start the state machine for measuring the
+			 * total service time of rq: setting
+			 * wait_dispatch will cause bfqd->waited_rq to
+			 * be set when rq will be dispatched.
+			 */
+			bfqd->wait_dispatch = true;
+			bfqd->rqs_injected = false;
+		}
+	}
+
 	elv_rb_add(&bfqq->sort_list, rq);
 
 	/*
@@ -2566,6 +2683,8 @@ static void bfq_arm_slice_timer(struct bfq_data *bfqd)
 		sl = max_t(u32, sl, 20ULL * NSEC_PER_MSEC);
 
 	bfqd->last_idling_start = ktime_get();
+	bfqd->last_idling_start_jiffies = jiffies;
+
 	hrtimer_start(&bfqd->idle_slice_timer, ns_to_ktime(sl),
 		      HRTIMER_MODE_REL);
 	bfqg_stats_set_start_idle_time(bfqq_group(bfqq));
@@ -3240,13 +3359,6 @@ static unsigned long bfq_bfqq_softrt_next_start(struct bfq_data *bfqd,
 		    jiffies + nsecs_to_jiffies(bfqq->bfqd->bfq_slice_idle) + 4);
 }
 
-static bool bfq_bfqq_injectable(struct bfq_queue *bfqq)
-{
-	return BFQQ_SEEKY(bfqq) && bfqq->wr_coeff == 1 &&
-		blk_queue_nonrot(bfqq->bfqd->queue) &&
-		bfqq->bfqd->hw_tag;
-}
-
 /**
  * bfq_bfqq_expire - expire a queue.
  * @bfqd: device owning the queue.
@@ -3361,6 +3473,14 @@ void bfq_bfqq_expire(struct bfq_data *bfqd,
 		"expire (%d, slow %d, num_disp %d, short_ttime %d)", reason,
 		slow, bfqq->dispatched, bfq_bfqq_has_short_ttime(bfqq));
 
+	/*
+	 * bfqq expired, so no total service time needs to be computed
+	 * any longer: reset state machine for measuring total service
+	 * times.
+	 */
+	bfqd->rqs_injected = bfqd->wait_dispatch = false;
+	bfqd->waited_rq = NULL;
+
 	/*
 	 * Increase, decrease or leave budget unchanged according to
 	 * reason.
@@ -3372,8 +3492,6 @@ void bfq_bfqq_expire(struct bfq_data *bfqd,
 	if (ref == 1) /* bfqq is gone, no more actions on it */
 		return;
 
-	bfqq->injected_service = 0;
-
 	/* mark bfqq as waiting a request only if a bic still points to it */
 	if (!bfq_bfqq_busy(bfqq) &&
 	    reason != BFQQE_BUDGET_TIMEOUT &&
@@ -3767,26 +3885,98 @@ static bool bfq_bfqq_must_idle(struct bfq_queue *bfqq)
 	return RB_EMPTY_ROOT(&bfqq->sort_list) && bfq_better_to_idle(bfqq);
 }
 
-static struct bfq_queue *bfq_choose_bfqq_for_injection(struct bfq_data *bfqd)
+/*
+ * This function chooses the queue from which to pick the next extra
+ * I/O request to inject, if it finds a compatible queue. See the
+ * comments on bfq_update_inject_limit() for details on the injection
+ * mechanism, and for the definitions of the quantities mentioned
+ * below.
+ */
+static struct bfq_queue *
+bfq_choose_bfqq_for_injection(struct bfq_data *bfqd)
 {
-	struct bfq_queue *bfqq;
+	struct bfq_queue *bfqq, *in_serv_bfqq = bfqd->in_service_queue;
+	unsigned int limit = in_serv_bfqq->inject_limit;
+	/*
+	 * If
+	 * - bfqq is not weight-raised and therefore does not carry
+	 *   time-critical I/O,
+	 * or
+	 * - regardless of whether bfqq is weight-raised, bfqq has
+	 *   however a long think time, during which it can absorb the
+	 *   effect of an appropriate number of extra I/O requests
+	 *   from other queues (see bfq_update_inject_limit for
+	 *   details on the computation of this number);
+	 * then injection can be performed without restrictions.
+	 */
+	bool in_serv_always_inject = in_serv_bfqq->wr_coeff == 1 ||
+		!bfq_bfqq_has_short_ttime(in_serv_bfqq);
 
 	/*
-	 * A linear search; but, with a high probability, very few
-	 * steps are needed to find a candidate queue, i.e., a queue
-	 * with enough budget left for its next request. In fact:
+	 * If
+	 * - the baseline total service time could not be sampled yet,
+	 *   so the inject limit happens to be still 0, and
+	 * - a lot of time has elapsed since the plugging of I/O
+	 *   dispatching started, so drive speed is being wasted
+	 *   significantly;
+	 * then temporarily raise inject limit to one request.
+	 */
+	if (limit == 0 && in_serv_bfqq->last_serv_time_ns == 0 &&
+	    bfq_bfqq_wait_request(in_serv_bfqq) &&
+	    time_is_before_eq_jiffies(bfqd->last_idling_start_jiffies +
+				      bfqd->bfq_slice_idle)
+		)
+		limit = 1;
+
+	if (bfqd->rq_in_driver >= limit)
+		return NULL;
+
+	/*
+	 * Linear search of the source queue for injection; but, with
+	 * a high probability, very few steps are needed to find a
+	 * candidate queue, i.e., a queue with enough budget left for
+	 * its next request. In fact:
 	 * - BFQ dynamically updates the budget of every queue so as
 	 *   to accommodate the expected backlog of the queue;
 	 * - if a queue gets all its requests dispatched as injected
 	 *   service, then the queue is removed from the active list
-	 *   (and re-added only if it gets new requests, but with
-	 *   enough budget for its new backlog).
+	 *   (and re-added only if it gets new requests, but then it
+	 *   is assigned again enough budget for its new backlog).
 	 */
 	list_for_each_entry(bfqq, &bfqd->active_list, bfqq_list)
 		if (!RB_EMPTY_ROOT(&bfqq->sort_list) &&
+		    (in_serv_always_inject || bfqq->wr_coeff > 1) &&
 		    bfq_serv_to_charge(bfqq->next_rq, bfqq) <=
-		    bfq_bfqq_budget_left(bfqq))
-			return bfqq;
+		    bfq_bfqq_budget_left(bfqq)) {
+			/*
+			 * Allow for only one large in-flight request
+			 * on non-rotational devices, for the
+			 * following reason. On non-rotationl drives,
+			 * large requests take much longer than
+			 * smaller requests to be served. In addition,
+			 * the drive prefers to serve large requests
+			 * w.r.t. to small ones, if it can choose. So,
+			 * having more than one large requests queued
+			 * in the drive may easily make the next first
+			 * request of the in-service queue wait for so
+			 * long to break bfqq's service guarantees. On
+			 * the bright side, large requests let the
+			 * drive reach a very high throughput, even if
+			 * there is only one in-flight large request
+			 * at a time.
+			 */
+			if (blk_queue_nonrot(bfqd->queue) &&
+			    blk_rq_sectors(bfqq->next_rq) >=
+			    BFQQ_SECT_THR_NONROT)
+				limit = min_t(unsigned int, 1, limit);
+			else
+				limit = in_serv_bfqq->inject_limit;
+
+			if (bfqd->rq_in_driver < limit) {
+				bfqd->rqs_injected = true;
+				return bfqq;
+			}
+		}
 
 	return NULL;
 }
@@ -3873,14 +4063,32 @@ check_queue:
 	 * for a new request, or has requests waiting for a completion and
 	 * may idle after their completion, then keep it anyway.
 	 *
-	 * Yet, to boost throughput, inject service from other queues if
-	 * possible.
+	 * Yet, inject service from other queues if it boosts
+	 * throughput and is possible.
 	 */
 	if (bfq_bfqq_wait_request(bfqq) ||
 	    (bfqq->dispatched != 0 && bfq_better_to_idle(bfqq))) {
-		if (bfq_bfqq_injectable(bfqq) &&
-		    bfqq->injected_service * bfqq->inject_coeff <
-		    bfqq->entity.service * 10)
+		struct bfq_queue *async_bfqq =
+			bfqq->bic && bfqq->bic->bfqq[0] &&
+			bfq_bfqq_busy(bfqq->bic->bfqq[0]) ?
+			bfqq->bic->bfqq[0] : NULL;
+
+		/*
+		 * If the process associated with bfqq has also async
+		 * I/O pending, then inject it
+		 * unconditionally. Injecting I/O from the same
+		 * process can cause no harm to the process. On the
+		 * contrary, it can only increase bandwidth and reduce
+		 * latency for the process.
+		 */
+		if (async_bfqq &&
+		    icq_to_bic(async_bfqq->next_rq->elv.icq) == bfqq->bic &&
+		    bfq_serv_to_charge(async_bfqq->next_rq, async_bfqq) <=
+		    bfq_bfqq_budget_left(async_bfqq))
+			bfqq = bfqq->bic->bfqq[0];
+		else if (!idling_boosts_thr_without_issues(bfqd, bfqq) &&
+			 (bfqq->wr_coeff == 1 || bfqd->wr_busy_queues > 1 ||
+			  !bfq_bfqq_has_short_ttime(bfqq)))
 			bfqq = bfq_choose_bfqq_for_injection(bfqd);
 		else
 			bfqq = NULL;
@@ -3972,15 +4180,15 @@ static struct request *bfq_dispatch_rq_from_bfqq(struct bfq_data *bfqd,
 
 	bfq_bfqq_served(bfqq, service_to_charge);
 
-	bfq_dispatch_remove(bfqd->queue, rq);
+	if (bfqq == bfqd->in_service_queue && bfqd->wait_dispatch) {
+		bfqd->wait_dispatch = false;
+		bfqd->waited_rq = rq;
+	}
 
-	if (bfqq != bfqd->in_service_queue) {
-		if (likely(bfqd->in_service_queue))
-			bfqd->in_service_queue->injected_service +=
-				bfq_serv_to_charge(rq, bfqq);
+	bfq_dispatch_remove(bfqd->queue, rq);
 
+	if (bfqq != bfqd->in_service_queue)
 		goto return_rq;
-	}
 
 	/*
 	 * If weight raising has to terminate for bfqq, then next
@@ -4411,13 +4619,6 @@ static void bfq_init_bfqq(struct bfq_data *bfqd, struct bfq_queue *bfqq,
 			bfq_mark_bfqq_has_short_ttime(bfqq);
 		bfq_mark_bfqq_sync(bfqq);
 		bfq_mark_bfqq_just_created(bfqq);
-		/*
-		 * Aggressively inject a lot of service: up to 90%.
-		 * This coefficient remains constant during bfqq life,
-		 * but this behavior might be changed, after enough
-		 * testing and tuning.
-		 */
-		bfqq->inject_coeff = 1;
 	} else
 		bfq_clear_bfqq_sync(bfqq);
 
@@ -4976,6 +5177,147 @@ static void bfq_finish_requeue_request_body(struct bfq_queue *bfqq)
 	bfq_put_queue(bfqq);
 }
 
+/*
+ * The processes associated with bfqq may happen to generate their
+ * cumulative I/O at a lower rate than the rate at which the device
+ * could serve the same I/O. This is rather probable, e.g., if only
+ * one process is associated with bfqq and the device is an SSD. It
+ * results in bfqq becoming often empty while in service. In this
+ * respect, if BFQ is allowed to switch to another queue when bfqq
+ * remains empty, then the device goes on being fed with I/O requests,
+ * and the throughput is not affected. In contrast, if BFQ is not
+ * allowed to switch to another queue---because bfqq is sync and
+ * I/O-dispatch needs to be plugged while bfqq is temporarily
+ * empty---then, during the service of bfqq, there will be frequent
+ * "service holes", i.e., time intervals during which bfqq gets empty
+ * and the device can only consume the I/O already queued in its
+ * hardware queues. During service holes, the device may even get to
+ * remaining idle. In the end, during the service of bfqq, the device
+ * is driven at a lower speed than the one it can reach with the kind
+ * of I/O flowing through bfqq.
+ *
+ * To counter this loss of throughput, BFQ implements a "request
+ * injection mechanism", which tries to fill the above service holes
+ * with I/O requests taken from other queues. The hard part in this
+ * mechanism is finding the right amount of I/O to inject, so as to
+ * both boost throughput and not break bfqq's bandwidth and latency
+ * guarantees. In this respect, the mechanism maintains a per-queue
+ * inject limit, computed as below. While bfqq is empty, the injection
+ * mechanism dispatches extra I/O requests only until the total number
+ * of I/O requests in flight---i.e., already dispatched but not yet
+ * completed---remains lower than this limit.
+ *
+ * A first definition comes in handy to introduce the algorithm by
+ * which the inject limit is computed.  We define as first request for
+ * bfqq, an I/O request for bfqq that arrives while bfqq is in
+ * service, and causes bfqq to switch from empty to non-empty. The
+ * algorithm updates the limit as a function of the effect of
+ * injection on the service times of only the first requests of
+ * bfqq. The reason for this restriction is that these are the
+ * requests whose service time is affected most, because they are the
+ * first to arrive after injection possibly occurred.
+ *
+ * To evaluate the effect of injection, the algorithm measures the
+ * "total service time" of first requests. We define as total service
+ * time of an I/O request, the time that elapses since when the
+ * request is enqueued into bfqq, to when it is completed. This
+ * quantity allows the whole effect of injection to be measured. It is
+ * easy to see why. Suppose that some requests of other queues are
+ * actually injected while bfqq is empty, and that a new request R
+ * then arrives for bfqq. If the device does start to serve all or
+ * part of the injected requests during the service hole, then,
+ * because of this extra service, it may delay the next invocation of
+ * the dispatch hook of BFQ. Then, even after R gets eventually
+ * dispatched, the device may delay the actual service of R if it is
+ * still busy serving the extra requests, or if it decides to serve,
+ * before R, some extra request still present in its queues. As a
+ * conclusion, the cumulative extra delay caused by injection can be
+ * easily evaluated by just comparing the total service time of first
+ * requests with and without injection.
+ *
+ * The limit-update algorithm works as follows. On the arrival of a
+ * first request of bfqq, the algorithm measures the total time of the
+ * request only if one of the three cases below holds, and, for each
+ * case, it updates the limit as described below:
+ *
+ * (1) If there is no in-flight request. This gives a baseline for the
+ *     total service time of the requests of bfqq. If the baseline has
+ *     not been computed yet, then, after computing it, the limit is
+ *     set to 1, to start boosting throughput, and to prepare the
+ *     ground for the next case. If the baseline has already been
+ *     computed, then it is updated, in case it results to be lower
+ *     than the previous value.
+ *
+ * (2) If the limit is higher than 0 and there are in-flight
+ *     requests. By comparing the total service time in this case with
+ *     the above baseline, it is possible to know at which extent the
+ *     current value of the limit is inflating the total service
+ *     time. If the inflation is below a certain threshold, then bfqq
+ *     is assumed to be suffering from no perceivable loss of its
+ *     service guarantees, and the limit is even tentatively
+ *     increased. If the inflation is above the threshold, then the
+ *     limit is decreased. Due to the lack of any hysteresis, this
+ *     logic makes the limit oscillate even in steady workload
+ *     conditions. Yet we opted for it, because it is fast in reaching
+ *     the best value for the limit, as a function of the current I/O
+ *     workload. To reduce oscillations, this step is disabled for a
+ *     short time interval after the limit happens to be decreased.
+ *
+ * (3) Periodically, after resetting the limit, to make sure that the
+ *     limit eventually drops in case the workload changes. This is
+ *     needed because, after the limit has gone safely up for a
+ *     certain workload, it is impossible to guess whether the
+ *     baseline total service time may have changed, without measuring
+ *     it again without injection. A more effective version of this
+ *     step might be to just sample the baseline, by interrupting
+ *     injection only once, and then to reset/lower the limit only if
+ *     the total service time with the current limit does happen to be
+ *     too large.
+ *
+ * More details on each step are provided in the comments on the
+ * pieces of code that implement these steps: the branch handling the
+ * transition from empty to non empty in bfq_add_request(), the branch
+ * handling injection in bfq_select_queue(), and the function
+ * bfq_choose_bfqq_for_injection(). These comments also explain some
+ * exceptions, made by the injection mechanism in some special cases.
+ */
+static void bfq_update_inject_limit(struct bfq_data *bfqd,
+				    struct bfq_queue *bfqq)
+{
+	u64 tot_time_ns = ktime_get_ns() - bfqd->last_empty_occupied_ns;
+	unsigned int old_limit = bfqq->inject_limit;
+
+	if (bfqq->last_serv_time_ns > 0) {
+		u64 threshold = (bfqq->last_serv_time_ns * 3)>>1;
+
+		if (tot_time_ns >= threshold && old_limit > 0) {
+			bfqq->inject_limit--;
+			bfqq->decrease_time_jif = jiffies;
+		} else if (tot_time_ns < threshold &&
+			   old_limit < bfqd->max_rq_in_driver<<1)
+			bfqq->inject_limit++;
+	}
+
+	/*
+	 * Either we still have to compute the base value for the
+	 * total service time, and there seem to be the right
+	 * conditions to do it, or we can lower the last base value
+	 * computed.
+	 */
+	if ((bfqq->last_serv_time_ns == 0 && bfqd->rq_in_driver == 0) ||
+	    tot_time_ns < bfqq->last_serv_time_ns) {
+		bfqq->last_serv_time_ns = tot_time_ns;
+		/*
+		 * Now we certainly have a base value: make sure we
+		 * start trying injection.
+		 */
+		bfqq->inject_limit = max_t(unsigned int, 1, old_limit);
+	}
+
+	/* update complete, not waiting for any request completion any longer */
+	bfqd->waited_rq = NULL;
+}
+
 /*
  * Handle either a requeue or a finish for rq. The things to do are
  * the same in both cases: all references to rq are to be dropped. In
@@ -5020,6 +5362,9 @@ static void bfq_finish_requeue_request(struct request *rq)
 
 		spin_lock_irqsave(&bfqd->lock, flags);
 
+		if (rq == bfqd->waited_rq)
+			bfq_update_inject_limit(bfqd, bfqq);
+
 		bfq_completed_request(bfqq, bfqd);
 		bfq_finish_requeue_request_body(bfqq);
 
diff --git a/block/bfq-iosched.h b/block/bfq-iosched.h
index 81cabf51a87e..26869cfbbfa9 100644
--- a/block/bfq-iosched.h
+++ b/block/bfq-iosched.h
@@ -240,6 +240,13 @@ struct bfq_queue {
 	/* next ioprio and ioprio class if a change is in progress */
 	unsigned short new_ioprio, new_ioprio_class;
 
+	/* last total-service-time sample, see bfq_update_inject_limit() */
+	u64 last_serv_time_ns;
+	/* limit for request injection */
+	unsigned int inject_limit;
+	/* last time the inject limit has been decreased, in jiffies */
+	unsigned long decrease_time_jif;
+
 	/*
 	 * Shared bfq_queue if queue is cooperating with one or more
 	 * other queues.
@@ -357,29 +364,6 @@ struct bfq_queue {
 
 	/* max service rate measured so far */
 	u32 max_service_rate;
-	/*
-	 * Ratio between the service received by bfqq while it is in
-	 * service, and the cumulative service (of requests of other
-	 * queues) that may be injected while bfqq is empty but still
-	 * in service. To increase precision, the coefficient is
-	 * measured in tenths of unit. Here are some example of (1)
-	 * ratios, (2) resulting percentages of service injected
-	 * w.r.t. to the total service dispatched while bfqq is in
-	 * service, and (3) corresponding values of the coefficient:
-	 * 1 (50%) -> 10
-	 * 2 (33%) -> 20
-	 * 10 (9%) -> 100
-	 * 9.9 (9%) -> 99
-	 * 1.5 (40%) -> 15
-	 * 0.5 (66%) -> 5
-	 * 0.1 (90%) -> 1
-	 *
-	 * So, if the coefficient is lower than 10, then
-	 * injected service is more than bfqq service.
-	 */
-	unsigned int inject_coeff;
-	/* amount of service injected in current service slot */
-	unsigned int injected_service;
 };
 
 /**
@@ -544,6 +528,26 @@ struct bfq_data {
 	/* time of last request completion (ns) */
 	u64 last_completion;
 
+	/* time of last transition from empty to non-empty (ns) */
+	u64 last_empty_occupied_ns;
+
+	/*
+	 * Flag set to activate the sampling of the total service time
+	 * of a just-arrived first I/O request (see
+	 * bfq_update_inject_limit()). This will cause the setting of
+	 * waited_rq when the request is finally dispatched.
+	 */
+	bool wait_dispatch;
+	/*
+	 *  If set, then bfq_update_inject_limit() is invoked when
+	 *  waited_rq is eventually completed.
+	 */
+	struct request *waited_rq;
+	/*
+	 * True if some request has been injected during the last service hole.
+	 */
+	bool rqs_injected;
+
 	/* time of first rq dispatch in current observation interval (ns) */
 	u64 first_dispatch;
 	/* time of last rq dispatch in current observation interval (ns) */
@@ -553,6 +557,7 @@ struct bfq_data {
 	ktime_t last_budget_start;
 	/* beginning of the last idle slice */
 	ktime_t last_idling_start;
+	unsigned long last_idling_start_jiffies;
 
 	/* number of samples in current observation interval */
 	int peak_rate_samples;
-- 
cgit 


From 8cacc5ab3eacf5284bc9b0d7d5b85b748a338104 Mon Sep 17 00:00:00 2001
From: Paolo Valente <paolo.valente@linaro.org>
Date: Tue, 12 Mar 2019 09:59:30 +0100
Subject: block, bfq: do not merge queues on flash storage with queueing
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

To boost throughput with a set of processes doing interleaved I/O
(i.e., a set of processes whose individual I/O is random, but whose
merged cumulative I/O is sequential), BFQ merges the queues associated
with these processes, i.e., redirects the I/O of these processes into a
common, shared queue. In the shared queue, I/O requests are ordered by
their position on the medium, thus sequential I/O gets dispatched to
the device when the shared queue is served.

Queue merging costs execution time, because, to detect which queues to
merge, BFQ must maintain a list of the head I/O requests of active
queues, ordered by request positions. Measurements showed that this
costs about 10% of BFQ's total per-request processing time.

Request processing time becomes more and more critical as the speed of
the underlying storage device grows. Yet, fortunately, queue merging
is basically useless on the very devices that are so fast to make
request processing time critical. To reach a high throughput, these
devices must have many requests queued at the same time. But, in this
configuration, the internal scheduling algorithms of these devices do
also the job of queue merging: they reorder requests so as to obtain
as much as possible a sequential I/O pattern. As a consequence, with
processes doing interleaved I/O, the throughput reached by one such
device is likely to be the same, with and without queue merging.

In view of this fact, this commit disables queue merging, and all
related housekeeping, for non-rotational devices with internal
queueing. The total, single-lock-protected, per-request processing
time of BFQ drops to, e.g., 1.9 us on an Intel Core i7-2760QM@2.40GHz
(time measured with simple code instrumentation, and using the
throughput-sync.sh script of the S suite [1], in performance-profiling
mode). To put this result into context, the total,
single-lock-protected, per-request execution time of the lightest I/O
scheduler available in blk-mq, mq-deadline, is 0.7 us (mq-deadline is
~800 LOC, against ~10500 LOC for BFQ).

Disabling merging provides a further, remarkable benefit in terms of
throughput. Merging tends to make many workloads artificially more
uneven, mainly because of shared queues remaining non empty for
incomparably more time than normal queues. So, if, e.g., one of the
queues in a set of merged queues has a higher weight than a normal
queue, then the shared queue may inherit such a high weight and, by
staying almost always active, may force BFQ to perform I/O plugging
most of the time. This evidently makes it harder for BFQ to let the
device reach a high throughput.

As a practical example of this problem, and of the benefits of this
commit, we measured again the throughput in the nasty scenario
considered in previous commit messages: dbench test (in the Phoronix
suite), with 6 clients, on a filesystem with journaling, and with the
journaling daemon enjoying a higher weight than normal processes. With
this commit, the throughput grows from ~150 MB/s to ~200 MB/s on a
PLEXTOR PX-256M5 SSD. This is the same peak throughput reached by any
of the other I/O schedulers. As such, this is also likely to be the
maximum possible throughput reachable with this workload on this
device, because I/O is mostly random, and the other schedulers
basically just pass I/O requests to the drive as fast as possible.

[1] https://github.com/Algodev-github/S

Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Tested-by: Francesco Pollicino <fra.fra.800@gmail.com>
Signed-off-by: Alessio Masola <alessio.masola@gmail.com>
Signed-off-by: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 block/bfq-cgroup.c  |  3 ++-
 block/bfq-iosched.c | 73 +++++++++++++++++++++++++++++++++++++++++++++++++----
 block/bfq-iosched.h |  3 +++
 3 files changed, 73 insertions(+), 6 deletions(-)

(limited to 'block')

diff --git a/block/bfq-cgroup.c b/block/bfq-cgroup.c
index c6113af31960..2a74a3f2a8f7 100644
--- a/block/bfq-cgroup.c
+++ b/block/bfq-cgroup.c
@@ -578,7 +578,8 @@ void bfq_bfqq_move(struct bfq_data *bfqd, struct bfq_queue *bfqq,
 	bfqg_and_blkg_get(bfqg);
 
 	if (bfq_bfqq_busy(bfqq)) {
-		bfq_pos_tree_add_move(bfqd, bfqq);
+		if (unlikely(!bfqd->nonrot_with_queueing))
+			bfq_pos_tree_add_move(bfqd, bfqq);
 		bfq_activate_bfqq(bfqd, bfqq);
 	}
 
diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
index f59efee7a601..b957e9db87d8 100644
--- a/block/bfq-iosched.c
+++ b/block/bfq-iosched.c
@@ -595,7 +595,16 @@ static bool bfq_too_late_for_merging(struct bfq_queue *bfqq)
 				       bfq_merge_time_limit);
 }
 
-void bfq_pos_tree_add_move(struct bfq_data *bfqd, struct bfq_queue *bfqq)
+/*
+ * The following function is not marked as __cold because it is
+ * actually cold, but for the same performance goal described in the
+ * comments on the likely() at the beginning of
+ * bfq_setup_cooperator(). Unexpectedly, to reach an even lower
+ * execution time for the case where this function is not invoked, we
+ * had to add an unlikely() in each involved if().
+ */
+void __cold
+bfq_pos_tree_add_move(struct bfq_data *bfqd, struct bfq_queue *bfqq)
 {
 	struct rb_node **p, *parent;
 	struct bfq_queue *__bfqq;
@@ -1849,8 +1858,9 @@ static void bfq_add_request(struct request *rq)
 
 	/*
 	 * Adjust priority tree position, if next_rq changes.
+	 * See comments on bfq_pos_tree_add_move() for the unlikely().
 	 */
-	if (prev != bfqq->next_rq)
+	if (unlikely(!bfqd->nonrot_with_queueing && prev != bfqq->next_rq))
 		bfq_pos_tree_add_move(bfqd, bfqq);
 
 	if (!bfq_bfqq_busy(bfqq)) /* switching to busy ... */
@@ -1990,7 +2000,9 @@ static void bfq_remove_request(struct request_queue *q,
 			bfqq->pos_root = NULL;
 		}
 	} else {
-		bfq_pos_tree_add_move(bfqd, bfqq);
+		/* see comments on bfq_pos_tree_add_move() for the unlikely() */
+		if (unlikely(!bfqd->nonrot_with_queueing))
+			bfq_pos_tree_add_move(bfqd, bfqq);
 	}
 
 	if (rq->cmd_flags & REQ_META)
@@ -2075,7 +2087,12 @@ static void bfq_request_merged(struct request_queue *q, struct request *req,
 		 */
 		if (prev != bfqq->next_rq) {
 			bfq_updated_next_req(bfqd, bfqq);
-			bfq_pos_tree_add_move(bfqd, bfqq);
+			/*
+			 * See comments on bfq_pos_tree_add_move() for
+			 * the unlikely().
+			 */
+			if (unlikely(!bfqd->nonrot_with_queueing))
+				bfq_pos_tree_add_move(bfqd, bfqq);
 		}
 	}
 }
@@ -2357,6 +2374,46 @@ bfq_setup_cooperator(struct bfq_data *bfqd, struct bfq_queue *bfqq,
 {
 	struct bfq_queue *in_service_bfqq, *new_bfqq;
 
+	/*
+	 * Do not perform queue merging if the device is non
+	 * rotational and performs internal queueing. In fact, such a
+	 * device reaches a high speed through internal parallelism
+	 * and pipelining. This means that, to reach a high
+	 * throughput, it must have many requests enqueued at the same
+	 * time. But, in this configuration, the internal scheduling
+	 * algorithm of the device does exactly the job of queue
+	 * merging: it reorders requests so as to obtain as much as
+	 * possible a sequential I/O pattern. As a consequence, with
+	 * the workload generated by processes doing interleaved I/O,
+	 * the throughput reached by the device is likely to be the
+	 * same, with and without queue merging.
+	 *
+	 * Disabling merging also provides a remarkable benefit in
+	 * terms of throughput. Merging tends to make many workloads
+	 * artificially more uneven, because of shared queues
+	 * remaining non empty for incomparably more time than
+	 * non-merged queues. This may accentuate workload
+	 * asymmetries. For example, if one of the queues in a set of
+	 * merged queues has a higher weight than a normal queue, then
+	 * the shared queue may inherit such a high weight and, by
+	 * staying almost always active, may force BFQ to perform I/O
+	 * plugging most of the time. This evidently makes it harder
+	 * for BFQ to let the device reach a high throughput.
+	 *
+	 * Finally, the likely() macro below is not used because one
+	 * of the two branches is more likely than the other, but to
+	 * have the code path after the following if() executed as
+	 * fast as possible for the case of a non rotational device
+	 * with queueing. We want it because this is the fastest kind
+	 * of device. On the opposite end, the likely() may lengthen
+	 * the execution time of BFQ for the case of slower devices
+	 * (rotational or at least without queueing). But in this case
+	 * the execution time of BFQ matters very little, if not at
+	 * all.
+	 */
+	if (likely(bfqd->nonrot_with_queueing))
+		return NULL;
+
 	/*
 	 * Prevent bfqq from being merged if it has been created too
 	 * long ago. The idea is that true cooperating processes, and
@@ -2986,8 +3043,10 @@ static void __bfq_bfqq_expire(struct bfq_data *bfqd, struct bfq_queue *bfqq)
 		bfq_requeue_bfqq(bfqd, bfqq, true);
 		/*
 		 * Resort priority tree of potential close cooperators.
+		 * See comments on bfq_pos_tree_add_move() for the unlikely().
 		 */
-		bfq_pos_tree_add_move(bfqd, bfqq);
+		if (unlikely(!bfqd->nonrot_with_queueing))
+			bfq_pos_tree_add_move(bfqd, bfqq);
 	}
 
 	/*
@@ -5051,6 +5110,9 @@ static void bfq_update_hw_tag(struct bfq_data *bfqd)
 	bfqd->hw_tag = bfqd->max_rq_in_driver > BFQ_HW_QUEUE_THRESHOLD;
 	bfqd->max_rq_in_driver = 0;
 	bfqd->hw_tag_samples = 0;
+
+	bfqd->nonrot_with_queueing =
+		blk_queue_nonrot(bfqd->queue) && bfqd->hw_tag;
 }
 
 static void bfq_completed_request(struct bfq_queue *bfqq, struct bfq_data *bfqd)
@@ -5882,6 +5944,7 @@ static int bfq_init_queue(struct request_queue *q, struct elevator_type *e)
 	INIT_HLIST_HEAD(&bfqd->burst_list);
 
 	bfqd->hw_tag = -1;
+	bfqd->nonrot_with_queueing = blk_queue_nonrot(bfqd->queue);
 
 	bfqd->bfq_max_budget = bfq_default_max_budget;
 
diff --git a/block/bfq-iosched.h b/block/bfq-iosched.h
index 26869cfbbfa9..829730b96fb2 100644
--- a/block/bfq-iosched.h
+++ b/block/bfq-iosched.h
@@ -497,6 +497,9 @@ struct bfq_data {
 	/* number of requests dispatched and waiting for completion */
 	int rq_in_driver;
 
+	/* true if the device is non rotational and performs queueing */
+	bool nonrot_with_queueing;
+
 	/*
 	 * Maximum number of requests in driver in the last
 	 * @hw_tag_samples completed requests.
-- 
cgit 


From 7074f076ff153021f408229b0ce63063dde9a400 Mon Sep 17 00:00:00 2001
From: Paolo Valente <paolo.valente@linaro.org>
Date: Tue, 12 Mar 2019 09:59:31 +0100
Subject: block, bfq: do not tag totally seeky queues as soft rt
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Sync random I/O is likely to be confused with soft real-time I/O,
because it is characterized by limited throughput and apparently
isochronous arrival pattern. To avoid false positives, this commits
prevents bfq_queues containing only random (seeky) I/O from being
tagged as soft real-time.

Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Signed-off-by: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 block/bfq-iosched.c | 14 ++++++++++++++
 1 file changed, 14 insertions(+)

(limited to 'block')

diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
index b957e9db87d8..7044da0b1c52 100644
--- a/block/bfq-iosched.c
+++ b/block/bfq-iosched.c
@@ -242,6 +242,14 @@ static struct kmem_cache *bfq_pool;
 	  blk_rq_sectors(rq) < BFQQ_SECT_THR_NONROT))
 #define BFQQ_CLOSE_THR		(sector_t)(8 * 1024)
 #define BFQQ_SEEKY(bfqq)	(hweight32(bfqq->seek_history) > 19)
+/*
+ * Sync random I/O is likely to be confused with soft real-time I/O,
+ * because it is characterized by limited throughput and apparently
+ * isochronous arrival pattern. To avoid false positives, queues
+ * containing only random (seeky) I/O are prevented from being tagged
+ * as soft real-time.
+ */
+#define BFQQ_TOTALLY_SEEKY(bfqq)	(bfqq->seek_history & -1)
 
 /* Min number of samples required to perform peak-rate update */
 #define BFQ_RATE_MIN_SAMPLES	32
@@ -1622,6 +1630,7 @@ static void bfq_bfqq_handle_idle_busy_switch(struct bfq_data *bfqd,
 	 */
 	in_burst = bfq_bfqq_in_large_burst(bfqq);
 	soft_rt = bfqd->bfq_wr_max_softrt_rate > 0 &&
+		!BFQQ_TOTALLY_SEEKY(bfqq) &&
 		!in_burst &&
 		time_is_before_jiffies(bfqq->soft_rt_next_start) &&
 		bfqq->dispatched == 0;
@@ -4816,6 +4825,11 @@ bfq_update_io_seektime(struct bfq_data *bfqd, struct bfq_queue *bfqq,
 {
 	bfqq->seek_history <<= 1;
 	bfqq->seek_history |= BFQ_RQ_SEEKY(bfqd, bfqq->last_request_pos, rq);
+
+	if (bfqq->wr_coeff > 1 &&
+	    bfqq->wr_cur_max_time == bfqd->bfq_wr_rt_max_time &&
+	    BFQQ_TOTALLY_SEEKY(bfqq))
+		bfq_bfqq_end_wr(bfqq);
 }
 
 static void bfq_update_has_short_ttime(struct bfq_data *bfqd,
-- 
cgit 


From 84a746891e1d8364485c0a37533fe6c1380270d4 Mon Sep 17 00:00:00 2001
From: Paolo Valente <paolo.valente@linaro.org>
Date: Tue, 12 Mar 2019 09:59:32 +0100
Subject: block, bfq: always protect newly-created queues from existing active
 queues
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

If many bfq_queues belonging to the same group happen to be created
shortly after each other, then the processes associated with these
queues have typically a common goal. In particular, bursts of queue
creations are usually caused by services or applications that spawn
many parallel threads/processes. Examples are systemd during boot, or
git grep. If there are no other active queues, then, to help these
processes get their job done as soon as possible, the best thing to do
is to reach a high throughput. To this goal, it is usually better to
not grant either weight-raising or device idling to the queues
associated with these processes. And this is exactly what BFQ
currently does.

There is however a drawback: if, in contrast, some other queues are
already active, then the newly created queues must be protected from
the I/O flowing through the already existing queues. In this case, the
best thing to do is the opposite as in the other case: it is much
better to grant weight-raising and device idling to the newly-created
queues, if they deserve it. This commit addresses this issue by doing
so if there are already other active queues.

This change also helps eliminating false positives, which occur when
the newly-created queues do not belong to an actual large burst of
creations, but some background task (e.g., a service) happens to
trigger the creation of new queues in the middle, i.e., very close to
when the victim queues are created. These false positive may cause
total loss of control on process latencies.

Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Signed-off-by: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 block/bfq-iosched.c | 64 ++++++++++++++++++++++++++++++++++++++++++-----------
 1 file changed, 51 insertions(+), 13 deletions(-)

(limited to 'block')

diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
index 7044da0b1c52..49bde428f7f2 100644
--- a/block/bfq-iosched.c
+++ b/block/bfq-iosched.c
@@ -1075,8 +1075,18 @@ static void bfq_reset_burst_list(struct bfq_data *bfqd, struct bfq_queue *bfqq)
 
 	hlist_for_each_entry_safe(item, n, &bfqd->burst_list, burst_list_node)
 		hlist_del_init(&item->burst_list_node);
-	hlist_add_head(&bfqq->burst_list_node, &bfqd->burst_list);
-	bfqd->burst_size = 1;
+
+	/*
+	 * Start the creation of a new burst list only if there is no
+	 * active queue. See comments on the conditional invocation of
+	 * bfq_handle_burst().
+	 */
+	if (bfq_tot_busy_queues(bfqd) == 0) {
+		hlist_add_head(&bfqq->burst_list_node, &bfqd->burst_list);
+		bfqd->burst_size = 1;
+	} else
+		bfqd->burst_size = 0;
+
 	bfqd->burst_parent_entity = bfqq->entity.parent;
 }
 
@@ -1132,7 +1142,8 @@ static void bfq_add_to_burst(struct bfq_data *bfqd, struct bfq_queue *bfqq)
  * many parallel threads/processes. Examples are systemd during boot,
  * or git grep. To help these processes get their job done as soon as
  * possible, it is usually better to not grant either weight-raising
- * or device idling to their queues.
+ * or device idling to their queues, unless these queues must be
+ * protected from the I/O flowing through other active queues.
  *
  * In this comment we describe, firstly, the reasons why this fact
  * holds, and, secondly, the next function, which implements the main
@@ -1144,7 +1155,10 @@ static void bfq_add_to_burst(struct bfq_data *bfqd, struct bfq_queue *bfqq)
  * cumulatively served, the sooner the target job of these queues gets
  * completed. As a consequence, weight-raising any of these queues,
  * which also implies idling the device for it, is almost always
- * counterproductive. In most cases it just lowers throughput.
+ * counterproductive, unless there are other active queues to isolate
+ * these new queues from. If there no other active queues, then
+ * weight-raising these new queues just lowers throughput in most
+ * cases.
  *
  * On the other hand, a burst of queue creations may be caused also by
  * the start of an application that does not consist of a lot of
@@ -1178,14 +1192,16 @@ static void bfq_add_to_burst(struct bfq_data *bfqd, struct bfq_queue *bfqq)
  * are very rare. They typically occur if some service happens to
  * start doing I/O exactly when the interactive task starts.
  *
- * Turning back to the next function, it implements all the steps
- * needed to detect the occurrence of a large burst and to properly
- * mark all the queues belonging to it (so that they can then be
- * treated in a different way). This goal is achieved by maintaining a
- * "burst list" that holds, temporarily, the queues that belong to the
- * burst in progress. The list is then used to mark these queues as
- * belonging to a large burst if the burst does become large. The main
- * steps are the following.
+ * Turning back to the next function, it is invoked only if there are
+ * no active queues (apart from active queues that would belong to the
+ * same, possible burst bfqq would belong to), and it implements all
+ * the steps needed to detect the occurrence of a large burst and to
+ * properly mark all the queues belonging to it (so that they can then
+ * be treated in a different way). This goal is achieved by
+ * maintaining a "burst list" that holds, temporarily, the queues that
+ * belong to the burst in progress. The list is then used to mark
+ * these queues as belonging to a large burst if the burst does become
+ * large. The main steps are the following.
  *
  * . when the very first queue is created, the queue is inserted into the
  *   list (as it could be the first queue in a possible burst)
@@ -5695,7 +5711,29 @@ static struct bfq_queue *bfq_init_rq(struct request *rq)
 		}
 	}
 
-	if (unlikely(bfq_bfqq_just_created(bfqq)))
+	/*
+	 * Consider bfqq as possibly belonging to a burst of newly
+	 * created queues only if:
+	 * 1) A burst is actually happening (bfqd->burst_size > 0)
+	 * or
+	 * 2) There is no other active queue. In fact, if, in
+	 *    contrast, there are active queues not belonging to the
+	 *    possible burst bfqq may belong to, then there is no gain
+	 *    in considering bfqq as belonging to a burst, and
+	 *    therefore in not weight-raising bfqq. See comments on
+	 *    bfq_handle_burst().
+	 *
+	 * This filtering also helps eliminating false positives,
+	 * occurring when bfqq does not belong to an actual large
+	 * burst, but some background task (e.g., a service) happens
+	 * to trigger the creation of new queues very close to when
+	 * bfqq and its possible companion queues are created. See
+	 * comments on bfq_handle_burst() for further details also on
+	 * this issue.
+	 */
+	if (unlikely(bfq_bfqq_just_created(bfqq) &&
+		     (bfqd->burst_size > 0 ||
+		      bfq_tot_busy_queues(bfqd) == 0)))
 		bfq_handle_burst(bfqd, bfqq);
 
 	return bfqq;
-- 
cgit 


From 1e66413c4f68e2a61a210e4f5ff5df7a2ab86a5b Mon Sep 17 00:00:00 2001
From: Francesco Pollicino <fra.fra.800@gmail.com>
Date: Tue, 12 Mar 2019 09:59:33 +0100
Subject: block, bfq: print SHARED instead of pid for shared queues in logs
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

The function "bfq_log_bfqq" prints the pid of the process
associated with the queue passed as input.

Unfortunately, if the queue is shared, then more than one process
is associated with the queue. The pid that gets printed in this
case is the pid of one of the associated processes.
Which process gets printed depends on the exact sequence of merge
events the queue underwent. So printing such a pid is rather
useless and above all is often rather confusing because it
reports a random pid between those of the associated processes.

This commit addresses this issue by printing SHARED instead of a pid
if the queue is shared.

Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Signed-off-by: Francesco Pollicino <fra.fra.800@gmail.com>
Signed-off-by: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 block/bfq-iosched.c | 10 ++++++++++
 block/bfq-iosched.h | 23 +++++++++++++++++++----
 2 files changed, 29 insertions(+), 4 deletions(-)

(limited to 'block')

diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
index 49bde428f7f2..37cc8f127cf6 100644
--- a/block/bfq-iosched.c
+++ b/block/bfq-iosched.c
@@ -2590,6 +2590,16 @@ bfq_merge_bfqqs(struct bfq_data *bfqd, struct bfq_io_cq *bic,
 	 *   assignment causes no harm).
 	 */
 	new_bfqq->bic = NULL;
+	/*
+	 * If the queue is shared, the pid is the pid of one of the associated
+	 * processes. Which pid depends on the exact sequence of merge events
+	 * the queue underwent. So printing such a pid is useless and confusing
+	 * because it reports a random pid between those of the associated
+	 * processes.
+	 * We mark such a queue with a pid -1, and then print SHARED instead of
+	 * a pid in logging messages.
+	 */
+	new_bfqq->pid = -1;
 	bfqq->bic = NULL;
 	/* release process reference to bfqq */
 	bfq_put_queue(bfqq);
diff --git a/block/bfq-iosched.h b/block/bfq-iosched.h
index 829730b96fb2..67e63c276c7a 100644
--- a/block/bfq-iosched.h
+++ b/block/bfq-iosched.h
@@ -32,6 +32,8 @@
 #define BFQ_DEFAULT_GRP_IOPRIO	0
 #define BFQ_DEFAULT_GRP_CLASS	IOPRIO_CLASS_BE
 
+#define MAX_PID_STR_LENGTH 12
+
 /*
  * Soft real-time applications are extremely more latency sensitive
  * than interactive ones. Over-raise the weight of the former to
@@ -1016,13 +1018,23 @@ void bfq_add_bfqq_busy(struct bfq_data *bfqd, struct bfq_queue *bfqq);
 /* --------------- end of interface of B-WF2Q+ ---------------- */
 
 /* Logging facilities. */
+static inline void bfq_pid_to_str(int pid, char *str, int len)
+{
+	if (pid != -1)
+		snprintf(str, len, "%d", pid);
+	else
+		snprintf(str, len, "SHARED-");
+}
+
 #ifdef CONFIG_BFQ_GROUP_IOSCHED
 struct bfq_group *bfqq_group(struct bfq_queue *bfqq);
 
 #define bfq_log_bfqq(bfqd, bfqq, fmt, args...)	do {			\
+	char pid_str[MAX_PID_STR_LENGTH];	\
+	bfq_pid_to_str((bfqq)->pid, pid_str, MAX_PID_STR_LENGTH);	\
 	blk_add_cgroup_trace_msg((bfqd)->queue,				\
 			bfqg_to_blkg(bfqq_group(bfqq))->blkcg,		\
-			"bfq%d%c " fmt, (bfqq)->pid,			\
+			"bfq%s%c " fmt, pid_str,			\
 			bfq_bfqq_sync((bfqq)) ? 'S' : 'A', ##args);	\
 } while (0)
 
@@ -1033,10 +1045,13 @@ struct bfq_group *bfqq_group(struct bfq_queue *bfqq);
 
 #else /* CONFIG_BFQ_GROUP_IOSCHED */
 
-#define bfq_log_bfqq(bfqd, bfqq, fmt, args...)	\
-	blk_add_trace_msg((bfqd)->queue, "bfq%d%c " fmt, (bfqq)->pid,	\
+#define bfq_log_bfqq(bfqd, bfqq, fmt, args...) do {	\
+	char pid_str[MAX_PID_STR_LENGTH];	\
+	bfq_pid_to_str((bfqq)->pid, pid_str, MAX_PID_STR_LENGTH);	\
+	blk_add_trace_msg((bfqd)->queue, "bfq%s%c " fmt, pid_str,	\
 			bfq_bfqq_sync((bfqq)) ? 'S' : 'A',		\
-				##args)
+				##args);	\
+} while (0)
 #define bfq_log_bfqg(bfqd, bfqg, fmt, args...)		do {} while (0)
 
 #endif /* CONFIG_BFQ_GROUP_IOSCHED */
-- 
cgit 


From fffca087d587b03d0d0dca2e86bf8e688fbf2c18 Mon Sep 17 00:00:00 2001
From: Francesco Pollicino <fra.fra.800@gmail.com>
Date: Tue, 12 Mar 2019 09:59:34 +0100
Subject: block, bfq: save & resume weight on a queue merge/split
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

bfq saves the state of a queue each time a merge occurs, to be
able to resume such a state when the queue is associated again
with its original process, on a split.

Unfortunately bfq does not save & restore also the weight of the
queue. If the weight is not correctly resumed when the queue is
recycled, then the weight of the recycled queue could differ
from the weight of the original queue.

This commit adds the missing save & resume of the weight.

Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Signed-off-by: Francesco Pollicino <fra.fra.800@gmail.com>
Signed-off-by: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 block/bfq-iosched.c | 2 ++
 block/bfq-iosched.h | 9 +++++++++
 2 files changed, 11 insertions(+)

(limited to 'block')

diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
index 37cc8f127cf6..ceb06abd73df 100644
--- a/block/bfq-iosched.c
+++ b/block/bfq-iosched.c
@@ -1028,6 +1028,7 @@ bfq_bfqq_resume_state(struct bfq_queue *bfqq, struct bfq_data *bfqd,
 	else
 		bfq_clear_bfqq_IO_bound(bfqq);
 
+	bfqq->entity.new_weight = bic->saved_weight;
 	bfqq->ttime = bic->saved_ttime;
 	bfqq->wr_coeff = bic->saved_wr_coeff;
 	bfqq->wr_start_at_switch_to_srt = bic->saved_wr_start_at_switch_to_srt;
@@ -2502,6 +2503,7 @@ static void bfq_bfqq_save_state(struct bfq_queue *bfqq)
 	if (!bic)
 		return;
 
+	bic->saved_weight = bfqq->entity.orig_weight;
 	bic->saved_ttime = bfqq->ttime;
 	bic->saved_has_short_ttime = bfq_bfqq_has_short_ttime(bfqq);
 	bic->saved_IO_bound = bfq_bfqq_IO_bound(bfqq);
diff --git a/block/bfq-iosched.h b/block/bfq-iosched.h
index 67e63c276c7a..60c148728cc5 100644
--- a/block/bfq-iosched.h
+++ b/block/bfq-iosched.h
@@ -404,6 +404,15 @@ struct bfq_io_cq {
 	 */
 	bool was_in_burst_list;
 
+	/*
+	 * Save the weight when a merge occurs, to be able
+	 * to restore it in case of split. If the weight is not
+	 * correctly resumed when the queue is recycled,
+	 * then the weight of the recycled queue could differ
+	 * from the weight of the original queue.
+	 */
+	unsigned int saved_weight;
+
 	/*
 	 * Similar to previous fields: save wr information.
 	 */
-- 
cgit 


From 0383ad4374f7ad7edd925a2ee4753035c3f5508a Mon Sep 17 00:00:00 2001
From: Ming Lei <ming.lei@redhat.com>
Date: Fri, 29 Mar 2019 15:07:54 +0800
Subject: block: pass page to xen_biovec_phys_mergeable

xen_biovec_phys_mergeable() only needs .bv_page of the 2nd bio bvec
for checking if the two bvecs can be merged, so pass page to
xen_biovec_phys_mergeable() directly.

No function change.

Cc: ris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: xen-devel@lists.xenproject.org
Cc: Omar Sandoval <osandov@fb.com>
Cc: Christoph Hellwig <hch@lst.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 block/blk.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

(limited to 'block')

diff --git a/block/blk.h b/block/blk.h
index 5d636ee41663..e27fd1512e4b 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -75,7 +75,7 @@ static inline bool biovec_phys_mergeable(struct request_queue *q,
 
 	if (addr1 + vec1->bv_len != addr2)
 		return false;
-	if (xen_domain() && !xen_biovec_phys_mergeable(vec1, vec2))
+	if (xen_domain() && !xen_biovec_phys_mergeable(vec1, vec2->bv_page))
 		return false;
 	if ((addr1 | mask) != ((addr2 + vec2->bv_len - 1) | mask))
 		return false;
-- 
cgit 


From db5ebd6edd2627d7e81a031643cf43587f63e66c Mon Sep 17 00:00:00 2001
From: Ming Lei <ming.lei@redhat.com>
Date: Sun, 17 Mar 2019 18:01:04 +0800
Subject: block: avoid to break XEN by multi-page bvec

XEN has special page merge requirement, see xen_biovec_phys_mergeable().
We can't merge pages into one bvec simply for XEN.

So move XEN's specific check on page merge into __bio_try_merge_page(),
then abvoid to break XEN by multi-page bvec.

Cc: ris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: xen-devel@lists.xenproject.org
Cc: Omar Sandoval <osandov@fb.com>
Cc: Christoph Hellwig <hch@lst.de>
Reviewed-by: Juergen Gross <jgross@suse.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 block/bio.c | 2 ++
 1 file changed, 2 insertions(+)

(limited to 'block')

diff --git a/block/bio.c b/block/bio.c
index b64cedc7f87c..b2423e7aae08 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -776,6 +776,8 @@ bool __bio_try_merge_page(struct bio *bio, struct page *page,
 
 		if (vec_end_addr + 1 != page_addr + off)
 			return false;
+		if (xen_domain() && !xen_biovec_phys_mergeable(bv, page))
+			return false;
 		if (same_page && (vec_end_addr & PAGE_MASK) != page_addr)
 			return false;
 
-- 
cgit 


From fd7d8d4232f08b0df623d9ea7e941f0350a26e14 Mon Sep 17 00:00:00 2001
From: Ming Lei <ming.lei@redhat.com>
Date: Sun, 17 Mar 2019 18:01:05 +0800
Subject: block: don't merge adjacent bvecs to one segment in bio
 blk_queue_split

For normal filesystem IO, each page is added via blk_add_page(),
in which bvec(page) merge has been handled already, and basically
not possible to merge two adjacent bvecs in one bio.

So not try to merge two adjacent bvecs in blk_queue_split().

Cc: Omar Sandoval <osandov@fb.com>
Cc: Christoph Hellwig <hch@lst.de>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 block/blk-merge.c | 17 -----------------
 1 file changed, 17 deletions(-)

(limited to 'block')

diff --git a/block/blk-merge.c b/block/blk-merge.c
index 1c9d4f0f96ea..aa9164eb7187 100644
--- a/block/blk-merge.c
+++ b/block/blk-merge.c
@@ -267,23 +267,6 @@ static struct bio *blk_bio_segment_split(struct request_queue *q,
 			goto split;
 		}
 
-		if (bvprvp) {
-			if (seg_size + bv.bv_len > queue_max_segment_size(q))
-				goto new_segment;
-			if (!biovec_phys_mergeable(q, bvprvp, &bv))
-				goto new_segment;
-
-			seg_size += bv.bv_len;
-			bvprv = bv;
-			bvprvp = &bvprv;
-			sectors += bv.bv_len >> 9;
-
-			if (nsegs == 1 && seg_size > front_seg_size)
-				front_seg_size = seg_size;
-
-			continue;
-		}
-new_segment:
 		if (nsegs == max_segs)
 			goto split;
 
-- 
cgit 


From 5a8ce240d4d302d27a58fd34499b2404b3a8df4f Mon Sep 17 00:00:00 2001
From: Ming Lei <ming.lei@redhat.com>
Date: Sun, 17 Mar 2019 18:01:06 +0800
Subject: block: cleanup bio_add_pc_page

REQ_PC is out of date, so replace it with passthrough IO.

Also remove the local variable of 'prev' since we can reuse
the top local variable of 'bvec'.

No function change.

Cc: Omar Sandoval <osandov@fb.com>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 block/bio.c | 14 +++++++-------
 1 file changed, 7 insertions(+), 7 deletions(-)

(limited to 'block')

diff --git a/block/bio.c b/block/bio.c
index b2423e7aae08..cbd202250a32 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -648,7 +648,7 @@ struct bio *bio_clone_fast(struct bio *bio, gfp_t gfp_mask, struct bio_set *bs)
 EXPORT_SYMBOL(bio_clone_fast);
 
 /**
- *	bio_add_pc_page	-	attempt to add page to bio
+ *	bio_add_pc_page	-	attempt to add page to passthrough bio
  *	@q: the target queue
  *	@bio: destination bio
  *	@page: page to add
@@ -660,7 +660,7 @@ EXPORT_SYMBOL(bio_clone_fast);
  *	limitations. The target block device must allow bio's up to PAGE_SIZE,
  *	so it is always possible to add a single page to an empty bio.
  *
- *	This should only be used by REQ_PC bios.
+ *	This should only be used by passthrough bios.
  */
 int bio_add_pc_page(struct request_queue *q, struct bio *bio, struct page
 		    *page, unsigned int len, unsigned int offset)
@@ -683,11 +683,11 @@ int bio_add_pc_page(struct request_queue *q, struct bio *bio, struct page
 	 * a consecutive offset.  Optimize this special case.
 	 */
 	if (bio->bi_vcnt > 0) {
-		struct bio_vec *prev = &bio->bi_io_vec[bio->bi_vcnt - 1];
+		bvec = &bio->bi_io_vec[bio->bi_vcnt - 1];
 
-		if (page == prev->bv_page &&
-		    offset == prev->bv_offset + prev->bv_len) {
-			prev->bv_len += len;
+		if (page == bvec->bv_page &&
+		    offset == bvec->bv_offset + bvec->bv_len) {
+			bvec->bv_len += len;
 			bio->bi_iter.bi_size += len;
 			goto done;
 		}
@@ -696,7 +696,7 @@ int bio_add_pc_page(struct request_queue *q, struct bio *bio, struct page
 		 * If the queue doesn't support SG gaps and adding this
 		 * offset would create a gap, disallow it.
 		 */
-		if (bvec_gap_to_prev(q, prev, offset))
+		if (bvec_gap_to_prev(q, bvec, offset))
 			return 0;
 	}
 
-- 
cgit 


From 5919482e222908d40279a616b1fe6400549e32b4 Mon Sep 17 00:00:00 2001
From: Ming Lei <ming.lei@redhat.com>
Date: Sun, 17 Mar 2019 18:01:07 +0800
Subject: block: check if page is mergeable in one helper

Now the check for deciding if one page is mergeable to current bvec
becomes a bit complicated, and we need to reuse the code before
adding pc page.

So move the check in one dedicated helper.

No function change.

Cc: Omar Sandoval <osandov@fb.com>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 block/bio.c | 38 ++++++++++++++++++++++++--------------
 1 file changed, 24 insertions(+), 14 deletions(-)

(limited to 'block')

diff --git a/block/bio.c b/block/bio.c
index cbd202250a32..7ab7060a0e6c 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -647,6 +647,24 @@ struct bio *bio_clone_fast(struct bio *bio, gfp_t gfp_mask, struct bio_set *bs)
 }
 EXPORT_SYMBOL(bio_clone_fast);
 
+static inline bool page_is_mergeable(const struct bio_vec *bv,
+		struct page *page, unsigned int len, unsigned int off,
+		bool same_page)
+{
+	phys_addr_t vec_end_addr = page_to_phys(bv->bv_page) +
+		bv->bv_offset + bv->bv_len - 1;
+	phys_addr_t page_addr = page_to_phys(page);
+
+	if (vec_end_addr + 1 != page_addr + off)
+		return false;
+	if (xen_domain() && !xen_biovec_phys_mergeable(bv, page))
+		return false;
+	if (same_page && (vec_end_addr & PAGE_MASK) != page_addr)
+		return false;
+
+	return true;
+}
+
 /**
  *	bio_add_pc_page	-	attempt to add page to passthrough bio
  *	@q: the target queue
@@ -770,20 +788,12 @@ bool __bio_try_merge_page(struct bio *bio, struct page *page,
 
 	if (bio->bi_vcnt > 0) {
 		struct bio_vec *bv = &bio->bi_io_vec[bio->bi_vcnt - 1];
-		phys_addr_t vec_end_addr = page_to_phys(bv->bv_page) +
-			bv->bv_offset + bv->bv_len - 1;
-		phys_addr_t page_addr = page_to_phys(page);
-
-		if (vec_end_addr + 1 != page_addr + off)
-			return false;
-		if (xen_domain() && !xen_biovec_phys_mergeable(bv, page))
-			return false;
-		if (same_page && (vec_end_addr & PAGE_MASK) != page_addr)
-			return false;
-
-		bv->bv_len += len;
-		bio->bi_iter.bi_size += len;
-		return true;
+
+		if (page_is_mergeable(bv, page, len, off, same_page)) {
+			bv->bv_len += len;
+			bio->bi_iter.bi_size += len;
+			return true;
+		}
 	}
 	return false;
 }
-- 
cgit 


From 190470871ae28da7bdb3909f6124385c8472fc97 Mon Sep 17 00:00:00 2001
From: Ming Lei <ming.lei@redhat.com>
Date: Sun, 17 Mar 2019 18:01:08 +0800
Subject: block: put the same page when adding it to bio

When the added page is merged to last same page in bio_add_pc_page(),
the user may need to put this page for avoiding page leak.

bio_map_user_iov() needs this kind of handling, and now it deals with
it by itself in hack style.

Moves the handling of put page into __bio_add_pc_page(), so
bio_map_user_iov() may be simplified a bit, and maybe more users
can benefit from this change.

Cc: Omar Sandoval <osandov@fb.com>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 block/bio.c | 28 ++++++++++++++++------------
 1 file changed, 16 insertions(+), 12 deletions(-)

(limited to 'block')

diff --git a/block/bio.c b/block/bio.c
index 7ab7060a0e6c..26853e072cd7 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -666,12 +666,13 @@ static inline bool page_is_mergeable(const struct bio_vec *bv,
 }
 
 /**
- *	bio_add_pc_page	-	attempt to add page to passthrough bio
+ *	__bio_add_pc_page	- attempt to add page to passthrough bio
  *	@q: the target queue
  *	@bio: destination bio
  *	@page: page to add
  *	@len: vec entry length
  *	@offset: vec entry offset
+ *	@put_same_page: put the page if it is same with last added page
  *
  *	Attempt to add a page to the bio_vec maplist. This can fail for a
  *	number of reasons, such as the bio being full or target block device
@@ -680,8 +681,9 @@ static inline bool page_is_mergeable(const struct bio_vec *bv,
  *
  *	This should only be used by passthrough bios.
  */
-int bio_add_pc_page(struct request_queue *q, struct bio *bio, struct page
-		    *page, unsigned int len, unsigned int offset)
+int __bio_add_pc_page(struct request_queue *q, struct bio *bio,
+		struct page *page, unsigned int len, unsigned int offset,
+		bool put_same_page)
 {
 	int retried_segments = 0;
 	struct bio_vec *bvec;
@@ -705,6 +707,8 @@ int bio_add_pc_page(struct request_queue *q, struct bio *bio, struct page
 
 		if (page == bvec->bv_page &&
 		    offset == bvec->bv_offset + bvec->bv_len) {
+			if (put_same_page)
+				put_page(page);
 			bvec->bv_len += len;
 			bio->bi_iter.bi_size += len;
 			goto done;
@@ -763,6 +767,13 @@ int bio_add_pc_page(struct request_queue *q, struct bio *bio, struct page
 	blk_recount_segments(q, bio);
 	return 0;
 }
+EXPORT_SYMBOL(__bio_add_pc_page);
+
+int bio_add_pc_page(struct request_queue *q, struct bio *bio,
+		struct page *page, unsigned int len, unsigned int offset)
+{
+	return __bio_add_pc_page(q, bio, page, len, offset, false);
+}
 EXPORT_SYMBOL(bio_add_pc_page);
 
 /**
@@ -1397,21 +1408,14 @@ struct bio *bio_map_user_iov(struct request_queue *q,
 			for (j = 0; j < npages; j++) {
 				struct page *page = pages[j];
 				unsigned int n = PAGE_SIZE - offs;
-				unsigned short prev_bi_vcnt = bio->bi_vcnt;
 
 				if (n > bytes)
 					n = bytes;
 
-				if (!bio_add_pc_page(q, bio, page, n, offs))
+				if (!__bio_add_pc_page(q, bio, page, n, offs,
+							true))
 					break;
 
-				/*
-				 * check if vector was merged with previous
-				 * drop page reference if needed
-				 */
-				if (bio->bi_vcnt == prev_bi_vcnt)
-					put_page(page);
-
 				added += n;
 				bytes -= n;
 				offs = 0;
-- 
cgit 


From 489fbbcb51d0249569d863f9220de69cb31f1922 Mon Sep 17 00:00:00 2001
From: Ming Lei <ming.lei@redhat.com>
Date: Fri, 29 Mar 2019 15:08:00 +0800
Subject: block: enable multi-page bvec for passthrough IO

Now block IO stack is basically ready for supporting multi-page bvec,
however it isn't enabled on passthrough IO.

One reason is that passthrough IO is dispatched to LLD directly and bio
split is bypassed, so the bio has to be built correctly for dispatch to
LLD from the beginning.

Implement multi-page support for passthrough IO by limitting each bvec
as block device's segment and applying all kinds of queue limit in
blk_add_pc_page(). Then we don't need to calculate segments any more for
passthrough IO any more, turns out code is simplified much.

Cc: Omar Sandoval <osandov@fb.com>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 block/bio.c | 60 +++++++++++++++++++++++++++++++-----------------------------
 1 file changed, 31 insertions(+), 29 deletions(-)

(limited to 'block')

diff --git a/block/bio.c b/block/bio.c
index 26853e072cd7..8d516d508ae3 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -665,6 +665,27 @@ static inline bool page_is_mergeable(const struct bio_vec *bv,
 	return true;
 }
 
+/*
+ * Check if the @page can be added to the current segment(@bv), and make
+ * sure to call it only if page_is_mergeable(@bv, @page) is true
+ */
+static bool can_add_page_to_seg(struct request_queue *q,
+		struct bio_vec *bv, struct page *page, unsigned len,
+		unsigned offset)
+{
+	unsigned long mask = queue_segment_boundary(q);
+	phys_addr_t addr1 = page_to_phys(bv->bv_page) + bv->bv_offset;
+	phys_addr_t addr2 = page_to_phys(page) + offset + len - 1;
+
+	if ((addr1 | mask) != (addr2 | mask))
+		return false;
+
+	if (bv->bv_len + len > queue_max_segment_size(q))
+		return false;
+
+	return true;
+}
+
 /**
  *	__bio_add_pc_page	- attempt to add page to passthrough bio
  *	@q: the target queue
@@ -685,7 +706,6 @@ int __bio_add_pc_page(struct request_queue *q, struct bio *bio,
 		struct page *page, unsigned int len, unsigned int offset,
 		bool put_same_page)
 {
-	int retried_segments = 0;
 	struct bio_vec *bvec;
 
 	/*
@@ -709,6 +729,7 @@ int __bio_add_pc_page(struct request_queue *q, struct bio *bio,
 		    offset == bvec->bv_offset + bvec->bv_len) {
 			if (put_same_page)
 				put_page(page);
+ bvec_merge:
 			bvec->bv_len += len;
 			bio->bi_iter.bi_size += len;
 			goto done;
@@ -720,11 +741,18 @@ int __bio_add_pc_page(struct request_queue *q, struct bio *bio,
 		 */
 		if (bvec_gap_to_prev(q, bvec, offset))
 			return 0;
+
+		if (page_is_mergeable(bvec, page, len, offset, false) &&
+				can_add_page_to_seg(q, bvec, page, len, offset))
+			goto bvec_merge;
 	}
 
 	if (bio_full(bio))
 		return 0;
 
+	if (bio->bi_phys_segments >= queue_max_segments(q))
+		return 0;
+
 	/*
 	 * setup the new entry, we might clear it again later if we
 	 * cannot add the page
@@ -734,38 +762,12 @@ int __bio_add_pc_page(struct request_queue *q, struct bio *bio,
 	bvec->bv_len = len;
 	bvec->bv_offset = offset;
 	bio->bi_vcnt++;
-	bio->bi_phys_segments++;
 	bio->bi_iter.bi_size += len;
 
-	/*
-	 * Perform a recount if the number of segments is greater
-	 * than queue_max_segments(q).
-	 */
-
-	while (bio->bi_phys_segments > queue_max_segments(q)) {
-
-		if (retried_segments)
-			goto failed;
-
-		retried_segments = 1;
-		blk_recount_segments(q, bio);
-	}
-
-	/* If we may be able to merge these biovecs, force a recount */
-	if (bio->bi_vcnt > 1 && biovec_phys_mergeable(q, bvec - 1, bvec))
-		bio_clear_flag(bio, BIO_SEG_VALID);
-
  done:
+	bio->bi_phys_segments = bio->bi_vcnt;
+	bio_set_flag(bio, BIO_SEG_VALID);
 	return len;
-
- failed:
-	bvec->bv_page = NULL;
-	bvec->bv_len = 0;
-	bvec->bv_offset = 0;
-	bio->bi_vcnt--;
-	bio->bi_iter.bi_size -= len;
-	blk_recount_segments(q, bio);
-	return 0;
 }
 EXPORT_SYMBOL(__bio_add_pc_page);
 
-- 
cgit 


From cae6c2e54cc10514fec26e333f63c5cded9d2383 Mon Sep 17 00:00:00 2001
From: Ming Lei <ming.lei@redhat.com>
Date: Sun, 17 Mar 2019 18:01:10 +0800
Subject: block: remove argument of 'request_queue' from __blk_bvec_map_sg

The argument of 'request_queue' isn't used by __blk_bvec_map_sg(),
so remove it.

Cc: Omar Sandoval <osandov@fb.com>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 block/blk-merge.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

(limited to 'block')

diff --git a/block/blk-merge.c b/block/blk-merge.c
index aa9164eb7187..9ec704bb58ec 100644
--- a/block/blk-merge.c
+++ b/block/blk-merge.c
@@ -520,7 +520,7 @@ new_segment:
 	*bvprv = *bvec;
 }
 
-static inline int __blk_bvec_map_sg(struct request_queue *q, struct bio_vec bv,
+static inline int __blk_bvec_map_sg(struct bio_vec bv,
 		struct scatterlist *sglist, struct scatterlist **sg)
 {
 	*sg = sglist;
@@ -555,9 +555,9 @@ int blk_rq_map_sg(struct request_queue *q, struct request *rq,
 	int nsegs = 0;
 
 	if (rq->rq_flags & RQF_SPECIAL_PAYLOAD)
-		nsegs = __blk_bvec_map_sg(q, rq->special_vec, sglist, &sg);
+		nsegs = __blk_bvec_map_sg(rq->special_vec, sglist, &sg);
 	else if (rq->bio && bio_op(rq->bio) == REQ_OP_WRITE_SAME)
-		nsegs = __blk_bvec_map_sg(q, bio_iovec(rq->bio), sglist, &sg);
+		nsegs = __blk_bvec_map_sg(bio_iovec(rq->bio), sglist, &sg);
 	else if (rq->bio)
 		nsegs = __blk_bios_map_sg(q, rq->bio, sglist, &sg);
 
-- 
cgit 


From 16e3e4187758d8936d358b26149de785b7d5a9b7 Mon Sep 17 00:00:00 2001
From: Ming Lei <ming.lei@redhat.com>
Date: Sun, 17 Mar 2019 18:01:11 +0800
Subject: block: reuse __blk_bvec_map_sg() for mapping page sized bvec

Inside __blk_segment_map_sg(), page sized bvec mapping is optimized
a bit with one standalone branch.

So reuse __blk_bvec_map_sg() to do that.

Cc: Omar Sandoval <osandov@fb.com>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 block/blk-merge.c | 20 +++++++++-----------
 1 file changed, 9 insertions(+), 11 deletions(-)

(limited to 'block')

diff --git a/block/blk-merge.c b/block/blk-merge.c
index 9ec704bb58ec..3e934ee9a907 100644
--- a/block/blk-merge.c
+++ b/block/blk-merge.c
@@ -493,6 +493,14 @@ static unsigned blk_bvec_map_sg(struct request_queue *q,
 	return nsegs;
 }
 
+static inline int __blk_bvec_map_sg(struct bio_vec bv,
+		struct scatterlist *sglist, struct scatterlist **sg)
+{
+	*sg = blk_next_sg(sg, sglist);
+	sg_set_page(*sg, bv.bv_page, bv.bv_len, bv.bv_offset);
+	return 1;
+}
+
 static inline void
 __blk_segment_map_sg(struct request_queue *q, struct bio_vec *bvec,
 		     struct scatterlist *sglist, struct bio_vec *bvprv,
@@ -511,23 +519,13 @@ __blk_segment_map_sg(struct request_queue *q, struct bio_vec *bvec,
 	} else {
 new_segment:
 		if (bvec->bv_offset + bvec->bv_len <= PAGE_SIZE) {
-			*sg = blk_next_sg(sg, sglist);
-			sg_set_page(*sg, bvec->bv_page, nbytes, bvec->bv_offset);
-			(*nsegs) += 1;
+			(*nsegs) += __blk_bvec_map_sg(*bvec, sglist, sg);
 		} else
 			(*nsegs) += blk_bvec_map_sg(q, bvec, sglist, sg);
 	}
 	*bvprv = *bvec;
 }
 
-static inline int __blk_bvec_map_sg(struct bio_vec bv,
-		struct scatterlist *sglist, struct scatterlist **sg)
-{
-	*sg = sglist;
-	sg_set_page(*sg, bv.bv_page, bv.bv_len, bv.bv_offset);
-	return 1;
-}
-
 static int __blk_bios_map_sg(struct request_queue *q, struct bio *bio,
 			     struct scatterlist *sglist,
 			     struct scatterlist **sg)
-- 
cgit 


From f6970f83ef79503cb24ca8324e5cfa1188674f85 Mon Sep 17 00:00:00 2001
From: Ming Lei <ming.lei@redhat.com>
Date: Sun, 17 Mar 2019 18:01:12 +0800
Subject: block: don't check if adjacent bvecs in one bio can be mergeable

Now both passthrough and FS IO have supported multi-page bvec, and
bvec merging has been handled actually when adding page to bio, then
adjacent bvecs won't be mergeable any more if they belong to same bio.

So only try to merge bvecs if they are from different bios.

Cc: Omar Sandoval <osandov@fb.com>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 block/blk-merge.c | 69 +++++++++++++++++++++++++++++++++----------------------
 1 file changed, 42 insertions(+), 27 deletions(-)

(limited to 'block')

diff --git a/block/blk-merge.c b/block/blk-merge.c
index 3e934ee9a907..8f96d683b577 100644
--- a/block/blk-merge.c
+++ b/block/blk-merge.c
@@ -354,11 +354,11 @@ static unsigned int __blk_recalc_rq_segments(struct request_queue *q,
 					     struct bio *bio)
 {
 	struct bio_vec bv, bvprv = { NULL };
-	int prev = 0;
 	unsigned int seg_size, nr_phys_segs;
 	unsigned front_seg_size;
 	struct bio *fbio, *bbio;
 	struct bvec_iter iter;
+	bool new_bio = false;
 
 	if (!bio)
 		return 0;
@@ -379,7 +379,7 @@ static unsigned int __blk_recalc_rq_segments(struct request_queue *q,
 	nr_phys_segs = 0;
 	for_each_bio(bio) {
 		bio_for_each_bvec(bv, bio, iter) {
-			if (prev) {
+			if (new_bio) {
 				if (seg_size + bv.bv_len
 				    > queue_max_segment_size(q))
 					goto new_segment;
@@ -387,7 +387,6 @@ static unsigned int __blk_recalc_rq_segments(struct request_queue *q,
 					goto new_segment;
 
 				seg_size += bv.bv_len;
-				bvprv = bv;
 
 				if (nr_phys_segs == 1 && seg_size >
 						front_seg_size)
@@ -396,12 +395,13 @@ static unsigned int __blk_recalc_rq_segments(struct request_queue *q,
 				continue;
 			}
 new_segment:
-			bvprv = bv;
-			prev = 1;
 			bvec_split_segs(q, &bv, &nr_phys_segs, &seg_size,
 					&front_seg_size, NULL, UINT_MAX);
+			new_bio = false;
 		}
 		bbio = bio;
+		bvprv = bv;
+		new_bio = true;
 	}
 
 	fbio->bi_seg_front_size = front_seg_size;
@@ -501,29 +501,26 @@ static inline int __blk_bvec_map_sg(struct bio_vec bv,
 	return 1;
 }
 
-static inline void
-__blk_segment_map_sg(struct request_queue *q, struct bio_vec *bvec,
-		     struct scatterlist *sglist, struct bio_vec *bvprv,
-		     struct scatterlist **sg, int *nsegs)
+/* only try to merge bvecs into one sg if they are from two bios */
+static inline bool
+__blk_segment_map_sg_merge(struct request_queue *q, struct bio_vec *bvec,
+			   struct bio_vec *bvprv, struct scatterlist **sg)
 {
 
 	int nbytes = bvec->bv_len;
 
-	if (*sg) {
-		if ((*sg)->length + nbytes > queue_max_segment_size(q))
-			goto new_segment;
-		if (!biovec_phys_mergeable(q, bvprv, bvec))
-			goto new_segment;
+	if (!*sg)
+		return false;
 
-		(*sg)->length += nbytes;
-	} else {
-new_segment:
-		if (bvec->bv_offset + bvec->bv_len <= PAGE_SIZE) {
-			(*nsegs) += __blk_bvec_map_sg(*bvec, sglist, sg);
-		} else
-			(*nsegs) += blk_bvec_map_sg(q, bvec, sglist, sg);
-	}
-	*bvprv = *bvec;
+	if ((*sg)->length + nbytes > queue_max_segment_size(q))
+		return false;
+
+	if (!biovec_phys_mergeable(q, bvprv, bvec))
+		return false;
+
+	(*sg)->length += nbytes;
+
+	return true;
 }
 
 static int __blk_bios_map_sg(struct request_queue *q, struct bio *bio,
@@ -533,11 +530,29 @@ static int __blk_bios_map_sg(struct request_queue *q, struct bio *bio,
 	struct bio_vec bvec, bvprv = { NULL };
 	struct bvec_iter iter;
 	int nsegs = 0;
+	bool new_bio = false;
 
-	for_each_bio(bio)
-		bio_for_each_bvec(bvec, bio, iter)
-			__blk_segment_map_sg(q, &bvec, sglist, &bvprv, sg,
-					     &nsegs);
+	for_each_bio(bio) {
+		bio_for_each_bvec(bvec, bio, iter) {
+			/*
+			 * Only try to merge bvecs from two bios given we
+			 * have done bio internal merge when adding pages
+			 * to bio
+			 */
+			if (new_bio &&
+			    __blk_segment_map_sg_merge(q, &bvec, &bvprv, sg))
+				goto next_bvec;
+
+			if (bvec.bv_offset + bvec.bv_len <= PAGE_SIZE)
+				nsegs += __blk_bvec_map_sg(bvec, sglist, sg);
+			else
+				nsegs += blk_bvec_map_sg(q, &bvec, sglist, sg);
+ next_bvec:
+			new_bio = false;
+		}
+		bvprv = bvec;
+		new_bio = true;
+	}
 
 	return nsegs;
 }
-- 
cgit 


From 2b24e6f63ac9e817630424c6d8f008256348dfc4 Mon Sep 17 00:00:00 2001
From: Johannes Thumshirn <jthumshirn@suse.de>
Date: Wed, 3 Apr 2019 11:15:19 +0200
Subject: block: bio: ensure newly added bio flags don't override BVEC_POOL_IDX

With the introduction of BIO_NO_PAGE_REF we've used up all available bits
in bio::bi_flags.

Convert the defines of the flags to an enum and add a BUILD_BUG_ON() call
to make sure no-one adds a new one and thus overrides the BVEC_POOL_IDX
causing crashes.

Reviewed-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 block/bio.c | 3 +++
 1 file changed, 3 insertions(+)

(limited to 'block')

diff --git a/block/bio.c b/block/bio.c
index 8d516d508ae3..c2592c5d70b9 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -2218,6 +2218,9 @@ static int __init init_bio(void)
 	bio_slab_nr = 0;
 	bio_slabs = kcalloc(bio_slab_max, sizeof(struct bio_slab),
 			    GFP_KERNEL);
+
+	BUILD_BUG_ON(BIO_FLAG_LAST > BVEC_POOL_OFFSET);
+
 	if (!bio_slabs)
 		panic("bio: can't allocate bios\n");
 
-- 
cgit 


From 72deb455b5ec619ff043c30bc90025aa3de3cdda Mon Sep 17 00:00:00 2001
From: Christoph Hellwig <hch@lst.de>
Date: Fri, 5 Apr 2019 18:08:59 +0200
Subject: block: remove CONFIG_LBDAF

Currently support for 64-bit sector_t and blkcnt_t is optional on 32-bit
architectures.  These types are required to support block device and/or
file sizes larger than 2 TiB, and have generally defaulted to on for
a long time.  Enabling the option only increases the i386 tinyconfig
size by 145 bytes, and many data structures already always use
64-bit values for their in-core and on-disk data structures anyway,
so there should not be a large change in dynamic memory usage either.

Dropping this option removes a somewhat weird non-default config that
has cause various bugs or compiler warnings when actually used.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 block/Kconfig | 24 ------------------------
 1 file changed, 24 deletions(-)

(limited to 'block')

diff --git a/block/Kconfig b/block/Kconfig
index 028bc085dac8..1b220101a9cb 100644
--- a/block/Kconfig
+++ b/block/Kconfig
@@ -26,30 +26,6 @@ menuconfig BLOCK
 
 if BLOCK
 
-config LBDAF
-	bool "Support for large (2TB+) block devices and files"
-	depends on !64BIT
-	default y
-	help
-	  Enable block devices or files of size 2TB and larger.
-
-	  This option is required to support the full capacity of large
-	  (2TB+) block devices, including RAID, disk, Network Block Device,
-	  Logical Volume Manager (LVM) and loopback.
-	
-	  This option also enables support for single files larger than
-	  2TB.
-
-	  The ext4 filesystem requires that this feature be enabled in
-	  order to support filesystems that have the huge_file feature
-	  enabled.  Otherwise, it will refuse to mount in the read-write
-	  mode any filesystems that use the huge_file feature, which is
-	  enabled by default by mke2fs.ext4.
-
-	  The GFS2 filesystem also requires this feature.
-
-	  If unsure, say Y.
-
 config BLK_SCSI_REQUEST
 	bool
 
-- 
cgit 


From 78bf47353b0041865564deeed257a54f047c2fdc Mon Sep 17 00:00:00 2001
From: David Kozub <zub@linux.fjfi.cvut.cz>
Date: Thu, 14 Feb 2019 01:15:53 +0100
Subject: block: sed-opal: fix IOC_OPAL_ENABLE_DISABLE_MBR

The implementation of IOC_OPAL_ENABLE_DISABLE_MBR handled the value
opal_mbr_data.enable_disable incorrectly: enable_disable is expected
to be one of OPAL_MBR_ENABLE(0) or OPAL_MBR_DISABLE(1). enable_disable
was passed directly to set_mbr_done and set_mbr_enable_disable where
is was interpreted as either OPAL_TRUE(1) or OPAL_FALSE(0). The end
result was that calling IOC_OPAL_ENABLE_DISABLE_MBR with OPAL_MBR_ENABLE
actually disabled the shadow MBR and vice versa.

This patch adds correct conversion from OPAL_MBR_DISABLE/ENABLE to
OPAL_FALSE/TRUE. The change affects existing programs using
IOC_OPAL_ENABLE_DISABLE_MBR but this is typically used only once when
setting up an Opal drive.

Acked-by: Jon Derrick <jonathan.derrick@intel.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Scott Bauer <sbauer@plzdonthack.me>
Signed-off-by: David Kozub <zub@linux.fjfi.cvut.cz>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 block/sed-opal.c | 9 ++++++---
 1 file changed, 6 insertions(+), 3 deletions(-)

(limited to 'block')

diff --git a/block/sed-opal.c b/block/sed-opal.c
index e0de4dd448b3..119640897293 100644
--- a/block/sed-opal.c
+++ b/block/sed-opal.c
@@ -2095,13 +2095,16 @@ static int opal_erase_locking_range(struct opal_dev *dev,
 static int opal_enable_disable_shadow_mbr(struct opal_dev *dev,
 					  struct opal_mbr_data *opal_mbr)
 {
+	u8 enable_disable = opal_mbr->enable_disable == OPAL_MBR_ENABLE ?
+		OPAL_TRUE : OPAL_FALSE;
+
 	const struct opal_step mbr_steps[] = {
 		{ opal_discovery0, },
 		{ start_admin1LSP_opal_session, &opal_mbr->key },
-		{ set_mbr_done, &opal_mbr->enable_disable },
+		{ set_mbr_done, &enable_disable },
 		{ end_opal_session, },
 		{ start_admin1LSP_opal_session, &opal_mbr->key },
-		{ set_mbr_enable_disable, &opal_mbr->enable_disable },
+		{ set_mbr_enable_disable, &enable_disable },
 		{ end_opal_session, },
 		{ NULL, }
 	};
@@ -2221,7 +2224,7 @@ static int __opal_lock_unlock(struct opal_dev *dev,
 
 static int __opal_set_mbr_done(struct opal_dev *dev, struct opal_key *key)
 {
-	u8 mbr_done_tf = 1;
+	u8 mbr_done_tf = OPAL_TRUE;
 	const struct opal_step mbrdone_step [] = {
 		{ opal_discovery0, },
 		{ start_admin1LSP_opal_session, key },
-- 
cgit 


From 1e815b33c5ccd3936b71292b5ffb84e97e1df9e0 Mon Sep 17 00:00:00 2001
From: David Kozub <zub@linux.fjfi.cvut.cz>
Date: Thu, 14 Feb 2019 01:15:54 +0100
Subject: block: sed-opal: fix typos and formatting

This should make no change in functionality.
The formatting changes were triggered by checkpatch.pl.

Reviewed-by: Scott Bauer <sbauer@plzdonthack.me>
Reviewed-by: Jon Derrick <jonathan.derrick@intel.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David Kozub <zub@linux.fjfi.cvut.cz>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 block/sed-opal.c | 18 ++++++++++--------
 1 file changed, 10 insertions(+), 8 deletions(-)

(limited to 'block')

diff --git a/block/sed-opal.c b/block/sed-opal.c
index 119640897293..d12a910e06cb 100644
--- a/block/sed-opal.c
+++ b/block/sed-opal.c
@@ -157,7 +157,7 @@ static const u8 opaluid[][OPAL_UID_LENGTH] = {
 
 	/* C_PIN_TABLE object ID's */
 
-        [OPAL_C_PIN_MSID] =
+	[OPAL_C_PIN_MSID] =
 		{ 0x00, 0x00, 0x00, 0x0B, 0x00, 0x00, 0x84, 0x02},
 	[OPAL_C_PIN_SID] =
 		{ 0x00, 0x00, 0x00, 0x0B, 0x00, 0x00, 0x00, 0x01},
@@ -551,7 +551,6 @@ static void add_medium_atom_header(struct opal_dev *cmd, bool bytestring,
 
 static void add_token_u64(int *err, struct opal_dev *cmd, u64 number)
 {
-
 	size_t len;
 	int msb;
 
@@ -623,7 +622,7 @@ static int build_locking_range(u8 *buffer, size_t length, u8 lr)
 static int build_locking_user(u8 *buffer, size_t length, u8 lr)
 {
 	if (length > OPAL_UID_LENGTH) {
-		pr_debug("Can't build locking range user, Length OOB\n");
+		pr_debug("Can't build locking range user. Length OOB\n");
 		return -ERANGE;
 	}
 
@@ -1324,6 +1323,7 @@ static int start_SIDASP_opal_session(struct opal_dev *dev, void *data)
 
 	if (!key) {
 		const struct opal_key *okey = data;
+
 		ret = start_generic_opal_session(dev, OPAL_SID_UID,
 						 OPAL_ADMINSP_UID,
 						 okey->key,
@@ -1341,6 +1341,7 @@ static int start_SIDASP_opal_session(struct opal_dev *dev, void *data)
 static int start_admin1LSP_opal_session(struct opal_dev *dev, void *data)
 {
 	struct opal_key *key = data;
+
 	return start_generic_opal_session(dev, OPAL_ADMIN1_UID,
 					  OPAL_LOCKINGSP_UID,
 					  key->key, key->key_len);
@@ -1714,7 +1715,7 @@ static int lock_unlock_locking_range(struct opal_dev *dev, void *data)
 		write_locked = 0;
 		break;
 	case OPAL_LK:
-		/* vars are initalized to locked */
+		/* vars are initialized to locked */
 		break;
 	default:
 		pr_debug("Tried to set an invalid locking state... returning to uland\n");
@@ -1775,7 +1776,7 @@ static int lock_unlock_locking_range_sum(struct opal_dev *dev, void *data)
 		write_locked = 0;
 		break;
 	case OPAL_LK:
-		/* vars are initalized to locked */
+		/* vars are initialized to locked */
 		break;
 	default:
 		pr_debug("Tried to set an invalid locking state.\n");
@@ -1854,7 +1855,7 @@ static int get_lsp_lifecycle_cont(struct opal_dev *dev)
 		return error;
 
 	lc_status = response_get_u64(&dev->parsed, 4);
-	/* 0x08 is Manufacured Inactive */
+	/* 0x08 is Manufactured Inactive */
 	/* 0x09 is Manufactured */
 	if (lc_status != OPAL_MANUFACTURED_INACTIVE) {
 		pr_debug("Couldn't determine the status of the Lifecycle state\n");
@@ -2225,7 +2226,7 @@ static int __opal_lock_unlock(struct opal_dev *dev,
 static int __opal_set_mbr_done(struct opal_dev *dev, struct opal_key *key)
 {
 	u8 mbr_done_tf = OPAL_TRUE;
-	const struct opal_step mbrdone_step [] = {
+	const struct opal_step mbrdone_step[] = {
 		{ opal_discovery0, },
 		{ start_admin1LSP_opal_session, key },
 		{ set_mbr_done, &mbr_done_tf },
@@ -2276,7 +2277,8 @@ static int opal_take_ownership(struct opal_dev *dev, struct opal_key *opal)
 	return ret;
 }
 
-static int opal_activate_lsp(struct opal_dev *dev, struct opal_lr_act *opal_lr_act)
+static int opal_activate_lsp(struct opal_dev *dev,
+			     struct opal_lr_act *opal_lr_act)
 {
 	const struct opal_step active_steps[] = {
 		{ opal_discovery0, },
-- 
cgit 


From 1b6b75b0137fd2b5af533eceba5f5db62b1c45b0 Mon Sep 17 00:00:00 2001
From: Jonas Rabenstein <jonas.rabenstein@studium.uni-erlangen.de>
Date: Thu, 14 Feb 2019 01:15:55 +0100
Subject: block: sed-opal: use correct macro for method length

Also the values of OPAL_UID_LENGTH and OPAL_METHOD_LENGTH are the same,
it is weird to use OPAL_UID_LENGTH for the definition of the methods.

Signed-off-by: Jonas Rabenstein <jonas.rabenstein@studium.uni-erlangen.de>
Signed-off-by: David Kozub <zub@linux.fjfi.cvut.cz>
Reviewed-by: Scott Bauer <sbauer@plzdonthack.me>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jon Derrick <jonathan.derrick@intel.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 block/sed-opal.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

(limited to 'block')

diff --git a/block/sed-opal.c b/block/sed-opal.c
index d12a910e06cb..e59ae364f1ef 100644
--- a/block/sed-opal.c
+++ b/block/sed-opal.c
@@ -181,7 +181,7 @@ static const u8 opaluid[][OPAL_UID_LENGTH] = {
  * Derived from: TCG_Storage_Architecture_Core_Spec_v2.01_r1.00
  * Section: 6.3 Assigned UIDs
  */
-static const u8 opalmethod[][OPAL_UID_LENGTH] = {
+static const u8 opalmethod[][OPAL_METHOD_LENGTH] = {
 	[OPAL_PROPERTIES] =
 		{ 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0xff, 0x01 },
 	[OPAL_STARTSESSION] =
-- 
cgit 


From e2821a50b17c1b760e7d597777de61241f22fd55 Mon Sep 17 00:00:00 2001
From: Jonas Rabenstein <jonas.rabenstein@studium.uni-erlangen.de>
Date: Thu, 14 Feb 2019 01:15:56 +0100
Subject: block: sed-opal: unify space check in add_token_*

All add_token_* functions have a common set of conditions that have to
be checked. Use a common function for those checks in order to avoid
different behaviour as well as code duplication.

Acked-by: Jon Derrick <jonathan.derrick@intel.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Scott Bauer <sbauer@plzdonthack.me>
Co-authored-by: David Kozub <zub@linux.fjfi.cvut.cz>
Signed-off-by: Jonas Rabenstein <jonas.rabenstein@studium.uni-erlangen.de>
Signed-off-by: David Kozub <zub@linux.fjfi.cvut.cz>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 block/sed-opal.c | 25 ++++++++++++++++---------
 1 file changed, 16 insertions(+), 9 deletions(-)

(limited to 'block')

diff --git a/block/sed-opal.c b/block/sed-opal.c
index e59ae364f1ef..d285bd4b2b9b 100644
--- a/block/sed-opal.c
+++ b/block/sed-opal.c
@@ -510,15 +510,24 @@ static int opal_discovery0(struct opal_dev *dev, void *data)
 	return opal_discovery0_end(dev);
 }
 
-static void add_token_u8(int *err, struct opal_dev *cmd, u8 tok)
+static bool can_add(int *err, struct opal_dev *cmd, size_t len)
 {
 	if (*err)
-		return;
-	if (cmd->pos >= IO_BUFFER_LENGTH - 1) {
-		pr_debug("Error adding u8: end of buffer.\n");
+		return false;
+
+	if (len > IO_BUFFER_LENGTH || cmd->pos > IO_BUFFER_LENGTH - len) {
+		pr_debug("Error adding %zu bytes: end of buffer.\n", len);
 		*err = -ERANGE;
-		return;
+		return false;
 	}
+
+	return true;
+}
+
+static void add_token_u8(int *err, struct opal_dev *cmd, u8 tok)
+{
+	if (!can_add(err, cmd, 1))
+		return;
 	cmd->cmd[cmd->pos++] = tok;
 }
 
@@ -562,9 +571,8 @@ static void add_token_u64(int *err, struct opal_dev *cmd, u64 number)
 	msb = fls64(number);
 	len = DIV_ROUND_UP(msb, 8);
 
-	if (cmd->pos >= IO_BUFFER_LENGTH - len - 1) {
+	if (!can_add(err, cmd, len + 1)) {
 		pr_debug("Error adding u64: end of buffer.\n");
-		*err = -ERANGE;
 		return;
 	}
 	add_short_atom_header(cmd, false, false, len);
@@ -586,9 +594,8 @@ static void add_token_bytestring(int *err, struct opal_dev *cmd,
 		is_short_atom = false;
 	}
 
-	if (len >= IO_BUFFER_LENGTH - cmd->pos - header_len) {
+	if (!can_add(err, cmd, header_len + len)) {
 		pr_debug("Error adding bytestring: end of buffer.\n");
-		*err = -ERANGE;
 		return;
 	}
 
-- 
cgit 


From 78d584ca31efbadfd8db105dec09c362d75b97b9 Mon Sep 17 00:00:00 2001
From: David Kozub <zub@linux.fjfi.cvut.cz>
Date: Thu, 14 Feb 2019 01:15:57 +0100
Subject: block: sed-opal: close parameter list in cmd_finalize

Every step ends by calling cmd_finalize (via finalize_and_send)
yet every step adds the token OPAL_ENDLIST on its own. Moving
this into cmd_finalize decreases code duplication.

Co-authored-by: Jonas Rabenstein <jonas.rabenstein@studium.uni-erlangen.de>
Signed-off-by: David Kozub <zub@linux.fjfi.cvut.cz>
Signed-off-by: Jonas Rabenstein <jonas.rabenstein@studium.uni-erlangen.de>
Reviewed-by: Scott Bauer <sbauer@plzdonthack.me>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Acked-by: Jon Derrick <jonathan.derrick@intel.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 block/sed-opal.c | 25 +++----------------------
 1 file changed, 3 insertions(+), 22 deletions(-)

(limited to 'block')

diff --git a/block/sed-opal.c b/block/sed-opal.c
index d285bd4b2b9b..c5dff6199bd6 100644
--- a/block/sed-opal.c
+++ b/block/sed-opal.c
@@ -655,6 +655,9 @@ static int cmd_finalize(struct opal_dev *cmd, u32 hsn, u32 tsn)
 	struct opal_header *hdr;
 	int err = 0;
 
+	/* close parameter list */
+	add_token_u8(&err, cmd, OPAL_ENDLIST);
+
 	add_token_u8(&err, cmd, OPAL_ENDOFDATA);
 	add_token_u8(&err, cmd, OPAL_STARTLIST);
 	add_token_u8(&err, cmd, 0);
@@ -1073,7 +1076,6 @@ static int gen_key(struct opal_dev *dev, void *data)
 	add_token_bytestring(&err, dev, opalmethod[OPAL_GENKEY],
 			     OPAL_UID_LENGTH);
 	add_token_u8(&err, dev, OPAL_STARTLIST);
-	add_token_u8(&err, dev, OPAL_ENDLIST);
 
 	if (err) {
 		pr_debug("Error building gen key command\n");
@@ -1136,7 +1138,6 @@ static int get_active_key(struct opal_dev *dev, void *data)
 	add_token_u8(&err, dev, 10); /* ActiveKey */
 	add_token_u8(&err, dev, OPAL_ENDNAME);
 	add_token_u8(&err, dev, OPAL_ENDLIST);
-	add_token_u8(&err, dev, OPAL_ENDLIST);
 	if (err) {
 		pr_debug("Error building get active key command\n");
 		return err;
@@ -1182,7 +1183,6 @@ static int generic_lr_enable_disable(struct opal_dev *dev,
 
 	add_token_u8(&err, dev, OPAL_ENDLIST);
 	add_token_u8(&err, dev, OPAL_ENDNAME);
-	add_token_u8(&err, dev, OPAL_ENDLIST);
 	return err;
 }
 
@@ -1248,8 +1248,6 @@ static int setup_locking_range(struct opal_dev *dev, void *data)
 
 		add_token_u8(&err, dev, OPAL_ENDLIST);
 		add_token_u8(&err, dev, OPAL_ENDNAME);
-		add_token_u8(&err, dev, OPAL_ENDLIST);
-
 	}
 	if (err) {
 		pr_debug("Error building Setup Locking range command.\n");
@@ -1289,7 +1287,6 @@ static int start_generic_opal_session(struct opal_dev *dev,
 
 	switch (auth) {
 	case OPAL_ANYBODY_UID:
-		add_token_u8(&err, dev, OPAL_ENDLIST);
 		break;
 	case OPAL_ADMIN1_UID:
 	case OPAL_SID_UID:
@@ -1302,7 +1299,6 @@ static int start_generic_opal_session(struct opal_dev *dev,
 		add_token_bytestring(&err, dev, opaluid[auth],
 				     OPAL_UID_LENGTH);
 		add_token_u8(&err, dev, OPAL_ENDNAME);
-		add_token_u8(&err, dev, OPAL_ENDLIST);
 		break;
 	default:
 		pr_debug("Cannot start Admin SP session with auth %d\n", auth);
@@ -1400,7 +1396,6 @@ static int start_auth_opal_session(struct opal_dev *dev, void *data)
 	add_token_u8(&err, dev, 3);
 	add_token_bytestring(&err, dev, lk_ul_user, OPAL_UID_LENGTH);
 	add_token_u8(&err, dev, OPAL_ENDNAME);
-	add_token_u8(&err, dev, OPAL_ENDLIST);
 
 	if (err) {
 		pr_debug("Error building STARTSESSION command.\n");
@@ -1423,7 +1418,6 @@ static int revert_tper(struct opal_dev *dev, void *data)
 	add_token_bytestring(&err, dev, opalmethod[OPAL_REVERT],
 			     OPAL_UID_LENGTH);
 	add_token_u8(&err, dev, OPAL_STARTLIST);
-	add_token_u8(&err, dev, OPAL_ENDLIST);
 	if (err) {
 		pr_debug("Error building REVERT TPER command.\n");
 		return err;
@@ -1457,7 +1451,6 @@ static int internal_activate_user(struct opal_dev *dev, void *data)
 	add_token_u8(&err, dev, OPAL_ENDNAME);
 	add_token_u8(&err, dev, OPAL_ENDLIST);
 	add_token_u8(&err, dev, OPAL_ENDNAME);
-	add_token_u8(&err, dev, OPAL_ENDLIST);
 
 	if (err) {
 		pr_debug("Error building Activate UserN command.\n");
@@ -1484,7 +1477,6 @@ static int erase_locking_range(struct opal_dev *dev, void *data)
 	add_token_bytestring(&err, dev, opalmethod[OPAL_ERASE],
 			     OPAL_UID_LENGTH);
 	add_token_u8(&err, dev, OPAL_STARTLIST);
-	add_token_u8(&err, dev, OPAL_ENDLIST);
 
 	if (err) {
 		pr_debug("Error building Erase Locking Range Command.\n");
@@ -1515,7 +1507,6 @@ static int set_mbr_done(struct opal_dev *dev, void *data)
 	add_token_u8(&err, dev, OPAL_ENDNAME);
 	add_token_u8(&err, dev, OPAL_ENDLIST);
 	add_token_u8(&err, dev, OPAL_ENDNAME);
-	add_token_u8(&err, dev, OPAL_ENDLIST);
 
 	if (err) {
 		pr_debug("Error Building set MBR Done command\n");
@@ -1547,7 +1538,6 @@ static int set_mbr_enable_disable(struct opal_dev *dev, void *data)
 	add_token_u8(&err, dev, OPAL_ENDNAME);
 	add_token_u8(&err, dev, OPAL_ENDLIST);
 	add_token_u8(&err, dev, OPAL_ENDNAME);
-	add_token_u8(&err, dev, OPAL_ENDLIST);
 
 	if (err) {
 		pr_debug("Error Building set MBR done command\n");
@@ -1579,7 +1569,6 @@ static int generic_pw_cmd(u8 *key, size_t key_len, u8 *cpin_uid,
 	add_token_u8(&err, dev, OPAL_ENDNAME);
 	add_token_u8(&err, dev, OPAL_ENDLIST);
 	add_token_u8(&err, dev, OPAL_ENDNAME);
-	add_token_u8(&err, dev, OPAL_ENDLIST);
 
 	return err;
 }
@@ -1688,7 +1677,6 @@ static int add_user_to_lr(struct opal_dev *dev, void *data)
 	add_token_u8(&err, dev, OPAL_ENDNAME);
 	add_token_u8(&err, dev, OPAL_ENDLIST);
 	add_token_u8(&err, dev, OPAL_ENDNAME);
-	add_token_u8(&err, dev, OPAL_ENDLIST);
 
 	if (err) {
 		pr_debug("Error building add user to locking range command.\n");
@@ -1749,7 +1737,6 @@ static int lock_unlock_locking_range(struct opal_dev *dev, void *data)
 
 	add_token_u8(&err, dev, OPAL_ENDLIST);
 	add_token_u8(&err, dev, OPAL_ENDNAME);
-	add_token_u8(&err, dev, OPAL_ENDLIST);
 
 	if (err) {
 		pr_debug("Error building SET command.\n");
@@ -1837,11 +1824,8 @@ static int activate_lsp(struct opal_dev *dev, void *data)
 		}
 		add_token_u8(&err, dev, OPAL_ENDLIST);
 		add_token_u8(&err, dev, OPAL_ENDNAME);
-		add_token_u8(&err, dev, OPAL_ENDLIST);
-
 	} else {
 		add_token_u8(&err, dev, OPAL_STARTLIST);
-		add_token_u8(&err, dev, OPAL_ENDLIST);
 	}
 
 	if (err) {
@@ -1898,7 +1882,6 @@ static int get_lsp_lifecycle(struct opal_dev *dev, void *data)
 	add_token_u8(&err, dev, 6); /* Lifecycle Column */
 	add_token_u8(&err, dev, OPAL_ENDNAME);
 
-	add_token_u8(&err, dev, OPAL_ENDLIST);
 	add_token_u8(&err, dev, OPAL_ENDLIST);
 
 	if (err) {
@@ -1958,8 +1941,6 @@ static int get_msid_cpin_pin(struct opal_dev *dev, void *data)
 	add_token_u8(&err, dev, 4); /* End Column */
 	add_token_u8(&err, dev, 3); /* Lifecycle Column */
 	add_token_u8(&err, dev, OPAL_ENDNAME);
-
-	add_token_u8(&err, dev, OPAL_ENDLIST);
 	add_token_u8(&err, dev, OPAL_ENDLIST);
 
 	if (err) {
-- 
cgit 


From e8b2922459cf15140ab8cc1f92b6861674fff1a3 Mon Sep 17 00:00:00 2001
From: David Kozub <zub@linux.fjfi.cvut.cz>
Date: Thu, 14 Feb 2019 01:15:58 +0100
Subject: block: sed-opal: unify cmd start

Every step starts with resetting the cmd buffer as well as the comid and
constructs the appropriate OPAL_CALL command. Consequently, those
actions may be combined into one generic function. On should take care
that the opening and closing tokens for the argument list are already
emitted by cmd_start and cmd_finalize respectively and thus must not be
additionally added.

Co-authored-by: Jonas Rabenstein <jonas.rabenstein@studium.uni-erlangen.de>
Signed-off-by: David Kozub <zub@linux.fjfi.cvut.cz>
Signed-off-by: Jonas Rabenstein <jonas.rabenstein@studium.uni-erlangen.de>
Reviewed-by: Scott Bauer <sbauer@plzdonthack.me>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Acked-by: Jon Derrick <jonathan.derrick@intel.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 block/sed-opal.c | 228 +++++++++++++++++--------------------------------------
 1 file changed, 69 insertions(+), 159 deletions(-)

(limited to 'block')

diff --git a/block/sed-opal.c b/block/sed-opal.c
index c5dff6199bd6..0348fb896a5d 100644
--- a/block/sed-opal.c
+++ b/block/sed-opal.c
@@ -655,7 +655,7 @@ static int cmd_finalize(struct opal_dev *cmd, u32 hsn, u32 tsn)
 	struct opal_header *hdr;
 	int err = 0;
 
-	/* close parameter list */
+	/* close the parameter list opened from cmd_start */
 	add_token_u8(&err, cmd, OPAL_ENDLIST);
 
 	add_token_u8(&err, cmd, OPAL_ENDOFDATA);
@@ -1000,6 +1000,27 @@ static void clear_opal_cmd(struct opal_dev *dev)
 	memset(dev->cmd, 0, IO_BUFFER_LENGTH);
 }
 
+static int cmd_start(struct opal_dev *dev, const u8 *uid, const u8 *method)
+{
+	int err = 0;
+
+	clear_opal_cmd(dev);
+	set_comid(dev, dev->comid);
+
+	add_token_u8(&err, dev, OPAL_CALL);
+	add_token_bytestring(&err, dev, uid, OPAL_UID_LENGTH);
+	add_token_bytestring(&err, dev, method, OPAL_METHOD_LENGTH);
+
+	/*
+	 * Every method call is followed by its parameters enclosed within
+	 * OPAL_STARTLIST and OPAL_ENDLIST tokens. We automatically open the
+	 * parameter list here and close it later in cmd_finalize.
+	 */
+	add_token_u8(&err, dev, OPAL_STARTLIST);
+
+	return err;
+}
+
 static int start_opal_session_cont(struct opal_dev *dev)
 {
 	u32 hsn, tsn;
@@ -1062,20 +1083,13 @@ static int finalize_and_send(struct opal_dev *dev, cont_fn cont)
 static int gen_key(struct opal_dev *dev, void *data)
 {
 	u8 uid[OPAL_UID_LENGTH];
-	int err = 0;
-
-	clear_opal_cmd(dev);
-	set_comid(dev, dev->comid);
+	int err;
 
 	memcpy(uid, dev->prev_data, min(sizeof(uid), dev->prev_d_len));
 	kfree(dev->prev_data);
 	dev->prev_data = NULL;
 
-	add_token_u8(&err, dev, OPAL_CALL);
-	add_token_bytestring(&err, dev, uid, OPAL_UID_LENGTH);
-	add_token_bytestring(&err, dev, opalmethod[OPAL_GENKEY],
-			     OPAL_UID_LENGTH);
-	add_token_u8(&err, dev, OPAL_STARTLIST);
+	err = cmd_start(dev, uid, opalmethod[OPAL_GENKEY]);
 
 	if (err) {
 		pr_debug("Error building gen key command\n");
@@ -1113,21 +1127,14 @@ static int get_active_key_cont(struct opal_dev *dev)
 static int get_active_key(struct opal_dev *dev, void *data)
 {
 	u8 uid[OPAL_UID_LENGTH];
-	int err = 0;
+	int err;
 	u8 *lr = data;
 
-	clear_opal_cmd(dev);
-	set_comid(dev, dev->comid);
-
 	err = build_locking_range(uid, sizeof(uid), *lr);
 	if (err)
 		return err;
 
-	err = 0;
-	add_token_u8(&err, dev, OPAL_CALL);
-	add_token_bytestring(&err, dev, uid, OPAL_UID_LENGTH);
-	add_token_bytestring(&err, dev, opalmethod[OPAL_GET], OPAL_UID_LENGTH);
-	add_token_u8(&err, dev, OPAL_STARTLIST);
+	err = cmd_start(dev, uid, opalmethod[OPAL_GET]);
 	add_token_u8(&err, dev, OPAL_STARTLIST);
 	add_token_u8(&err, dev, OPAL_STARTNAME);
 	add_token_u8(&err, dev, 3); /* startCloumn */
@@ -1150,13 +1157,10 @@ static int generic_lr_enable_disable(struct opal_dev *dev,
 				     u8 *uid, bool rle, bool wle,
 				     bool rl, bool wl)
 {
-	int err = 0;
+	int err;
 
-	add_token_u8(&err, dev, OPAL_CALL);
-	add_token_bytestring(&err, dev, uid, OPAL_UID_LENGTH);
-	add_token_bytestring(&err, dev, opalmethod[OPAL_SET], OPAL_UID_LENGTH);
+	err = cmd_start(dev, uid, opalmethod[OPAL_SET]);
 
-	add_token_u8(&err, dev, OPAL_STARTLIST);
 	add_token_u8(&err, dev, OPAL_STARTNAME);
 	add_token_u8(&err, dev, OPAL_VALUES);
 	add_token_u8(&err, dev, OPAL_STARTLIST);
@@ -1203,10 +1207,7 @@ static int setup_locking_range(struct opal_dev *dev, void *data)
 	u8 uid[OPAL_UID_LENGTH];
 	struct opal_user_lr_setup *setup = data;
 	u8 lr;
-	int err = 0;
-
-	clear_opal_cmd(dev);
-	set_comid(dev, dev->comid);
+	int err;
 
 	lr = setup->session.opal_key.lr;
 	err = build_locking_range(uid, sizeof(uid), lr);
@@ -1216,12 +1217,8 @@ static int setup_locking_range(struct opal_dev *dev, void *data)
 	if (lr == 0)
 		err = enable_global_lr(dev, uid, setup);
 	else {
-		add_token_u8(&err, dev, OPAL_CALL);
-		add_token_bytestring(&err, dev, uid, OPAL_UID_LENGTH);
-		add_token_bytestring(&err, dev, opalmethod[OPAL_SET],
-				     OPAL_UID_LENGTH);
+		err = cmd_start(dev, uid, opalmethod[OPAL_SET]);
 
-		add_token_u8(&err, dev, OPAL_STARTLIST);
 		add_token_u8(&err, dev, OPAL_STARTNAME);
 		add_token_u8(&err, dev, OPAL_VALUES);
 		add_token_u8(&err, dev, OPAL_STARTLIST);
@@ -1265,22 +1262,15 @@ static int start_generic_opal_session(struct opal_dev *dev,
 				      u8 key_len)
 {
 	u32 hsn;
-	int err = 0;
+	int err;
 
 	if (key == NULL && auth != OPAL_ANYBODY_UID)
 		return OPAL_INVAL_PARAM;
 
-	clear_opal_cmd(dev);
-
-	set_comid(dev, dev->comid);
 	hsn = GENERIC_HOST_SESSION_NUM;
+	err = cmd_start(dev, opaluid[OPAL_SMUID_UID],
+			opalmethod[OPAL_STARTSESSION]);
 
-	add_token_u8(&err, dev, OPAL_CALL);
-	add_token_bytestring(&err, dev, opaluid[OPAL_SMUID_UID],
-			     OPAL_UID_LENGTH);
-	add_token_bytestring(&err, dev, opalmethod[OPAL_STARTSESSION],
-			     OPAL_UID_LENGTH);
-	add_token_u8(&err, dev, OPAL_STARTLIST);
 	add_token_u64(&err, dev, hsn);
 	add_token_bytestring(&err, dev, opaluid[sp_type], OPAL_UID_LENGTH);
 	add_token_u8(&err, dev, 1);
@@ -1360,30 +1350,21 @@ static int start_auth_opal_session(struct opal_dev *dev, void *data)
 	u8 *key = session->opal_key.key;
 	u32 hsn = GENERIC_HOST_SESSION_NUM;
 
-	clear_opal_cmd(dev);
-	set_comid(dev, dev->comid);
-
-	if (session->sum) {
+	if (session->sum)
 		err = build_locking_user(lk_ul_user, sizeof(lk_ul_user),
 					 session->opal_key.lr);
-		if (err)
-			return err;
-
-	} else if (session->who != OPAL_ADMIN1 && !session->sum) {
+	else if (session->who != OPAL_ADMIN1 && !session->sum)
 		err = build_locking_user(lk_ul_user, sizeof(lk_ul_user),
 					 session->who - 1);
-		if (err)
-			return err;
-	} else
+	else
 		memcpy(lk_ul_user, opaluid[OPAL_ADMIN1_UID], OPAL_UID_LENGTH);
 
-	add_token_u8(&err, dev, OPAL_CALL);
-	add_token_bytestring(&err, dev, opaluid[OPAL_SMUID_UID],
-			     OPAL_UID_LENGTH);
-	add_token_bytestring(&err, dev, opalmethod[OPAL_STARTSESSION],
-			     OPAL_UID_LENGTH);
+	if (err)
+		return err;
+
+	err = cmd_start(dev, opaluid[OPAL_SMUID_UID],
+			opalmethod[OPAL_STARTSESSION]);
 
-	add_token_u8(&err, dev, OPAL_STARTLIST);
 	add_token_u64(&err, dev, hsn);
 	add_token_bytestring(&err, dev, opaluid[OPAL_LOCKINGSP_UID],
 			     OPAL_UID_LENGTH);
@@ -1407,17 +1388,10 @@ static int start_auth_opal_session(struct opal_dev *dev, void *data)
 
 static int revert_tper(struct opal_dev *dev, void *data)
 {
-	int err = 0;
-
-	clear_opal_cmd(dev);
-	set_comid(dev, dev->comid);
+	int err;
 
-	add_token_u8(&err, dev, OPAL_CALL);
-	add_token_bytestring(&err, dev, opaluid[OPAL_ADMINSP_UID],
-			     OPAL_UID_LENGTH);
-	add_token_bytestring(&err, dev, opalmethod[OPAL_REVERT],
-			     OPAL_UID_LENGTH);
-	add_token_u8(&err, dev, OPAL_STARTLIST);
+	err = cmd_start(dev, opaluid[OPAL_ADMINSP_UID],
+			opalmethod[OPAL_REVERT]);
 	if (err) {
 		pr_debug("Error building REVERT TPER command.\n");
 		return err;
@@ -1430,18 +1404,12 @@ static int internal_activate_user(struct opal_dev *dev, void *data)
 {
 	struct opal_session_info *session = data;
 	u8 uid[OPAL_UID_LENGTH];
-	int err = 0;
-
-	clear_opal_cmd(dev);
-	set_comid(dev, dev->comid);
+	int err;
 
 	memcpy(uid, opaluid[OPAL_USER1_UID], OPAL_UID_LENGTH);
 	uid[7] = session->who;
 
-	add_token_u8(&err, dev, OPAL_CALL);
-	add_token_bytestring(&err, dev, uid, OPAL_UID_LENGTH);
-	add_token_bytestring(&err, dev, opalmethod[OPAL_SET], OPAL_UID_LENGTH);
-	add_token_u8(&err, dev, OPAL_STARTLIST);
+	err = cmd_start(dev, uid, opalmethod[OPAL_SET]);
 	add_token_u8(&err, dev, OPAL_STARTNAME);
 	add_token_u8(&err, dev, OPAL_VALUES);
 	add_token_u8(&err, dev, OPAL_STARTLIST);
@@ -1464,19 +1432,12 @@ static int erase_locking_range(struct opal_dev *dev, void *data)
 {
 	struct opal_session_info *session = data;
 	u8 uid[OPAL_UID_LENGTH];
-	int err = 0;
-
-	clear_opal_cmd(dev);
-	set_comid(dev, dev->comid);
+	int err;
 
 	if (build_locking_range(uid, sizeof(uid), session->opal_key.lr) < 0)
 		return -ERANGE;
 
-	add_token_u8(&err, dev, OPAL_CALL);
-	add_token_bytestring(&err, dev, uid, OPAL_UID_LENGTH);
-	add_token_bytestring(&err, dev, opalmethod[OPAL_ERASE],
-			     OPAL_UID_LENGTH);
-	add_token_u8(&err, dev, OPAL_STARTLIST);
+	err = cmd_start(dev, uid, opalmethod[OPAL_ERASE]);
 
 	if (err) {
 		pr_debug("Error building Erase Locking Range Command.\n");
@@ -1488,16 +1449,11 @@ static int erase_locking_range(struct opal_dev *dev, void *data)
 static int set_mbr_done(struct opal_dev *dev, void *data)
 {
 	u8 *mbr_done_tf = data;
-	int err = 0;
+	int err;
 
-	clear_opal_cmd(dev);
-	set_comid(dev, dev->comid);
+	err = cmd_start(dev, opaluid[OPAL_MBRCONTROL],
+			opalmethod[OPAL_SET]);
 
-	add_token_u8(&err, dev, OPAL_CALL);
-	add_token_bytestring(&err, dev, opaluid[OPAL_MBRCONTROL],
-			     OPAL_UID_LENGTH);
-	add_token_bytestring(&err, dev, opalmethod[OPAL_SET], OPAL_UID_LENGTH);
-	add_token_u8(&err, dev, OPAL_STARTLIST);
 	add_token_u8(&err, dev, OPAL_STARTNAME);
 	add_token_u8(&err, dev, OPAL_VALUES);
 	add_token_u8(&err, dev, OPAL_STARTLIST);
@@ -1519,16 +1475,11 @@ static int set_mbr_done(struct opal_dev *dev, void *data)
 static int set_mbr_enable_disable(struct opal_dev *dev, void *data)
 {
 	u8 *mbr_en_dis = data;
-	int err = 0;
+	int err;
 
-	clear_opal_cmd(dev);
-	set_comid(dev, dev->comid);
+	err = cmd_start(dev, opaluid[OPAL_MBRCONTROL],
+			opalmethod[OPAL_SET]);
 
-	add_token_u8(&err, dev, OPAL_CALL);
-	add_token_bytestring(&err, dev, opaluid[OPAL_MBRCONTROL],
-			     OPAL_UID_LENGTH);
-	add_token_bytestring(&err, dev, opalmethod[OPAL_SET], OPAL_UID_LENGTH);
-	add_token_u8(&err, dev, OPAL_STARTLIST);
 	add_token_u8(&err, dev, OPAL_STARTNAME);
 	add_token_u8(&err, dev, OPAL_VALUES);
 	add_token_u8(&err, dev, OPAL_STARTLIST);
@@ -1550,16 +1501,10 @@ static int set_mbr_enable_disable(struct opal_dev *dev, void *data)
 static int generic_pw_cmd(u8 *key, size_t key_len, u8 *cpin_uid,
 			  struct opal_dev *dev)
 {
-	int err = 0;
+	int err;
 
-	clear_opal_cmd(dev);
-	set_comid(dev, dev->comid);
+	err = cmd_start(dev, cpin_uid, opalmethod[OPAL_SET]);
 
-	add_token_u8(&err, dev, OPAL_CALL);
-	add_token_bytestring(&err, dev, cpin_uid, OPAL_UID_LENGTH);
-	add_token_bytestring(&err, dev, opalmethod[OPAL_SET],
-			     OPAL_UID_LENGTH);
-	add_token_u8(&err, dev, OPAL_STARTLIST);
 	add_token_u8(&err, dev, OPAL_STARTNAME);
 	add_token_u8(&err, dev, OPAL_VALUES);
 	add_token_u8(&err, dev, OPAL_STARTLIST);
@@ -1616,10 +1561,7 @@ static int add_user_to_lr(struct opal_dev *dev, void *data)
 	u8 lr_buffer[OPAL_UID_LENGTH];
 	u8 user_uid[OPAL_UID_LENGTH];
 	struct opal_lock_unlock *lkul = data;
-	int err = 0;
-
-	clear_opal_cmd(dev);
-	set_comid(dev, dev->comid);
+	int err;
 
 	memcpy(lr_buffer, opaluid[OPAL_LOCKINGRANGE_ACE_RDLOCKED],
 	       OPAL_UID_LENGTH);
@@ -1634,12 +1576,8 @@ static int add_user_to_lr(struct opal_dev *dev, void *data)
 
 	user_uid[7] = lkul->session.who;
 
-	add_token_u8(&err, dev, OPAL_CALL);
-	add_token_bytestring(&err, dev, lr_buffer, OPAL_UID_LENGTH);
-	add_token_bytestring(&err, dev, opalmethod[OPAL_SET],
-			     OPAL_UID_LENGTH);
+	err = cmd_start(dev, lr_buffer, opalmethod[OPAL_SET]);
 
-	add_token_u8(&err, dev, OPAL_STARTLIST);
 	add_token_u8(&err, dev, OPAL_STARTNAME);
 	add_token_u8(&err, dev, OPAL_VALUES);
 
@@ -1693,9 +1631,6 @@ static int lock_unlock_locking_range(struct opal_dev *dev, void *data)
 	u8 read_locked = 1, write_locked = 1;
 	int err = 0;
 
-	clear_opal_cmd(dev);
-	set_comid(dev, dev->comid);
-
 	if (build_locking_range(lr_buffer, sizeof(lr_buffer),
 				lkul->session.opal_key.lr) < 0)
 		return -ERANGE;
@@ -1717,10 +1652,8 @@ static int lock_unlock_locking_range(struct opal_dev *dev, void *data)
 		return OPAL_INVAL_PARAM;
 	}
 
-	add_token_u8(&err, dev, OPAL_CALL);
-	add_token_bytestring(&err, dev, lr_buffer, OPAL_UID_LENGTH);
-	add_token_bytestring(&err, dev, opalmethod[OPAL_SET], OPAL_UID_LENGTH);
-	add_token_u8(&err, dev, OPAL_STARTLIST);
+	err = cmd_start(dev, lr_buffer, opalmethod[OPAL_SET]);
+
 	add_token_u8(&err, dev, OPAL_STARTNAME);
 	add_token_u8(&err, dev, OPAL_VALUES);
 	add_token_u8(&err, dev, OPAL_STARTLIST);
@@ -1791,17 +1724,10 @@ static int activate_lsp(struct opal_dev *dev, void *data)
 	struct opal_lr_act *opal_act = data;
 	u8 user_lr[OPAL_UID_LENGTH];
 	u8 uint_3 = 0x83;
-	int err = 0, i;
-
-	clear_opal_cmd(dev);
-	set_comid(dev, dev->comid);
-
-	add_token_u8(&err, dev, OPAL_CALL);
-	add_token_bytestring(&err, dev, opaluid[OPAL_LOCKINGSP_UID],
-			     OPAL_UID_LENGTH);
-	add_token_bytestring(&err, dev, opalmethod[OPAL_ACTIVATE],
-			     OPAL_UID_LENGTH);
+	int err, i;
 
+	err = cmd_start(dev, opaluid[OPAL_LOCKINGSP_UID],
+			opalmethod[OPAL_ACTIVATE]);
 
 	if (opal_act->sum) {
 		err = build_locking_range(user_lr, sizeof(user_lr),
@@ -1809,7 +1735,6 @@ static int activate_lsp(struct opal_dev *dev, void *data)
 		if (err)
 			return err;
 
-		add_token_u8(&err, dev, OPAL_STARTLIST);
 		add_token_u8(&err, dev, OPAL_STARTNAME);
 		add_token_u8(&err, dev, uint_3);
 		add_token_u8(&err, dev, 6);
@@ -1824,8 +1749,6 @@ static int activate_lsp(struct opal_dev *dev, void *data)
 		}
 		add_token_u8(&err, dev, OPAL_ENDLIST);
 		add_token_u8(&err, dev, OPAL_ENDNAME);
-	} else {
-		add_token_u8(&err, dev, OPAL_STARTLIST);
 	}
 
 	if (err) {
@@ -1859,17 +1782,11 @@ static int get_lsp_lifecycle_cont(struct opal_dev *dev)
 /* Determine if we're in the Manufactured Inactive or Active state */
 static int get_lsp_lifecycle(struct opal_dev *dev, void *data)
 {
-	int err = 0;
-
-	clear_opal_cmd(dev);
-	set_comid(dev, dev->comid);
+	int err;
 
-	add_token_u8(&err, dev, OPAL_CALL);
-	add_token_bytestring(&err, dev, opaluid[OPAL_LOCKINGSP_UID],
-			     OPAL_UID_LENGTH);
-	add_token_bytestring(&err, dev, opalmethod[OPAL_GET], OPAL_UID_LENGTH);
+	err = cmd_start(dev, opaluid[OPAL_LOCKINGSP_UID],
+			opalmethod[OPAL_GET]);
 
-	add_token_u8(&err, dev, OPAL_STARTLIST);
 	add_token_u8(&err, dev, OPAL_STARTLIST);
 
 	add_token_u8(&err, dev, OPAL_STARTNAME);
@@ -1919,19 +1836,12 @@ static int get_msid_cpin_pin_cont(struct opal_dev *dev)
 
 static int get_msid_cpin_pin(struct opal_dev *dev, void *data)
 {
-	int err = 0;
-
-	clear_opal_cmd(dev);
-	set_comid(dev, dev->comid);
+	int err;
 
-	add_token_u8(&err, dev, OPAL_CALL);
-	add_token_bytestring(&err, dev, opaluid[OPAL_C_PIN_MSID],
-			     OPAL_UID_LENGTH);
-	add_token_bytestring(&err, dev, opalmethod[OPAL_GET], OPAL_UID_LENGTH);
+	err = cmd_start(dev, opaluid[OPAL_C_PIN_MSID],
+			opalmethod[OPAL_GET]);
 
 	add_token_u8(&err, dev, OPAL_STARTLIST);
-	add_token_u8(&err, dev, OPAL_STARTLIST);
-
 	add_token_u8(&err, dev, OPAL_STARTNAME);
 	add_token_u8(&err, dev, 3); /* Start Column */
 	add_token_u8(&err, dev, 3); /* PIN */
-- 
cgit 


From 7d9b62ae2a7db6dfa218999b7dd65517a6f9cfb7 Mon Sep 17 00:00:00 2001
From: David Kozub <zub@linux.fjfi.cvut.cz>
Date: Thu, 14 Feb 2019 01:15:59 +0100
Subject: block: sed-opal: unify error handling of responses

response_get_{string,u64} include error handling for argument resp being
NULL but response_get_token does not handle this.

Make all three of response_get_{string,u64,token} handle NULL resp in
the same way.

Co-authored-by: Jonas Rabenstein <jonas.rabenstein@studium.uni-erlangen.de>
Signed-off-by: David Kozub <zub@linux.fjfi.cvut.cz>
Signed-off-by: Jonas Rabenstein <jonas.rabenstein@studium.uni-erlangen.de>
Reviewed-by: Scott Bauer <sbauer@plzdonthack.me>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jon Derrick <jonathan.derrick@intel.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 block/sed-opal.c | 5 +++++
 1 file changed, 5 insertions(+)

(limited to 'block')

diff --git a/block/sed-opal.c b/block/sed-opal.c
index 0348fb896a5d..3f368b14efd9 100644
--- a/block/sed-opal.c
+++ b/block/sed-opal.c
@@ -696,6 +696,11 @@ static const struct opal_resp_tok *response_get_token(
 {
 	const struct opal_resp_tok *tok;
 
+	if (!resp) {
+		pr_debug("Response is NULL\n");
+		return ERR_PTR(-EINVAL);
+	}
+
 	if (n >= resp->num) {
 		pr_debug("Token number doesn't exist: %d, resp: %d\n",
 			 n, resp->num);
-- 
cgit 


From b68f09ecdeaa7d59397429cb95aa83f02b9bd107 Mon Sep 17 00:00:00 2001
From: David Kozub <zub@linux.fjfi.cvut.cz>
Date: Thu, 14 Feb 2019 01:16:00 +0100
Subject: block: sed-opal: reuse response_get_token to decrease code
 duplication

response_get_token had already been in place, its functionality had
been duplicated within response_get_{u64,bytestring} with the same error
handling. Unify the handling by reusing response_get_token within the
other functions.

Co-authored-by: Jonas Rabenstein <jonas.rabenstein@studium.uni-erlangen.de>
Signed-off-by: David Kozub <zub@linux.fjfi.cvut.cz>
Signed-off-by: Jonas Rabenstein <jonas.rabenstein@studium.uni-erlangen.de>
Reviewed-by: Scott Bauer <sbauer@plzdonthack.me>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jon Derrick <jonathan.derrick@intel.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 block/sed-opal.c | 46 +++++++++++++++-------------------------------
 1 file changed, 15 insertions(+), 31 deletions(-)

(limited to 'block')

diff --git a/block/sed-opal.c b/block/sed-opal.c
index 3f368b14efd9..5cb8034b53c8 100644
--- a/block/sed-opal.c
+++ b/block/sed-opal.c
@@ -883,27 +883,19 @@ static size_t response_get_string(const struct parsed_resp *resp, int n,
 				  const char **store)
 {
 	u8 skip;
-	const struct opal_resp_tok *token;
+	const struct opal_resp_tok *tok;
 
 	*store = NULL;
-	if (!resp) {
-		pr_debug("Response is NULL\n");
-		return 0;
-	}
-
-	if (n >= resp->num) {
-		pr_debug("Response has %d tokens. Can't access %d\n",
-			 resp->num, n);
+	tok = response_get_token(resp, n);
+	if (IS_ERR(tok))
 		return 0;
-	}
 
-	token = &resp->toks[n];
-	if (token->type != OPAL_DTA_TOKENID_BYTESTRING) {
+	if (tok->type != OPAL_DTA_TOKENID_BYTESTRING) {
 		pr_debug("Token is not a byte string!\n");
 		return 0;
 	}
 
-	switch (token->width) {
+	switch (tok->width) {
 	case OPAL_WIDTH_TINY:
 	case OPAL_WIDTH_SHORT:
 		skip = 1;
@@ -919,37 +911,29 @@ static size_t response_get_string(const struct parsed_resp *resp, int n,
 		return 0;
 	}
 
-	*store = token->pos + skip;
-	return token->len - skip;
+	*store = tok->pos + skip;
+	return tok->len - skip;
 }
 
 static u64 response_get_u64(const struct parsed_resp *resp, int n)
 {
-	if (!resp) {
-		pr_debug("Response is NULL\n");
-		return 0;
-	}
+	const struct opal_resp_tok *tok;
 
-	if (n >= resp->num) {
-		pr_debug("Response has %d tokens. Can't access %d\n",
-			 resp->num, n);
+	tok = response_get_token(resp, n);
+	if (IS_ERR(tok))
 		return 0;
-	}
 
-	if (resp->toks[n].type != OPAL_DTA_TOKENID_UINT) {
-		pr_debug("Token is not unsigned it: %d\n",
-			 resp->toks[n].type);
+	if (tok->type != OPAL_DTA_TOKENID_UINT) {
+		pr_debug("Token is not unsigned int: %d\n", tok->type);
 		return 0;
 	}
 
-	if (!(resp->toks[n].width == OPAL_WIDTH_TINY ||
-	      resp->toks[n].width == OPAL_WIDTH_SHORT)) {
-		pr_debug("Atom is not short or tiny: %d\n",
-			 resp->toks[n].width);
+	if (tok->width != OPAL_WIDTH_TINY && tok->width != OPAL_WIDTH_SHORT) {
+		pr_debug("Atom is not short or tiny: %d\n", tok->width);
 		return 0;
 	}
 
-	return resp->toks[n].stored.u;
+	return tok->stored.u;
 }
 
 static bool response_token_matches(const struct opal_resp_tok *token, u8 match)
-- 
cgit 


From b2f9c6eb3f5f44d2fded05856f69050d7170eeff Mon Sep 17 00:00:00 2001
From: Jonas Rabenstein <jonas.rabenstein@studium.uni-erlangen.de>
Date: Thu, 14 Feb 2019 01:16:01 +0100
Subject: block: sed-opal: print failed function address

Add function address (and if available its symbol) to the message if a
step function fails.

Signed-off-by: Jonas Rabenstein <jonas.rabenstein@studium.uni-erlangen.de>
Signed-off-by: David Kozub <zub@linux.fjfi.cvut.cz>
Reviewed-by: Scott Bauer <sbauer@plzdonthack.me>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jon Derrick <jonathan.derrick@intel.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 block/sed-opal.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

(limited to 'block')

diff --git a/block/sed-opal.c b/block/sed-opal.c
index 5cb8034b53c8..1f246200b574 100644
--- a/block/sed-opal.c
+++ b/block/sed-opal.c
@@ -394,8 +394,8 @@ static int next(struct opal_dev *dev)
 
 		error = step->fn(dev, step->data);
 		if (error) {
-			pr_debug("Error on step function: %d with error %d: %s\n",
-				 state, error,
+			pr_debug("Step %d (%pS) failed with error %d: %s\n",
+				 state, step->fn, error,
 				 opal_error_to_human(error));
 
 			/* For each OPAL command we do a discovery0 then we
-- 
cgit 


From 285599590e2e1f067d56a5855244e95f6303b28f Mon Sep 17 00:00:00 2001
From: Jonas Rabenstein <jonas.rabenstein@studium.uni-erlangen.de>
Date: Thu, 14 Feb 2019 01:16:02 +0100
Subject: block: sed-opal: split generation of bytestring header and content

Split the header generation from the (normal) memcpy part if a
bytestring is copied into the command buffer. This allows in-place
generation of the bytestring content. For example, copy_from_user may be
used without an intermediate buffer.

Signed-off-by: Jonas Rabenstein <jonas.rabenstein@studium.uni-erlangen.de>
Signed-off-by: David Kozub <zub@linux.fjfi.cvut.cz>
Reviewed-by: Scott Bauer <sbauer@plzdonthack.me>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jon Derrick <jonathan.derrick@intel.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 block/sed-opal.c | 22 ++++++++++++++--------
 1 file changed, 14 insertions(+), 8 deletions(-)

(limited to 'block')

diff --git a/block/sed-opal.c b/block/sed-opal.c
index 1f246200b574..ad66d1dc725a 100644
--- a/block/sed-opal.c
+++ b/block/sed-opal.c
@@ -580,15 +580,11 @@ static void add_token_u64(int *err, struct opal_dev *cmd, u64 number)
 		add_token_u8(err, cmd, number >> (len * 8));
 }
 
-static void add_token_bytestring(int *err, struct opal_dev *cmd,
-				 const u8 *bytestring, size_t len)
+static u8 *add_bytestring_header(int *err, struct opal_dev *cmd, size_t len)
 {
 	size_t header_len = 1;
 	bool is_short_atom = true;
 
-	if (*err)
-		return;
-
 	if (len & ~SHORT_ATOM_LEN_MASK) {
 		header_len = 2;
 		is_short_atom = false;
@@ -596,7 +592,7 @@ static void add_token_bytestring(int *err, struct opal_dev *cmd,
 
 	if (!can_add(err, cmd, header_len + len)) {
 		pr_debug("Error adding bytestring: end of buffer.\n");
-		return;
+		return NULL;
 	}
 
 	if (is_short_atom)
@@ -604,9 +600,19 @@ static void add_token_bytestring(int *err, struct opal_dev *cmd,
 	else
 		add_medium_atom_header(cmd, true, false, len);
 
-	memcpy(&cmd->cmd[cmd->pos], bytestring, len);
-	cmd->pos += len;
+	return &cmd->cmd[cmd->pos];
+}
 
+static void add_token_bytestring(int *err, struct opal_dev *cmd,
+				 const u8 *bytestring, size_t len)
+{
+	u8 *start;
+
+	start = add_bytestring_header(err, cmd, len);
+	if (!start)
+		return;
+	memcpy(start, bytestring, len);
+	cmd->pos += len;
 }
 
 static int build_locking_range(u8 *buffer, size_t length, u8 lr)
-- 
cgit 


From a4ddbd1b7b2cf5d18f34fdf8ddbb539f4c307564 Mon Sep 17 00:00:00 2001
From: David Kozub <zub@linux.fjfi.cvut.cz>
Date: Thu, 14 Feb 2019 01:16:03 +0100
Subject: block: sed-opal: add token for OPAL_LIFECYCLE

Define OPAL_LIFECYCLE token and use it instead of literals in
get_lsp_lifecycle.

Acked-by: Jon Derrick <jonathan.derrick@intel.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Scott Bauer <sbauer@plzdonthack.me>
Signed-off-by: David Kozub <zub@linux.fjfi.cvut.cz>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 block/opal_proto.h | 2 ++
 block/sed-opal.c   | 4 ++--
 2 files changed, 4 insertions(+), 2 deletions(-)

(limited to 'block')

diff --git a/block/opal_proto.h b/block/opal_proto.h
index e20be8258854..b6e352cfe982 100644
--- a/block/opal_proto.h
+++ b/block/opal_proto.h
@@ -170,6 +170,8 @@ enum opal_token {
 	OPAL_READLOCKED = 0x07,
 	OPAL_WRITELOCKED = 0x08,
 	OPAL_ACTIVEKEY = 0x0A,
+	/* lockingsp table */
+	OPAL_LIFECYCLE = 0x06,
 	/* locking info table */
 	OPAL_MAXRANGES = 0x04,
 	 /* mbr control */
diff --git a/block/sed-opal.c b/block/sed-opal.c
index ad66d1dc725a..7f02e50e2bce 100644
--- a/block/sed-opal.c
+++ b/block/sed-opal.c
@@ -1786,12 +1786,12 @@ static int get_lsp_lifecycle(struct opal_dev *dev, void *data)
 
 	add_token_u8(&err, dev, OPAL_STARTNAME);
 	add_token_u8(&err, dev, 3); /* Start Column */
-	add_token_u8(&err, dev, 6); /* Lifecycle Column */
+	add_token_u8(&err, dev, OPAL_LIFECYCLE);
 	add_token_u8(&err, dev, OPAL_ENDNAME);
 
 	add_token_u8(&err, dev, OPAL_STARTNAME);
 	add_token_u8(&err, dev, 4); /* End Column */
-	add_token_u8(&err, dev, 6); /* Lifecycle Column */
+	add_token_u8(&err, dev, OPAL_LIFECYCLE);
 	add_token_u8(&err, dev, OPAL_ENDNAME);
 
 	add_token_u8(&err, dev, OPAL_ENDLIST);
-- 
cgit 


From 3fff234b851cf7cd638efea658e9cbcf33c3a691 Mon Sep 17 00:00:00 2001
From: David Kozub <zub@linux.fjfi.cvut.cz>
Date: Thu, 14 Feb 2019 01:16:04 +0100
Subject: block: sed-opal: unify retrieval of table columns

Instead of having multiple places defining the same argument list to get
a specific column of a sed-opal table, provide a generic version and
call it from those functions.

Co-authored-by: David Kozub <zub@linux.fjfi.cvut.cz>
Signed-off-by: Jonas Rabenstein <jonas.rabenstein@studium.uni-erlangen.de>
Signed-off-by: David Kozub <zub@linux.fjfi.cvut.cz>
Reviewed-by: Scott Bauer <sbauer@plzdonthack.me>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jon Derrick <jonathan.derrick@intel.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 block/sed-opal.c | 130 ++++++++++++++++++++-----------------------------------
 1 file changed, 47 insertions(+), 83 deletions(-)

(limited to 'block')

diff --git a/block/sed-opal.c b/block/sed-opal.c
index 7f02e50e2bce..5395ab1c5248 100644
--- a/block/sed-opal.c
+++ b/block/sed-opal.c
@@ -1075,6 +1075,37 @@ static int finalize_and_send(struct opal_dev *dev, cont_fn cont)
 	return opal_send_recv(dev, cont);
 }
 
+/*
+ * request @column from table @table on device @dev. On success, the column
+ * data will be available in dev->resp->tok[4]
+ */
+static int generic_get_column(struct opal_dev *dev, const u8 *table,
+			      u64 column)
+{
+	int err;
+
+	err = cmd_start(dev, table, opalmethod[OPAL_GET]);
+
+	add_token_u8(&err, dev, OPAL_STARTLIST);
+
+	add_token_u8(&err, dev, OPAL_STARTNAME);
+	add_token_u8(&err, dev, OPAL_STARTCOLUMN);
+	add_token_u64(&err, dev, column);
+	add_token_u8(&err, dev, OPAL_ENDNAME);
+
+	add_token_u8(&err, dev, OPAL_STARTNAME);
+	add_token_u8(&err, dev, OPAL_ENDCOLUMN);
+	add_token_u64(&err, dev, column);
+	add_token_u8(&err, dev, OPAL_ENDNAME);
+
+	add_token_u8(&err, dev, OPAL_ENDLIST);
+
+	if (err)
+		return err;
+
+	return finalize_and_send(dev, parse_and_check_status);
+}
+
 static int gen_key(struct opal_dev *dev, void *data)
 {
 	u8 uid[OPAL_UID_LENGTH];
@@ -1129,23 +1160,11 @@ static int get_active_key(struct opal_dev *dev, void *data)
 	if (err)
 		return err;
 
-	err = cmd_start(dev, uid, opalmethod[OPAL_GET]);
-	add_token_u8(&err, dev, OPAL_STARTLIST);
-	add_token_u8(&err, dev, OPAL_STARTNAME);
-	add_token_u8(&err, dev, 3); /* startCloumn */
-	add_token_u8(&err, dev, 10); /* ActiveKey */
-	add_token_u8(&err, dev, OPAL_ENDNAME);
-	add_token_u8(&err, dev, OPAL_STARTNAME);
-	add_token_u8(&err, dev, 4); /* endColumn */
-	add_token_u8(&err, dev, 10); /* ActiveKey */
-	add_token_u8(&err, dev, OPAL_ENDNAME);
-	add_token_u8(&err, dev, OPAL_ENDLIST);
-	if (err) {
-		pr_debug("Error building get active key command\n");
+	err = generic_get_column(dev, uid, OPAL_ACTIVEKEY);
+	if (err)
 		return err;
-	}
 
-	return finalize_and_send(dev, get_active_key_cont);
+	return get_active_key_cont(dev);
 }
 
 static int generic_lr_enable_disable(struct opal_dev *dev,
@@ -1754,14 +1773,16 @@ static int activate_lsp(struct opal_dev *dev, void *data)
 	return finalize_and_send(dev, parse_and_check_status);
 }
 
-static int get_lsp_lifecycle_cont(struct opal_dev *dev)
+/* Determine if we're in the Manufactured Inactive or Active state */
+static int get_lsp_lifecycle(struct opal_dev *dev, void *data)
 {
 	u8 lc_status;
-	int error = 0;
+	int err;
 
-	error = parse_and_check_status(dev);
-	if (error)
-		return error;
+	err = generic_get_column(dev, opaluid[OPAL_LOCKINGSP_UID],
+				 OPAL_LIFECYCLE);
+	if (err)
+		return err;
 
 	lc_status = response_get_u64(&dev->parsed, 4);
 	/* 0x08 is Manufactured Inactive */
@@ -1774,49 +1795,19 @@ static int get_lsp_lifecycle_cont(struct opal_dev *dev)
 	return 0;
 }
 
-/* Determine if we're in the Manufactured Inactive or Active state */
-static int get_lsp_lifecycle(struct opal_dev *dev, void *data)
-{
-	int err;
-
-	err = cmd_start(dev, opaluid[OPAL_LOCKINGSP_UID],
-			opalmethod[OPAL_GET]);
-
-	add_token_u8(&err, dev, OPAL_STARTLIST);
-
-	add_token_u8(&err, dev, OPAL_STARTNAME);
-	add_token_u8(&err, dev, 3); /* Start Column */
-	add_token_u8(&err, dev, OPAL_LIFECYCLE);
-	add_token_u8(&err, dev, OPAL_ENDNAME);
-
-	add_token_u8(&err, dev, OPAL_STARTNAME);
-	add_token_u8(&err, dev, 4); /* End Column */
-	add_token_u8(&err, dev, OPAL_LIFECYCLE);
-	add_token_u8(&err, dev, OPAL_ENDNAME);
-
-	add_token_u8(&err, dev, OPAL_ENDLIST);
-
-	if (err) {
-		pr_debug("Error Building GET Lifecycle Status command\n");
-		return err;
-	}
-
-	return finalize_and_send(dev, get_lsp_lifecycle_cont);
-}
-
-static int get_msid_cpin_pin_cont(struct opal_dev *dev)
+static int get_msid_cpin_pin(struct opal_dev *dev, void *data)
 {
 	const char *msid_pin;
 	size_t strlen;
-	int error = 0;
+	int err;
 
-	error = parse_and_check_status(dev);
-	if (error)
-		return error;
+	err = generic_get_column(dev, opaluid[OPAL_C_PIN_MSID], OPAL_PIN);
+	if (err)
+		return err;
 
 	strlen = response_get_string(&dev->parsed, 4, &msid_pin);
 	if (!msid_pin) {
-		pr_debug("%s: Couldn't extract PIN from response\n", __func__);
+		pr_debug("Couldn't extract MSID_CPIN from response\n");
 		return OPAL_INVAL_PARAM;
 	}
 
@@ -1829,33 +1820,6 @@ static int get_msid_cpin_pin_cont(struct opal_dev *dev)
 	return 0;
 }
 
-static int get_msid_cpin_pin(struct opal_dev *dev, void *data)
-{
-	int err;
-
-	err = cmd_start(dev, opaluid[OPAL_C_PIN_MSID],
-			opalmethod[OPAL_GET]);
-
-	add_token_u8(&err, dev, OPAL_STARTLIST);
-	add_token_u8(&err, dev, OPAL_STARTNAME);
-	add_token_u8(&err, dev, 3); /* Start Column */
-	add_token_u8(&err, dev, 3); /* PIN */
-	add_token_u8(&err, dev, OPAL_ENDNAME);
-
-	add_token_u8(&err, dev, OPAL_STARTNAME);
-	add_token_u8(&err, dev, 4); /* End Column */
-	add_token_u8(&err, dev, 3); /* Lifecycle Column */
-	add_token_u8(&err, dev, OPAL_ENDNAME);
-	add_token_u8(&err, dev, OPAL_ENDLIST);
-
-	if (err) {
-		pr_debug("Error building Get MSID CPIN PIN command.\n");
-		return err;
-	}
-
-	return finalize_and_send(dev, get_msid_cpin_pin_cont);
-}
-
 static int end_opal_session(struct opal_dev *dev, void *data)
 {
 	int err = 0;
-- 
cgit 


From 372be408447506c43cc1bede4261324ef66d8fb4 Mon Sep 17 00:00:00 2001
From: David Kozub <zub@linux.fjfi.cvut.cz>
Date: Thu, 14 Feb 2019 01:16:05 +0100
Subject: block: sed-opal: use named Opal tokens instead of integer literals

Replace integer literals by Opal tokens defined in opal_proto.h where
possible.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Acked-by: Jon Derrick <jonathan.derrick@intel.com>
Reviewed-by: Scott Bauer <sbauer@plzdonthack.me>
Signed-off-by: David Kozub <zub@linux.fjfi.cvut.cz>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 block/sed-opal.c | 18 +++++++++---------
 1 file changed, 9 insertions(+), 9 deletions(-)

(limited to 'block')

diff --git a/block/sed-opal.c b/block/sed-opal.c
index 5395ab1c5248..be0d633783c8 100644
--- a/block/sed-opal.c
+++ b/block/sed-opal.c
@@ -1180,12 +1180,12 @@ static int generic_lr_enable_disable(struct opal_dev *dev,
 	add_token_u8(&err, dev, OPAL_STARTLIST);
 
 	add_token_u8(&err, dev, OPAL_STARTNAME);
-	add_token_u8(&err, dev, 5); /* ReadLockEnabled */
+	add_token_u8(&err, dev, OPAL_READLOCKENABLED);
 	add_token_u8(&err, dev, rle);
 	add_token_u8(&err, dev, OPAL_ENDNAME);
 
 	add_token_u8(&err, dev, OPAL_STARTNAME);
-	add_token_u8(&err, dev, 6); /* WriteLockEnabled */
+	add_token_u8(&err, dev, OPAL_WRITELOCKENABLED);
 	add_token_u8(&err, dev, wle);
 	add_token_u8(&err, dev, OPAL_ENDNAME);
 
@@ -1238,22 +1238,22 @@ static int setup_locking_range(struct opal_dev *dev, void *data)
 		add_token_u8(&err, dev, OPAL_STARTLIST);
 
 		add_token_u8(&err, dev, OPAL_STARTNAME);
-		add_token_u8(&err, dev, 3); /* Ranges Start */
+		add_token_u8(&err, dev, OPAL_RANGESTART);
 		add_token_u64(&err, dev, setup->range_start);
 		add_token_u8(&err, dev, OPAL_ENDNAME);
 
 		add_token_u8(&err, dev, OPAL_STARTNAME);
-		add_token_u8(&err, dev, 4); /* Ranges length */
+		add_token_u8(&err, dev, OPAL_RANGELENGTH);
 		add_token_u64(&err, dev, setup->range_length);
 		add_token_u8(&err, dev, OPAL_ENDNAME);
 
 		add_token_u8(&err, dev, OPAL_STARTNAME);
-		add_token_u8(&err, dev, 5); /*ReadLockEnabled */
+		add_token_u8(&err, dev, OPAL_READLOCKENABLED);
 		add_token_u64(&err, dev, !!setup->RLE);
 		add_token_u8(&err, dev, OPAL_ENDNAME);
 
 		add_token_u8(&err, dev, OPAL_STARTNAME);
-		add_token_u8(&err, dev, 6); /*WriteLockEnabled*/
+		add_token_u8(&err, dev, OPAL_WRITELOCKENABLED);
 		add_token_u64(&err, dev, !!setup->WLE);
 		add_token_u8(&err, dev, OPAL_ENDNAME);
 
@@ -1472,7 +1472,7 @@ static int set_mbr_done(struct opal_dev *dev, void *data)
 	add_token_u8(&err, dev, OPAL_VALUES);
 	add_token_u8(&err, dev, OPAL_STARTLIST);
 	add_token_u8(&err, dev, OPAL_STARTNAME);
-	add_token_u8(&err, dev, 2); /* Done */
+	add_token_u8(&err, dev, OPAL_MBRDONE);
 	add_token_u8(&err, dev, *mbr_done_tf); /* Done T or F */
 	add_token_u8(&err, dev, OPAL_ENDNAME);
 	add_token_u8(&err, dev, OPAL_ENDLIST);
@@ -1498,7 +1498,7 @@ static int set_mbr_enable_disable(struct opal_dev *dev, void *data)
 	add_token_u8(&err, dev, OPAL_VALUES);
 	add_token_u8(&err, dev, OPAL_STARTLIST);
 	add_token_u8(&err, dev, OPAL_STARTNAME);
-	add_token_u8(&err, dev, 1);
+	add_token_u8(&err, dev, OPAL_MBRENABLE);
 	add_token_u8(&err, dev, *mbr_en_dis);
 	add_token_u8(&err, dev, OPAL_ENDNAME);
 	add_token_u8(&err, dev, OPAL_ENDLIST);
@@ -1523,7 +1523,7 @@ static int generic_pw_cmd(u8 *key, size_t key_len, u8 *cpin_uid,
 	add_token_u8(&err, dev, OPAL_VALUES);
 	add_token_u8(&err, dev, OPAL_STARTLIST);
 	add_token_u8(&err, dev, OPAL_STARTNAME);
-	add_token_u8(&err, dev, 3); /* PIN */
+	add_token_u8(&err, dev, OPAL_PIN);
 	add_token_bytestring(&err, dev, key, key_len);
 	add_token_u8(&err, dev, OPAL_ENDNAME);
 	add_token_u8(&err, dev, OPAL_ENDLIST);
-- 
cgit 


From 3db87236cfb29f143028b91293a8aee01cf932e7 Mon Sep 17 00:00:00 2001
From: David Kozub <zub@linux.fjfi.cvut.cz>
Date: Thu, 14 Feb 2019 01:16:06 +0100
Subject: block: sed-opal: pass steps via argument rather than via opal_dev

The steps argument is only read by the next function, so it can
be passed directly as an argument rather than via opal_dev.

Normally, the steps is an array on the stack, so the pointer stops
being valid then the function that set opal_dev.steps returns.
If opal_dev.steps was not set to NULL before return it would become
a dangling pointer. When the steps are passed as argument this
becomes easier to see and more difficult to misuse.

Acked-by: Jon Derrick <jonathan.derrick@intel.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Scott Bauer <sbauer@plzdonthack.me>
Signed-off-by: David Kozub <zub@linux.fjfi.cvut.cz>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 block/sed-opal.c | 158 ++++++++++++++++++++++++-------------------------------
 1 file changed, 69 insertions(+), 89 deletions(-)

(limited to 'block')

diff --git a/block/sed-opal.c b/block/sed-opal.c
index be0d633783c8..f027c0cb682e 100644
--- a/block/sed-opal.c
+++ b/block/sed-opal.c
@@ -85,7 +85,6 @@ struct opal_dev {
 	void *data;
 	sec_send_recv *send_recv;
 
-	const struct opal_step *steps;
 	struct mutex dev_lock;
 	u16 comid;
 	u32 hsn;
@@ -382,37 +381,34 @@ static void check_geometry(struct opal_dev *dev, const void *data)
 	dev->lowest_lba = geo->lowest_aligned_lba;
 }
 
-static int next(struct opal_dev *dev)
+static int next(struct opal_dev *dev, const struct opal_step *steps,
+		size_t n_steps)
 {
 	const struct opal_step *step;
-	int state = 0, error = 0;
+	size_t state;
+	int error = 0;
 
-	do {
-		step = &dev->steps[state];
-		if (!step->fn)
-			break;
+	for (state = 0; state < n_steps; state++) {
+		step = &steps[state];
 
 		error = step->fn(dev, step->data);
-		if (error) {
-			pr_debug("Step %d (%pS) failed with error %d: %s\n",
-				 state, step->fn, error,
-				 opal_error_to_human(error));
-
-			/* For each OPAL command we do a discovery0 then we
-			 * start some sort of session.
-			 * If we haven't passed state 1 then there was an error
-			 * on discovery0 or during the attempt to start a
-			 * session. Therefore we shouldn't attempt to terminate
-			 * a session, as one has not yet been created.
-			 */
-			if (state > 1) {
-				end_opal_session_error(dev);
-				return error;
-			}
+		if (error)
+			goto out_error;
+	}
 
-		}
-		state++;
-	} while (!error);
+	return 0;
+
+out_error:
+	/*
+	 * For each OPAL command the first step in steps does a discovery0
+	 * and the second step starts some sort of session. If an error occurred
+	 * in the first two steps (and thus stopping the loop with state <= 1)
+	 * then there was an error before or during the attempt to start a
+	 * session. Therefore we shouldn't attempt to terminate a session, as
+	 * one has not yet been created.
+	 */
+	if (state > 1)
+		end_opal_session_error(dev);
 
 	return error;
 }
@@ -1836,17 +1832,13 @@ static int end_opal_session(struct opal_dev *dev, void *data)
 static int end_opal_session_error(struct opal_dev *dev)
 {
 	const struct opal_step error_end_session[] = {
-		{ end_opal_session, },
-		{ NULL, }
+		{ end_opal_session, }
 	};
-	dev->steps = error_end_session;
-	return next(dev);
+	return next(dev, error_end_session, ARRAY_SIZE(error_end_session));
 }
 
-static inline void setup_opal_dev(struct opal_dev *dev,
-				  const struct opal_step *steps)
+static inline void setup_opal_dev(struct opal_dev *dev)
 {
-	dev->steps = steps;
 	dev->tsn = 0;
 	dev->hsn = 0;
 	dev->prev_data = NULL;
@@ -1855,14 +1847,13 @@ static inline void setup_opal_dev(struct opal_dev *dev,
 static int check_opal_support(struct opal_dev *dev)
 {
 	const struct opal_step steps[] = {
-		{ opal_discovery0, },
-		{ NULL, }
+		{ opal_discovery0, }
 	};
 	int ret;
 
 	mutex_lock(&dev->dev_lock);
-	setup_opal_dev(dev, steps);
-	ret = next(dev);
+	setup_opal_dev(dev);
+	ret = next(dev, steps, ARRAY_SIZE(steps));
 	dev->supported = !ret;
 	mutex_unlock(&dev->dev_lock);
 	return ret;
@@ -1919,14 +1910,13 @@ static int opal_secure_erase_locking_range(struct opal_dev *dev,
 		{ start_auth_opal_session, opal_session },
 		{ get_active_key, &opal_session->opal_key.lr },
 		{ gen_key, },
-		{ end_opal_session, },
-		{ NULL, }
+		{ end_opal_session, }
 	};
 	int ret;
 
 	mutex_lock(&dev->dev_lock);
-	setup_opal_dev(dev, erase_steps);
-	ret = next(dev);
+	setup_opal_dev(dev);
+	ret = next(dev, erase_steps, ARRAY_SIZE(erase_steps));
 	mutex_unlock(&dev->dev_lock);
 	return ret;
 }
@@ -1938,14 +1928,13 @@ static int opal_erase_locking_range(struct opal_dev *dev,
 		{ opal_discovery0, },
 		{ start_auth_opal_session, opal_session },
 		{ erase_locking_range, opal_session },
-		{ end_opal_session, },
-		{ NULL, }
+		{ end_opal_session, }
 	};
 	int ret;
 
 	mutex_lock(&dev->dev_lock);
-	setup_opal_dev(dev, erase_steps);
-	ret = next(dev);
+	setup_opal_dev(dev);
+	ret = next(dev, erase_steps, ARRAY_SIZE(erase_steps));
 	mutex_unlock(&dev->dev_lock);
 	return ret;
 }
@@ -1963,8 +1952,7 @@ static int opal_enable_disable_shadow_mbr(struct opal_dev *dev,
 		{ end_opal_session, },
 		{ start_admin1LSP_opal_session, &opal_mbr->key },
 		{ set_mbr_enable_disable, &enable_disable },
-		{ end_opal_session, },
-		{ NULL, }
+		{ end_opal_session, }
 	};
 	int ret;
 
@@ -1973,8 +1961,8 @@ static int opal_enable_disable_shadow_mbr(struct opal_dev *dev,
 		return -EINVAL;
 
 	mutex_lock(&dev->dev_lock);
-	setup_opal_dev(dev, mbr_steps);
-	ret = next(dev);
+	setup_opal_dev(dev);
+	ret = next(dev, mbr_steps, ARRAY_SIZE(mbr_steps));
 	mutex_unlock(&dev->dev_lock);
 	return ret;
 }
@@ -1991,7 +1979,7 @@ static int opal_save(struct opal_dev *dev, struct opal_lock_unlock *lk_unlk)
 	suspend->lr = lk_unlk->session.opal_key.lr;
 
 	mutex_lock(&dev->dev_lock);
-	setup_opal_dev(dev, NULL);
+	setup_opal_dev(dev);
 	add_suspend_info(dev, suspend);
 	mutex_unlock(&dev->dev_lock);
 	return 0;
@@ -2004,8 +1992,7 @@ static int opal_add_user_to_lr(struct opal_dev *dev,
 		{ opal_discovery0, },
 		{ start_admin1LSP_opal_session, &lk_unlk->session.opal_key },
 		{ add_user_to_lr, lk_unlk },
-		{ end_opal_session, },
-		{ NULL, }
+		{ end_opal_session, }
 	};
 	int ret;
 
@@ -2027,8 +2014,8 @@ static int opal_add_user_to_lr(struct opal_dev *dev,
 	}
 
 	mutex_lock(&dev->dev_lock);
-	setup_opal_dev(dev, steps);
-	ret = next(dev);
+	setup_opal_dev(dev);
+	ret = next(dev, steps, ARRAY_SIZE(steps));
 	mutex_unlock(&dev->dev_lock);
 	return ret;
 }
@@ -2038,14 +2025,13 @@ static int opal_reverttper(struct opal_dev *dev, struct opal_key *opal)
 	const struct opal_step revert_steps[] = {
 		{ opal_discovery0, },
 		{ start_SIDASP_opal_session, opal },
-		{ revert_tper, }, /* controller will terminate session */
-		{ NULL, }
+		{ revert_tper, } /* controller will terminate session */
 	};
 	int ret;
 
 	mutex_lock(&dev->dev_lock);
-	setup_opal_dev(dev, revert_steps);
-	ret = next(dev);
+	setup_opal_dev(dev);
+	ret = next(dev, revert_steps, ARRAY_SIZE(revert_steps));
 	mutex_unlock(&dev->dev_lock);
 
 	/*
@@ -2065,19 +2051,20 @@ static int __opal_lock_unlock(struct opal_dev *dev,
 		{ opal_discovery0, },
 		{ start_auth_opal_session, &lk_unlk->session },
 		{ lock_unlock_locking_range, lk_unlk },
-		{ end_opal_session, },
-		{ NULL, }
+		{ end_opal_session, }
 	};
 	const struct opal_step unlock_sum_steps[] = {
 		{ opal_discovery0, },
 		{ start_auth_opal_session, &lk_unlk->session },
 		{ lock_unlock_locking_range_sum, lk_unlk },
-		{ end_opal_session, },
-		{ NULL, }
+		{ end_opal_session, }
 	};
 
-	dev->steps = lk_unlk->session.sum ? unlock_sum_steps : unlock_steps;
-	return next(dev);
+	if (lk_unlk->session.sum)
+		return next(dev, unlock_sum_steps,
+			    ARRAY_SIZE(unlock_sum_steps));
+	else
+		return next(dev, unlock_steps, ARRAY_SIZE(unlock_steps));
 }
 
 static int __opal_set_mbr_done(struct opal_dev *dev, struct opal_key *key)
@@ -2087,12 +2074,10 @@ static int __opal_set_mbr_done(struct opal_dev *dev, struct opal_key *key)
 		{ opal_discovery0, },
 		{ start_admin1LSP_opal_session, key },
 		{ set_mbr_done, &mbr_done_tf },
-		{ end_opal_session, },
-		{ NULL, }
+		{ end_opal_session, }
 	};
 
-	dev->steps = mbrdone_step;
-	return next(dev);
+	return next(dev, mbrdone_step, ARRAY_SIZE(mbrdone_step));
 }
 
 static int opal_lock_unlock(struct opal_dev *dev,
@@ -2119,8 +2104,7 @@ static int opal_take_ownership(struct opal_dev *dev, struct opal_key *opal)
 		{ end_opal_session, },
 		{ start_SIDASP_opal_session, opal },
 		{ set_sid_cpin_pin, opal },
-		{ end_opal_session, },
-		{ NULL, }
+		{ end_opal_session, }
 	};
 	int ret;
 
@@ -2128,8 +2112,8 @@ static int opal_take_ownership(struct opal_dev *dev, struct opal_key *opal)
 		return -ENODEV;
 
 	mutex_lock(&dev->dev_lock);
-	setup_opal_dev(dev, owner_steps);
-	ret = next(dev);
+	setup_opal_dev(dev);
+	ret = next(dev, owner_steps, ARRAY_SIZE(owner_steps));
 	mutex_unlock(&dev->dev_lock);
 	return ret;
 }
@@ -2142,8 +2126,7 @@ static int opal_activate_lsp(struct opal_dev *dev,
 		{ start_SIDASP_opal_session, &opal_lr_act->key },
 		{ get_lsp_lifecycle, },
 		{ activate_lsp, opal_lr_act },
-		{ end_opal_session, },
-		{ NULL, }
+		{ end_opal_session, }
 	};
 	int ret;
 
@@ -2151,8 +2134,8 @@ static int opal_activate_lsp(struct opal_dev *dev,
 		return -EINVAL;
 
 	mutex_lock(&dev->dev_lock);
-	setup_opal_dev(dev, active_steps);
-	ret = next(dev);
+	setup_opal_dev(dev);
+	ret = next(dev, active_steps, ARRAY_SIZE(active_steps));
 	mutex_unlock(&dev->dev_lock);
 	return ret;
 }
@@ -2164,14 +2147,13 @@ static int opal_setup_locking_range(struct opal_dev *dev,
 		{ opal_discovery0, },
 		{ start_auth_opal_session, &opal_lrs->session },
 		{ setup_locking_range, opal_lrs },
-		{ end_opal_session, },
-		{ NULL, }
+		{ end_opal_session, }
 	};
 	int ret;
 
 	mutex_lock(&dev->dev_lock);
-	setup_opal_dev(dev, lr_steps);
-	ret = next(dev);
+	setup_opal_dev(dev);
+	ret = next(dev, lr_steps, ARRAY_SIZE(lr_steps));
 	mutex_unlock(&dev->dev_lock);
 	return ret;
 }
@@ -2182,8 +2164,7 @@ static int opal_set_new_pw(struct opal_dev *dev, struct opal_new_pw *opal_pw)
 		{ opal_discovery0, },
 		{ start_auth_opal_session, &opal_pw->session },
 		{ set_new_pw, &opal_pw->new_user_pw },
-		{ end_opal_session, },
-		{ NULL }
+		{ end_opal_session, }
 	};
 	int ret;
 
@@ -2194,8 +2175,8 @@ static int opal_set_new_pw(struct opal_dev *dev, struct opal_new_pw *opal_pw)
 		return -EINVAL;
 
 	mutex_lock(&dev->dev_lock);
-	setup_opal_dev(dev, pw_steps);
-	ret = next(dev);
+	setup_opal_dev(dev);
+	ret = next(dev, pw_steps, ARRAY_SIZE(pw_steps));
 	mutex_unlock(&dev->dev_lock);
 	return ret;
 }
@@ -2207,8 +2188,7 @@ static int opal_activate_user(struct opal_dev *dev,
 		{ opal_discovery0, },
 		{ start_admin1LSP_opal_session, &opal_session->opal_key },
 		{ internal_activate_user, opal_session },
-		{ end_opal_session, },
-		{ NULL, }
+		{ end_opal_session, }
 	};
 	int ret;
 
@@ -2220,8 +2200,8 @@ static int opal_activate_user(struct opal_dev *dev,
 	}
 
 	mutex_lock(&dev->dev_lock);
-	setup_opal_dev(dev, act_steps);
-	ret = next(dev);
+	setup_opal_dev(dev);
+	ret = next(dev, act_steps, ARRAY_SIZE(act_steps));
 	mutex_unlock(&dev->dev_lock);
 	return ret;
 }
@@ -2238,7 +2218,7 @@ bool opal_unlock_from_suspend(struct opal_dev *dev)
 		return false;
 
 	mutex_lock(&dev->dev_lock);
-	setup_opal_dev(dev, NULL);
+	setup_opal_dev(dev);
 
 	list_for_each_entry(suspend, &dev->unlk_lst, node) {
 		dev->tsn = 0;
-- 
cgit 


From 0af2648ec30cf811b835f01a8501b4747f3a9022 Mon Sep 17 00:00:00 2001
From: David Kozub <zub@linux.fjfi.cvut.cz>
Date: Thu, 14 Feb 2019 01:16:07 +0100
Subject: block: sed-opal: don't repeat opal_discovery0 in each steps array

Originally each of the opal functions that call next include
opal_discovery0 in the array of steps. This is superfluous and
can be done always inside next.

Acked-by: Jon Derrick <jonathan.derrick@intel.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Scott Bauer <sbauer@plzdonthack.me>
Signed-off-by: David Kozub <zub@linux.fjfi.cvut.cz>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 block/sed-opal.c | 75 +++++++++++++++++++++++++++++++-------------------------
 1 file changed, 42 insertions(+), 33 deletions(-)

(limited to 'block')

diff --git a/block/sed-opal.c b/block/sed-opal.c
index f027c0cb682e..b947efd6d4d9 100644
--- a/block/sed-opal.c
+++ b/block/sed-opal.c
@@ -216,6 +216,7 @@ static const u8 opalmethod[][OPAL_METHOD_LENGTH] = {
 };
 
 static int end_opal_session_error(struct opal_dev *dev);
+static int opal_discovery0_step(struct opal_dev *dev);
 
 struct opal_suspend_data {
 	struct opal_lock_unlock unlk;
@@ -381,17 +382,33 @@ static void check_geometry(struct opal_dev *dev, const void *data)
 	dev->lowest_lba = geo->lowest_aligned_lba;
 }
 
+static int execute_step(struct opal_dev *dev,
+			const struct opal_step *step, size_t stepIndex)
+{
+	int error = step->fn(dev, step->data);
+
+	if (error) {
+		pr_debug("Step %zu (%pS) failed with error %d: %s\n",
+			 stepIndex, step->fn, error,
+			 opal_error_to_human(error));
+	}
+
+	return error;
+}
+
 static int next(struct opal_dev *dev, const struct opal_step *steps,
 		size_t n_steps)
 {
-	const struct opal_step *step;
-	size_t state;
-	int error = 0;
+	size_t state = 0;
+	int error;
 
-	for (state = 0; state < n_steps; state++) {
-		step = &steps[state];
+	/* first do a discovery0 */
+	error = opal_discovery0_step(dev);
+	if (error)
+		return error;
 
-		error = step->fn(dev, step->data);
+	for (state = 0; state < n_steps; state++) {
+		error = execute_step(dev, &steps[state], state);
 		if (error)
 			goto out_error;
 	}
@@ -400,14 +417,14 @@ static int next(struct opal_dev *dev, const struct opal_step *steps,
 
 out_error:
 	/*
-	 * For each OPAL command the first step in steps does a discovery0
-	 * and the second step starts some sort of session. If an error occurred
-	 * in the first two steps (and thus stopping the loop with state <= 1)
-	 * then there was an error before or during the attempt to start a
-	 * session. Therefore we shouldn't attempt to terminate a session, as
-	 * one has not yet been created.
+	 * For each OPAL command the first step in steps starts some sort of
+	 * session. If an error occurred in the initial discovery0 or if an
+	 * error occurred in the first step (and thus stopping the loop with
+	 * state == 0) then there was an error before or during the attempt to
+	 * start a session. Therefore we shouldn't attempt to terminate a
+	 * session, as one has not yet been created.
 	 */
-	if (state > 1)
+	if (state > 0)
 		end_opal_session_error(dev);
 
 	return error;
@@ -506,6 +523,14 @@ static int opal_discovery0(struct opal_dev *dev, void *data)
 	return opal_discovery0_end(dev);
 }
 
+static int opal_discovery0_step(struct opal_dev *dev)
+{
+	const struct opal_step discovery0_step = {
+		opal_discovery0,
+	};
+	return execute_step(dev, &discovery0_step, 0);
+}
+
 static bool can_add(int *err, struct opal_dev *cmd, size_t len)
 {
 	if (*err)
@@ -1831,10 +1856,10 @@ static int end_opal_session(struct opal_dev *dev, void *data)
 
 static int end_opal_session_error(struct opal_dev *dev)
 {
-	const struct opal_step error_end_session[] = {
-		{ end_opal_session, }
+	const struct opal_step error_end_session = {
+		end_opal_session,
 	};
-	return next(dev, error_end_session, ARRAY_SIZE(error_end_session));
+	return execute_step(dev, &error_end_session, 0);
 }
 
 static inline void setup_opal_dev(struct opal_dev *dev)
@@ -1846,14 +1871,11 @@ static inline void setup_opal_dev(struct opal_dev *dev)
 
 static int check_opal_support(struct opal_dev *dev)
 {
-	const struct opal_step steps[] = {
-		{ opal_discovery0, }
-	};
 	int ret;
 
 	mutex_lock(&dev->dev_lock);
 	setup_opal_dev(dev);
-	ret = next(dev, steps, ARRAY_SIZE(steps));
+	ret = opal_discovery0_step(dev);
 	dev->supported = !ret;
 	mutex_unlock(&dev->dev_lock);
 	return ret;
@@ -1906,7 +1928,6 @@ static int opal_secure_erase_locking_range(struct opal_dev *dev,
 					   struct opal_session_info *opal_session)
 {
 	const struct opal_step erase_steps[] = {
-		{ opal_discovery0, },
 		{ start_auth_opal_session, opal_session },
 		{ get_active_key, &opal_session->opal_key.lr },
 		{ gen_key, },
@@ -1925,7 +1946,6 @@ static int opal_erase_locking_range(struct opal_dev *dev,
 				    struct opal_session_info *opal_session)
 {
 	const struct opal_step erase_steps[] = {
-		{ opal_discovery0, },
 		{ start_auth_opal_session, opal_session },
 		{ erase_locking_range, opal_session },
 		{ end_opal_session, }
@@ -1946,7 +1966,6 @@ static int opal_enable_disable_shadow_mbr(struct opal_dev *dev,
 		OPAL_TRUE : OPAL_FALSE;
 
 	const struct opal_step mbr_steps[] = {
-		{ opal_discovery0, },
 		{ start_admin1LSP_opal_session, &opal_mbr->key },
 		{ set_mbr_done, &enable_disable },
 		{ end_opal_session, },
@@ -1989,7 +2008,6 @@ static int opal_add_user_to_lr(struct opal_dev *dev,
 			       struct opal_lock_unlock *lk_unlk)
 {
 	const struct opal_step steps[] = {
-		{ opal_discovery0, },
 		{ start_admin1LSP_opal_session, &lk_unlk->session.opal_key },
 		{ add_user_to_lr, lk_unlk },
 		{ end_opal_session, }
@@ -2023,7 +2041,6 @@ static int opal_add_user_to_lr(struct opal_dev *dev,
 static int opal_reverttper(struct opal_dev *dev, struct opal_key *opal)
 {
 	const struct opal_step revert_steps[] = {
-		{ opal_discovery0, },
 		{ start_SIDASP_opal_session, opal },
 		{ revert_tper, } /* controller will terminate session */
 	};
@@ -2048,13 +2065,11 @@ static int __opal_lock_unlock(struct opal_dev *dev,
 			      struct opal_lock_unlock *lk_unlk)
 {
 	const struct opal_step unlock_steps[] = {
-		{ opal_discovery0, },
 		{ start_auth_opal_session, &lk_unlk->session },
 		{ lock_unlock_locking_range, lk_unlk },
 		{ end_opal_session, }
 	};
 	const struct opal_step unlock_sum_steps[] = {
-		{ opal_discovery0, },
 		{ start_auth_opal_session, &lk_unlk->session },
 		{ lock_unlock_locking_range_sum, lk_unlk },
 		{ end_opal_session, }
@@ -2071,7 +2086,6 @@ static int __opal_set_mbr_done(struct opal_dev *dev, struct opal_key *key)
 {
 	u8 mbr_done_tf = OPAL_TRUE;
 	const struct opal_step mbrdone_step[] = {
-		{ opal_discovery0, },
 		{ start_admin1LSP_opal_session, key },
 		{ set_mbr_done, &mbr_done_tf },
 		{ end_opal_session, }
@@ -2098,7 +2112,6 @@ static int opal_lock_unlock(struct opal_dev *dev,
 static int opal_take_ownership(struct opal_dev *dev, struct opal_key *opal)
 {
 	const struct opal_step owner_steps[] = {
-		{ opal_discovery0, },
 		{ start_anybodyASP_opal_session, },
 		{ get_msid_cpin_pin, },
 		{ end_opal_session, },
@@ -2122,7 +2135,6 @@ static int opal_activate_lsp(struct opal_dev *dev,
 			     struct opal_lr_act *opal_lr_act)
 {
 	const struct opal_step active_steps[] = {
-		{ opal_discovery0, },
 		{ start_SIDASP_opal_session, &opal_lr_act->key },
 		{ get_lsp_lifecycle, },
 		{ activate_lsp, opal_lr_act },
@@ -2144,7 +2156,6 @@ static int opal_setup_locking_range(struct opal_dev *dev,
 				    struct opal_user_lr_setup *opal_lrs)
 {
 	const struct opal_step lr_steps[] = {
-		{ opal_discovery0, },
 		{ start_auth_opal_session, &opal_lrs->session },
 		{ setup_locking_range, opal_lrs },
 		{ end_opal_session, }
@@ -2161,7 +2172,6 @@ static int opal_setup_locking_range(struct opal_dev *dev,
 static int opal_set_new_pw(struct opal_dev *dev, struct opal_new_pw *opal_pw)
 {
 	const struct opal_step pw_steps[] = {
-		{ opal_discovery0, },
 		{ start_auth_opal_session, &opal_pw->session },
 		{ set_new_pw, &opal_pw->new_user_pw },
 		{ end_opal_session, }
@@ -2185,7 +2195,6 @@ static int opal_activate_user(struct opal_dev *dev,
 			      struct opal_session_info *opal_session)
 {
 	const struct opal_step act_steps[] = {
-		{ opal_discovery0, },
 		{ start_admin1LSP_opal_session, &opal_session->opal_key },
 		{ internal_activate_user, opal_session },
 		{ end_opal_session, }
-- 
cgit 


From a80f36cc64f09956686f3729bf3eb65b7abfc32e Mon Sep 17 00:00:00 2001
From: David Kozub <zub@linux.fjfi.cvut.cz>
Date: Thu, 14 Feb 2019 01:16:08 +0100
Subject: block: sed-opal: rename next to execute_steps

As the function is responsible for executing the individual steps supplied
in the steps argument, execute_steps is a more descriptive name than the
rather generic next.

Signed-off-by: David Kozub <zub@linux.fjfi.cvut.cz>
Reviewed-by: Scott Bauer <sbauer@plzdonthack.me>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jon Derrick <jonathan.derrick@intel.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 block/sed-opal.c | 33 +++++++++++++++++----------------
 1 file changed, 17 insertions(+), 16 deletions(-)

(limited to 'block')

diff --git a/block/sed-opal.c b/block/sed-opal.c
index b947efd6d4d9..b1aa0cc25803 100644
--- a/block/sed-opal.c
+++ b/block/sed-opal.c
@@ -396,8 +396,8 @@ static int execute_step(struct opal_dev *dev,
 	return error;
 }
 
-static int next(struct opal_dev *dev, const struct opal_step *steps,
-		size_t n_steps)
+static int execute_steps(struct opal_dev *dev,
+			 const struct opal_step *steps, size_t n_steps)
 {
 	size_t state = 0;
 	int error;
@@ -1937,7 +1937,7 @@ static int opal_secure_erase_locking_range(struct opal_dev *dev,
 
 	mutex_lock(&dev->dev_lock);
 	setup_opal_dev(dev);
-	ret = next(dev, erase_steps, ARRAY_SIZE(erase_steps));
+	ret = execute_steps(dev, erase_steps, ARRAY_SIZE(erase_steps));
 	mutex_unlock(&dev->dev_lock);
 	return ret;
 }
@@ -1954,7 +1954,7 @@ static int opal_erase_locking_range(struct opal_dev *dev,
 
 	mutex_lock(&dev->dev_lock);
 	setup_opal_dev(dev);
-	ret = next(dev, erase_steps, ARRAY_SIZE(erase_steps));
+	ret = execute_steps(dev, erase_steps, ARRAY_SIZE(erase_steps));
 	mutex_unlock(&dev->dev_lock);
 	return ret;
 }
@@ -1981,7 +1981,7 @@ static int opal_enable_disable_shadow_mbr(struct opal_dev *dev,
 
 	mutex_lock(&dev->dev_lock);
 	setup_opal_dev(dev);
-	ret = next(dev, mbr_steps, ARRAY_SIZE(mbr_steps));
+	ret = execute_steps(dev, mbr_steps, ARRAY_SIZE(mbr_steps));
 	mutex_unlock(&dev->dev_lock);
 	return ret;
 }
@@ -2033,7 +2033,7 @@ static int opal_add_user_to_lr(struct opal_dev *dev,
 
 	mutex_lock(&dev->dev_lock);
 	setup_opal_dev(dev);
-	ret = next(dev, steps, ARRAY_SIZE(steps));
+	ret = execute_steps(dev, steps, ARRAY_SIZE(steps));
 	mutex_unlock(&dev->dev_lock);
 	return ret;
 }
@@ -2048,7 +2048,7 @@ static int opal_reverttper(struct opal_dev *dev, struct opal_key *opal)
 
 	mutex_lock(&dev->dev_lock);
 	setup_opal_dev(dev);
-	ret = next(dev, revert_steps, ARRAY_SIZE(revert_steps));
+	ret = execute_steps(dev, revert_steps, ARRAY_SIZE(revert_steps));
 	mutex_unlock(&dev->dev_lock);
 
 	/*
@@ -2076,10 +2076,11 @@ static int __opal_lock_unlock(struct opal_dev *dev,
 	};
 
 	if (lk_unlk->session.sum)
-		return next(dev, unlock_sum_steps,
-			    ARRAY_SIZE(unlock_sum_steps));
+		return execute_steps(dev, unlock_sum_steps,
+				     ARRAY_SIZE(unlock_sum_steps));
 	else
-		return next(dev, unlock_steps, ARRAY_SIZE(unlock_steps));
+		return execute_steps(dev, unlock_steps,
+				     ARRAY_SIZE(unlock_steps));
 }
 
 static int __opal_set_mbr_done(struct opal_dev *dev, struct opal_key *key)
@@ -2091,7 +2092,7 @@ static int __opal_set_mbr_done(struct opal_dev *dev, struct opal_key *key)
 		{ end_opal_session, }
 	};
 
-	return next(dev, mbrdone_step, ARRAY_SIZE(mbrdone_step));
+	return execute_steps(dev, mbrdone_step, ARRAY_SIZE(mbrdone_step));
 }
 
 static int opal_lock_unlock(struct opal_dev *dev,
@@ -2126,7 +2127,7 @@ static int opal_take_ownership(struct opal_dev *dev, struct opal_key *opal)
 
 	mutex_lock(&dev->dev_lock);
 	setup_opal_dev(dev);
-	ret = next(dev, owner_steps, ARRAY_SIZE(owner_steps));
+	ret = execute_steps(dev, owner_steps, ARRAY_SIZE(owner_steps));
 	mutex_unlock(&dev->dev_lock);
 	return ret;
 }
@@ -2147,7 +2148,7 @@ static int opal_activate_lsp(struct opal_dev *dev,
 
 	mutex_lock(&dev->dev_lock);
 	setup_opal_dev(dev);
-	ret = next(dev, active_steps, ARRAY_SIZE(active_steps));
+	ret = execute_steps(dev, active_steps, ARRAY_SIZE(active_steps));
 	mutex_unlock(&dev->dev_lock);
 	return ret;
 }
@@ -2164,7 +2165,7 @@ static int opal_setup_locking_range(struct opal_dev *dev,
 
 	mutex_lock(&dev->dev_lock);
 	setup_opal_dev(dev);
-	ret = next(dev, lr_steps, ARRAY_SIZE(lr_steps));
+	ret = execute_steps(dev, lr_steps, ARRAY_SIZE(lr_steps));
 	mutex_unlock(&dev->dev_lock);
 	return ret;
 }
@@ -2186,7 +2187,7 @@ static int opal_set_new_pw(struct opal_dev *dev, struct opal_new_pw *opal_pw)
 
 	mutex_lock(&dev->dev_lock);
 	setup_opal_dev(dev);
-	ret = next(dev, pw_steps, ARRAY_SIZE(pw_steps));
+	ret = execute_steps(dev, pw_steps, ARRAY_SIZE(pw_steps));
 	mutex_unlock(&dev->dev_lock);
 	return ret;
 }
@@ -2210,7 +2211,7 @@ static int opal_activate_user(struct opal_dev *dev,
 
 	mutex_lock(&dev->dev_lock);
 	setup_opal_dev(dev);
-	ret = next(dev, act_steps, ARRAY_SIZE(act_steps));
+	ret = execute_steps(dev, act_steps, ARRAY_SIZE(act_steps));
 	mutex_unlock(&dev->dev_lock);
 	return ret;
 }
-- 
cgit 


From d0b0a81acbd809228b57fb27a89028ecd0fc542a Mon Sep 17 00:00:00 2001
From: Hisao Tanabe <xtanabe@gmail.com>
Date: Mon, 8 Apr 2019 00:27:42 +0900
Subject: block: remove unused variable 'def'

The 'def' local variable became unused after commit f382fb0bcef4 ("block: remove
legacy IO schedulers"), let's remove it.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Hisao Tanabe <xtanabe@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 block/elevator.c | 6 ++----
 1 file changed, 2 insertions(+), 4 deletions(-)

(limited to 'block')

diff --git a/block/elevator.c b/block/elevator.c
index d6d835a08de6..2e5399d9f40f 100644
--- a/block/elevator.c
+++ b/block/elevator.c
@@ -509,8 +509,6 @@ void elv_unregister_queue(struct request_queue *q)
 
 int elv_register(struct elevator_type *e)
 {
-	char *def = "";
-
 	/* create icq_cache if requested */
 	if (e->icq_size) {
 		if (WARN_ON(e->icq_size < sizeof(struct io_cq)) ||
@@ -535,8 +533,8 @@ int elv_register(struct elevator_type *e)
 	list_add_tail(&e->list, &elv_list);
 	spin_unlock(&elv_list_lock);
 
-	printk(KERN_INFO "io scheduler %s registered%s\n", e->elevator_name,
-								def);
+	printk(KERN_INFO "io scheduler %s registered\n", e->elevator_name);
+
 	return 0;
 }
 EXPORT_SYMBOL_GPL(elv_register);
-- 
cgit 


From 636b8fe86bede8c9f797365986b8406ff2183f13 Mon Sep 17 00:00:00 2001
From: Angelo Ruocco <angelo.ruocco.90@gmail.com>
Date: Mon, 8 Apr 2019 17:35:34 +0200
Subject: block, bfq: fix some typos in comments

Some of the comments in the bfq files had typos. This patch fixes them.

Signed-off-by: Angelo Ruocco <angeloruocco90@gmail.com>
Signed-off-by: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 block/bfq-cgroup.c  |  2 +-
 block/bfq-iosched.c | 16 ++++++++--------
 block/bfq-iosched.h |  4 ++--
 block/bfq-wf2q.c    | 10 +++++-----
 4 files changed, 16 insertions(+), 16 deletions(-)

(limited to 'block')

diff --git a/block/bfq-cgroup.c b/block/bfq-cgroup.c
index 2a74a3f2a8f7..793c027ca60e 100644
--- a/block/bfq-cgroup.c
+++ b/block/bfq-cgroup.c
@@ -1103,7 +1103,7 @@ struct cftype bfq_blkcg_legacy_files[] = {
 	},
 #endif /* CONFIG_DEBUG_BLK_CGROUP */
 
-	/* the same statictics which cover the bfqg and its descendants */
+	/* the same statistics which cover the bfqg and its descendants */
 	{
 		.name = "bfq.io_service_bytes_recursive",
 		.private = (unsigned long)&blkcg_policy_bfq,
diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
index ceb06abd73df..bcaa4bde3961 100644
--- a/block/bfq-iosched.c
+++ b/block/bfq-iosched.c
@@ -189,7 +189,7 @@ static const int bfq_default_max_budget = 16 * 1024;
 /*
  * When a sync request is dispatched, the queue that contains that
  * request, and all the ancestor entities of that queue, are charged
- * with the number of sectors of the request. In constrast, if the
+ * with the number of sectors of the request. In contrast, if the
  * request is async, then the queue and its ancestor entities are
  * charged with the number of sectors of the request, multiplied by
  * the factor below. This throttles the bandwidth for async I/O,
@@ -217,7 +217,7 @@ const int bfq_timeout = HZ / 8;
  * queue merging.
  *
  * As can be deduced from the low time limit below, queue merging, if
- * successful, happens at the very beggining of the I/O of the involved
+ * successful, happens at the very beginning of the I/O of the involved
  * cooperating processes, as a consequence of the arrival of the very
  * first requests from each cooperator.  After that, there is very
  * little chance to find cooperators.
@@ -441,7 +441,7 @@ void bfq_schedule_dispatch(struct bfq_data *bfqd)
 
 /*
  * Lifted from AS - choose which of rq1 and rq2 that is best served now.
- * We choose the request that is closesr to the head right now.  Distance
+ * We choose the request that is closer to the head right now.  Distance
  * behind the head is penalized and only allowed to a certain extent.
  */
 static struct request *bfq_choose_req(struct bfq_data *bfqd,
@@ -989,7 +989,7 @@ static unsigned int bfq_wr_duration(struct bfq_data *bfqd)
 	 *   of several files
 	 * mplayer took 23 seconds to start, if constantly weight-raised.
 	 *
-	 * As for higher values than that accomodating the above bad
+	 * As for higher values than that accommodating the above bad
 	 * scenario, tests show that higher values would often yield
 	 * the opposite of the desired result, i.e., would worsen
 	 * responsiveness by allowing non-interactive applications to
@@ -2636,8 +2636,8 @@ static bool bfq_allow_bio_merge(struct request_queue *q, struct request *rq,
 		/*
 		 * bic still points to bfqq, then it has not yet been
 		 * redirected to some other bfq_queue, and a queue
-		 * merge beween bfqq and new_bfqq can be safely
-		 * fulfillled, i.e., bic can be redirected to new_bfqq
+		 * merge between bfqq and new_bfqq can be safely
+		 * fulfilled, i.e., bic can be redirected to new_bfqq
 		 * and bfqq can be put.
 		 */
 		bfq_merge_bfqqs(bfqd, bfqd->bio_bic, bfqq,
@@ -3089,7 +3089,7 @@ static void __bfq_bfqq_expire(struct bfq_data *bfqd, struct bfq_queue *bfqq)
 	/*
 	 * All in-service entities must have been properly deactivated
 	 * or requeued before executing the next function, which
-	 * resets all in-service entites as no more in service.
+	 * resets all in-service entities as no more in service.
 	 */
 	__bfq_bfqd_reset_in_service(bfqd);
 }
@@ -5632,7 +5632,7 @@ static void bfq_prepare_request(struct request *rq, struct bio *bio)
  * preparation is that, after the prepare_request hook is invoked for
  * rq, rq may still be transformed into a request with no icq, i.e., a
  * request not associated with any queue. No bfq hook is invoked to
- * signal this tranformation. As a consequence, should these
+ * signal this transformation. As a consequence, should these
  * preparation operations be performed when the prepare_request hook
  * is invoked, and should rq be transformed one moment later, bfq
  * would end up in an inconsistent state, because it would have
diff --git a/block/bfq-iosched.h b/block/bfq-iosched.h
index 60c148728cc5..e7dc07cf9a57 100644
--- a/block/bfq-iosched.h
+++ b/block/bfq-iosched.h
@@ -91,7 +91,7 @@ struct bfq_service_tree {
  * expiration. This peculiar definition allows for the following
  * optimization, not yet exploited: while a given entity is still in
  * service, we already know which is the best candidate for next
- * service among the other active entitities in the same parent
+ * service among the other active entities in the same parent
  * entity. We can then quickly compare the timestamps of the
  * in-service entity with those of such best candidate.
  *
@@ -142,7 +142,7 @@ struct bfq_weight_counter {
  *
  * Unless cgroups are used, the weight value is calculated from the
  * ioprio to export the same interface as CFQ.  When dealing with
- * ``well-behaved'' queues (i.e., queues that do not spend too much
+ * "well-behaved" queues (i.e., queues that do not spend too much
  * time to consume their budget and have true sequential behavior, and
  * when there are no external factors breaking anticipation) the
  * relative weights at each level of the cgroups hierarchy should be
diff --git a/block/bfq-wf2q.c b/block/bfq-wf2q.c
index 51ef1f00df80..d2ea98ef26a3 100644
--- a/block/bfq-wf2q.c
+++ b/block/bfq-wf2q.c
@@ -59,7 +59,7 @@ static bool bfq_update_parent_budget(struct bfq_entity *next_in_service);
  * bfq_update_next_in_service - update sd->next_in_service
  * @sd: sched_data for which to perform the update.
  * @new_entity: if not NULL, pointer to the entity whose activation,
- *		requeueing or repositionig triggered the invocation of
+ *		requeueing or repositioning triggered the invocation of
  *		this function.
  * @expiration: id true, this function is being invoked after the
  *             expiration of the in-service entity
@@ -90,7 +90,7 @@ static bool bfq_update_next_in_service(struct bfq_sched_data *sd,
 
 	/*
 	 * If this update is triggered by the activation, requeueing
-	 * or repositiong of an entity that does not coincide with
+	 * or repositioning of an entity that does not coincide with
 	 * sd->next_in_service, then a full lookup in the active tree
 	 * can be avoided. In fact, it is enough to check whether the
 	 * just-modified entity has the same priority as
@@ -1396,7 +1396,7 @@ left:
  * In this first case, update the virtual time in @st too (see the
  * comments on this update inside the function).
  *
- * In constrast, if there is an in-service entity, then return the
+ * In contrast, if there is an in-service entity, then return the
  * entity that would be set in service if not only the above
  * conditions, but also the next one held true: the currently
  * in-service entity, on expiration,
@@ -1479,12 +1479,12 @@ static struct bfq_entity *bfq_lookup_next_entity(struct bfq_sched_data *sd,
 		 * is being invoked as a part of the expiration path
 		 * of the in-service queue. In this case, even if
 		 * sd->in_service_entity is not NULL,
-		 * sd->in_service_entiy at this point is actually not
+		 * sd->in_service_entity at this point is actually not
 		 * in service any more, and, if needed, has already
 		 * been properly queued or requeued into the right
 		 * tree. The reason why sd->in_service_entity is still
 		 * not NULL here, even if expiration is true, is that
-		 * sd->in_service_entiy is reset as a last step in the
+		 * sd->in_service_entity is reset as a last step in the
 		 * expiration path. So, if expiration is true, tell
 		 * __bfq_lookup_next_entity that there is no
 		 * sd->in_service_entity.
-- 
cgit 


From b21e11c5c8311b8bf6923ff29d57f2a5f997e939 Mon Sep 17 00:00:00 2001
From: Ming Lei <ming.lei@redhat.com>
Date: Tue, 2 Apr 2019 10:26:44 +0800
Subject: block: fix build warning in merging bvecs
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Commit f6970f83ef79 ("block: don't check if adjacent bvecs in one bio can
be mergeable") changes bvec merge by only considering two bvecs from
different bios. However, if the former bio doesn't inlcude any io bvec,
then the following warning may be triggered:

 warning: ‘bvec.bv_offset’ may be used uninitialized in this function [-Wmaybe-uninitialized]

In practice, it shouldn't be triggered.

Fixes it by adding check on former bio, the check shouldn't add any cost
given 'bio->bi_iter' can be hit in cache.

Reported-by: Jens Axboe <axboe@kernel.dk>
Fixes: f6970f83ef79 ("block: don't check if adjacent bvecs in one bio can be mergeable")
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 block/blk-merge.c | 16 ++++++++++------
 1 file changed, 10 insertions(+), 6 deletions(-)

(limited to 'block')

diff --git a/block/blk-merge.c b/block/blk-merge.c
index 8f96d683b577..895795cdb145 100644
--- a/block/blk-merge.c
+++ b/block/blk-merge.c
@@ -353,7 +353,7 @@ EXPORT_SYMBOL(blk_queue_split);
 static unsigned int __blk_recalc_rq_segments(struct request_queue *q,
 					     struct bio *bio)
 {
-	struct bio_vec bv, bvprv = { NULL };
+	struct bio_vec uninitialized_var(bv), bvprv = { NULL };
 	unsigned int seg_size, nr_phys_segs;
 	unsigned front_seg_size;
 	struct bio *fbio, *bbio;
@@ -400,8 +400,10 @@ new_segment:
 			new_bio = false;
 		}
 		bbio = bio;
-		bvprv = bv;
-		new_bio = true;
+		if (likely(bio->bi_iter.bi_size)) {
+			bvprv = bv;
+			new_bio = true;
+		}
 	}
 
 	fbio->bi_seg_front_size = front_seg_size;
@@ -527,7 +529,7 @@ static int __blk_bios_map_sg(struct request_queue *q, struct bio *bio,
 			     struct scatterlist *sglist,
 			     struct scatterlist **sg)
 {
-	struct bio_vec bvec, bvprv = { NULL };
+	struct bio_vec uninitialized_var(bvec), bvprv = { NULL };
 	struct bvec_iter iter;
 	int nsegs = 0;
 	bool new_bio = false;
@@ -550,8 +552,10 @@ static int __blk_bios_map_sg(struct request_queue *q, struct bio *bio,
  next_bvec:
 			new_bio = false;
 		}
-		bvprv = bvec;
-		new_bio = true;
+		if (likely(bio->bi_iter.bi_size)) {
+			bvprv = bvec;
+			new_bio = true;
+		}
 	}
 
 	return nsegs;
-- 
cgit 


From 8a96a0e408102fb7aa73d8aa0b5e2219cfd51e55 Mon Sep 17 00:00:00 2001
From: Christoph Hellwig <hch@lst.de>
Date: Thu, 11 Apr 2019 08:23:27 +0200
Subject: block: rewrite blk_bvec_map_sg to avoid a nth_page call

The offset in scatterlists is allowed to be larger than the page size,
so don't go to great length to avoid that case and simplify the
arithmetics.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 block/blk-merge.c | 21 ++++++---------------
 1 file changed, 6 insertions(+), 15 deletions(-)

(limited to 'block')

diff --git a/block/blk-merge.c b/block/blk-merge.c
index 895795cdb145..247b17f2a0f6 100644
--- a/block/blk-merge.c
+++ b/block/blk-merge.c
@@ -469,26 +469,17 @@ static unsigned blk_bvec_map_sg(struct request_queue *q,
 		struct scatterlist **sg)
 {
 	unsigned nbytes = bvec->bv_len;
-	unsigned nsegs = 0, total = 0, offset = 0;
+	unsigned nsegs = 0, total = 0;
 
 	while (nbytes > 0) {
-		unsigned seg_size;
-		struct page *pg;
-		unsigned idx;
+		unsigned offset = bvec->bv_offset + total;
+		unsigned len = min(get_max_segment_size(q, offset), nbytes);
 
 		*sg = blk_next_sg(sg, sglist);
+		sg_set_page(*sg, bvec->bv_page, len, offset);
 
-		seg_size = get_max_segment_size(q, bvec->bv_offset + total);
-		seg_size = min(nbytes, seg_size);
-
-		offset = (total + bvec->bv_offset) % PAGE_SIZE;
-		idx = (total + bvec->bv_offset) / PAGE_SIZE;
-		pg = bvec_nth_page(bvec->bv_page, idx);
-
-		sg_set_page(*sg, pg, seg_size, offset);
-
-		total += seg_size;
-		nbytes -= seg_size;
+		total += len;
+		nbytes -= len;
 		nsegs++;
 	}
 
-- 
cgit 


From a10584c3cda9cbb1a1ccd072783bfd625f99e40d Mon Sep 17 00:00:00 2001
From: Christoph Hellwig <hch@lst.de>
Date: Thu, 11 Apr 2019 08:23:28 +0200
Subject: block: refactor __bio_iov_bvec_add_pages

Return early on error, and add an unlikely annotation for that case.

Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 block/bio.c | 19 +++++++++----------
 1 file changed, 9 insertions(+), 10 deletions(-)

(limited to 'block')

diff --git a/block/bio.c b/block/bio.c
index c2592c5d70b9..ad346082a971 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -873,20 +873,19 @@ static int __bio_iov_bvec_add_pages(struct bio *bio, struct iov_iter *iter)
 	len = min_t(size_t, bv->bv_len - iter->iov_offset, iter->count);
 	size = bio_add_page(bio, bv->bv_page, len,
 				bv->bv_offset + iter->iov_offset);
-	if (size == len) {
-		if (!bio_flagged(bio, BIO_NO_PAGE_REF)) {
-			struct page *page;
-			int i;
+	if (unlikely(size != len))
+		return -EINVAL;
 
-			mp_bvec_for_each_page(page, bv, i)
-				get_page(page);
-		}
+	if (!bio_flagged(bio, BIO_NO_PAGE_REF)) {
+		struct page *page;
+		int i;
 
-		iov_iter_advance(iter, size);
-		return 0;
+		mp_bvec_for_each_page(page, bv, i)
+			get_page(page);
 	}
 
-	return -EINVAL;
+	iov_iter_advance(iter, size);
+	return 0;
 }
 
 #define PAGE_PTRS_PER_BVEC     (sizeof(struct bio_vec) / sizeof(struct page *))
-- 
cgit 


From 14eacf12dbc75352fa746dfd9e24de3170ba5ff5 Mon Sep 17 00:00:00 2001
From: Christoph Hellwig <hch@lst.de>
Date: Thu, 11 Apr 2019 08:23:29 +0200
Subject: block: don't allow multiple bio_iov_iter_get_pages calls per bio

No caller uses bio_iov_iter_get_pages multiple times on a given bio,
and that funtionality isn't all that useful.  Removing it will make
some future changes a little easier and also simplifies the function
a bit.

Reviewed-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 block/bio.c | 15 ++++++---------
 1 file changed, 6 insertions(+), 9 deletions(-)

(limited to 'block')

diff --git a/block/bio.c b/block/bio.c
index ad346082a971..c2a389b1509a 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -958,7 +958,10 @@ static int __bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)
 int bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)
 {
 	const bool is_bvec = iov_iter_is_bvec(iter);
-	unsigned short orig_vcnt = bio->bi_vcnt;
+	int ret;
+
+	if (WARN_ON_ONCE(bio->bi_vcnt))
+		return -EINVAL;
 
 	/*
 	 * If this is a BVEC iter, then the pages are kernel pages. Don't
@@ -968,19 +971,13 @@ int bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)
 		bio_set_flag(bio, BIO_NO_PAGE_REF);
 
 	do {
-		int ret;
-
 		if (is_bvec)
 			ret = __bio_iov_bvec_add_pages(bio, iter);
 		else
 			ret = __bio_iov_iter_get_pages(bio, iter);
+	} while (!ret && iov_iter_count(iter) && !bio_full(bio));
 
-		if (unlikely(ret))
-			return bio->bi_vcnt > orig_vcnt ? 0 : ret;
-
-	} while (iov_iter_count(iter) && !bio_full(bio));
-
-	return 0;
+	return bio->bi_vcnt ? 0 : ret;
 }
 
 static void submit_bio_wait_endio(struct bio *bio)
-- 
cgit 


From 7321ecbfc7cf85211460a1dc6bb0ccfc3dcf9df0 Mon Sep 17 00:00:00 2001
From: Christoph Hellwig <hch@lst.de>
Date: Thu, 11 Apr 2019 08:23:30 +0200
Subject: block: change how we get page references in bio_iov_iter_get_pages

Instead of needing a special macro to iterate over all pages in
a bvec just do a second passs over the whole bio.  This also matches
what we do on the release side.  The release side helper is moved
up to where we need the get helper to clearly express the symmetry.

Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 block/bio.c | 51 +++++++++++++++++++++++++--------------------------
 1 file changed, 25 insertions(+), 26 deletions(-)

(limited to 'block')

diff --git a/block/bio.c b/block/bio.c
index c2a389b1509a..d3490aeb1a7e 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -861,6 +861,26 @@ int bio_add_page(struct bio *bio, struct page *page,
 }
 EXPORT_SYMBOL(bio_add_page);
 
+static void bio_get_pages(struct bio *bio)
+{
+	struct bvec_iter_all iter_all;
+	struct bio_vec *bvec;
+	int i;
+
+	bio_for_each_segment_all(bvec, bio, i, iter_all)
+		get_page(bvec->bv_page);
+}
+
+static void bio_release_pages(struct bio *bio)
+{
+	struct bvec_iter_all iter_all;
+	struct bio_vec *bvec;
+	int i;
+
+	bio_for_each_segment_all(bvec, bio, i, iter_all)
+		put_page(bvec->bv_page);
+}
+
 static int __bio_iov_bvec_add_pages(struct bio *bio, struct iov_iter *iter)
 {
 	const struct bio_vec *bv = iter->bvec;
@@ -875,15 +895,6 @@ static int __bio_iov_bvec_add_pages(struct bio *bio, struct iov_iter *iter)
 				bv->bv_offset + iter->iov_offset);
 	if (unlikely(size != len))
 		return -EINVAL;
-
-	if (!bio_flagged(bio, BIO_NO_PAGE_REF)) {
-		struct page *page;
-		int i;
-
-		mp_bvec_for_each_page(page, bv, i)
-			get_page(page);
-	}
-
 	iov_iter_advance(iter, size);
 	return 0;
 }
@@ -963,13 +974,6 @@ int bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)
 	if (WARN_ON_ONCE(bio->bi_vcnt))
 		return -EINVAL;
 
-	/*
-	 * If this is a BVEC iter, then the pages are kernel pages. Don't
-	 * release them on IO completion, if the caller asked us to.
-	 */
-	if (is_bvec && iov_iter_bvec_no_ref(iter))
-		bio_set_flag(bio, BIO_NO_PAGE_REF);
-
 	do {
 		if (is_bvec)
 			ret = __bio_iov_bvec_add_pages(bio, iter);
@@ -977,6 +981,11 @@ int bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)
 			ret = __bio_iov_iter_get_pages(bio, iter);
 	} while (!ret && iov_iter_count(iter) && !bio_full(bio));
 
+	if (iov_iter_bvec_no_ref(iter))
+		bio_set_flag(bio, BIO_NO_PAGE_REF);
+	else
+		bio_get_pages(bio);
+
 	return bio->bi_vcnt ? 0 : ret;
 }
 
@@ -1670,16 +1679,6 @@ void bio_set_pages_dirty(struct bio *bio)
 	}
 }
 
-static void bio_release_pages(struct bio *bio)
-{
-	struct bio_vec *bvec;
-	int i;
-	struct bvec_iter_all iter_all;
-
-	bio_for_each_segment_all(bvec, bio, i, iter_all)
-		put_page(bvec->bv_page);
-}
-
 /*
  * bio_check_pages_dirty() will check that all the BIO's pages are still dirty.
  * If they are, then fine.  If, however, some pages are clean then they must
-- 
cgit 


From 52d52d1c98a90cfe860b83498e4b6074aad95c15 Mon Sep 17 00:00:00 2001
From: Christoph Hellwig <hch@lst.de>
Date: Thu, 11 Apr 2019 08:23:31 +0200
Subject: block: only allow contiguous page structs in a bio_vec

We currently have to call nth_page when iterating over pages inside a
bio_vec.  Jens complained a while ago that this is fairly expensive.
To mitigate this we can check that that the actual page structures
are contiguous when adding them to the bio, and just do check pointer
arithmetics later on.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 block/bio.c | 9 +++++++--
 1 file changed, 7 insertions(+), 2 deletions(-)

(limited to 'block')

diff --git a/block/bio.c b/block/bio.c
index d3490aeb1a7e..8adc2a20d57d 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -659,8 +659,13 @@ static inline bool page_is_mergeable(const struct bio_vec *bv,
 		return false;
 	if (xen_domain() && !xen_biovec_phys_mergeable(bv, page))
 		return false;
-	if (same_page && (vec_end_addr & PAGE_MASK) != page_addr)
-		return false;
+
+	if ((vec_end_addr & PAGE_MASK) != page_addr) {
+		if (same_page)
+			return false;
+		if (pfn_to_page(PFN_DOWN(vec_end_addr)) + 1 != page)
+			return false;
+	}
 
 	return true;
 }
-- 
cgit 


From 673387a930059fc4ad8060847a1d46f94e702281 Mon Sep 17 00:00:00 2001
From: Martin Wilck <mwilck@suse.com>
Date: Wed, 27 Mar 2019 14:51:01 +0100
Subject: block: genhd: remove async_events field

The async_events field, intended to be used for drivers that support
asynchronous notifications about disk events (aka media change events),
isn't currently used by any driver, and apparently that has been that
way for a long time (if not forever). Remove it.

Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Martin Wilck <mwilck@suse.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 block/genhd.c | 10 ++++------
 1 file changed, 4 insertions(+), 6 deletions(-)

(limited to 'block')

diff --git a/block/genhd.c b/block/genhd.c
index 703267865f14..ee76de0fb4cc 100644
--- a/block/genhd.c
+++ b/block/genhd.c
@@ -1628,12 +1628,11 @@ static unsigned long disk_events_poll_jiffies(struct gendisk *disk)
 
 	/*
 	 * If device-specific poll interval is set, always use it.  If
-	 * the default is being used, poll iff there are events which
-	 * can't be monitored asynchronously.
+	 * the default is being used, poll if the POLL flag is set.
 	 */
 	if (ev->poll_msecs >= 0)
 		intv_msecs = ev->poll_msecs;
-	else if (disk->events & ~disk->async_events)
+	else if (disk->events)
 		intv_msecs = disk_events_dfl_poll_msecs;
 
 	return msecs_to_jiffies(intv_msecs);
@@ -1860,6 +1859,7 @@ static void disk_check_events(struct disk_events *ev,
  *
  * events		: list of all supported events
  * events_async		: list of events which can be detected w/o polling
+ *			  (always empty, only for backwards compatibility)
  * events_poll_msecs	: polling interval, 0: disable, -1: system default
  */
 static ssize_t __disk_events_show(unsigned int events, char *buf)
@@ -1890,9 +1890,7 @@ static ssize_t disk_events_show(struct device *dev,
 static ssize_t disk_events_async_show(struct device *dev,
 				      struct device_attribute *attr, char *buf)
 {
-	struct gendisk *disk = dev_to_disk(dev);
-
-	return __disk_events_show(disk->async_events, buf);
+	return 0;
 }
 
 static ssize_t disk_events_poll_msecs_show(struct device *dev,
-- 
cgit 


From c92e2f04b35938da23eb9a7f7101cbdd5ac7cdc4 Mon Sep 17 00:00:00 2001
From: Martin Wilck <mwilck@suse.com>
Date: Wed, 27 Mar 2019 14:51:02 +0100
Subject: block: disk_events: introduce event flags

Currently, an empty disk->events field tells the block layer not to
forward media change events to user space. This was done in commit
7c88a168da80 ("block: don't propagate unlisted DISK_EVENTs to userland")
in order to avoid events from "fringe" drivers to be forwarded to user
space. By doing so, the block layer lost the information which events
were supported by a particular block device, and most importantly,
whether or not a given device supports media change events at all.

Prepare for not interpreting the "events" field this way in the future
any more. This is done by adding an additional field "event_flags" to
struct gendisk, and two flag bits that can be set to have the device
treated like one that had the "events" field set to a non-zero value
before. This applies only to the sd and sr drivers, which are changed to
set the new flags.

The new flags are DISK_EVENT_FLAG_POLL to enforce polling of the device
for synchronous events, and DISK_EVENT_FLAG_UEVENT to tell the
blocklayer to generate udev events from kernel events.

In order to add the event_flags field to struct gendisk, the events
field is converted to an "unsigned short"; it doesn't need to hold
values bigger than 2 anyway.

This patch doesn't change behavior.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Martin Wilck <mwilck@suse.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 block/genhd.c | 13 +++++++++----
 1 file changed, 9 insertions(+), 4 deletions(-)

(limited to 'block')

diff --git a/block/genhd.c b/block/genhd.c
index ee76de0fb4cc..5375be39e8a5 100644
--- a/block/genhd.c
+++ b/block/genhd.c
@@ -1632,7 +1632,7 @@ static unsigned long disk_events_poll_jiffies(struct gendisk *disk)
 	 */
 	if (ev->poll_msecs >= 0)
 		intv_msecs = ev->poll_msecs;
-	else if (disk->events)
+	else if (disk->event_flags & DISK_EVENT_FLAG_POLL)
 		intv_msecs = disk_events_dfl_poll_msecs;
 
 	return msecs_to_jiffies(intv_msecs);
@@ -1842,11 +1842,13 @@ static void disk_check_events(struct disk_events *ev,
 
 	/*
 	 * Tell userland about new events.  Only the events listed in
-	 * @disk->events are reported.  Unlisted events are processed the
-	 * same internally but never get reported to userland.
+	 * @disk->events are reported, and only if DISK_EVENT_FLAG_UEVENT
+	 * is set. Otherwise, events are processed internally but never
+	 * get reported to userland.
 	 */
 	for (i = 0; i < ARRAY_SIZE(disk_uevents); i++)
-		if (events & disk->events & (1 << i))
+		if ((events & disk->events & (1 << i)) &&
+		    (disk->event_flags & DISK_EVENT_FLAG_UEVENT))
 			envp[nr_events++] = disk_uevents[i];
 
 	if (nr_events)
@@ -1884,6 +1886,9 @@ static ssize_t disk_events_show(struct device *dev,
 {
 	struct gendisk *disk = dev_to_disk(dev);
 
+	if (!(disk->event_flags & DISK_EVENT_FLAG_UEVENT))
+		return 0;
+
 	return __disk_events_show(disk->events, buf);
 }
 
-- 
cgit 


From cdf3e3deb747d5e193dee617ed37c83060eb576f Mon Sep 17 00:00:00 2001
From: Martin Wilck <mwilck@suse.com>
Date: Wed, 27 Mar 2019 14:51:05 +0100
Subject: block: check_events: don't bother with events if unsupported

Drivers now report to the block layer if they support media change
events. If this is not the case, there's no need to allocate the event
structure, and all event handling code can effectively be skipped. This
simplifies code flow in particular for non-removable sd devices.

This effectively reverts commit 75e3f3ee3c64 ("block: always allocate
genhd->ev if check_events is implemented").

The sysfs files for the events are kept in place even if no events are
supported, as user space may rely on them being present. The only
difference is that an error code is now returned if the user tries to
set poll_msecs.

Reviewed-by: Hannes Reinecke <hare@suse.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Martin Wilck <mwilck@suse.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 block/genhd.c | 27 ++++++++++++++++-----------
 1 file changed, 16 insertions(+), 11 deletions(-)

(limited to 'block')

diff --git a/block/genhd.c b/block/genhd.c
index 5375be39e8a5..1d0d25f7b0fe 100644
--- a/block/genhd.c
+++ b/block/genhd.c
@@ -1904,6 +1904,9 @@ static ssize_t disk_events_poll_msecs_show(struct device *dev,
 {
 	struct gendisk *disk = dev_to_disk(dev);
 
+	if (!disk->ev)
+		return sprintf(buf, "-1\n");
+
 	return sprintf(buf, "%ld\n", disk->ev->poll_msecs);
 }
 
@@ -1920,6 +1923,9 @@ static ssize_t disk_events_poll_msecs_store(struct device *dev,
 	if (intv < 0 && intv != -1)
 		return -EINVAL;
 
+	if (!disk->ev)
+		return -ENODEV;
+
 	disk_block_events(disk);
 	disk->ev->poll_msecs = intv;
 	__disk_unblock_events(disk, true);
@@ -1984,7 +1990,7 @@ static void disk_alloc_events(struct gendisk *disk)
 {
 	struct disk_events *ev;
 
-	if (!disk->fops->check_events)
+	if (!disk->fops->check_events || !disk->events)
 		return;
 
 	ev = kzalloc(sizeof(*ev), GFP_KERNEL);
@@ -2006,14 +2012,14 @@ static void disk_alloc_events(struct gendisk *disk)
 
 static void disk_add_events(struct gendisk *disk)
 {
-	if (!disk->ev)
-		return;
-
 	/* FIXME: error handling */
 	if (sysfs_create_files(&disk_to_dev(disk)->kobj, disk_events_attrs) < 0)
 		pr_warn("%s: failed to create sysfs files for events\n",
 			disk->disk_name);
 
+	if (!disk->ev)
+		return;
+
 	mutex_lock(&disk_events_mutex);
 	list_add_tail(&disk->ev->node, &disk_events);
 	mutex_unlock(&disk_events_mutex);
@@ -2027,14 +2033,13 @@ static void disk_add_events(struct gendisk *disk)
 
 static void disk_del_events(struct gendisk *disk)
 {
-	if (!disk->ev)
-		return;
-
-	disk_block_events(disk);
+	if (disk->ev) {
+		disk_block_events(disk);
 
-	mutex_lock(&disk_events_mutex);
-	list_del_init(&disk->ev->node);
-	mutex_unlock(&disk_events_mutex);
+		mutex_lock(&disk_events_mutex);
+		list_del_init(&disk->ev->node);
+		mutex_unlock(&disk_events_mutex);
+	}
 
 	sysfs_remove_files(&disk_to_dev(disk)->kobj, disk_events_attrs);
 }
-- 
cgit 


From 6fcc44d1d77fea3c7230e4d109b37f6977aa675a Mon Sep 17 00:00:00 2001
From: Yufen Yu <yuyufen@huawei.com>
Date: Tue, 2 Apr 2019 20:06:34 +0800
Subject: block: fix use-after-free on gendisk

commit 2da78092dda "block: Fix dev_t minor allocation lifetime"
specifically moved blk_free_devt(dev->devt) call to part_release()
to avoid reallocating device number before the device is fully
shutdown.

However, it can cause use-after-free on gendisk in get_gendisk().
We use md device as example to show the race scenes:

Process1		Worker			Process2
md_free
						blkdev_open
del_gendisk
  add delete_partition_work_fn() to wq
  						__blkdev_get
						get_gendisk
put_disk
  disk_release
    kfree(disk)
    						find part from ext_devt_idr
						get_disk_and_module(disk)
    					  	cause use after free

    			delete_partition_work_fn
			put_device(part)
    		  	part_release
		    	remove part from ext_devt_idr

Before <devt, hd_struct pointer> is removed from ext_devt_idr by
delete_partition_work_fn(), we can find the devt and then access
gendisk by hd_struct pointer. But, if we access the gendisk after
it have been freed, it can cause in use-after-freeon gendisk in
get_gendisk().

We fix this by adding a new helper blk_invalidate_devt() in
delete_partition() and del_gendisk(). It replaces hd_struct
pointer in idr with value 'NULL', and deletes the entry from
idr in part_release() as we do now.

Thanks to Jan Kara for providing the solution and more clear comments
for the code.

Fixes: 2da78092dda1 ("block: Fix dev_t minor allocation lifetime")
Cc: Al Viro <viro@zeniv.linux.org.uk>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Keith Busch <keith.busch@intel.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Suggested-by: Jan Kara <jack@suse.cz>
Signed-off-by: Yufen Yu <yuyufen@huawei.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 block/genhd.c             | 19 +++++++++++++++++++
 block/partition-generic.c |  7 +++++++
 2 files changed, 26 insertions(+)

(limited to 'block')

diff --git a/block/genhd.c b/block/genhd.c
index 1d0d25f7b0fe..83f5c33d1e80 100644
--- a/block/genhd.c
+++ b/block/genhd.c
@@ -531,6 +531,18 @@ void blk_free_devt(dev_t devt)
 	}
 }
 
+/**
+ *	We invalidate devt by assigning NULL pointer for devt in idr.
+ */
+void blk_invalidate_devt(dev_t devt)
+{
+	if (MAJOR(devt) == BLOCK_EXT_MAJOR) {
+		spin_lock_bh(&ext_devt_lock);
+		idr_replace(&ext_devt_idr, NULL, blk_mangle_minor(MINOR(devt)));
+		spin_unlock_bh(&ext_devt_lock);
+	}
+}
+
 static char *bdevt_str(dev_t devt, char *buf)
 {
 	if (MAJOR(devt) <= 0xff && MINOR(devt) <= 0xff) {
@@ -793,6 +805,13 @@ void del_gendisk(struct gendisk *disk)
 
 	if (!(disk->flags & GENHD_FL_HIDDEN))
 		blk_unregister_region(disk_devt(disk), disk->minors);
+	/*
+	 * Remove gendisk pointer from idr so that it cannot be looked up
+	 * while RCU period before freeing gendisk is running to prevent
+	 * use-after-free issues. Note that the device number stays
+	 * "in-use" until we really free the gendisk.
+	 */
+	blk_invalidate_devt(disk_devt(disk));
 
 	kobject_put(disk->part0.holder_dir);
 	kobject_put(disk->slave_dir);
diff --git a/block/partition-generic.c b/block/partition-generic.c
index 8e596a8dff32..aee643ce13d1 100644
--- a/block/partition-generic.c
+++ b/block/partition-generic.c
@@ -285,6 +285,13 @@ void delete_partition(struct gendisk *disk, int partno)
 	kobject_put(part->holder_dir);
 	device_del(part_to_dev(part));
 
+	/*
+	 * Remove gendisk pointer from idr so that it cannot be looked up
+	 * while RCU period before freeing gendisk is running to prevent
+	 * use-after-free issues. Note that the device number stays
+	 * "in-use" until we really free the gendisk.
+	 */
+	blk_invalidate_devt(part_devt(part));
 	hd_struct_kill(part);
 }
 
-- 
cgit 


From f9f76879bc4521019697970bad3bc1dd0bec211f Mon Sep 17 00:00:00 2001
From: Christoph Hellwig <hch@lst.de>
Date: Fri, 19 Apr 2019 08:56:24 +0200
Subject: block: avoid scatterlist offsets > PAGE_SIZE

While we generally allow scatterlists to have offsets larger than page
size for an entry, and other subsystems like the crypto code make use of
that, the block layer isn't quite ready for that.  Flip the switch back
to avoid them for now, and revisit that decision early in a merge window
once the known offenders are fixed.

Fixes: 8a96a0e40810 ("block: rewrite blk_bvec_map_sg to avoid a nth_page call")
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Tested-by: Guenter Roeck <linux@roeck-us.net>
Reported-by: Guenter Roeck <linux@roeck-us.net>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 block/blk-merge.c | 14 +++++++++++++-
 1 file changed, 13 insertions(+), 1 deletion(-)

(limited to 'block')

diff --git a/block/blk-merge.c b/block/blk-merge.c
index 247b17f2a0f6..21e87a714a73 100644
--- a/block/blk-merge.c
+++ b/block/blk-merge.c
@@ -474,9 +474,21 @@ static unsigned blk_bvec_map_sg(struct request_queue *q,
 	while (nbytes > 0) {
 		unsigned offset = bvec->bv_offset + total;
 		unsigned len = min(get_max_segment_size(q, offset), nbytes);
+		struct page *page = bvec->bv_page;
+
+		/*
+		 * Unfortunately a fair number of drivers barf on scatterlists
+		 * that have an offset larger than PAGE_SIZE, despite other
+		 * subsystems dealing with that invariant just fine.  For now
+		 * stick to the legacy format where we never present those from
+		 * the block layer, but the code below should be removed once
+		 * these offenders (mostly MMC/SD drivers) are fixed.
+		 */
+		page += (offset >> PAGE_SHIFT);
+		offset &= ~PAGE_MASK;
 
 		*sg = blk_next_sg(sg, sglist);
-		sg_set_page(*sg, bvec->bv_page, len, offset);
+		sg_set_page(*sg, page, len, offset);
 
 		total += len;
 		nbytes -= len;
-- 
cgit 


From 4d25339e32a1b6e1f490bb78b1e5b0fa9eb3e073 Mon Sep 17 00:00:00 2001
From: Weiping Zhang <zhangweiping@didiglobal.com>
Date: Tue, 2 Apr 2019 21:14:30 +0800
Subject: block: don't show io_timeout if driver has no timeout handler

If the low level driver has no timeout handler, the
/sys/block/<disk>/queue/io_timeout will not be displayed.

Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Weiping Zhang <zhangweiping@didiglobal.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 block/blk-sysfs.c | 30 ++++++++++++++++++++++++++++--
 1 file changed, 28 insertions(+), 2 deletions(-)

(limited to 'block')

diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
index 422327089e0f..a16a02c52a85 100644
--- a/block/blk-sysfs.c
+++ b/block/blk-sysfs.c
@@ -728,7 +728,7 @@ static struct queue_sysfs_entry throtl_sample_time_entry = {
 };
 #endif
 
-static struct attribute *default_attrs[] = {
+static struct attribute *queue_attrs[] = {
 	&queue_requests_entry.attr,
 	&queue_ra_entry.attr,
 	&queue_max_hw_sectors_entry.attr,
@@ -770,6 +770,25 @@ static struct attribute *default_attrs[] = {
 	NULL,
 };
 
+static umode_t queue_attr_visible(struct kobject *kobj, struct attribute *attr,
+				int n)
+{
+	struct request_queue *q =
+		container_of(kobj, struct request_queue, kobj);
+
+	if (attr == &queue_io_timeout_entry.attr &&
+		(!q->mq_ops || !q->mq_ops->timeout))
+			return 0;
+
+	return attr->mode;
+}
+
+static struct attribute_group queue_attr_group = {
+	.attrs = queue_attrs,
+	.is_visible = queue_attr_visible,
+};
+
+
 #define to_queue(atr) container_of((atr), struct queue_sysfs_entry, attr)
 
 static ssize_t
@@ -890,7 +909,6 @@ static const struct sysfs_ops queue_sysfs_ops = {
 
 struct kobj_type blk_queue_ktype = {
 	.sysfs_ops	= &queue_sysfs_ops,
-	.default_attrs	= default_attrs,
 	.release	= blk_release_queue,
 };
 
@@ -939,6 +957,14 @@ int blk_register_queue(struct gendisk *disk)
 		goto unlock;
 	}
 
+	ret = sysfs_create_group(&q->kobj, &queue_attr_group);
+	if (ret) {
+		blk_trace_remove_sysfs(dev);
+		kobject_del(&q->kobj);
+		kobject_put(&dev->kobj);
+		goto unlock;
+	}
+
 	if (queue_is_mq(q)) {
 		__blk_mq_register_dev(dev, q);
 		blk_mq_debugfs_register(q);
-- 
cgit 


From 551879a48f01826fd86568d7bd1e774cb0de3295 Mon Sep 17 00:00:00 2001
From: Ming Lei <ming.lei@redhat.com>
Date: Tue, 23 Apr 2019 10:51:04 +0800
Subject: block: clarify that bio_add_page() and related helpers can add multi
 pages

bio_add_page() and __bio_add_page() are capable of adding pages into
bio, and now we have at least two such usages alreay:

	- __bio_iov_bvec_add_pages()
	- nvmet_bdev_execute_rw().

So update comments on these two helpers.

The thing is a bit special for __bio_try_merge_page(), given the caller
needs to know if the new added page is same with the last added page,
then it isn't safe to pass multi-page in case that 'same_page' is true,
so adds warning on potential misuse, and updates comment on
__bio_try_merge_page().

Cc: linux-xfs@vger.kernel.org
Cc: linux-fsdevel@vger.kernel.org
Reviewed-by: Hannes Reinecke <hare@suse.com>
Reviewed-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 block/bio.c | 26 +++++++++++++++-----------
 1 file changed, 15 insertions(+), 11 deletions(-)

(limited to 'block')

diff --git a/block/bio.c b/block/bio.c
index 5959141d4e46..c81ed2dfee53 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -667,6 +667,8 @@ static inline bool page_is_mergeable(const struct bio_vec *bv,
 			return false;
 	}
 
+	WARN_ON_ONCE(same_page && (len + off) > PAGE_SIZE);
+
 	return true;
 }
 
@@ -786,9 +788,9 @@ EXPORT_SYMBOL(bio_add_pc_page);
 /**
  * __bio_try_merge_page - try appending data to an existing bvec.
  * @bio: destination bio
- * @page: page to add
+ * @page: start page to add
  * @len: length of the data to add
- * @off: offset of the data in @page
+ * @off: offset of the data relative to @page
  * @same_page: if %true only merge if the new data is in the same physical
  *		page as the last segment of the bio.
  *
@@ -796,6 +798,8 @@ EXPORT_SYMBOL(bio_add_pc_page);
  * a useful optimisation for file systems with a block size smaller than the
  * page size.
  *
+ * Warn if (@len, @off) crosses pages in case that @same_page is true.
+ *
  * Return %true on success or %false on failure.
  */
 bool __bio_try_merge_page(struct bio *bio, struct page *page,
@@ -818,11 +822,11 @@ bool __bio_try_merge_page(struct bio *bio, struct page *page,
 EXPORT_SYMBOL_GPL(__bio_try_merge_page);
 
 /**
- * __bio_add_page - add page to a bio in a new segment
+ * __bio_add_page - add page(s) to a bio in a new segment
  * @bio: destination bio
- * @page: page to add
- * @len: length of the data to add
- * @off: offset of the data in @page
+ * @page: start page to add
+ * @len: length of the data to add, may cross pages
+ * @off: offset of the data relative to @page, may cross pages
  *
  * Add the data at @page + @off to @bio as a new bvec.  The caller must ensure
  * that @bio has space for another bvec.
@@ -845,13 +849,13 @@ void __bio_add_page(struct bio *bio, struct page *page,
 EXPORT_SYMBOL_GPL(__bio_add_page);
 
 /**
- *	bio_add_page	-	attempt to add page to bio
+ *	bio_add_page	-	attempt to add page(s) to bio
  *	@bio: destination bio
- *	@page: page to add
- *	@len: vec entry length
- *	@offset: vec entry offset
+ *	@page: start page to add
+ *	@len: vec entry length, may cross pages
+ *	@offset: vec entry offset relative to @page, may cross pages
  *
- *	Attempt to add a page to the bio_vec maplist. This will only fail
+ *	Attempt to add page(s) to the bio_vec maplist. This will only fail
  *	if either bio->bi_vcnt == bio->bi_max_vecs or it's a cloned bio.
  */
 int bio_add_page(struct bio *bio, struct page *page,
-- 
cgit 


From 0257c0ed5ea3de3e32cb322852c4c40bc09d1b97 Mon Sep 17 00:00:00 2001
From: Ming Lei <ming.lei@redhat.com>
Date: Wed, 24 Apr 2019 19:01:46 +0800
Subject: block: don't run get_page() on pages from non-bvec iov iter

The refcount has been increased for pages retrieved from non-bvec iov iter
via __bio_iov_iter_get_pages(), so don't need to do that again.

Otherwise, IO pages are leaked easily.

Cc: Christoph Hellwig <hch@lst.de>
Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Fixes: 7321ecbfc7cf ("block: change how we get page references in bio_iov_iter_get_pages")
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 block/bio.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

(limited to 'block')

diff --git a/block/bio.c b/block/bio.c
index c81ed2dfee53..662d45752ec5 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -992,7 +992,7 @@ int bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)
 
 	if (iov_iter_bvec_no_ref(iter))
 		bio_set_flag(bio, BIO_NO_PAGE_REF);
-	else
+	else if (is_bvec)
 		bio_get_pages(bio);
 
 	return bio->bi_vcnt ? 0 : ret;
-- 
cgit 


From 2b070cfe582b8e99fec6ada57d2e59e194aae202 Mon Sep 17 00:00:00 2001
From: Christoph Hellwig <hch@lst.de>
Date: Thu, 25 Apr 2019 09:03:00 +0200
Subject: block: remove the i argument to bio_for_each_segment_all

We only have two callers that need the integer loop iterator, and they
can easily maintain it themselves.

Suggested-by: Matthew Wilcox <willy@infradead.org>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Acked-by: David Sterba <dsterba@suse.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Acked-by: Coly Li <colyli@suse.de>
Reviewed-by: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 block/bio.c    | 29 ++++++++++-------------------
 block/bounce.c |  3 +--
 2 files changed, 11 insertions(+), 21 deletions(-)

(limited to 'block')

diff --git a/block/bio.c b/block/bio.c
index 662d45752ec5..9ad0d00cdc9b 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -874,9 +874,8 @@ static void bio_get_pages(struct bio *bio)
 {
 	struct bvec_iter_all iter_all;
 	struct bio_vec *bvec;
-	int i;
 
-	bio_for_each_segment_all(bvec, bio, i, iter_all)
+	bio_for_each_segment_all(bvec, bio, iter_all)
 		get_page(bvec->bv_page);
 }
 
@@ -884,9 +883,8 @@ static void bio_release_pages(struct bio *bio)
 {
 	struct bvec_iter_all iter_all;
 	struct bio_vec *bvec;
-	int i;
 
-	bio_for_each_segment_all(bvec, bio, i, iter_all)
+	bio_for_each_segment_all(bvec, bio, iter_all)
 		put_page(bvec->bv_page);
 }
 
@@ -1166,11 +1164,10 @@ static struct bio_map_data *bio_alloc_map_data(struct iov_iter *data,
  */
 static int bio_copy_from_iter(struct bio *bio, struct iov_iter *iter)
 {
-	int i;
 	struct bio_vec *bvec;
 	struct bvec_iter_all iter_all;
 
-	bio_for_each_segment_all(bvec, bio, i, iter_all) {
+	bio_for_each_segment_all(bvec, bio, iter_all) {
 		ssize_t ret;
 
 		ret = copy_page_from_iter(bvec->bv_page,
@@ -1198,11 +1195,10 @@ static int bio_copy_from_iter(struct bio *bio, struct iov_iter *iter)
  */
 static int bio_copy_to_iter(struct bio *bio, struct iov_iter iter)
 {
-	int i;
 	struct bio_vec *bvec;
 	struct bvec_iter_all iter_all;
 
-	bio_for_each_segment_all(bvec, bio, i, iter_all) {
+	bio_for_each_segment_all(bvec, bio, iter_all) {
 		ssize_t ret;
 
 		ret = copy_page_to_iter(bvec->bv_page,
@@ -1223,10 +1219,9 @@ static int bio_copy_to_iter(struct bio *bio, struct iov_iter iter)
 void bio_free_pages(struct bio *bio)
 {
 	struct bio_vec *bvec;
-	int i;
 	struct bvec_iter_all iter_all;
 
-	bio_for_each_segment_all(bvec, bio, i, iter_all)
+	bio_for_each_segment_all(bvec, bio, iter_all)
 		__free_page(bvec->bv_page);
 }
 EXPORT_SYMBOL(bio_free_pages);
@@ -1464,7 +1459,7 @@ struct bio *bio_map_user_iov(struct request_queue *q,
 	return bio;
 
  out_unmap:
-	bio_for_each_segment_all(bvec, bio, j, iter_all) {
+	bio_for_each_segment_all(bvec, bio, iter_all) {
 		put_page(bvec->bv_page);
 	}
 	bio_put(bio);
@@ -1474,13 +1469,12 @@ struct bio *bio_map_user_iov(struct request_queue *q,
 static void __bio_unmap_user(struct bio *bio)
 {
 	struct bio_vec *bvec;
-	int i;
 	struct bvec_iter_all iter_all;
 
 	/*
 	 * make sure we dirty pages we wrote to
 	 */
-	bio_for_each_segment_all(bvec, bio, i, iter_all) {
+	bio_for_each_segment_all(bvec, bio, iter_all) {
 		if (bio_data_dir(bio) == READ)
 			set_page_dirty_lock(bvec->bv_page);
 
@@ -1571,10 +1565,9 @@ static void bio_copy_kern_endio_read(struct bio *bio)
 {
 	char *p = bio->bi_private;
 	struct bio_vec *bvec;
-	int i;
 	struct bvec_iter_all iter_all;
 
-	bio_for_each_segment_all(bvec, bio, i, iter_all) {
+	bio_for_each_segment_all(bvec, bio, iter_all) {
 		memcpy(p, page_address(bvec->bv_page), bvec->bv_len);
 		p += bvec->bv_len;
 	}
@@ -1682,10 +1675,9 @@ cleanup:
 void bio_set_pages_dirty(struct bio *bio)
 {
 	struct bio_vec *bvec;
-	int i;
 	struct bvec_iter_all iter_all;
 
-	bio_for_each_segment_all(bvec, bio, i, iter_all) {
+	bio_for_each_segment_all(bvec, bio, iter_all) {
 		if (!PageCompound(bvec->bv_page))
 			set_page_dirty_lock(bvec->bv_page);
 	}
@@ -1734,10 +1726,9 @@ void bio_check_pages_dirty(struct bio *bio)
 {
 	struct bio_vec *bvec;
 	unsigned long flags;
-	int i;
 	struct bvec_iter_all iter_all;
 
-	bio_for_each_segment_all(bvec, bio, i, iter_all) {
+	bio_for_each_segment_all(bvec, bio, iter_all) {
 		if (!PageDirty(bvec->bv_page) && !PageCompound(bvec->bv_page))
 			goto defer;
 	}
diff --git a/block/bounce.c b/block/bounce.c
index 47eb7e936e22..f8ed677a1bf7 100644
--- a/block/bounce.c
+++ b/block/bounce.c
@@ -163,14 +163,13 @@ static void bounce_end_io(struct bio *bio, mempool_t *pool)
 {
 	struct bio *bio_orig = bio->bi_private;
 	struct bio_vec *bvec, orig_vec;
-	int i;
 	struct bvec_iter orig_iter = bio_orig->bi_iter;
 	struct bvec_iter_all iter_all;
 
 	/*
 	 * free up bounce indirect pages used
 	 */
-	bio_for_each_segment_all(bvec, bio, i, iter_all) {
+	bio_for_each_segment_all(bvec, bio, iter_all) {
 		orig_vec = bio_iter_iovec(bio_orig, orig_iter);
 		if (bvec->bv_page != orig_vec.bv_page) {
 			dec_zone_page_state(bvec->bv_page, NR_BOUNCE);
-- 
cgit 


From 4713839dfe8269d27d83a33d1e39f9c2970eb31a Mon Sep 17 00:00:00 2001
From: Christoph Hellwig <hch@lst.de>
Date: Thu, 25 Apr 2019 09:04:33 +0200
Subject: block: remove the __bio_add_pc_page export

The same page optimization is a rather odd corner case, which is not
used outside bio.c and which really should not be used outside of bio.c
either - we have better highlevel helpers like the rq/bio mapping
helpers.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 block/bio.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

(limited to 'block')

diff --git a/block/bio.c b/block/bio.c
index 9ad0d00cdc9b..e717b303e1fb 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -709,7 +709,7 @@ static bool can_add_page_to_seg(struct request_queue *q,
  *
  *	This should only be used by passthrough bios.
  */
-int __bio_add_pc_page(struct request_queue *q, struct bio *bio,
+static int __bio_add_pc_page(struct request_queue *q, struct bio *bio,
 		struct page *page, unsigned int len, unsigned int offset,
 		bool put_same_page)
 {
@@ -776,7 +776,6 @@ int __bio_add_pc_page(struct request_queue *q, struct bio *bio,
 	bio_set_flag(bio, BIO_SEG_VALID);
 	return len;
 }
-EXPORT_SYMBOL(__bio_add_pc_page);
 
 int bio_add_pc_page(struct request_queue *q, struct bio *bio,
 		struct page *page, unsigned int len, unsigned int offset)
-- 
cgit 


From 6601e44efd20efddc183c85131216200e90c5728 Mon Sep 17 00:00:00 2001
From: Christoph Hellwig <hch@lst.de>
Date: Thu, 25 Apr 2019 09:04:34 +0200
Subject: block: remove bogus comments in __bio_add_pc_page

We are never called with file system pages by defintions for the
passthrough interface, and we also never undo any addition later
these days.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 block/bio.c | 9 ---------
 1 file changed, 9 deletions(-)

(limited to 'block')

diff --git a/block/bio.c b/block/bio.c
index e717b303e1fb..de26dc18bceb 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -724,11 +724,6 @@ static int __bio_add_pc_page(struct request_queue *q, struct bio *bio,
 	if (((bio->bi_iter.bi_size + len) >> 9) > queue_max_hw_sectors(q))
 		return 0;
 
-	/*
-	 * For filesystems with a blocksize smaller than the pagesize
-	 * we will often be called with the same page as last time and
-	 * a consecutive offset.  Optimize this special case.
-	 */
 	if (bio->bi_vcnt > 0) {
 		bvec = &bio->bi_io_vec[bio->bi_vcnt - 1];
 
@@ -760,10 +755,6 @@ static int __bio_add_pc_page(struct request_queue *q, struct bio *bio,
 	if (bio->bi_phys_segments >= queue_max_segments(q))
 		return 0;
 
-	/*
-	 * setup the new entry, we might clear it again later if we
-	 * cannot add the page
-	 */
 	bvec = &bio->bi_io_vec[bio->bi_vcnt];
 	bvec->bv_page = page;
 	bvec->bv_len = len;
-- 
cgit 


From dcdca753c152efe8d86ec7a15423307807a516a7 Mon Sep 17 00:00:00 2001
From: Christoph Hellwig <hch@lst.de>
Date: Thu, 25 Apr 2019 09:04:35 +0200
Subject: block: clean up __bio_add_pc_page a bit

Share the bi_size update by moving the done label up, and duplicate
the bv_len update in the two callers to get rid of the bvec_merge
label.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 block/bio.c | 11 +++++------
 1 file changed, 5 insertions(+), 6 deletions(-)

(limited to 'block')

diff --git a/block/bio.c b/block/bio.c
index de26dc18bceb..029afb121a48 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -731,9 +731,7 @@ static int __bio_add_pc_page(struct request_queue *q, struct bio *bio,
 		    offset == bvec->bv_offset + bvec->bv_len) {
 			if (put_same_page)
 				put_page(page);
- bvec_merge:
 			bvec->bv_len += len;
-			bio->bi_iter.bi_size += len;
 			goto done;
 		}
 
@@ -745,8 +743,10 @@ static int __bio_add_pc_page(struct request_queue *q, struct bio *bio,
 			return 0;
 
 		if (page_is_mergeable(bvec, page, len, offset, false) &&
-				can_add_page_to_seg(q, bvec, page, len, offset))
-			goto bvec_merge;
+		    can_add_page_to_seg(q, bvec, page, len, offset)) {
+			bvec->bv_len += len;
+			goto done;
+		}
 	}
 
 	if (bio_full(bio))
@@ -760,9 +760,8 @@ static int __bio_add_pc_page(struct request_queue *q, struct bio *bio,
 	bvec->bv_len = len;
 	bvec->bv_offset = offset;
 	bio->bi_vcnt++;
-	bio->bi_iter.bi_size += len;
-
  done:
+	bio->bi_iter.bi_size += len;
 	bio->bi_phys_segments = bio->bi_vcnt;
 	bio_set_flag(bio, BIO_SEG_VALID);
 	return len;
-- 
cgit 


From 8c16567d867ed3185a67d8560e051090486d3ff1 Mon Sep 17 00:00:00 2001
From: Christoph Hellwig <hch@lst.de>
Date: Tue, 30 Apr 2019 14:42:39 -0400
Subject: block: switch all files cleared marked as GPLv2 to SPDX tags

All these files have some form of the usual GPLv2 boilerplate.  Switch
them to use SPDX tags instead.

Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 block/badblocks.c        | 10 +---------
 block/bio-integrity.c    | 16 +---------------
 block/bio.c              | 15 +--------------
 block/blk-flush.c        |  3 +--
 block/blk-integrity.c    | 16 +---------------
 block/blk-mq-debugfs.c   | 13 +------------
 block/blk-mq-pci.c       | 10 +---------
 block/blk-mq-rdma.c      | 10 +---------
 block/blk-mq-virtio.c    | 10 +---------
 block/bsg.c              |  9 +--------
 block/kyber-iosched.c    | 13 +------------
 block/opal_proto.h       | 10 +---------
 block/partitions/acorn.c |  7 +------
 block/scsi_ioctl.c       | 16 +---------------
 block/sed-opal.c         | 10 +---------
 block/t10-pi.c           | 19 +------------------
 16 files changed, 16 insertions(+), 171 deletions(-)

(limited to 'block')

diff --git a/block/badblocks.c b/block/badblocks.c
index 91f7bcf979d3..2e5f5697db35 100644
--- a/block/badblocks.c
+++ b/block/badblocks.c
@@ -1,18 +1,10 @@
+// SPDX-License-Identifier: GPL-2.0
 /*
  * Bad block management
  *
  * - Heavily based on MD badblocks code from Neil Brown
  *
  * Copyright (c) 2015, Intel Corporation.
- *
- * This program is free software; you can redistribute it and/or modify it
- * under the terms and conditions of the GNU General Public License,
- * version 2, as published by the Free Software Foundation.
- *
- * This program is distributed in the hope it will be useful, but WITHOUT
- * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
- * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
- * more details.
  */
 
 #include <linux/badblocks.h>
diff --git a/block/bio-integrity.c b/block/bio-integrity.c
index 1b633a3526d4..42536674020a 100644
--- a/block/bio-integrity.c
+++ b/block/bio-integrity.c
@@ -1,23 +1,9 @@
+// SPDX-License-Identifier: GPL-2.0
 /*
  * bio-integrity.c - bio data integrity extensions
  *
  * Copyright (C) 2007, 2008, 2009 Oracle Corporation
  * Written by: Martin K. Petersen <martin.petersen@oracle.com>
- *
- * This program is free software; you can redistribute it and/or
- * modify it under the terms of the GNU General Public License version
- * 2 as published by the Free Software Foundation.
- *
- * This program is distributed in the hope that it will be useful, but
- * WITHOUT ANY WARRANTY; without even the implied warranty of
- * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
- * General Public License for more details.
- *
- * You should have received a copy of the GNU General Public License
- * along with this program; see the file COPYING.  If not, write to
- * the Free Software Foundation, 675 Mass Ave, Cambridge, MA 02139,
- * USA.
- *
  */
 
 #include <linux/blkdev.h>
diff --git a/block/bio.c b/block/bio.c
index 029afb121a48..683cbb40f051 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -1,19 +1,6 @@
+// SPDX-License-Identifier: GPL-2.0
 /*
  * Copyright (C) 2001 Jens Axboe <axboe@kernel.dk>
- *
- * This program is free software; you can redistribute it and/or modify
- * it under the terms of the GNU General Public License version 2 as
- * published by the Free Software Foundation.
- *
- * This program is distributed in the hope that it will be useful,
- * but WITHOUT ANY WARRANTY; without even the implied warranty of
- * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
- * GNU General Public License for more details.
- *
- * You should have received a copy of the GNU General Public Licens
- * along with this program; if not, write to the Free Software
- * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA  02111-
- *
  */
 #include <linux/mm.h>
 #include <linux/swap.h>
diff --git a/block/blk-flush.c b/block/blk-flush.c
index d95f94892015..aedd9320e605 100644
--- a/block/blk-flush.c
+++ b/block/blk-flush.c
@@ -1,11 +1,10 @@
+// SPDX-License-Identifier: GPL-2.0
 /*
  * Functions to sequence PREFLUSH and FUA writes.
  *
  * Copyright (C) 2011		Max Planck Institute for Gravitational Physics
  * Copyright (C) 2011		Tejun Heo <tj@kernel.org>
  *
- * This file is released under the GPLv2.
- *
  * REQ_{PREFLUSH|FUA} requests are decomposed to sequences consisted of three
  * optional steps - PREFLUSH, DATA and POSTFLUSH - according to the request
  * properties and hardware capability.
diff --git a/block/blk-integrity.c b/block/blk-integrity.c
index d1ab089e0919..7f302f7b9d84 100644
--- a/block/blk-integrity.c
+++ b/block/blk-integrity.c
@@ -1,23 +1,9 @@
+// SPDX-License-Identifier: GPL-2.0
 /*
  * blk-integrity.c - Block layer data integrity extensions
  *
  * Copyright (C) 2007, 2008 Oracle Corporation
  * Written by: Martin K. Petersen <martin.petersen@oracle.com>
- *
- * This program is free software; you can redistribute it and/or
- * modify it under the terms of the GNU General Public License version
- * 2 as published by the Free Software Foundation.
- *
- * This program is distributed in the hope that it will be useful, but
- * WITHOUT ANY WARRANTY; without even the implied warranty of
- * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
- * General Public License for more details.
- *
- * You should have received a copy of the GNU General Public License
- * along with this program; see the file COPYING.  If not, write to
- * the Free Software Foundation, 675 Mass Ave, Cambridge, MA 02139,
- * USA.
- *
  */
 
 #include <linux/blkdev.h>
diff --git a/block/blk-mq-debugfs.c b/block/blk-mq-debugfs.c
index ec1d18cb643c..6aea0ebc3a73 100644
--- a/block/blk-mq-debugfs.c
+++ b/block/blk-mq-debugfs.c
@@ -1,17 +1,6 @@
+// SPDX-License-Identifier: GPL-2.0
 /*
  * Copyright (C) 2017 Facebook
- *
- * This program is free software; you can redistribute it and/or
- * modify it under the terms of the GNU General Public
- * License v2 as published by the Free Software Foundation.
- *
- * This program is distributed in the hope that it will be useful,
- * but WITHOUT ANY WARRANTY; without even the implied warranty of
- * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
- * General Public License for more details.
- *
- * You should have received a copy of the GNU General Public License
- * along with this program.  If not, see <https://www.gnu.org/licenses/>.
  */
 
 #include <linux/kernel.h>
diff --git a/block/blk-mq-pci.c b/block/blk-mq-pci.c
index 1dce18553984..ad4545a2a98b 100644
--- a/block/blk-mq-pci.c
+++ b/block/blk-mq-pci.c
@@ -1,14 +1,6 @@
+// SPDX-License-Identifier: GPL-2.0
 /*
  * Copyright (c) 2016 Christoph Hellwig.
- *
- * This program is free software; you can redistribute it and/or modify it
- * under the terms and conditions of the GNU General Public License,
- * version 2, as published by the Free Software Foundation.
- *
- * This program is distributed in the hope it will be useful, but WITHOUT
- * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
- * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
- * more details.
  */
 #include <linux/kobject.h>
 #include <linux/blkdev.h>
diff --git a/block/blk-mq-rdma.c b/block/blk-mq-rdma.c
index 45030a81a1ed..cc921e6ba709 100644
--- a/block/blk-mq-rdma.c
+++ b/block/blk-mq-rdma.c
@@ -1,14 +1,6 @@
+// SPDX-License-Identifier: GPL-2.0
 /*
  * Copyright (c) 2017 Sagi Grimberg.
- *
- * This program is free software; you can redistribute it and/or modify it
- * under the terms and conditions of the GNU General Public License,
- * version 2, as published by the Free Software Foundation.
- *
- * This program is distributed in the hope it will be useful, but WITHOUT
- * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
- * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
- * more details.
  */
 #include <linux/blk-mq.h>
 #include <linux/blk-mq-rdma.h>
diff --git a/block/blk-mq-virtio.c b/block/blk-mq-virtio.c
index 370827163835..75a52c18a8f6 100644
--- a/block/blk-mq-virtio.c
+++ b/block/blk-mq-virtio.c
@@ -1,14 +1,6 @@
+// SPDX-License-Identifier: GPL-2.0
 /*
  * Copyright (c) 2016 Christoph Hellwig.
- *
- * This program is free software; you can redistribute it and/or modify it
- * under the terms and conditions of the GNU General Public License,
- * version 2, as published by the Free Software Foundation.
- *
- * This program is distributed in the hope it will be useful, but WITHOUT
- * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
- * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
- * more details.
  */
 #include <linux/device.h>
 #include <linux/blk-mq.h>
diff --git a/block/bsg.c b/block/bsg.c
index f306853c6b08..833c44b3d458 100644
--- a/block/bsg.c
+++ b/block/bsg.c
@@ -1,13 +1,6 @@
+// SPDX-License-Identifier: GPL-2.0
 /*
  * bsg.c - block layer implementation of the sg v4 interface
- *
- * Copyright (C) 2004 Jens Axboe <axboe@suse.de> SUSE Labs
- * Copyright (C) 2004 Peter M. Jones <pjones@redhat.com>
- *
- *  This file is subject to the terms and conditions of the GNU General Public
- *  License version 2.  See the file "COPYING" in the main directory of this
- *  archive for more details.
- *
  */
 #include <linux/module.h>
 #include <linux/init.h>
diff --git a/block/kyber-iosched.c b/block/kyber-iosched.c
index ec6a04e01bc1..c3b05119cebd 100644
--- a/block/kyber-iosched.c
+++ b/block/kyber-iosched.c
@@ -1,20 +1,9 @@
+// SPDX-License-Identifier: GPL-2.0
 /*
  * The Kyber I/O scheduler. Controls latency by throttling queue depths using
  * scalable techniques.
  *
  * Copyright (C) 2017 Facebook
- *
- * This program is free software; you can redistribute it and/or
- * modify it under the terms of the GNU General Public
- * License v2 as published by the Free Software Foundation.
- *
- * This program is distributed in the hope that it will be useful,
- * but WITHOUT ANY WARRANTY; without even the implied warranty of
- * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
- * General Public License for more details.
- *
- * You should have received a copy of the GNU General Public License
- * along with this program.  If not, see <https://www.gnu.org/licenses/>.
  */
 
 #include <linux/kernel.h>
diff --git a/block/opal_proto.h b/block/opal_proto.h
index b6e352cfe982..d9a05ad02eb5 100644
--- a/block/opal_proto.h
+++ b/block/opal_proto.h
@@ -1,18 +1,10 @@
+/* SPDX-License-Identifier: GPL-2.0 */
 /*
  * Copyright © 2016 Intel Corporation
  *
  * Authors:
  *    Rafael Antognolli <rafael.antognolli@intel.com>
  *    Scott  Bauer      <scott.bauer@intel.com>
- *
- * This program is free software; you can redistribute it and/or modify it
- * under the terms and conditions of the GNU General Public License,
- * version 2, as published by the Free Software Foundation.
- *
- * This program is distributed in the hope it will be useful, but WITHOUT
- * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
- * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
- * more details.
  */
 #include <linux/types.h>
 
diff --git a/block/partitions/acorn.c b/block/partitions/acorn.c
index fbeb697374d5..7587700fad4a 100644
--- a/block/partitions/acorn.c
+++ b/block/partitions/acorn.c
@@ -1,12 +1,7 @@
+// SPDX-License-Identifier: GPL-2.0
 /*
- *  linux/fs/partitions/acorn.c
- *
  *  Copyright (c) 1996-2000 Russell King.
  *
- * This program is free software; you can redistribute it and/or modify
- * it under the terms of the GNU General Public License version 2 as
- * published by the Free Software Foundation.
- *
  *  Scan ADFS partitions on hard disk drives.  Unfortunately, there
  *  isn't a standard for partitioning drives on Acorn machines, so
  *  every single manufacturer of SCSI and IDE cards created their own
diff --git a/block/scsi_ioctl.c b/block/scsi_ioctl.c
index 533f4aee8567..f5e0ad65e86a 100644
--- a/block/scsi_ioctl.c
+++ b/block/scsi_ioctl.c
@@ -1,20 +1,6 @@
+// SPDX-License-Identifier: GPL-2.0
 /*
  * Copyright (C) 2001 Jens Axboe <axboe@suse.de>
- *
- * This program is free software; you can redistribute it and/or modify
- * it under the terms of the GNU General Public License version 2 as
- * published by the Free Software Foundation.
- *
- * This program is distributed in the hope that it will be useful,
- * but WITHOUT ANY WARRANTY; without even the implied warranty of
- *
- * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
- * GNU General Public License for more details.
- *
- * You should have received a copy of the GNU General Public Licens
- * along with this program; if not, write to the Free Software
- * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA  02111-
- *
  */
 #include <linux/kernel.h>
 #include <linux/errno.h>
diff --git a/block/sed-opal.c b/block/sed-opal.c
index b1aa0cc25803..a46e8d13e16d 100644
--- a/block/sed-opal.c
+++ b/block/sed-opal.c
@@ -1,18 +1,10 @@
+// SPDX-License-Identifier: GPL-2.0
 /*
  * Copyright © 2016 Intel Corporation
  *
  * Authors:
  *    Scott  Bauer      <scott.bauer@intel.com>
  *    Rafael Antognolli <rafael.antognolli@intel.com>
- *
- * This program is free software; you can redistribute it and/or modify it
- * under the terms and conditions of the GNU General Public License,
- * version 2, as published by the Free Software Foundation.
- *
- * This program is distributed in the hope it will be useful, but WITHOUT
- * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
- * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
- * more details.
  */
 
 #define pr_fmt(fmt) KBUILD_MODNAME ":OPAL: " fmt
diff --git a/block/t10-pi.c b/block/t10-pi.c
index 62aed77d0bb9..0c0094609dd6 100644
--- a/block/t10-pi.c
+++ b/block/t10-pi.c
@@ -1,24 +1,7 @@
+// SPDX-License-Identifier: GPL-2.0
 /*
  * t10_pi.c - Functions for generating and verifying T10 Protection
  *	      Information.
- *
- * Copyright (C) 2007, 2008, 2014 Oracle Corporation
- * Written by: Martin K. Petersen <martin.petersen@oracle.com>
- *
- * This program is free software; you can redistribute it and/or
- * modify it under the terms of the GNU General Public License version
- * 2 as published by the Free Software Foundation.
- *
- * This program is distributed in the hope that it will be useful, but
- * WITHOUT ANY WARRANTY; without even the implied warranty of
- * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
- * General Public License for more details.
- *
- * You should have received a copy of the GNU General Public License
- * along with this program; see the file COPYING.  If not, write to
- * the Free Software Foundation, 675 Mass Ave, Cambridge, MA 02139,
- * USA.
- *
  */
 
 #include <linux/t10-pi.h>
-- 
cgit 


From a497ee34a45d58e9b978d0fa5c4b25d4813eb350 Mon Sep 17 00:00:00 2001
From: Christoph Hellwig <hch@lst.de>
Date: Tue, 30 Apr 2019 14:42:40 -0400
Subject: block: switch all files cleared marked as GPLv2 or later to SPDX tags

All these files have some form of the usual GPLv2 or later boilerplate.
Switch them to use SPDX tags instead.

Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 block/bfq-cgroup.c     | 11 +----------
 block/bfq-iosched.c    | 11 +----------
 block/bfq-iosched.h    | 11 +----------
 block/bfq-wf2q.c       | 11 +----------
 block/bsg-lib.c        | 16 +---------------
 block/partitions/efi.c | 16 +---------------
 block/partitions/efi.h | 16 +---------------
 block/partitions/ldm.c | 16 +---------------
 block/partitions/ldm.h | 16 +---------------
 9 files changed, 9 insertions(+), 115 deletions(-)

(limited to 'block')

diff --git a/block/bfq-cgroup.c b/block/bfq-cgroup.c
index 793c027ca60e..b3796a40a61a 100644
--- a/block/bfq-cgroup.c
+++ b/block/bfq-cgroup.c
@@ -1,15 +1,6 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
 /*
  * cgroups support for the BFQ I/O scheduler.
- *
- *  This program is free software; you can redistribute it and/or
- *  modify it under the terms of the GNU General Public License as
- *  published by the Free Software Foundation; either version 2 of the
- *  License, or (at your option) any later version.
- *
- *  This program is distributed in the hope that it will be useful,
- *  but WITHOUT ANY WARRANTY; without even the implied warranty of
- *  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
- *  General Public License for more details.
  */
 #include <linux/module.h>
 #include <linux/slab.h>
diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
index b85a4ab8b9db..f8d430f88d25 100644
--- a/block/bfq-iosched.c
+++ b/block/bfq-iosched.c
@@ -1,3 +1,4 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
 /*
  * Budget Fair Queueing (BFQ) I/O scheduler.
  *
@@ -12,16 +13,6 @@
  *
  * Copyright (C) 2017 Paolo Valente <paolo.valente@linaro.org>
  *
- *  This program is free software; you can redistribute it and/or
- *  modify it under the terms of the GNU General Public License as
- *  published by the Free Software Foundation; either version 2 of the
- *  License, or (at your option) any later version.
- *
- *  This program is distributed in the hope that it will be useful,
- *  but WITHOUT ANY WARRANTY; without even the implied warranty of
- *  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
- *  General Public License for more details.
- *
  * BFQ is a proportional-share I/O scheduler, with some extra
  * low-latency capabilities. BFQ also supports full hierarchical
  * scheduling through cgroups. Next paragraphs provide an introduction
diff --git a/block/bfq-iosched.h b/block/bfq-iosched.h
index eba7cd449ab4..c2faa77824f8 100644
--- a/block/bfq-iosched.h
+++ b/block/bfq-iosched.h
@@ -1,16 +1,7 @@
+/* SPDX-License-Identifier: GPL-2.0-or-later */
 /*
  * Header file for the BFQ I/O scheduler: data structures and
  * prototypes of interface functions among BFQ components.
- *
- *  This program is free software; you can redistribute it and/or
- *  modify it under the terms of the GNU General Public License as
- *  published by the Free Software Foundation; either version 2 of the
- *  License, or (at your option) any later version.
- *
- *  This program is distributed in the hope that it will be useful,
- *  but WITHOUT ANY WARRANTY; without even the implied warranty of
- *  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
- *  General Public License for more details.
  */
 #ifndef _BFQ_H
 #define _BFQ_H
diff --git a/block/bfq-wf2q.c b/block/bfq-wf2q.c
index 48d899cfbe03..c9ba225081ce 100644
--- a/block/bfq-wf2q.c
+++ b/block/bfq-wf2q.c
@@ -1,19 +1,10 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
 /*
  * Hierarchical Budget Worst-case Fair Weighted Fair Queueing
  * (B-WF2Q+): hierarchical scheduling algorithm by which the BFQ I/O
  * scheduler schedules generic entities. The latter can represent
  * either single bfq queues (associated with processes) or groups of
  * bfq queues (associated with cgroups).
- *
- *  This program is free software; you can redistribute it and/or
- *  modify it under the terms of the GNU General Public License as
- *  published by the Free Software Foundation; either version 2 of the
- *  License, or (at your option) any later version.
- *
- *  This program is distributed in the hope that it will be useful,
- *  but WITHOUT ANY WARRANTY; without even the implied warranty of
- *  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
- *  General Public License for more details.
  */
 #include "bfq-iosched.h"
 
diff --git a/block/bsg-lib.c b/block/bsg-lib.c
index 005e2b75d775..b898a1cdf872 100644
--- a/block/bsg-lib.c
+++ b/block/bsg-lib.c
@@ -1,24 +1,10 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
 /*
  *  BSG helper library
  *
  *  Copyright (C) 2008   James Smart, Emulex Corporation
  *  Copyright (C) 2011   Red Hat, Inc.  All rights reserved.
  *  Copyright (C) 2011   Mike Christie
- *
- *  This program is free software; you can redistribute it and/or modify
- *  it under the terms of the GNU General Public License as published by
- *  the Free Software Foundation; either version 2 of the License, or
- *  (at your option) any later version.
- *
- *  This program is distributed in the hope that it will be useful,
- *  but WITHOUT ANY WARRANTY; without even the implied warranty of
- *  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
- *  GNU General Public License for more details.
- *
- *  You should have received a copy of the GNU General Public License
- *  along with this program; if not, write to the Free Software
- *  Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA  02111-1307  USA
- *
  */
 #include <linux/slab.h>
 #include <linux/blk-mq.h>
diff --git a/block/partitions/efi.c b/block/partitions/efi.c
index 39f70d968754..db2fef7dfc47 100644
--- a/block/partitions/efi.c
+++ b/block/partitions/efi.c
@@ -1,3 +1,4 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
 /************************************************************
  * EFI GUID Partition Table handling
  *
@@ -7,21 +8,6 @@
  * efi.[ch] by Matt Domsch <Matt_Domsch@dell.com>
  *   Copyright 2000,2001,2002,2004 Dell Inc.
  *
- *  This program is free software; you can redistribute it and/or modify
- *  it under the terms of the GNU General Public License as published by
- *  the Free Software Foundation; either version 2 of the License, or
- *  (at your option) any later version.
- *
- *  This program is distributed in the hope that it will be useful,
- *  but WITHOUT ANY WARRANTY; without even the implied warranty of
- *  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
- *  GNU General Public License for more details.
- *
- *  You should have received a copy of the GNU General Public License
- *  along with this program; if not, write to the Free Software
- *  Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA  02111-1307  USA
- *
- *
  * TODO:
  *
  * Changelog:
diff --git a/block/partitions/efi.h b/block/partitions/efi.h
index abd0b19288a6..3e8576157575 100644
--- a/block/partitions/efi.h
+++ b/block/partitions/efi.h
@@ -1,3 +1,4 @@
+/* SPDX-License-Identifier: GPL-2.0-or-later */
 /************************************************************
  * EFI GUID Partition Table
  * Per Intel EFI Specification v1.02
@@ -5,21 +6,6 @@
  *
  * By Matt Domsch <Matt_Domsch@dell.com>  Fri Sep 22 22:15:56 CDT 2000  
  *   Copyright 2000,2001 Dell Inc.
- *
- *  This program is free software; you can redistribute it and/or modify
- *  it under the terms of the GNU General Public License as published by
- *  the Free Software Foundation; either version 2 of the License, or
- *  (at your option) any later version.
- * 
- *  This program is distributed in the hope that it will be useful,
- *  but WITHOUT ANY WARRANTY; without even the implied warranty of
- *  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
- *  GNU General Public License for more details.
- *
- *  You should have received a copy of the GNU General Public License
- *  along with this program; if not, write to the Free Software
- *  Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA  02111-1307  USA
- * 
  ************************************************************/
 
 #ifndef FS_PART_EFI_H_INCLUDED
diff --git a/block/partitions/ldm.c b/block/partitions/ldm.c
index 16766f267559..6db573f33219 100644
--- a/block/partitions/ldm.c
+++ b/block/partitions/ldm.c
@@ -1,3 +1,4 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
 /**
  * ldm - Support for Windows Logical Disk Manager (Dynamic Disks)
  *
@@ -6,21 +7,6 @@
  * Copyright (C) 2001,2002 Jakob Kemi <jakob.kemi@telia.com>
  *
  * Documentation is available at http://www.linux-ntfs.org/doku.php?id=downloads 
- *
- * This program is free software; you can redistribute it and/or modify it under
- * the terms of the GNU General Public License as published by the Free Software
- * Foundation; either version 2 of the License, or (at your option) any later
- * version.
- *
- * This program is distributed in the hope that it will be useful, but WITHOUT
- * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS
- * FOR A PARTICULAR PURPOSE.  See the GNU General Public License for more
- * details.
- *
- * You should have received a copy of the GNU General Public License along with
- * this program (in the main directory of the source in the file COPYING); if
- * not, write to the Free Software Foundation, Inc., 59 Temple Place, Suite 330,
- * Boston, MA  02111-1307  USA
  */
 
 #include <linux/slab.h>
diff --git a/block/partitions/ldm.h b/block/partitions/ldm.h
index f4c6055df956..1ca63e97bccc 100644
--- a/block/partitions/ldm.h
+++ b/block/partitions/ldm.h
@@ -1,3 +1,4 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
 /**
  * ldm - Part of the Linux-NTFS project.
  *
@@ -6,21 +7,6 @@
  * Copyright (C) 2001,2002 Jakob Kemi <jakob.kemi@telia.com>
  *
  * Documentation is available at http://www.linux-ntfs.org/doku.php?id=downloads 
- *
- * This program is free software; you can redistribute it and/or modify it
- * under the terms of the GNU General Public License as published by the Free
- * Software Foundation; either version 2 of the License, or (at your option)
- * any later version.
- *
- * This program is distributed in the hope that it will be useful,
- * but WITHOUT ANY WARRANTY; without even the implied warranty of
- * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
- * GNU General Public License for more details.
- *
- * You should have received a copy of the GNU General Public License
- * along with this program (in the main directory of the Linux-NTFS source
- * in the file COPYING); if not, write to the Free Software Foundation,
- * Inc., 59 Temple Place, Suite 330, Boston, MA  02111-1307  USA
  */
 
 #ifndef _FS_PT_LDM_H_
-- 
cgit 


From 3dcf60bcb603f56361abb364a4cd2f69677453f0 Mon Sep 17 00:00:00 2001
From: Christoph Hellwig <hch@lst.de>
Date: Tue, 30 Apr 2019 14:42:43 -0400
Subject: block: add SPDX tags to block layer files missing licensing
 information

Various block layer files do not have any licensing information at all.
Add SPDX tags for the default kernel GPLv2 license to those.

Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 block/blk-cgroup.c        | 1 +
 block/blk-core.c          | 1 +
 block/blk-exec.c          | 1 +
 block/blk-iolatency.c     | 1 +
 block/blk-mq-cpumap.c     | 1 +
 block/blk-mq-sched.c      | 1 +
 block/blk-mq-sysfs.c      | 1 +
 block/blk-mq-tag.c        | 1 +
 block/blk-mq.c            | 1 +
 block/blk-rq-qos.c        | 2 ++
 block/blk-rq-qos.h        | 1 +
 block/blk-settings.c      | 1 +
 block/blk-stat.c          | 1 +
 block/blk-timeout.c       | 1 +
 block/blk-wbt.c           | 1 +
 block/blk-zoned.c         | 1 +
 block/elevator.c          | 1 +
 block/genhd.c             | 1 +
 block/ioctl.c             | 1 +
 block/ioprio.c            | 1 +
 block/mq-deadline.c       | 1 +
 block/partitions/aix.h    | 1 +
 block/partitions/amiga.h  | 1 +
 block/partitions/ibm.h    | 1 +
 block/partitions/karma.h  | 1 +
 block/partitions/msdos.h  | 1 +
 block/partitions/osf.h    | 1 +
 block/partitions/sgi.h    | 1 +
 block/partitions/sun.h    | 1 +
 block/partitions/sysv68.h | 1 +
 block/partitions/ultrix.h | 1 +
 31 files changed, 32 insertions(+)

(limited to 'block')

diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
index 617a2b3f7582..b97b479e4f64 100644
--- a/block/blk-cgroup.c
+++ b/block/blk-cgroup.c
@@ -1,3 +1,4 @@
+// SPDX-License-Identifier: GPL-2.0
 /*
  * Common Block IO controller cgroup interface
  *
diff --git a/block/blk-core.c b/block/blk-core.c
index a55389ba8779..b044829135c9 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -1,3 +1,4 @@
+// SPDX-License-Identifier: GPL-2.0
 /*
  * Copyright (C) 1991, 1992 Linus Torvalds
  * Copyright (C) 1994,      Karl Keyte: Added support for disk statistics
diff --git a/block/blk-exec.c b/block/blk-exec.c
index a34b7d918742..1db44ca0f4a6 100644
--- a/block/blk-exec.c
+++ b/block/blk-exec.c
@@ -1,3 +1,4 @@
+// SPDX-License-Identifier: GPL-2.0
 /*
  * Functions related to setting various queue properties from drivers
  */
diff --git a/block/blk-iolatency.c b/block/blk-iolatency.c
index 507212d75ee2..d22e61bced86 100644
--- a/block/blk-iolatency.c
+++ b/block/blk-iolatency.c
@@ -1,3 +1,4 @@
+// SPDX-License-Identifier: GPL-2.0
 /*
  * Block rq-qos base io controller
  *
diff --git a/block/blk-mq-cpumap.c b/block/blk-mq-cpumap.c
index 03a534820271..48bebf00a5f3 100644
--- a/block/blk-mq-cpumap.c
+++ b/block/blk-mq-cpumap.c
@@ -1,3 +1,4 @@
+// SPDX-License-Identifier: GPL-2.0
 /*
  * CPU <-> hardware queue mapping helpers
  *
diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
index aa6bc5c02643..f6e3b10b52eb 100644
--- a/block/blk-mq-sched.c
+++ b/block/blk-mq-sched.c
@@ -1,3 +1,4 @@
+// SPDX-License-Identifier: GPL-2.0
 /*
  * blk-mq scheduling framework
  *
diff --git a/block/blk-mq-sysfs.c b/block/blk-mq-sysfs.c
index 3f9c3f4ac44c..61efc2a29e58 100644
--- a/block/blk-mq-sysfs.c
+++ b/block/blk-mq-sysfs.c
@@ -1,3 +1,4 @@
+// SPDX-License-Identifier: GPL-2.0
 #include <linux/kernel.h>
 #include <linux/module.h>
 #include <linux/backing-dev.h>
diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
index a4931fc7be8a..7513c8eaabee 100644
--- a/block/blk-mq-tag.c
+++ b/block/blk-mq-tag.c
@@ -1,3 +1,4 @@
+// SPDX-License-Identifier: GPL-2.0
 /*
  * Tag allocation using scalable bitmaps. Uses active queue tracking to support
  * fairer distribution of tags between multiple submitters when a shared tag map
diff --git a/block/blk-mq.c b/block/blk-mq.c
index fc60ed7e940e..4f15adfbab29 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1,3 +1,4 @@
+// SPDX-License-Identifier: GPL-2.0
 /*
  * Block multiqueue core code
  *
diff --git a/block/blk-rq-qos.c b/block/blk-rq-qos.c
index d169d7188fa6..3f55b56f24bc 100644
--- a/block/blk-rq-qos.c
+++ b/block/blk-rq-qos.c
@@ -1,3 +1,5 @@
+// SPDX-License-Identifier: GPL-2.0
+
 #include "blk-rq-qos.h"
 
 /*
diff --git a/block/blk-rq-qos.h b/block/blk-rq-qos.h
index 564851889550..2300e038b9fa 100644
--- a/block/blk-rq-qos.h
+++ b/block/blk-rq-qos.h
@@ -1,3 +1,4 @@
+/* SPDX-License-Identifier: GPL-2.0 */
 #ifndef RQ_QOS_H
 #define RQ_QOS_H
 
diff --git a/block/blk-settings.c b/block/blk-settings.c
index 6375afaedcec..ec150f88db09 100644
--- a/block/blk-settings.c
+++ b/block/blk-settings.c
@@ -1,3 +1,4 @@
+// SPDX-License-Identifier: GPL-2.0
 /*
  * Functions related to setting various queue properties from drivers
  */
diff --git a/block/blk-stat.c b/block/blk-stat.c
index 696a04176e4d..940f15d600f8 100644
--- a/block/blk-stat.c
+++ b/block/blk-stat.c
@@ -1,3 +1,4 @@
+// SPDX-License-Identifier: GPL-2.0
 /*
  * Block stat tracking code
  *
diff --git a/block/blk-timeout.c b/block/blk-timeout.c
index 124c26128bf6..8aa68fae96ad 100644
--- a/block/blk-timeout.c
+++ b/block/blk-timeout.c
@@ -1,3 +1,4 @@
+// SPDX-License-Identifier: GPL-2.0
 /*
  * Functions related to generic timeout handling of requests.
  */
diff --git a/block/blk-wbt.c b/block/blk-wbt.c
index fd166fbb0f65..313f45a37e9d 100644
--- a/block/blk-wbt.c
+++ b/block/blk-wbt.c
@@ -1,3 +1,4 @@
+// SPDX-License-Identifier: GPL-2.0
 /*
  * buffered writeback throttling. loosely based on CoDel. We can't drop
  * packets for IO scheduling, so the logic is something like this:
diff --git a/block/blk-zoned.c b/block/blk-zoned.c
index 2d98803faec2..ae7e91bd0618 100644
--- a/block/blk-zoned.c
+++ b/block/blk-zoned.c
@@ -1,3 +1,4 @@
+// SPDX-License-Identifier: GPL-2.0
 /*
  * Zoned block device handling
  *
diff --git a/block/elevator.c b/block/elevator.c
index 2e5399d9f40f..ec55d5fc0b3e 100644
--- a/block/elevator.c
+++ b/block/elevator.c
@@ -1,3 +1,4 @@
+// SPDX-License-Identifier: GPL-2.0
 /*
  *  Block device elevator/IO-scheduler.
  *
diff --git a/block/genhd.c b/block/genhd.c
index 83f5c33d1e80..ad6826628e79 100644
--- a/block/genhd.c
+++ b/block/genhd.c
@@ -1,3 +1,4 @@
+// SPDX-License-Identifier: GPL-2.0
 /*
  *  gendisk handling
  */
diff --git a/block/ioctl.c b/block/ioctl.c
index 4825c78a6baa..15a0eb80ada9 100644
--- a/block/ioctl.c
+++ b/block/ioctl.c
@@ -1,3 +1,4 @@
+// SPDX-License-Identifier: GPL-2.0
 #include <linux/capability.h>
 #include <linux/blkdev.h>
 #include <linux/export.h>
diff --git a/block/ioprio.c b/block/ioprio.c
index f9821080c92c..2e0559f157c8 100644
--- a/block/ioprio.c
+++ b/block/ioprio.c
@@ -1,3 +1,4 @@
+// SPDX-License-Identifier: GPL-2.0
 /*
  * fs/ioprio.c
  *
diff --git a/block/mq-deadline.c b/block/mq-deadline.c
index 14288f864e94..1876f5712bfd 100644
--- a/block/mq-deadline.c
+++ b/block/mq-deadline.c
@@ -1,3 +1,4 @@
+// SPDX-License-Identifier: GPL-2.0
 /*
  *  MQ Deadline i/o scheduler - adaptation of the legacy deadline scheduler,
  *  for the blk-mq scheduling framework
diff --git a/block/partitions/aix.h b/block/partitions/aix.h
index e0c66a987523..b4449f0b9f2b 100644
--- a/block/partitions/aix.h
+++ b/block/partitions/aix.h
@@ -1 +1,2 @@
+/* SPDX-License-Identifier: GPL-2.0 */
 extern int aix_partition(struct parsed_partitions *state);
diff --git a/block/partitions/amiga.h b/block/partitions/amiga.h
index d094585cadaa..7e63f4d9d969 100644
--- a/block/partitions/amiga.h
+++ b/block/partitions/amiga.h
@@ -1,3 +1,4 @@
+/* SPDX-License-Identifier: GPL-2.0 */
 /*
  *  fs/partitions/amiga.h
  */
diff --git a/block/partitions/ibm.h b/block/partitions/ibm.h
index 08fb0804a812..8bf13febb2b6 100644
--- a/block/partitions/ibm.h
+++ b/block/partitions/ibm.h
@@ -1 +1,2 @@
+/* SPDX-License-Identifier: GPL-2.0 */
 int ibm_partition(struct parsed_partitions *);
diff --git a/block/partitions/karma.h b/block/partitions/karma.h
index c764b2e9df21..48e074d417fb 100644
--- a/block/partitions/karma.h
+++ b/block/partitions/karma.h
@@ -1,3 +1,4 @@
+/* SPDX-License-Identifier: GPL-2.0 */
 /*
  *  fs/partitions/karma.h
  */
diff --git a/block/partitions/msdos.h b/block/partitions/msdos.h
index 38c781c490b3..fcacfc486092 100644
--- a/block/partitions/msdos.h
+++ b/block/partitions/msdos.h
@@ -1,3 +1,4 @@
+/* SPDX-License-Identifier: GPL-2.0 */
 /*
  *  fs/partitions/msdos.h
  */
diff --git a/block/partitions/osf.h b/block/partitions/osf.h
index 20ed2315ec16..4d8088e7ea8c 100644
--- a/block/partitions/osf.h
+++ b/block/partitions/osf.h
@@ -1,3 +1,4 @@
+/* SPDX-License-Identifier: GPL-2.0 */
 /*
  *  fs/partitions/osf.h
  */
diff --git a/block/partitions/sgi.h b/block/partitions/sgi.h
index b9553ebdd5a9..a5b77c3987cf 100644
--- a/block/partitions/sgi.h
+++ b/block/partitions/sgi.h
@@ -1,3 +1,4 @@
+/* SPDX-License-Identifier: GPL-2.0 */
 /*
  *  fs/partitions/sgi.h
  */
diff --git a/block/partitions/sun.h b/block/partitions/sun.h
index 2424baa8319f..ae1b9eed3fd7 100644
--- a/block/partitions/sun.h
+++ b/block/partitions/sun.h
@@ -1,3 +1,4 @@
+/* SPDX-License-Identifier: GPL-2.0 */
 /*
  *  fs/partitions/sun.h
  */
diff --git a/block/partitions/sysv68.h b/block/partitions/sysv68.h
index bf2f5ffa97ac..4fb6b8ec78ae 100644
--- a/block/partitions/sysv68.h
+++ b/block/partitions/sysv68.h
@@ -1 +1,2 @@
+/* SPDX-License-Identifier: GPL-2.0 */
 extern int sysv68_partition(struct parsed_partitions *state);
diff --git a/block/partitions/ultrix.h b/block/partitions/ultrix.h
index a3cc00b2bded..9f676cead222 100644
--- a/block/partitions/ultrix.h
+++ b/block/partitions/ultrix.h
@@ -1,3 +1,4 @@
+/* SPDX-License-Identifier: GPL-2.0 */
 /*
  *  fs/partitions/ultrix.h
  */
-- 
cgit 


From 12adb7a013e318de553ccee4a006a718667972b3 Mon Sep 17 00:00:00 2001
From: Christoph Hellwig <hch@lst.de>
Date: Tue, 30 Apr 2019 13:56:16 -0400
Subject: block: remove the unused blk_queue_dma_pad function

Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 block/blk-settings.c | 16 ----------------
 1 file changed, 16 deletions(-)

(limited to 'block')

diff --git a/block/blk-settings.c b/block/blk-settings.c
index ec150f88db09..3facc41476be 100644
--- a/block/blk-settings.c
+++ b/block/blk-settings.c
@@ -663,22 +663,6 @@ void disk_stack_limits(struct gendisk *disk, struct block_device *bdev,
 }
 EXPORT_SYMBOL(disk_stack_limits);
 
-/**
- * blk_queue_dma_pad - set pad mask
- * @q:     the request queue for the device
- * @mask:  pad mask
- *
- * Set dma pad mask.
- *
- * Appending pad buffer to a request modifies the last entry of a
- * scatter list such that it includes the pad buffer.
- **/
-void blk_queue_dma_pad(struct request_queue *q, unsigned int mask)
-{
-	q->dma_pad_mask = mask;
-}
-EXPORT_SYMBOL(blk_queue_dma_pad);
-
 /**
  * blk_queue_update_dma_pad - update pad mask
  * @q:     the request queue for the device
-- 
cgit 


From 273938bf7ae92112e646f9a46b39aa74b64be4e8 Mon Sep 17 00:00:00 2001
From: Raul E Rangel <rrangel@chromium.org>
Date: Thu, 2 May 2019 13:48:11 -0600
Subject: block: fix function name in comment

The comment was out of date.

Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Raul E Rangel <rrangel@chromium.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 block/blk-mq.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

(limited to 'block')

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 4f15adfbab29..c9bf9b92d2db 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -2063,7 +2063,7 @@ void blk_mq_free_rqs(struct blk_mq_tag_set *set, struct blk_mq_tags *tags,
 		list_del_init(&page->lru);
 		/*
 		 * Remove kmemleak object previously allocated in
-		 * blk_mq_init_rq_map().
+		 * blk_mq_alloc_rqs().
 		 */
 		kmemleak_free(page_address(page));
 		__free_pages(page, page->private);
-- 
cgit 


From e87eb301bee183d82bb3d04bd71b6660889a2588 Mon Sep 17 00:00:00 2001
From: Ming Lei <ming.lei@redhat.com>
Date: Tue, 30 Apr 2019 09:52:23 +0800
Subject: blk-mq: grab .q_usage_counter when queuing request from plug code
 path

Just like aio/io_uring, we need to grab 2 refcount for queuing one
request, one is for submission, another is for completion.

If the request isn't queued from plug code path, the refcount grabbed
in generic_make_request() serves for submission. In theroy, this
refcount should have been released after the sumission(async run queue)
is done. blk_freeze_queue() works with blk_sync_queue() together
for avoiding race between cleanup queue and IO submission, given async
run queue activities are canceled because hctx->run_work is scheduled with
the refcount held, so it is fine to not hold the refcount when
running the run queue work function for dispatch IO.

However, if request is staggered into plug list, and finally queued
from plug code path, the refcount in submission side is actually missed.
And we may start to run queue after queue is removed because the queue's
kobject refcount isn't guaranteed to be grabbed in flushing plug list
context, then kernel oops is triggered, see the following race:

blk_mq_flush_plug_list():
        blk_mq_sched_insert_requests()
                insert requests to sw queue or scheduler queue
                blk_mq_run_hw_queue

Because of concurrent run queue, all requests inserted above may be
completed before calling the above blk_mq_run_hw_queue. Then queue can
be freed during the above blk_mq_run_hw_queue().

Fixes the issue by grab .q_usage_counter before calling
blk_mq_sched_insert_requests() in blk_mq_flush_plug_list(). This way is
safe because the queue is absolutely alive before inserting request.

Cc: Dongli Zhang <dongli.zhang@oracle.com>
Cc: James Smart <james.smart@broadcom.com>
Cc: linux-scsi@vger.kernel.org,
Cc: Martin K . Petersen <martin.petersen@oracle.com>,
Cc: Christoph Hellwig <hch@lst.de>,
Cc: James E . J . Bottomley <jejb@linux.vnet.ibm.com>,
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Tested-by: James Smart <james.smart@broadcom.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 block/blk-mq-sched.c | 12 +++++++++++-
 1 file changed, 11 insertions(+), 1 deletion(-)

(limited to 'block')

diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
index f6e3b10b52eb..74c6bb871f7e 100644
--- a/block/blk-mq-sched.c
+++ b/block/blk-mq-sched.c
@@ -414,6 +414,14 @@ void blk_mq_sched_insert_requests(struct blk_mq_hw_ctx *hctx,
 				  struct list_head *list, bool run_queue_async)
 {
 	struct elevator_queue *e;
+	struct request_queue *q = hctx->queue;
+
+	/*
+	 * blk_mq_sched_insert_requests() is called from flush plug
+	 * context only, and hold one usage counter to prevent queue
+	 * from being released.
+	 */
+	percpu_ref_get(&q->q_usage_counter);
 
 	e = hctx->queue->elevator;
 	if (e && e->type->ops.insert_requests)
@@ -427,12 +435,14 @@ void blk_mq_sched_insert_requests(struct blk_mq_hw_ctx *hctx,
 		if (!hctx->dispatch_busy && !e && !run_queue_async) {
 			blk_mq_try_issue_list_directly(hctx, list);
 			if (list_empty(list))
-				return;
+				goto out;
 		}
 		blk_mq_insert_requests(hctx, ctx, list);
 	}
 
 	blk_mq_run_hw_queue(hctx, run_queue_async);
+ out:
+	percpu_ref_put(&q->q_usage_counter);
 }
 
 static void blk_mq_sched_free_tags(struct blk_mq_tag_set *set,
-- 
cgit 


From fbc2a15e3433058582e5635aabe48a3011a644a8 Mon Sep 17 00:00:00 2001
From: Ming Lei <ming.lei@redhat.com>
Date: Tue, 30 Apr 2019 09:52:24 +0800
Subject: blk-mq: move cancel of requeue_work into blk_mq_release

With holding queue's kobject refcount, it is safe for driver
to schedule requeue. However, blk_mq_kick_requeue_list() may
be called after blk_sync_queue() is done because of concurrent
requeue activities, then requeue work may not be completed when
freeing queue, and kernel oops is triggered.

So moving the cancel of requeue_work into blk_mq_release() for
avoiding race between requeue and freeing queue.

Cc: Dongli Zhang <dongli.zhang@oracle.com>
Cc: James Smart <james.smart@broadcom.com>
Cc: Bart Van Assche <bart.vanassche@wdc.com>
Cc: linux-scsi@vger.kernel.org,
Cc: Martin K . Petersen <martin.petersen@oracle.com>,
Cc: Christoph Hellwig <hch@lst.de>,
Cc: James E . J . Bottomley <jejb@linux.vnet.ibm.com>,
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Tested-by: James Smart <james.smart@broadcom.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 block/blk-core.c | 1 -
 block/blk-mq.c   | 2 ++
 2 files changed, 2 insertions(+), 1 deletion(-)

(limited to 'block')

diff --git a/block/blk-core.c b/block/blk-core.c
index b044829135c9..2af1040b2fa6 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -238,7 +238,6 @@ void blk_sync_queue(struct request_queue *q)
 		struct blk_mq_hw_ctx *hctx;
 		int i;
 
-		cancel_delayed_work_sync(&q->requeue_work);
 		queue_for_each_hw_ctx(q, hctx, i)
 			cancel_delayed_work_sync(&hctx->run_work);
 	}
diff --git a/block/blk-mq.c b/block/blk-mq.c
index c9bf9b92d2db..741cf8d55e9c 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -2635,6 +2635,8 @@ void blk_mq_release(struct request_queue *q)
 	struct blk_mq_hw_ctx *hctx;
 	unsigned int i;
 
+	cancel_delayed_work_sync(&q->requeue_work);
+
 	/* hctx kobj stays in hctx */
 	queue_for_each_hw_ctx(q, hctx, i) {
 		if (!hctx)
-- 
cgit 


From c7e2d94b3d1634988a95ac4d77a72dc7487ece06 Mon Sep 17 00:00:00 2001
From: Ming Lei <ming.lei@redhat.com>
Date: Tue, 30 Apr 2019 09:52:25 +0800
Subject: blk-mq: free hw queue's resource in hctx's release handler

Once blk_cleanup_queue() returns, tags shouldn't be used any more,
because blk_mq_free_tag_set() may be called. Commit 45a9c9d909b2
("blk-mq: Fix a use-after-free") fixes this issue exactly.

However, that commit introduces another issue. Before 45a9c9d909b2,
we are allowed to run queue during cleaning up queue if the queue's
kobj refcount is held. After that commit, queue can't be run during
queue cleaning up, otherwise oops can be triggered easily because
some fields of hctx are freed by blk_mq_free_queue() in blk_cleanup_queue().

We have invented ways for addressing this kind of issue before, such as:

	8dc765d438f1 ("SCSI: fix queue cleanup race before queue initialization is done")
	c2856ae2f315 ("blk-mq: quiesce queue before freeing queue")

But still can't cover all cases, recently James reports another such
kind of issue:

	https://marc.info/?l=linux-scsi&m=155389088124782&w=2

This issue can be quite hard to address by previous way, given
scsi_run_queue() may run requeues for other LUNs.

Fixes the above issue by freeing hctx's resources in its release handler, and this
way is safe becasue tags isn't needed for freeing such hctx resource.

This approach follows typical design pattern wrt. kobject's release handler.

Cc: Dongli Zhang <dongli.zhang@oracle.com>
Cc: James Smart <james.smart@broadcom.com>
Cc: Bart Van Assche <bart.vanassche@wdc.com>
Cc: linux-scsi@vger.kernel.org,
Cc: Martin K . Petersen <martin.petersen@oracle.com>,
Cc: Christoph Hellwig <hch@lst.de>,
Cc: James E . J . Bottomley <jejb@linux.vnet.ibm.com>,
Reported-by: James Smart <james.smart@broadcom.com>
Fixes: 45a9c9d909b2 ("blk-mq: Fix a use-after-free")
Cc: stable@vger.kernel.org
Reviewed-by: Hannes Reinecke <hare@suse.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Tested-by: James Smart <james.smart@broadcom.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 block/blk-core.c     | 2 +-
 block/blk-mq-sysfs.c | 6 ++++++
 block/blk-mq.c       | 8 ++------
 block/blk-mq.h       | 2 +-
 4 files changed, 10 insertions(+), 8 deletions(-)

(limited to 'block')

diff --git a/block/blk-core.c b/block/blk-core.c
index 2af1040b2fa6..81d209568a26 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -375,7 +375,7 @@ void blk_cleanup_queue(struct request_queue *q)
 	blk_exit_queue(q);
 
 	if (queue_is_mq(q))
-		blk_mq_free_queue(q);
+		blk_mq_exit_queue(q);
 
 	percpu_ref_exit(&q->q_usage_counter);
 
diff --git a/block/blk-mq-sysfs.c b/block/blk-mq-sysfs.c
index 61efc2a29e58..7593c4c78975 100644
--- a/block/blk-mq-sysfs.c
+++ b/block/blk-mq-sysfs.c
@@ -11,6 +11,7 @@
 #include <linux/smp.h>
 
 #include <linux/blk-mq.h>
+#include "blk.h"
 #include "blk-mq.h"
 #include "blk-mq-tag.h"
 
@@ -34,6 +35,11 @@ static void blk_mq_hw_sysfs_release(struct kobject *kobj)
 {
 	struct blk_mq_hw_ctx *hctx = container_of(kobj, struct blk_mq_hw_ctx,
 						  kobj);
+
+	if (hctx->flags & BLK_MQ_F_BLOCKING)
+		cleanup_srcu_struct(hctx->srcu);
+	blk_free_flush_queue(hctx->fq);
+	sbitmap_free(&hctx->ctx_map);
 	free_cpumask_var(hctx->cpumask);
 	kfree(hctx->ctxs);
 	kfree(hctx);
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 741cf8d55e9c..1fdb8de92a10 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -2268,12 +2268,7 @@ static void blk_mq_exit_hctx(struct request_queue *q,
 	if (set->ops->exit_hctx)
 		set->ops->exit_hctx(hctx, hctx_idx);
 
-	if (hctx->flags & BLK_MQ_F_BLOCKING)
-		cleanup_srcu_struct(hctx->srcu);
-
 	blk_mq_remove_cpuhp(hctx);
-	blk_free_flush_queue(hctx->fq);
-	sbitmap_free(&hctx->ctx_map);
 }
 
 static void blk_mq_exit_hw_queues(struct request_queue *q,
@@ -2908,7 +2903,8 @@ err_exit:
 }
 EXPORT_SYMBOL(blk_mq_init_allocated_queue);
 
-void blk_mq_free_queue(struct request_queue *q)
+/* tags can _not_ be used after returning from blk_mq_exit_queue */
+void blk_mq_exit_queue(struct request_queue *q)
 {
 	struct blk_mq_tag_set	*set = q->tag_set;
 
diff --git a/block/blk-mq.h b/block/blk-mq.h
index 423ea88ab6fb..633a5a77ee8b 100644
--- a/block/blk-mq.h
+++ b/block/blk-mq.h
@@ -37,7 +37,7 @@ struct blk_mq_ctx {
 	struct kobject		kobj;
 } ____cacheline_aligned_in_smp;
 
-void blk_mq_free_queue(struct request_queue *q);
+void blk_mq_exit_queue(struct request_queue *q);
 int blk_mq_update_nr_requests(struct request_queue *q, unsigned int nr);
 void blk_mq_wake_waiters(struct request_queue *q);
 bool blk_mq_dispatch_rq_list(struct request_queue *, struct list_head *, bool);
-- 
cgit 


From 7c6c5b7c9186e3fb5b10afb8e5f710ae661144c6 Mon Sep 17 00:00:00 2001
From: Ming Lei <ming.lei@redhat.com>
Date: Tue, 30 Apr 2019 09:52:26 +0800
Subject: blk-mq: split blk_mq_alloc_and_init_hctx into two parts

Split blk_mq_alloc_and_init_hctx into two parts, and one is
blk_mq_alloc_hctx() for allocating all hctx resources, another
is blk_mq_init_hctx() for initializing hctx, which serves as
counter-part of blk_mq_exit_hctx().

Cc: Dongli Zhang <dongli.zhang@oracle.com>
Cc: James Smart <james.smart@broadcom.com>
Cc: Bart Van Assche <bart.vanassche@wdc.com>
Cc: linux-scsi@vger.kernel.org
Cc: Martin K . Petersen <martin.petersen@oracle.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: James E . J . Bottomley <jejb@linux.vnet.ibm.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Tested-by: James Smart <james.smart@broadcom.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 block/blk-mq.c | 139 +++++++++++++++++++++++++++++++--------------------------
 1 file changed, 75 insertions(+), 64 deletions(-)

(limited to 'block')

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 1fdb8de92a10..17e63d80b6d6 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -2285,15 +2285,65 @@ static void blk_mq_exit_hw_queues(struct request_queue *q,
 	}
 }
 
+static int blk_mq_hw_ctx_size(struct blk_mq_tag_set *tag_set)
+{
+	int hw_ctx_size = sizeof(struct blk_mq_hw_ctx);
+
+	BUILD_BUG_ON(ALIGN(offsetof(struct blk_mq_hw_ctx, srcu),
+			   __alignof__(struct blk_mq_hw_ctx)) !=
+		     sizeof(struct blk_mq_hw_ctx));
+
+	if (tag_set->flags & BLK_MQ_F_BLOCKING)
+		hw_ctx_size += sizeof(struct srcu_struct);
+
+	return hw_ctx_size;
+}
+
 static int blk_mq_init_hctx(struct request_queue *q,
 		struct blk_mq_tag_set *set,
 		struct blk_mq_hw_ctx *hctx, unsigned hctx_idx)
 {
-	int node;
+	hctx->queue_num = hctx_idx;
+
+	cpuhp_state_add_instance_nocalls(CPUHP_BLK_MQ_DEAD, &hctx->cpuhp_dead);
+
+	hctx->tags = set->tags[hctx_idx];
+
+	if (set->ops->init_hctx &&
+	    set->ops->init_hctx(hctx, set->driver_data, hctx_idx))
+		goto unregister_cpu_notifier;
 
-	node = hctx->numa_node;
+	if (blk_mq_init_request(set, hctx->fq->flush_rq, hctx_idx,
+				hctx->numa_node))
+		goto exit_hctx;
+	return 0;
+
+ exit_hctx:
+	if (set->ops->exit_hctx)
+		set->ops->exit_hctx(hctx, hctx_idx);
+ unregister_cpu_notifier:
+	blk_mq_remove_cpuhp(hctx);
+	return -1;
+}
+
+static struct blk_mq_hw_ctx *
+blk_mq_alloc_hctx(struct request_queue *q, struct blk_mq_tag_set *set,
+		int node)
+{
+	struct blk_mq_hw_ctx *hctx;
+	gfp_t gfp = GFP_NOIO | __GFP_NOWARN | __GFP_NORETRY;
+
+	hctx = kzalloc_node(blk_mq_hw_ctx_size(set), gfp, node);
+	if (!hctx)
+		goto fail_alloc_hctx;
+
+	if (!zalloc_cpumask_var_node(&hctx->cpumask, gfp, node))
+		goto free_hctx;
+
+	atomic_set(&hctx->nr_active, 0);
 	if (node == NUMA_NO_NODE)
-		node = hctx->numa_node = set->numa_node;
+		node = set->numa_node;
+	hctx->numa_node = node;
 
 	INIT_DELAYED_WORK(&hctx->run_work, blk_mq_run_work_fn);
 	spin_lock_init(&hctx->lock);
@@ -2301,58 +2351,45 @@ static int blk_mq_init_hctx(struct request_queue *q,
 	hctx->queue = q;
 	hctx->flags = set->flags & ~BLK_MQ_F_TAG_SHARED;
 
-	cpuhp_state_add_instance_nocalls(CPUHP_BLK_MQ_DEAD, &hctx->cpuhp_dead);
-
-	hctx->tags = set->tags[hctx_idx];
-
 	/*
 	 * Allocate space for all possible cpus to avoid allocation at
 	 * runtime
 	 */
 	hctx->ctxs = kmalloc_array_node(nr_cpu_ids, sizeof(void *),
-			GFP_NOIO | __GFP_NOWARN | __GFP_NORETRY, node);
+			gfp, node);
 	if (!hctx->ctxs)
-		goto unregister_cpu_notifier;
+		goto free_cpumask;
 
 	if (sbitmap_init_node(&hctx->ctx_map, nr_cpu_ids, ilog2(8),
-				GFP_NOIO | __GFP_NOWARN | __GFP_NORETRY, node))
+				gfp, node))
 		goto free_ctxs;
-
 	hctx->nr_ctx = 0;
 
 	spin_lock_init(&hctx->dispatch_wait_lock);
 	init_waitqueue_func_entry(&hctx->dispatch_wait, blk_mq_dispatch_wake);
 	INIT_LIST_HEAD(&hctx->dispatch_wait.entry);
 
-	if (set->ops->init_hctx &&
-	    set->ops->init_hctx(hctx, set->driver_data, hctx_idx))
-		goto free_bitmap;
-
 	hctx->fq = blk_alloc_flush_queue(q, hctx->numa_node, set->cmd_size,
-			GFP_NOIO | __GFP_NOWARN | __GFP_NORETRY);
+			gfp);
 	if (!hctx->fq)
-		goto exit_hctx;
-
-	if (blk_mq_init_request(set, hctx->fq->flush_rq, hctx_idx, node))
-		goto free_fq;
+		goto free_bitmap;
 
 	if (hctx->flags & BLK_MQ_F_BLOCKING)
 		init_srcu_struct(hctx->srcu);
+	blk_mq_hctx_kobj_init(hctx);
 
-	return 0;
+	return hctx;
 
- free_fq:
-	blk_free_flush_queue(hctx->fq);
- exit_hctx:
-	if (set->ops->exit_hctx)
-		set->ops->exit_hctx(hctx, hctx_idx);
  free_bitmap:
 	sbitmap_free(&hctx->ctx_map);
  free_ctxs:
 	kfree(hctx->ctxs);
- unregister_cpu_notifier:
-	blk_mq_remove_cpuhp(hctx);
-	return -1;
+ free_cpumask:
+	free_cpumask_var(hctx->cpumask);
+ free_hctx:
+	kfree(hctx);
+ fail_alloc_hctx:
+	return NULL;
 }
 
 static void blk_mq_init_cpu_queues(struct request_queue *q,
@@ -2698,51 +2735,25 @@ struct request_queue *blk_mq_init_sq_queue(struct blk_mq_tag_set *set,
 }
 EXPORT_SYMBOL(blk_mq_init_sq_queue);
 
-static int blk_mq_hw_ctx_size(struct blk_mq_tag_set *tag_set)
-{
-	int hw_ctx_size = sizeof(struct blk_mq_hw_ctx);
-
-	BUILD_BUG_ON(ALIGN(offsetof(struct blk_mq_hw_ctx, srcu),
-			   __alignof__(struct blk_mq_hw_ctx)) !=
-		     sizeof(struct blk_mq_hw_ctx));
-
-	if (tag_set->flags & BLK_MQ_F_BLOCKING)
-		hw_ctx_size += sizeof(struct srcu_struct);
-
-	return hw_ctx_size;
-}
-
 static struct blk_mq_hw_ctx *blk_mq_alloc_and_init_hctx(
 		struct blk_mq_tag_set *set, struct request_queue *q,
 		int hctx_idx, int node)
 {
 	struct blk_mq_hw_ctx *hctx;
 
-	hctx = kzalloc_node(blk_mq_hw_ctx_size(set),
-			GFP_NOIO | __GFP_NOWARN | __GFP_NORETRY,
-			node);
+	hctx = blk_mq_alloc_hctx(q, set, node);
 	if (!hctx)
-		return NULL;
-
-	if (!zalloc_cpumask_var_node(&hctx->cpumask,
-				GFP_NOIO | __GFP_NOWARN | __GFP_NORETRY,
-				node)) {
-		kfree(hctx);
-		return NULL;
-	}
-
-	atomic_set(&hctx->nr_active, 0);
-	hctx->numa_node = node;
-	hctx->queue_num = hctx_idx;
+		goto fail;
 
-	if (blk_mq_init_hctx(q, set, hctx, hctx_idx)) {
-		free_cpumask_var(hctx->cpumask);
-		kfree(hctx);
-		return NULL;
-	}
-	blk_mq_hctx_kobj_init(hctx);
+	if (blk_mq_init_hctx(q, set, hctx, hctx_idx))
+		goto free_hctx;
 
 	return hctx;
+
+ free_hctx:
+	kobject_put(&hctx->kobj);
+ fail:
+	return NULL;
 }
 
 static void blk_mq_realloc_hw_ctxs(struct blk_mq_tag_set *set,
-- 
cgit 


From 2f8f1336a48bd5186de3476da0a3e2ec06d0533a Mon Sep 17 00:00:00 2001
From: Ming Lei <ming.lei@redhat.com>
Date: Tue, 30 Apr 2019 09:52:27 +0800
Subject: blk-mq: always free hctx after request queue is freed

In normal queue cleanup path, hctx is released after request queue
is freed, see blk_mq_release().

However, in __blk_mq_update_nr_hw_queues(), hctx may be freed because
of hw queues shrinking. This way is easy to cause use-after-free,
because: one implicit rule is that it is safe to call almost all block
layer APIs if the request queue is alive; and one hctx may be retrieved
by one API, then the hctx can be freed by blk_mq_update_nr_hw_queues();
finally use-after-free is triggered.

Fixes this issue by always freeing hctx after releasing request queue.
If some hctxs are removed in blk_mq_update_nr_hw_queues(), introduce
a per-queue list to hold them, then try to resuse these hctxs if numa
node is matched.

Cc: Dongli Zhang <dongli.zhang@oracle.com>
Cc: James Smart <james.smart@broadcom.com>
Cc: Bart Van Assche <bart.vanassche@wdc.com>
Cc: linux-scsi@vger.kernel.org,
Cc: Martin K . Petersen <martin.petersen@oracle.com>,
Cc: Christoph Hellwig <hch@lst.de>,
Cc: James E . J . Bottomley <jejb@linux.vnet.ibm.com>,
Reviewed-by: Hannes Reinecke <hare@suse.com>
Tested-by: James Smart <james.smart@broadcom.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 block/blk-mq.c | 46 +++++++++++++++++++++++++++++++++-------------
 1 file changed, 33 insertions(+), 13 deletions(-)

(limited to 'block')

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 17e63d80b6d6..08a6248d8536 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -2269,6 +2269,10 @@ static void blk_mq_exit_hctx(struct request_queue *q,
 		set->ops->exit_hctx(hctx, hctx_idx);
 
 	blk_mq_remove_cpuhp(hctx);
+
+	spin_lock(&q->unused_hctx_lock);
+	list_add(&hctx->hctx_list, &q->unused_hctx_list);
+	spin_unlock(&q->unused_hctx_lock);
 }
 
 static void blk_mq_exit_hw_queues(struct request_queue *q,
@@ -2351,6 +2355,8 @@ blk_mq_alloc_hctx(struct request_queue *q, struct blk_mq_tag_set *set,
 	hctx->queue = q;
 	hctx->flags = set->flags & ~BLK_MQ_F_TAG_SHARED;
 
+	INIT_LIST_HEAD(&hctx->hctx_list);
+
 	/*
 	 * Allocate space for all possible cpus to avoid allocation at
 	 * runtime
@@ -2664,15 +2670,17 @@ static int blk_mq_alloc_ctxs(struct request_queue *q)
  */
 void blk_mq_release(struct request_queue *q)
 {
-	struct blk_mq_hw_ctx *hctx;
-	unsigned int i;
+	struct blk_mq_hw_ctx *hctx, *next;
+	int i;
 
 	cancel_delayed_work_sync(&q->requeue_work);
 
-	/* hctx kobj stays in hctx */
-	queue_for_each_hw_ctx(q, hctx, i) {
-		if (!hctx)
-			continue;
+	queue_for_each_hw_ctx(q, hctx, i)
+		WARN_ON_ONCE(hctx && list_empty(&hctx->hctx_list));
+
+	/* all hctx are in .unused_hctx_list now */
+	list_for_each_entry_safe(hctx, next, &q->unused_hctx_list, hctx_list) {
+		list_del_init(&hctx->hctx_list);
 		kobject_put(&hctx->kobj);
 	}
 
@@ -2739,9 +2747,22 @@ static struct blk_mq_hw_ctx *blk_mq_alloc_and_init_hctx(
 		struct blk_mq_tag_set *set, struct request_queue *q,
 		int hctx_idx, int node)
 {
-	struct blk_mq_hw_ctx *hctx;
+	struct blk_mq_hw_ctx *hctx = NULL, *tmp;
 
-	hctx = blk_mq_alloc_hctx(q, set, node);
+	/* reuse dead hctx first */
+	spin_lock(&q->unused_hctx_lock);
+	list_for_each_entry(tmp, &q->unused_hctx_list, hctx_list) {
+		if (tmp->numa_node == node) {
+			hctx = tmp;
+			break;
+		}
+	}
+	if (hctx)
+		list_del_init(&hctx->hctx_list);
+	spin_unlock(&q->unused_hctx_lock);
+
+	if (!hctx)
+		hctx = blk_mq_alloc_hctx(q, set, node);
 	if (!hctx)
 		goto fail;
 
@@ -2779,10 +2800,8 @@ static void blk_mq_realloc_hw_ctxs(struct blk_mq_tag_set *set,
 
 		hctx = blk_mq_alloc_and_init_hctx(set, q, i, node);
 		if (hctx) {
-			if (hctxs[i]) {
+			if (hctxs[i])
 				blk_mq_exit_hctx(q, set, hctxs[i], i);
-				kobject_put(&hctxs[i]->kobj);
-			}
 			hctxs[i] = hctx;
 		} else {
 			if (hctxs[i])
@@ -2813,9 +2832,7 @@ static void blk_mq_realloc_hw_ctxs(struct blk_mq_tag_set *set,
 			if (hctx->tags)
 				blk_mq_free_map_and_requests(set, j);
 			blk_mq_exit_hctx(q, set, hctx, j);
-			kobject_put(&hctx->kobj);
 			hctxs[j] = NULL;
-
 		}
 	}
 	mutex_unlock(&q->sysfs_lock);
@@ -2858,6 +2875,9 @@ struct request_queue *blk_mq_init_allocated_queue(struct blk_mq_tag_set *set,
 	if (!q->queue_hw_ctx)
 		goto err_sys_init;
 
+	INIT_LIST_HEAD(&q->unused_hctx_list);
+	spin_lock_init(&q->unused_hctx_lock);
+
 	blk_mq_realloc_hw_ctxs(set, q);
 	if (!q->nr_hw_queues)
 		goto err_hctxs;
-- 
cgit 


From 1b97871b501f1bac0fd39a073c4c8473ee457a55 Mon Sep 17 00:00:00 2001
From: Ming Lei <ming.lei@redhat.com>
Date: Tue, 30 Apr 2019 09:52:28 +0800
Subject: blk-mq: move cancel of hctx->run_work into blk_mq_hw_sysfs_release

hctx is always released after requeue is freed.

With holding queue's kobject refcount, it is safe for driver to run queue,
so one run queue might be scheduled after blk_sync_queue() is done.

So moving the cancel of hctx->run_work into blk_mq_hw_sysfs_release()
for avoiding run released queue.

Cc: Dongli Zhang <dongli.zhang@oracle.com>
Cc: James Smart <james.smart@broadcom.com>
Cc: Bart Van Assche <bart.vanassche@wdc.com>
Cc: linux-scsi@vger.kernel.org,
Cc: Martin K . Petersen <martin.petersen@oracle.com>,
Cc: Christoph Hellwig <hch@lst.de>,
Cc: James E . J . Bottomley <jejb@linux.vnet.ibm.com>,
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Tested-by: James Smart <james.smart@broadcom.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 block/blk-core.c     | 8 --------
 block/blk-mq-sysfs.c | 2 ++
 2 files changed, 2 insertions(+), 8 deletions(-)

(limited to 'block')

diff --git a/block/blk-core.c b/block/blk-core.c
index 81d209568a26..6722b24a1182 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -233,14 +233,6 @@ void blk_sync_queue(struct request_queue *q)
 {
 	del_timer_sync(&q->timeout);
 	cancel_work_sync(&q->timeout_work);
-
-	if (queue_is_mq(q)) {
-		struct blk_mq_hw_ctx *hctx;
-		int i;
-
-		queue_for_each_hw_ctx(q, hctx, i)
-			cancel_delayed_work_sync(&hctx->run_work);
-	}
 }
 EXPORT_SYMBOL(blk_sync_queue);
 
diff --git a/block/blk-mq-sysfs.c b/block/blk-mq-sysfs.c
index 7593c4c78975..2280d3cca965 100644
--- a/block/blk-mq-sysfs.c
+++ b/block/blk-mq-sysfs.c
@@ -36,6 +36,8 @@ static void blk_mq_hw_sysfs_release(struct kobject *kobj)
 	struct blk_mq_hw_ctx *hctx = container_of(kobj, struct blk_mq_hw_ctx,
 						  kobj);
 
+	cancel_delayed_work_sync(&hctx->run_work);
+
 	if (hctx->flags & BLK_MQ_F_BLOCKING)
 		cleanup_srcu_struct(hctx->srcu);
 	blk_free_flush_queue(hctx->fq);
-- 
cgit 


From 662156641bc409a28fa313fca1a755105425d278 Mon Sep 17 00:00:00 2001
From: Ming Lei <ming.lei@redhat.com>
Date: Tue, 30 Apr 2019 09:52:29 +0800
Subject: block: don't drain in-progress dispatch in blk_cleanup_queue()

Now freeing hw queue resource is moved to hctx's release handler,
we don't need to worry about the race between blk_cleanup_queue and
run queue any more.

So don't drain in-progress dispatch in blk_cleanup_queue().

This is basically revert of c2856ae2f315 ("blk-mq: quiesce queue before
freeing queue").

Cc: Dongli Zhang <dongli.zhang@oracle.com>
Cc: James Smart <james.smart@broadcom.com>
Cc: Bart Van Assche <bart.vanassche@wdc.com>
Cc: linux-scsi@vger.kernel.org,
Cc: Martin K . Petersen <martin.petersen@oracle.com>,
Cc: Christoph Hellwig <hch@lst.de>,
Cc: James E . J . Bottomley <jejb@linux.vnet.ibm.com>,
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Tested-by: James Smart <james.smart@broadcom.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 block/blk-core.c | 12 ------------
 1 file changed, 12 deletions(-)

(limited to 'block')

diff --git a/block/blk-core.c b/block/blk-core.c
index 6722b24a1182..419d600e6637 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -339,18 +339,6 @@ void blk_cleanup_queue(struct request_queue *q)
 
 	blk_queue_flag_set(QUEUE_FLAG_DEAD, q);
 
-	/*
-	 * make sure all in-progress dispatch are completed because
-	 * blk_freeze_queue() can only complete all requests, and
-	 * dispatch may still be in-progress since we dispatch requests
-	 * from more than one contexts.
-	 *
-	 * We rely on driver to deal with the race in case that queue
-	 * initialization isn't done.
-	 */
-	if (queue_is_mq(q) && blk_queue_init_done(q))
-		blk_mq_quiesce_queue(q);
-
 	/* for synchronous bio-based driver finish in-flight integrity i/o */
 	blk_flush_integrity();
 
-- 
cgit