staging/lustre/ptlrpc: make ptlrpcd threads cpt-aware

On NUMA systems, the placement of worker threads relative to the memory they use greatly affects performance. The CPT mechanism can be used to constrain a number of Lustre thread types, and this change makes it possible to configure the placement of ptlrpcd threads in a similar manner. To simplify the code changes, the global structures used to manage ptlrpcd threads are changed to one per CPT. In particular this means there will be one ptlrpcd recovery thread per CPT. To prevent ptlrpcd threads from wandering all over the system, all ptlrpcd thread are bound to a CPT. Note that some CPT configuration is always created, but the defaults are not likely to be correct for a NUMA system. After discussing the options with Liang Zhen we decided that we would not bind ptlrpcd threads to specific CPUs, and rather trust the kernel scheduler to migrate ptlrpcd threads. With all ptlrpcd threads bound to a CPT, but not to specific CPUs, the load policy mechanism can be radically simplified: - PDL_POLICY_LOCAL and PDL_POLICY_ROUND are currently identical. - PDL_POLICY_ROUND, if fully implemented, would cost us the locality we are trying to achieve, so most or all calls using this policy would have to be changed to PDL_POLICY_LOCAL. - PDL_POLICY_PREFERRED is not used, and cannot be implemented without binding ptlrpcd threads to individual CPUs. - PDL_POLICY_SAME is rarely used, and cannot be implemented without binding ptlrpcd threads to individual CPUs. The partner mechanism is also updated, because now all ptlrpcd threads are "bound" threads. The only difference between the various bind policies, PDB_POLICY_NONE, PDB_POLICY_FULL, PDB_POLICY_PAIR, and PDB_POLICY_NEIGHBOR, is the number of partner threads. The bind policy is replaced with a tunable that directly specifies the size of the groups of ptlrpcd partner threads. Ensure that the ptlrpc_request_set for a ptlrpcd thread is created on the same CPT that the thread will work on. When threads are bound to specific nodes and/or CPUs in a NUMA system, it pays to ensure that the datastructures used by these threads are also on the same node. Visible changes: * ptlrpcd thread names include the CPT number, for example "ptlrpcd_02_07". In this case the "07" is relative to the CPT, and not a CPU number. Tunables added: * ptlrpcd_cpts (string): A CPT string describing the CPU partitions that ptlrpcd threads should run on. Used to make ptlrpcd threads run on a subset of all CPTs. * ptlrpcd_per_cpt_max (int): The maximum number of ptlrpcd threads to run in a CPT. * ptlrpcd_partner_group_size (int): The desired number of threads in each ptlrpcd partner thread group. Default is 2, corresponding to the old PDB_POLICY_PAIR. A negative value makes all ptlrpcd threads in a CPT partners of each other. Tunables obsoleted: * max_ptlrpcds: The new ptlrcpd_per_cpt_max can be used to obtain the same effect. * ptlrpcd_bind_policy: The new ptlrpcd_partner_group_size can be used to obtain the same effect. Internal interface changes: * pdb_policy_t and related code have been removed. Groups of partner ptlrpcd threads are still created, and all threads in a partner group are bound on the same CPT. The ptlrpcd threads bound to a CPT are typically divided into several partner groups. The partner groups on a CPT all have an equal number of ptlrpcd threads. * pdl_policy_t and related code have been removed. Since ptlrpcd threads are not bound to a specific CPU, all the code that avoids scheduling on the current CPU (or attempts to do so) has been removed as non-functional. A simplified form of PDL_POLICY_LOCAL is kept as the only load policy. * LIOD_BIND and related code have been removed. All ptlrpcd threads are now bound to a CPT, and no additional binding policy is implemented. * ptlrpc_prep_set(): Changed to allocate a ptlrpc_request_set on the current CPT. * ptlrpcd(): If an error is encountered before entering the main loop store the error in pc_error before exiting. * ptlrpcd_start(): Check pc_error to verify that the ptlrpcd thread has successfully entered its main loop. * ptlrpcd_init(): Initialize the struct ptlrpcd_ctl for all threads for a CPT before starting any of them. This closes a race during startup where a partner thread could reference a non-initialized struct ptlrpcd_ctl. Signed-off-by: Olaf Weber <olaf@sgi.com> Reviewed-on: http://review.whamcloud.com/13972 Intel-bug-id: https://jira.hpdd.intel.com/browse/LU-6325 Reviewed-by: Grégoire Pichon <gregoire.pichon@bull.net> Reviewed-by: Stephen Champion <schamp@sgi.com> Reviewed-by: James Simmons <uja.ornl@yahoo.com> Reviewed-by: Jinshan Xiong <jinshan.xiong@intel.com> Signed-off-by: Oleg Drokin <oleg.drokin@intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
author: Olaf Weber <olaf@sgi.com> 2015-09-14 18:41:35 -0400
committer: Greg Kroah-Hartman <gregkh@linuxfoundation.org> 2015-09-15 06:26:55 -0700
commit: c5c4c6fae0f754aa17180b411ad04c196f9798ee (patch)
tree: 363b29ac71c2215cf1c5a9b2b25592bf081b7189 /drivers/staging/lustre/lustre/osc
parent: 69456a0377555c52bf0b69d9555a2a605b2aa1d3 (diff)
4 files changed, 30 insertions, 43 deletions
diff --git a/drivers/staging/lustre/lustre/osc/osc_cache.c b/drivers/staging/lustre/lustre/osc/osc_cache.c
index c72035e048aa..62da06157665 100644
--- a/drivers/staging/lustre/lustre/osc/osc_cache.c
+++ b/drivers/staging/lustre/lustre/osc/osc_cache.c
@@ -1934,7 +1934,7 @@ static int get_write_extents(struct osc_object *obj, struct list_head *rpclist)
 
 static int
 osc_send_write_rpc(const struct lu_env *env, struct client_obd *cli,
-		   struct osc_object *osc, pdl_policy_t pol)
+		   struct osc_object *osc)
 {
 	LIST_HEAD(rpclist);
 	struct osc_extent *ext;
@@ -1986,7 +1986,7 @@ osc_send_write_rpc(const struct lu_env *env, struct client_obd *cli,
 
 	if (!list_empty(&rpclist)) {
 		LASSERT(page_count > 0);
-		rc = osc_build_rpc(env, cli, &rpclist, OBD_BRW_WRITE, pol);
+		rc = osc_build_rpc(env, cli, &rpclist, OBD_BRW_WRITE);
 		LASSERT(list_empty(&rpclist));
 	}
 
@@ -2006,7 +2006,7 @@ osc_send_write_rpc(const struct lu_env *env, struct client_obd *cli,
  */
 static int
 osc_send_read_rpc(const struct lu_env *env, struct client_obd *cli,
-		  struct osc_object *osc, pdl_policy_t pol)
+		  struct osc_object *osc)
 {
 	struct osc_extent *ext;
 	struct osc_extent *next;
@@ -2033,7 +2033,7 @@ osc_send_read_rpc(const struct lu_env *env, struct client_obd *cli,
 		osc_object_unlock(osc);
 
 		LASSERT(page_count > 0);
-		rc = osc_build_rpc(env, cli, &rpclist, OBD_BRW_READ, pol);
+		rc = osc_build_rpc(env, cli, &rpclist, OBD_BRW_READ);
 		LASSERT(list_empty(&rpclist));
 
 		osc_object_lock(osc);
@@ -2079,8 +2079,7 @@ static struct osc_object *osc_next_obj(struct client_obd *cli)
 }
 
 /* called with the loi list lock held */
-static void osc_check_rpcs(const struct lu_env *env, struct client_obd *cli,
-			   pdl_policy_t pol)
+static void osc_check_rpcs(const struct lu_env *env, struct client_obd *cli)
 {
 	struct osc_object *osc;
 	int rc = 0;
@@ -2109,7 +2108,7 @@ static void osc_check_rpcs(const struct lu_env *env, struct client_obd *cli,
 		 * do io on writes while there are cache waiters */
 		osc_object_lock(osc);
 		if (osc_makes_rpc(cli, osc, OBD_BRW_WRITE)) {
-			rc = osc_send_write_rpc(env, cli, osc, pol);
+			rc = osc_send_write_rpc(env, cli, osc);
 			if (rc < 0) {
 				CERROR("Write request failed with %d\n", rc);
 
@@ -2133,7 +2132,7 @@ static void osc_check_rpcs(const struct lu_env *env, struct client_obd *cli,
 			}
 		}
 		if (osc_makes_rpc(cli, osc, OBD_BRW_READ)) {
-			rc = osc_send_read_rpc(env, cli, osc, pol);
+			rc = osc_send_read_rpc(env, cli, osc);
 			if (rc < 0)
 				CERROR("Read request failed with %d\n", rc);
 		}
@@ -2149,7 +2148,7 @@ static void osc_check_rpcs(const struct lu_env *env, struct client_obd *cli,
 }
 
 static int osc_io_unplug0(const struct lu_env *env, struct client_obd *cli,
-			  struct osc_object *osc, pdl_policy_t pol, int async)
+			  struct osc_object *osc, int async)
 {
 	int rc = 0;
 
@@ -2161,7 +2160,7 @@ static int osc_io_unplug0(const struct lu_env *env, struct client_obd *cli,
 		 * potential stack overrun problem. LU-2859 */
 		atomic_inc(&cli->cl_lru_shrinkers);
 		client_obd_list_lock(&cli->cl_loi_list_lock);
-		osc_check_rpcs(env, cli, pol);
+		osc_check_rpcs(env, cli);
 		client_obd_list_unlock(&cli->cl_loi_list_lock);
 		atomic_dec(&cli->cl_lru_shrinkers);
 	} else {
@@ -2175,14 +2174,13 @@ static int osc_io_unplug0(const struct lu_env *env, struct client_obd *cli,
 static int osc_io_unplug_async(const struct lu_env *env,
 			       struct client_obd *cli, struct osc_object *osc)
 {
-	/* XXX: policy is no use actually. */
-	return osc_io_unplug0(env, cli, osc, PDL_POLICY_ROUND, 1);
+	return osc_io_unplug0(env, cli, osc, 1);
 }
 
 void osc_io_unplug(const struct lu_env *env, struct client_obd *cli,
-		   struct osc_object *osc, pdl_policy_t pol)
+		   struct osc_object *osc)
 {
-	(void)osc_io_unplug0(env, cli, osc, pol, 0);
+	(void)osc_io_unplug0(env, cli, osc, 0);
 }
 
 int osc_prep_async_page(struct osc_object *osc, struct osc_page *ops,
@@ -2922,7 +2920,7 @@ int osc_cache_writeback_range(const struct lu_env *env, struct osc_object *obj,
 	}
 
 	if (unplug)
-		osc_io_unplug(env, osc_cli(obj), obj, PDL_POLICY_ROUND);
+		osc_io_unplug(env, osc_cli(obj), obj);
 
 	if (hp || discard) {
 		int rc;
diff --git a/drivers/staging/lustre/lustre/osc/osc_cl_internal.h b/drivers/staging/lustre/lustre/osc/osc_cl_internal.h
index 365b2787b3c8..75bfda6e4ad3 100644
--- a/drivers/staging/lustre/lustre/osc/osc_cl_internal.h
+++ b/drivers/staging/lustre/lustre/osc/osc_cl_internal.h
@@ -454,7 +454,7 @@ int osc_cache_writeback_range(const struct lu_env *env, struct osc_object *obj,
 int osc_cache_wait_range(const struct lu_env *env, struct osc_object *obj,
 			 pgoff_t start, pgoff_t end);
 void osc_io_unplug(const struct lu_env *env, struct client_obd *cli,
-		   struct osc_object *osc, pdl_policy_t pol);
+		   struct osc_object *osc);
 
 void osc_object_set_contended  (struct osc_object *obj);
 void osc_object_clear_contended(struct osc_object *obj);
diff --git a/drivers/staging/lustre/lustre/osc/osc_internal.h b/drivers/staging/lustre/lustre/osc/osc_internal.h
index 7d0a3e2db289..448fdf4b8939 100644
--- a/drivers/staging/lustre/lustre/osc/osc_internal.h
+++ b/drivers/staging/lustre/lustre/osc/osc_internal.h
@@ -132,7 +132,7 @@ int osc_sync_base(struct obd_export *exp, struct obd_info *oinfo,
 
 int osc_process_config_base(struct obd_device *obd, struct lustre_cfg *cfg);
 int osc_build_rpc(const struct lu_env *env, struct client_obd *cli,
-		  struct list_head *ext_list, int cmd, pdl_policy_t p);
+		  struct list_head *ext_list, int cmd);
 int osc_lru_shrink(struct client_obd *cli, int target);
 
 extern spinlock_t osc_ast_guard;
diff --git a/drivers/staging/lustre/lustre/osc/osc_request.c b/drivers/staging/lustre/lustre/osc/osc_request.c
index f41f762573f5..9f5362759f2f 100644
--- a/drivers/staging/lustre/lustre/osc/osc_request.c
+++ b/drivers/staging/lustre/lustre/osc/osc_request.c
@@ -437,7 +437,7 @@ int osc_setattr_async_base(struct obd_export *exp, struct obd_info *oinfo,
 	/* do mds to ost setattr asynchronously */
 	if (!rqset) {
 		/* Do not wait for response. */
-		ptlrpcd_add_req(req, PDL_POLICY_ROUND, -1);
+		ptlrpcd_add_req(req);
 	} else {
 		req->rq_interpret_reply =
 			(ptlrpc_interpterer_t)osc_setattr_interpret;
@@ -449,7 +449,7 @@ int osc_setattr_async_base(struct obd_export *exp, struct obd_info *oinfo,
 		sa->sa_cookie = cookie;
 
 		if (rqset == PTLRPCD_SET)
-			ptlrpcd_add_req(req, PDL_POLICY_ROUND, -1);
+			ptlrpcd_add_req(req);
 		else
 			ptlrpc_set_add_req(rqset, req);
 	}
@@ -590,7 +590,7 @@ int osc_punch_base(struct obd_export *exp, struct obd_info *oinfo,
 	sa->sa_upcall = upcall;
 	sa->sa_cookie = cookie;
 	if (rqset == PTLRPCD_SET)
-		ptlrpcd_add_req(req, PDL_POLICY_ROUND, -1);
+		ptlrpcd_add_req(req);
 	else
 		ptlrpc_set_add_req(rqset, req);
 
@@ -657,7 +657,7 @@ int osc_sync_base(struct obd_export *exp, struct obd_info *oinfo,
 	fa->fa_cookie = cookie;
 
 	if (rqset == PTLRPCD_SET)
-		ptlrpcd_add_req(req, PDL_POLICY_ROUND, -1);
+		ptlrpcd_add_req(req);
 	else
 		ptlrpc_set_add_req(rqset, req);
 
@@ -826,7 +826,7 @@ static int osc_destroy(const struct lu_env *env, struct obd_export *exp,
 	}
 
 	/* Do not wait for response */
-	ptlrpcd_add_req(req, PDL_POLICY_ROUND, -1);
+	ptlrpcd_add_req(req);
 	return 0;
 }
 
@@ -1718,7 +1718,7 @@ static int osc_brw_redo_request(struct ptlrpc_request *request,
 	 * to add a series of BRW RPCs into a self-defined ptlrpc_request_set
 	 * and wait for all of them to be finished. We should inherit request
 	 * set from old request. */
-	ptlrpcd_add_req(new_req, PDL_POLICY_SAME, -1);
+	ptlrpcd_add_req(new_req);
 
 	DEBUG_REQ(D_INFO, new_req, "new request");
 	return 0;
@@ -1859,7 +1859,7 @@ static int brw_interpret(const struct lu_env *env,
 	osc_wake_cache_waiters(cli);
 	client_obd_list_unlock(&cli->cl_loi_list_lock);
 
-	osc_io_unplug(env, cli, NULL, PDL_POLICY_SAME);
+	osc_io_unplug(env, cli, NULL);
 	return rc;
 }
 
@@ -1869,7 +1869,7 @@ static int brw_interpret(const struct lu_env *env,
  * Extents in the list must be in OES_RPC state.
  */
 int osc_build_rpc(const struct lu_env *env, struct client_obd *cli,
-		  struct list_head *ext_list, int cmd, pdl_policy_t pol)
+		  struct list_head *ext_list, int cmd)
 {
 	struct ptlrpc_request *req = NULL;
 	struct osc_extent *ext;
@@ -2043,19 +2043,7 @@ int osc_build_rpc(const struct lu_env *env, struct client_obd *cli,
 		  page_count, aa, cli->cl_r_in_flight,
 		  cli->cl_w_in_flight);
 
-	/* XXX: Maybe the caller can check the RPC bulk descriptor to
-	 * see which CPU/NUMA node the majority of pages were allocated
-	 * on, and try to assign the async RPC to the CPU core
-	 * (PDL_POLICY_PREFERRED) to reduce cross-CPU memory traffic.
-	 *
-	 * But on the other hand, we expect that multiple ptlrpcd
-	 * threads and the initial write sponsor can run in parallel,
-	 * especially when data checksum is enabled, which is CPU-bound
-	 * operation and single ptlrpcd thread cannot process in time.
-	 * So more ptlrpcd threads sharing BRW load
-	 * (with PDL_POLICY_ROUND) seems better.
-	 */
-	ptlrpcd_add_req(req, pol, -1);
+	ptlrpcd_add_req(req);
 	rc = 0;
 
 out:
@@ -2382,7 +2370,7 @@ int osc_enqueue_base(struct obd_export *exp, struct ldlm_res_id *res_id,
 			req->rq_interpret_reply =
 				(ptlrpc_interpterer_t)osc_enqueue_interpret;
 			if (rqset == PTLRPCD_SET)
-				ptlrpcd_add_req(req, PDL_POLICY_ROUND, -1);
+				ptlrpcd_add_req(req);
 			else
 				ptlrpc_set_add_req(rqset, req);
 		} else if (intent) {
@@ -2997,8 +2985,9 @@ static int osc_set_info_async(const struct lu_env *env, struct obd_export *exp,
 		LASSERT(set != NULL);
 		ptlrpc_set_add_req(set, req);
 		ptlrpc_check_set(NULL, set);
-	} else
-		ptlrpcd_add_req(req, PDL_POLICY_ROUND, -1);
+	} else {
+		ptlrpcd_add_req(req);
+	}
 
 	return 0;
 }
@@ -3090,7 +3079,7 @@ static int osc_import_event(struct obd_device *obd,
 			cli = &obd->u.cli;
 			/* all pages go to failing rpcs due to the invalid
 			 * import */
-			osc_io_unplug(env, cli, NULL, PDL_POLICY_ROUND);
+			osc_io_unplug(env, cli, NULL);
 
 			ldlm_namespace_cleanup(ns, LDLM_FL_LOCAL_ONLY);
 			cl_env_put(env, &refcheck);
@@ -3162,7 +3151,7 @@ static int brw_queue_work(const struct lu_env *env, void *data)
 
 	CDEBUG(D_CACHE, "Run writeback work for client obd %p.\n", cli);
 
-	osc_io_unplug(env, cli, NULL, PDL_POLICY_SAME);
+	osc_io_unplug(env, cli, NULL);
 	return 0;
 }
author	Olaf Weber <olaf@sgi.com>	2015-09-14 18:41:35 -0400
committer	Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2015-09-15 06:26:55 -0700
commit	c5c4c6fae0f754aa17180b411ad04c196f9798ee (patch)
tree	363b29ac71c2215cf1c5a9b2b25592bf081b7189 /drivers/staging/lustre/lustre/osc
parent	69456a0377555c52bf0b69d9555a2a605b2aa1d3 (diff)