summaryrefslogtreecommitdiff
path: root/Documentation/arch/x86/resctrl.rst
diff options
context:
space:
mode:
Diffstat (limited to 'Documentation/arch/x86/resctrl.rst')
-rw-r--r--Documentation/arch/x86/resctrl.rst53
1 files changed, 48 insertions, 5 deletions
diff --git a/Documentation/arch/x86/resctrl.rst b/Documentation/arch/x86/resctrl.rst
index a6279df64a9d..6768fc1fad16 100644
--- a/Documentation/arch/x86/resctrl.rst
+++ b/Documentation/arch/x86/resctrl.rst
@@ -45,7 +45,7 @@ mount options are:
Enable code/data prioritization in L2 cache allocations.
"mba_MBps":
Enable the MBA Software Controller(mba_sc) to specify MBA
- bandwidth in MBps
+ bandwidth in MiBps
"debug":
Make debug files accessible. Available debug files are annotated with
"Available only with debug option".
@@ -375,11 +375,25 @@ When monitoring is enabled all MON groups will also contain:
all tasks in the group. In CTRL_MON groups these files provide
the sum for all tasks in the CTRL_MON group and all tasks in
MON groups. Please see example section for more details on usage.
+ On systems with Sub-NUMA Cluster (SNC) enabled there are extra
+ directories for each node (located within the "mon_L3_XX" directory
+ for the L3 cache they occupy). These are named "mon_sub_L3_YY"
+ where "YY" is the node number.
"mon_hw_id":
Available only with debug option. The identifier used by hardware
for the monitor group. On x86 this is the RMID.
+When the "mba_MBps" mount option is used all CTRL_MON groups will also contain:
+
+"mba_MBps_event":
+ Reading this file shows which memory bandwidth event is used
+ as input to the software feedback loop that keeps memory bandwidth
+ below the value specified in the schemata file. Writing the
+ name of one of the supported memory bandwidth events found in
+ /sys/fs/resctrl/info/L3_MON/mon_features changes the input
+ event.
+
Resource allocation rules
-------------------------
@@ -446,6 +460,12 @@ during mkdir.
max_threshold_occupancy is a user configurable value to determine the
occupancy at which an RMID can be freed.
+The mon_llc_occupancy_limbo tracepoint gives the precise occupancy in bytes
+for a subset of RMID that are not immediately available for allocation.
+This can't be relied on to produce output every second, it may be necessary
+to attempt to create an empty monitor group to force an update. Output may
+only be produced if creation of a control or monitor group fails.
+
Schemata files - general concepts
---------------------------------
Each line in the file describes one resource. The line starts with
@@ -478,6 +498,29 @@ if non-contiguous 1s value is supported. On a system with a 20-bit mask
each bit represents 5% of the capacity of the cache. You could partition
the cache into four equal parts with masks: 0x1f, 0x3e0, 0x7c00, 0xf8000.
+Notes on Sub-NUMA Cluster mode
+==============================
+When SNC mode is enabled, Linux may load balance tasks between Sub-NUMA
+nodes much more readily than between regular NUMA nodes since the CPUs
+on Sub-NUMA nodes share the same L3 cache and the system may report
+the NUMA distance between Sub-NUMA nodes with a lower value than used
+for regular NUMA nodes.
+
+The top-level monitoring files in each "mon_L3_XX" directory provide
+the sum of data across all SNC nodes sharing an L3 cache instance.
+Users who bind tasks to the CPUs of a specific Sub-NUMA node can read
+the "llc_occupancy", "mbm_total_bytes", and "mbm_local_bytes" in the
+"mon_sub_L3_YY" directories to get node local data.
+
+Memory bandwidth allocation is still performed at the L3 cache
+level. I.e. throttling controls are applied to all SNC nodes.
+
+L3 cache allocation bitmaps also apply to all SNC nodes. But note that
+the amount of L3 cache represented by each bit is divided by the number
+of SNC nodes per L3 cache. E.g. with a 100MB cache on a system with 10-bit
+allocation masks each bit normally represents 10MB. With SNC mode enabled
+with two SNC nodes per L3 cache, each bit only represents 5MB.
+
Memory bandwidth Allocation and monitoring
==========================================
@@ -526,7 +569,7 @@ threads start using more cores in an rdtgroup, the actual bandwidth may
increase or vary although user specified bandwidth percentage is same.
In order to mitigate this and make the interface more user friendly,
-resctrl added support for specifying the bandwidth in MBps as well. The
+resctrl added support for specifying the bandwidth in MiBps as well. The
kernel underneath would use a software feedback mechanism or a "Software
Controller(mba_sc)" which reads the actual bandwidth using MBM counters
and adjust the memory bandwidth percentages to ensure::
@@ -573,13 +616,13 @@ Memory b/w domain is L3 cache.
MB:<cache_id0>=bandwidth0;<cache_id1>=bandwidth1;...
-Memory bandwidth Allocation specified in MBps
----------------------------------------------
+Memory bandwidth Allocation specified in MiBps
+----------------------------------------------
Memory bandwidth domain is L3 cache.
::
- MB:<cache_id0>=bw_MBps0;<cache_id1>=bw_MBps1;...
+ MB:<cache_id0>=bw_MiBps0;<cache_id1>=bw_MiBps1;...
Slow Memory Bandwidth Allocation (SMBA)
---------------------------------------