Documentation/driver-api/cxl/platform/bios-and-efi.rst


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262

.. SPDX-License-Identifier: GPL-2.0

======================
BIOS/EFI Configuration
======================

BIOS and EFI are largely responsible for configuring static information about
devices (or potential future devices) such that Linux can build the appropriate
logical representations of these devices.

At a high level, this is what occurs during this phase of configuration.

* The bootloader starts the BIOS/EFI.

* BIOS/EFI do early device probe to determine static configuration

* BIOS/EFI creates ACPI Tables that describe static config for the OS

* BIOS/EFI create the system memory map (EFI Memory Map, E820, etc)

* BIOS/EFI calls :code:`start_kernel` and begins the Linux Early Boot process.

Much of what this section is concerned with is ACPI Table production and
static memory map configuration. More detail on these tables can be found
at :doc:`ACPI Tables <acpi>`.

.. note::
   Platform Vendors should read carefully, as this sections has recommendations
   on physical memory region size and alignment, memory holes, HDM interleave,
   and what linux expects of HDM decoders trying to work with these features.

UEFI Settings
=============
If your platform supports it, the :code:`uefisettings` command can be used to
read/write EFI settings. Changes will be reflected on the next reboot. Kexec
is not a sufficient reboot.

One notable configuration here is the EFI_MEMORY_SP (Specific Purpose) bit.
When this is enabled, this bit tells linux to defer management of a memory
region to a driver (in this case, the CXL driver). Otherwise, the memory is
treated as "normal memory", and is exposed to the page allocator during
:code:`__init`.

uefisettings examples
---------------------

:code:`uefisettings identify` ::

        uefisettings identify

        bios_vendor: xxx
        bios_version: xxx
        bios_release: xxx
        bios_date: xxx
        product_name: xxx
        product_family: xxx
        product_version: xxx

On some AMD platforms, the :code:`EFI_MEMORY_SP` bit is set via the :code:`CXL
Memory Attribute` field.  This may be called something else on your platform.

:code:`uefisettings get "CXL Memory Attribute"` ::

        selector: xxx
        ...
        question: Question {
            name: "CXL Memory Attribute",
            answer: "Enabled",
            ...
        }

Physical Memory Map
===================

Physical Address Region Alignment
---------------------------------

As of Linux v6.14, the hotplug memory system requires memory regions to be
uniform in size and alignment.  While the CXL specification allows for memory
regions as small as 256MB, the supported memory block size and alignment for
hotplugged memory is architecture-defined.

A Linux memory blocks may be as small as 128MB and increase in powers of two.

* On ARM, the default block size and alignment is either 128MB or 256MB.

* On x86, the default block size is 256MB, and increases to 2GB as the
  capacity of the system increases up to 64GB.

For best support across versions, platform vendors should place CXL memory at
a 2GB aligned base address, and regions should be 2GB aligned.  This also helps
prevent the creating thousands of memory devices (one per block).

Memory Holes
------------

Holes in the memory map are tricky.  Consider a 4GB device located at base
address 0x100000000, but with the following memory map ::

  ---------------------
  |    0x100000000    |
  |        CXL        |
  |    0x1BFFFFFFF    |
  ---------------------
  |    0x1C0000000    |
  |    MEMORY HOLE    |
  |    0x1FFFFFFFF    |
  ---------------------
  |    0x200000000    |
  |     CXL CONT.     |
  |    0x23FFFFFFF    |
  ---------------------

There are two issues to consider:

* decoder programming, and
* memory block alignment.

If your architecture requires 2GB uniform size and aligned memory blocks, the
only capacity Linux is capable of mapping (as of v6.14) would be the capacity
from `0x100000000-0x180000000`.  The remaining capacity will be stranded, as
they are not of 2GB aligned length.

Assuming your architecture and memory configuration allows 1GB memory blocks,
this memory map is supported and this should be presented as multiple CFMWS
in the CEDT that describe each side of the memory hole separately - along with
matching decoders.

Multiple decoders can (and should) be used to manage such a memory hole (see
below), but each chunk of a memory hole should be aligned to a reasonable block
size (larger alignment is always better).  If you intend to have memory holes
in the memory map, expect to use one decoder per contiguous chunk of host
physical memory.

As of v6.14, Linux does provide support for memory hotplug of multiple
physical memory regions separated by a memory hole described by a single
HDM decoder.


Decoder Programming
===================
If BIOS/EFI intends to program the decoders to be statically configured,
there are a few things to consider to avoid major pitfalls that will
prevent Linux compatibility.  Some of these recommendations are not
required "per the specification", but Linux makes no guarantees of support
otherwise.


Translation Point
-----------------
Per the specification, the only decoders which **TRANSLATE** Host Physical
Address (HPA) to Device Physical Address (DPA) are the **Endpoint Decoders**.
All other decoders in the fabric are intended to route accesses without
translating the addresses.

This is heavily implied by the specification, see: ::

  CXL Specification 3.1
  8.2.4.20: CXL HDM Decoder Capability Structure
  - Implementation Note: CXL Host Bridge and Upstream Switch Port Decoder Flow
  - Implementation Note: Device Decoder Logic

Given this, Linux makes a strong assumption that decoders between CPU and
endpoint will all be programmed with addresses ranges that are subsets of
their parent decoder.

Due to some ambiguity in how Architecture, ACPI, PCI, and CXL specifications
"hand off" responsibility between domains, some early adopting platforms
attempted to do translation at the originating memory controller or host
bridge.  This configuration requires a platform specific extension to the
driver and is not officially endorsed - despite being supported.

It is *highly recommended* **NOT** to do this; otherwise, you are on your own
to implement driver support for your platform.

Interleave and Configuration Flexibility
----------------------------------------
If providing cross-host-bridge interleave, a CFMWS entry in the :doc:`CEDT
<acpi/cedt>` must be presented with target host-bridges for the interleaved
device sets (there may be multiple behind each host bridge).

If providing intra-host-bridge interleaving, only 1 CFMWS entry in the CEDT is
required for that host bridge - if it covers the entire capacity of the devices
behind the host bridge.

If intending to provide users flexibility in programming decoders beyond the
root, you may want to provide multiple CFMWS entries in the CEDT intended for
different purposes.  For example, you may want to consider adding:

1) A CFMWS entry to cover all interleavable host bridges.
2) A CFMWS entry to cover all devices on a single host bridge.
3) A CFMWS entry to cover each device.

A platform may choose to add all of these, or change the mode based on a BIOS
setting.  For each CFMWS entry, Linux expects descriptions of the described
memory regions in the :doc:`SRAT <acpi/srat>` to determine the number of
NUMA nodes it should reserve during early boot / init.

As of v6.14, Linux will create a NUMA node for each CEDT CFMWS entry, even if
a matching SRAT entry does not exist; however, this is not guaranteed in the
future and such a configuration should be avoided.

Memory Holes
------------
If your platform includes memory holes intersparsed between your CXL memory, it
is recommended to utilize multiple decoders to cover these regions of memory,
rather than try to program the decoders to accept the entire range and expect
Linux to manage the overlap.

For example, consider the Memory Hole described above ::

  ---------------------
  |    0x100000000    |
  |        CXL        |
  |    0x1BFFFFFFF    |
  ---------------------
  |    0x1C0000000    |
  |    MEMORY HOLE    |
  |    0x1FFFFFFFF    |
  ---------------------
  |    0x200000000    |
  |     CXL CONT.     |
  |    0x23FFFFFFF    |
  ---------------------

Assuming this is provided by a single device attached directly to a host bridge,
Linux would expect the following decoder programming ::

     -----------------------   -----------------------
     | root-decoder-0      |   | root-decoder-1      |
     |   base: 0x100000000 |   |   base: 0x200000000 |
     |   size:  0xC0000000 |   |   size:  0x40000000 |
     -----------------------   -----------------------
                |                         |
     -----------------------   -----------------------
     | HB-decoder-0        |   | HB-decoder-1        |
     |   base: 0x100000000 |   |   base: 0x200000000 |
     |   size:  0xC0000000 |   |   size:  0x40000000 |
     -----------------------   -----------------------
                |                         |
     -----------------------   -----------------------
     | ep-decoder-0        |   | ep-decoder-1        |
     |   base: 0x100000000 |   |   base: 0x200000000 |
     |   size:  0xC0000000 |   |   size:  0x40000000 |
     -----------------------   -----------------------

With a CEDT configuration with two CFMWS describing the above root decoders.

Linux makes no guarantee of support for strange memory hole situations.

Multi-Media Devices
-------------------
The CFMWS field of the CEDT has special restriction bits which describe whether
the described memory region allows volatile or persistent memory (or both). If
the platform intends to support either:

1) A device with multiple medias, or
2) Using a persistent memory device as normal memory

A platform may wish to create multiple CEDT CFMWS entries to describe the same
memory, with the intent of allowing the end user flexibility in how that memory
is configured. Linux does not presently have strong requirements in this area.