diff options
Diffstat (limited to 'Documentation/admin-guide/blockdev/zram.rst')
| -rw-r--r-- | Documentation/admin-guide/blockdev/zram.rst | 248 |
1 files changed, 197 insertions, 51 deletions
diff --git a/Documentation/admin-guide/blockdev/zram.rst b/Documentation/admin-guide/blockdev/zram.rst index 27c77d853028..3e273c1bb749 100644 --- a/Documentation/admin-guide/blockdev/zram.rst +++ b/Documentation/admin-guide/blockdev/zram.rst @@ -47,12 +47,14 @@ The list of possible return codes: -ENOMEM zram was not able to allocate enough memory to fulfil your needs. -EINVAL invalid input has been provided. +-EAGAIN re-try operation later (e.g. when attempting to run recompress + and writeback simultaneously). ======== ============================================================= If you use 'echo', the returned value is set by the 'echo' utility, and, in general case, something like:: - echo 3 > /sys/block/zram0/max_comp_streams + echo foo > /sys/block/zram0/comp_algorithm if [ $? -ne 0 ]; then handle_error fi @@ -71,21 +73,7 @@ This creates 4 devices: /dev/zram{0,1,2,3} num_devices parameter is optional and tells zram how many devices should be pre-created. Default: 1. -2) Set max number of compression streams -======================================== - -Regardless of the value passed to this attribute, ZRAM will always -allocate multiple compression streams - one per online CPU - thus -allowing several concurrent compression operations. The number of -allocated compression streams goes down when some of the CPUs -become offline. There is no single-compression-stream mode anymore, -unless you are running a UP system or have only 1 CPU online. - -To find out how many streams are currently available:: - - cat /sys/block/zram0/max_comp_streams - -3) Select compression algorithm +2) Select compression algorithm =============================== Using comp_algorithm device attribute one can see available and @@ -102,15 +90,39 @@ Examples:: #select lzo compression algorithm echo lzo > /sys/block/zram0/comp_algorithm -For the time being, the `comp_algorithm` content does not necessarily -show every compression algorithm supported by the kernel. We keep this -list primarily to simplify device configuration and one can configure -a new device with a compression algorithm that is not listed in -`comp_algorithm`. The thing is that, internally, ZRAM uses Crypto API -and, if some of the algorithms were built as modules, it's impossible -to list all of them using, for instance, /proc/crypto or any other -method. This, however, has an advantage of permitting the usage of -custom crypto compression modules (implementing S/W or H/W compression). +For the time being, the `comp_algorithm` content shows only compression +algorithms that are supported by zram. + +3) Set compression algorithm parameters: Optional +================================================= + +Compression algorithms may support specific parameters which can be +tweaked for particular dataset. ZRAM has an `algorithm_params` device +attribute which provides a per-algorithm params configuration. + +For example, several compression algorithms support `level` parameter. +In addition, certain compression algorithms support pre-trained dictionaries, +which significantly change algorithms' characteristics. In order to configure +compression algorithm to use external pre-trained dictionary, pass full +path to the `dict` along with other parameters:: + + #pass path to pre-trained zstd dictionary + echo "algo=zstd dict=/etc/dictionary" > /sys/block/zram0/algorithm_params + + #same, but using algorithm priority + echo "priority=1 dict=/etc/dictionary" > \ + /sys/block/zram0/algorithm_params + + #pass path to pre-trained zstd dictionary and compression level + echo "algo=zstd level=8 dict=/etc/dictionary" > \ + /sys/block/zram0/algorithm_params + +Parameters are algorithm specific: not all algorithms support pre-trained +dictionaries, not all algorithms support `level`. Furthermore, for certain +algorithms `level` controls the compression level (the higher the value the +better the compression ratio, it even can take negatives values for some +algorithms), for other algorithms `level` is acceleration level (the higher +the value the lower the compression ratio). 4) Set Disksize =============== @@ -202,9 +214,8 @@ mem_limit WO specifies the maximum amount of memory ZRAM can writeback_limit WO specifies the maximum amount of write IO zram can write out to backing device as 4KB unit writeback_limit_enable RW show and set writeback_limit feature -max_comp_streams RW the number of possible concurrent compress - operations comp_algorithm RW show and change the compression algorithm +algorithm_params WO setup compression algorithm parameters compact WO trigger memory compaction debug_stat RO this file is used for zram debugging purposes backing_dev RW set up backend storage for zram to write out @@ -251,8 +262,6 @@ line of text and contains the following stats separated by whitespace: ================ ============================================================= orig_data_size uncompressed size of data stored in this disk. - This excludes same-element-filled pages (same_pages) since - no memory is allocated for them. Unit: bytes compr_data_size compressed size of data stored in this disk mem_used_total the amount of memory allocated for this disk. This @@ -268,6 +277,7 @@ line of text and contains the following stats separated by whitespace: No memory is allocated for such pages. pages_compacted the number of pages freed during compaction huge_pages the number of incompressible pages + huge_pages_since the number of incompressible pages since zram set up ================ ============================================================= File /sys/block/zram<id>/bd_stat @@ -285,7 +295,7 @@ a single line of text and contains the following stats separated by whitespace: ============== ============================================================= 9) Deactivate -============= +============== :: @@ -307,6 +317,26 @@ a single line of text and contains the following stats separated by whitespace: Optional Feature ================ +IDLE pages tracking +------------------- + +zram has built-in support for idle pages tracking (that is, allocated but +not used pages). This feature is useful for e.g. zram writeback and +recompression. In order to mark pages as idle, execute the following command:: + + echo all > /sys/block/zramX/idle + +This will mark all allocated zram pages as idle. The idle mark will be +removed only when the page (block) is accessed (e.g. overwritten or freed). +Additionally, when CONFIG_ZRAM_TRACK_ENTRY_ACTIME is enabled, pages can be +marked as idle based on how many seconds have passed since the last access to +a particular zram page:: + + echo 86400 > /sys/block/zramX/idle + +In this example, all pages which haven't been accessed in more than 86400 +seconds (one day) will be marked idle. + writeback --------- @@ -316,25 +346,48 @@ To use the feature, admin should set up backing device via:: echo /dev/sda5 > /sys/block/zramX/backing_dev -before disksize setting. It supports only partition at this moment. -If admin wants to use incompressible page writeback, they could do via:: +before disksize setting. It supports only partitions at this moment. +If admin wants to use incompressible page writeback, they could do it via:: echo huge > /sys/block/zramX/writeback -To use idle page writeback, first, user need to declare zram pages -as idle:: +Admin can request writeback of idle pages at right timing via:: - echo all > /sys/block/zramX/idle + echo idle > /sys/block/zramX/writeback -From now on, any pages on zram are idle pages. The idle mark -will be removed until someone requests access of the block. -IOW, unless there is access request, those pages are still idle pages. +With the command, zram will writeback idle pages from memory to the storage. -Admin can request writeback of those idle pages at right timing via:: +Additionally, if a user choose to writeback only huge and idle pages +this can be accomplished with:: - echo idle > /sys/block/zramX/writeback + echo huge_idle > /sys/block/zramX/writeback + +If a user chooses to writeback only incompressible pages (pages that none of +algorithms can compress) this can be accomplished with:: + + echo incompressible > /sys/block/zramX/writeback + +If an admin wants to write a specific page in zram device to the backing device, +they could write a page index into the interface:: -With the command, zram writeback idle pages from memory to the storage. + echo "page_index=1251" > /sys/block/zramX/writeback + +In Linux 6.16 this interface underwent some rework. First, the interface +now supports `key=value` format for all of its parameters (`type=huge_idle`, +etc.) Second, the support for `page_indexes` was introduced, which specify +`LOW-HIGH` range (or ranges) of pages to be written-back. This reduces the +number of syscalls, but more importantly this enables optimal post-processing +target selection strategy. Usage example:: + + echo "type=idle" > /sys/block/zramX/writeback + echo "page_indexes=1-100 page_indexes=200-300" > \ + /sys/block/zramX/writeback + +We also now permit multiple page_index params per call and a mix of +single pages and page ranges:: + + echo page_index=42 page_index=99 page_indexes=100-200 \ + page_indexes=500-700 > /sys/block/zramX/writeback If there are lots of write IO with flash device, potentially, it has flash wearout problem so that admin needs to design write limitation @@ -342,7 +395,7 @@ to guarantee storage health for entire product life. To overcome the concern, zram supports "writeback_limit" feature. The "writeback_limit_enable"'s default value is 0 so that it doesn't limit -any writeback. IOW, if admin wants to apply writeback budget, he should +any writeback. IOW, if admin wants to apply writeback budget, they should enable writeback_limit_enable via:: $ echo 1 > /sys/block/zramX/writeback_limit_enable @@ -353,7 +406,7 @@ until admin sets the budget via /sys/block/zramX/writeback_limit. (If admin doesn't enable writeback_limit_enable, writeback_limit's value assigned via /sys/block/zramX/writeback_limit is meaningless.) -If admin want to limit writeback as per-day 400M, he could do it +If admin wants to limit writeback as per-day 400M, they could do it like below:: $ MB_SHIFT=20 @@ -362,17 +415,17 @@ like below:: /sys/block/zram0/writeback_limit. $ echo 1 > /sys/block/zram0/writeback_limit_enable -If admins want to allow further write again once the bugdet is exhausted, -he could do it like below:: +If admins want to allow further write again once the budget is exhausted, +they could do it like below:: $ echo $((400<<MB_SHIFT>>4K_SHIFT)) > \ /sys/block/zram0/writeback_limit -If admin wants to see remaining writeback budget since last set:: +If an admin wants to see the remaining writeback budget since last set:: $ cat /sys/block/zramX/writeback_limit -If admin want to disable writeback limit, he could do:: +If an admin wants to disable writeback limit, they could do:: $ echo 0 > /sys/block/zramX/writeback_limit_enable @@ -381,9 +434,96 @@ system reboot, echo 1 > /sys/block/zramX/reset) so keeping how many of writeback happened until you reset the zram to allocate extra writeback budget in next setting is user's job. -If admin wants to measure writeback count in a certain period, he could +If admin wants to measure writeback count in a certain period, they could know it via /sys/block/zram0/bd_stat's 3rd column. +recompression +------------- + +With CONFIG_ZRAM_MULTI_COMP, zram can recompress pages using alternative +(secondary) compression algorithms. The basic idea is that alternative +compression algorithm can provide better compression ratio at a price of +(potentially) slower compression/decompression speeds. Alternative compression +algorithm can, for example, be more successful compressing huge pages (those +that default algorithm failed to compress). Another application is idle pages +recompression - pages that are cold and sit in the memory can be recompressed +using more effective algorithm and, hence, reduce zsmalloc memory usage. + +With CONFIG_ZRAM_MULTI_COMP, zram supports up to 4 compression algorithms: +one primary and up to 3 secondary ones. Primary zram compressor is explained +in "3) Select compression algorithm", secondary algorithms are configured +using recomp_algorithm device attribute. + +Example::: + + #show supported recompression algorithms + cat /sys/block/zramX/recomp_algorithm + #1: lzo lzo-rle lz4 lz4hc [zstd] + #2: lzo lzo-rle lz4 [lz4hc] zstd + +Alternative compression algorithms are sorted by priority. In the example +above, zstd is used as the first alternative algorithm, which has priority +of 1, while lz4hc is configured as a compression algorithm with priority 2. +Alternative compression algorithm's priority is provided during algorithms +configuration::: + + #select zstd recompression algorithm, priority 1 + echo "algo=zstd priority=1" > /sys/block/zramX/recomp_algorithm + + #select deflate recompression algorithm, priority 2 + echo "algo=deflate priority=2" > /sys/block/zramX/recomp_algorithm + +Another device attribute that CONFIG_ZRAM_MULTI_COMP enables is recompress, +which controls recompression. + +Examples::: + + #IDLE pages recompression is activated by `idle` mode + echo "type=idle" > /sys/block/zramX/recompress + + #HUGE pages recompression is activated by `huge` mode + echo "type=huge" > /sys/block/zram0/recompress + + #HUGE_IDLE pages recompression is activated by `huge_idle` mode + echo "type=huge_idle" > /sys/block/zramX/recompress + +The number of idle pages can be significant, so user-space can pass a size +threshold (in bytes) to the recompress knob: zram will recompress only pages +of equal or greater size::: + + #recompress all pages larger than 3000 bytes + echo "threshold=3000" > /sys/block/zramX/recompress + + #recompress idle pages larger than 2000 bytes + echo "type=idle threshold=2000" > /sys/block/zramX/recompress + +It is also possible to limit the number of pages zram re-compression will +attempt to recompress::: + + echo "type=huge_idle max_pages=42" > /sys/block/zramX/recompress + +During re-compression for every page, that matches re-compression criteria, +ZRAM iterates the list of registered alternative compression algorithms in +order of their priorities. ZRAM stops either when re-compression was +successful (re-compressed object is smaller in size than the original one) +and matches re-compression criteria (e.g. size threshold) or when there are +no secondary algorithms left to try. If none of the secondary algorithms can +successfully re-compressed the page such a page is marked as incompressible, +so ZRAM will not attempt to re-compress it in the future. + +This re-compression behaviour, when it iterates through the list of +registered compression algorithms, increases our chances of finding the +algorithm that successfully compresses a particular page. Sometimes, however, +it is convenient (and sometimes even necessary) to limit recompression to +only one particular algorithm so that it will not try any other algorithms. +This can be achieved by providing a `algo` or `priority` parameter::: + + #use zstd algorithm only (if registered) + echo "type=huge algo=zstd" > /sys/block/zramX/recompress + + #use zstd algorithm only (if zstd was registered under priority 1) + echo "type=huge priority=1" > /sys/block/zramX/recompress + memory tracking =============== @@ -394,9 +534,11 @@ pages of the process with*pagemap. If you enable the feature, you could see block state via /sys/kernel/debug/zram/zram0/block_state". The output is as follows:: - 300 75.033841 .wh. - 301 63.806904 s... - 302 63.806919 ..hi + 300 75.033841 .wh... + 301 63.806904 s..... + 302 63.806919 ..hi.. + 303 62.801919 ....r. + 304 146.781902 ..hi.n First column zram's block index. @@ -413,6 +555,10 @@ Third column huge page i: idle page + r: + recompressed page (secondary compression algorithm) + n: + none (including secondary) of algorithms could compress it First line of above example says 300th block is accessed at 75.033841sec and the block's state is huge so it is written back to the backing |
