From 34e75cf4beb1a88a61b7c76b5fdc99c43cff8594 Mon Sep 17 00:00:00 2001 From: "Daniel W. S. Almeida" Date: Wed, 29 Jan 2020 01:49:13 -0300 Subject: Documentation: nfs: convert pnfs.txt to ReST Convert pnfs.txt to ReST. Content remains mostly unchanged. Signed-off-by: Daniel W. S. Almeida Link: https://lore.kernel.org/r/20200129044917.566906-2-dwlsalmeida@gmail.com Signed-off-by: Jonathan Corbet --- Documentation/filesystems/index.rst | 1 + Documentation/filesystems/nfs/index.rst | 9 ++++ Documentation/filesystems/nfs/pnfs.rst | 78 +++++++++++++++++++++++++++++++++ Documentation/filesystems/nfs/pnfs.txt | 73 ------------------------------ 4 files changed, 88 insertions(+), 73 deletions(-) create mode 100644 Documentation/filesystems/nfs/index.rst create mode 100644 Documentation/filesystems/nfs/pnfs.rst delete mode 100644 Documentation/filesystems/nfs/pnfs.txt diff --git a/Documentation/filesystems/index.rst b/Documentation/filesystems/index.rst index 386eaad008b2..45d791905e91 100644 --- a/Documentation/filesystems/index.rst +++ b/Documentation/filesystems/index.rst @@ -51,3 +51,4 @@ Documentation for filesystem implementations. overlayfs virtiofs vfat + nfs/index diff --git a/Documentation/filesystems/nfs/index.rst b/Documentation/filesystems/nfs/index.rst new file mode 100644 index 000000000000..d19ba592779a --- /dev/null +++ b/Documentation/filesystems/nfs/index.rst @@ -0,0 +1,9 @@ +=============================== +NFS +=============================== + + +.. toctree:: + :maxdepth: 1 + + pnfs diff --git a/Documentation/filesystems/nfs/pnfs.rst b/Documentation/filesystems/nfs/pnfs.rst new file mode 100644 index 000000000000..7c470ecdc3a9 --- /dev/null +++ b/Documentation/filesystems/nfs/pnfs.rst @@ -0,0 +1,78 @@ +========================== +Reference counting in pnfs +========================== + +The are several inter-related caches. We have layouts which can +reference multiple devices, each of which can reference multiple data servers. +Each data server can be referenced by multiple devices. Each device +can be referenced by multiple layouts. To keep all of this straight, +we need to reference count. + + +struct pnfs_layout_hdr +====================== + +The on-the-wire command LAYOUTGET corresponds to struct +pnfs_layout_segment, usually referred to by the variable name lseg. +Each nfs_inode may hold a pointer to a cache of these layout +segments in nfsi->layout, of type struct pnfs_layout_hdr. + +We reference the header for the inode pointing to it, across each +outstanding RPC call that references it (LAYOUTGET, LAYOUTRETURN, +LAYOUTCOMMIT), and for each lseg held within. + +Each header is also (when non-empty) put on a list associated with +struct nfs_client (cl_layouts). Being put on this list does not bump +the reference count, as the layout is kept around by the lseg that +keeps it in the list. + +deviceid_cache +============== + +lsegs reference device ids, which are resolved per nfs_client and +layout driver type. The device ids are held in a RCU cache (struct +nfs4_deviceid_cache). The cache itself is referenced across each +mount. The entries (struct nfs4_deviceid) themselves are held across +the lifetime of each lseg referencing them. + +RCU is used because the deviceid is basically a write once, read many +data structure. The hlist size of 32 buckets needs better +justification, but seems reasonable given that we can have multiple +deviceid's per filesystem, and multiple filesystems per nfs_client. + +The hash code is copied from the nfsd code base. A discussion of +hashing and variations of this algorithm can be found `here. +`_ + +data server cache +================= + +file driver devices refer to data servers, which are kept in a module +level cache. Its reference is held over the lifetime of the deviceid +pointing to it. + +lseg +==== + +lseg maintains an extra reference corresponding to the NFS_LSEG_VALID +bit which holds it in the pnfs_layout_hdr's list. When the final lseg +is removed from the pnfs_layout_hdr's list, the NFS_LAYOUT_DESTROYED +bit is set, preventing any new lsegs from being added. + +layout drivers +============== + +PNFS utilizes what is called layout drivers. The STD defines 4 basic +layout types: "files", "objects", "blocks", and "flexfiles". For each +of these types there is a layout-driver with a common function-vectors +table which are called by the nfs-client pnfs-core to implement the +different layout types. + +Files-layout-driver code is in: fs/nfs/filelayout/.. directory +Blocks-layout-driver code is in: fs/nfs/blocklayout/.. directory +Flexfiles-layout-driver code is in: fs/nfs/flexfilelayout/.. directory + +blocks-layout setup +=================== + +TODO: Document the setup needs of the blocks layout driver diff --git a/Documentation/filesystems/nfs/pnfs.txt b/Documentation/filesystems/nfs/pnfs.txt deleted file mode 100644 index 80dc0bdc302a..000000000000 --- a/Documentation/filesystems/nfs/pnfs.txt +++ /dev/null @@ -1,73 +0,0 @@ -Reference counting in pnfs: -========================== - -The are several inter-related caches. We have layouts which can -reference multiple devices, each of which can reference multiple data servers. -Each data server can be referenced by multiple devices. Each device -can be referenced by multiple layouts. To keep all of this straight, -we need to reference count. - - -struct pnfs_layout_hdr ----------------------- -The on-the-wire command LAYOUTGET corresponds to struct -pnfs_layout_segment, usually referred to by the variable name lseg. -Each nfs_inode may hold a pointer to a cache of these layout -segments in nfsi->layout, of type struct pnfs_layout_hdr. - -We reference the header for the inode pointing to it, across each -outstanding RPC call that references it (LAYOUTGET, LAYOUTRETURN, -LAYOUTCOMMIT), and for each lseg held within. - -Each header is also (when non-empty) put on a list associated with -struct nfs_client (cl_layouts). Being put on this list does not bump -the reference count, as the layout is kept around by the lseg that -keeps it in the list. - -deviceid_cache --------------- -lsegs reference device ids, which are resolved per nfs_client and -layout driver type. The device ids are held in a RCU cache (struct -nfs4_deviceid_cache). The cache itself is referenced across each -mount. The entries (struct nfs4_deviceid) themselves are held across -the lifetime of each lseg referencing them. - -RCU is used because the deviceid is basically a write once, read many -data structure. The hlist size of 32 buckets needs better -justification, but seems reasonable given that we can have multiple -deviceid's per filesystem, and multiple filesystems per nfs_client. - -The hash code is copied from the nfsd code base. A discussion of -hashing and variations of this algorithm can be found at: -http://groups.google.com/group/comp.lang.c/browse_thread/thread/9522965e2b8d3809 - -data server cache ------------------ -file driver devices refer to data servers, which are kept in a module -level cache. Its reference is held over the lifetime of the deviceid -pointing to it. - -lseg ----- -lseg maintains an extra reference corresponding to the NFS_LSEG_VALID -bit which holds it in the pnfs_layout_hdr's list. When the final lseg -is removed from the pnfs_layout_hdr's list, the NFS_LAYOUT_DESTROYED -bit is set, preventing any new lsegs from being added. - -layout drivers --------------- - -PNFS utilizes what is called layout drivers. The STD defines 4 basic -layout types: "files", "objects", "blocks", and "flexfiles". For each -of these types there is a layout-driver with a common function-vectors -table which are called by the nfs-client pnfs-core to implement the -different layout types. - -Files-layout-driver code is in: fs/nfs/filelayout/.. directory -Blocks-layout-driver code is in: fs/nfs/blocklayout/.. directory -Flexfiles-layout-driver code is in: fs/nfs/flexfilelayout/.. directory - -blocks-layout setup -------------------- - -TODO: Document the setup needs of the blocks layout driver -- cgit From f0bf8a988b26e75cc6fc28a44a745cb354a2b5a6 Mon Sep 17 00:00:00 2001 From: "Daniel W. S. Almeida" Date: Wed, 29 Jan 2020 01:49:14 -0300 Subject: Documentation: nfs: rpc-cache: convert to ReST Convert rpc-cache.txt to ReST. Changes aim to improve presentation but the content itself remains mostly the same. Signed-off-by: Daniel W. S. Almeida Link: https://lore.kernel.org/r/20200129044917.566906-3-dwlsalmeida@gmail.com Signed-off-by: Jonathan Corbet --- Documentation/filesystems/nfs/index.rst | 1 + Documentation/filesystems/nfs/rpc-cache.rst | 220 ++++++++++++++++++++++++++++ Documentation/filesystems/nfs/rpc-cache.txt | 202 ------------------------- 3 files changed, 221 insertions(+), 202 deletions(-) create mode 100644 Documentation/filesystems/nfs/rpc-cache.rst delete mode 100644 Documentation/filesystems/nfs/rpc-cache.txt diff --git a/Documentation/filesystems/nfs/index.rst b/Documentation/filesystems/nfs/index.rst index d19ba592779a..52f4956e7770 100644 --- a/Documentation/filesystems/nfs/index.rst +++ b/Documentation/filesystems/nfs/index.rst @@ -7,3 +7,4 @@ NFS :maxdepth: 1 pnfs + rpc-cache diff --git a/Documentation/filesystems/nfs/rpc-cache.rst b/Documentation/filesystems/nfs/rpc-cache.rst new file mode 100644 index 000000000000..bb164eea969b --- /dev/null +++ b/Documentation/filesystems/nfs/rpc-cache.rst @@ -0,0 +1,220 @@ +========= +RPC Cache +========= + +This document gives a brief introduction to the caching +mechanisms in the sunrpc layer that is used, in particular, +for NFS authentication. + +Caches +====== + +The caching replaces the old exports table and allows for +a wide variety of values to be caches. + +There are a number of caches that are similar in structure though +quite possibly very different in content and use. There is a corpus +of common code for managing these caches. + +Examples of caches that are likely to be needed are: + + - mapping from IP address to client name + - mapping from client name and filesystem to export options + - mapping from UID to list of GIDs, to work around NFS's limitation + of 16 gids. + - mappings between local UID/GID and remote UID/GID for sites that + do not have uniform uid assignment + - mapping from network identify to public key for crypto authentication. + +The common code handles such things as: + + - general cache lookup with correct locking + - supporting 'NEGATIVE' as well as positive entries + - allowing an EXPIRED time on cache items, and removing + items after they expire, and are no longer in-use. + - making requests to user-space to fill in cache entries + - allowing user-space to directly set entries in the cache + - delaying RPC requests that depend on as-yet incomplete + cache entries, and replaying those requests when the cache entry + is complete. + - clean out old entries as they expire. + +Creating a Cache +---------------- + +- A cache needs a datum to store. This is in the form of a + structure definition that must contain a struct cache_head + as an element, usually the first. + It will also contain a key and some content. + Each cache element is reference counted and contains + expiry and update times for use in cache management. +- A cache needs a "cache_detail" structure that + describes the cache. This stores the hash table, some + parameters for cache management, and some operations detailing how + to work with particular cache items. + + The operations are: + + struct cache_head \*alloc(void) + This simply allocates appropriate memory and returns + a pointer to the cache_detail embedded within the + structure + + void cache_put(struct kref \*) + This is called when the last reference to an item is + dropped. The pointer passed is to the 'ref' field + in the cache_head. cache_put should release any + references create by 'cache_init' and, if CACHE_VALID + is set, any references created by cache_update. + It should then release the memory allocated by + 'alloc'. + + int match(struct cache_head \*orig, struct cache_head \*new) + test if the keys in the two structures match. Return + 1 if they do, 0 if they don't. + + void init(struct cache_head \*orig, struct cache_head \*new) + Set the 'key' fields in 'new' from 'orig'. This may + include taking references to shared objects. + + void update(struct cache_head \*orig, struct cache_head \*new) + Set the 'content' fileds in 'new' from 'orig'. + + int cache_show(struct seq_file \*m, struct cache_detail \*cd, struct cache_head \*h) + Optional. Used to provide a /proc file that lists the + contents of a cache. This should show one item, + usually on just one line. + + int cache_request(struct cache_detail \*cd, struct cache_head \*h, char \*\*bpp, int \*blen) + Format a request to be send to user-space for an item + to be instantiated. \*bpp is a buffer of size \*blen. + bpp should be moved forward over the encoded message, + and \*blen should be reduced to show how much free + space remains. Return 0 on success or <0 if not + enough room or other problem. + + int cache_parse(struct cache_detail \*cd, char \*buf, int len) + A message from user space has arrived to fill out a + cache entry. It is in 'buf' of length 'len'. + cache_parse should parse this, find the item in the + cache with sunrpc_cache_lookup_rcu, and update the item + with sunrpc_cache_update. + + +- A cache needs to be registered using cache_register(). This + includes it on a list of caches that will be regularly + cleaned to discard old data. + +Using a cache +------------- + +To find a value in a cache, call sunrpc_cache_lookup_rcu passing a pointer +to the cache_head in a sample item with the 'key' fields filled in. +This will be passed to ->match to identify the target entry. If no +entry is found, a new entry will be create, added to the cache, and +marked as not containing valid data. + +The item returned is typically passed to cache_check which will check +if the data is valid, and may initiate an up-call to get fresh data. +cache_check will return -ENOENT in the entry is negative or if an up +call is needed but not possible, -EAGAIN if an upcall is pending, +or 0 if the data is valid; + +cache_check can be passed a "struct cache_req\*". This structure is +typically embedded in the actual request and can be used to create a +deferred copy of the request (struct cache_deferred_req). This is +done when the found cache item is not uptodate, but the is reason to +believe that userspace might provide information soon. When the cache +item does become valid, the deferred copy of the request will be +revisited (->revisit). It is expected that this method will +reschedule the request for processing. + +The value returned by sunrpc_cache_lookup_rcu can also be passed to +sunrpc_cache_update to set the content for the item. A second item is +passed which should hold the content. If the item found by _lookup +has valid data, then it is discarded and a new item is created. This +saves any user of an item from worrying about content changing while +it is being inspected. If the item found by _lookup does not contain +valid data, then the content is copied across and CACHE_VALID is set. + +Populating a cache +------------------ + +Each cache has a name, and when the cache is registered, a directory +with that name is created in /proc/net/rpc + +This directory contains a file called 'channel' which is a channel +for communicating between kernel and user for populating the cache. +This directory may later contain other files of interacting +with the cache. + +The 'channel' works a bit like a datagram socket. Each 'write' is +passed as a whole to the cache for parsing and interpretation. +Each cache can treat the write requests differently, but it is +expected that a message written will contain: + + - a key + - an expiry time + - a content. + +with the intention that an item in the cache with the give key +should be create or updated to have the given content, and the +expiry time should be set on that item. + +Reading from a channel is a bit more interesting. When a cache +lookup fails, or when it succeeds but finds an entry that may soon +expire, a request is lodged for that cache item to be updated by +user-space. These requests appear in the channel file. + +Successive reads will return successive requests. +If there are no more requests to return, read will return EOF, but a +select or poll for read will block waiting for another request to be +added. + +Thus a user-space helper is likely to:: + + open the channel. + select for readable + read a request + write a response + loop. + +If it dies and needs to be restarted, any requests that have not been +answered will still appear in the file and will be read by the new +instance of the helper. + +Each cache should define a "cache_parse" method which takes a message +written from user-space and processes it. It should return an error +(which propagates back to the write syscall) or 0. + +Each cache should also define a "cache_request" method which +takes a cache item and encodes a request into the buffer +provided. + +.. note:: + If a cache has no active readers on the channel, and has had not + active readers for more than 60 seconds, further requests will not be + added to the channel but instead all lookups that do not find a valid + entry will fail. This is partly for backward compatibility: The + previous nfs exports table was deemed to be authoritative and a + failed lookup meant a definite 'no'. + +request/response format +----------------------- + +While each cache is free to use its own format for requests +and responses over channel, the following is recommended as +appropriate and support routines are available to help: +Each request or response record should be printable ASCII +with precisely one newline character which should be at the end. +Fields within the record should be separated by spaces, normally one. +If spaces, newlines, or nul characters are needed in a field they +much be quoted. two mechanisms are available: + +- If a field begins '\x' then it must contain an even number of + hex digits, and pairs of these digits provide the bytes in the + field. +- otherwise a \ in the field must be followed by 3 octal digits + which give the code for a byte. Other characters are treated + as them selves. At the very least, space, newline, nul, and + '\' must be quoted in this way. diff --git a/Documentation/filesystems/nfs/rpc-cache.txt b/Documentation/filesystems/nfs/rpc-cache.txt deleted file mode 100644 index c4dac829db0f..000000000000 --- a/Documentation/filesystems/nfs/rpc-cache.txt +++ /dev/null @@ -1,202 +0,0 @@ - This document gives a brief introduction to the caching -mechanisms in the sunrpc layer that is used, in particular, -for NFS authentication. - -CACHES -====== -The caching replaces the old exports table and allows for -a wide variety of values to be caches. - -There are a number of caches that are similar in structure though -quite possibly very different in content and use. There is a corpus -of common code for managing these caches. - -Examples of caches that are likely to be needed are: - - mapping from IP address to client name - - mapping from client name and filesystem to export options - - mapping from UID to list of GIDs, to work around NFS's limitation - of 16 gids. - - mappings between local UID/GID and remote UID/GID for sites that - do not have uniform uid assignment - - mapping from network identify to public key for crypto authentication. - -The common code handles such things as: - - general cache lookup with correct locking - - supporting 'NEGATIVE' as well as positive entries - - allowing an EXPIRED time on cache items, and removing - items after they expire, and are no longer in-use. - - making requests to user-space to fill in cache entries - - allowing user-space to directly set entries in the cache - - delaying RPC requests that depend on as-yet incomplete - cache entries, and replaying those requests when the cache entry - is complete. - - clean out old entries as they expire. - -Creating a Cache ----------------- - -1/ A cache needs a datum to store. This is in the form of a - structure definition that must contain a - struct cache_head - as an element, usually the first. - It will also contain a key and some content. - Each cache element is reference counted and contains - expiry and update times for use in cache management. -2/ A cache needs a "cache_detail" structure that - describes the cache. This stores the hash table, some - parameters for cache management, and some operations detailing how - to work with particular cache items. - The operations requires are: - struct cache_head *alloc(void) - This simply allocates appropriate memory and returns - a pointer to the cache_detail embedded within the - structure - void cache_put(struct kref *) - This is called when the last reference to an item is - dropped. The pointer passed is to the 'ref' field - in the cache_head. cache_put should release any - references create by 'cache_init' and, if CACHE_VALID - is set, any references created by cache_update. - It should then release the memory allocated by - 'alloc'. - int match(struct cache_head *orig, struct cache_head *new) - test if the keys in the two structures match. Return - 1 if they do, 0 if they don't. - void init(struct cache_head *orig, struct cache_head *new) - Set the 'key' fields in 'new' from 'orig'. This may - include taking references to shared objects. - void update(struct cache_head *orig, struct cache_head *new) - Set the 'content' fileds in 'new' from 'orig'. - int cache_show(struct seq_file *m, struct cache_detail *cd, - struct cache_head *h) - Optional. Used to provide a /proc file that lists the - contents of a cache. This should show one item, - usually on just one line. - int cache_request(struct cache_detail *cd, struct cache_head *h, - char **bpp, int *blen) - Format a request to be send to user-space for an item - to be instantiated. *bpp is a buffer of size *blen. - bpp should be moved forward over the encoded message, - and *blen should be reduced to show how much free - space remains. Return 0 on success or <0 if not - enough room or other problem. - int cache_parse(struct cache_detail *cd, char *buf, int len) - A message from user space has arrived to fill out a - cache entry. It is in 'buf' of length 'len'. - cache_parse should parse this, find the item in the - cache with sunrpc_cache_lookup_rcu, and update the item - with sunrpc_cache_update. - - -3/ A cache needs to be registered using cache_register(). This - includes it on a list of caches that will be regularly - cleaned to discard old data. - -Using a cache -------------- - -To find a value in a cache, call sunrpc_cache_lookup_rcu passing a pointer -to the cache_head in a sample item with the 'key' fields filled in. -This will be passed to ->match to identify the target entry. If no -entry is found, a new entry will be create, added to the cache, and -marked as not containing valid data. - -The item returned is typically passed to cache_check which will check -if the data is valid, and may initiate an up-call to get fresh data. -cache_check will return -ENOENT in the entry is negative or if an up -call is needed but not possible, -EAGAIN if an upcall is pending, -or 0 if the data is valid; - -cache_check can be passed a "struct cache_req *". This structure is -typically embedded in the actual request and can be used to create a -deferred copy of the request (struct cache_deferred_req). This is -done when the found cache item is not uptodate, but the is reason to -believe that userspace might provide information soon. When the cache -item does become valid, the deferred copy of the request will be -revisited (->revisit). It is expected that this method will -reschedule the request for processing. - -The value returned by sunrpc_cache_lookup_rcu can also be passed to -sunrpc_cache_update to set the content for the item. A second item is -passed which should hold the content. If the item found by _lookup -has valid data, then it is discarded and a new item is created. This -saves any user of an item from worrying about content changing while -it is being inspected. If the item found by _lookup does not contain -valid data, then the content is copied across and CACHE_VALID is set. - -Populating a cache ------------------- - -Each cache has a name, and when the cache is registered, a directory -with that name is created in /proc/net/rpc - -This directory contains a file called 'channel' which is a channel -for communicating between kernel and user for populating the cache. -This directory may later contain other files of interacting -with the cache. - -The 'channel' works a bit like a datagram socket. Each 'write' is -passed as a whole to the cache for parsing and interpretation. -Each cache can treat the write requests differently, but it is -expected that a message written will contain: - - a key - - an expiry time - - a content. -with the intention that an item in the cache with the give key -should be create or updated to have the given content, and the -expiry time should be set on that item. - -Reading from a channel is a bit more interesting. When a cache -lookup fails, or when it succeeds but finds an entry that may soon -expire, a request is lodged for that cache item to be updated by -user-space. These requests appear in the channel file. - -Successive reads will return successive requests. -If there are no more requests to return, read will return EOF, but a -select or poll for read will block waiting for another request to be -added. - -Thus a user-space helper is likely to: - open the channel. - select for readable - read a request - write a response - loop. - -If it dies and needs to be restarted, any requests that have not been -answered will still appear in the file and will be read by the new -instance of the helper. - -Each cache should define a "cache_parse" method which takes a message -written from user-space and processes it. It should return an error -(which propagates back to the write syscall) or 0. - -Each cache should also define a "cache_request" method which -takes a cache item and encodes a request into the buffer -provided. - -Note: If a cache has no active readers on the channel, and has had not -active readers for more than 60 seconds, further requests will not be -added to the channel but instead all lookups that do not find a valid -entry will fail. This is partly for backward compatibility: The -previous nfs exports table was deemed to be authoritative and a -failed lookup meant a definite 'no'. - -request/response format ------------------------ - -While each cache is free to use its own format for requests -and responses over channel, the following is recommended as -appropriate and support routines are available to help: -Each request or response record should be printable ASCII -with precisely one newline character which should be at the end. -Fields within the record should be separated by spaces, normally one. -If spaces, newlines, or nul characters are needed in a field they -much be quoted. two mechanisms are available: -1/ If a field begins '\x' then it must contain an even number of - hex digits, and pairs of these digits provide the bytes in the - field. -2/ otherwise a \ in the field must be followed by 3 octal digits - which give the code for a byte. Other characters are treated - as them selves. At the very least, space, newline, nul, and - '\' must be quoted in this way. -- cgit From 250baf06aacf4eafb5641c86c91f2b1df4cf7d86 Mon Sep 17 00:00:00 2001 From: "Daniel W. S. Almeida" Date: Wed, 29 Jan 2020 01:49:15 -0300 Subject: Documentation: nfs: rpc-server-gss: convert to ReST Convert rpc-server-gss.txt to ReST. Content remains mostly unchanged. Signed-off-by: Daniel W. S. Almeida Link: https://lore.kernel.org/r/20200129044917.566906-4-dwlsalmeida@gmail.com Signed-off-by: Jonathan Corbet --- Documentation/filesystems/nfs/index.rst | 1 + Documentation/filesystems/nfs/rpc-server-gss.rst | 94 ++++++++++++++++++++++++ Documentation/filesystems/nfs/rpc-server-gss.txt | 91 ----------------------- 3 files changed, 95 insertions(+), 91 deletions(-) create mode 100644 Documentation/filesystems/nfs/rpc-server-gss.rst delete mode 100644 Documentation/filesystems/nfs/rpc-server-gss.txt diff --git a/Documentation/filesystems/nfs/index.rst b/Documentation/filesystems/nfs/index.rst index 52f4956e7770..9d5365cbe2c3 100644 --- a/Documentation/filesystems/nfs/index.rst +++ b/Documentation/filesystems/nfs/index.rst @@ -8,3 +8,4 @@ NFS pnfs rpc-cache + rpc-server-gss diff --git a/Documentation/filesystems/nfs/rpc-server-gss.rst b/Documentation/filesystems/nfs/rpc-server-gss.rst new file mode 100644 index 000000000000..812754576845 --- /dev/null +++ b/Documentation/filesystems/nfs/rpc-server-gss.rst @@ -0,0 +1,94 @@ +========================================= +rpcsec_gss support for kernel RPC servers +========================================= + +This document gives references to the standards and protocols used to +implement RPCGSS authentication in kernel RPC servers such as the NFS +server and the NFS client's NFSv4.0 callback server. (But note that +NFSv4.1 and higher don't require the client to act as a server for the +purposes of authentication.) + +RPCGSS is specified in a few IETF documents: + + - RFC2203 v1: http://tools.ietf.org/rfc/rfc2203.txt + - RFC5403 v2: http://tools.ietf.org/rfc/rfc5403.txt + +and there is a 3rd version being proposed: + + - http://tools.ietf.org/id/draft-williams-rpcsecgssv3.txt + (At draft n. 02 at the time of writing) + +Background +========== + +The RPCGSS Authentication method describes a way to perform GSSAPI +Authentication for NFS. Although GSSAPI is itself completely mechanism +agnostic, in many cases only the KRB5 mechanism is supported by NFS +implementations. + +The Linux kernel, at the moment, supports only the KRB5 mechanism, and +depends on GSSAPI extensions that are KRB5 specific. + +GSSAPI is a complex library, and implementing it completely in kernel is +unwarranted. However GSSAPI operations are fundementally separable in 2 +parts: + +- initial context establishment +- integrity/privacy protection (signing and encrypting of individual + packets) + +The former is more complex and policy-independent, but less +performance-sensitive. The latter is simpler and needs to be very fast. + +Therefore, we perform per-packet integrity and privacy protection in the +kernel, but leave the initial context establishment to userspace. We +need upcalls to request userspace to perform context establishment. + +NFS Server Legacy Upcall Mechanism +================================== + +The classic upcall mechanism uses a custom text based upcall mechanism +to talk to a custom daemon called rpc.svcgssd that is provide by the +nfs-utils package. + +This upcall mechanism has 2 limitations: + +A) It can handle tokens that are no bigger than 2KiB + +In some Kerberos deployment GSSAPI tokens can be quite big, up and +beyond 64KiB in size due to various authorization extensions attacked to +the Kerberos tickets, that needs to be sent through the GSS layer in +order to perform context establishment. + +B) It does not properly handle creds where the user is member of more +than a few thousand groups (the current hard limit in the kernel is 65K +groups) due to limitation on the size of the buffer that can be send +back to the kernel (4KiB). + +NFS Server New RPC Upcall Mechanism +=================================== + +The newer upcall mechanism uses RPC over a unix socket to a daemon +called gss-proxy, implemented by a userspace program called Gssproxy. + +The gss_proxy RPC protocol is currently documented `here +`_. + +This upcall mechanism uses the kernel rpc client and connects to the gssproxy +userspace program over a regular unix socket. The gssproxy protocol does not +suffer from the size limitations of the legacy protocol. + +Negotiating Upcall Mechanisms +============================= + +To provide backward compatibility, the kernel defaults to using the +legacy mechanism. To switch to the new mechanism, gss-proxy must bind +to /var/run/gssproxy.sock and then write "1" to +/proc/net/rpc/use-gss-proxy. If gss-proxy dies, it must repeat both +steps. + +Once the upcall mechanism is chosen, it cannot be changed. To prevent +locking into the legacy mechanisms, the above steps must be performed +before starting nfsd. Whoever starts nfsd can guarantee this by reading +from /proc/net/rpc/use-gss-proxy and checking that it contains a +"1"--the read will block until gss-proxy has done its write to the file. diff --git a/Documentation/filesystems/nfs/rpc-server-gss.txt b/Documentation/filesystems/nfs/rpc-server-gss.txt deleted file mode 100644 index 310bbbaf9080..000000000000 --- a/Documentation/filesystems/nfs/rpc-server-gss.txt +++ /dev/null @@ -1,91 +0,0 @@ - -rpcsec_gss support for kernel RPC servers -========================================= - -This document gives references to the standards and protocols used to -implement RPCGSS authentication in kernel RPC servers such as the NFS -server and the NFS client's NFSv4.0 callback server. (But note that -NFSv4.1 and higher don't require the client to act as a server for the -purposes of authentication.) - -RPCGSS is specified in a few IETF documents: - - RFC2203 v1: http://tools.ietf.org/rfc/rfc2203.txt - - RFC5403 v2: http://tools.ietf.org/rfc/rfc5403.txt -and there is a 3rd version being proposed: - - http://tools.ietf.org/id/draft-williams-rpcsecgssv3.txt - (At draft n. 02 at the time of writing) - -Background ----------- - -The RPCGSS Authentication method describes a way to perform GSSAPI -Authentication for NFS. Although GSSAPI is itself completely mechanism -agnostic, in many cases only the KRB5 mechanism is supported by NFS -implementations. - -The Linux kernel, at the moment, supports only the KRB5 mechanism, and -depends on GSSAPI extensions that are KRB5 specific. - -GSSAPI is a complex library, and implementing it completely in kernel is -unwarranted. However GSSAPI operations are fundementally separable in 2 -parts: -- initial context establishment -- integrity/privacy protection (signing and encrypting of individual - packets) - -The former is more complex and policy-independent, but less -performance-sensitive. The latter is simpler and needs to be very fast. - -Therefore, we perform per-packet integrity and privacy protection in the -kernel, but leave the initial context establishment to userspace. We -need upcalls to request userspace to perform context establishment. - -NFS Server Legacy Upcall Mechanism ----------------------------------- - -The classic upcall mechanism uses a custom text based upcall mechanism -to talk to a custom daemon called rpc.svcgssd that is provide by the -nfs-utils package. - -This upcall mechanism has 2 limitations: - -A) It can handle tokens that are no bigger than 2KiB - -In some Kerberos deployment GSSAPI tokens can be quite big, up and -beyond 64KiB in size due to various authorization extensions attacked to -the Kerberos tickets, that needs to be sent through the GSS layer in -order to perform context establishment. - -B) It does not properly handle creds where the user is member of more -than a few thousand groups (the current hard limit in the kernel is 65K -groups) due to limitation on the size of the buffer that can be send -back to the kernel (4KiB). - -NFS Server New RPC Upcall Mechanism ------------------------------------ - -The newer upcall mechanism uses RPC over a unix socket to a daemon -called gss-proxy, implemented by a userspace program called Gssproxy. - -The gss_proxy RPC protocol is currently documented here: - - https://fedorahosted.org/gss-proxy/wiki/ProtocolDocumentation - -This upcall mechanism uses the kernel rpc client and connects to the gssproxy -userspace program over a regular unix socket. The gssproxy protocol does not -suffer from the size limitations of the legacy protocol. - -Negotiating Upcall Mechanisms ------------------------------ - -To provide backward compatibility, the kernel defaults to using the -legacy mechanism. To switch to the new mechanism, gss-proxy must bind -to /var/run/gssproxy.sock and then write "1" to -/proc/net/rpc/use-gss-proxy. If gss-proxy dies, it must repeat both -steps. - -Once the upcall mechanism is chosen, it cannot be changed. To prevent -locking into the legacy mechanisms, the above steps must be performed -before starting nfsd. Whoever starts nfsd can guarantee this by reading -from /proc/net/rpc/use-gss-proxy and checking that it contains a -"1"--the read will block until gss-proxy has done its write to the file. -- cgit From 04f81fb08d067f79c59fe132929a9c81eb9cb74b Mon Sep 17 00:00:00 2001 From: "Daniel W. S. Almeida" Date: Wed, 29 Jan 2020 01:49:16 -0300 Subject: Documentation: nfs: nfs41-server: convert to ReST Convert nfs41-server.txt to ReST. ASCII tables were converted to ReST grid table format. Signed-off-by: Daniel W. S. Almeida Link: https://lore.kernel.org/r/20200129044917.566906-5-dwlsalmeida@gmail.com Signed-off-by: Jonathan Corbet --- Documentation/filesystems/nfs/index.rst | 1 + Documentation/filesystems/nfs/nfs41-server.rst | 256 +++++++++++++++++++++++++ Documentation/filesystems/nfs/nfs41-server.txt | 173 ----------------- 3 files changed, 257 insertions(+), 173 deletions(-) create mode 100644 Documentation/filesystems/nfs/nfs41-server.rst delete mode 100644 Documentation/filesystems/nfs/nfs41-server.txt diff --git a/Documentation/filesystems/nfs/index.rst b/Documentation/filesystems/nfs/index.rst index 9d5365cbe2c3..a0a678af921b 100644 --- a/Documentation/filesystems/nfs/index.rst +++ b/Documentation/filesystems/nfs/index.rst @@ -9,3 +9,4 @@ NFS pnfs rpc-cache rpc-server-gss + nfs41-server diff --git a/Documentation/filesystems/nfs/nfs41-server.rst b/Documentation/filesystems/nfs/nfs41-server.rst new file mode 100644 index 000000000000..16b5f02f81c3 --- /dev/null +++ b/Documentation/filesystems/nfs/nfs41-server.rst @@ -0,0 +1,256 @@ +============================= +NFSv4.1 Server Implementation +============================= + +Server support for minorversion 1 can be controlled using the +/proc/fs/nfsd/versions control file. The string output returned +by reading this file will contain either "+4.1" or "-4.1" +correspondingly. + +Currently, server support for minorversion 1 is enabled by default. +It can be disabled at run time by writing the string "-4.1" to +the /proc/fs/nfsd/versions control file. Note that to write this +control file, the nfsd service must be taken down. You can use rpc.nfsd +for this; see rpc.nfsd(8). + +(Warning: older servers will interpret "+4.1" and "-4.1" as "+4" and +"-4", respectively. Therefore, code meant to work on both new and old +kernels must turn 4.1 on or off *before* turning support for version 4 +on or off; rpc.nfsd does this correctly.) + +The NFSv4 minorversion 1 (NFSv4.1) implementation in nfsd is based +on RFC 5661. + +From the many new features in NFSv4.1 the current implementation +focuses on the mandatory-to-implement NFSv4.1 Sessions, providing +"exactly once" semantics and better control and throttling of the +resources allocated for each client. + +The table below, taken from the NFSv4.1 document, lists +the operations that are mandatory to implement (REQ), optional +(OPT), and NFSv4.0 operations that are required not to implement (MNI) +in minor version 1. The first column indicates the operations that +are not supported yet by the linux server implementation. + +The OPTIONAL features identified and their abbreviations are as follows: + +- **pNFS** Parallel NFS +- **FDELG** File Delegations +- **DDELG** Directory Delegations + +The following abbreviations indicate the linux server implementation status. + +- **I** Implemented NFSv4.1 operations. +- **NS** Not Supported. +- **NS\*** Unimplemented optional feature. + +Operations +========== + ++-----------------------+----------------------+---------------------+---------------------------+----------------+ +| Implementation status | Operation | REQ,REC, OPT or NMI | Feature (REQ, REC or OPT) | Definition | ++=======================+======================+=====================+===========================+================+ +| | ACCESS | REQ | | Section 18.1 | ++-----------------------+----------------------+---------------------+---------------------------+----------------+ +| I | BACKCHANNEL_CTL | REQ | | Section 18.33 | ++-----------------------+----------------------+---------------------+---------------------------+----------------+ +| I | BIND_CONN_TO_SESSION | REQ | | Section 18.34 | ++-----------------------+----------------------+---------------------+---------------------------+----------------+ +| | CLOSE | REQ | | Section 18.2 | ++-----------------------+----------------------+---------------------+---------------------------+----------------+ +| | COMMIT | REQ | | Section 18.3 | ++-----------------------+----------------------+---------------------+---------------------------+----------------+ +| | CREATE | REQ | | Section 18.4 | ++-----------------------+----------------------+---------------------+---------------------------+----------------+ +| I | CREATE_SESSION | REQ | | Section 18.36 | ++-----------------------+----------------------+---------------------+---------------------------+----------------+ +| NS* | DELEGPURGE | OPT | FDELG (REQ) | Section 18.5 | ++-----------------------+----------------------+---------------------+---------------------------+----------------+ +| | DELEGRETURN | OPT | FDELG, | Section 18.6 | ++-----------------------+----------------------+---------------------+---------------------------+----------------+ +| | | | DDELG, pNFS | | ++-----------------------+----------------------+---------------------+---------------------------+----------------+ +| | | | (REQ) | | ++-----------------------+----------------------+---------------------+---------------------------+----------------+ +| I | DESTROY_CLIENTID | REQ | | Section 18.50 | ++-----------------------+----------------------+---------------------+---------------------------+----------------+ +| I | DESTROY_SESSION | REQ | | Section 18.37 | ++-----------------------+----------------------+---------------------+---------------------------+----------------+ +| I | EXCHANGE_ID | REQ | | Section 18.35 | ++-----------------------+----------------------+---------------------+---------------------------+----------------+ +| I | FREE_STATEID | REQ | | Section 18.38 | ++-----------------------+----------------------+---------------------+---------------------------+----------------+ +| | GETATTR | REQ | | Section 18.7 | ++-----------------------+----------------------+---------------------+---------------------------+----------------+ +| I | GETDEVICEINFO | OPT | pNFS (REQ) | Section 18.40 | ++-----------------------+----------------------+---------------------+---------------------------+----------------+ +| NS* | GETDEVICELIST | OPT | pNFS (OPT) | Section 18.41 | ++-----------------------+----------------------+---------------------+---------------------------+----------------+ +| | GETFH | REQ | | Section 18.8 | ++-----------------------+----------------------+---------------------+---------------------------+----------------+ +| NS* | GET_DIR_DELEGATION | OPT | DDELG (REQ) | Section 18.39 | ++-----------------------+----------------------+---------------------+---------------------------+----------------+ +| I | LAYOUTCOMMIT | OPT | pNFS (REQ) | Section 18.42 | ++-----------------------+----------------------+---------------------+---------------------------+----------------+ +| I | LAYOUTGET | OPT | pNFS (REQ) | Section 18.43 | ++-----------------------+----------------------+---------------------+---------------------------+----------------+ +| I | LAYOUTRETURN | OPT | pNFS (REQ) | Section 18.44 | ++-----------------------+----------------------+---------------------+---------------------------+----------------+ +| | LINK | OPT | | Section 18.9 | ++-----------------------+----------------------+---------------------+---------------------------+----------------+ +| | LOCK | REQ | | Section 18.10 | ++-----------------------+----------------------+---------------------+---------------------------+----------------+ +| | LOCKT | REQ | | Section 18.11 | ++-----------------------+----------------------+---------------------+---------------------------+----------------+ +| | LOCKU | REQ | | Section 18.12 | ++-----------------------+----------------------+---------------------+---------------------------+----------------+ +| | LOOKUP | REQ | | Section 18.13 | ++-----------------------+----------------------+---------------------+---------------------------+----------------+ +| | LOOKUPP | REQ | | Section 18.14 | ++-----------------------+----------------------+---------------------+---------------------------+----------------+ +| | NVERIFY | REQ | | Section 18.15 | ++-----------------------+----------------------+---------------------+---------------------------+----------------+ +| | OPEN | REQ | | Section 18.16 | ++-----------------------+----------------------+---------------------+---------------------------+----------------+ +| NS* | OPENATTR | OPT | | Section 18.17 | ++-----------------------+----------------------+---------------------+---------------------------+----------------+ +| | OPEN_CONFIRM | MNI | | N/A | ++-----------------------+----------------------+---------------------+---------------------------+----------------+ +| | OPEN_DOWNGRADE | REQ | | Section 18.18 | ++-----------------------+----------------------+---------------------+---------------------------+----------------+ +| | PUTFH | REQ | | Section 18.19 | ++-----------------------+----------------------+---------------------+---------------------------+----------------+ +| | PUTPUBFH | REQ | | Section 18.20 | ++-----------------------+----------------------+---------------------+---------------------------+----------------+ +| | PUTROOTFH | REQ | | Section 18.21 | ++-----------------------+----------------------+---------------------+---------------------------+----------------+ +| | READ | REQ | | Section 18.22 | ++-----------------------+----------------------+---------------------+---------------------------+----------------+ +| | READDIR | REQ | | Section 18.23 | ++-----------------------+----------------------+---------------------+---------------------------+----------------+ +| | READLINK | OPT | | Section 18.24 | ++-----------------------+----------------------+---------------------+---------------------------+----------------+ +| | RECLAIM_COMPLETE | REQ | | Section 18.51 | ++-----------------------+----------------------+---------------------+---------------------------+----------------+ +| | RELEASE_LOCKOWNER | MNI | | N/A | ++-----------------------+----------------------+---------------------+---------------------------+----------------+ +| | REMOVE | REQ | | Section 18.25 | ++-----------------------+----------------------+---------------------+---------------------------+----------------+ +| | RENAME | REQ | | Section 18.26 | ++-----------------------+----------------------+---------------------+---------------------------+----------------+ +| | RENEW | MNI | | N/A | ++-----------------------+----------------------+---------------------+---------------------------+----------------+ +| | RESTOREFH | REQ | | Section 18.27 | ++-----------------------+----------------------+---------------------+---------------------------+----------------+ +| | SAVEFH | REQ | | Section 18.28 | ++-----------------------+----------------------+---------------------+---------------------------+----------------+ +| | SECINFO | REQ | | Section 18.29 | ++-----------------------+----------------------+---------------------+---------------------------+----------------+ +| I | SECINFO_NO_NAME | REC | pNFS files | Section 18.45, | ++-----------------------+----------------------+---------------------+---------------------------+----------------+ +| | | | layout (REQ) | Section 13.12 | ++-----------------------+----------------------+---------------------+---------------------------+----------------+ +| I | SEQUENCE | REQ | | Section 18.46 | ++-----------------------+----------------------+---------------------+---------------------------+----------------+ +| | SETATTR | REQ | | Section 18.30 | ++-----------------------+----------------------+---------------------+---------------------------+----------------+ +| | SETCLIENTID | MNI | | N/A | ++-----------------------+----------------------+---------------------+---------------------------+----------------+ +| | SETCLIENTID_CONFIRM | MNI | | N/A | ++-----------------------+----------------------+---------------------+---------------------------+----------------+ +| NS | SET_SSV | REQ | | Section 18.47 | ++-----------------------+----------------------+---------------------+---------------------------+----------------+ +| I | TEST_STATEID | REQ | | Section 18.48 | ++-----------------------+----------------------+---------------------+---------------------------+----------------+ +| | VERIFY | REQ | | Section 18.31 | ++-----------------------+----------------------+---------------------+---------------------------+----------------+ +| NS* | WANT_DELEGATION | OPT | FDELG (OPT) | Section 18.49 | ++-----------------------+----------------------+---------------------+---------------------------+----------------+ +| | WRITE | REQ | | Section 18.32 | ++-----------------------+----------------------+---------------------+---------------------------+----------------+ + + +Callback Operations +=================== ++-----------------------+-------------------------+---------------------+---------------------------+---------------+ +| Implementation status | Operation | REQ,REC, OPT or NMI | Feature (REQ, REC or OPT) | Definition | ++=======================+=========================+=====================+===========================+===============+ +| | CB_GETATTR | OPT | FDELG (REQ) | Section 20.1 | ++-----------------------+-------------------------+---------------------+---------------------------+---------------+ +| I | CB_LAYOUTRECALL | OPT | pNFS (REQ) | Section 20.3 | ++-----------------------+-------------------------+---------------------+---------------------------+---------------+ +| NS* | CB_NOTIFY | OPT | DDELG (REQ) | Section 20.4 | ++-----------------------+-------------------------+---------------------+---------------------------+---------------+ +| NS* | CB_NOTIFY_DEVICEID | OPT | pNFS (OPT) | Section 20.12 | ++-----------------------+-------------------------+---------------------+---------------------------+---------------+ +| NS* | CB_NOTIFY_LOCK | OPT | | Section 20.11 | ++-----------------------+-------------------------+---------------------+---------------------------+---------------+ +| NS* | CB_PUSH_DELEG | OPT | FDELG (OPT) | Section 20.5 | ++-----------------------+-------------------------+---------------------+---------------------------+---------------+ +| | CB_RECALL | OPT | FDELG, | Section 20.2 | ++-----------------------+-------------------------+---------------------+---------------------------+---------------+ +| | | | DDELG, pNFS | | ++-----------------------+-------------------------+---------------------+---------------------------+---------------+ +| | | | (REQ) | | ++-----------------------+-------------------------+---------------------+---------------------------+---------------+ +| NS* | CB_RECALL_ANY | OPT | FDELG, | Section 20.6 | ++-----------------------+-------------------------+---------------------+---------------------------+---------------+ +| | | | DDELG, pNFS | | ++-----------------------+-------------------------+---------------------+---------------------------+---------------+ +| | | | (REQ) | | ++-----------------------+-------------------------+---------------------+---------------------------+---------------+ +| NS | CB_RECALL_SLOT | REQ | | Section 20.8 | ++-----------------------+-------------------------+---------------------+---------------------------+---------------+ +| NS* | CB_RECALLABLE_OBJ_AVAIL | OPT | DDELG, pNFS | Section 20.7 | ++-----------------------+-------------------------+---------------------+---------------------------+---------------+ +| | | | (REQ) | | ++-----------------------+-------------------------+---------------------+---------------------------+---------------+ +| I | CB_SEQUENCE | OPT | FDELG, | Section 20.9 | ++-----------------------+-------------------------+---------------------+---------------------------+---------------+ +| | | | DDELG, pNFS | | ++-----------------------+-------------------------+---------------------+---------------------------+---------------+ +| | | | (REQ) | | ++-----------------------+-------------------------+---------------------+---------------------------+---------------+ +| NS* | CB_WANTS_CANCELLED | OPT | FDELG, | Section 20.10 | ++-----------------------+-------------------------+---------------------+---------------------------+---------------+ +| | | | DDELG, pNFS | | ++-----------------------+-------------------------+---------------------+---------------------------+---------------+ +| | | | (REQ) | | ++-----------------------+-------------------------+---------------------+---------------------------+---------------+ + + +Implementation notes: +===================== + +SSV: + The spec claims this is mandatory, but we don't actually know of any + implementations, so we're ignoring it for now. The server returns + NFS4ERR_ENCR_ALG_UNSUPP on EXCHANGE_ID, which should be future-proof. + +GSS on the backchannel: + Again, theoretically required but not widely implemented (in + particular, the current Linux client doesn't request it). We return + NFS4ERR_ENCR_ALG_UNSUPP on CREATE_SESSION. + +DELEGPURGE: + mandatory only for servers that support CLAIM_DELEGATE_PREV and/or + CLAIM_DELEG_PREV_FH (which allows clients to keep delegations that + persist across client reboots). Thus we need not implement this for + now. + +EXCHANGE_ID: + implementation ids are ignored + +CREATE_SESSION: + backchannel attributes are ignored + +SEQUENCE: + no support for dynamic slot table renegotiation (optional) + +Nonstandard compound limitations: + No support for a sessions fore channel RPC compound that requires both a + ca_maxrequestsize request and a ca_maxresponsesize reply, so we may + fail to live up to the promise we made in CREATE_SESSION fore channel + negotiation. + +See also http://wiki.linux-nfs.org/wiki/index.php/Server_4.0_and_4.1_issues. diff --git a/Documentation/filesystems/nfs/nfs41-server.txt b/Documentation/filesystems/nfs/nfs41-server.txt deleted file mode 100644 index 682a59fabe3f..000000000000 --- a/Documentation/filesystems/nfs/nfs41-server.txt +++ /dev/null @@ -1,173 +0,0 @@ -NFSv4.1 Server Implementation - -Server support for minorversion 1 can be controlled using the -/proc/fs/nfsd/versions control file. The string output returned -by reading this file will contain either "+4.1" or "-4.1" -correspondingly. - -Currently, server support for minorversion 1 is enabled by default. -It can be disabled at run time by writing the string "-4.1" to -the /proc/fs/nfsd/versions control file. Note that to write this -control file, the nfsd service must be taken down. You can use rpc.nfsd -for this; see rpc.nfsd(8). - -(Warning: older servers will interpret "+4.1" and "-4.1" as "+4" and -"-4", respectively. Therefore, code meant to work on both new and old -kernels must turn 4.1 on or off *before* turning support for version 4 -on or off; rpc.nfsd does this correctly.) - -The NFSv4 minorversion 1 (NFSv4.1) implementation in nfsd is based -on RFC 5661. - -From the many new features in NFSv4.1 the current implementation -focuses on the mandatory-to-implement NFSv4.1 Sessions, providing -"exactly once" semantics and better control and throttling of the -resources allocated for each client. - -The table below, taken from the NFSv4.1 document, lists -the operations that are mandatory to implement (REQ), optional -(OPT), and NFSv4.0 operations that are required not to implement (MNI) -in minor version 1. The first column indicates the operations that -are not supported yet by the linux server implementation. - -The OPTIONAL features identified and their abbreviations are as follows: - pNFS Parallel NFS - FDELG File Delegations - DDELG Directory Delegations - -The following abbreviations indicate the linux server implementation status. - I Implemented NFSv4.1 operations. - NS Not Supported. - NS* Unimplemented optional feature. - -Operations - - +----------------------+------------+--------------+----------------+ - | Operation | REQ, REC, | Feature | Definition | - | | OPT, or | (REQ, REC, | | - | | MNI | or OPT) | | - +----------------------+------------+--------------+----------------+ - | ACCESS | REQ | | Section 18.1 | -I | BACKCHANNEL_CTL | REQ | | Section 18.33 | -I | BIND_CONN_TO_SESSION | REQ | | Section 18.34 | - | CLOSE | REQ | | Section 18.2 | - | COMMIT | REQ | | Section 18.3 | - | CREATE | REQ | | Section 18.4 | -I | CREATE_SESSION | REQ | | Section 18.36 | -NS*| DELEGPURGE | OPT | FDELG (REQ) | Section 18.5 | - | DELEGRETURN | OPT | FDELG, | Section 18.6 | - | | | DDELG, pNFS | | - | | | (REQ) | | -I | DESTROY_CLIENTID | REQ | | Section 18.50 | -I | DESTROY_SESSION | REQ | | Section 18.37 | -I | EXCHANGE_ID | REQ | | Section 18.35 | -I | FREE_STATEID | REQ | | Section 18.38 | - | GETATTR | REQ | | Section 18.7 | -I | GETDEVICEINFO | OPT | pNFS (REQ) | Section 18.40 | -NS*| GETDEVICELIST | OPT | pNFS (OPT) | Section 18.41 | - | GETFH | REQ | | Section 18.8 | -NS*| GET_DIR_DELEGATION | OPT | DDELG (REQ) | Section 18.39 | -I | LAYOUTCOMMIT | OPT | pNFS (REQ) | Section 18.42 | -I | LAYOUTGET | OPT | pNFS (REQ) | Section 18.43 | -I | LAYOUTRETURN | OPT | pNFS (REQ) | Section 18.44 | - | LINK | OPT | | Section 18.9 | - | LOCK | REQ | | Section 18.10 | - | LOCKT | REQ | | Section 18.11 | - | LOCKU | REQ | | Section 18.12 | - | LOOKUP | REQ | | Section 18.13 | - | LOOKUPP | REQ | | Section 18.14 | - | NVERIFY | REQ | | Section 18.15 | - | OPEN | REQ | | Section 18.16 | -NS*| OPENATTR | OPT | | Section 18.17 | - | OPEN_CONFIRM | MNI | | N/A | - | OPEN_DOWNGRADE | REQ | | Section 18.18 | - | PUTFH | REQ | | Section 18.19 | - | PUTPUBFH | REQ | | Section 18.20 | - | PUTROOTFH | REQ | | Section 18.21 | - | READ | REQ | | Section 18.22 | - | READDIR | REQ | | Section 18.23 | - | READLINK | OPT | | Section 18.24 | - | RECLAIM_COMPLETE | REQ | | Section 18.51 | - | RELEASE_LOCKOWNER | MNI | | N/A | - | REMOVE | REQ | | Section 18.25 | - | RENAME | REQ | | Section 18.26 | - | RENEW | MNI | | N/A | - | RESTOREFH | REQ | | Section 18.27 | - | SAVEFH | REQ | | Section 18.28 | - | SECINFO | REQ | | Section 18.29 | -I | SECINFO_NO_NAME | REC | pNFS files | Section 18.45, | - | | | layout (REQ) | Section 13.12 | -I | SEQUENCE | REQ | | Section 18.46 | - | SETATTR | REQ | | Section 18.30 | - | SETCLIENTID | MNI | | N/A | - | SETCLIENTID_CONFIRM | MNI | | N/A | -NS | SET_SSV | REQ | | Section 18.47 | -I | TEST_STATEID | REQ | | Section 18.48 | - | VERIFY | REQ | | Section 18.31 | -NS*| WANT_DELEGATION | OPT | FDELG (OPT) | Section 18.49 | - | WRITE | REQ | | Section 18.32 | - -Callback Operations - - +-------------------------+-----------+-------------+---------------+ - | Operation | REQ, REC, | Feature | Definition | - | | OPT, or | (REQ, REC, | | - | | MNI | or OPT) | | - +-------------------------+-----------+-------------+---------------+ - | CB_GETATTR | OPT | FDELG (REQ) | Section 20.1 | -I | CB_LAYOUTRECALL | OPT | pNFS (REQ) | Section 20.3 | -NS*| CB_NOTIFY | OPT | DDELG (REQ) | Section 20.4 | -NS*| CB_NOTIFY_DEVICEID | OPT | pNFS (OPT) | Section 20.12 | -NS*| CB_NOTIFY_LOCK | OPT | | Section 20.11 | -NS*| CB_PUSH_DELEG | OPT | FDELG (OPT) | Section 20.5 | - | CB_RECALL | OPT | FDELG, | Section 20.2 | - | | | DDELG, pNFS | | - | | | (REQ) | | -NS*| CB_RECALL_ANY | OPT | FDELG, | Section 20.6 | - | | | DDELG, pNFS | | - | | | (REQ) | | -NS | CB_RECALL_SLOT | REQ | | Section 20.8 | -NS*| CB_RECALLABLE_OBJ_AVAIL | OPT | DDELG, pNFS | Section 20.7 | - | | | (REQ) | | -I | CB_SEQUENCE | OPT | FDELG, | Section 20.9 | - | | | DDELG, pNFS | | - | | | (REQ) | | -NS*| CB_WANTS_CANCELLED | OPT | FDELG, | Section 20.10 | - | | | DDELG, pNFS | | - | | | (REQ) | | - +-------------------------+-----------+-------------+---------------+ - -Implementation notes: - -SSV: -* The spec claims this is mandatory, but we don't actually know of any - implementations, so we're ignoring it for now. The server returns - NFS4ERR_ENCR_ALG_UNSUPP on EXCHANGE_ID, which should be future-proof. - -GSS on the backchannel: -* Again, theoretically required but not widely implemented (in - particular, the current Linux client doesn't request it). We return - NFS4ERR_ENCR_ALG_UNSUPP on CREATE_SESSION. - -DELEGPURGE: -* mandatory only for servers that support CLAIM_DELEGATE_PREV and/or - CLAIM_DELEG_PREV_FH (which allows clients to keep delegations that - persist across client reboots). Thus we need not implement this for - now. - -EXCHANGE_ID: -* implementation ids are ignored - -CREATE_SESSION: -* backchannel attributes are ignored - -SEQUENCE: -* no support for dynamic slot table renegotiation (optional) - -Nonstandard compound limitations: -* No support for a sessions fore channel RPC compound that requires both a - ca_maxrequestsize request and a ca_maxresponsesize reply, so we may - fail to live up to the promise we made in CREATE_SESSION fore channel - negotiation. - -See also http://wiki.linux-nfs.org/wiki/index.php/Server_4.0_and_4.1_issues. -- cgit From cb63032b1233e03ac20fc2b60820a50d605b9bc0 Mon Sep 17 00:00:00 2001 From: "Daniel W. S. Almeida" Date: Wed, 29 Jan 2020 01:49:17 -0300 Subject: Documentation: nfs: knfsd-stats: convert to ReST Convert knfsd-stats.txt to ReST. Content remains mostly the same. Signed-off-by: Daniel W. S. Almeida Link: https://lore.kernel.org/r/20200129044917.566906-6-dwlsalmeida@gmail.com Signed-off-by: Jonathan Corbet --- Documentation/filesystems/nfs/index.rst | 1 + Documentation/filesystems/nfs/knfsd-stats.rst | 122 +++++++++++++++++++++++++ Documentation/filesystems/nfs/knfsd-stats.txt | 123 -------------------------- 3 files changed, 123 insertions(+), 123 deletions(-) create mode 100644 Documentation/filesystems/nfs/knfsd-stats.rst delete mode 100644 Documentation/filesystems/nfs/knfsd-stats.txt diff --git a/Documentation/filesystems/nfs/index.rst b/Documentation/filesystems/nfs/index.rst index a0a678af921b..65805624e39b 100644 --- a/Documentation/filesystems/nfs/index.rst +++ b/Documentation/filesystems/nfs/index.rst @@ -10,3 +10,4 @@ NFS rpc-cache rpc-server-gss nfs41-server + knfsd-stats diff --git a/Documentation/filesystems/nfs/knfsd-stats.rst b/Documentation/filesystems/nfs/knfsd-stats.rst new file mode 100644 index 000000000000..80bcf13550de --- /dev/null +++ b/Documentation/filesystems/nfs/knfsd-stats.rst @@ -0,0 +1,122 @@ +============================ +Kernel NFS Server Statistics +============================ + +:Authors: Greg Banks - 26 Mar 2009 + +This document describes the format and semantics of the statistics +which the kernel NFS server makes available to userspace. These +statistics are available in several text form pseudo files, each of +which is described separately below. + +In most cases you don't need to know these formats, as the nfsstat(8) +program from the nfs-utils distribution provides a helpful command-line +interface for extracting and printing them. + +All the files described here are formatted as a sequence of text lines, +separated by newline '\n' characters. Lines beginning with a hash +'#' character are comments intended for humans and should be ignored +by parsing routines. All other lines contain a sequence of fields +separated by whitespace. + +/proc/fs/nfsd/pool_stats +======================== + +This file is available in kernels from 2.6.30 onwards, if the +/proc/fs/nfsd filesystem is mounted (it almost always should be). + +The first line is a comment which describes the fields present in +all the other lines. The other lines present the following data as +a sequence of unsigned decimal numeric fields. One line is shown +for each NFS thread pool. + +All counters are 64 bits wide and wrap naturally. There is no way +to zero these counters, instead applications should do their own +rate conversion. + +pool + The id number of the NFS thread pool to which this line applies. + This number does not change. + + Thread pool ids are a contiguous set of small integers starting + at zero. The maximum value depends on the thread pool mode, but + currently cannot be larger than the number of CPUs in the system. + Note that in the default case there will be a single thread pool + which contains all the nfsd threads and all the CPUs in the system, + and thus this file will have a single line with a pool id of "0". + +packets-arrived + Counts how many NFS packets have arrived. More precisely, this + is the number of times that the network stack has notified the + sunrpc server layer that new data may be available on a transport + (e.g. an NFS or UDP socket or an NFS/RDMA endpoint). + + Depending on the NFS workload patterns and various network stack + effects (such as Large Receive Offload) which can combine packets + on the wire, this may be either more or less than the number + of NFS calls received (which statistic is available elsewhere). + However this is a more accurate and less workload-dependent measure + of how much CPU load is being placed on the sunrpc server layer + due to NFS network traffic. + +sockets-enqueued + Counts how many times an NFS transport is enqueued to wait for + an nfsd thread to service it, i.e. no nfsd thread was considered + available. + + The circumstance this statistic tracks indicates that there was NFS + network-facing work to be done but it couldn't be done immediately, + thus introducing a small delay in servicing NFS calls. The ideal + rate of change for this counter is zero; significantly non-zero + values may indicate a performance limitation. + + This can happen because there are too few nfsd threads in the thread + pool for the NFS workload (the workload is thread-limited), in which + case configuring more nfsd threads will probably improve the + performance of the NFS workload. + +threads-woken + Counts how many times an idle nfsd thread is woken to try to + receive some data from an NFS transport. + + This statistic tracks the circumstance where incoming + network-facing NFS work is being handled quickly, which is a good + thing. The ideal rate of change for this counter will be close + to but less than the rate of change of the packets-arrived counter. + +threads-timedout + Counts how many times an nfsd thread triggered an idle timeout, + i.e. was not woken to handle any incoming network packets for + some time. + + This statistic counts a circumstance where there are more nfsd + threads configured than can be used by the NFS workload. This is + a clue that the number of nfsd threads can be reduced without + affecting performance. Unfortunately, it's only a clue and not + a strong indication, for a couple of reasons: + + - Currently the rate at which the counter is incremented is quite + slow; the idle timeout is 60 minutes. Unless the NFS workload + remains constant for hours at a time, this counter is unlikely + to be providing information that is still useful. + + - It is usually a wise policy to provide some slack, + i.e. configure a few more nfsds than are currently needed, + to allow for future spikes in load. + + +Note that incoming packets on NFS transports will be dealt with in +one of three ways. An nfsd thread can be woken (threads-woken counts +this case), or the transport can be enqueued for later attention +(sockets-enqueued counts this case), or the packet can be temporarily +deferred because the transport is currently being used by an nfsd +thread. This last case is not very interesting and is not explicitly +counted, but can be inferred from the other counters thus:: + + packets-deferred = packets-arrived - ( sockets-enqueued + threads-woken ) + + +More +==== + +Descriptions of the other statistics file should go here. diff --git a/Documentation/filesystems/nfs/knfsd-stats.txt b/Documentation/filesystems/nfs/knfsd-stats.txt deleted file mode 100644 index 1a5d82180b84..000000000000 --- a/Documentation/filesystems/nfs/knfsd-stats.txt +++ /dev/null @@ -1,123 +0,0 @@ - -Kernel NFS Server Statistics -============================ - -This document describes the format and semantics of the statistics -which the kernel NFS server makes available to userspace. These -statistics are available in several text form pseudo files, each of -which is described separately below. - -In most cases you don't need to know these formats, as the nfsstat(8) -program from the nfs-utils distribution provides a helpful command-line -interface for extracting and printing them. - -All the files described here are formatted as a sequence of text lines, -separated by newline '\n' characters. Lines beginning with a hash -'#' character are comments intended for humans and should be ignored -by parsing routines. All other lines contain a sequence of fields -separated by whitespace. - -/proc/fs/nfsd/pool_stats ------------------------- - -This file is available in kernels from 2.6.30 onwards, if the -/proc/fs/nfsd filesystem is mounted (it almost always should be). - -The first line is a comment which describes the fields present in -all the other lines. The other lines present the following data as -a sequence of unsigned decimal numeric fields. One line is shown -for each NFS thread pool. - -All counters are 64 bits wide and wrap naturally. There is no way -to zero these counters, instead applications should do their own -rate conversion. - -pool - The id number of the NFS thread pool to which this line applies. - This number does not change. - - Thread pool ids are a contiguous set of small integers starting - at zero. The maximum value depends on the thread pool mode, but - currently cannot be larger than the number of CPUs in the system. - Note that in the default case there will be a single thread pool - which contains all the nfsd threads and all the CPUs in the system, - and thus this file will have a single line with a pool id of "0". - -packets-arrived - Counts how many NFS packets have arrived. More precisely, this - is the number of times that the network stack has notified the - sunrpc server layer that new data may be available on a transport - (e.g. an NFS or UDP socket or an NFS/RDMA endpoint). - - Depending on the NFS workload patterns and various network stack - effects (such as Large Receive Offload) which can combine packets - on the wire, this may be either more or less than the number - of NFS calls received (which statistic is available elsewhere). - However this is a more accurate and less workload-dependent measure - of how much CPU load is being placed on the sunrpc server layer - due to NFS network traffic. - -sockets-enqueued - Counts how many times an NFS transport is enqueued to wait for - an nfsd thread to service it, i.e. no nfsd thread was considered - available. - - The circumstance this statistic tracks indicates that there was NFS - network-facing work to be done but it couldn't be done immediately, - thus introducing a small delay in servicing NFS calls. The ideal - rate of change for this counter is zero; significantly non-zero - values may indicate a performance limitation. - - This can happen because there are too few nfsd threads in the thread - pool for the NFS workload (the workload is thread-limited), in which - case configuring more nfsd threads will probably improve the - performance of the NFS workload. - -threads-woken - Counts how many times an idle nfsd thread is woken to try to - receive some data from an NFS transport. - - This statistic tracks the circumstance where incoming - network-facing NFS work is being handled quickly, which is a good - thing. The ideal rate of change for this counter will be close - to but less than the rate of change of the packets-arrived counter. - -threads-timedout - Counts how many times an nfsd thread triggered an idle timeout, - i.e. was not woken to handle any incoming network packets for - some time. - - This statistic counts a circumstance where there are more nfsd - threads configured than can be used by the NFS workload. This is - a clue that the number of nfsd threads can be reduced without - affecting performance. Unfortunately, it's only a clue and not - a strong indication, for a couple of reasons: - - - Currently the rate at which the counter is incremented is quite - slow; the idle timeout is 60 minutes. Unless the NFS workload - remains constant for hours at a time, this counter is unlikely - to be providing information that is still useful. - - - It is usually a wise policy to provide some slack, - i.e. configure a few more nfsds than are currently needed, - to allow for future spikes in load. - - -Note that incoming packets on NFS transports will be dealt with in -one of three ways. An nfsd thread can be woken (threads-woken counts -this case), or the transport can be enqueued for later attention -(sockets-enqueued counts this case), or the packet can be temporarily -deferred because the transport is currently being used by an nfsd -thread. This last case is not very interesting and is not explicitly -counted, but can be inferred from the other counters thus: - -packets-deferred = packets-arrived - ( sockets-enqueued + threads-woken ) - - -More ----- -Descriptions of the other statistics file should go here. - - -Greg Banks -26 Mar 2009 -- cgit From 56e6b3b0b381abd0484802828764d01552ff76ab Mon Sep 17 00:00:00 2001 From: Yue Hu Date: Thu, 6 Feb 2020 19:10:31 +0800 Subject: Documentation: zram: fix the description about orig_data_size of mm_stat orig_data_size counted the same_pages by commit 51f9f82c855d ("zram: count same page write as page_stored"), so let's fix it. Signed-off-by: Yue Hu Link: https://lore.kernel.org/r/20200206111031.9524-1-zbestahu@gmail.com Signed-off-by: Jonathan Corbet --- Documentation/admin-guide/blockdev/zram.rst | 2 -- 1 file changed, 2 deletions(-) diff --git a/Documentation/admin-guide/blockdev/zram.rst b/Documentation/admin-guide/blockdev/zram.rst index 27c77d853028..a6fd1f9b5faf 100644 --- a/Documentation/admin-guide/blockdev/zram.rst +++ b/Documentation/admin-guide/blockdev/zram.rst @@ -251,8 +251,6 @@ line of text and contains the following stats separated by whitespace: ================ ============================================================= orig_data_size uncompressed size of data stored in this disk. - This excludes same-element-filled pages (same_pages) since - no memory is allocated for them. Unit: bytes compr_data_size compressed size of data stored in this disk mem_used_total the amount of memory allocated for this disk. This -- cgit From 895f2c20a88a343d12c387dab9d785ff665cb4ac Mon Sep 17 00:00:00 2001 From: "d.hatayama@fujitsu.com" Date: Thu, 13 Feb 2020 02:51:49 +0000 Subject: docs: admin-guide: Add description of %c corename format There is somehow no description of %c corename format specifier for /proc/sys/kernel/core_pattern. The %c corename format specifier is used by user-space application such as systemd-coredump, so it should be documented. To find where %c is handled in the kernel source code, look at function format_corename() in fs/coredump.c. Signed-off-by: HATAYAMA Daisuke Link: https://lore.kernel.org/r/TYAPR01MB4014714BB2ACE425BB6EC6B7951A0@TYAPR01MB4014.jpnprd01.prod.outlook.com Signed-off-by: Jonathan Corbet --- Documentation/admin-guide/sysctl/kernel.rst | 1 + 1 file changed, 1 insertion(+) diff --git a/Documentation/admin-guide/sysctl/kernel.rst b/Documentation/admin-guide/sysctl/kernel.rst index def074807cee..b08ba4e63291 100644 --- a/Documentation/admin-guide/sysctl/kernel.rst +++ b/Documentation/admin-guide/sysctl/kernel.rst @@ -213,6 +213,7 @@ core_pattern is used to specify a core dumpfile pattern name. %h hostname %e executable filename (may be shortened) %E executable path + %c maximum size of core file by resource limit RLIMIT_CORE % both are dropped * If the first character of the pattern is a '|', the kernel will treat -- cgit From 3b82a112ce594889742164b242d8f213938a443f Mon Sep 17 00:00:00 2001 From: Wang Long Date: Fri, 7 Feb 2020 21:42:10 +0800 Subject: Documentation/ABI: move sysfs-kernel-uids to removed directory commit 7c9414385ebf ("sched: Remove USER_SCHED") deleted the USER_SCHED feature. so move the ABI doc to removed directory. Signed-off-by: Wang Long Link: https://lore.kernel.org/r/1581082930-30441-1-git-send-email-w@laoqinren.net Signed-off-by: Jonathan Corbet --- Documentation/ABI/removed/sysfs-kernel-uids | 14 ++++++++++++++ Documentation/ABI/testing/sysfs-kernel-uids | 14 -------------- 2 files changed, 14 insertions(+), 14 deletions(-) create mode 100644 Documentation/ABI/removed/sysfs-kernel-uids delete mode 100644 Documentation/ABI/testing/sysfs-kernel-uids diff --git a/Documentation/ABI/removed/sysfs-kernel-uids b/Documentation/ABI/removed/sysfs-kernel-uids new file mode 100644 index 000000000000..dc4463f190a7 --- /dev/null +++ b/Documentation/ABI/removed/sysfs-kernel-uids @@ -0,0 +1,14 @@ +What: /sys/kernel/uids//cpu_shares +Date: December 2007, finally removed in kernel v2.6.34-rc1 +Contact: Dhaval Giani + Srivatsa Vaddagiri +Description: + The /sys/kernel/uids//cpu_shares tunable is used + to set the cpu bandwidth a user is allowed. This is a + propotional value. What that means is that if there + are two users logged in, each with an equal number of + shares, then they will get equal CPU bandwidth. Another + example would be, if User A has shares = 1024 and user + B has shares = 2048, User B will get twice the CPU + bandwidth user A will. For more details refer + Documentation/scheduler/sched-design-CFS.rst diff --git a/Documentation/ABI/testing/sysfs-kernel-uids b/Documentation/ABI/testing/sysfs-kernel-uids deleted file mode 100644 index 4182b7061816..000000000000 --- a/Documentation/ABI/testing/sysfs-kernel-uids +++ /dev/null @@ -1,14 +0,0 @@ -What: /sys/kernel/uids//cpu_shares -Date: December 2007 -Contact: Dhaval Giani - Srivatsa Vaddagiri -Description: - The /sys/kernel/uids//cpu_shares tunable is used - to set the cpu bandwidth a user is allowed. This is a - propotional value. What that means is that if there - are two users logged in, each with an equal number of - shares, then they will get equal CPU bandwidth. Another - example would be, if User A has shares = 1024 and user - B has shares = 2048, User B will get twice the CPU - bandwidth user A will. For more details refer - Documentation/scheduler/sched-design-CFS.rst -- cgit From 473da2f0d80aa7240dd0a2be5015fdfd93543ca2 Mon Sep 17 00:00:00 2001 From: Alexandre Belloni Date: Sun, 9 Feb 2020 21:33:04 +0100 Subject: docs: userspace: ioctl-number: remove mc146818rtc conflict In 2.3.43pre2, the RTC ioctls definitions were actually moved from linux/mc146818rtc.h to linux/rtc.h Signed-off-by: Alexandre Belloni Link: https://lore.kernel.org/r/20200209203304.66004-1-alexandre.belloni@bootlin.com Signed-off-by: Jonathan Corbet --- Documentation/userspace-api/ioctl/ioctl-number.rst | 1 - 1 file changed, 1 deletion(-) diff --git a/Documentation/userspace-api/ioctl/ioctl-number.rst b/Documentation/userspace-api/ioctl/ioctl-number.rst index 2e91370dc159..f759edafd938 100644 --- a/Documentation/userspace-api/ioctl/ioctl-number.rst +++ b/Documentation/userspace-api/ioctl/ioctl-number.rst @@ -266,7 +266,6 @@ Code Seq# Include File Comments 'o' 01-A1 `linux/dvb/*.h` DVB 'p' 00-0F linux/phantom.h conflict! (OpenHaptics needs this) 'p' 00-1F linux/rtc.h conflict! -'p' 00-3F linux/mc146818rtc.h conflict! 'p' 40-7F linux/nvram.h 'p' 80-9F linux/ppdev.h user-space parport -- cgit From 2e5b1886e9bab6c29c5e5c3ce4e373bb9e9eaa8b Mon Sep 17 00:00:00 2001 From: Randy Dunlap Date: Sun, 9 Feb 2020 19:53:17 -0800 Subject: Documentation: bootconfig: fix Sphinx block warning Fix Sphinx format warning: lnx-56-rc1/Documentation/admin-guide/bootconfig.rst:26: WARNING: Literal block expected; none found. Signed-off-by: Randy Dunlap Cc: Steven Rostedt Acked-by: Masami Hiramatsu Link: https://lore.kernel.org/r/07b3e31f-9b1e-1876-aa60-4436e4dd6da0@infradead.org Signed-off-by: Jonathan Corbet --- Documentation/admin-guide/bootconfig.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/Documentation/admin-guide/bootconfig.rst b/Documentation/admin-guide/bootconfig.rst index b342a6796392..e603ebb5bdda 100644 --- a/Documentation/admin-guide/bootconfig.rst +++ b/Documentation/admin-guide/bootconfig.rst @@ -23,7 +23,7 @@ of dot-connected-words, and key and value are connected by ``=``. The value has to be terminated by semi-colon (``;``) or newline (``\n``). For array value, array entries are separated by comma (``,``). :: -KEY[.WORD[...]] = VALUE[, VALUE2[...]][;] + KEY[.WORD[...]] = VALUE[, VALUE2[...]][;] Unlike the kernel command line syntax, spaces are OK around the comma and ``=``. -- cgit From 874ddbce487f077c46957e44e4115b3d82f62c92 Mon Sep 17 00:00:00 2001 From: Alexandre Ghiti Date: Wed, 19 Feb 2020 01:59:53 -0500 Subject: documentation: vm: Advertise support for pte_special in riscv Risc-V architecture has actually supported pte_special since its merge upstream, simply add this info to the documentation. Signed-off-by: Alexandre Ghiti Signed-off-by: Jonathan Corbet --- Documentation/features/vm/pte_special/arch-support.txt | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/Documentation/features/vm/pte_special/arch-support.txt b/Documentation/features/vm/pte_special/arch-support.txt index 2dc5df6a1cf5..3d492a34c8ee 100644 --- a/Documentation/features/vm/pte_special/arch-support.txt +++ b/Documentation/features/vm/pte_special/arch-support.txt @@ -23,7 +23,7 @@ | openrisc: | TODO | | parisc: | TODO | | powerpc: | ok | - | riscv: | TODO | + | riscv: | ok | | s390: | ok | | sh: | ok | | sparc: | ok | -- cgit From 2d5dfb5911cb0eed0a9a91ea404ad963f18e5aaf Mon Sep 17 00:00:00 2001 From: Jonathan Neuschäfer Date: Tue, 18 Feb 2020 17:38:25 +0100 Subject: docs: arm: tcm: Fix a few typos MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Signed-off-by: Jonathan Neuschäfer Signed-off-by: Jonathan Corbet --- Documentation/arm/tcm.rst | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/Documentation/arm/tcm.rst b/Documentation/arm/tcm.rst index effd9c7bc968..b256f9783883 100644 --- a/Documentation/arm/tcm.rst +++ b/Documentation/arm/tcm.rst @@ -4,18 +4,18 @@ ARM TCM (Tightly-Coupled Memory) handling in Linux Written by Linus Walleij -Some ARM SoC:s have a so-called TCM (Tightly-Coupled Memory). +Some ARM SoCs have a so-called TCM (Tightly-Coupled Memory). This is usually just a few (4-64) KiB of RAM inside the ARM processor. -Due to being embedded inside the CPU The TCM has a +Due to being embedded inside the CPU, the TCM has a Harvard-architecture, so there is an ITCM (instruction TCM) and a DTCM (data TCM). The DTCM can not contain any instructions, but the ITCM can actually contain data. The size of DTCM or ITCM is minimum 4KiB so the typical minimum configuration is 4KiB ITCM and 4KiB DTCM. -ARM CPU:s have special registers to read out status, physical +ARM CPUs have special registers to read out status, physical location and size of TCM memories. arch/arm/include/asm/cputype.h defines a CPUID_TCM register that you can read out from the system control coprocessor. Documentation from ARM can be found -- cgit From fb2511247dc4061fd122d0195838278a4a0b7b59 Mon Sep 17 00:00:00 2001 From: Jonathan Neuschäfer Date: Tue, 18 Feb 2020 16:02:19 +0100 Subject: docs: Fix path to MTD command line partition parser MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit cmdlinepart.c has been moved to drivers/mtd/parsers/. Fixes: a3f12a35c91d ("mtd: parsers: Move CMDLINE parser") Signed-off-by: Jonathan Neuschäfer Signed-off-by: Jonathan Corbet --- Documentation/admin-guide/kernel-parameters.txt | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt index dbc22d684627..47cd55e339a5 100644 --- a/Documentation/admin-guide/kernel-parameters.txt +++ b/Documentation/admin-guide/kernel-parameters.txt @@ -2791,7 +2791,7 @@ ,[,,,,] mtdparts= [MTD] - See drivers/mtd/cmdlinepart.c. + See drivers/mtd/parsers/cmdlinepart.c multitce=off [PPC] This parameter disables the use of the pSeries firmware feature for updating multiple TCE entries -- cgit From a3cb66a508528e9082cba8303b4f31767e7743a2 Mon Sep 17 00:00:00 2001 From: Stephen Kitt Date: Tue, 18 Feb 2020 13:59:16 +0100 Subject: docs: pretty up sysctl/kernel.rst This updates sysctl/kernel.rst to use ReStructured Text more fully: * the list of files is now the table of contents (old entries with no corresponding sections are added as empty sections for now); * code references and commands are formatted as code, except for function names which end up linked to the appropriate documentation; * links are used to point to other documentation and other sections; * tables are used to make lists of values more readable (as already done for some sections); * in heavily-reworked paragraphs, sentences are wrapped individually, to make future diffs easier to read. The first mention of the kernel version is dropped. The second mention, saying that the document is accurate for 2.2, is preserved for now; I will update that once the document really is accurate for a current kernel release. Signed-off-by: Stephen Kitt Signed-off-by: Jonathan Corbet --- Documentation/admin-guide/sysctl/kernel.rst | 987 ++++++++++++++-------------- 1 file changed, 492 insertions(+), 495 deletions(-) diff --git a/Documentation/admin-guide/sysctl/kernel.rst b/Documentation/admin-guide/sysctl/kernel.rst index b08ba4e63291..4872610cc491 100644 --- a/Documentation/admin-guide/sysctl/kernel.rst +++ b/Documentation/admin-guide/sysctl/kernel.rst @@ -2,263 +2,187 @@ Documentation for /proc/sys/kernel/ =================================== -kernel version 2.2.10 - Copyright (c) 1998, 1999, Rik van Riel Copyright (c) 2009, Shen Feng -For general info and legal blurb, please look in index.rst. +For general info and legal blurb, please look in :doc:`index`. ------------------------------------------------------------------------------ This file contains documentation for the sysctl files in -/proc/sys/kernel/ and is valid for Linux kernel version 2.2. +``/proc/sys/kernel/`` and is valid for Linux kernel version 2.2. The files in this directory can be used to tune and monitor miscellaneous and general things in the operation of the Linux -kernel. Since some of the files _can_ be used to screw up your +kernel. Since some of the files *can* be used to screw up your system, it is advisable to read both documentation and source before actually making adjustments. Currently, these files might (depending on your configuration) -show up in /proc/sys/kernel: - -- acct -- acpi_video_flags -- auto_msgmni -- bootloader_type [ X86 only ] -- bootloader_version [ X86 only ] -- cap_last_cap -- core_pattern -- core_pipe_limit -- core_uses_pid -- ctrl-alt-del -- dmesg_restrict -- domainname -- hostname -- hotplug -- hardlockup_all_cpu_backtrace -- hardlockup_panic -- hung_task_panic -- hung_task_check_count -- hung_task_timeout_secs -- hung_task_check_interval_secs -- hung_task_warnings -- hyperv_record_panic_msg -- kexec_load_disabled -- kptr_restrict -- l2cr [ PPC only ] -- modprobe ==> Documentation/debugging-modules.txt -- modules_disabled -- msg_next_id [ sysv ipc ] -- msgmax -- msgmnb -- msgmni -- nmi_watchdog -- osrelease -- ostype -- overflowgid -- overflowuid -- panic -- panic_on_oops -- panic_on_stackoverflow -- panic_on_unrecovered_nmi -- panic_on_warn -- panic_print -- panic_on_rcu_stall -- perf_cpu_time_max_percent -- perf_event_paranoid -- perf_event_max_stack -- perf_event_mlock_kb -- perf_event_max_contexts_per_stack -- pid_max -- powersave-nap [ PPC only ] -- printk -- printk_delay -- printk_ratelimit -- printk_ratelimit_burst -- pty ==> Documentation/filesystems/devpts.txt -- randomize_va_space -- real-root-dev ==> Documentation/admin-guide/initrd.rst -- reboot-cmd [ SPARC only ] -- rtsig-max -- rtsig-nr -- sched_energy_aware -- seccomp/ ==> Documentation/userspace-api/seccomp_filter.rst -- sem -- sem_next_id [ sysv ipc ] -- sg-big-buff [ generic SCSI device (sg) ] -- shm_next_id [ sysv ipc ] -- shm_rmid_forced -- shmall -- shmmax [ sysv ipc ] -- shmmni -- softlockup_all_cpu_backtrace -- soft_watchdog -- stack_erasing -- stop-a [ SPARC only ] -- sysrq ==> Documentation/admin-guide/sysrq.rst -- sysctl_writes_strict -- tainted ==> Documentation/admin-guide/tainted-kernels.rst -- threads-max -- unknown_nmi_panic -- watchdog -- watchdog_thresh -- version - - -acct: -===== +show up in ``/proc/sys/kernel``: + +.. contents:: :local: + + +acct +==== + +:: -highwater lowwater frequency + highwater lowwater frequency If BSD-style process accounting is enabled these values control its behaviour. If free space on filesystem where the log lives -goes below % accounting suspends. If free space gets -above % accounting resumes. determines +goes below ``lowwater``% accounting suspends. If free space gets +above ``highwater``% accounting resumes. ``frequency`` determines how often do we check the amount of free space (value is in seconds). Default: -4 2 30 -That is, suspend accounting if there left <= 2% free; resume it -if we got >=4%; consider information about amount of free space -valid for 30 seconds. +:: -acpi_video_flags: -================= + 4 2 30 -flags +That is, suspend accounting if free space drops below 2%; resume it +if it increases to at least 4%; consider information about amount of +free space valid for 30 seconds. -See Doc*/kernel/power/video.txt, it allows mode of video boot to be -set during run time. +acpi_video_flags +================ + +See Documentation/kernel/power/video.txt, it allows mode of video boot +to be set during run time. -auto_msgmni: -============ + +auto_msgmni +=========== This variable has no effect and may be removed in future kernel releases. Reading it always returns 0. -Up to Linux 3.17, it enabled/disabled automatic recomputing of msgmni -upon memory add/remove or upon ipc namespace creation/removal. +Up to Linux 3.17, it enabled/disabled automatic recomputing of +`msgmni`_ +upon memory add/remove or upon IPC namespace creation/removal. Echoing "1" into this file enabled msgmni automatic recomputing. -Echoing "0" turned it off. auto_msgmni default value was 1. - +Echoing "0" turned it off. The default value was 1. -bootloader_type: -================ -x86 bootloader identification +bootloader_type (x86 only) +========================== This gives the bootloader type number as indicated by the bootloader, shifted left by 4, and OR'd with the low four bits of the bootloader version. The reason for this encoding is that this used to match the -type_of_loader field in the kernel header; the encoding is kept for +``type_of_loader`` field in the kernel header; the encoding is kept for backwards compatibility. That is, if the full bootloader type number is 0x15 and the full version number is 0x234, this file will contain the value 340 = 0x154. -See the type_of_loader and ext_loader_type fields in -Documentation/x86/boot.rst for additional information. - +See the ``type_of_loader`` and ``ext_loader_type`` fields in +:doc:`/x86/boot` for additional information. -bootloader_version: -=================== -x86 bootloader version +bootloader_version (x86 only) +============================= The complete bootloader version number. In the example above, this file will contain the value 564 = 0x234. -See the type_of_loader and ext_loader_ver fields in -Documentation/x86/boot.rst for additional information. +See the ``type_of_loader`` and ``ext_loader_ver`` fields in +:doc:`/x86/boot` for additional information. -cap_last_cap: -============= +cap_last_cap +============ Highest valid capability of the running kernel. Exports -CAP_LAST_CAP from the kernel. +``CAP_LAST_CAP`` from the kernel. -core_pattern: -============= +core_pattern +============ -core_pattern is used to specify a core dumpfile pattern name. +``core_pattern`` is used to specify a core dumpfile pattern name. * max length 127 characters; default value is "core" -* core_pattern is used as a pattern template for the output filename; - certain string patterns (beginning with '%') are substituted with - their actual values. -* backward compatibility with core_uses_pid: +* ``core_pattern`` is used as a pattern template for the output + filename; certain string patterns (beginning with '%') are + substituted with their actual values. +* backward compatibility with ``core_uses_pid``: - If core_pattern does not include "%p" (default does not) - and core_uses_pid is set, then .PID will be appended to + If ``core_pattern`` does not include "%p" (default does not) + and ``core_uses_pid`` is set, then .PID will be appended to the filename. -* corename format specifiers:: - - % '%' is dropped - %% output one '%' - %p pid - %P global pid (init PID namespace) - %i tid - %I global tid (init PID namespace) - %u uid (in initial user namespace) - %g gid (in initial user namespace) - %d dump mode, matches PR_SET_DUMPABLE and - /proc/sys/fs/suid_dumpable - %s signal number - %t UNIX time of dump - %h hostname - %e executable filename (may be shortened) - %E executable path - %c maximum size of core file by resource limit RLIMIT_CORE - % both are dropped +* corename format specifiers + + ======== ========================================== + % '%' is dropped + %% output one '%' + %p pid + %P global pid (init PID namespace) + %i tid + %I global tid (init PID namespace) + %u uid (in initial user namespace) + %g gid (in initial user namespace) + %d dump mode, matches ``PR_SET_DUMPABLE`` and + ``/proc/sys/fs/suid_dumpable`` + %s signal number + %t UNIX time of dump + %h hostname + %e executable filename (may be shortened) + %E executable path + %c maximum size of core file by resource limit RLIMIT_CORE + % both are dropped + ======== ========================================== * If the first character of the pattern is a '|', the kernel will treat the rest of the pattern as a command to run. The core dump will be written to the standard input of that program instead of to a file. -core_pipe_limit: -================ +core_pipe_limit +=============== -This sysctl is only applicable when core_pattern is configured to pipe -core files to a user space helper (when the first character of -core_pattern is a '|', see above). When collecting cores via a pipe -to an application, it is occasionally useful for the collecting -application to gather data about the crashing process from its -/proc/pid directory. In order to do this safely, the kernel must wait -for the collecting process to exit, so as not to remove the crashing -processes proc files prematurely. This in turn creates the -possibility that a misbehaving userspace collecting process can block -the reaping of a crashed process simply by never exiting. This sysctl -defends against that. It defines how many concurrent crashing -processes may be piped to user space applications in parallel. If -this value is exceeded, then those crashing processes above that value -are noted via the kernel log and their cores are skipped. 0 is a -special value, indicating that unlimited processes may be captured in -parallel, but that no waiting will take place (i.e. the collecting -process is not guaranteed access to /proc//). This -value defaults to 0. - - -core_uses_pid: -============== +This sysctl is only applicable when `core_pattern`_ is configured to +pipe core files to a user space helper (when the first character of +``core_pattern`` is a '|', see above). +When collecting cores via a pipe to an application, it is occasionally +useful for the collecting application to gather data about the +crashing process from its ``/proc/pid`` directory. +In order to do this safely, the kernel must wait for the collecting +process to exit, so as not to remove the crashing processes proc files +prematurely. +This in turn creates the possibility that a misbehaving userspace +collecting process can block the reaping of a crashed process simply +by never exiting. +This sysctl defends against that. +It defines how many concurrent crashing processes may be piped to user +space applications in parallel. +If this value is exceeded, then those crashing processes above that +value are noted via the kernel log and their cores are skipped. +0 is a special value, indicating that unlimited processes may be +captured in parallel, but that no waiting will take place (i.e. the +collecting process is not guaranteed access to ``/proc//``). +This value defaults to 0. + + +core_uses_pid +============= The default coredump filename is "core". By setting -core_uses_pid to 1, the coredump filename becomes core.PID. -If core_pattern does not include "%p" (default does not) -and core_uses_pid is set, then .PID will be appended to +``core_uses_pid`` to 1, the coredump filename becomes core.PID. +If `core_pattern`_ does not include "%p" (default does not) +and ``core_uses_pid`` is set, then .PID will be appended to the filename. -ctrl-alt-del: -============= +ctrl-alt-del +============ When the value in this file is 0, ctrl-alt-del is trapped and -sent to the init(1) program to handle a graceful restart. +sent to the ``init(1)`` program to handle a graceful restart. When, however, the value is > 0, Linux's reaction to a Vulcan Nerve Pinch (tm) will be an immediate reboot, without even syncing its dirty buffers. @@ -270,21 +194,22 @@ Note: to decide what to do with it. -dmesg_restrict: -=============== +dmesg_restrict +============== This toggle indicates whether unprivileged users are prevented -from using dmesg(8) to view messages from the kernel's log buffer. -When dmesg_restrict is set to (0) there are no restrictions. When -dmesg_restrict is set set to (1), users must have CAP_SYSLOG to use -dmesg(8). +from using ``dmesg(8)`` to view messages from the kernel's log +buffer. +When ``dmesg_restrict`` is set to 0 there are no restrictions. +When ``dmesg_restrict`` is set set to 1, users must have +``CAP_SYSLOG`` to use ``dmesg(8)``. -The kernel config option CONFIG_SECURITY_DMESG_RESTRICT sets the -default value of dmesg_restrict. +The kernel config option ``CONFIG_SECURITY_DMESG_RESTRICT`` sets the +default value of ``dmesg_restrict``. -domainname & hostname: -====================== +domainname & hostname +===================== These files can be used to set the NIS/YP domainname and the hostname of your box in exactly the same way as the commands @@ -303,167 +228,192 @@ hostname "darkstar" and DNS (Internet Domain Name Server) domainname "frop.org", not to be confused with the NIS (Network Information Service) or YP (Yellow Pages) domainname. These two domain names are in general different. For a detailed discussion -see the hostname(1) man page. +see the ``hostname(1)`` man page. -hardlockup_all_cpu_backtrace: -============================= +hardlockup_all_cpu_backtrace +============================ This value controls the hard lockup detector behavior when a hard lockup condition is detected as to whether or not to gather further debug information. If enabled, arch-specific all-CPU stack dumping will be initiated. -0: do nothing. This is the default behavior. - -1: on detection capture more debug information. += ============================================ +0 Do nothing. This is the default behavior. +1 On detection capture more debug information. += ============================================ -hardlockup_panic: -================= +hardlockup_panic +================ This parameter can be used to control whether the kernel panics when a hard lockup is detected. - 0 - don't panic on hard lockup - 1 - panic on hard lockup += =========================== +0 Don't panic on hard lockup. +1 Panic on hard lockup. += =========================== -See Documentation/admin-guide/lockup-watchdogs.rst for more information. This can -also be set using the nmi_watchdog kernel parameter. +See :doc:`/admin-guide/lockup-watchdogs` for more information. +This can also be set using the nmi_watchdog kernel parameter. -hotplug: -======== +hotplug +======= Path for the hotplug policy agent. -Default value is "/sbin/hotplug". +Default value is "``/sbin/hotplug``". -hung_task_panic: -================ +hung_task_panic +=============== Controls the kernel's behavior when a hung task is detected. -This file shows up if CONFIG_DETECT_HUNG_TASK is enabled. +This file shows up if ``CONFIG_DETECT_HUNG_TASK`` is enabled. -0: continue operation. This is the default behavior. += ================================================= +0 Continue operation. This is the default behavior. +1 Panic immediately. += ================================================= -1: panic immediately. - -hung_task_check_count: -====================== +hung_task_check_count +===================== The upper bound on the number of tasks that are checked. -This file shows up if CONFIG_DETECT_HUNG_TASK is enabled. +This file shows up if ``CONFIG_DETECT_HUNG_TASK`` is enabled. -hung_task_timeout_secs: -======================= +hung_task_timeout_secs +====================== When a task in D state did not get scheduled for more than this value report a warning. -This file shows up if CONFIG_DETECT_HUNG_TASK is enabled. +This file shows up if ``CONFIG_DETECT_HUNG_TASK`` is enabled. -0: means infinite timeout - no checking done. +0 means infinite timeout, no checking is done. -Possible values to set are in range {0..LONG_MAX/HZ}. +Possible values to set are in range {0:``LONG_MAX``/``HZ``}. -hung_task_check_interval_secs: -============================== +hung_task_check_interval_secs +============================= Hung task check interval. If hung task checking is enabled -(see hung_task_timeout_secs), the check is done every -hung_task_check_interval_secs seconds. -This file shows up if CONFIG_DETECT_HUNG_TASK is enabled. +(see `hung_task_timeout_secs`_), the check is done every +``hung_task_check_interval_secs`` seconds. +This file shows up if ``CONFIG_DETECT_HUNG_TASK`` is enabled. -0 (default): means use hung_task_timeout_secs as checking interval. -Possible values to set are in range {0..LONG_MAX/HZ}. +0 (default) means use ``hung_task_timeout_secs`` as checking +interval. +Possible values to set are in range {0:``LONG_MAX``/``HZ``}. -hung_task_warnings: -=================== + +hung_task_warnings +================== The maximum number of warnings to report. During a check interval if a hung task is detected, this value is decreased by 1. When this value reaches 0, no more warnings will be reported. -This file shows up if CONFIG_DETECT_HUNG_TASK is enabled. +This file shows up if ``CONFIG_DETECT_HUNG_TASK`` is enabled. -1: report an infinite number of warnings. -hyperv_record_panic_msg: -======================== +hyperv_record_panic_msg +======================= Controls whether the panic kmsg data should be reported to Hyper-V. -0: do not report panic kmsg data. += ========================================================= +0 Do not report panic kmsg data. +1 Report the panic kmsg data. This is the default behavior. += ========================================================= -1: report the panic kmsg data. This is the default behavior. +kexec_load_disabled +=================== -kexec_load_disabled: -==================== - -A toggle indicating if the kexec_load syscall has been disabled. This -value defaults to 0 (false: kexec_load enabled), but can be set to 1 -(true: kexec_load disabled). Once true, kexec can no longer be used, and -the toggle cannot be set back to false. This allows a kexec image to be -loaded before disabling the syscall, allowing a system to set up (and -later use) an image without it being altered. Generally used together -with the "modules_disabled" sysctl. +A toggle indicating if the ``kexec_load`` syscall has been disabled. +This value defaults to 0 (false: ``kexec_load`` enabled), but can be +set to 1 (true: ``kexec_load`` disabled). +Once true, kexec can no longer be used, and the toggle cannot be set +back to false. +This allows a kexec image to be loaded before disabling the syscall, +allowing a system to set up (and later use) an image without it being +altered. +Generally used together with the `modules_disabled`_ sysctl. -kptr_restrict: -============== +kptr_restrict +============= This toggle indicates whether restrictions are placed on -exposing kernel addresses via /proc and other interfaces. - -When kptr_restrict is set to 0 (the default) the address is hashed before -printing. (This is the equivalent to %p.) +exposing kernel addresses via ``/proc`` and other interfaces. + +When ``kptr_restrict`` is set to 0 (the default) the address is hashed +before printing. +(This is the equivalent to %p.) + +When ``kptr_restrict`` is set to 1, kernel pointers printed using the +%pK format specifier will be replaced with 0s unless the user has +``CAP_SYSLOG`` and effective user and group ids are equal to the real +ids. +This is because %pK checks are done at read() time rather than open() +time, so if permissions are elevated between the open() and the read() +(e.g via a setuid binary) then %pK will not leak kernel pointers to +unprivileged users. +Note, this is a temporary solution only. +The correct long-term solution is to do the permission checks at +open() time. +Consider removing world read permissions from files that use %pK, and +using `dmesg_restrict`_ to protect against uses of %pK in ``dmesg(8)`` +if leaking kernel pointer values to unprivileged users is a concern. + +When ``kptr_restrict`` is set to 2, kernel pointers printed using +%pK will be replaced with 0s regardless of privileges. + + +l2cr (PPC only) +=============== -When kptr_restrict is set to (1), kernel pointers printed using the %pK -format specifier will be replaced with 0's unless the user has CAP_SYSLOG -and effective user and group ids are equal to the real ids. This is -because %pK checks are done at read() time rather than open() time, so -if permissions are elevated between the open() and the read() (e.g via -a setuid binary) then %pK will not leak kernel pointers to unprivileged -users. Note, this is a temporary solution only. The correct long-term -solution is to do the permission checks at open() time. Consider removing -world read permissions from files that use %pK, and using dmesg_restrict -to protect against uses of %pK in dmesg(8) if leaking kernel pointer -values to unprivileged users is a concern. +This flag controls the L2 cache of G3 processor boards. If +0, the cache is disabled. Enabled if nonzero. -When kptr_restrict is set to (2), kernel pointers printed using -%pK will be replaced with 0's regardless of privileges. +modprobe +======== -l2cr: (PPC only) -================ - -This flag controls the L2 cache of G3 processor boards. If -0, the cache is disabled. Enabled if nonzero. +See Documentation/debugging-modules.txt. -modules_disabled: -================= +modules_disabled +================ A toggle value indicating if modules are allowed to be loaded in an otherwise modular kernel. This toggle defaults to off (0), but can be set true (1). Once true, modules can be neither loaded nor unloaded, and the toggle cannot be set back -to false. Generally used with the "kexec_load_disabled" toggle. +to false. Generally used with the `kexec_load_disabled`_ toggle. + +.. _msgmni: -msg_next_id, sem_next_id, and shm_next_id: -========================================== +msgmax, msgmnb, and msgmni +========================== + + +msg_next_id, sem_next_id, and shm_next_id (System V IPC) +======================================================== These three toggles allows to specify desired id for next allocated IPC object: message, semaphore or shared memory respectively. By default they are equal to -1, which means generic allocation logic. -Possible values to set are in range {0..INT_MAX}. +Possible values to set are in range {0:``INT_MAX``}. Notes: 1) kernel doesn't guarantee, that new object will have desired id. So, @@ -473,15 +423,16 @@ Notes: fails, it is undefined if the value remains unmodified or is reset to -1. -nmi_watchdog: -============= +nmi_watchdog +============ This parameter can be used to control the NMI watchdog (i.e. the hard lockup detector) on x86 systems. -0 - disable the hard lockup detector - -1 - enable the hard lockup detector += ================================= +0 Disable the hard lockup detector. +1 Enable the hard lockup detector. += ================================= The hard lockup detector monitors each CPU for its ability to respond to timer interrupts. The mechanism utilizes CPU performance counter registers @@ -493,11 +444,11 @@ in a KVM virtual machine. This default can be overridden by adding:: nmi_watchdog=1 -to the guest kernel command line (see Documentation/admin-guide/kernel-parameters.rst). +to the guest kernel command line (see :doc:`/admin-guide/kernel-parameters`). -numa_balancing: -=============== +numa_balancing +============== Enables/disables automatic page fault based NUMA memory balancing. Memory is moved automatically to nodes @@ -515,9 +466,10 @@ ideally is offset by improved memory locality but there is no universal guarantee. If the target workload is already bound to NUMA nodes then this feature should be disabled. Otherwise, if the system overhead from the feature is too high then the rate the kernel samples for NUMA hinting -faults may be controlled by the numa_balancing_scan_period_min_ms, +faults may be controlled by the `numa_balancing_scan_period_min_ms, numa_balancing_scan_delay_ms, numa_balancing_scan_period_max_ms, -numa_balancing_scan_size_mb, and numa_balancing_settle_count sysctls. +numa_balancing_scan_size_mb`_, and numa_balancing_settle_count sysctls. + numa_balancing_scan_period_min_ms, numa_balancing_scan_delay_ms, numa_balancing_scan_period_max_ms, numa_balancing_scan_size_mb =============================================================================================================================== @@ -543,23 +495,23 @@ workload pattern changes and minimises performance impact due to remote memory accesses. These sysctls control the thresholds for scan delays and the number of pages scanned. -numa_balancing_scan_period_min_ms is the minimum time in milliseconds to +``numa_balancing_scan_period_min_ms`` is the minimum time in milliseconds to scan a tasks virtual memory. It effectively controls the maximum scanning rate for each task. -numa_balancing_scan_delay_ms is the starting "scan delay" used for a task +``numa_balancing_scan_delay_ms`` is the starting "scan delay" used for a task when it initially forks. -numa_balancing_scan_period_max_ms is the maximum time in milliseconds to +``numa_balancing_scan_period_max_ms`` is the maximum time in milliseconds to scan a tasks virtual memory. It effectively controls the minimum scanning rate for each task. -numa_balancing_scan_size_mb is how many megabytes worth of pages are +``numa_balancing_scan_size_mb`` is how many megabytes worth of pages are scanned for a given scan. -osrelease, ostype & version: -============================ +osrelease, ostype & version +=========================== :: @@ -570,15 +522,16 @@ osrelease, ostype & version: # cat version #5 Wed Feb 25 21:49:24 MET 1998 -The files osrelease and ostype should be clear enough. Version +The files ``osrelease`` and ``ostype`` should be clear enough. +``version`` needs a little more clarification however. The '#5' means that this is the fifth kernel built from this source base and the date behind it indicates the time the kernel was built. The only way to tune these values is to rebuild the kernel :-) -overflowgid & overflowuid: -========================== +overflowgid & overflowuid +========================= if your architecture did not always support 32-bit UIDs (i.e. arm, i386, m68k, sh, and sparc32), a fixed UID and GID will be returned to @@ -589,108 +542,113 @@ These sysctls allow you to change the value of the fixed UID and GID. The default is 65534. -panic: -====== +panic +===== The value in this file represents the number of seconds the kernel waits before rebooting on a panic. When you use the software watchdog, the recommended setting is 60. -panic_on_io_nmi: -================ +panic_on_io_nmi +=============== Controls the kernel's behavior when a CPU receives an NMI caused by an IO error. -0: try to continue operation (default) - -1: panic immediately. The IO error triggered an NMI. This indicates a - serious system condition which could result in IO data corruption. - Rather than continuing, panicking might be a better choice. Some - servers issue this sort of NMI when the dump button is pushed, - and you can use this option to take a crash dump. += ================================================================== +0 Try to continue operation (default). +1 Panic immediately. The IO error triggered an NMI. This indicates a + serious system condition which could result in IO data corruption. + Rather than continuing, panicking might be a better choice. Some + servers issue this sort of NMI when the dump button is pushed, + and you can use this option to take a crash dump. += ================================================================== -panic_on_oops: -============== +panic_on_oops +============= Controls the kernel's behaviour when an oops or BUG is encountered. -0: try to continue operation - -1: panic immediately. If the `panic` sysctl is also non-zero then the - machine will be rebooted. += =================================================================== +0 Try to continue operation. +1 Panic immediately. If the `panic` sysctl is also non-zero then the + machine will be rebooted. += =================================================================== -panic_on_stackoverflow: -======================= +panic_on_stackoverflow +====================== Controls the kernel's behavior when detecting the overflows of kernel, IRQ and exception stacks except a user stack. -This file shows up if CONFIG_DEBUG_STACKOVERFLOW is enabled. - -0: try to continue operation. +This file shows up if ``CONFIG_DEBUG_STACKOVERFLOW`` is enabled. -1: panic immediately. += ========================== +0 Try to continue operation. +1 Panic immediately. += ========================== -panic_on_unrecovered_nmi: -========================= +panic_on_unrecovered_nmi +======================== The default Linux behaviour on an NMI of either memory or unknown is to continue operation. For many environments such as scientific computing it is preferable that the box is taken out and the error dealt with than an uncorrected parity/ECC error get propagated. -A small number of systems do generate NMI's for bizarre random reasons +A small number of systems do generate NMIs for bizarre random reasons such as power management so the default is off. That sysctl works like the existing panic controls already in that directory. -panic_on_warn: -============== +panic_on_warn +============= Calls panic() in the WARN() path when set to 1. This is useful to avoid a kernel rebuild when attempting to kdump at the location of a WARN(). -0: only WARN(), default behaviour. - -1: call panic() after printing out WARN() location. += ================================================ +0 Only WARN(), default behaviour. +1 Call panic() after printing out WARN() location. += ================================================ -panic_print: -============ +panic_print +=========== Bitmask for printing system info when panic happens. User can chose combination of the following bits: -===== ======================================== +===== ============================================ bit 0 print all tasks info bit 1 print system memory info bit 2 print timer info -bit 3 print locks info if CONFIG_LOCKDEP is on +bit 3 print locks info if ``CONFIG_LOCKDEP`` is on bit 4 print ftrace buffer -===== ======================================== +===== ============================================ So for example to print tasks and memory info on panic, user can:: echo 3 > /proc/sys/kernel/panic_print -panic_on_rcu_stall: -=================== +panic_on_rcu_stall +================== When set to 1, calls panic() after RCU stall detection messages. This is useful to define the root cause of RCU stalls using a vmcore. -0: do not panic() when RCU stall takes place, default behavior. - -1: panic() after printing RCU stall messages. += ============================================================ +0 Do not panic() when RCU stall takes place, default behavior. +1 panic() after printing RCU stall messages. += ============================================================ -perf_cpu_time_max_percent: -========================== +perf_cpu_time_max_percent +========================= Hints to the kernel how much CPU time it should be allowed to use to handle perf sampling events. If the perf subsystem @@ -703,171 +661,179 @@ unexpectedly take too long to execute, the NMIs can become stacked up next to each other so much that nothing else is allowed to execute. -0: - disable the mechanism. Do not monitor or correct perf's - sampling rate no matter how CPU time it takes. +===== ======================================================== +0 Disable the mechanism. Do not monitor or correct perf's + sampling rate no matter how CPU time it takes. -1-100: - attempt to throttle perf's sample rate to this - percentage of CPU. Note: the kernel calculates an - "expected" length of each sample event. 100 here means - 100% of that expected length. Even if this is set to - 100, you may still see sample throttling if this - length is exceeded. Set to 0 if you truly do not care - how much CPU is consumed. +1-100 Attempt to throttle perf's sample rate to this + percentage of CPU. Note: the kernel calculates an + "expected" length of each sample event. 100 here means + 100% of that expected length. Even if this is set to + 100, you may still see sample throttling if this + length is exceeded. Set to 0 if you truly do not care + how much CPU is consumed. +===== ======================================================== -perf_event_paranoid: -==================== +perf_event_paranoid +=================== Controls use of the performance events system by unprivileged users (without CAP_SYS_ADMIN). The default value is 2. === ================================================================== - -1 Allow use of (almost) all events by all users + -1 Allow use of (almost) all events by all users. - Ignore mlock limit after perf_event_mlock_kb without CAP_IPC_LOCK + Ignore mlock limit after perf_event_mlock_kb without + ``CAP_IPC_LOCK``. ->=0 Disallow ftrace function tracepoint by users without CAP_SYS_ADMIN +>=0 Disallow ftrace function tracepoint by users without + ``CAP_SYS_ADMIN``. - Disallow raw tracepoint access by users without CAP_SYS_ADMIN + Disallow raw tracepoint access by users without ``CAP_SYS_ADMIN``. ->=1 Disallow CPU event access by users without CAP_SYS_ADMIN +>=1 Disallow CPU event access by users without ``CAP_SYS_ADMIN``. ->=2 Disallow kernel profiling by users without CAP_SYS_ADMIN +>=2 Disallow kernel profiling by users without ``CAP_SYS_ADMIN``. === ================================================================== -perf_event_max_stack: -===================== +perf_event_max_stack +==================== -Controls maximum number of stack frames to copy for (attr.sample_type & -PERF_SAMPLE_CALLCHAIN) configured events, for instance, when using -'perf record -g' or 'perf trace --call-graph fp'. +Controls maximum number of stack frames to copy for (``attr.sample_type & +PERF_SAMPLE_CALLCHAIN``) configured events, for instance, when using +'``perf record -g``' or '``perf trace --call-graph fp``'. This can only be done when no events are in use that have callchains -enabled, otherwise writing to this file will return -EBUSY. +enabled, otherwise writing to this file will return ``-EBUSY``. The default value is 127. -perf_event_mlock_kb: -==================== +perf_event_mlock_kb +=================== Control size of per-cpu ring buffer not counted agains mlock limit. The default value is 512 + 1 page -perf_event_max_contexts_per_stack: -================================== +perf_event_max_contexts_per_stack +================================= Controls maximum number of stack frame context entries for -(attr.sample_type & PERF_SAMPLE_CALLCHAIN) configured events, for -instance, when using 'perf record -g' or 'perf trace --call-graph fp'. +(``attr.sample_type & PERF_SAMPLE_CALLCHAIN``) configured events, for +instance, when using '``perf record -g``' or '``perf trace --call-graph fp``'. This can only be done when no events are in use that have callchains -enabled, otherwise writing to this file will return -EBUSY. +enabled, otherwise writing to this file will return ``-EBUSY``. The default value is 8. -pid_max: -======== +pid_max +======= PID allocation wrap value. When the kernel's next PID value reaches this value, it wraps back to a minimum PID value. -PIDs of value pid_max or larger are not allocated. +PIDs of value ``pid_max`` or larger are not allocated. -ns_last_pid: -============ +ns_last_pid +=========== The last pid allocated in the current (the one task using this sysctl lives in) pid namespace. When selecting a pid for a next task on fork kernel tries to allocate a number starting from this one. -powersave-nap: (PPC only) -========================= +powersave-nap (PPC only) +======================== If set, Linux-PPC will use the 'nap' mode of powersaving, otherwise the 'doze' mode will be used. + ============================================================== -printk: -======= +printk +====== -The four values in printk denote: console_loglevel, -default_message_loglevel, minimum_console_loglevel and -default_console_loglevel respectively. +The four values in printk denote: ``console_loglevel``, +``default_message_loglevel``, ``minimum_console_loglevel`` and +``default_console_loglevel`` respectively. These values influence printk() behavior when printing or -logging error messages. See 'man 2 syslog' for more info on +logging error messages. See '``man 2 syslog``' for more info on the different loglevels. -- console_loglevel: - messages with a higher priority than - this will be printed to the console -- default_message_loglevel: - messages without an explicit priority - will be printed with this priority -- minimum_console_loglevel: - minimum (highest) value to which - console_loglevel can be set -- default_console_loglevel: - default value for console_loglevel +======================== ===================================== +console_loglevel messages with a higher priority than + this will be printed to the console +default_message_loglevel messages without an explicit priority + will be printed with this priority +minimum_console_loglevel minimum (highest) value to which + console_loglevel can be set +default_console_loglevel default value for console_loglevel +======================== ===================================== -printk_delay: -============= +printk_delay +============ -Delay each printk message in printk_delay milliseconds +Delay each printk message in ``printk_delay`` milliseconds Value from 0 - 10000 is allowed. -printk_ratelimit: -================= +printk_ratelimit +================ -Some warning messages are rate limited. printk_ratelimit specifies +Some warning messages are rate limited. ``printk_ratelimit`` specifies the minimum length of time between these messages (in seconds). The default value is 5 seconds. A value of 0 will disable rate limiting. -printk_ratelimit_burst: -======================= +printk_ratelimit_burst +====================== -While long term we enforce one message per printk_ratelimit +While long term we enforce one message per `printk_ratelimit`_ seconds, we do allow a burst of messages to pass through. -printk_ratelimit_burst specifies the number of messages we can +``printk_ratelimit_burst`` specifies the number of messages we can send before ratelimiting kicks in. The default value is 10 messages. -printk_devkmsg: -=============== - -Control the logging to /dev/kmsg from userspace: - -ratelimit: - default, ratelimited +printk_devkmsg +============== -on: unlimited logging to /dev/kmsg from userspace +Control the logging to ``/dev/kmsg`` from userspace: -off: logging to /dev/kmsg disabled +========= ============================================= +ratelimit default, ratelimited +on unlimited logging to /dev/kmsg from userspace +off logging to /dev/kmsg disabled +========= ============================================= -The kernel command line parameter printk.devkmsg= overrides this and is +The kernel command line parameter ``printk.devkmsg=`` overrides this and is a one-time setting until next reboot: once set, it cannot be changed by this sysctl interface anymore. +============================================================== -randomize_va_space: -=================== + +pty +=== + +See Documentation/filesystems/devpts.txt. + + +randomize_va_space +================== This option can be used to select the type of process address space randomization that is used in the system, for architectures @@ -882,10 +848,10 @@ that support this feature. This, among other things, implies that shared libraries will be loaded to random addresses. Also for PIE-linked binaries, the location of code start is randomized. This is the default if the - CONFIG_COMPAT_BRK option is enabled. + ``CONFIG_COMPAT_BRK`` option is enabled. 2 Additionally enable heap randomization. This is the default if - CONFIG_COMPAT_BRK is disabled. + ``CONFIG_COMPAT_BRK`` is disabled. There are a few legacy applications out there (such as some ancient versions of libc.so.5 from 1996) that assume that brk area starts @@ -895,21 +861,27 @@ that support this feature. systems it is safe to choose full randomization. Systems with ancient and/or broken binaries should be configured - with CONFIG_COMPAT_BRK enabled, which excludes the heap from process + with ``CONFIG_COMPAT_BRK`` enabled, which excludes the heap from process address space randomization. == =========================================================================== -reboot-cmd: (Sparc only) -======================== +real-root-dev +============= + +See :doc:`/admin-guide/initrd`. + + +reboot-cmd (SPARC only) +======================= ??? This seems to be a way to give an argument to the Sparc ROM/Flash boot loader. Maybe to tell it what to do after rebooting. ??? -rtsig-max & rtsig-nr: -===================== +rtsig-max & rtsig-nr +==================== The file rtsig-max can be used to tune the maximum number of POSIX realtime (queued) signals that can be outstanding @@ -918,8 +890,8 @@ in the system. rtsig-nr shows the number of RT signals currently queued. -sched_energy_aware: -=================== +sched_energy_aware +================== Enables/disables Energy Aware Scheduling (EAS). EAS starts automatically on platforms where it can run (that is, @@ -929,75 +901,85 @@ requirements for EAS but you do not want to use it, change this value to 0. -sched_schedstats: -================= +sched_schedstats +================ Enables/disables scheduler statistics. Enabling this feature incurs a small amount of overhead in the scheduler but is useful for debugging and performance tuning. -sg-big-buff: -============ +seccomp +======= + +See :doc:`/userspace-api/seccomp_filter`. + + +sg-big-buff +=========== This file shows the size of the generic SCSI (sg) buffer. You can't tune it just yet, but you could change it on -compile time by editing include/scsi/sg.h and changing -the value of SG_BIG_BUFF. +compile time by editing ``include/scsi/sg.h`` and changing +the value of ``SG_BIG_BUFF``. There shouldn't be any reason to change this value. If you can come up with one, you probably know what you are doing anyway :) -shmall: -======= +shmall +====== This parameter sets the total amount of shared memory pages that -can be used system wide. Hence, SHMALL should always be at least -ceil(shmmax/PAGE_SIZE). +can be used system wide. Hence, ``shmall`` should always be at least +``ceil(shmmax/PAGE_SIZE)``. -If you are not sure what the default PAGE_SIZE is on your Linux -system, you can run the following command: +If you are not sure what the default ``PAGE_SIZE`` is on your Linux +system, you can run the following command:: # getconf PAGE_SIZE -shmmax: -======= +shmmax +====== This value can be used to query and set the run time limit on the maximum shared memory segment size that can be created. Shared memory segments up to 1Gb are now supported in the -kernel. This value defaults to SHMMAX. +kernel. This value defaults to ``SHMMAX``. -shm_rmid_forced: -================ +shmmni +====== + + +shm_rmid_forced +=============== Linux lets you set resource limits, including how much memory one -process can consume, via setrlimit(2). Unfortunately, shared memory +process can consume, via ``setrlimit(2)``. Unfortunately, shared memory segments are allowed to exist without association with any process, and thus might not be counted against any resource limits. If enabled, shared memory segments are automatically destroyed when their attach count becomes zero after a detach or a process termination. It will also destroy segments that were created, but never attached to, on exit -from the process. The only use left for IPC_RMID is to immediately +from the process. The only use left for ``IPC_RMID`` is to immediately destroy an unattached segment. Of course, this breaks the way things are defined, so some applications might stop working. Note that this feature will do you no good unless you also configure your resource -limits (in particular, RLIMIT_AS and RLIMIT_NPROC). Most systems don't +limits (in particular, ``RLIMIT_AS`` and ``RLIMIT_NPROC``). Most systems don't need this. Note that if you change this from 0 to 1, already created segments without users and with a dead originative process will be destroyed. -sysctl_writes_strict: -===================== +sysctl_writes_strict +==================== Control how file position affects the behavior of updating sysctl values -via the /proc/sys interface: +via the ``/proc/sys`` interface: == ====================================================================== -1 Legacy per-write sysctl value handling, with no printk warnings. @@ -1014,8 +996,8 @@ via the /proc/sys interface: == ====================================================================== -softlockup_all_cpu_backtrace: -============================= +softlockup_all_cpu_backtrace +============================ This value controls the soft lockup detector thread's behavior when a soft lockup condition is detected as to whether or not @@ -1025,43 +1007,56 @@ be issued an NMI and instructed to capture stack trace. This feature is only applicable for architectures which support NMI. -0: do nothing. This is the default behavior. - -1: on detection capture more debug information. += ============================================ +0 Do nothing. This is the default behavior. +1 On detection capture more debug information. += ============================================ -soft_watchdog: -============== +soft_watchdog +============= This parameter can be used to control the soft lockup detector. - 0 - disable the soft lockup detector - - 1 - enable the soft lockup detector += ================================= +0 Disable the soft lockup detector. +1 Enable the soft lockup detector. += ================================= The soft lockup detector monitors CPUs for threads that are hogging the CPUs without rescheduling voluntarily, and thus prevent the 'watchdog/N' threads from running. The mechanism depends on the CPUs ability to respond to timer interrupts which are needed for the 'watchdog/N' threads to be woken up by -the watchdog timer function, otherwise the NMI watchdog - if enabled - can +the watchdog timer function, otherwise the NMI watchdog — if enabled — can detect a hard lockup condition. -stack_erasing: -============== +stack_erasing +============= This parameter can be used to control kernel stack erasing at the end -of syscalls for kernels built with CONFIG_GCC_PLUGIN_STACKLEAK. +of syscalls for kernels built with ``CONFIG_GCC_PLUGIN_STACKLEAK``. That erasing reduces the information which kernel stack leak bugs can reveal and blocks some uninitialized stack variable attacks. The tradeoff is the performance impact: on a single CPU system kernel compilation sees a 1% slowdown, other systems and workloads may vary. - 0: kernel stack erasing is disabled, STACKLEAK_METRICS are not updated. += ==================================================================== +0 Kernel stack erasing is disabled, STACKLEAK_METRICS are not updated. +1 Kernel stack erasing is enabled (default), it is performed before + returning to the userspace at the end of syscalls. += ==================================================================== + + +stop-a (SPARC only) +=================== - 1: kernel stack erasing is enabled (default), it is performed before - returning to the userspace at the end of syscalls. + +sysrq +===== + +See :doc:`/admin-guide/sysrq`. tainted @@ -1091,30 +1086,30 @@ ORed together. The letters are seen in "Tainted" line of Oops reports. 131072 `(T)` The kernel was built with the struct randomization plugin ====== ===== ============================================================== -See Documentation/admin-guide/tainted-kernels.rst for more information. +See :doc:`/admin-guide/tainted-kernels` for more information. -threads-max: -============ +threads-max +=========== This value controls the maximum number of threads that can be created -using fork(). +using ``fork()``. During initialization the kernel sets this value such that even if the maximum number of threads is created, the thread structures occupy only a part (1/8th) of the available RAM pages. -The minimum value that can be written to threads-max is 1. +The minimum value that can be written to ``threads-max`` is 1. -The maximum value that can be written to threads-max is given by the -constant FUTEX_TID_MASK (0x3fffffff). +The maximum value that can be written to ``threads-max`` is given by the +constant ``FUTEX_TID_MASK`` (0x3fffffff). -If a value outside of this range is written to threads-max an error -EINVAL occurs. +If a value outside of this range is written to ``threads-max`` an +``EINVAL`` error occurs. -unknown_nmi_panic: -================== +unknown_nmi_panic +================= The value in this file affects behavior of handling NMI. When the value is non-zero, unknown NMI is trapped and then panic occurs. At @@ -1124,37 +1119,39 @@ NMI switch that most IA32 servers have fires unknown NMI up, for example. If a system hangs up, try pressing the NMI switch. -watchdog: -========= +watchdog +======== This parameter can be used to disable or enable the soft lockup detector -_and_ the NMI watchdog (i.e. the hard lockup detector) at the same time. +*and* the NMI watchdog (i.e. the hard lockup detector) at the same time. - 0 - disable both lockup detectors - - 1 - enable both lockup detectors += ============================== +0 Disable both lockup detectors. +1 Enable both lockup detectors. += ============================== The soft lockup detector and the NMI watchdog can also be disabled or -enabled individually, using the soft_watchdog and nmi_watchdog parameters. -If the watchdog parameter is read, for example by executing:: +enabled individually, using the ``soft_watchdog`` and ``nmi_watchdog`` +parameters. +If the ``watchdog`` parameter is read, for example by executing:: cat /proc/sys/kernel/watchdog -the output of this command (0 or 1) shows the logical OR of soft_watchdog -and nmi_watchdog. +the output of this command (0 or 1) shows the logical OR of +``soft_watchdog`` and ``nmi_watchdog``. -watchdog_cpumask: -================= +watchdog_cpumask +================ This value can be used to control on which cpus the watchdog may run. -The default cpumask is all possible cores, but if NO_HZ_FULL is +The default cpumask is all possible cores, but if ``NO_HZ_FULL`` is enabled in the kernel config, and cores are specified with the -nohz_full= boot argument, those cores are excluded by default. +``nohz_full=`` boot argument, those cores are excluded by default. Offline cores can be included in this mask, and if the core is later brought online, the watchdog will be started based on the mask value. -Typically this value would only be touched in the nohz_full case +Typically this value would only be touched in the ``nohz_full`` case to re-enable cores that by default were not running the watchdog, if a kernel lockup was suspected on those cores. @@ -1165,12 +1162,12 @@ might say:: echo 0,2-4 > /proc/sys/kernel/watchdog_cpumask -watchdog_thresh: -================ +watchdog_thresh +=============== This value can be used to control the frequency of hrtimer and NMI events and the soft and hard lockup thresholds. The default threshold is 10 seconds. -The softlockup threshold is (2 * watchdog_thresh). Setting this +The softlockup threshold is (``2 * watchdog_thresh``). Setting this tunable to zero will disable lockup detection altogether. -- cgit From 0317c5371e6a9b71a2e25b47013dd5c62d55d1a6 Mon Sep 17 00:00:00 2001 From: Stephen Kitt Date: Tue, 18 Feb 2020 13:59:17 +0100 Subject: docs: merge debugging-modules.txt into sysctl/kernel.rst This fits nicely in sysctl/kernel.rst, merge it (and rephrase it) instead of linking to it. Signed-off-by: Stephen Kitt Signed-off-by: Jonathan Corbet --- Documentation/admin-guide/sysctl/kernel.rst | 14 +++++++++++++- Documentation/debugging-modules.txt | 22 ---------------------- 2 files changed, 13 insertions(+), 23 deletions(-) delete mode 100644 Documentation/debugging-modules.txt diff --git a/Documentation/admin-guide/sysctl/kernel.rst b/Documentation/admin-guide/sysctl/kernel.rst index 4872610cc491..bb56ff25d947 100644 --- a/Documentation/admin-guide/sysctl/kernel.rst +++ b/Documentation/admin-guide/sysctl/kernel.rst @@ -387,7 +387,19 @@ This flag controls the L2 cache of G3 processor boards. If modprobe ======== -See Documentation/debugging-modules.txt. +This gives the full path of the modprobe command which the kernel will +use to load modules. This can be used to debug module loading +requests:: + + echo '#! /bin/sh' > /tmp/modprobe + echo 'echo "$@" >> /tmp/modprobe.log' >> /tmp/modprobe + echo 'exec /sbin/modprobe "$@"' >> /tmp/modprobe + chmod a+x /tmp/modprobe + echo /tmp/modprobe > /proc/sys/kernel/modprobe + +This only applies when the *kernel* is requesting that the module be +loaded; it won't have any effect if the module is being loaded +explicitly using ``modprobe`` from userspace. modules_disabled diff --git a/Documentation/debugging-modules.txt b/Documentation/debugging-modules.txt deleted file mode 100644 index 172ad4aec493..000000000000 --- a/Documentation/debugging-modules.txt +++ /dev/null @@ -1,22 +0,0 @@ -Debugging Modules after 2.6.3 ------------------------------ - -In almost all distributions, the kernel asks for modules which don't -exist, such as "net-pf-10" or whatever. Changing "modprobe -q" to -"succeed" in this case is hacky and breaks some setups, and also we -want to know if it failed for the fallback code for old aliases in -fs/char_dev.c, for example. - -In the past a debugging message which would fill people's logs was -emitted. This debugging message has been removed. The correct way -of debugging module problems is something like this: - -echo '#! /bin/sh' > /tmp/modprobe -echo 'echo "$@" >> /tmp/modprobe.log' >> /tmp/modprobe -echo 'exec /sbin/modprobe "$@"' >> /tmp/modprobe -chmod a+x /tmp/modprobe -echo /tmp/modprobe > /proc/sys/kernel/modprobe - -Note that the above applies only when the *kernel* is requesting -that the module be loaded -- it won't have any effect if that module -is being loaded explicitly using "modprobe" from userspace. -- cgit From a474105bb6a6fe85ea30d7fe0a087184da32c751 Mon Sep 17 00:00:00 2001 From: Stephen Kitt Date: Tue, 18 Feb 2020 13:59:18 +0100 Subject: docs: drop l2cr from sysctl/kernel.rst The l2cr sysctl entry was removed in commit c2f3dabefa73 ("sysctl: kill binary sysctl KERN_PPC_L2CR"), this removes the corresponding documentation. Signed-off-by: Stephen Kitt Signed-off-by: Jonathan Corbet --- Documentation/admin-guide/sysctl/kernel.rst | 7 ------- 1 file changed, 7 deletions(-) diff --git a/Documentation/admin-guide/sysctl/kernel.rst b/Documentation/admin-guide/sysctl/kernel.rst index bb56ff25d947..99569a26f93e 100644 --- a/Documentation/admin-guide/sysctl/kernel.rst +++ b/Documentation/admin-guide/sysctl/kernel.rst @@ -377,13 +377,6 @@ When ``kptr_restrict`` is set to 2, kernel pointers printed using %pK will be replaced with 0s regardless of privileges. -l2cr (PPC only) -=============== - -This flag controls the L2 cache of G3 processor boards. If -0, the cache is disabled. Enabled if nonzero. - - modprobe ======== -- cgit From fa5b526411bb5afe7736ce14bab18c0b68db4251 Mon Sep 17 00:00:00 2001 From: Stephen Kitt Date: Tue, 18 Feb 2020 13:59:19 +0100 Subject: docs: add missing IPC documentation in sysctl/kernel.rst This adds short descriptions of msgmax, msgmnb, msgmni, and shmmni, which were previously listed in kernel.rst but not described. Signed-off-by: Stephen Kitt Signed-off-by: Jonathan Corbet --- Documentation/admin-guide/sysctl/kernel.rst | 12 ++++++++++++ 1 file changed, 12 insertions(+) diff --git a/Documentation/admin-guide/sysctl/kernel.rst b/Documentation/admin-guide/sysctl/kernel.rst index 99569a26f93e..0ae52156db75 100644 --- a/Documentation/admin-guide/sysctl/kernel.rst +++ b/Documentation/admin-guide/sysctl/kernel.rst @@ -410,6 +410,15 @@ to false. Generally used with the `kexec_load_disabled`_ toggle. msgmax, msgmnb, and msgmni ========================== +``msgmax`` is the maximum size of an IPC message, in bytes. 8192 by +default (``MSGMAX``). + +``msgmnb`` is the maximum size of an IPC queue, in bytes. 16384 by +default (``MSGMNB``). + +``msgmni`` is the maximum number of IPC queues. 32000 by default +(``MSGMNI``). + msg_next_id, sem_next_id, and shm_next_id (System V IPC) ======================================================== @@ -958,6 +967,9 @@ kernel. This value defaults to ``SHMMAX``. shmmni ====== +This value determines the maximum number of shared memory segments. +4096 by default (``SHMMNI``). + shm_rmid_forced =============== -- cgit From a1ad4f15054b58636aa58f0df2961259f8781746 Mon Sep 17 00:00:00 2001 From: Stephen Kitt Date: Tue, 18 Feb 2020 13:59:20 +0100 Subject: docs: document stop-a in sysctl/kernel.rst This describes the SPARC-specific stop-a sysctl entry, which was previously listed in kernel.rst but not documented. Base on the implementation in arch/sparc/kernel/setup_{32,64}.c and kernel/panic.c. Signed-off-by: Stephen Kitt Signed-off-by: Jonathan Corbet --- Documentation/admin-guide/sysctl/kernel.rst | 10 ++++++++++ 1 file changed, 10 insertions(+) diff --git a/Documentation/admin-guide/sysctl/kernel.rst b/Documentation/admin-guide/sysctl/kernel.rst index 0ae52156db75..3cbbe4502e18 100644 --- a/Documentation/admin-guide/sysctl/kernel.rst +++ b/Documentation/admin-guide/sysctl/kernel.rst @@ -1069,6 +1069,16 @@ compilation sees a 1% slowdown, other systems and workloads may vary. stop-a (SPARC only) =================== +Controls Stop-A: + += ==================================== +0 Stop-A has no effect. +1 Stop-A breaks to the PROM (default). += ==================================== + +Stop-A is always enabled on a panic, so that the user can return to +the boot PROM. + sysrq ===== -- cgit From 404347e68aeb81b89dc440135ed23fcabff104f9 Mon Sep 17 00:00:00 2001 From: Stephen Kitt Date: Tue, 18 Feb 2020 13:59:21 +0100 Subject: docs: document panic fully in sysctl/kernel.rst MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit The description of panic doesn’t cover all the supported scenarios; this patch fixes that, describing the three possibilities (no reboot, immediate reboot, reboot after a delay). Based on the implementation in kernel/panic.c. Signed-off-by: Stephen Kitt Signed-off-by: Jonathan Corbet --- Documentation/admin-guide/sysctl/kernel.rst | 12 +++++++++--- 1 file changed, 9 insertions(+), 3 deletions(-) diff --git a/Documentation/admin-guide/sysctl/kernel.rst b/Documentation/admin-guide/sysctl/kernel.rst index 3cbbe4502e18..60c97a79ff26 100644 --- a/Documentation/admin-guide/sysctl/kernel.rst +++ b/Documentation/admin-guide/sysctl/kernel.rst @@ -559,9 +559,15 @@ The default is 65534. panic ===== -The value in this file represents the number of seconds the kernel -waits before rebooting on a panic. When you use the software watchdog, -the recommended setting is 60. +The value in this file determines the behaviour of the kernel on a +panic: + +* if zero, the kernel will loop forever; +* if negative, the kernel will reboot immediately; +* if positive, the kernel will reboot after the corresponding number + of seconds. + +When you use the software watchdog, the recommended setting is 60. panic_on_io_nmi -- cgit From 8f21f54b8a9517e0213948088aca757a0f122447 Mon Sep 17 00:00:00 2001 From: Stephen Kitt Date: Tue, 18 Feb 2020 13:59:23 +0100 Subject: docs: sysctl/kernel: remove rtsig entries These have no corresponding code in the kernel. Signed-off-by: Stephen Kitt Signed-off-by: Jonathan Corbet --- Documentation/admin-guide/sysctl/kernel.rst | 10 ---------- 1 file changed, 10 deletions(-) diff --git a/Documentation/admin-guide/sysctl/kernel.rst b/Documentation/admin-guide/sysctl/kernel.rst index 60c97a79ff26..6c0d8c55101c 100644 --- a/Documentation/admin-guide/sysctl/kernel.rst +++ b/Documentation/admin-guide/sysctl/kernel.rst @@ -900,16 +900,6 @@ ROM/Flash boot loader. Maybe to tell it what to do after rebooting. ??? -rtsig-max & rtsig-nr -==================== - -The file rtsig-max can be used to tune the maximum number -of POSIX realtime (queued) signals that can be outstanding -in the system. - -rtsig-nr shows the number of RT signals currently queued. - - sched_energy_aware ================== -- cgit From dff2c2e69f308c1c7d296d49d2b0467e9675b58e Mon Sep 17 00:00:00 2001 From: Bhaskar Chowdhury Date: Tue, 18 Feb 2020 15:10:13 +0530 Subject: Replace dead urls with active urls for Mutt This patch replace stale/dead urls with active urls for Mutt. Signed-off-by: Bhaskar Chowdhury Signed-off-by: Jonathan Corbet --- Documentation/process/email-clients.rst | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/Documentation/process/email-clients.rst b/Documentation/process/email-clients.rst index 5273d06c8ff6..c9e4ce2613c0 100644 --- a/Documentation/process/email-clients.rst +++ b/Documentation/process/email-clients.rst @@ -237,9 +237,9 @@ using Mutt to send patches through Gmail:: The Mutt docs have lots more information: - http://dev.mutt.org/trac/wiki/UseCases/Gmail + https://gitlab.com/muttmua/mutt/-/wikis/UseCases/Gmail - http://dev.mutt.org/doc/manual.html + http://www.mutt.org/doc/manual/ Pine (TUI) ********** -- cgit From fb0e0ffe7fc8e0e91481e67665f1d646bfd071f2 Mon Sep 17 00:00:00 2001 From: Tony Fischetti Date: Sun, 16 Feb 2020 19:08:26 -0500 Subject: Documentation: bring process docs up to date The guide to the kernel dev process documentation, for example, contains references to older kernels and their timelines. In addition, one of the "long term support kernels" listed have since reached EOL, and a new one has been named. This patch brings information/tables up to date. Additionally, some very trivial grammatical errors, unclear sentences, and potentially unsavory diction have been edited. Signed-off-by: Tony Fischetti Reviewed-by: Randy Dunlap Signed-off-by: Jonathan Corbet --- Documentation/process/2.Process.rst | 108 +++++++++++++++++---------------- Documentation/process/coding-style.rst | 18 +++--- Documentation/process/howto.rst | 17 +++--- 3 files changed, 73 insertions(+), 70 deletions(-) diff --git a/Documentation/process/2.Process.rst b/Documentation/process/2.Process.rst index ae020d84d7c4..b21b5b245d13 100644 --- a/Documentation/process/2.Process.rst +++ b/Documentation/process/2.Process.rst @@ -18,18 +18,18 @@ major kernel release happening every two or three months. The recent release history looks like this: ====== ================= - 4.11 April 30, 2017 - 4.12 July 2, 2017 - 4.13 September 3, 2017 - 4.14 November 12, 2017 - 4.15 January 28, 2018 - 4.16 April 1, 2018 + 5.0 March 3, 2019 + 5.1 May 5, 2019 + 5.2 July 7, 2019 + 5.3 September 15, 2019 + 5.4 November 24, 2019 + 5.5 January 6, 2020 ====== ================= -Every 4.x release is a major kernel release with new features, internal -API changes, and more. A typical 4.x release contain about 13,000 -changesets with changes to several hundred thousand lines of code. 4.x is -thus the leading edge of Linux kernel development; the kernel uses a +Every 5.x release is a major kernel release with new features, internal +API changes, and more. A typical release can contain about 13,000 +changesets with changes to several hundred thousand lines of code. 5.x is +the leading edge of Linux kernel development; the kernel uses a rolling development model which is continually integrating major changes. A relatively straightforward discipline is followed with regard to the @@ -48,9 +48,9 @@ detail later on). The merge window lasts for approximately two weeks. At the end of this time, Linus Torvalds will declare that the window is closed and release the -first of the "rc" kernels. For the kernel which is destined to be 2.6.40, +first of the "rc" kernels. For the kernel which is destined to be 5.6, for example, the release which happens at the end of the merge window will -be called 2.6.40-rc1. The -rc1 release is the signal that the time to +be called 5.6-rc1. The -rc1 release is the signal that the time to merge new features has passed, and that the time to stabilize the next kernel has begun. @@ -67,22 +67,23 @@ add at any time). As fixes make their way into the mainline, the patch rate will slow over time. Linus releases new -rc kernels about once a week; a normal series will get up to somewhere between -rc6 and -rc9 before the kernel is -considered to be sufficiently stable and the final 2.6.x release is made. +considered to be sufficiently stable and the final release is made. At that point the whole process starts over again. -As an example, here is how the 4.16 development cycle went (all dates in -2018): +As an example, here is how the 5.4 development cycle went (all dates in +2019): ============== =============================== - January 28 4.15 stable release - February 11 4.16-rc1, merge window closes - February 18 4.16-rc2 - February 25 4.16-rc3 - March 4 4.16-rc4 - March 11 4.16-rc5 - March 18 4.16-rc6 - March 25 4.16-rc7 - April 1 4.16 stable release + September 15 5.3 stable release + September 30 5.4-rc1, merge window closes + October 6 5.4-rc2 + October 13 5.4-rc3 + October 20 5.4-rc4 + October 27 5.4-rc5 + November 3 5.4-rc6 + November 10 5.4-rc7 + November 17 5.4-rc8 + November 24 5.4 stable release ============== =============================== How do the developers decide when to close the development cycle and create @@ -98,43 +99,44 @@ release is made. In the real world, this kind of perfection is hard to achieve; there are just too many variables in a project of this size. There comes a point where delaying the final release just makes the problem worse; the pile of changes waiting for the next merge window will grow -larger, creating even more regressions the next time around. So most 4.x +larger, creating even more regressions the next time around. So most 5.x kernels go out with a handful of known regressions though, hopefully, none of them are serious. Once a stable release is made, its ongoing maintenance is passed off to the -"stable team," currently consisting of Greg Kroah-Hartman. The stable team -will release occasional updates to the stable release using the 4.x.y -numbering scheme. To be considered for an update release, a patch must (1) -fix a significant bug, and (2) already be merged into the mainline for the -next development kernel. Kernels will typically receive stable updates for -a little more than one development cycle past their initial release. So, -for example, the 4.13 kernel's history looked like: +"stable team," currently Greg Kroah-Hartman. The stable team will release +occasional updates to the stable release using the 5.x.y numbering scheme. +To be considered for an update release, a patch must (1) fix a significant +bug, and (2) already be merged into the mainline for the next development +kernel. Kernels will typically receive stable updates for a little more +than one development cycle past their initial release. So, for example, the +5.2 kernel's history looked like this (all dates in 2019): ============== =============================== - September 3 4.13 stable release - September 13 4.13.1 - September 20 4.13.2 - September 27 4.13.3 - October 5 4.13.4 - October 12 4.13.5 + September 15 5.2 stable release + July 14 5.2.1 + July 21 5.2.2 + July 26 5.2.3 + July 28 5.2.4 + July 31 5.2.5 ... ... - November 24 4.13.16 + October 11 5.2.21 ============== =============================== -4.13.16 was the final stable update of the 4.13 release. +5.2.21 was the final stable update of the 5.2 release. Some kernels are designated "long term" kernels; they will receive support for a longer period. As of this writing, the current long term kernels and their maintainers are: - ====== ====================== ============================== - 3.16 Ben Hutchings (very long-term stable kernel) - 4.1 Sasha Levin - 4.4 Greg Kroah-Hartman (very long-term stable kernel) - 4.9 Greg Kroah-Hartman - 4.14 Greg Kroah-Hartman - ====== ====================== ============================== + ====== ================================ ======================= + 3.16 Ben Hutchings (very long-term kernel) + 4.4 Greg Kroah-Hartman & Sasha Levin (very long-term kernel) + 4.9 Greg Kroah-Hartman & Sasha Levin + 4.14 Greg Kroah-Hartman & Sasha Levin + 4.19 Greg Kroah-Hartman & Sasha Levin + 5.4 Greg Kroah-Hartman & Sasha Levin + ====== ================================ ======================= The selection of a kernel for long-term support is purely a matter of a maintainer having the need and the time to maintain that release. There @@ -215,12 +217,12 @@ How patches get into the Kernel ------------------------------- There is exactly one person who can merge patches into the mainline kernel -repository: Linus Torvalds. But, of the over 9,500 patches which went -into the 2.6.38 kernel, only 112 (around 1.3%) were directly chosen by Linus -himself. The kernel project has long since grown to a size where no single -developer could possibly inspect and select every patch unassisted. The -way the kernel developers have addressed this growth is through the use of -a lieutenant system built around a chain of trust. +repository: Linus Torvalds. But, for example, of the over 9,500 patches +which went into the 2.6.38 kernel, only 112 (around 1.3%) were directly +chosen by Linus himself. The kernel project has long since grown to a size +where no single developer could possibly inspect and select every patch +unassisted. The way the kernel developers have addressed this growth is +through the use of a lieutenant system built around a chain of trust. The kernel code base is logically broken down into a set of subsystems: networking, specific architecture support, memory management, video diff --git a/Documentation/process/coding-style.rst b/Documentation/process/coding-style.rst index edb296c52f61..acb2f1b36350 100644 --- a/Documentation/process/coding-style.rst +++ b/Documentation/process/coding-style.rst @@ -284,9 +284,9 @@ context lines. 4) Naming --------- -C is a Spartan language, and so should your naming be. Unlike Modula-2 -and Pascal programmers, C programmers do not use cute names like -ThisVariableIsATemporaryCounter. A C programmer would call that +C is a Spartan language, and your naming conventions should follow suit. +Unlike Modula-2 and Pascal programmers, C programmers do not use cute +names like ThisVariableIsATemporaryCounter. A C programmer would call that variable ``tmp``, which is much easier to write, and not the least more difficult to understand. @@ -300,9 +300,9 @@ that counts the number of active users, you should call that ``count_active_users()`` or similar, you should **not** call it ``cntusr()``. Encoding the type of a function into the name (so-called Hungarian -notation) is brain damaged - the compiler knows the types anyway and can -check those, and it only confuses the programmer. No wonder MicroSoft -makes buggy programs. +notation) is asinine - the compiler knows the types anyway and can check +those, and it only confuses the programmer. No wonder Microsoft makes buggy +programs. LOCAL variable names should be short, and to the point. If you have some random integer loop counter, it should probably be called ``i``. @@ -806,9 +806,9 @@ covers RTL which is used frequently with assembly language in the kernel. ---------------------------- Kernel developers like to be seen as literate. Do mind the spelling -of kernel messages to make a good impression. Do not use crippled -words like ``dont``; use ``do not`` or ``don't`` instead. Make the messages -concise, clear, and unambiguous. +of kernel messages to make a good impression. Do not use incorrect +contractions like ``dont``; use ``do not`` or ``don't`` instead. Make the +messages concise, clear, and unambiguous. Kernel messages do not have to be terminated with a period. diff --git a/Documentation/process/howto.rst b/Documentation/process/howto.rst index b6f5a379ad6c..70791e153de1 100644 --- a/Documentation/process/howto.rst +++ b/Documentation/process/howto.rst @@ -243,10 +243,10 @@ branches. These different branches are: Mainline tree ~~~~~~~~~~~~~ -Mainline tree are maintained by Linus Torvalds, and can be found at +The mainline tree is maintained by Linus Torvalds, and can be found at https://kernel.org or in the repo. Its development process is as follows: - - As soon as a new kernel is released a two weeks window is open, + - As soon as a new kernel is released a two week window is open, during this period of time maintainers can submit big diffs to Linus, usually the patches that have already been included in the linux-next for a few weeks. The preferred way to submit big changes @@ -281,8 +281,9 @@ Various stable trees with multiple major numbers Kernels with 3-part versions are -stable kernels. They contain relatively small and critical fixes for security problems or significant -regressions discovered in a given major mainline release, with the first -2-part of version number are the same correspondingly. +regressions discovered in a given major mainline release. Each release +in a major stable series increments the third part of the version +number, keeping the first two parts the same. This is the recommended branch for users who want the most recent stable kernel and are not interested in helping test development/experimental @@ -359,10 +360,10 @@ Managing bug reports One of the best ways to put into practice your hacking skills is by fixing bugs reported by other people. Not only you will help to make the kernel -more stable, you'll learn to fix real world problems and you will improve -your skills, and other developers will be aware of your presence. Fixing -bugs is one of the best ways to get merits among other developers, because -not many people like wasting time fixing other people's bugs. +more stable, but you'll also learn to fix real world problems and you will +improve your skills, and other developers will be aware of your presence. +Fixing bugs is one of the best ways to get merits among other developers, +because not many people like wasting time fixing other people's bugs. To work in the already reported bug reports, go to https://bugzilla.kernel.org. -- cgit From 965fc39f73932041441e03730db31516e285b61a Mon Sep 17 00:00:00 2001 From: Randy Dunlap Date: Sat, 15 Feb 2020 23:26:06 -0800 Subject: Documentation: sort _SPHINXDIRS for 'make help' Sort the _SPHINXDIRS so that the 'make help' output is easier to read & search and in a predictable order instead of some unknown pseudo-random order. Signed-off-by: Randy Dunlap Cc: Jonathan Corbet Cc: linux-doc@vger.kernel.org Signed-off-by: Jonathan Corbet --- Documentation/Makefile | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/Documentation/Makefile b/Documentation/Makefile index d77bb607aea4..79ecee62d597 100644 --- a/Documentation/Makefile +++ b/Documentation/Makefile @@ -13,7 +13,7 @@ endif SPHINXBUILD = sphinx-build SPHINXOPTS = SPHINXDIRS = . -_SPHINXDIRS = $(patsubst $(srctree)/Documentation/%/index.rst,%,$(wildcard $(srctree)/Documentation/*/index.rst)) +_SPHINXDIRS = $(sort $(patsubst $(srctree)/Documentation/%/index.rst,%,$(wildcard $(srctree)/Documentation/*/index.rst))) SPHINX_CONF = conf.py PAPER = BUILDDIR = $(obj)/output -- cgit From 1733ec77d34059cd67a7b9677fe2fd3ef977afb3 Mon Sep 17 00:00:00 2001 From: Jonathan Neuschäfer Date: Fri, 14 Feb 2020 18:41:32 +0100 Subject: docs: driver-api: edid: Fix list formatting MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Without the empty lines, Sphinx renders the list as part of the running text. Signed-off-by: Jonathan Neuschäfer Signed-off-by: Jonathan Corbet --- Documentation/driver-api/edid.rst | 2 ++ 1 file changed, 2 insertions(+) diff --git a/Documentation/driver-api/edid.rst b/Documentation/driver-api/edid.rst index b1b5acd501ed..7dc07942ceb2 100644 --- a/Documentation/driver-api/edid.rst +++ b/Documentation/driver-api/edid.rst @@ -11,11 +11,13 @@ Today, with the advent of Kernel Mode Setting, a graphics board is either correctly working because all components follow the standards - or the computer is unusable, because the screen remains dark after booting or it displays the wrong area. Cases when this happens are: + - The graphics board does not recognize the monitor. - The graphics board is unable to detect any EDID data. - The graphics board incorrectly forwards EDID data to the driver. - The monitor sends no or bogus EDID data. - A KVM sends its own EDID data instead of querying the connected monitor. + Adding the kernel parameter "nomodeset" helps in most cases, but causes restrictions later on. -- cgit From 320bfd91a985f2b945bad611c43add8a3a359845 Mon Sep 17 00:00:00 2001 From: Jonathan Neuschäfer Date: Fri, 14 Feb 2020 18:41:33 +0100 Subject: docs: admin-guide: Move edid.rst from driver-api MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit This document describes actions that an admin can do, rather than interfaces available to driver developers, so admin-guide seems to be a more appropriate place for it. Signed-off-by: Jonathan Neuschäfer Signed-off-by: Jonathan Corbet --- Documentation/admin-guide/edid.rst | 60 +++++++++++++++++++++++++++++++++++++ Documentation/admin-guide/index.rst | 1 + Documentation/driver-api/edid.rst | 60 ------------------------------------- Documentation/driver-api/index.rst | 1 - 4 files changed, 61 insertions(+), 61 deletions(-) create mode 100644 Documentation/admin-guide/edid.rst delete mode 100644 Documentation/driver-api/edid.rst diff --git a/Documentation/admin-guide/edid.rst b/Documentation/admin-guide/edid.rst new file mode 100644 index 000000000000..7dc07942ceb2 --- /dev/null +++ b/Documentation/admin-guide/edid.rst @@ -0,0 +1,60 @@ +.. SPDX-License-Identifier: GPL-2.0 + +==== +EDID +==== + +In the good old days when graphics parameters were configured explicitly +in a file called xorg.conf, even broken hardware could be managed. + +Today, with the advent of Kernel Mode Setting, a graphics board is +either correctly working because all components follow the standards - +or the computer is unusable, because the screen remains dark after +booting or it displays the wrong area. Cases when this happens are: + +- The graphics board does not recognize the monitor. +- The graphics board is unable to detect any EDID data. +- The graphics board incorrectly forwards EDID data to the driver. +- The monitor sends no or bogus EDID data. +- A KVM sends its own EDID data instead of querying the connected monitor. + +Adding the kernel parameter "nomodeset" helps in most cases, but causes +restrictions later on. + +As a remedy for such situations, the kernel configuration item +CONFIG_DRM_LOAD_EDID_FIRMWARE was introduced. It allows to provide an +individually prepared or corrected EDID data set in the /lib/firmware +directory from where it is loaded via the firmware interface. The code +(see drivers/gpu/drm/drm_edid_load.c) contains built-in data sets for +commonly used screen resolutions (800x600, 1024x768, 1280x1024, 1600x1200, +1680x1050, 1920x1080) as binary blobs, but the kernel source tree does +not contain code to create these data. In order to elucidate the origin +of the built-in binary EDID blobs and to facilitate the creation of +individual data for a specific misbehaving monitor, commented sources +and a Makefile environment are given here. + +To create binary EDID and C source code files from the existing data +material, simply type "make". + +If you want to create your own EDID file, copy the file 1024x768.S, +replace the settings with your own data and add a new target to the +Makefile. Please note that the EDID data structure expects the timing +values in a different way as compared to the standard X11 format. + +X11: + HTimings: + hdisp hsyncstart hsyncend htotal + VTimings: + vdisp vsyncstart vsyncend vtotal + +EDID:: + + #define XPIX hdisp + #define XBLANK htotal-hdisp + #define XOFFSET hsyncstart-hdisp + #define XPULSE hsyncend-hsyncstart + + #define YPIX vdisp + #define YBLANK vtotal-vdisp + #define YOFFSET vsyncstart-vdisp + #define YPULSE vsyncend-vsyncstart diff --git a/Documentation/admin-guide/index.rst b/Documentation/admin-guide/index.rst index f1d0ccffbe72..5a6269fb8593 100644 --- a/Documentation/admin-guide/index.rst +++ b/Documentation/admin-guide/index.rst @@ -75,6 +75,7 @@ configure specific aspects of kernel behavior to your liking. cputopology dell_rbu device-mapper/index + edid efi-stub ext4 nfs/index diff --git a/Documentation/driver-api/edid.rst b/Documentation/driver-api/edid.rst deleted file mode 100644 index 7dc07942ceb2..000000000000 --- a/Documentation/driver-api/edid.rst +++ /dev/null @@ -1,60 +0,0 @@ -.. SPDX-License-Identifier: GPL-2.0 - -==== -EDID -==== - -In the good old days when graphics parameters were configured explicitly -in a file called xorg.conf, even broken hardware could be managed. - -Today, with the advent of Kernel Mode Setting, a graphics board is -either correctly working because all components follow the standards - -or the computer is unusable, because the screen remains dark after -booting or it displays the wrong area. Cases when this happens are: - -- The graphics board does not recognize the monitor. -- The graphics board is unable to detect any EDID data. -- The graphics board incorrectly forwards EDID data to the driver. -- The monitor sends no or bogus EDID data. -- A KVM sends its own EDID data instead of querying the connected monitor. - -Adding the kernel parameter "nomodeset" helps in most cases, but causes -restrictions later on. - -As a remedy for such situations, the kernel configuration item -CONFIG_DRM_LOAD_EDID_FIRMWARE was introduced. It allows to provide an -individually prepared or corrected EDID data set in the /lib/firmware -directory from where it is loaded via the firmware interface. The code -(see drivers/gpu/drm/drm_edid_load.c) contains built-in data sets for -commonly used screen resolutions (800x600, 1024x768, 1280x1024, 1600x1200, -1680x1050, 1920x1080) as binary blobs, but the kernel source tree does -not contain code to create these data. In order to elucidate the origin -of the built-in binary EDID blobs and to facilitate the creation of -individual data for a specific misbehaving monitor, commented sources -and a Makefile environment are given here. - -To create binary EDID and C source code files from the existing data -material, simply type "make". - -If you want to create your own EDID file, copy the file 1024x768.S, -replace the settings with your own data and add a new target to the -Makefile. Please note that the EDID data structure expects the timing -values in a different way as compared to the standard X11 format. - -X11: - HTimings: - hdisp hsyncstart hsyncend htotal - VTimings: - vdisp vsyncstart vsyncend vtotal - -EDID:: - - #define XPIX hdisp - #define XBLANK htotal-hdisp - #define XOFFSET hsyncstart-hdisp - #define XPULSE hsyncend-hsyncstart - - #define YPIX vdisp - #define YBLANK vtotal-vdisp - #define YOFFSET vsyncstart-vdisp - #define YPULSE vsyncend-vsyncstart diff --git a/Documentation/driver-api/index.rst b/Documentation/driver-api/index.rst index 0ebe205efd0c..ea3003b3c5e5 100644 --- a/Documentation/driver-api/index.rst +++ b/Documentation/driver-api/index.rst @@ -74,7 +74,6 @@ available subsections can be seen below. connector console dcdbas - edid eisa ipmb isa -- cgit From b4ce545f349b711351ec4b0df7a3302d91c3dd45 Mon Sep 17 00:00:00 2001 From: Jonathan Neuschäfer Date: Fri, 14 Feb 2020 18:41:35 +0100 Subject: docs: admin-guide: edid: Clarify where to run "make" MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit When both the documentation and the data files lived in Documentation/EDID, this wasn't necessary, but both have been moved to other directories in the meantime. Signed-off-by: Jonathan Neuschäfer Signed-off-by: Jonathan Corbet --- Documentation/admin-guide/edid.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/Documentation/admin-guide/edid.rst b/Documentation/admin-guide/edid.rst index 7dc07942ceb2..80deeb21a265 100644 --- a/Documentation/admin-guide/edid.rst +++ b/Documentation/admin-guide/edid.rst @@ -34,7 +34,7 @@ individual data for a specific misbehaving monitor, commented sources and a Makefile environment are given here. To create binary EDID and C source code files from the existing data -material, simply type "make". +material, simply type "make" in tools/edid/. If you want to create your own EDID file, copy the file 1024x768.S, replace the settings with your own data and add a new target to the -- cgit From e2c79ab7d75b4c6ed827e8078e5ebe2d059edafc Mon Sep 17 00:00:00 2001 From: Jonathan Neuschäfer Date: Fri, 14 Feb 2020 18:41:34 +0100 Subject: tools/edid: Move EDID data sets from Documentation/ MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit The EDID files are not really documentation. Signed-off-by: Jonathan Neuschäfer Signed-off-by: Jonathan Corbet --- Documentation/EDID/1024x768.S | 43 ------- Documentation/EDID/1280x1024.S | 43 ------- Documentation/EDID/1600x1200.S | 43 ------- Documentation/EDID/1680x1050.S | 43 ------- Documentation/EDID/1920x1080.S | 43 ------- Documentation/EDID/800x600.S | 40 ------ Documentation/EDID/Makefile | 37 ------ Documentation/EDID/edid.S | 274 ----------------------------------------- Documentation/EDID/hex | 1 - tools/edid/1024x768.S | 43 +++++++ tools/edid/1280x1024.S | 43 +++++++ tools/edid/1600x1200.S | 43 +++++++ tools/edid/1680x1050.S | 43 +++++++ tools/edid/1920x1080.S | 43 +++++++ tools/edid/800x600.S | 40 ++++++ tools/edid/Makefile | 37 ++++++ tools/edid/edid.S | 274 +++++++++++++++++++++++++++++++++++++++++ tools/edid/hex | 1 + 18 files changed, 567 insertions(+), 567 deletions(-) delete mode 100644 Documentation/EDID/1024x768.S delete mode 100644 Documentation/EDID/1280x1024.S delete mode 100644 Documentation/EDID/1600x1200.S delete mode 100644 Documentation/EDID/1680x1050.S delete mode 100644 Documentation/EDID/1920x1080.S delete mode 100644 Documentation/EDID/800x600.S delete mode 100644 Documentation/EDID/Makefile delete mode 100644 Documentation/EDID/edid.S delete mode 100644 Documentation/EDID/hex create mode 100644 tools/edid/1024x768.S create mode 100644 tools/edid/1280x1024.S create mode 100644 tools/edid/1600x1200.S create mode 100644 tools/edid/1680x1050.S create mode 100644 tools/edid/1920x1080.S create mode 100644 tools/edid/800x600.S create mode 100644 tools/edid/Makefile create mode 100644 tools/edid/edid.S create mode 100644 tools/edid/hex diff --git a/Documentation/EDID/1024x768.S b/Documentation/EDID/1024x768.S deleted file mode 100644 index 4aed3f9ab88a..000000000000 --- a/Documentation/EDID/1024x768.S +++ /dev/null @@ -1,43 +0,0 @@ -/* - 1024x768.S: EDID data set for standard 1024x768 60 Hz monitor - - Copyright (C) 2011 Carsten Emde - - This program is free software; you can redistribute it and/or - modify it under the terms of the GNU General Public License - as published by the Free Software Foundation; either version 2 - of the License, or (at your option) any later version. - - This program is distributed in the hope that it will be useful, - but WITHOUT ANY WARRANTY; without even the implied warranty of - MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the - GNU General Public License for more details. - - You should have received a copy of the GNU General Public License - along with this program; if not, write to the Free Software - Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA. -*/ - -/* EDID */ -#define VERSION 1 -#define REVISION 3 - -/* Display */ -#define CLOCK 65000 /* kHz */ -#define XPIX 1024 -#define YPIX 768 -#define XY_RATIO XY_RATIO_4_3 -#define XBLANK 320 -#define YBLANK 38 -#define XOFFSET 8 -#define XPULSE 144 -#define YOFFSET 3 -#define YPULSE 6 -#define DPI 72 -#define VFREQ 60 /* Hz */ -#define TIMING_NAME "Linux XGA" -#define ESTABLISHED_TIMING2_BITS 0x08 /* Bit 3 -> 1024x768 @60 Hz */ -#define HSYNC_POL 0 -#define VSYNC_POL 0 - -#include "edid.S" diff --git a/Documentation/EDID/1280x1024.S b/Documentation/EDID/1280x1024.S deleted file mode 100644 index b26dd424cad7..000000000000 --- a/Documentation/EDID/1280x1024.S +++ /dev/null @@ -1,43 +0,0 @@ -/* - 1280x1024.S: EDID data set for standard 1280x1024 60 Hz monitor - - Copyright (C) 2011 Carsten Emde - - This program is free software; you can redistribute it and/or - modify it under the terms of the GNU General Public License - as published by the Free Software Foundation; either version 2 - of the License, or (at your option) any later version. - - This program is distributed in the hope that it will be useful, - but WITHOUT ANY WARRANTY; without even the implied warranty of - MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the - GNU General Public License for more details. - - You should have received a copy of the GNU General Public License - along with this program; if not, write to the Free Software - Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA. -*/ - -/* EDID */ -#define VERSION 1 -#define REVISION 3 - -/* Display */ -#define CLOCK 108000 /* kHz */ -#define XPIX 1280 -#define YPIX 1024 -#define XY_RATIO XY_RATIO_5_4 -#define XBLANK 408 -#define YBLANK 42 -#define XOFFSET 48 -#define XPULSE 112 -#define YOFFSET 1 -#define YPULSE 3 -#define DPI 72 -#define VFREQ 60 /* Hz */ -#define TIMING_NAME "Linux SXGA" -/* No ESTABLISHED_TIMINGx_BITS */ -#define HSYNC_POL 1 -#define VSYNC_POL 1 - -#include "edid.S" diff --git a/Documentation/EDID/1600x1200.S b/Documentation/EDID/1600x1200.S deleted file mode 100644 index 0d091b282768..000000000000 --- a/Documentation/EDID/1600x1200.S +++ /dev/null @@ -1,43 +0,0 @@ -/* - 1600x1200.S: EDID data set for standard 1600x1200 60 Hz monitor - - Copyright (C) 2013 Carsten Emde - - This program is free software; you can redistribute it and/or - modify it under the terms of the GNU General Public License - as published by the Free Software Foundation; either version 2 - of the License, or (at your option) any later version. - - This program is distributed in the hope that it will be useful, - but WITHOUT ANY WARRANTY; without even the implied warranty of - MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the - GNU General Public License for more details. - - You should have received a copy of the GNU General Public License - along with this program; if not, write to the Free Software - Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA. -*/ - -/* EDID */ -#define VERSION 1 -#define REVISION 3 - -/* Display */ -#define CLOCK 162000 /* kHz */ -#define XPIX 1600 -#define YPIX 1200 -#define XY_RATIO XY_RATIO_4_3 -#define XBLANK 560 -#define YBLANK 50 -#define XOFFSET 64 -#define XPULSE 192 -#define YOFFSET 1 -#define YPULSE 3 -#define DPI 72 -#define VFREQ 60 /* Hz */ -#define TIMING_NAME "Linux UXGA" -/* No ESTABLISHED_TIMINGx_BITS */ -#define HSYNC_POL 1 -#define VSYNC_POL 1 - -#include "edid.S" diff --git a/Documentation/EDID/1680x1050.S b/Documentation/EDID/1680x1050.S deleted file mode 100644 index 7dfed9a33eab..000000000000 --- a/Documentation/EDID/1680x1050.S +++ /dev/null @@ -1,43 +0,0 @@ -/* - 1680x1050.S: EDID data set for standard 1680x1050 60 Hz monitor - - Copyright (C) 2012 Carsten Emde - - This program is free software; you can redistribute it and/or - modify it under the terms of the GNU General Public License - as published by the Free Software Foundation; either version 2 - of the License, or (at your option) any later version. - - This program is distributed in the hope that it will be useful, - but WITHOUT ANY WARRANTY; without even the implied warranty of - MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the - GNU General Public License for more details. - - You should have received a copy of the GNU General Public License - along with this program; if not, write to the Free Software - Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA. -*/ - -/* EDID */ -#define VERSION 1 -#define REVISION 3 - -/* Display */ -#define CLOCK 146250 /* kHz */ -#define XPIX 1680 -#define YPIX 1050 -#define XY_RATIO XY_RATIO_16_10 -#define XBLANK 560 -#define YBLANK 39 -#define XOFFSET 104 -#define XPULSE 176 -#define YOFFSET 3 -#define YPULSE 6 -#define DPI 96 -#define VFREQ 60 /* Hz */ -#define TIMING_NAME "Linux WSXGA" -/* No ESTABLISHED_TIMINGx_BITS */ -#define HSYNC_POL 1 -#define VSYNC_POL 1 - -#include "edid.S" diff --git a/Documentation/EDID/1920x1080.S b/Documentation/EDID/1920x1080.S deleted file mode 100644 index d6ffbba28e95..000000000000 --- a/Documentation/EDID/1920x1080.S +++ /dev/null @@ -1,43 +0,0 @@ -/* - 1920x1080.S: EDID data set for standard 1920x1080 60 Hz monitor - - Copyright (C) 2012 Carsten Emde - - This program is free software; you can redistribute it and/or - modify it under the terms of the GNU General Public License - as published by the Free Software Foundation; either version 2 - of the License, or (at your option) any later version. - - This program is distributed in the hope that it will be useful, - but WITHOUT ANY WARRANTY; without even the implied warranty of - MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the - GNU General Public License for more details. - - You should have received a copy of the GNU General Public License - along with this program; if not, write to the Free Software - Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA. -*/ - -/* EDID */ -#define VERSION 1 -#define REVISION 3 - -/* Display */ -#define CLOCK 148500 /* kHz */ -#define XPIX 1920 -#define YPIX 1080 -#define XY_RATIO XY_RATIO_16_9 -#define XBLANK 280 -#define YBLANK 45 -#define XOFFSET 88 -#define XPULSE 44 -#define YOFFSET 4 -#define YPULSE 5 -#define DPI 96 -#define VFREQ 60 /* Hz */ -#define TIMING_NAME "Linux FHD" -/* No ESTABLISHED_TIMINGx_BITS */ -#define HSYNC_POL 1 -#define VSYNC_POL 1 - -#include "edid.S" diff --git a/Documentation/EDID/800x600.S b/Documentation/EDID/800x600.S deleted file mode 100644 index a5616588de08..000000000000 --- a/Documentation/EDID/800x600.S +++ /dev/null @@ -1,40 +0,0 @@ -/* - 800x600.S: EDID data set for standard 800x600 60 Hz monitor - - Copyright (C) 2011 Carsten Emde - Copyright (C) 2014 Linaro Limited - - This program is free software; you can redistribute it and/or - modify it under the terms of the GNU General Public License - as published by the Free Software Foundation; either version 2 - of the License, or (at your option) any later version. - - This program is distributed in the hope that it will be useful, - but WITHOUT ANY WARRANTY; without even the implied warranty of - MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the - GNU General Public License for more details. -*/ - -/* EDID */ -#define VERSION 1 -#define REVISION 3 - -/* Display */ -#define CLOCK 40000 /* kHz */ -#define XPIX 800 -#define YPIX 600 -#define XY_RATIO XY_RATIO_4_3 -#define XBLANK 256 -#define YBLANK 28 -#define XOFFSET 40 -#define XPULSE 128 -#define YOFFSET 1 -#define YPULSE 4 -#define DPI 72 -#define VFREQ 60 /* Hz */ -#define TIMING_NAME "Linux SVGA" -#define ESTABLISHED_TIMING1_BITS 0x01 /* Bit 0: 800x600 @ 60Hz */ -#define HSYNC_POL 1 -#define VSYNC_POL 1 - -#include "edid.S" diff --git a/Documentation/EDID/Makefile b/Documentation/EDID/Makefile deleted file mode 100644 index 85a927dfab02..000000000000 --- a/Documentation/EDID/Makefile +++ /dev/null @@ -1,37 +0,0 @@ - -SOURCES := $(wildcard [0-9]*x[0-9]*.S) - -BIN := $(patsubst %.S, %.bin, $(SOURCES)) - -IHEX := $(patsubst %.S, %.bin.ihex, $(SOURCES)) - -CODE := $(patsubst %.S, %.c, $(SOURCES)) - -all: $(BIN) $(IHEX) $(CODE) - -clean: - @rm -f *.o *.bin.ihex *.bin *.c - -%.o: %.S - @cc -c $^ - -%.bin.nocrc: %.o - @objcopy -Obinary $^ $@ - -%.crc: %.bin.nocrc - @list=$$(for i in `seq 1 127`; do head -c$$i $^ | tail -c1 \ - | hexdump -v -e '/1 "%02X+"'; done); \ - echo "ibase=16;100-($${list%?})%100" | bc >$@ - -%.p: %.crc %.S - @cc -c -DCRC="$$(cat $*.crc)" -o $@ $*.S - -%.bin: %.p - @objcopy -Obinary $^ $@ - -%.bin.ihex: %.p - @objcopy -Oihex $^ $@ - @dos2unix $@ 2>/dev/null - -%.c: %.bin - @echo "{" >$@; hexdump -f hex $^ >>$@; echo "};" >>$@ diff --git a/Documentation/EDID/edid.S b/Documentation/EDID/edid.S deleted file mode 100644 index c3d13815526d..000000000000 --- a/Documentation/EDID/edid.S +++ /dev/null @@ -1,274 +0,0 @@ -/* - edid.S: EDID data template - - Copyright (C) 2012 Carsten Emde - - This program is free software; you can redistribute it and/or - modify it under the terms of the GNU General Public License - as published by the Free Software Foundation; either version 2 - of the License, or (at your option) any later version. - - This program is distributed in the hope that it will be useful, - but WITHOUT ANY WARRANTY; without even the implied warranty of - MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the - GNU General Public License for more details. - - You should have received a copy of the GNU General Public License - along with this program; if not, write to the Free Software - Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA. -*/ - - -/* Manufacturer */ -#define MFG_LNX1 'L' -#define MFG_LNX2 'N' -#define MFG_LNX3 'X' -#define SERIAL 0 -#define YEAR 2012 -#define WEEK 5 - -/* EDID 1.3 standard definitions */ -#define XY_RATIO_16_10 0b00 -#define XY_RATIO_4_3 0b01 -#define XY_RATIO_5_4 0b10 -#define XY_RATIO_16_9 0b11 - -/* Provide defaults for the timing bits */ -#ifndef ESTABLISHED_TIMING1_BITS -#define ESTABLISHED_TIMING1_BITS 0x00 -#endif -#ifndef ESTABLISHED_TIMING2_BITS -#define ESTABLISHED_TIMING2_BITS 0x00 -#endif -#ifndef ESTABLISHED_TIMING3_BITS -#define ESTABLISHED_TIMING3_BITS 0x00 -#endif - -#define mfgname2id(v1,v2,v3) \ - ((((v1-'@')&0x1f)<<10)+(((v2-'@')&0x1f)<<5)+((v3-'@')&0x1f)) -#define swap16(v1) ((v1>>8)+((v1&0xff)<<8)) -#define lsbs2(v1,v2) (((v1&0x0f)<<4)+(v2&0x0f)) -#define msbs2(v1,v2) ((((v1>>8)&0x0f)<<4)+((v2>>8)&0x0f)) -#define msbs4(v1,v2,v3,v4) \ - ((((v1>>8)&0x03)<<6)+(((v2>>8)&0x03)<<4)+\ - (((v3>>4)&0x03)<<2)+((v4>>4)&0x03)) -#define pixdpi2mm(pix,dpi) ((pix*25)/dpi) -#define xsize pixdpi2mm(XPIX,DPI) -#define ysize pixdpi2mm(YPIX,DPI) - - .data - -/* Fixed header pattern */ -header: .byte 0x00,0xff,0xff,0xff,0xff,0xff,0xff,0x00 - -mfg_id: .hword swap16(mfgname2id(MFG_LNX1, MFG_LNX2, MFG_LNX3)) - -prod_code: .hword 0 - -/* Serial number. 32 bits, little endian. */ -serial_number: .long SERIAL - -/* Week of manufacture */ -week: .byte WEEK - -/* Year of manufacture, less 1990. (1990-2245) - If week=255, it is the model year instead */ -year: .byte YEAR-1990 - -version: .byte VERSION /* EDID version, usually 1 (for 1.3) */ -revision: .byte REVISION /* EDID revision, usually 3 (for 1.3) */ - -/* If Bit 7=1 Digital input. If set, the following bit definitions apply: - Bits 6-1 Reserved, must be 0 - Bit 0 Signal is compatible with VESA DFP 1.x TMDS CRGB, - 1 pixel per clock, up to 8 bits per color, MSB aligned, - If Bit 7=0 Analog input. If clear, the following bit definitions apply: - Bits 6-5 Video white and sync levels, relative to blank - 00=+0.7/-0.3 V; 01=+0.714/-0.286 V; - 10=+1.0/-0.4 V; 11=+0.7/0 V - Bit 4 Blank-to-black setup (pedestal) expected - Bit 3 Separate sync supported - Bit 2 Composite sync (on HSync) supported - Bit 1 Sync on green supported - Bit 0 VSync pulse must be serrated when somposite or - sync-on-green is used. */ -video_parms: .byte 0x6d - -/* Maximum horizontal image size, in centimetres - (max 292 cm/115 in at 16:9 aspect ratio) */ -max_hor_size: .byte xsize/10 - -/* Maximum vertical image size, in centimetres. - If either byte is 0, undefined (e.g. projector) */ -max_vert_size: .byte ysize/10 - -/* Display gamma, minus 1, times 100 (range 1.00-3.5 */ -gamma: .byte 120 - -/* Bit 7 DPMS standby supported - Bit 6 DPMS suspend supported - Bit 5 DPMS active-off supported - Bits 4-3 Display type: 00=monochrome; 01=RGB colour; - 10=non-RGB multicolour; 11=undefined - Bit 2 Standard sRGB colour space. Bytes 25-34 must contain - sRGB standard values. - Bit 1 Preferred timing mode specified in descriptor block 1. - Bit 0 GTF supported with default parameter values. */ -dsp_features: .byte 0xea - -/* Chromaticity coordinates. */ -/* Red and green least-significant bits - Bits 7-6 Red x value least-significant 2 bits - Bits 5-4 Red y value least-significant 2 bits - Bits 3-2 Green x value lst-significant 2 bits - Bits 1-0 Green y value least-significant 2 bits */ -red_green_lsb: .byte 0x5e - -/* Blue and white least-significant 2 bits */ -blue_white_lsb: .byte 0xc0 - -/* Red x value most significant 8 bits. - 0-255 encodes 0-0.996 (255/256); 0-0.999 (1023/1024) with lsbits */ -red_x_msb: .byte 0xa4 - -/* Red y value most significant 8 bits */ -red_y_msb: .byte 0x59 - -/* Green x and y value most significant 8 bits */ -green_x_y_msb: .byte 0x4a,0x98 - -/* Blue x and y value most significant 8 bits */ -blue_x_y_msb: .byte 0x25,0x20 - -/* Default white point x and y value most significant 8 bits */ -white_x_y_msb: .byte 0x50,0x54 - -/* Established timings */ -/* Bit 7 720x400 @ 70 Hz - Bit 6 720x400 @ 88 Hz - Bit 5 640x480 @ 60 Hz - Bit 4 640x480 @ 67 Hz - Bit 3 640x480 @ 72 Hz - Bit 2 640x480 @ 75 Hz - Bit 1 800x600 @ 56 Hz - Bit 0 800x600 @ 60 Hz */ -estbl_timing1: .byte ESTABLISHED_TIMING1_BITS - -/* Bit 7 800x600 @ 72 Hz - Bit 6 800x600 @ 75 Hz - Bit 5 832x624 @ 75 Hz - Bit 4 1024x768 @ 87 Hz, interlaced (1024x768) - Bit 3 1024x768 @ 60 Hz - Bit 2 1024x768 @ 72 Hz - Bit 1 1024x768 @ 75 Hz - Bit 0 1280x1024 @ 75 Hz */ -estbl_timing2: .byte ESTABLISHED_TIMING2_BITS - -/* Bit 7 1152x870 @ 75 Hz (Apple Macintosh II) - Bits 6-0 Other manufacturer-specific display mod */ -estbl_timing3: .byte ESTABLISHED_TIMING3_BITS - -/* Standard timing */ -/* X resolution, less 31, divided by 8 (256-2288 pixels) */ -std_xres: .byte (XPIX/8)-31 -/* Y resolution, X:Y pixel ratio - Bits 7-6 X:Y pixel ratio: 00=16:10; 01=4:3; 10=5:4; 11=16:9. - Bits 5-0 Vertical frequency, less 60 (60-123 Hz) */ -std_vres: .byte (XY_RATIO<<6)+VFREQ-60 - .fill 7,2,0x0101 /* Unused */ - -descriptor1: -/* Pixel clock in 10 kHz units. (0.-655.35 MHz, little-endian) */ -clock: .hword CLOCK/10 - -/* Horizontal active pixels 8 lsbits (0-4095) */ -x_act_lsb: .byte XPIX&0xff -/* Horizontal blanking pixels 8 lsbits (0-4095) - End of active to start of next active. */ -x_blk_lsb: .byte XBLANK&0xff -/* Bits 7-4 Horizontal active pixels 4 msbits - Bits 3-0 Horizontal blanking pixels 4 msbits */ -x_msbs: .byte msbs2(XPIX,XBLANK) - -/* Vertical active lines 8 lsbits (0-4095) */ -y_act_lsb: .byte YPIX&0xff -/* Vertical blanking lines 8 lsbits (0-4095) */ -y_blk_lsb: .byte YBLANK&0xff -/* Bits 7-4 Vertical active lines 4 msbits - Bits 3-0 Vertical blanking lines 4 msbits */ -y_msbs: .byte msbs2(YPIX,YBLANK) - -/* Horizontal sync offset pixels 8 lsbits (0-1023) From blanking start */ -x_snc_off_lsb: .byte XOFFSET&0xff -/* Horizontal sync pulse width pixels 8 lsbits (0-1023) */ -x_snc_pls_lsb: .byte XPULSE&0xff -/* Bits 7-4 Vertical sync offset lines 4 lsbits (0-63) - Bits 3-0 Vertical sync pulse width lines 4 lsbits (0-63) */ -y_snc_lsb: .byte lsbs2(YOFFSET, YPULSE) -/* Bits 7-6 Horizontal sync offset pixels 2 msbits - Bits 5-4 Horizontal sync pulse width pixels 2 msbits - Bits 3-2 Vertical sync offset lines 2 msbits - Bits 1-0 Vertical sync pulse width lines 2 msbits */ -xy_snc_msbs: .byte msbs4(XOFFSET,XPULSE,YOFFSET,YPULSE) - -/* Horizontal display size, mm, 8 lsbits (0-4095 mm, 161 in) */ -x_dsp_size: .byte xsize&0xff - -/* Vertical display size, mm, 8 lsbits (0-4095 mm, 161 in) */ -y_dsp_size: .byte ysize&0xff - -/* Bits 7-4 Horizontal display size, mm, 4 msbits - Bits 3-0 Vertical display size, mm, 4 msbits */ -dsp_size_mbsb: .byte msbs2(xsize,ysize) - -/* Horizontal border pixels (each side; total is twice this) */ -x_border: .byte 0 -/* Vertical border lines (each side; total is twice this) */ -y_border: .byte 0 - -/* Bit 7 Interlaced - Bits 6-5 Stereo mode: 00=No stereo; other values depend on bit 0: - Bit 0=0: 01=Field sequential, sync=1 during right; 10=similar, - sync=1 during left; 11=4-way interleaved stereo - Bit 0=1 2-way interleaved stereo: 01=Right image on even lines; - 10=Left image on even lines; 11=side-by-side - Bits 4-3 Sync type: 00=Analog composite; 01=Bipolar analog composite; - 10=Digital composite (on HSync); 11=Digital separate - Bit 2 If digital separate: Vertical sync polarity (1=positive) - Other types: VSync serrated (HSync during VSync) - Bit 1 If analog sync: Sync on all 3 RGB lines (else green only) - Digital: HSync polarity (1=positive) - Bit 0 2-way line-interleaved stereo, if bits 4-3 are not 00. */ -features: .byte 0x18+(VSYNC_POL<<2)+(HSYNC_POL<<1) - -descriptor2: .byte 0,0 /* Not a detailed timing descriptor */ - .byte 0 /* Must be zero */ - .byte 0xff /* Descriptor is monitor serial number (text) */ - .byte 0 /* Must be zero */ -start1: .ascii "Linux #0" -end1: .byte 0x0a /* End marker */ - .fill 12-(end1-start1), 1, 0x20 /* Padded spaces */ -descriptor3: .byte 0,0 /* Not a detailed timing descriptor */ - .byte 0 /* Must be zero */ - .byte 0xfd /* Descriptor is monitor range limits */ - .byte 0 /* Must be zero */ -start2: .byte VFREQ-1 /* Minimum vertical field rate (1-255 Hz) */ - .byte VFREQ+1 /* Maximum vertical field rate (1-255 Hz) */ - .byte (CLOCK/(XPIX+XBLANK))-1 /* Minimum horizontal line rate - (1-255 kHz) */ - .byte (CLOCK/(XPIX+XBLANK))+1 /* Maximum horizontal line rate - (1-255 kHz) */ - .byte (CLOCK/10000)+1 /* Maximum pixel clock rate, rounded up - to 10 MHz multiple (10-2550 MHz) */ - .byte 0 /* No extended timing information type */ -end2: .byte 0x0a /* End marker */ - .fill 12-(end2-start2), 1, 0x20 /* Padded spaces */ -descriptor4: .byte 0,0 /* Not a detailed timing descriptor */ - .byte 0 /* Must be zero */ - .byte 0xfc /* Descriptor is text */ - .byte 0 /* Must be zero */ -start3: .ascii TIMING_NAME -end3: .byte 0x0a /* End marker */ - .fill 12-(end3-start3), 1, 0x20 /* Padded spaces */ -extensions: .byte 0 /* Number of extensions to follow */ -checksum: .byte CRC /* Sum of all bytes must be 0 */ diff --git a/Documentation/EDID/hex b/Documentation/EDID/hex deleted file mode 100644 index 8873ebb618af..000000000000 --- a/Documentation/EDID/hex +++ /dev/null @@ -1 +0,0 @@ -"\t" 8/1 "0x%02x, " "\n" diff --git a/tools/edid/1024x768.S b/tools/edid/1024x768.S new file mode 100644 index 000000000000..4aed3f9ab88a --- /dev/null +++ b/tools/edid/1024x768.S @@ -0,0 +1,43 @@ +/* + 1024x768.S: EDID data set for standard 1024x768 60 Hz monitor + + Copyright (C) 2011 Carsten Emde + + This program is free software; you can redistribute it and/or + modify it under the terms of the GNU General Public License + as published by the Free Software Foundation; either version 2 + of the License, or (at your option) any later version. + + This program is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + GNU General Public License for more details. + + You should have received a copy of the GNU General Public License + along with this program; if not, write to the Free Software + Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA. +*/ + +/* EDID */ +#define VERSION 1 +#define REVISION 3 + +/* Display */ +#define CLOCK 65000 /* kHz */ +#define XPIX 1024 +#define YPIX 768 +#define XY_RATIO XY_RATIO_4_3 +#define XBLANK 320 +#define YBLANK 38 +#define XOFFSET 8 +#define XPULSE 144 +#define YOFFSET 3 +#define YPULSE 6 +#define DPI 72 +#define VFREQ 60 /* Hz */ +#define TIMING_NAME "Linux XGA" +#define ESTABLISHED_TIMING2_BITS 0x08 /* Bit 3 -> 1024x768 @60 Hz */ +#define HSYNC_POL 0 +#define VSYNC_POL 0 + +#include "edid.S" diff --git a/tools/edid/1280x1024.S b/tools/edid/1280x1024.S new file mode 100644 index 000000000000..b26dd424cad7 --- /dev/null +++ b/tools/edid/1280x1024.S @@ -0,0 +1,43 @@ +/* + 1280x1024.S: EDID data set for standard 1280x1024 60 Hz monitor + + Copyright (C) 2011 Carsten Emde + + This program is free software; you can redistribute it and/or + modify it under the terms of the GNU General Public License + as published by the Free Software Foundation; either version 2 + of the License, or (at your option) any later version. + + This program is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + GNU General Public License for more details. + + You should have received a copy of the GNU General Public License + along with this program; if not, write to the Free Software + Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA. +*/ + +/* EDID */ +#define VERSION 1 +#define REVISION 3 + +/* Display */ +#define CLOCK 108000 /* kHz */ +#define XPIX 1280 +#define YPIX 1024 +#define XY_RATIO XY_RATIO_5_4 +#define XBLANK 408 +#define YBLANK 42 +#define XOFFSET 48 +#define XPULSE 112 +#define YOFFSET 1 +#define YPULSE 3 +#define DPI 72 +#define VFREQ 60 /* Hz */ +#define TIMING_NAME "Linux SXGA" +/* No ESTABLISHED_TIMINGx_BITS */ +#define HSYNC_POL 1 +#define VSYNC_POL 1 + +#include "edid.S" diff --git a/tools/edid/1600x1200.S b/tools/edid/1600x1200.S new file mode 100644 index 000000000000..0d091b282768 --- /dev/null +++ b/tools/edid/1600x1200.S @@ -0,0 +1,43 @@ +/* + 1600x1200.S: EDID data set for standard 1600x1200 60 Hz monitor + + Copyright (C) 2013 Carsten Emde + + This program is free software; you can redistribute it and/or + modify it under the terms of the GNU General Public License + as published by the Free Software Foundation; either version 2 + of the License, or (at your option) any later version. + + This program is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + GNU General Public License for more details. + + You should have received a copy of the GNU General Public License + along with this program; if not, write to the Free Software + Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA. +*/ + +/* EDID */ +#define VERSION 1 +#define REVISION 3 + +/* Display */ +#define CLOCK 162000 /* kHz */ +#define XPIX 1600 +#define YPIX 1200 +#define XY_RATIO XY_RATIO_4_3 +#define XBLANK 560 +#define YBLANK 50 +#define XOFFSET 64 +#define XPULSE 192 +#define YOFFSET 1 +#define YPULSE 3 +#define DPI 72 +#define VFREQ 60 /* Hz */ +#define TIMING_NAME "Linux UXGA" +/* No ESTABLISHED_TIMINGx_BITS */ +#define HSYNC_POL 1 +#define VSYNC_POL 1 + +#include "edid.S" diff --git a/tools/edid/1680x1050.S b/tools/edid/1680x1050.S new file mode 100644 index 000000000000..7dfed9a33eab --- /dev/null +++ b/tools/edid/1680x1050.S @@ -0,0 +1,43 @@ +/* + 1680x1050.S: EDID data set for standard 1680x1050 60 Hz monitor + + Copyright (C) 2012 Carsten Emde + + This program is free software; you can redistribute it and/or + modify it under the terms of the GNU General Public License + as published by the Free Software Foundation; either version 2 + of the License, or (at your option) any later version. + + This program is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + GNU General Public License for more details. + + You should have received a copy of the GNU General Public License + along with this program; if not, write to the Free Software + Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA. +*/ + +/* EDID */ +#define VERSION 1 +#define REVISION 3 + +/* Display */ +#define CLOCK 146250 /* kHz */ +#define XPIX 1680 +#define YPIX 1050 +#define XY_RATIO XY_RATIO_16_10 +#define XBLANK 560 +#define YBLANK 39 +#define XOFFSET 104 +#define XPULSE 176 +#define YOFFSET 3 +#define YPULSE 6 +#define DPI 96 +#define VFREQ 60 /* Hz */ +#define TIMING_NAME "Linux WSXGA" +/* No ESTABLISHED_TIMINGx_BITS */ +#define HSYNC_POL 1 +#define VSYNC_POL 1 + +#include "edid.S" diff --git a/tools/edid/1920x1080.S b/tools/edid/1920x1080.S new file mode 100644 index 000000000000..d6ffbba28e95 --- /dev/null +++ b/tools/edid/1920x1080.S @@ -0,0 +1,43 @@ +/* + 1920x1080.S: EDID data set for standard 1920x1080 60 Hz monitor + + Copyright (C) 2012 Carsten Emde + + This program is free software; you can redistribute it and/or + modify it under the terms of the GNU General Public License + as published by the Free Software Foundation; either version 2 + of the License, or (at your option) any later version. + + This program is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + GNU General Public License for more details. + + You should have received a copy of the GNU General Public License + along with this program; if not, write to the Free Software + Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA. +*/ + +/* EDID */ +#define VERSION 1 +#define REVISION 3 + +/* Display */ +#define CLOCK 148500 /* kHz */ +#define XPIX 1920 +#define YPIX 1080 +#define XY_RATIO XY_RATIO_16_9 +#define XBLANK 280 +#define YBLANK 45 +#define XOFFSET 88 +#define XPULSE 44 +#define YOFFSET 4 +#define YPULSE 5 +#define DPI 96 +#define VFREQ 60 /* Hz */ +#define TIMING_NAME "Linux FHD" +/* No ESTABLISHED_TIMINGx_BITS */ +#define HSYNC_POL 1 +#define VSYNC_POL 1 + +#include "edid.S" diff --git a/tools/edid/800x600.S b/tools/edid/800x600.S new file mode 100644 index 000000000000..a5616588de08 --- /dev/null +++ b/tools/edid/800x600.S @@ -0,0 +1,40 @@ +/* + 800x600.S: EDID data set for standard 800x600 60 Hz monitor + + Copyright (C) 2011 Carsten Emde + Copyright (C) 2014 Linaro Limited + + This program is free software; you can redistribute it and/or + modify it under the terms of the GNU General Public License + as published by the Free Software Foundation; either version 2 + of the License, or (at your option) any later version. + + This program is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + GNU General Public License for more details. +*/ + +/* EDID */ +#define VERSION 1 +#define REVISION 3 + +/* Display */ +#define CLOCK 40000 /* kHz */ +#define XPIX 800 +#define YPIX 600 +#define XY_RATIO XY_RATIO_4_3 +#define XBLANK 256 +#define YBLANK 28 +#define XOFFSET 40 +#define XPULSE 128 +#define YOFFSET 1 +#define YPULSE 4 +#define DPI 72 +#define VFREQ 60 /* Hz */ +#define TIMING_NAME "Linux SVGA" +#define ESTABLISHED_TIMING1_BITS 0x01 /* Bit 0: 800x600 @ 60Hz */ +#define HSYNC_POL 1 +#define VSYNC_POL 1 + +#include "edid.S" diff --git a/tools/edid/Makefile b/tools/edid/Makefile new file mode 100644 index 000000000000..85a927dfab02 --- /dev/null +++ b/tools/edid/Makefile @@ -0,0 +1,37 @@ + +SOURCES := $(wildcard [0-9]*x[0-9]*.S) + +BIN := $(patsubst %.S, %.bin, $(SOURCES)) + +IHEX := $(patsubst %.S, %.bin.ihex, $(SOURCES)) + +CODE := $(patsubst %.S, %.c, $(SOURCES)) + +all: $(BIN) $(IHEX) $(CODE) + +clean: + @rm -f *.o *.bin.ihex *.bin *.c + +%.o: %.S + @cc -c $^ + +%.bin.nocrc: %.o + @objcopy -Obinary $^ $@ + +%.crc: %.bin.nocrc + @list=$$(for i in `seq 1 127`; do head -c$$i $^ | tail -c1 \ + | hexdump -v -e '/1 "%02X+"'; done); \ + echo "ibase=16;100-($${list%?})%100" | bc >$@ + +%.p: %.crc %.S + @cc -c -DCRC="$$(cat $*.crc)" -o $@ $*.S + +%.bin: %.p + @objcopy -Obinary $^ $@ + +%.bin.ihex: %.p + @objcopy -Oihex $^ $@ + @dos2unix $@ 2>/dev/null + +%.c: %.bin + @echo "{" >$@; hexdump -f hex $^ >>$@; echo "};" >>$@ diff --git a/tools/edid/edid.S b/tools/edid/edid.S new file mode 100644 index 000000000000..c3d13815526d --- /dev/null +++ b/tools/edid/edid.S @@ -0,0 +1,274 @@ +/* + edid.S: EDID data template + + Copyright (C) 2012 Carsten Emde + + This program is free software; you can redistribute it and/or + modify it under the terms of the GNU General Public License + as published by the Free Software Foundation; either version 2 + of the License, or (at your option) any later version. + + This program is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + GNU General Public License for more details. + + You should have received a copy of the GNU General Public License + along with this program; if not, write to the Free Software + Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA. +*/ + + +/* Manufacturer */ +#define MFG_LNX1 'L' +#define MFG_LNX2 'N' +#define MFG_LNX3 'X' +#define SERIAL 0 +#define YEAR 2012 +#define WEEK 5 + +/* EDID 1.3 standard definitions */ +#define XY_RATIO_16_10 0b00 +#define XY_RATIO_4_3 0b01 +#define XY_RATIO_5_4 0b10 +#define XY_RATIO_16_9 0b11 + +/* Provide defaults for the timing bits */ +#ifndef ESTABLISHED_TIMING1_BITS +#define ESTABLISHED_TIMING1_BITS 0x00 +#endif +#ifndef ESTABLISHED_TIMING2_BITS +#define ESTABLISHED_TIMING2_BITS 0x00 +#endif +#ifndef ESTABLISHED_TIMING3_BITS +#define ESTABLISHED_TIMING3_BITS 0x00 +#endif + +#define mfgname2id(v1,v2,v3) \ + ((((v1-'@')&0x1f)<<10)+(((v2-'@')&0x1f)<<5)+((v3-'@')&0x1f)) +#define swap16(v1) ((v1>>8)+((v1&0xff)<<8)) +#define lsbs2(v1,v2) (((v1&0x0f)<<4)+(v2&0x0f)) +#define msbs2(v1,v2) ((((v1>>8)&0x0f)<<4)+((v2>>8)&0x0f)) +#define msbs4(v1,v2,v3,v4) \ + ((((v1>>8)&0x03)<<6)+(((v2>>8)&0x03)<<4)+\ + (((v3>>4)&0x03)<<2)+((v4>>4)&0x03)) +#define pixdpi2mm(pix,dpi) ((pix*25)/dpi) +#define xsize pixdpi2mm(XPIX,DPI) +#define ysize pixdpi2mm(YPIX,DPI) + + .data + +/* Fixed header pattern */ +header: .byte 0x00,0xff,0xff,0xff,0xff,0xff,0xff,0x00 + +mfg_id: .hword swap16(mfgname2id(MFG_LNX1, MFG_LNX2, MFG_LNX3)) + +prod_code: .hword 0 + +/* Serial number. 32 bits, little endian. */ +serial_number: .long SERIAL + +/* Week of manufacture */ +week: .byte WEEK + +/* Year of manufacture, less 1990. (1990-2245) + If week=255, it is the model year instead */ +year: .byte YEAR-1990 + +version: .byte VERSION /* EDID version, usually 1 (for 1.3) */ +revision: .byte REVISION /* EDID revision, usually 3 (for 1.3) */ + +/* If Bit 7=1 Digital input. If set, the following bit definitions apply: + Bits 6-1 Reserved, must be 0 + Bit 0 Signal is compatible with VESA DFP 1.x TMDS CRGB, + 1 pixel per clock, up to 8 bits per color, MSB aligned, + If Bit 7=0 Analog input. If clear, the following bit definitions apply: + Bits 6-5 Video white and sync levels, relative to blank + 00=+0.7/-0.3 V; 01=+0.714/-0.286 V; + 10=+1.0/-0.4 V; 11=+0.7/0 V + Bit 4 Blank-to-black setup (pedestal) expected + Bit 3 Separate sync supported + Bit 2 Composite sync (on HSync) supported + Bit 1 Sync on green supported + Bit 0 VSync pulse must be serrated when somposite or + sync-on-green is used. */ +video_parms: .byte 0x6d + +/* Maximum horizontal image size, in centimetres + (max 292 cm/115 in at 16:9 aspect ratio) */ +max_hor_size: .byte xsize/10 + +/* Maximum vertical image size, in centimetres. + If either byte is 0, undefined (e.g. projector) */ +max_vert_size: .byte ysize/10 + +/* Display gamma, minus 1, times 100 (range 1.00-3.5 */ +gamma: .byte 120 + +/* Bit 7 DPMS standby supported + Bit 6 DPMS suspend supported + Bit 5 DPMS active-off supported + Bits 4-3 Display type: 00=monochrome; 01=RGB colour; + 10=non-RGB multicolour; 11=undefined + Bit 2 Standard sRGB colour space. Bytes 25-34 must contain + sRGB standard values. + Bit 1 Preferred timing mode specified in descriptor block 1. + Bit 0 GTF supported with default parameter values. */ +dsp_features: .byte 0xea + +/* Chromaticity coordinates. */ +/* Red and green least-significant bits + Bits 7-6 Red x value least-significant 2 bits + Bits 5-4 Red y value least-significant 2 bits + Bits 3-2 Green x value lst-significant 2 bits + Bits 1-0 Green y value least-significant 2 bits */ +red_green_lsb: .byte 0x5e + +/* Blue and white least-significant 2 bits */ +blue_white_lsb: .byte 0xc0 + +/* Red x value most significant 8 bits. + 0-255 encodes 0-0.996 (255/256); 0-0.999 (1023/1024) with lsbits */ +red_x_msb: .byte 0xa4 + +/* Red y value most significant 8 bits */ +red_y_msb: .byte 0x59 + +/* Green x and y value most significant 8 bits */ +green_x_y_msb: .byte 0x4a,0x98 + +/* Blue x and y value most significant 8 bits */ +blue_x_y_msb: .byte 0x25,0x20 + +/* Default white point x and y value most significant 8 bits */ +white_x_y_msb: .byte 0x50,0x54 + +/* Established timings */ +/* Bit 7 720x400 @ 70 Hz + Bit 6 720x400 @ 88 Hz + Bit 5 640x480 @ 60 Hz + Bit 4 640x480 @ 67 Hz + Bit 3 640x480 @ 72 Hz + Bit 2 640x480 @ 75 Hz + Bit 1 800x600 @ 56 Hz + Bit 0 800x600 @ 60 Hz */ +estbl_timing1: .byte ESTABLISHED_TIMING1_BITS + +/* Bit 7 800x600 @ 72 Hz + Bit 6 800x600 @ 75 Hz + Bit 5 832x624 @ 75 Hz + Bit 4 1024x768 @ 87 Hz, interlaced (1024x768) + Bit 3 1024x768 @ 60 Hz + Bit 2 1024x768 @ 72 Hz + Bit 1 1024x768 @ 75 Hz + Bit 0 1280x1024 @ 75 Hz */ +estbl_timing2: .byte ESTABLISHED_TIMING2_BITS + +/* Bit 7 1152x870 @ 75 Hz (Apple Macintosh II) + Bits 6-0 Other manufacturer-specific display mod */ +estbl_timing3: .byte ESTABLISHED_TIMING3_BITS + +/* Standard timing */ +/* X resolution, less 31, divided by 8 (256-2288 pixels) */ +std_xres: .byte (XPIX/8)-31 +/* Y resolution, X:Y pixel ratio + Bits 7-6 X:Y pixel ratio: 00=16:10; 01=4:3; 10=5:4; 11=16:9. + Bits 5-0 Vertical frequency, less 60 (60-123 Hz) */ +std_vres: .byte (XY_RATIO<<6)+VFREQ-60 + .fill 7,2,0x0101 /* Unused */ + +descriptor1: +/* Pixel clock in 10 kHz units. (0.-655.35 MHz, little-endian) */ +clock: .hword CLOCK/10 + +/* Horizontal active pixels 8 lsbits (0-4095) */ +x_act_lsb: .byte XPIX&0xff +/* Horizontal blanking pixels 8 lsbits (0-4095) + End of active to start of next active. */ +x_blk_lsb: .byte XBLANK&0xff +/* Bits 7-4 Horizontal active pixels 4 msbits + Bits 3-0 Horizontal blanking pixels 4 msbits */ +x_msbs: .byte msbs2(XPIX,XBLANK) + +/* Vertical active lines 8 lsbits (0-4095) */ +y_act_lsb: .byte YPIX&0xff +/* Vertical blanking lines 8 lsbits (0-4095) */ +y_blk_lsb: .byte YBLANK&0xff +/* Bits 7-4 Vertical active lines 4 msbits + Bits 3-0 Vertical blanking lines 4 msbits */ +y_msbs: .byte msbs2(YPIX,YBLANK) + +/* Horizontal sync offset pixels 8 lsbits (0-1023) From blanking start */ +x_snc_off_lsb: .byte XOFFSET&0xff +/* Horizontal sync pulse width pixels 8 lsbits (0-1023) */ +x_snc_pls_lsb: .byte XPULSE&0xff +/* Bits 7-4 Vertical sync offset lines 4 lsbits (0-63) + Bits 3-0 Vertical sync pulse width lines 4 lsbits (0-63) */ +y_snc_lsb: .byte lsbs2(YOFFSET, YPULSE) +/* Bits 7-6 Horizontal sync offset pixels 2 msbits + Bits 5-4 Horizontal sync pulse width pixels 2 msbits + Bits 3-2 Vertical sync offset lines 2 msbits + Bits 1-0 Vertical sync pulse width lines 2 msbits */ +xy_snc_msbs: .byte msbs4(XOFFSET,XPULSE,YOFFSET,YPULSE) + +/* Horizontal display size, mm, 8 lsbits (0-4095 mm, 161 in) */ +x_dsp_size: .byte xsize&0xff + +/* Vertical display size, mm, 8 lsbits (0-4095 mm, 161 in) */ +y_dsp_size: .byte ysize&0xff + +/* Bits 7-4 Horizontal display size, mm, 4 msbits + Bits 3-0 Vertical display size, mm, 4 msbits */ +dsp_size_mbsb: .byte msbs2(xsize,ysize) + +/* Horizontal border pixels (each side; total is twice this) */ +x_border: .byte 0 +/* Vertical border lines (each side; total is twice this) */ +y_border: .byte 0 + +/* Bit 7 Interlaced + Bits 6-5 Stereo mode: 00=No stereo; other values depend on bit 0: + Bit 0=0: 01=Field sequential, sync=1 during right; 10=similar, + sync=1 during left; 11=4-way interleaved stereo + Bit 0=1 2-way interleaved stereo: 01=Right image on even lines; + 10=Left image on even lines; 11=side-by-side + Bits 4-3 Sync type: 00=Analog composite; 01=Bipolar analog composite; + 10=Digital composite (on HSync); 11=Digital separate + Bit 2 If digital separate: Vertical sync polarity (1=positive) + Other types: VSync serrated (HSync during VSync) + Bit 1 If analog sync: Sync on all 3 RGB lines (else green only) + Digital: HSync polarity (1=positive) + Bit 0 2-way line-interleaved stereo, if bits 4-3 are not 00. */ +features: .byte 0x18+(VSYNC_POL<<2)+(HSYNC_POL<<1) + +descriptor2: .byte 0,0 /* Not a detailed timing descriptor */ + .byte 0 /* Must be zero */ + .byte 0xff /* Descriptor is monitor serial number (text) */ + .byte 0 /* Must be zero */ +start1: .ascii "Linux #0" +end1: .byte 0x0a /* End marker */ + .fill 12-(end1-start1), 1, 0x20 /* Padded spaces */ +descriptor3: .byte 0,0 /* Not a detailed timing descriptor */ + .byte 0 /* Must be zero */ + .byte 0xfd /* Descriptor is monitor range limits */ + .byte 0 /* Must be zero */ +start2: .byte VFREQ-1 /* Minimum vertical field rate (1-255 Hz) */ + .byte VFREQ+1 /* Maximum vertical field rate (1-255 Hz) */ + .byte (CLOCK/(XPIX+XBLANK))-1 /* Minimum horizontal line rate + (1-255 kHz) */ + .byte (CLOCK/(XPIX+XBLANK))+1 /* Maximum horizontal line rate + (1-255 kHz) */ + .byte (CLOCK/10000)+1 /* Maximum pixel clock rate, rounded up + to 10 MHz multiple (10-2550 MHz) */ + .byte 0 /* No extended timing information type */ +end2: .byte 0x0a /* End marker */ + .fill 12-(end2-start2), 1, 0x20 /* Padded spaces */ +descriptor4: .byte 0,0 /* Not a detailed timing descriptor */ + .byte 0 /* Must be zero */ + .byte 0xfc /* Descriptor is text */ + .byte 0 /* Must be zero */ +start3: .ascii TIMING_NAME +end3: .byte 0x0a /* End marker */ + .fill 12-(end3-start3), 1, 0x20 /* Padded spaces */ +extensions: .byte 0 /* Number of extensions to follow */ +checksum: .byte CRC /* Sum of all bytes must be 0 */ diff --git a/tools/edid/hex b/tools/edid/hex new file mode 100644 index 000000000000..8873ebb618af --- /dev/null +++ b/tools/edid/hex @@ -0,0 +1 @@ +"\t" 8/1 "0x%02x, " "\n" -- cgit From 43e96ef8b70c50f6054f20b8c357ee5881592082 Mon Sep 17 00:00:00 2001 From: Michael Ellerman Date: Fri, 21 Feb 2020 11:48:43 +1100 Subject: docs/core-api: Add Fedora instructions for GCC plugins Add an example of how to install the necessary packages for GCC plugins on Fedora. Signed-off-by: Michael Ellerman Reviewed-by: Kees Cook Signed-off-by: Jonathan Corbet --- Documentation/core-api/gcc-plugins.rst | 4 ++++ 1 file changed, 4 insertions(+) diff --git a/Documentation/core-api/gcc-plugins.rst b/Documentation/core-api/gcc-plugins.rst index 8502f24396fb..4b1c10f88e30 100644 --- a/Documentation/core-api/gcc-plugins.rst +++ b/Documentation/core-api/gcc-plugins.rst @@ -72,6 +72,10 @@ e.g., on Ubuntu for gcc-4.9:: apt-get install gcc-4.9-plugin-dev +Or on Fedora:: + + dnf install gcc-plugin-devel + Enable a GCC plugin based feature in the kernel config:: CONFIG_GCC_PLUGIN_CYC_COMPLEXITY = y -- cgit From 290d5388993eb40b9d5632aefb864cf1012a2bcc Mon Sep 17 00:00:00 2001 From: Mauro Carvalho Chehab Date: Sat, 22 Feb 2020 10:00:01 +0100 Subject: scripts: documentation-file-ref-check: improve :doc: handling There are some issues at the script with regards to :doc: tags: - It doesn't escape files under Documentation/sphinx, leading to false positives; - It doesn't handle root URLs, like :doc:`/x86/boot`; - It doesn't output the file with a bad reference. Address those things, in order to remove false positives from the list of problems. Signed-off-by: Mauro Carvalho Chehab Signed-off-by: Jonathan Corbet --- scripts/documentation-file-ref-check | 11 +++++++++-- 1 file changed, 9 insertions(+), 2 deletions(-) diff --git a/scripts/documentation-file-ref-check b/scripts/documentation-file-ref-check index 7784c54aa38b..997202a18ddb 100755 --- a/scripts/documentation-file-ref-check +++ b/scripts/documentation-file-ref-check @@ -51,7 +51,9 @@ open IN, "git grep ':doc:\`' Documentation/|" or die "Failed to run git grep"; while () { next if (!m,^([^:]+):.*\:doc\:\`([^\`]+)\`,); + next if (m,sphinx/,); + my $file = $1; my $d = $1; my $doc_ref = $2; @@ -60,7 +62,12 @@ while () { $d =~ s,(.*/).*,$1,; $f =~ s,.*\<([^\>]+)\>,$1,; - $f ="$d$f.rst"; + if ($f =~ m,^/,) { + $f = "$f.rst"; + $f =~ s,^/,Documentation/,; + } else { + $f = "$d$f.rst"; + } next if (grep -e, glob("$f")); @@ -69,7 +76,7 @@ while () { } $doc_fix++; - print STDERR "$f: :doc:`$doc_ref`\n"; + print STDERR "$file: :doc:`$doc_ref`\n"; } close IN; -- cgit From a3aead706dac19ca504c31ed5d6b3e141addbaec Mon Sep 17 00:00:00 2001 From: Mauro Carvalho Chehab Date: Sat, 22 Feb 2020 10:00:07 +0100 Subject: docs: gpu: i915.rst: fix warnings due to file renames Fix two warnings due to file rename: WARNING: kernel-doc './scripts/kernel-doc -rst -enable-lineno -function csr support for dmc ./drivers/gpu/drm/i915/intel_csr.c' failed with return code 1 WARNING: kernel-doc './scripts/kernel-doc -rst -enable-lineno -internal ./drivers/gpu/drm/i915/intel_csr.c' failed with return code 2 Fixes: 06d3ff6e7451 ("drm/i915: move intel_csr.[ch] under display/") Signed-off-by: Mauro Carvalho Chehab Signed-off-by: Jonathan Corbet --- Documentation/gpu/i915.rst | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/Documentation/gpu/i915.rst b/Documentation/gpu/i915.rst index e539c42a3e78..cc74e24ca3b5 100644 --- a/Documentation/gpu/i915.rst +++ b/Documentation/gpu/i915.rst @@ -207,10 +207,10 @@ DPIO CSR firmware support for DMC ---------------------------- -.. kernel-doc:: drivers/gpu/drm/i915/intel_csr.c +.. kernel-doc:: drivers/gpu/drm/i915/display/intel_csr.c :doc: csr support for dmc -.. kernel-doc:: drivers/gpu/drm/i915/intel_csr.c +.. kernel-doc:: drivers/gpu/drm/i915/display/intel_csr.c :internal: Video BIOS Table (VBT) -- cgit From 2bd49cb581ed5a5fbd43811b952fe9552b737408 Mon Sep 17 00:00:00 2001 From: Stephen Kitt Date: Fri, 21 Feb 2020 17:55:02 +0100 Subject: docs: sysctl/kernel: document acpi_video_flags Based on the implementation in arch/x86/kernel/acpi/sleep.c, in particular the acpi_sleep_setup() function. Signed-off-by: Stephen Kitt Signed-off-by: Jonathan Corbet --- Documentation/admin-guide/sysctl/kernel.rst | 11 +++++++++-- 1 file changed, 9 insertions(+), 2 deletions(-) diff --git a/Documentation/admin-guide/sysctl/kernel.rst b/Documentation/admin-guide/sysctl/kernel.rst index 6c0d8c55101c..6586e0e0c11f 100644 --- a/Documentation/admin-guide/sysctl/kernel.rst +++ b/Documentation/admin-guide/sysctl/kernel.rst @@ -51,8 +51,15 @@ free space valid for 30 seconds. acpi_video_flags ================ -See Documentation/kernel/power/video.txt, it allows mode of video boot -to be set during run time. +See :doc:`/power/video`. This allows the video resume mode to be set, +in a similar fashion to the ``acpi_sleep`` kernel parameter, by +combining the following values: + += ======= +1 s3_bios +2 s3_mode +4 s3_beep += ======= auto_msgmni -- cgit From bf347b9da9bbba14b4af845b00d443f24d17d46d Mon Sep 17 00:00:00 2001 From: Alex Hung Date: Wed, 19 Feb 2020 12:21:33 -0700 Subject: Documentation: fix a typo for intel_iommu=nobounce "untrusted" was mis-spelled as "unstrusted" Signed-off-by: Alex Hung Signed-off-by: Jonathan Corbet --- Documentation/admin-guide/kernel-parameters.txt | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt index 47cd55e339a5..56bf9b2a9ddf 100644 --- a/Documentation/admin-guide/kernel-parameters.txt +++ b/Documentation/admin-guide/kernel-parameters.txt @@ -1775,7 +1775,7 @@ provided by tboot because it makes the system vulnerable to DMA attacks. nobounce [Default off] - Disable bounce buffer for unstrusted devices such as + Disable bounce buffer for untrusted devices such as the Thunderbolt devices. This will treat the untrusted devices as the trusted ones, hence might expose security risks of DMA attacks. -- cgit From 021622df556b7213cffec1c0713f093fc7d045e3 Mon Sep 17 00:00:00 2001 From: Stephen Kitt Date: Wed, 19 Feb 2020 16:34:42 +0100 Subject: docs: add a script to check sysctl docs This script allows sysctl documentation to be checked against the kernel source code, to identify missing or obsolete entries. Running it against 5.5 shows for example that sysctl/kernel.rst has two obsolete entries and is missing 52 entries. Signed-off-by: Stephen Kitt Signed-off-by: Jonathan Corbet --- Documentation/admin-guide/sysctl/kernel.rst | 3 + scripts/check-sysctl-docs | 181 ++++++++++++++++++++++++++++ 2 files changed, 184 insertions(+) create mode 100755 scripts/check-sysctl-docs diff --git a/Documentation/admin-guide/sysctl/kernel.rst b/Documentation/admin-guide/sysctl/kernel.rst index 6586e0e0c11f..1c48ab4bfe30 100644 --- a/Documentation/admin-guide/sysctl/kernel.rst +++ b/Documentation/admin-guide/sysctl/kernel.rst @@ -2,6 +2,9 @@ Documentation for /proc/sys/kernel/ =================================== +.. See scripts/check-sysctl-docs to keep this up to date + + Copyright (c) 1998, 1999, Rik van Riel Copyright (c) 2009, Shen Feng diff --git a/scripts/check-sysctl-docs b/scripts/check-sysctl-docs new file mode 100755 index 000000000000..8bcb9e26c7bc --- /dev/null +++ b/scripts/check-sysctl-docs @@ -0,0 +1,181 @@ +#!/usr/bin/gawk -f +# SPDX-License-Identifier: GPL-2.0 + +# Script to check sysctl documentation against source files +# +# Copyright (c) 2020 Stephen Kitt + +# Example invocation: +# scripts/check-sysctl-docs -vtable="kernel" \ +# Documentation/admin-guide/sysctl/kernel.rst \ +# $(git grep -l register_sysctl_) +# +# Specify -vdebug=1 to see debugging information + +BEGIN { + if (!table) { + print "Please specify the table to look for using the table variable" > "/dev/stderr" + exit 1 + } +} + +# The following globals are used: +# children: maps ctl_table names and procnames to child ctl_table names +# documented: maps documented entries (each key is an entry) +# entries: maps ctl_table names and procnames to counts (so +# enumerating the subkeys for a given ctl_table lists its +# procnames) +# files: maps procnames to source file names +# paths: maps ctl_path names to paths +# curpath: the name of the current ctl_path struct +# curtable: the name of the current ctl_table struct +# curentry: the name of the current proc entry (procname when parsing +# a ctl_table, constructed path when parsing a ctl_path) + + +# Remove punctuation from the given value +function trimpunct(value) { + while (value ~ /^["&]/) { + value = substr(value, 2) + } + while (value ~ /[]["&,}]$/) { + value = substr(value, 1, length(value) - 1) + } + return value +} + +# Print the information for the given entry +function printentry(entry) { + seen[entry]++ + printf "* %s from %s", entry, file[entry] + if (documented[entry]) { + printf " (documented)" + } + print "" +} + + +# Stage 1: build the list of documented entries +FNR == NR && /^=+$/ { + if (prevline ~ /Documentation for/) { + # This is the main title + next + } + + # The previous line is a section title, parse it + $0 = prevline + if (debug) print "Parsing " $0 + inbrackets = 0 + for (i = 1; i <= NF; i++) { + if (length($i) == 0) { + continue + } + if (!inbrackets && substr($i, 1, 1) == "(") { + inbrackets = 1 + } + if (!inbrackets) { + token = trimpunct($i) + if (length(token) > 0 && token != "and") { + if (debug) print trimpunct($i) + documented[trimpunct($i)]++ + } + } + if (inbrackets && substr($i, length($i), 1) == ")") { + inbrackets = 0 + } + } +} + +FNR == NR { + prevline = $0 + next +} + + +# Stage 2: process each file and find all sysctl tables +BEGINFILE { + delete children + delete entries + delete paths + curpath = "" + curtable = "" + curentry = "" + if (debug) print "Processing file " FILENAME +} + +/^static struct ctl_path/ { + match($0, /static struct ctl_path ([^][]+)/, tables) + curpath = tables[1] + if (debug) print "Processing path " curpath +} + +/^static struct ctl_table/ { + match($0, /static struct ctl_table ([^][]+)/, tables) + curtable = tables[1] + if (debug) print "Processing table " curtable +} + +/^};$/ { + curpath = "" + curtable = "" + curentry = "" +} + +curpath && /\.procname[\t ]*=[\t ]*".+"/ { + match($0, /.procname[\t ]*=[\t ]*"([^"]+)"/, names) + if (curentry) { + curentry = curentry "/" names[1] + } else { + curentry = names[1] + } + if (debug) print "Setting path " curpath " to " curentry + paths[curpath] = curentry +} + +curtable && /\.procname[\t ]*=[\t ]*".+"/ { + match($0, /.procname[\t ]*=[\t ]*"([^"]+)"/, names) + curentry = names[1] + if (debug) print "Adding entry " curentry " to table " curtable + entries[curtable][curentry]++ + file[curentry] = FILENAME +} + +/\.child[\t ]*=/ { + child = trimpunct($NF) + if (debug) print "Linking child " child " to table " curtable " entry " curentry + children[curtable][curentry] = child +} + +/register_sysctl_table\(.*\)/ { + match($0, /register_sysctl_table\(([^)]+)\)/, tables) + if (debug) print "Registering table " tables[1] + if (children[tables[1]][table]) { + for (entry in entries[children[tables[1]][table]]) { + printentry(entry) + } + } +} + +/register_sysctl_paths\(.*\)/ { + match($0, /register_sysctl_paths\(([^)]+), ([^)]+)\)/, tables) + if (debug) print "Attaching table " tables[2] " to path " tables[1] + if (paths[tables[1]] == table) { + for (entry in entries[tables[2]]) { + printentry(entry) + } + } + split(paths[tables[1]], components, "/") + if (length(components) > 1 && components[1] == table) { + # Count the first subdirectory as seen + seen[components[2]]++ + } +} + + +END { + for (entry in documented) { + if (!seen[entry]) { + print "No implementation for " entry + } + } +} -- cgit From ef45e78fdc11ac1794940c2ff4a6bf3bc4c45372 Mon Sep 17 00:00:00 2001 From: Manivannan Sadhasivam Date: Thu, 13 Feb 2020 18:23:11 +0530 Subject: docs: kref: Clarify the use of two kref_put() in example code Eventhough the current documentation explains that the reference count gets incremented by both kref_init() and kref_get(), it is often misunderstood that only one instance of kref_put() is needed in the example code. So let's clarify that a bit. Signed-off-by: Manivannan Sadhasivam Signed-off-by: Jonathan Corbet --- Documentation/kref.txt | 4 ++++ 1 file changed, 4 insertions(+) diff --git a/Documentation/kref.txt b/Documentation/kref.txt index 3af384156d7e..c61eea6f1bf2 100644 --- a/Documentation/kref.txt +++ b/Documentation/kref.txt @@ -128,6 +128,10 @@ since we already have a valid pointer that we own a refcount for. The put needs no lock because nothing tries to get the data without already holding a pointer. +In the above example, kref_put() will be called 2 times in both success +and error paths. This is necessary because the reference count got +incremented 2 times by kref_init() and kref_get(). + Note that the "before" in rule 1 is very important. You should never do something like:: -- cgit From 0a464ea4dc1248e8e640ae0f7eee90b99732eaf0 Mon Sep 17 00:00:00 2001 From: Jonathan Neuschäfer Date: Sat, 29 Feb 2020 18:35:14 +0100 Subject: docs: dev-tools: gcov: Remove a stray single-quote MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Signed-off-by: Jonathan Neuschäfer Link: https://lore.kernel.org/r/20200229173515.13868-1-j.neuschaefer@gmx.net Signed-off-by: Jonathan Corbet --- Documentation/dev-tools/gcov.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/Documentation/dev-tools/gcov.rst b/Documentation/dev-tools/gcov.rst index 46aae52a41d0..7bd013596217 100644 --- a/Documentation/dev-tools/gcov.rst +++ b/Documentation/dev-tools/gcov.rst @@ -203,7 +203,7 @@ Cause may not correctly copy files from sysfs. Solution - Use ``cat``' to read ``.gcda`` files and ``cp -d`` to copy links. + Use ``cat`` to read ``.gcda`` files and ``cp -d`` to copy links. Alternatively use the mechanism shown in Appendix B. -- cgit From 7fe068dba8335ff0f9ec608db9589b1fce4663e0 Mon Sep 17 00:00:00 2001 From: Jonathan Neuschäfer Date: Sat, 29 Feb 2020 14:27:48 +0100 Subject: docs: admin-guide: kernel-parameters: Document earlycon options for i.MX UARTs MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit drivers/tty/serial/imx.c implements these earlycon options. Signed-off-by: Jonathan Neuschäfer Link: https://lore.kernel.org/r/20200229132750.2783-1-j.neuschaefer@gmx.net Signed-off-by: Jonathan Corbet --- Documentation/admin-guide/kernel-parameters.txt | 6 ++++++ 1 file changed, 6 insertions(+) diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt index 56bf9b2a9ddf..3e3fd0d19e53 100644 --- a/Documentation/admin-guide/kernel-parameters.txt +++ b/Documentation/admin-guide/kernel-parameters.txt @@ -1095,6 +1095,12 @@ A valid base address must be provided, and the serial port must already be setup and configured. + ec_imx21, + ec_imx6q, + Start an early, polled-mode, output-only console on the + Freescale i.MX UART at the specified address. The UART + must already be setup and configured. + ar3700_uart, Start an early, polled-mode console on the Armada 3700 serial port at the specified -- cgit From adf3f38a87bbd8b8b69487988c7e8392d141f558 Mon Sep 17 00:00:00 2001 From: Jonathan Neuschäfer Date: Fri, 28 Feb 2020 21:41:45 +0100 Subject: docs: kernel-docs: Remove "Here is its" at the end of lines MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Before commit 9e03ea7f683e ("Documentation/kernel-docs.txt: convert it to ReST markup"), it read: Description: Linux Journal Kernel Korner article. Here is its abstract: "..." In Sphinx' HTML formatting, however, the "Here is its" doesn't make sense anymore, because the "Abstract:" is clearly separated. Signed-off-by: Jonathan Neuschäfer Link: https://lore.kernel.org/r/20200228204147.8622-1-j.neuschaefer@gmx.net Signed-off-by: Jonathan Corbet --- Documentation/process/kernel-docs.rst | 10 +++++----- 1 file changed, 5 insertions(+), 5 deletions(-) diff --git a/Documentation/process/kernel-docs.rst b/Documentation/process/kernel-docs.rst index 7a45a8e36ea7..9d6d0ac4fca9 100644 --- a/Documentation/process/kernel-docs.rst +++ b/Documentation/process/kernel-docs.rst @@ -313,7 +313,7 @@ On-line docs :URL: http://www.linuxjournal.com/article.php?sid=2391 :Date: 1997 :Keywords: RAID, MD driver. - :Description: Linux Journal Kernel Korner article. Here is its + :Description: Linux Journal Kernel Korner article. :Abstract: *A description of the implementation of the RAID-1, RAID-4 and RAID-5 personalities of the MD device driver in the Linux kernel, providing users with high performance and reliable, @@ -338,7 +338,7 @@ On-line docs :Date: 1996 :Keywords: device driver, module, loading/unloading modules, allocating resources. - :Description: Linux Journal Kernel Korner article. Here is its + :Description: Linux Journal Kernel Korner article. :Abstract: *This is the first of a series of four articles co-authored by Alessandro Rubini and Georg Zezchwitz which present a practical approach to writing Linux device drivers as kernel @@ -354,7 +354,7 @@ On-line docs :Keywords: character driver, init_module, clean_up module, autodetection, mayor number, minor number, file operations, open(), close(). - :Description: Linux Journal Kernel Korner article. Here is its + :Description: Linux Journal Kernel Korner article. :Abstract: *This article, the second of four, introduces part of the actual code to create custom module implementing a character device driver. It describes the code for module initialization and @@ -367,7 +367,7 @@ On-line docs :Date: 1996 :Keywords: read(), write(), select(), ioctl(), blocking/non blocking mode, interrupt handler. - :Description: Linux Journal Kernel Korner article. Here is its + :Description: Linux Journal Kernel Korner article. :Abstract: *This article, the third of four on writing character device drivers, introduces concepts of reading, writing, and using ioctl-calls*. @@ -378,7 +378,7 @@ On-line docs :URL: http://www.linuxjournal.com/article.php?sid=1222 :Date: 1996 :Keywords: interrupts, irqs, DMA, bottom halves, task queues. - :Description: Linux Journal Kernel Korner article. Here is its + :Description: Linux Journal Kernel Korner article. :Abstract: *This is the fourth in a series of articles about writing character device drivers as loadable kernel modules. This month, we further investigate the field of interrupt handling. -- cgit From d0c3bacb3e37488b00feb307c0ce43105a6fd23e Mon Sep 17 00:00:00 2001 From: Jakub Kicinski Date: Thu, 27 Feb 2020 16:06:49 -0800 Subject: doc: cgroup: improve formatting Fix tabs vs spaces issue which cases the line to be considered a new list entry. Signed-off-by: Jakub Kicinski Acked-by: Johannes Weiner Link: https://lore.kernel.org/r/20200228000653.1572553-2-kuba@kernel.org Signed-off-by: Jonathan Corbet --- Documentation/admin-guide/cgroup-v2.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst index 3f801461f0f3..723c8bd422cc 100644 --- a/Documentation/admin-guide/cgroup-v2.rst +++ b/Documentation/admin-guide/cgroup-v2.rst @@ -1103,7 +1103,7 @@ PAGE_SIZE multiple when read back. proportionally to the overage, reducing reclaim pressure for smaller overages. - Effective min boundary is limited by memory.min values of + Effective min boundary is limited by memory.min values of all ancestor cgroups. If there is memory.min overcommitment (child cgroup or cgroups are requiring more protected memory than parent will allow), then each child cgroup will get -- cgit From 2551cab59927e3b50f45f3f04f7ce0c9708eb5fb Mon Sep 17 00:00:00 2001 From: Jakub Kicinski Date: Thu, 27 Feb 2020 16:06:50 -0800 Subject: doc: cgroup: improve formatting of mem stats If there is an empty line between item and description Sphinx does not emphasize the item. First half of the list does not have the empty line and is emphasized correctly. Signed-off-by: Jakub Kicinski Acked-by: Johannes Weiner Link: https://lore.kernel.org/r/20200228000653.1572553-3-kuba@kernel.org Signed-off-by: Jonathan Corbet --- Documentation/admin-guide/cgroup-v2.rst | 12 ------------ 1 file changed, 12 deletions(-) diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst index 723c8bd422cc..ab8b91014afb 100644 --- a/Documentation/admin-guide/cgroup-v2.rst +++ b/Documentation/admin-guide/cgroup-v2.rst @@ -1313,53 +1313,41 @@ PAGE_SIZE multiple when read back. Number of major page faults incurred workingset_refault - Number of refaults of previously evicted pages workingset_activate - Number of refaulted pages that were immediately activated workingset_nodereclaim - Number of times a shadow node has been reclaimed pgrefill - Amount of scanned pages (in an active LRU list) pgscan - Amount of scanned pages (in an inactive LRU list) pgsteal - Amount of reclaimed pages pgactivate - Amount of pages moved to the active LRU list pgdeactivate - Amount of pages moved to the inactive LRU list pglazyfree - Amount of pages postponed to be freed under memory pressure pglazyfreed - Amount of reclaimed lazyfree pages thp_fault_alloc - Number of transparent hugepages which were allocated to satisfy a page fault, including COW faults. This counter is not present when CONFIG_TRANSPARENT_HUGEPAGE is not set. thp_collapse_alloc - Number of transparent hugepages which were allocated to allow collapsing an existing range of pages. This counter is not present when CONFIG_TRANSPARENT_HUGEPAGE is not set. -- cgit From 69654d37cfa66bd0483f172d174180daf4527fea Mon Sep 17 00:00:00 2001 From: Jakub Kicinski Date: Thu, 27 Feb 2020 16:06:51 -0800 Subject: doc: cgroup: improve formatting of io example We need a literal section, like few paragraphs below. Signed-off-by: Jakub Kicinski Acked-by: Johannes Weiner Link: https://lore.kernel.org/r/20200228000653.1572553-4-kuba@kernel.org Signed-off-by: Jonathan Corbet --- Documentation/admin-guide/cgroup-v2.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst index ab8b91014afb..9d16fbc5df63 100644 --- a/Documentation/admin-guide/cgroup-v2.rst +++ b/Documentation/admin-guide/cgroup-v2.rst @@ -1466,7 +1466,7 @@ IO Interface Files dios Number of discard IOs ====== ===================== - An example read output follows: + An example read output follows:: 8:16 rbytes=1459200 wbytes=314773504 rios=192 wios=353 dbytes=0 dios=0 8:0 rbytes=90430464 wbytes=299008000 rios=8950 wios=1252 dbytes=50331648 dios=3021 -- cgit From f3431ba715b5e7ecf6ae9634c0aa84305339e286 Mon Sep 17 00:00:00 2001 From: Jakub Kicinski Date: Thu, 27 Feb 2020 16:06:52 -0800 Subject: doc: cgroup: improve formatting of cpuset examples We need literal sections otherwise the entire example is rendered as a single line. Signed-off-by: Jakub Kicinski Acked-by: Johannes Weiner Link: https://lore.kernel.org/r/20200228000653.1572553-5-kuba@kernel.org Signed-off-by: Jonathan Corbet --- Documentation/admin-guide/cgroup-v2.rst | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst index 9d16fbc5df63..308d096af071 100644 --- a/Documentation/admin-guide/cgroup-v2.rst +++ b/Documentation/admin-guide/cgroup-v2.rst @@ -1841,7 +1841,7 @@ Cpuset Interface Files from the requested CPUs. The CPU numbers are comma-separated numbers or ranges. - For example: + For example:: # cat cpuset.cpus 0-4,6,8-10 @@ -1880,7 +1880,7 @@ Cpuset Interface Files from the requested memory nodes. The memory node numbers are comma-separated numbers or ranges. - For example: + For example:: # cat cpuset.mems 0-1,3 -- cgit From 373e8ffafd665ad114b96c547decce54b9621af4 Mon Sep 17 00:00:00 2001 From: Jakub Kicinski Date: Thu, 27 Feb 2020 16:06:53 -0800 Subject: doc: cgroup: improve formatting of references Annotate references to other documents to make them clickable. Signed-off-by: Jakub Kicinski Acked-by: Johannes Weiner Link: https://lore.kernel.org/r/20200228000653.1572553-6-kuba@kernel.org Signed-off-by: Jonathan Corbet --- Documentation/accounting/psi.rst | 2 ++ Documentation/admin-guide/cgroup-v1/index.rst | 2 ++ Documentation/admin-guide/cgroup-v2.rst | 8 ++++---- 3 files changed, 8 insertions(+), 4 deletions(-) diff --git a/Documentation/accounting/psi.rst b/Documentation/accounting/psi.rst index 621111ce5740..f2b3439edcc2 100644 --- a/Documentation/accounting/psi.rst +++ b/Documentation/accounting/psi.rst @@ -1,3 +1,5 @@ +.. _psi: + ================================ PSI - Pressure Stall Information ================================ diff --git a/Documentation/admin-guide/cgroup-v1/index.rst b/Documentation/admin-guide/cgroup-v1/index.rst index 10bf48bae0b0..226f64473e8e 100644 --- a/Documentation/admin-guide/cgroup-v1/index.rst +++ b/Documentation/admin-guide/cgroup-v1/index.rst @@ -1,3 +1,5 @@ +.. _cgroup-v1: + ======================== Control Groups version 1 ======================== diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst index 308d096af071..fbb111616705 100644 --- a/Documentation/admin-guide/cgroup-v2.rst +++ b/Documentation/admin-guide/cgroup-v2.rst @@ -9,7 +9,7 @@ This is the authoritative documentation on the design, interface and conventions of cgroup v2. It describes all userland-visible aspects of cgroup including core and specific controller behaviors. All future changes must be reflected in this document. Documentation for -v1 is available under Documentation/admin-guide/cgroup-v1/. +v1 is available under :ref:`Documentation/admin-guide/cgroup-v1/index.rst `. .. CONTENTS @@ -1023,7 +1023,7 @@ All time durations are in microseconds. A read-only nested-key file which exists on non-root cgroups. Shows pressure stall information for CPU. See - Documentation/accounting/psi.rst for details. + :ref:`Documentation/accounting/psi.rst ` for details. cpu.uclamp.min A read-write single value file which exists on non-root cgroups. @@ -1391,7 +1391,7 @@ PAGE_SIZE multiple when read back. A read-only nested-key file which exists on non-root cgroups. Shows pressure stall information for memory. See - Documentation/accounting/psi.rst for details. + :ref:`Documentation/accounting/psi.rst ` for details. Usage Guidelines @@ -1631,7 +1631,7 @@ IO Interface Files A read-only nested-key file which exists on non-root cgroups. Shows pressure stall information for IO. See - Documentation/accounting/psi.rst for details. + :ref:`Documentation/accounting/psi.rst ` for details. Writeback -- cgit From 669a5cc8c5d997147a0551c809d0e5f795867341 Mon Sep 17 00:00:00 2001 From: Sameer Rahmani Date: Tue, 25 Feb 2020 22:21:24 +0000 Subject: Documentation: Converted the `kobject.txt` to rst format Reviewed and converted the `kobject.txt` format to rst in place. Signed-off-by: Sameer Rahmani Link: https://lore.kernel.org/r/20200225222125.61874-1-lxsameer@gnu.org Signed-off-by: Jonathan Corbet --- Documentation/kobject.txt | 78 +++++++++++++++++++++++------------------------ 1 file changed, 39 insertions(+), 39 deletions(-) diff --git a/Documentation/kobject.txt b/Documentation/kobject.txt index ff4c25098119..1f62d4d7d966 100644 --- a/Documentation/kobject.txt +++ b/Documentation/kobject.txt @@ -25,7 +25,7 @@ some terms we will be working with. usually embedded within some other structure which contains the stuff the code is really interested in. - No structure should EVER have more than one kobject embedded within it. + No structure should **EVER** have more than one kobject embedded within it. If it does, the reference counting for the object is sure to be messed up and incorrect, and your code will be buggy. So do not do this. @@ -55,7 +55,7 @@ a larger, domain-specific object. To this end, kobjects will be found embedded in other structures. If you are used to thinking of things in object-oriented terms, kobjects can be seen as a top-level, abstract class from which other classes are derived. A kobject implements a set of -capabilities which are not particularly useful by themselves, but which are +capabilities which are not particularly useful by themselves, but are nice to have in other objects. The C language does not allow for the direct expression of inheritance, so other techniques - such as structure embedding - must be used. @@ -65,12 +65,12 @@ this is analogous as to how "list_head" structs are rarely useful on their own, but are invariably found embedded in the larger objects of interest.) -So, for example, the UIO code in drivers/uio/uio.c has a structure that +So, for example, the UIO code in ``drivers/uio/uio.c`` has a structure that defines the memory region associated with a uio device:: struct uio_map { - struct kobject kobj; - struct uio_mem *mem; + struct kobject kobj; + struct uio_mem *mem; }; If you have a struct uio_map structure, finding its embedded kobject is @@ -78,30 +78,30 @@ just a matter of using the kobj member. Code that works with kobjects will often have the opposite problem, however: given a struct kobject pointer, what is the pointer to the containing structure? You must avoid tricks (such as assuming that the kobject is at the beginning of the structure) -and, instead, use the container_of() macro, found in :: +and, instead, use the container_of() macro, found in ````:: container_of(pointer, type, member) where: - * "pointer" is the pointer to the embedded kobject, - * "type" is the type of the containing structure, and - * "member" is the name of the structure field to which "pointer" points. + * ``pointer`` is the pointer to the embedded kobject, + * ``type`` is the type of the containing structure, and + * ``member`` is the name of the structure field to which ``pointer`` points. The return value from container_of() is a pointer to the corresponding -container type. So, for example, a pointer "kp" to a struct kobject -embedded *within* a struct uio_map could be converted to a pointer to the -*containing* uio_map structure with:: +container type. So, for example, a pointer ``kp`` to a struct kobject +embedded **within** a struct uio_map could be converted to a pointer to the +**containing** uio_map structure with:: struct uio_map *u_map = container_of(kp, struct uio_map, kobj); -For convenience, programmers often define a simple macro for "back-casting" +For convenience, programmers often define a simple macro for **back-casting** kobject pointers to the containing type. Exactly this happens in the -earlier drivers/uio/uio.c, as you can see here:: +earlier ``drivers/uio/uio.c``, as you can see here:: struct uio_map { - struct kobject kobj; - struct uio_mem *mem; + struct kobject kobj; + struct uio_mem *mem; }; #define to_map(map) container_of(map, struct uio_map, kobj) @@ -125,7 +125,7 @@ must have an associated kobj_type. After calling kobject_init(), to register the kobject with sysfs, the function kobject_add() must be called:: int kobject_add(struct kobject *kobj, struct kobject *parent, - const char *fmt, ...); + const char *fmt, ...); This sets up the parent of the kobject and the name for the kobject properly. If the kobject is to be associated with a specific kset, @@ -172,13 +172,13 @@ call to kobject_uevent():: int kobject_uevent(struct kobject *kobj, enum kobject_action action); -Use the KOBJ_ADD action for when the kobject is first added to the kernel. +Use the **KOBJ_ADD** action for when the kobject is first added to the kernel. This should be done only after any attributes or children of the kobject have been initialized properly, as userspace will instantly start to look for them when this call happens. When the kobject is removed from the kernel (details on how to do that are -below), the uevent for KOBJ_REMOVE will be automatically created by the +below), the uevent for **KOBJ_REMOVE** will be automatically created by the kobject core, so the caller does not have to worry about doing that by hand. @@ -238,7 +238,7 @@ Both types of attributes used here, with a kobject that has been created with the kobject_create_and_add(), can be of type kobj_attribute, so no special custom attribute is needed to be created. -See the example module, samples/kobject/kobject-example.c for an +See the example module, ``samples/kobject/kobject-example.c`` for an implementation of a simple kobject and attributes. @@ -270,10 +270,10 @@ such a method has a form like:: void my_object_release(struct kobject *kobj) { - struct my_object *mine = container_of(kobj, struct my_object, kobj); + struct my_object *mine = container_of(kobj, struct my_object, kobj); - /* Perform any additional cleanup on this object, then... */ - kfree(mine); + /* Perform any additional cleanup on this object, then... */ + kfree(mine); } One important point cannot be overstated: every kobject must have a @@ -297,11 +297,11 @@ instead, it is associated with the ktype. So let us introduce struct kobj_type:: struct kobj_type { - void (*release)(struct kobject *kobj); - const struct sysfs_ops *sysfs_ops; - struct attribute **default_attrs; - const struct kobj_ns_type_operations *(*child_ns_type)(struct kobject *kobj); - const void *(*namespace)(struct kobject *kobj); + void (*release)(struct kobject *kobj); + const struct sysfs_ops *sysfs_ops; + struct attribute **default_attrs; + const struct kobj_ns_type_operations *(*child_ns_type)(struct kobject *kobj); + const void *(*namespace)(struct kobject *kobj); }; This structure is used to describe a particular type of kobject (or, more @@ -352,8 +352,8 @@ created and never declared statically or on the stack. To create a new kset use:: struct kset *kset_create_and_add(const char *name, - struct kset_uevent_ops *u, - struct kobject *parent); + struct kset_uevent_ops *u, + struct kobject *parent); When you are finished with the kset, call:: @@ -365,16 +365,16 @@ Because other references to the kset may still exist, the release may happen after kset_unregister() returns. An example of using a kset can be seen in the -samples/kobject/kset-example.c file in the kernel tree. +``samples/kobject/kset-example.c`` file in the kernel tree. If a kset wishes to control the uevent operations of the kobjects associated with it, it can use the struct kset_uevent_ops to handle it:: struct kset_uevent_ops { - int (*filter)(struct kset *kset, struct kobject *kobj); - const char *(*name)(struct kset *kset, struct kobject *kobj); - int (*uevent)(struct kset *kset, struct kobject *kobj, - struct kobj_uevent_env *env); + int (*filter)(struct kset *kset, struct kobject *kobj); + const char *(*name)(struct kset *kset, struct kobject *kobj); + int (*uevent)(struct kset *kset, struct kobject *kobj, + struct kobj_uevent_env *env); }; @@ -408,8 +408,8 @@ Kobject removal After a kobject has been registered with the kobject core successfully, it must be cleaned up when the code is finished with it. To do that, call kobject_put(). By doing this, the kobject core will automatically clean up -all of the memory allocated by this kobject. If a KOBJ_ADD uevent has been -sent for the object, a corresponding KOBJ_REMOVE uevent will be sent, and +all of the memory allocated by this kobject. If a ``KOBJ_ADD`` uevent has been +sent for the object, a corresponding ``KOBJ_REMOVE`` uevent will be sent, and any other sysfs housekeeping will be handled for the caller properly. If you need to do a two-stage delete of the kobject (say you are not @@ -430,5 +430,5 @@ Example code to copy from ========================= For a more complete example of using ksets and kobjects properly, see the -example programs samples/kobject/{kobject-example.c,kset-example.c}, -which will be built as loadable modules if you select CONFIG_SAMPLE_KOBJECT. +example programs ``samples/kobject/{kobject-example.c,kset-example.c}``, +which will be built as loadable modules if you select ``CONFIG_SAMPLE_KOBJECT``. -- cgit From 5fed00dcaca8bbd428742a6db1980753290eb204 Mon Sep 17 00:00:00 2001 From: Sameer Rahmani Date: Tue, 25 Feb 2020 22:21:25 +0000 Subject: Documentation: kobject.txt has been moved to core-api/kobject.rst Moved the `kobject.txt` to `core-api/kobject.rst` and updated the `core-api` index to point to it. Signed-off-by: Sameer Rahmani [jc: moved it down from the top of core-api/index.rst] Link: https://lore.kernel.org/r/20200225222125.61874-2-lxsameer@gnu.org Signed-off-by: Jonathan Corbet --- Documentation/core-api/index.rst | 1 + Documentation/core-api/kobject.rst | 434 +++++++++++++++++++++++++++++++++++++ Documentation/kobject.txt | 434 ------------------------------------- 3 files changed, 435 insertions(+), 434 deletions(-) create mode 100644 Documentation/core-api/kobject.rst delete mode 100644 Documentation/kobject.txt diff --git a/Documentation/core-api/index.rst b/Documentation/core-api/index.rst index a501dc1c90d0..d02b26917931 100644 --- a/Documentation/core-api/index.rst +++ b/Documentation/core-api/index.rst @@ -12,6 +12,7 @@ Core utilities :maxdepth: 1 kernel-api + kobject assoc_array atomic_ops cachetlb diff --git a/Documentation/core-api/kobject.rst b/Documentation/core-api/kobject.rst new file mode 100644 index 000000000000..1f62d4d7d966 --- /dev/null +++ b/Documentation/core-api/kobject.rst @@ -0,0 +1,434 @@ +===================================================================== +Everything you never wanted to know about kobjects, ksets, and ktypes +===================================================================== + +:Author: Greg Kroah-Hartman +:Last updated: December 19, 2007 + +Based on an original article by Jon Corbet for lwn.net written October 1, +2003 and located at http://lwn.net/Articles/51437/ + +Part of the difficulty in understanding the driver model - and the kobject +abstraction upon which it is built - is that there is no obvious starting +place. Dealing with kobjects requires understanding a few different types, +all of which make reference to each other. In an attempt to make things +easier, we'll take a multi-pass approach, starting with vague terms and +adding detail as we go. To that end, here are some quick definitions of +some terms we will be working with. + + - A kobject is an object of type struct kobject. Kobjects have a name + and a reference count. A kobject also has a parent pointer (allowing + objects to be arranged into hierarchies), a specific type, and, + usually, a representation in the sysfs virtual filesystem. + + Kobjects are generally not interesting on their own; instead, they are + usually embedded within some other structure which contains the stuff + the code is really interested in. + + No structure should **EVER** have more than one kobject embedded within it. + If it does, the reference counting for the object is sure to be messed + up and incorrect, and your code will be buggy. So do not do this. + + - A ktype is the type of object that embeds a kobject. Every structure + that embeds a kobject needs a corresponding ktype. The ktype controls + what happens to the kobject when it is created and destroyed. + + - A kset is a group of kobjects. These kobjects can be of the same ktype + or belong to different ktypes. The kset is the basic container type for + collections of kobjects. Ksets contain their own kobjects, but you can + safely ignore that implementation detail as the kset core code handles + this kobject automatically. + + When you see a sysfs directory full of other directories, generally each + of those directories corresponds to a kobject in the same kset. + +We'll look at how to create and manipulate all of these types. A bottom-up +approach will be taken, so we'll go back to kobjects. + + +Embedding kobjects +================== + +It is rare for kernel code to create a standalone kobject, with one major +exception explained below. Instead, kobjects are used to control access to +a larger, domain-specific object. To this end, kobjects will be found +embedded in other structures. If you are used to thinking of things in +object-oriented terms, kobjects can be seen as a top-level, abstract class +from which other classes are derived. A kobject implements a set of +capabilities which are not particularly useful by themselves, but are +nice to have in other objects. The C language does not allow for the +direct expression of inheritance, so other techniques - such as structure +embedding - must be used. + +(As an aside, for those familiar with the kernel linked list implementation, +this is analogous as to how "list_head" structs are rarely useful on +their own, but are invariably found embedded in the larger objects of +interest.) + +So, for example, the UIO code in ``drivers/uio/uio.c`` has a structure that +defines the memory region associated with a uio device:: + + struct uio_map { + struct kobject kobj; + struct uio_mem *mem; + }; + +If you have a struct uio_map structure, finding its embedded kobject is +just a matter of using the kobj member. Code that works with kobjects will +often have the opposite problem, however: given a struct kobject pointer, +what is the pointer to the containing structure? You must avoid tricks +(such as assuming that the kobject is at the beginning of the structure) +and, instead, use the container_of() macro, found in ````:: + + container_of(pointer, type, member) + +where: + + * ``pointer`` is the pointer to the embedded kobject, + * ``type`` is the type of the containing structure, and + * ``member`` is the name of the structure field to which ``pointer`` points. + +The return value from container_of() is a pointer to the corresponding +container type. So, for example, a pointer ``kp`` to a struct kobject +embedded **within** a struct uio_map could be converted to a pointer to the +**containing** uio_map structure with:: + + struct uio_map *u_map = container_of(kp, struct uio_map, kobj); + +For convenience, programmers often define a simple macro for **back-casting** +kobject pointers to the containing type. Exactly this happens in the +earlier ``drivers/uio/uio.c``, as you can see here:: + + struct uio_map { + struct kobject kobj; + struct uio_mem *mem; + }; + + #define to_map(map) container_of(map, struct uio_map, kobj) + +where the macro argument "map" is a pointer to the struct kobject in +question. That macro is subsequently invoked with:: + + struct uio_map *map = to_map(kobj); + + +Initialization of kobjects +========================== + +Code which creates a kobject must, of course, initialize that object. Some +of the internal fields are setup with a (mandatory) call to kobject_init():: + + void kobject_init(struct kobject *kobj, struct kobj_type *ktype); + +The ktype is required for a kobject to be created properly, as every kobject +must have an associated kobj_type. After calling kobject_init(), to +register the kobject with sysfs, the function kobject_add() must be called:: + + int kobject_add(struct kobject *kobj, struct kobject *parent, + const char *fmt, ...); + +This sets up the parent of the kobject and the name for the kobject +properly. If the kobject is to be associated with a specific kset, +kobj->kset must be assigned before calling kobject_add(). If a kset is +associated with a kobject, then the parent for the kobject can be set to +NULL in the call to kobject_add() and then the kobject's parent will be the +kset itself. + +As the name of the kobject is set when it is added to the kernel, the name +of the kobject should never be manipulated directly. If you must change +the name of the kobject, call kobject_rename():: + + int kobject_rename(struct kobject *kobj, const char *new_name); + +kobject_rename does not perform any locking or have a solid notion of +what names are valid so the caller must provide their own sanity checking +and serialization. + +There is a function called kobject_set_name() but that is legacy cruft and +is being removed. If your code needs to call this function, it is +incorrect and needs to be fixed. + +To properly access the name of the kobject, use the function +kobject_name():: + + const char *kobject_name(const struct kobject * kobj); + +There is a helper function to both initialize and add the kobject to the +kernel at the same time, called surprisingly enough kobject_init_and_add():: + + int kobject_init_and_add(struct kobject *kobj, struct kobj_type *ktype, + struct kobject *parent, const char *fmt, ...); + +The arguments are the same as the individual kobject_init() and +kobject_add() functions described above. + + +Uevents +======= + +After a kobject has been registered with the kobject core, you need to +announce to the world that it has been created. This can be done with a +call to kobject_uevent():: + + int kobject_uevent(struct kobject *kobj, enum kobject_action action); + +Use the **KOBJ_ADD** action for when the kobject is first added to the kernel. +This should be done only after any attributes or children of the kobject +have been initialized properly, as userspace will instantly start to look +for them when this call happens. + +When the kobject is removed from the kernel (details on how to do that are +below), the uevent for **KOBJ_REMOVE** will be automatically created by the +kobject core, so the caller does not have to worry about doing that by +hand. + + +Reference counts +================ + +One of the key functions of a kobject is to serve as a reference counter +for the object in which it is embedded. As long as references to the object +exist, the object (and the code which supports it) must continue to exist. +The low-level functions for manipulating a kobject's reference counts are:: + + struct kobject *kobject_get(struct kobject *kobj); + void kobject_put(struct kobject *kobj); + +A successful call to kobject_get() will increment the kobject's reference +counter and return the pointer to the kobject. + +When a reference is released, the call to kobject_put() will decrement the +reference count and, possibly, free the object. Note that kobject_init() +sets the reference count to one, so the code which sets up the kobject will +need to do a kobject_put() eventually to release that reference. + +Because kobjects are dynamic, they must not be declared statically or on +the stack, but instead, always allocated dynamically. Future versions of +the kernel will contain a run-time check for kobjects that are created +statically and will warn the developer of this improper usage. + +If all that you want to use a kobject for is to provide a reference counter +for your structure, please use the struct kref instead; a kobject would be +overkill. For more information on how to use struct kref, please see the +file Documentation/kref.txt in the Linux kernel source tree. + + +Creating "simple" kobjects +========================== + +Sometimes all that a developer wants is a way to create a simple directory +in the sysfs hierarchy, and not have to mess with the whole complication of +ksets, show and store functions, and other details. This is the one +exception where a single kobject should be created. To create such an +entry, use the function:: + + struct kobject *kobject_create_and_add(char *name, struct kobject *parent); + +This function will create a kobject and place it in sysfs in the location +underneath the specified parent kobject. To create simple attributes +associated with this kobject, use:: + + int sysfs_create_file(struct kobject *kobj, struct attribute *attr); + +or:: + + int sysfs_create_group(struct kobject *kobj, struct attribute_group *grp); + +Both types of attributes used here, with a kobject that has been created +with the kobject_create_and_add(), can be of type kobj_attribute, so no +special custom attribute is needed to be created. + +See the example module, ``samples/kobject/kobject-example.c`` for an +implementation of a simple kobject and attributes. + + + +ktypes and release methods +========================== + +One important thing still missing from the discussion is what happens to a +kobject when its reference count reaches zero. The code which created the +kobject generally does not know when that will happen; if it did, there +would be little point in using a kobject in the first place. Even +predictable object lifecycles become more complicated when sysfs is brought +in as other portions of the kernel can get a reference on any kobject that +is registered in the system. + +The end result is that a structure protected by a kobject cannot be freed +before its reference count goes to zero. The reference count is not under +the direct control of the code which created the kobject. So that code must +be notified asynchronously whenever the last reference to one of its +kobjects goes away. + +Once you registered your kobject via kobject_add(), you must never use +kfree() to free it directly. The only safe way is to use kobject_put(). It +is good practice to always use kobject_put() after kobject_init() to avoid +errors creeping in. + +This notification is done through a kobject's release() method. Usually +such a method has a form like:: + + void my_object_release(struct kobject *kobj) + { + struct my_object *mine = container_of(kobj, struct my_object, kobj); + + /* Perform any additional cleanup on this object, then... */ + kfree(mine); + } + +One important point cannot be overstated: every kobject must have a +release() method, and the kobject must persist (in a consistent state) +until that method is called. If these constraints are not met, the code is +flawed. Note that the kernel will warn you if you forget to provide a +release() method. Do not try to get rid of this warning by providing an +"empty" release function. + +If all your cleanup function needs to do is call kfree(), then you must +create a wrapper function which uses container_of() to upcast to the correct +type (as shown in the example above) and then calls kfree() on the overall +structure. + +Note, the name of the kobject is available in the release function, but it +must NOT be changed within this callback. Otherwise there will be a memory +leak in the kobject core, which makes people unhappy. + +Interestingly, the release() method is not stored in the kobject itself; +instead, it is associated with the ktype. So let us introduce struct +kobj_type:: + + struct kobj_type { + void (*release)(struct kobject *kobj); + const struct sysfs_ops *sysfs_ops; + struct attribute **default_attrs; + const struct kobj_ns_type_operations *(*child_ns_type)(struct kobject *kobj); + const void *(*namespace)(struct kobject *kobj); + }; + +This structure is used to describe a particular type of kobject (or, more +correctly, of containing object). Every kobject needs to have an associated +kobj_type structure; a pointer to that structure must be specified when you +call kobject_init() or kobject_init_and_add(). + +The release field in struct kobj_type is, of course, a pointer to the +release() method for this type of kobject. The other two fields (sysfs_ops +and default_attrs) control how objects of this type are represented in +sysfs; they are beyond the scope of this document. + +The default_attrs pointer is a list of default attributes that will be +automatically created for any kobject that is registered with this ktype. + + +ksets +===== + +A kset is merely a collection of kobjects that want to be associated with +each other. There is no restriction that they be of the same ktype, but be +very careful if they are not. + +A kset serves these functions: + + - It serves as a bag containing a group of objects. A kset can be used by + the kernel to track "all block devices" or "all PCI device drivers." + + - A kset is also a subdirectory in sysfs, where the associated kobjects + with the kset can show up. Every kset contains a kobject which can be + set up to be the parent of other kobjects; the top-level directories of + the sysfs hierarchy are constructed in this way. + + - Ksets can support the "hotplugging" of kobjects and influence how + uevent events are reported to user space. + +In object-oriented terms, "kset" is the top-level container class; ksets +contain their own kobject, but that kobject is managed by the kset code and +should not be manipulated by any other user. + +A kset keeps its children in a standard kernel linked list. Kobjects point +back to their containing kset via their kset field. In almost all cases, +the kobjects belonging to a kset have that kset (or, strictly, its embedded +kobject) in their parent. + +As a kset contains a kobject within it, it should always be dynamically +created and never declared statically or on the stack. To create a new +kset use:: + + struct kset *kset_create_and_add(const char *name, + struct kset_uevent_ops *u, + struct kobject *parent); + +When you are finished with the kset, call:: + + void kset_unregister(struct kset *kset); + +to destroy it. This removes the kset from sysfs and decrements its reference +count. When the reference count goes to zero, the kset will be released. +Because other references to the kset may still exist, the release may happen +after kset_unregister() returns. + +An example of using a kset can be seen in the +``samples/kobject/kset-example.c`` file in the kernel tree. + +If a kset wishes to control the uevent operations of the kobjects +associated with it, it can use the struct kset_uevent_ops to handle it:: + + struct kset_uevent_ops { + int (*filter)(struct kset *kset, struct kobject *kobj); + const char *(*name)(struct kset *kset, struct kobject *kobj); + int (*uevent)(struct kset *kset, struct kobject *kobj, + struct kobj_uevent_env *env); + }; + + +The filter function allows a kset to prevent a uevent from being emitted to +userspace for a specific kobject. If the function returns 0, the uevent +will not be emitted. + +The name function will be called to override the default name of the kset +that the uevent sends to userspace. By default, the name will be the same +as the kset itself, but this function, if present, can override that name. + +The uevent function will be called when the uevent is about to be sent to +userspace to allow more environment variables to be added to the uevent. + +One might ask how, exactly, a kobject is added to a kset, given that no +functions which perform that function have been presented. The answer is +that this task is handled by kobject_add(). When a kobject is passed to +kobject_add(), its kset member should point to the kset to which the +kobject will belong. kobject_add() will handle the rest. + +If the kobject belonging to a kset has no parent kobject set, it will be +added to the kset's directory. Not all members of a kset do necessarily +live in the kset directory. If an explicit parent kobject is assigned +before the kobject is added, the kobject is registered with the kset, but +added below the parent kobject. + + +Kobject removal +=============== + +After a kobject has been registered with the kobject core successfully, it +must be cleaned up when the code is finished with it. To do that, call +kobject_put(). By doing this, the kobject core will automatically clean up +all of the memory allocated by this kobject. If a ``KOBJ_ADD`` uevent has been +sent for the object, a corresponding ``KOBJ_REMOVE`` uevent will be sent, and +any other sysfs housekeeping will be handled for the caller properly. + +If you need to do a two-stage delete of the kobject (say you are not +allowed to sleep when you need to destroy the object), then call +kobject_del() which will unregister the kobject from sysfs. This makes the +kobject "invisible", but it is not cleaned up, and the reference count of +the object is still the same. At a later time call kobject_put() to finish +the cleanup of the memory associated with the kobject. + +kobject_del() can be used to drop the reference to the parent object, if +circular references are constructed. It is valid in some cases, that a +parent objects references a child. Circular references _must_ be broken +with an explicit call to kobject_del(), so that a release functions will be +called, and the objects in the former circle release each other. + + +Example code to copy from +========================= + +For a more complete example of using ksets and kobjects properly, see the +example programs ``samples/kobject/{kobject-example.c,kset-example.c}``, +which will be built as loadable modules if you select ``CONFIG_SAMPLE_KOBJECT``. diff --git a/Documentation/kobject.txt b/Documentation/kobject.txt deleted file mode 100644 index 1f62d4d7d966..000000000000 --- a/Documentation/kobject.txt +++ /dev/null @@ -1,434 +0,0 @@ -===================================================================== -Everything you never wanted to know about kobjects, ksets, and ktypes -===================================================================== - -:Author: Greg Kroah-Hartman -:Last updated: December 19, 2007 - -Based on an original article by Jon Corbet for lwn.net written October 1, -2003 and located at http://lwn.net/Articles/51437/ - -Part of the difficulty in understanding the driver model - and the kobject -abstraction upon which it is built - is that there is no obvious starting -place. Dealing with kobjects requires understanding a few different types, -all of which make reference to each other. In an attempt to make things -easier, we'll take a multi-pass approach, starting with vague terms and -adding detail as we go. To that end, here are some quick definitions of -some terms we will be working with. - - - A kobject is an object of type struct kobject. Kobjects have a name - and a reference count. A kobject also has a parent pointer (allowing - objects to be arranged into hierarchies), a specific type, and, - usually, a representation in the sysfs virtual filesystem. - - Kobjects are generally not interesting on their own; instead, they are - usually embedded within some other structure which contains the stuff - the code is really interested in. - - No structure should **EVER** have more than one kobject embedded within it. - If it does, the reference counting for the object is sure to be messed - up and incorrect, and your code will be buggy. So do not do this. - - - A ktype is the type of object that embeds a kobject. Every structure - that embeds a kobject needs a corresponding ktype. The ktype controls - what happens to the kobject when it is created and destroyed. - - - A kset is a group of kobjects. These kobjects can be of the same ktype - or belong to different ktypes. The kset is the basic container type for - collections of kobjects. Ksets contain their own kobjects, but you can - safely ignore that implementation detail as the kset core code handles - this kobject automatically. - - When you see a sysfs directory full of other directories, generally each - of those directories corresponds to a kobject in the same kset. - -We'll look at how to create and manipulate all of these types. A bottom-up -approach will be taken, so we'll go back to kobjects. - - -Embedding kobjects -================== - -It is rare for kernel code to create a standalone kobject, with one major -exception explained below. Instead, kobjects are used to control access to -a larger, domain-specific object. To this end, kobjects will be found -embedded in other structures. If you are used to thinking of things in -object-oriented terms, kobjects can be seen as a top-level, abstract class -from which other classes are derived. A kobject implements a set of -capabilities which are not particularly useful by themselves, but are -nice to have in other objects. The C language does not allow for the -direct expression of inheritance, so other techniques - such as structure -embedding - must be used. - -(As an aside, for those familiar with the kernel linked list implementation, -this is analogous as to how "list_head" structs are rarely useful on -their own, but are invariably found embedded in the larger objects of -interest.) - -So, for example, the UIO code in ``drivers/uio/uio.c`` has a structure that -defines the memory region associated with a uio device:: - - struct uio_map { - struct kobject kobj; - struct uio_mem *mem; - }; - -If you have a struct uio_map structure, finding its embedded kobject is -just a matter of using the kobj member. Code that works with kobjects will -often have the opposite problem, however: given a struct kobject pointer, -what is the pointer to the containing structure? You must avoid tricks -(such as assuming that the kobject is at the beginning of the structure) -and, instead, use the container_of() macro, found in ````:: - - container_of(pointer, type, member) - -where: - - * ``pointer`` is the pointer to the embedded kobject, - * ``type`` is the type of the containing structure, and - * ``member`` is the name of the structure field to which ``pointer`` points. - -The return value from container_of() is a pointer to the corresponding -container type. So, for example, a pointer ``kp`` to a struct kobject -embedded **within** a struct uio_map could be converted to a pointer to the -**containing** uio_map structure with:: - - struct uio_map *u_map = container_of(kp, struct uio_map, kobj); - -For convenience, programmers often define a simple macro for **back-casting** -kobject pointers to the containing type. Exactly this happens in the -earlier ``drivers/uio/uio.c``, as you can see here:: - - struct uio_map { - struct kobject kobj; - struct uio_mem *mem; - }; - - #define to_map(map) container_of(map, struct uio_map, kobj) - -where the macro argument "map" is a pointer to the struct kobject in -question. That macro is subsequently invoked with:: - - struct uio_map *map = to_map(kobj); - - -Initialization of kobjects -========================== - -Code which creates a kobject must, of course, initialize that object. Some -of the internal fields are setup with a (mandatory) call to kobject_init():: - - void kobject_init(struct kobject *kobj, struct kobj_type *ktype); - -The ktype is required for a kobject to be created properly, as every kobject -must have an associated kobj_type. After calling kobject_init(), to -register the kobject with sysfs, the function kobject_add() must be called:: - - int kobject_add(struct kobject *kobj, struct kobject *parent, - const char *fmt, ...); - -This sets up the parent of the kobject and the name for the kobject -properly. If the kobject is to be associated with a specific kset, -kobj->kset must be assigned before calling kobject_add(). If a kset is -associated with a kobject, then the parent for the kobject can be set to -NULL in the call to kobject_add() and then the kobject's parent will be the -kset itself. - -As the name of the kobject is set when it is added to the kernel, the name -of the kobject should never be manipulated directly. If you must change -the name of the kobject, call kobject_rename():: - - int kobject_rename(struct kobject *kobj, const char *new_name); - -kobject_rename does not perform any locking or have a solid notion of -what names are valid so the caller must provide their own sanity checking -and serialization. - -There is a function called kobject_set_name() but that is legacy cruft and -is being removed. If your code needs to call this function, it is -incorrect and needs to be fixed. - -To properly access the name of the kobject, use the function -kobject_name():: - - const char *kobject_name(const struct kobject * kobj); - -There is a helper function to both initialize and add the kobject to the -kernel at the same time, called surprisingly enough kobject_init_and_add():: - - int kobject_init_and_add(struct kobject *kobj, struct kobj_type *ktype, - struct kobject *parent, const char *fmt, ...); - -The arguments are the same as the individual kobject_init() and -kobject_add() functions described above. - - -Uevents -======= - -After a kobject has been registered with the kobject core, you need to -announce to the world that it has been created. This can be done with a -call to kobject_uevent():: - - int kobject_uevent(struct kobject *kobj, enum kobject_action action); - -Use the **KOBJ_ADD** action for when the kobject is first added to the kernel. -This should be done only after any attributes or children of the kobject -have been initialized properly, as userspace will instantly start to look -for them when this call happens. - -When the kobject is removed from the kernel (details on how to do that are -below), the uevent for **KOBJ_REMOVE** will be automatically created by the -kobject core, so the caller does not have to worry about doing that by -hand. - - -Reference counts -================ - -One of the key functions of a kobject is to serve as a reference counter -for the object in which it is embedded. As long as references to the object -exist, the object (and the code which supports it) must continue to exist. -The low-level functions for manipulating a kobject's reference counts are:: - - struct kobject *kobject_get(struct kobject *kobj); - void kobject_put(struct kobject *kobj); - -A successful call to kobject_get() will increment the kobject's reference -counter and return the pointer to the kobject. - -When a reference is released, the call to kobject_put() will decrement the -reference count and, possibly, free the object. Note that kobject_init() -sets the reference count to one, so the code which sets up the kobject will -need to do a kobject_put() eventually to release that reference. - -Because kobjects are dynamic, they must not be declared statically or on -the stack, but instead, always allocated dynamically. Future versions of -the kernel will contain a run-time check for kobjects that are created -statically and will warn the developer of this improper usage. - -If all that you want to use a kobject for is to provide a reference counter -for your structure, please use the struct kref instead; a kobject would be -overkill. For more information on how to use struct kref, please see the -file Documentation/kref.txt in the Linux kernel source tree. - - -Creating "simple" kobjects -========================== - -Sometimes all that a developer wants is a way to create a simple directory -in the sysfs hierarchy, and not have to mess with the whole complication of -ksets, show and store functions, and other details. This is the one -exception where a single kobject should be created. To create such an -entry, use the function:: - - struct kobject *kobject_create_and_add(char *name, struct kobject *parent); - -This function will create a kobject and place it in sysfs in the location -underneath the specified parent kobject. To create simple attributes -associated with this kobject, use:: - - int sysfs_create_file(struct kobject *kobj, struct attribute *attr); - -or:: - - int sysfs_create_group(struct kobject *kobj, struct attribute_group *grp); - -Both types of attributes used here, with a kobject that has been created -with the kobject_create_and_add(), can be of type kobj_attribute, so no -special custom attribute is needed to be created. - -See the example module, ``samples/kobject/kobject-example.c`` for an -implementation of a simple kobject and attributes. - - - -ktypes and release methods -========================== - -One important thing still missing from the discussion is what happens to a -kobject when its reference count reaches zero. The code which created the -kobject generally does not know when that will happen; if it did, there -would be little point in using a kobject in the first place. Even -predictable object lifecycles become more complicated when sysfs is brought -in as other portions of the kernel can get a reference on any kobject that -is registered in the system. - -The end result is that a structure protected by a kobject cannot be freed -before its reference count goes to zero. The reference count is not under -the direct control of the code which created the kobject. So that code must -be notified asynchronously whenever the last reference to one of its -kobjects goes away. - -Once you registered your kobject via kobject_add(), you must never use -kfree() to free it directly. The only safe way is to use kobject_put(). It -is good practice to always use kobject_put() after kobject_init() to avoid -errors creeping in. - -This notification is done through a kobject's release() method. Usually -such a method has a form like:: - - void my_object_release(struct kobject *kobj) - { - struct my_object *mine = container_of(kobj, struct my_object, kobj); - - /* Perform any additional cleanup on this object, then... */ - kfree(mine); - } - -One important point cannot be overstated: every kobject must have a -release() method, and the kobject must persist (in a consistent state) -until that method is called. If these constraints are not met, the code is -flawed. Note that the kernel will warn you if you forget to provide a -release() method. Do not try to get rid of this warning by providing an -"empty" release function. - -If all your cleanup function needs to do is call kfree(), then you must -create a wrapper function which uses container_of() to upcast to the correct -type (as shown in the example above) and then calls kfree() on the overall -structure. - -Note, the name of the kobject is available in the release function, but it -must NOT be changed within this callback. Otherwise there will be a memory -leak in the kobject core, which makes people unhappy. - -Interestingly, the release() method is not stored in the kobject itself; -instead, it is associated with the ktype. So let us introduce struct -kobj_type:: - - struct kobj_type { - void (*release)(struct kobject *kobj); - const struct sysfs_ops *sysfs_ops; - struct attribute **default_attrs; - const struct kobj_ns_type_operations *(*child_ns_type)(struct kobject *kobj); - const void *(*namespace)(struct kobject *kobj); - }; - -This structure is used to describe a particular type of kobject (or, more -correctly, of containing object). Every kobject needs to have an associated -kobj_type structure; a pointer to that structure must be specified when you -call kobject_init() or kobject_init_and_add(). - -The release field in struct kobj_type is, of course, a pointer to the -release() method for this type of kobject. The other two fields (sysfs_ops -and default_attrs) control how objects of this type are represented in -sysfs; they are beyond the scope of this document. - -The default_attrs pointer is a list of default attributes that will be -automatically created for any kobject that is registered with this ktype. - - -ksets -===== - -A kset is merely a collection of kobjects that want to be associated with -each other. There is no restriction that they be of the same ktype, but be -very careful if they are not. - -A kset serves these functions: - - - It serves as a bag containing a group of objects. A kset can be used by - the kernel to track "all block devices" or "all PCI device drivers." - - - A kset is also a subdirectory in sysfs, where the associated kobjects - with the kset can show up. Every kset contains a kobject which can be - set up to be the parent of other kobjects; the top-level directories of - the sysfs hierarchy are constructed in this way. - - - Ksets can support the "hotplugging" of kobjects and influence how - uevent events are reported to user space. - -In object-oriented terms, "kset" is the top-level container class; ksets -contain their own kobject, but that kobject is managed by the kset code and -should not be manipulated by any other user. - -A kset keeps its children in a standard kernel linked list. Kobjects point -back to their containing kset via their kset field. In almost all cases, -the kobjects belonging to a kset have that kset (or, strictly, its embedded -kobject) in their parent. - -As a kset contains a kobject within it, it should always be dynamically -created and never declared statically or on the stack. To create a new -kset use:: - - struct kset *kset_create_and_add(const char *name, - struct kset_uevent_ops *u, - struct kobject *parent); - -When you are finished with the kset, call:: - - void kset_unregister(struct kset *kset); - -to destroy it. This removes the kset from sysfs and decrements its reference -count. When the reference count goes to zero, the kset will be released. -Because other references to the kset may still exist, the release may happen -after kset_unregister() returns. - -An example of using a kset can be seen in the -``samples/kobject/kset-example.c`` file in the kernel tree. - -If a kset wishes to control the uevent operations of the kobjects -associated with it, it can use the struct kset_uevent_ops to handle it:: - - struct kset_uevent_ops { - int (*filter)(struct kset *kset, struct kobject *kobj); - const char *(*name)(struct kset *kset, struct kobject *kobj); - int (*uevent)(struct kset *kset, struct kobject *kobj, - struct kobj_uevent_env *env); - }; - - -The filter function allows a kset to prevent a uevent from being emitted to -userspace for a specific kobject. If the function returns 0, the uevent -will not be emitted. - -The name function will be called to override the default name of the kset -that the uevent sends to userspace. By default, the name will be the same -as the kset itself, but this function, if present, can override that name. - -The uevent function will be called when the uevent is about to be sent to -userspace to allow more environment variables to be added to the uevent. - -One might ask how, exactly, a kobject is added to a kset, given that no -functions which perform that function have been presented. The answer is -that this task is handled by kobject_add(). When a kobject is passed to -kobject_add(), its kset member should point to the kset to which the -kobject will belong. kobject_add() will handle the rest. - -If the kobject belonging to a kset has no parent kobject set, it will be -added to the kset's directory. Not all members of a kset do necessarily -live in the kset directory. If an explicit parent kobject is assigned -before the kobject is added, the kobject is registered with the kset, but -added below the parent kobject. - - -Kobject removal -=============== - -After a kobject has been registered with the kobject core successfully, it -must be cleaned up when the code is finished with it. To do that, call -kobject_put(). By doing this, the kobject core will automatically clean up -all of the memory allocated by this kobject. If a ``KOBJ_ADD`` uevent has been -sent for the object, a corresponding ``KOBJ_REMOVE`` uevent will be sent, and -any other sysfs housekeeping will be handled for the caller properly. - -If you need to do a two-stage delete of the kobject (say you are not -allowed to sleep when you need to destroy the object), then call -kobject_del() which will unregister the kobject from sysfs. This makes the -kobject "invisible", but it is not cleaned up, and the reference count of -the object is still the same. At a later time call kobject_put() to finish -the cleanup of the memory associated with the kobject. - -kobject_del() can be used to drop the reference to the parent object, if -circular references are constructed. It is valid in some cases, that a -parent objects references a child. Circular references _must_ be broken -with an explicit call to kobject_del(), so that a release functions will be -called, and the objects in the former circle release each other. - - -Example code to copy from -========================= - -For a more complete example of using ksets and kobjects properly, see the -example programs ``samples/kobject/{kobject-example.c,kset-example.c}``, -which will be built as loadable modules if you select ``CONFIG_SAMPLE_KOBJECT``. -- cgit From ae5977765acb25c1eafb348f81a6597cb7a88eba Mon Sep 17 00:00:00 2001 From: Zenghui Yu Date: Tue, 25 Feb 2020 20:40:52 +0800 Subject: Documentation: kthread: Fix WQ_SYSFS workqueues path name The set of WQ_SYSFS workqueues should be displayed using "ls /sys/devices/virtual/workqueue", add the missing '/'. Signed-off-by: Zenghui Yu Link: https://lore.kernel.org/r/20200225124052.1506-1-yuzenghui@huawei.com Signed-off-by: Jonathan Corbet --- Documentation/admin-guide/kernel-per-CPU-kthreads.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/Documentation/admin-guide/kernel-per-CPU-kthreads.rst b/Documentation/admin-guide/kernel-per-CPU-kthreads.rst index baeeba8762ae..21818aca4708 100644 --- a/Documentation/admin-guide/kernel-per-CPU-kthreads.rst +++ b/Documentation/admin-guide/kernel-per-CPU-kthreads.rst @@ -234,7 +234,7 @@ To reduce its OS jitter, do any of the following: Such a workqueue can be confined to a given subset of the CPUs using the ``/sys/devices/virtual/workqueue/*/cpumask`` sysfs files. The set of WQ_SYSFS workqueues can be displayed using - "ls sys/devices/virtual/workqueue". That said, the workqueues + "ls /sys/devices/virtual/workqueue". That said, the workqueues maintainer would like to caution people against indiscriminately sprinkling WQ_SYSFS across all the workqueues. The reason for caution is that it is easy to add WQ_SYSFS, but because sysfs is -- cgit From c428cd52282dcc967b2a936d80f1eec4cb80d6d5 Mon Sep 17 00:00:00 2001 From: Tim Bird Date: Mon, 24 Feb 2020 18:34:41 -0700 Subject: scripts/sphinx-pre-install: add '-p python3' to virtualenv With Ubuntu 16.04 (and presumably Debian distros of the same age), the instructions for setting up a python virtual environment should do so with the python 3 interpreter. On these older distros, the default python (and virtualenv command) might be python2 based. Some of the packages that sphinx relies on are now only available for python3. If you don't specify the python3 interpreter for the virtualenv, you get errors when doing the pip installs for various packages Fix this by adding '-p python3' to the virtualenv recommendation line. Signed-off-by: Tim Bird Link: https://lore.kernel.org/r/1582594481-23221-1-git-send-email-tim.bird@sony.com Signed-off-by: Jonathan Corbet --- scripts/sphinx-pre-install | 17 ++++++++++++++++- 1 file changed, 16 insertions(+), 1 deletion(-) diff --git a/scripts/sphinx-pre-install b/scripts/sphinx-pre-install index a8f0c002a340..fa3fb05cd54b 100755 --- a/scripts/sphinx-pre-install +++ b/scripts/sphinx-pre-install @@ -701,11 +701,26 @@ sub check_needs() } else { my $rec_activate = "$virtenv_dir/bin/activate"; my $virtualenv = findprog("virtualenv-3"); + my $rec_python3 = ""; $virtualenv = findprog("virtualenv-3.5") if (!$virtualenv); $virtualenv = findprog("virtualenv") if (!$virtualenv); $virtualenv = "virtualenv" if (!$virtualenv); - printf "\t$virtualenv $virtenv_dir\n"; + my $rel = ""; + if (index($system_release, "Ubuntu") != -1) { + $rel = $1 if ($system_release =~ /Ubuntu\s+(\d+)[.]/); + if ($rel && $rel >= 16) { + $rec_python3 = " -p python3"; + } + } + if (index($system_release, "Debian") != -1) { + $rel = $1 if ($system_release =~ /Debian\s+(\d+)/); + if ($rel && $rel >= 7) { + $rec_python3 = " -p python3"; + } + } + + printf "\t$virtualenv$rec_python3 $virtenv_dir\n"; printf "\t. $rec_activate\n"; printf "\tpip install -r $requirement_file\n"; deactivate_help(); -- cgit From 3eb30c51a6dda26d0c5b8824b7c0515502f1c161 Mon Sep 17 00:00:00 2001 From: Niklas Söderlund Date: Wed, 12 Feb 2020 19:13:32 +0100 Subject: Documentation: nfsroot.rst: Fix references to nfsroot.rst MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit When converting and moving nfsroot.txt to nfsroot.rst the references to the old text file was not updated to match the change, fix this. Fixes: f9a9349846f92b2d ("Documentation: nfsroot.txt: convert to ReST") Signed-off-by: Niklas Söderlund Reviewed-by: Geert Uytterhoeven Link: https://lore.kernel.org/r/20200212181332.520545-1-niklas.soderlund+renesas@ragnatech.se Signed-off-by: Jonathan Corbet --- Documentation/admin-guide/kernel-parameters.txt | 8 ++++---- Documentation/filesystems/cifs/cifsroot.txt | 2 +- fs/nfs/Kconfig | 2 +- net/ipv4/Kconfig | 6 +++--- net/ipv4/ipconfig.c | 2 +- 5 files changed, 10 insertions(+), 10 deletions(-) diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt index 3e3fd0d19e53..4220477079bd 100644 --- a/Documentation/admin-guide/kernel-parameters.txt +++ b/Documentation/admin-guide/kernel-parameters.txt @@ -1885,7 +1885,7 @@ No delay ip= [IP_PNP] - See Documentation/filesystems/nfs/nfsroot.txt. + See Documentation/admin-guide/nfs/nfsroot.rst. ipcmni_extend [KNL] Extend the maximum number of unique System V IPC identifiers from 32,768 to 16,777,216. @@ -2855,13 +2855,13 @@ Default value is 0. nfsaddrs= [NFS] Deprecated. Use ip= instead. - See Documentation/filesystems/nfs/nfsroot.txt. + See Documentation/admin-guide/nfs/nfsroot.rst. nfsroot= [NFS] nfs root filesystem for disk-less boxes. - See Documentation/filesystems/nfs/nfsroot.txt. + See Documentation/admin-guide/nfs/nfsroot.rst. nfsrootdebug [NFS] enable nfsroot debugging messages. - See Documentation/filesystems/nfs/nfsroot.txt. + See Documentation/admin-guide/nfs/nfsroot.rst. nfs.callback_nr_threads= [NFSv4] set the total number of threads that the diff --git a/Documentation/filesystems/cifs/cifsroot.txt b/Documentation/filesystems/cifs/cifsroot.txt index 0fa1a2c36a40..947b7ec6ce9e 100644 --- a/Documentation/filesystems/cifs/cifsroot.txt +++ b/Documentation/filesystems/cifs/cifsroot.txt @@ -13,7 +13,7 @@ network by utilizing SMB or CIFS protocol. In order to mount, the network stack will also need to be set up by using 'ip=' config option. For more details, see -Documentation/filesystems/nfs/nfsroot.txt. +Documentation/admin-guide/nfs/nfsroot.rst. A CIFS root mount currently requires the use of SMB1+UNIX Extensions which is only supported by the Samba server. SMB1 is the older diff --git a/fs/nfs/Kconfig b/fs/nfs/Kconfig index 40b6c5ac46c0..88e1763e02f3 100644 --- a/fs/nfs/Kconfig +++ b/fs/nfs/Kconfig @@ -164,7 +164,7 @@ config ROOT_NFS If you want your system to mount its root file system via NFS, choose Y here. This is common practice for managing systems without local permanent storage. For details, read - . + . Most people say N here. diff --git a/net/ipv4/Kconfig b/net/ipv4/Kconfig index f96bd489b362..fb1dc8d02f6d 100644 --- a/net/ipv4/Kconfig +++ b/net/ipv4/Kconfig @@ -129,7 +129,7 @@ config IP_PNP_DHCP If unsure, say Y. Note that if you want to use DHCP, a DHCP server must be operating on your network. Read - for details. + for details. config IP_PNP_BOOTP bool "IP: BOOTP support" @@ -144,7 +144,7 @@ config IP_PNP_BOOTP does BOOTP itself, providing all necessary information on the kernel command line, you can say N here. If unsure, say Y. Note that if you want to use BOOTP, a BOOTP server must be operating on your network. - Read for details. + Read for details. config IP_PNP_RARP bool "IP: RARP support" @@ -157,7 +157,7 @@ config IP_PNP_RARP older protocol which is being obsoleted by BOOTP and DHCP), say Y here. Note that if you want to use RARP, a RARP server must be operating on your network. Read - for details. + for details. config NET_IPIP tristate "IP: tunneling" diff --git a/net/ipv4/ipconfig.c b/net/ipv4/ipconfig.c index 4438f6b12335..561f15b5a944 100644 --- a/net/ipv4/ipconfig.c +++ b/net/ipv4/ipconfig.c @@ -1621,7 +1621,7 @@ late_initcall(ip_auto_config); /* * Decode any IP configuration options in the "ip=" or "nfsaddrs=" kernel - * command line parameter. See Documentation/filesystems/nfs/nfsroot.txt. + * command line parameter. See Documentation/admin-guide/nfs/nfsroot.rst. */ static int __init ic_proto_name(char *name) { -- cgit From 07d241fd66ba99111d43a0a4c4abeeb972468d1d Mon Sep 17 00:00:00 2001 From: Mauro Carvalho Chehab Date: Mon, 17 Feb 2020 17:11:47 +0100 Subject: docs: filesystems: convert 9p.txt to ReST - Add a SPDX header; - Add a document title; - Adjust section titles; - Some whitespace fixes and new line breaks; - Mark literal blocks as such; - Add table markups; - Add it to filesystems/index.rst. Signed-off-by: Mauro Carvalho Chehab Link: https://lore.kernel.org/r/96a060b7b5c0c3838ab1751addfe4d6d3bc37bd6.1581955849.git.mchehab+huawei@kernel.org Signed-off-by: Jonathan Corbet --- Documentation/filesystems/9p.rst | 185 ++++++++++++++++++++++++++++++++++++ Documentation/filesystems/9p.txt | 161 ------------------------------- Documentation/filesystems/index.rst | 1 + 3 files changed, 186 insertions(+), 161 deletions(-) create mode 100644 Documentation/filesystems/9p.rst delete mode 100644 Documentation/filesystems/9p.txt diff --git a/Documentation/filesystems/9p.rst b/Documentation/filesystems/9p.rst new file mode 100644 index 000000000000..f054d1c45e86 --- /dev/null +++ b/Documentation/filesystems/9p.rst @@ -0,0 +1,185 @@ +.. SPDX-License-Identifier: GPL-2.0 + +======================================= +v9fs: Plan 9 Resource Sharing for Linux +======================================= + +About +===== + +v9fs is a Unix implementation of the Plan 9 9p remote filesystem protocol. + +This software was originally developed by Ron Minnich +and Maya Gokhale. Additional development by Greg Watson + and most recently Eric Van Hensbergen +, Latchesar Ionkov and Russ Cox +. + +The best detailed explanation of the Linux implementation and applications of +the 9p client is available in the form of a USENIX paper: + + http://www.usenix.org/events/usenix05/tech/freenix/hensbergen.html + +Other applications are described in the following papers: + + * XCPU & Clustering + http://xcpu.org/papers/xcpu-talk.pdf + * KVMFS: control file system for KVM + http://xcpu.org/papers/kvmfs.pdf + * CellFS: A New Programming Model for the Cell BE + http://xcpu.org/papers/cellfs-talk.pdf + * PROSE I/O: Using 9p to enable Application Partitions + http://plan9.escet.urjc.es/iwp9/cready/PROSE_iwp9_2006.pdf + * VirtFS: A Virtualization Aware File System pass-through + http://goo.gl/3WPDg + +Usage +===== + +For remote file server:: + + mount -t 9p 10.10.1.2 /mnt/9 + +For Plan 9 From User Space applications (http://swtch.com/plan9):: + + mount -t 9p `namespace`/acme /mnt/9 -o trans=unix,uname=$USER + +For server running on QEMU host with virtio transport:: + + mount -t 9p -o trans=virtio /mnt/9 + +where mount_tag is the tag associated by the server to each of the exported +mount points. Each 9P export is seen by the client as a virtio device with an +associated "mount_tag" property. Available mount tags can be +seen by reading /sys/bus/virtio/drivers/9pnet_virtio/virtio/mount_tag files. + +Options +======= + + ============= =============================================================== + trans=name select an alternative transport. Valid options are + currently: + + ======== ============================================ + unix specifying a named pipe mount point + tcp specifying a normal TCP/IP connection + fd used passed file descriptors for connection + (see rfdno and wfdno) + virtio connect to the next virtio channel available + (from QEMU with trans_virtio module) + rdma connect to a specified RDMA channel + ======== ============================================ + + uname=name user name to attempt mount as on the remote server. The + server may override or ignore this value. Certain user + names may require authentication. + + aname=name aname specifies the file tree to access when the server is + offering several exported file systems. + + cache=mode specifies a caching policy. By default, no caches are used. + + none + default no cache policy, metadata and data + alike are synchronous. + loose + no attempts are made at consistency, + intended for exclusive, read-only mounts + fscache + use FS-Cache for a persistent, read-only + cache backend. + mmap + minimal cache that is only used for read-write + mmap. Northing else is cached, like cache=none + + debug=n specifies debug level. The debug level is a bitmask. + + ===== ================================ + 0x01 display verbose error messages + 0x02 developer debug (DEBUG_CURRENT) + 0x04 display 9p trace + 0x08 display VFS trace + 0x10 display Marshalling debug + 0x20 display RPC debug + 0x40 display transport debug + 0x80 display allocation debug + 0x100 display protocol message debug + 0x200 display Fid debug + 0x400 display packet debug + 0x800 display fscache tracing debug + ===== ================================ + + rfdno=n the file descriptor for reading with trans=fd + + wfdno=n the file descriptor for writing with trans=fd + + msize=n the number of bytes to use for 9p packet payload + + port=n port to connect to on the remote server + + noextend force legacy mode (no 9p2000.u or 9p2000.L semantics) + + version=name Select 9P protocol version. Valid options are: + + ======== ============================== + 9p2000 Legacy mode (same as noextend) + 9p2000.u Use 9P2000.u protocol + 9p2000.L Use 9P2000.L protocol + ======== ============================== + + dfltuid attempt to mount as a particular uid + + dfltgid attempt to mount with a particular gid + + afid security channel - used by Plan 9 authentication protocols + + nodevmap do not map special files - represent them as normal files. + This can be used to share devices/named pipes/sockets between + hosts. This functionality will be expanded in later versions. + + access there are four access modes. + user + if a user tries to access a file on v9fs + filesystem for the first time, v9fs sends an + attach command (Tattach) for that user. + This is the default mode. + + allows only user with uid= to access + the files on the mounted filesystem + any + v9fs does single attach and performs all + operations as one user + clien + ACL based access check on the 9p client + side for access validation + + cachetag cache tag to use the specified persistent cache. + cache tags for existing cache sessions can be listed at + /sys/fs/9p/caches. (applies only to cache=fscache) + ============= =============================================================== + +Resources +========= + +Protocol specifications are maintained on github: +http://ericvh.github.com/9p-rfc/ + +9p client and server implementations are listed on +http://9p.cat-v.org/implementations + +A 9p2000.L server is being developed by LLNL and can be found +at http://code.google.com/p/diod/ + +There are user and developer mailing lists available through the v9fs project +on sourceforge (http://sourceforge.net/projects/v9fs). + +News and other information is maintained on a Wiki. +(http://sf.net/apps/mediawiki/v9fs/index.php). + +Bug reports are best issued via the mailing list. + +For more information on the Plan 9 Operating System check out +http://plan9.bell-labs.com/plan9 + +For information on Plan 9 from User Space (Plan 9 applications and libraries +ported to Linux/BSD/OSX/etc) check out http://swtch.com/plan9 diff --git a/Documentation/filesystems/9p.txt b/Documentation/filesystems/9p.txt deleted file mode 100644 index fec7144e817c..000000000000 --- a/Documentation/filesystems/9p.txt +++ /dev/null @@ -1,161 +0,0 @@ - v9fs: Plan 9 Resource Sharing for Linux - ======================================= - -ABOUT -===== - -v9fs is a Unix implementation of the Plan 9 9p remote filesystem protocol. - -This software was originally developed by Ron Minnich -and Maya Gokhale. Additional development by Greg Watson - and most recently Eric Van Hensbergen -, Latchesar Ionkov and Russ Cox -. - -The best detailed explanation of the Linux implementation and applications of -the 9p client is available in the form of a USENIX paper: - http://www.usenix.org/events/usenix05/tech/freenix/hensbergen.html - -Other applications are described in the following papers: - * XCPU & Clustering - http://xcpu.org/papers/xcpu-talk.pdf - * KVMFS: control file system for KVM - http://xcpu.org/papers/kvmfs.pdf - * CellFS: A New Programming Model for the Cell BE - http://xcpu.org/papers/cellfs-talk.pdf - * PROSE I/O: Using 9p to enable Application Partitions - http://plan9.escet.urjc.es/iwp9/cready/PROSE_iwp9_2006.pdf - * VirtFS: A Virtualization Aware File System pass-through - http://goo.gl/3WPDg - -USAGE -===== - -For remote file server: - - mount -t 9p 10.10.1.2 /mnt/9 - -For Plan 9 From User Space applications (http://swtch.com/plan9) - - mount -t 9p `namespace`/acme /mnt/9 -o trans=unix,uname=$USER - -For server running on QEMU host with virtio transport: - - mount -t 9p -o trans=virtio /mnt/9 - -where mount_tag is the tag associated by the server to each of the exported -mount points. Each 9P export is seen by the client as a virtio device with an -associated "mount_tag" property. Available mount tags can be -seen by reading /sys/bus/virtio/drivers/9pnet_virtio/virtio/mount_tag files. - -OPTIONS -======= - - trans=name select an alternative transport. Valid options are - currently: - unix - specifying a named pipe mount point - tcp - specifying a normal TCP/IP connection - fd - used passed file descriptors for connection - (see rfdno and wfdno) - virtio - connect to the next virtio channel available - (from QEMU with trans_virtio module) - rdma - connect to a specified RDMA channel - - uname=name user name to attempt mount as on the remote server. The - server may override or ignore this value. Certain user - names may require authentication. - - aname=name aname specifies the file tree to access when the server is - offering several exported file systems. - - cache=mode specifies a caching policy. By default, no caches are used. - none = default no cache policy, metadata and data - alike are synchronous. - loose = no attempts are made at consistency, - intended for exclusive, read-only mounts - fscache = use FS-Cache for a persistent, read-only - cache backend. - mmap = minimal cache that is only used for read-write - mmap. Northing else is cached, like cache=none - - debug=n specifies debug level. The debug level is a bitmask. - 0x01 = display verbose error messages - 0x02 = developer debug (DEBUG_CURRENT) - 0x04 = display 9p trace - 0x08 = display VFS trace - 0x10 = display Marshalling debug - 0x20 = display RPC debug - 0x40 = display transport debug - 0x80 = display allocation debug - 0x100 = display protocol message debug - 0x200 = display Fid debug - 0x400 = display packet debug - 0x800 = display fscache tracing debug - - rfdno=n the file descriptor for reading with trans=fd - - wfdno=n the file descriptor for writing with trans=fd - - msize=n the number of bytes to use for 9p packet payload - - port=n port to connect to on the remote server - - noextend force legacy mode (no 9p2000.u or 9p2000.L semantics) - - version=name Select 9P protocol version. Valid options are: - 9p2000 - Legacy mode (same as noextend) - 9p2000.u - Use 9P2000.u protocol - 9p2000.L - Use 9P2000.L protocol - - dfltuid attempt to mount as a particular uid - - dfltgid attempt to mount with a particular gid - - afid security channel - used by Plan 9 authentication protocols - - nodevmap do not map special files - represent them as normal files. - This can be used to share devices/named pipes/sockets between - hosts. This functionality will be expanded in later versions. - - access there are four access modes. - user = if a user tries to access a file on v9fs - filesystem for the first time, v9fs sends an - attach command (Tattach) for that user. - This is the default mode. - = allows only user with uid= to access - the files on the mounted filesystem - any = v9fs does single attach and performs all - operations as one user - client = ACL based access check on the 9p client - side for access validation - - cachetag cache tag to use the specified persistent cache. - cache tags for existing cache sessions can be listed at - /sys/fs/9p/caches. (applies only to cache=fscache) - -RESOURCES -========= - -Protocol specifications are maintained on github: -http://ericvh.github.com/9p-rfc/ - -9p client and server implementations are listed on -http://9p.cat-v.org/implementations - -A 9p2000.L server is being developed by LLNL and can be found -at http://code.google.com/p/diod/ - -There are user and developer mailing lists available through the v9fs project -on sourceforge (http://sourceforge.net/projects/v9fs). - -News and other information is maintained on a Wiki. -(http://sf.net/apps/mediawiki/v9fs/index.php). - -Bug reports are best issued via the mailing list. - -For more information on the Plan 9 Operating System check out -http://plan9.bell-labs.com/plan9 - -For information on Plan 9 from User Space (Plan 9 applications and libraries -ported to Linux/BSD/OSX/etc) check out http://swtch.com/plan9 - diff --git a/Documentation/filesystems/index.rst b/Documentation/filesystems/index.rst index 45d791905e91..a9330c3f8c2e 100644 --- a/Documentation/filesystems/index.rst +++ b/Documentation/filesystems/index.rst @@ -46,6 +46,7 @@ Documentation for filesystem implementations. .. toctree:: :maxdepth: 2 + 9p autofs fuse overlayfs -- cgit From 348739003d4f7e777ef935a44a91e7494f8ab786 Mon Sep 17 00:00:00 2001 From: Mauro Carvalho Chehab Date: Mon, 17 Feb 2020 17:11:48 +0100 Subject: docs: filesystems: convert adfs.txt to ReST - Add a SPDX header; - Add a document title; - Adjust section titles; - Mark literal blocks as such; - Add it to filesystems/index.rst. Signed-off-by: Mauro Carvalho Chehab Link: https://lore.kernel.org/r/15ee92f03ec917e5d26bd7b863565dec88c843f6.1581955849.git.mchehab+huawei@kernel.org Signed-off-by: Jonathan Corbet --- Documentation/filesystems/adfs.rst | 108 ++++++++++++++++++++++++++++++++++++ Documentation/filesystems/adfs.txt | 99 --------------------------------- Documentation/filesystems/index.rst | 1 + 3 files changed, 109 insertions(+), 99 deletions(-) create mode 100644 Documentation/filesystems/adfs.rst delete mode 100644 Documentation/filesystems/adfs.txt diff --git a/Documentation/filesystems/adfs.rst b/Documentation/filesystems/adfs.rst new file mode 100644 index 000000000000..5b22cae38e5e --- /dev/null +++ b/Documentation/filesystems/adfs.rst @@ -0,0 +1,108 @@ +.. SPDX-License-Identifier: GPL-2.0 + +=============================== +Acorn Disc Filing System - ADFS +=============================== + +Filesystems supported by ADFS +----------------------------- + +The ADFS module supports the following Filecore formats which have: + +- new maps +- new directories or big directories + +In terms of the named formats, this means we support: + +- E and E+, with or without boot block +- F and F+ + +We fully support reading files from these filesystems, and writing to +existing files within their existing allocation. Essentially, we do +not support changing any of the filesystem metadata. + +This is intended to support loopback mounted Linux native filesystems +on a RISC OS Filecore filesystem, but will allow the data within files +to be changed. + +If write support (ADFS_FS_RW) is configured, we allow rudimentary +directory updates, specifically updating the access mode and timestamp. + +Mount options for ADFS +---------------------- + + ============ ====================================================== + uid=nnn All files in the partition will be owned by + user id nnn. Default 0 (root). + gid=nnn All files in the partition will be in group + nnn. Default 0 (root). + ownmask=nnn The permission mask for ADFS 'owner' permissions + will be nnn. Default 0700. + othmask=nnn The permission mask for ADFS 'other' permissions + will be nnn. Default 0077. + ftsuffix=n When ftsuffix=0, no file type suffix will be applied. + When ftsuffix=1, a hexadecimal suffix corresponding to + the RISC OS file type will be added. Default 0. + ============ ====================================================== + +Mapping of ADFS permissions to Linux permissions +------------------------------------------------ + + ADFS permissions consist of the following: + + - Owner read + - Owner write + - Other read + - Other write + + (In older versions, an 'execute' permission did exist, but this + does not hold the same meaning as the Linux 'execute' permission + and is now obsolete). + + The mapping is performed as follows:: + + Owner read -> -r--r--r-- + Owner write -> --w--w---w + Owner read and filetype UnixExec -> ---x--x--x + These are then masked by ownmask, eg 700 -> -rwx------ + Possible owner mode permissions -> -rwx------ + + Other read -> -r--r--r-- + Other write -> --w--w--w- + Other read and filetype UnixExec -> ---x--x--x + These are then masked by othmask, eg 077 -> ----rwxrwx + Possible other mode permissions -> ----rwxrwx + + Hence, with the default masks, if a file is owner read/write, and + not a UnixExec filetype, then the permissions will be:: + + -rw------- + + However, if the masks were ownmask=0770,othmask=0007, then this would + be modified to:: + + -rw-rw---- + + There is no restriction on what you can do with these masks. You may + wish that either read bits give read access to the file for all, but + keep the default write protection (ownmask=0755,othmask=0577):: + + -rw-r--r-- + + You can therefore tailor the permission translation to whatever you + desire the permissions should be under Linux. + +RISC OS file type suffix +------------------------ + + RISC OS file types are stored in bits 19..8 of the file load address. + + To enable non-RISC OS systems to be used to store files without losing + file type information, a file naming convention was devised (initially + for use with NFS) such that a hexadecimal suffix of the form ,xyz + denoted the file type: e.g. BasicFile,ffb is a BASIC (0xffb) file. This + naming convention is now also used by RISC OS emulators such as RPCEmu. + + Mounting an ADFS disc with option ftsuffix=1 will cause appropriate file + type suffixes to be appended to file names read from a directory. If the + ftsuffix option is zero or omitted, no file type suffixes will be added. diff --git a/Documentation/filesystems/adfs.txt b/Documentation/filesystems/adfs.txt deleted file mode 100644 index 0baa8e8c1fc1..000000000000 --- a/Documentation/filesystems/adfs.txt +++ /dev/null @@ -1,99 +0,0 @@ -Filesystems supported by ADFS ------------------------------ - -The ADFS module supports the following Filecore formats which have: - -- new maps -- new directories or big directories - -In terms of the named formats, this means we support: - -- E and E+, with or without boot block -- F and F+ - -We fully support reading files from these filesystems, and writing to -existing files within their existing allocation. Essentially, we do -not support changing any of the filesystem metadata. - -This is intended to support loopback mounted Linux native filesystems -on a RISC OS Filecore filesystem, but will allow the data within files -to be changed. - -If write support (ADFS_FS_RW) is configured, we allow rudimentary -directory updates, specifically updating the access mode and timestamp. - -Mount options for ADFS ----------------------- - - uid=nnn All files in the partition will be owned by - user id nnn. Default 0 (root). - gid=nnn All files in the partition will be in group - nnn. Default 0 (root). - ownmask=nnn The permission mask for ADFS 'owner' permissions - will be nnn. Default 0700. - othmask=nnn The permission mask for ADFS 'other' permissions - will be nnn. Default 0077. - ftsuffix=n When ftsuffix=0, no file type suffix will be applied. - When ftsuffix=1, a hexadecimal suffix corresponding to - the RISC OS file type will be added. Default 0. - -Mapping of ADFS permissions to Linux permissions ------------------------------------------------- - - ADFS permissions consist of the following: - - Owner read - Owner write - Other read - Other write - - (In older versions, an 'execute' permission did exist, but this - does not hold the same meaning as the Linux 'execute' permission - and is now obsolete). - - The mapping is performed as follows: - - Owner read -> -r--r--r-- - Owner write -> --w--w---w - Owner read and filetype UnixExec -> ---x--x--x - These are then masked by ownmask, eg 700 -> -rwx------ - Possible owner mode permissions -> -rwx------ - - Other read -> -r--r--r-- - Other write -> --w--w--w- - Other read and filetype UnixExec -> ---x--x--x - These are then masked by othmask, eg 077 -> ----rwxrwx - Possible other mode permissions -> ----rwxrwx - - Hence, with the default masks, if a file is owner read/write, and - not a UnixExec filetype, then the permissions will be: - - -rw------- - - However, if the masks were ownmask=0770,othmask=0007, then this would - be modified to: - -rw-rw---- - - There is no restriction on what you can do with these masks. You may - wish that either read bits give read access to the file for all, but - keep the default write protection (ownmask=0755,othmask=0577): - - -rw-r--r-- - - You can therefore tailor the permission translation to whatever you - desire the permissions should be under Linux. - -RISC OS file type suffix ------------------------- - - RISC OS file types are stored in bits 19..8 of the file load address. - - To enable non-RISC OS systems to be used to store files without losing - file type information, a file naming convention was devised (initially - for use with NFS) such that a hexadecimal suffix of the form ,xyz - denoted the file type: e.g. BasicFile,ffb is a BASIC (0xffb) file. This - naming convention is now also used by RISC OS emulators such as RPCEmu. - - Mounting an ADFS disc with option ftsuffix=1 will cause appropriate file - type suffixes to be appended to file names read from a directory. If the - ftsuffix option is zero or omitted, no file type suffixes will be added. diff --git a/Documentation/filesystems/index.rst b/Documentation/filesystems/index.rst index a9330c3f8c2e..14dc89c94822 100644 --- a/Documentation/filesystems/index.rst +++ b/Documentation/filesystems/index.rst @@ -47,6 +47,7 @@ Documentation for filesystem implementations. :maxdepth: 2 9p + adfs autofs fuse overlayfs -- cgit From 7627216830d808572fff8225964e9209249ba196 Mon Sep 17 00:00:00 2001 From: Mauro Carvalho Chehab Date: Mon, 17 Feb 2020 17:11:49 +0100 Subject: docs: filesystems: convert affs.txt to ReST - Add a SPDX header; - Adjust document title; - Add table markups; - Mark literal blocks as such; - Some whitespace fixes and new line breaks; - Add it to filesystems/index.rst. Signed-off-by: Mauro Carvalho Chehab Acked-by: David Sterba Link: https://lore.kernel.org/r/b44c56befe0e28cbc0eb1b3e281ad7d99737ff16.1581955849.git.mchehab+huawei@kernel.org Signed-off-by: Jonathan Corbet --- Documentation/filesystems/affs.rst | 246 ++++++++++++++++++++++++++++++++++++ Documentation/filesystems/affs.txt | 222 -------------------------------- Documentation/filesystems/index.rst | 1 + 3 files changed, 247 insertions(+), 222 deletions(-) create mode 100644 Documentation/filesystems/affs.rst delete mode 100644 Documentation/filesystems/affs.txt diff --git a/Documentation/filesystems/affs.rst b/Documentation/filesystems/affs.rst new file mode 100644 index 000000000000..7f1a40dce6d3 --- /dev/null +++ b/Documentation/filesystems/affs.rst @@ -0,0 +1,246 @@ +.. SPDX-License-Identifier: GPL-2.0 + +============================= +Overview of Amiga Filesystems +============================= + +Not all varieties of the Amiga filesystems are supported for reading and +writing. The Amiga currently knows six different filesystems: + +============== =============================================================== +DOS\0 The old or original filesystem, not really suited for + hard disks and normally not used on them, either. + Supported read/write. + +DOS\1 The original Fast File System. Supported read/write. + +DOS\2 The old "international" filesystem. International means that + a bug has been fixed so that accented ("international") letters + in file names are case-insensitive, as they ought to be. + Supported read/write. + +DOS\3 The "international" Fast File System. Supported read/write. + +DOS\4 The original filesystem with directory cache. The directory + cache speeds up directory accesses on floppies considerably, + but slows down file creation/deletion. Doesn't make much + sense on hard disks. Supported read only. + +DOS\5 The Fast File System with directory cache. Supported read only. +============== =============================================================== + +All of the above filesystems allow block sizes from 512 to 32K bytes. +Supported block sizes are: 512, 1024, 2048 and 4096 bytes. Larger blocks +speed up almost everything at the expense of wasted disk space. The speed +gain above 4K seems not really worth the price, so you don't lose too +much here, either. + +The muFS (multi user File System) equivalents of the above file systems +are supported, too. + +Mount options for the AFFS +========================== + +protect + If this option is set, the protection bits cannot be altered. + +setuid[=uid] + This sets the owner of all files and directories in the file + system to uid or the uid of the current user, respectively. + +setgid[=gid] + Same as above, but for gid. + +mode=mode + Sets the mode flags to the given (octal) value, regardless + of the original permissions. Directories will get an x + permission if the corresponding r bit is set. + This is useful since most of the plain AmigaOS files + will map to 600. + +nofilenametruncate + The file system will return an error when filename exceeds + standard maximum filename length (30 characters). + +reserved=num + Sets the number of reserved blocks at the start of the + partition to num. You should never need this option. + Default is 2. + +root=block + Sets the block number of the root block. This should never + be necessary. + +bs=blksize + Sets the blocksize to blksize. Valid block sizes are 512, + 1024, 2048 and 4096. Like the root option, this should + never be necessary, as the affs can figure it out itself. + +quiet + The file system will not return an error for disallowed + mode changes. + +verbose + The volume name, file system type and block size will + be written to the syslog when the filesystem is mounted. + +mufs + The filesystem is really a muFS, also it doesn't + identify itself as one. This option is necessary if + the filesystem wasn't formatted as muFS, but is used + as one. + +prefix=path + Path will be prefixed to every absolute path name of + symbolic links on an AFFS partition. Default = "/". + (See below.) + +volume=name + When symbolic links with an absolute path are created + on an AFFS partition, name will be prepended as the + volume name. Default = "" (empty string). + (See below.) + +Handling of the Users/Groups and protection flags +================================================= + +Amiga -> Linux: + +The Amiga protection flags RWEDRWEDHSPARWED are handled as follows: + + - R maps to r for user, group and others. On directories, R implies x. + + - If both W and D are allowed, w will be set. + + - E maps to x. + + - H and P are always retained and ignored under Linux. + + - A is always reset when a file is written to. + +User id and group id will be used unless set[gu]id are given as mount +options. Since most of the Amiga file systems are single user systems +they will be owned by root. The root directory (the mount point) of the +Amiga filesystem will be owned by the user who actually mounts the +filesystem (the root directory doesn't have uid/gid fields). + +Linux -> Amiga: + +The Linux rwxrwxrwx file mode is handled as follows: + + - r permission will set R for user, group and others. + + - w permission will set W and D for user, group and others. + + - x permission of the user will set E for plain files. + + - All other flags (suid, sgid, ...) are ignored and will + not be retained. + +Newly created files and directories will get the user and group ID +of the current user and a mode according to the umask. + +Symbolic links +============== + +Although the Amiga and Linux file systems resemble each other, there +are some, not always subtle, differences. One of them becomes apparent +with symbolic links. While Linux has a file system with exactly one +root directory, the Amiga has a separate root directory for each +file system (for example, partition, floppy disk, ...). With the Amiga, +these entities are called "volumes". They have symbolic names which +can be used to access them. Thus, symbolic links can point to a +different volume. AFFS turns the volume name into a directory name +and prepends the prefix path (see prefix option) to it. + +Example: +You mount all your Amiga partitions under /amiga/ (where + is the name of the volume), and you give the option +"prefix=/amiga/" when mounting all your AFFS partitions. (They +might be "User", "WB" and "Graphics", the mount points /amiga/User, +/amiga/WB and /amiga/Graphics). A symbolic link referring to +"User:sc/include/dos/dos.h" will be followed to +"/amiga/User/sc/include/dos/dos.h". + +Examples +======== + +Command line:: + + mount Archive/Amiga/Workbench3.1.adf /mnt -t affs -o loop,verbose + mount /dev/sda3 /Amiga -t affs + +/etc/fstab entry:: + + /dev/sdb5 /amiga/Workbench affs noauto,user,exec,verbose 0 0 + +IMPORTANT NOTE +============== + +If you boot Windows 95 (don't know about 3.x, 98 and NT) while you +have an Amiga harddisk connected to your PC, it will overwrite +the bytes 0x00dc..0x00df of block 0 with garbage, thus invalidating +the Rigid Disk Block. Sheer luck has it that this is an unused +area of the RDB, so only the checksum doesn't match anymore. +Linux will ignore this garbage and recognize the RDB anyway, but +before you connect that drive to your Amiga again, you must +restore or repair your RDB. So please do make a backup copy of it +before booting Windows! + +If the damage is already done, the following should fix the RDB +(where is the device name). + +DO AT YOUR OWN RISK:: + + dd if=/dev/ of=rdb.tmp count=1 + cp rdb.tmp rdb.fixed + dd if=/dev/zero of=rdb.fixed bs=1 seek=220 count=4 + dd if=rdb.fixed of=/dev/ + +Bugs, Restrictions, Caveats +=========================== + +Quite a few things may not work as advertised. Not everything is +tested, though several hundred MB have been read and written using +this fs. For a most up-to-date list of bugs please consult +fs/affs/Changes. + +By default, filenames are truncated to 30 characters without warning. +'nofilenametruncate' mount option can change that behavior. + +Case is ignored by the affs in filename matching, but Linux shells +do care about the case. Example (with /wb being an affs mounted fs):: + + rm /wb/WRONGCASE + +will remove /mnt/wrongcase, but:: + + rm /wb/WR* + +will not since the names are matched by the shell. + +The block allocation is designed for hard disk partitions. If more +than 1 process writes to a (small) diskette, the blocks are allocated +in an ugly way (but the real AFFS doesn't do much better). This +is also true when space gets tight. + +You cannot execute programs on an OFS (Old File System), since the +program files cannot be memory mapped due to the 488 byte blocks. +For the same reason you cannot mount an image on such a filesystem +via the loopback device. + +The bitmap valid flag in the root block may not be accurate when the +system crashes while an affs partition is mounted. There's currently +no way to fix a garbled filesystem without an Amiga (disk validator) +or manually (who would do this?). Maybe later. + +If you mount affs partitions on system startup, you may want to tell +fsck that the fs should not be checked (place a '0' in the sixth field +of /etc/fstab). + +It's not possible to read floppy disks with a normal PC or workstation +due to an incompatibility with the Amiga floppy controller. + +If you are interested in an Amiga Emulator for Linux, look at + +http://web.archive.org/web/%2E/http://www.freiburg.linux.de/~uae/ diff --git a/Documentation/filesystems/affs.txt b/Documentation/filesystems/affs.txt deleted file mode 100644 index 71b63c2b9841..000000000000 --- a/Documentation/filesystems/affs.txt +++ /dev/null @@ -1,222 +0,0 @@ -Overview of Amiga Filesystems -============================= - -Not all varieties of the Amiga filesystems are supported for reading and -writing. The Amiga currently knows six different filesystems: - -DOS\0 The old or original filesystem, not really suited for - hard disks and normally not used on them, either. - Supported read/write. - -DOS\1 The original Fast File System. Supported read/write. - -DOS\2 The old "international" filesystem. International means that - a bug has been fixed so that accented ("international") letters - in file names are case-insensitive, as they ought to be. - Supported read/write. - -DOS\3 The "international" Fast File System. Supported read/write. - -DOS\4 The original filesystem with directory cache. The directory - cache speeds up directory accesses on floppies considerably, - but slows down file creation/deletion. Doesn't make much - sense on hard disks. Supported read only. - -DOS\5 The Fast File System with directory cache. Supported read only. - -All of the above filesystems allow block sizes from 512 to 32K bytes. -Supported block sizes are: 512, 1024, 2048 and 4096 bytes. Larger blocks -speed up almost everything at the expense of wasted disk space. The speed -gain above 4K seems not really worth the price, so you don't lose too -much here, either. - -The muFS (multi user File System) equivalents of the above file systems -are supported, too. - -Mount options for the AFFS -========================== - -protect If this option is set, the protection bits cannot be altered. - -setuid[=uid] This sets the owner of all files and directories in the file - system to uid or the uid of the current user, respectively. - -setgid[=gid] Same as above, but for gid. - -mode=mode Sets the mode flags to the given (octal) value, regardless - of the original permissions. Directories will get an x - permission if the corresponding r bit is set. - This is useful since most of the plain AmigaOS files - will map to 600. - -nofilenametruncate - The file system will return an error when filename exceeds - standard maximum filename length (30 characters). - -reserved=num Sets the number of reserved blocks at the start of the - partition to num. You should never need this option. - Default is 2. - -root=block Sets the block number of the root block. This should never - be necessary. - -bs=blksize Sets the blocksize to blksize. Valid block sizes are 512, - 1024, 2048 and 4096. Like the root option, this should - never be necessary, as the affs can figure it out itself. - -quiet The file system will not return an error for disallowed - mode changes. - -verbose The volume name, file system type and block size will - be written to the syslog when the filesystem is mounted. - -mufs The filesystem is really a muFS, also it doesn't - identify itself as one. This option is necessary if - the filesystem wasn't formatted as muFS, but is used - as one. - -prefix=path Path will be prefixed to every absolute path name of - symbolic links on an AFFS partition. Default = "/". - (See below.) - -volume=name When symbolic links with an absolute path are created - on an AFFS partition, name will be prepended as the - volume name. Default = "" (empty string). - (See below.) - -Handling of the Users/Groups and protection flags -================================================= - -Amiga -> Linux: - -The Amiga protection flags RWEDRWEDHSPARWED are handled as follows: - - - R maps to r for user, group and others. On directories, R implies x. - - - If both W and D are allowed, w will be set. - - - E maps to x. - - - H and P are always retained and ignored under Linux. - - - A is always reset when a file is written to. - -User id and group id will be used unless set[gu]id are given as mount -options. Since most of the Amiga file systems are single user systems -they will be owned by root. The root directory (the mount point) of the -Amiga filesystem will be owned by the user who actually mounts the -filesystem (the root directory doesn't have uid/gid fields). - -Linux -> Amiga: - -The Linux rwxrwxrwx file mode is handled as follows: - - - r permission will set R for user, group and others. - - - w permission will set W and D for user, group and others. - - - x permission of the user will set E for plain files. - - - All other flags (suid, sgid, ...) are ignored and will - not be retained. - -Newly created files and directories will get the user and group ID -of the current user and a mode according to the umask. - -Symbolic links -============== - -Although the Amiga and Linux file systems resemble each other, there -are some, not always subtle, differences. One of them becomes apparent -with symbolic links. While Linux has a file system with exactly one -root directory, the Amiga has a separate root directory for each -file system (for example, partition, floppy disk, ...). With the Amiga, -these entities are called "volumes". They have symbolic names which -can be used to access them. Thus, symbolic links can point to a -different volume. AFFS turns the volume name into a directory name -and prepends the prefix path (see prefix option) to it. - -Example: -You mount all your Amiga partitions under /amiga/ (where - is the name of the volume), and you give the option -"prefix=/amiga/" when mounting all your AFFS partitions. (They -might be "User", "WB" and "Graphics", the mount points /amiga/User, -/amiga/WB and /amiga/Graphics). A symbolic link referring to -"User:sc/include/dos/dos.h" will be followed to -"/amiga/User/sc/include/dos/dos.h". - -Examples -======== - -Command line: - mount Archive/Amiga/Workbench3.1.adf /mnt -t affs -o loop,verbose - mount /dev/sda3 /Amiga -t affs - -/etc/fstab entry: - /dev/sdb5 /amiga/Workbench affs noauto,user,exec,verbose 0 0 - -IMPORTANT NOTE -============== - -If you boot Windows 95 (don't know about 3.x, 98 and NT) while you -have an Amiga harddisk connected to your PC, it will overwrite -the bytes 0x00dc..0x00df of block 0 with garbage, thus invalidating -the Rigid Disk Block. Sheer luck has it that this is an unused -area of the RDB, so only the checksum doesn't match anymore. -Linux will ignore this garbage and recognize the RDB anyway, but -before you connect that drive to your Amiga again, you must -restore or repair your RDB. So please do make a backup copy of it -before booting Windows! - -If the damage is already done, the following should fix the RDB -(where is the device name). -DO AT YOUR OWN RISK: - - dd if=/dev/ of=rdb.tmp count=1 - cp rdb.tmp rdb.fixed - dd if=/dev/zero of=rdb.fixed bs=1 seek=220 count=4 - dd if=rdb.fixed of=/dev/ - -Bugs, Restrictions, Caveats -=========================== - -Quite a few things may not work as advertised. Not everything is -tested, though several hundred MB have been read and written using -this fs. For a most up-to-date list of bugs please consult -fs/affs/Changes. - -By default, filenames are truncated to 30 characters without warning. -'nofilenametruncate' mount option can change that behavior. - -Case is ignored by the affs in filename matching, but Linux shells -do care about the case. Example (with /wb being an affs mounted fs): - rm /wb/WRONGCASE -will remove /mnt/wrongcase, but - rm /wb/WR* -will not since the names are matched by the shell. - -The block allocation is designed for hard disk partitions. If more -than 1 process writes to a (small) diskette, the blocks are allocated -in an ugly way (but the real AFFS doesn't do much better). This -is also true when space gets tight. - -You cannot execute programs on an OFS (Old File System), since the -program files cannot be memory mapped due to the 488 byte blocks. -For the same reason you cannot mount an image on such a filesystem -via the loopback device. - -The bitmap valid flag in the root block may not be accurate when the -system crashes while an affs partition is mounted. There's currently -no way to fix a garbled filesystem without an Amiga (disk validator) -or manually (who would do this?). Maybe later. - -If you mount affs partitions on system startup, you may want to tell -fsck that the fs should not be checked (place a '0' in the sixth field -of /etc/fstab). - -It's not possible to read floppy disks with a normal PC or workstation -due to an incompatibility with the Amiga floppy controller. - -If you are interested in an Amiga Emulator for Linux, look at - -http://web.archive.org/web/*/http://www.freiburg.linux.de/~uae/ diff --git a/Documentation/filesystems/index.rst b/Documentation/filesystems/index.rst index 14dc89c94822..273d802ad5fb 100644 --- a/Documentation/filesystems/index.rst +++ b/Documentation/filesystems/index.rst @@ -48,6 +48,7 @@ Documentation for filesystem implementations. 9p adfs + affs autofs fuse overlayfs -- cgit From ca6e9049a0934fe72ffea6990c889205aff0a2cf Mon Sep 17 00:00:00 2001 From: Mauro Carvalho Chehab Date: Mon, 17 Feb 2020 17:11:50 +0100 Subject: docs: filesystems: convert afs.txt to ReST - Add a SPDX header; - Adjust document and section titles; - Comment out text-only ToC; - Mark literal blocks as such; - Add it to filesystems/index.rst. Signed-off-by: Mauro Carvalho Chehab Link: https://lore.kernel.org/r/d77f5afdb5da0f8b0ec3dbe720aef23f1ce73bb5.1581955849.git.mchehab+huawei@kernel.org Signed-off-by: Jonathan Corbet --- Documentation/filesystems/afs.rst | 251 +++++++++++++++++++++++++++++++++++ Documentation/filesystems/afs.txt | 258 ------------------------------------ Documentation/filesystems/index.rst | 1 + 3 files changed, 252 insertions(+), 258 deletions(-) create mode 100644 Documentation/filesystems/afs.rst delete mode 100644 Documentation/filesystems/afs.txt diff --git a/Documentation/filesystems/afs.rst b/Documentation/filesystems/afs.rst new file mode 100644 index 000000000000..c4ec39a5966e --- /dev/null +++ b/Documentation/filesystems/afs.rst @@ -0,0 +1,251 @@ +.. SPDX-License-Identifier: GPL-2.0 + +==================== +kAFS: AFS FILESYSTEM +==================== + +.. Contents: + + - Overview. + - Usage. + - Mountpoints. + - Dynamic root. + - Proc filesystem. + - The cell database. + - Security. + - The @sys substitution. + + +Overview +======== + +This filesystem provides a fairly simple secure AFS filesystem driver. It is +under development and does not yet provide the full feature set. The features +it does support include: + + (*) Security (currently only AFS kaserver and KerberosIV tickets). + + (*) File reading and writing. + + (*) Automounting. + + (*) Local caching (via fscache). + +It does not yet support the following AFS features: + + (*) pioctl() system call. + + +Compilation +=========== + +The filesystem should be enabled by turning on the kernel configuration +options:: + + CONFIG_AF_RXRPC - The RxRPC protocol transport + CONFIG_RXKAD - The RxRPC Kerberos security handler + CONFIG_AFS - The AFS filesystem + +Additionally, the following can be turned on to aid debugging:: + + CONFIG_AF_RXRPC_DEBUG - Permit AF_RXRPC debugging to be enabled + CONFIG_AFS_DEBUG - Permit AFS debugging to be enabled + +They permit the debugging messages to be turned on dynamically by manipulating +the masks in the following files:: + + /sys/module/af_rxrpc/parameters/debug + /sys/module/kafs/parameters/debug + + +Usage +===== + +When inserting the driver modules the root cell must be specified along with a +list of volume location server IP addresses:: + + modprobe rxrpc + modprobe kafs rootcell=cambridge.redhat.com:172.16.18.73:172.16.18.91 + +The first module is the AF_RXRPC network protocol driver. This provides the +RxRPC remote operation protocol and may also be accessed from userspace. See: + + Documentation/networking/rxrpc.txt + +The second module is the kerberos RxRPC security driver, and the third module +is the actual filesystem driver for the AFS filesystem. + +Once the module has been loaded, more modules can be added by the following +procedure:: + + echo add grand.central.org 18.9.48.14:128.2.203.61:130.237.48.87 >/proc/fs/afs/cells + +Where the parameters to the "add" command are the name of a cell and a list of +volume location servers within that cell, with the latter separated by colons. + +Filesystems can be mounted anywhere by commands similar to the following:: + + mount -t afs "%cambridge.redhat.com:root.afs." /afs + mount -t afs "#cambridge.redhat.com:root.cell." /afs/cambridge + mount -t afs "#root.afs." /afs + mount -t afs "#root.cell." /afs/cambridge + +Where the initial character is either a hash or a percent symbol depending on +whether you definitely want a R/W volume (percent) or whether you'd prefer a +R/O volume, but are willing to use a R/W volume instead (hash). + +The name of the volume can be suffixes with ".backup" or ".readonly" to +specify connection to only volumes of those types. + +The name of the cell is optional, and if not given during a mount, then the +named volume will be looked up in the cell specified during modprobe. + +Additional cells can be added through /proc (see later section). + + +Mountpoints +=========== + +AFS has a concept of mountpoints. In AFS terms, these are specially formatted +symbolic links (of the same form as the "device name" passed to mount). kAFS +presents these to the user as directories that have a follow-link capability +(ie: symbolic link semantics). If anyone attempts to access them, they will +automatically cause the target volume to be mounted (if possible) on that site. + +Automatically mounted filesystems will be automatically unmounted approximately +twenty minutes after they were last used. Alternatively they can be unmounted +directly with the umount() system call. + +Manually unmounting an AFS volume will cause any idle submounts upon it to be +culled first. If all are culled, then the requested volume will also be +unmounted, otherwise error EBUSY will be returned. + +This can be used by the administrator to attempt to unmount the whole AFS tree +mounted on /afs in one go by doing:: + + umount /afs + + +Dynamic Root +============ + +A mount option is available to create a serverless mount that is only usable +for dynamic lookup. Creating such a mount can be done by, for example:: + + mount -t afs none /afs -o dyn + +This creates a mount that just has an empty directory at the root. Attempting +to look up a name in this directory will cause a mountpoint to be created that +looks up a cell of the same name, for example:: + + ls /afs/grand.central.org/ + + +Proc Filesystem +=============== + +The AFS modules creates a "/proc/fs/afs/" directory and populates it: + + (*) A "cells" file that lists cells currently known to the afs module and + their usage counts:: + + [root@andromeda ~]# cat /proc/fs/afs/cells + USE NAME + 3 cambridge.redhat.com + + (*) A directory per cell that contains files that list volume location + servers, volumes, and active servers known within that cell:: + + [root@andromeda ~]# cat /proc/fs/afs/cambridge.redhat.com/servers + USE ADDR STATE + 4 172.16.18.91 0 + [root@andromeda ~]# cat /proc/fs/afs/cambridge.redhat.com/vlservers + ADDRESS + 172.16.18.91 + [root@andromeda ~]# cat /proc/fs/afs/cambridge.redhat.com/volumes + USE STT VLID[0] VLID[1] VLID[2] NAME + 1 Val 20000000 20000001 20000002 root.afs + + +The Cell Database +================= + +The filesystem maintains an internal database of all the cells it knows and the +IP addresses of the volume location servers for those cells. The cell to which +the system belongs is added to the database when modprobe is performed by the +"rootcell=" argument or, if compiled in, using a "kafs.rootcell=" argument on +the kernel command line. + +Further cells can be added by commands similar to the following:: + + echo add CELLNAME VLADDR[:VLADDR][:VLADDR]... >/proc/fs/afs/cells + echo add grand.central.org 18.9.48.14:128.2.203.61:130.237.48.87 >/proc/fs/afs/cells + +No other cell database operations are available at this time. + + +Security +======== + +Secure operations are initiated by acquiring a key using the klog program. A +very primitive klog program is available at: + + http://people.redhat.com/~dhowells/rxrpc/klog.c + +This should be compiled by:: + + make klog LDLIBS="-lcrypto -lcrypt -lkrb4 -lkeyutils" + +And then run as:: + + ./klog + +Assuming it's successful, this adds a key of type RxRPC, named for the service +and cell, eg: "afs@". This can be viewed with the keyctl program or +by cat'ing /proc/keys:: + + [root@andromeda ~]# keyctl show + Session Keyring + -3 --alswrv 0 0 keyring: _ses.3268 + 2 --alswrv 0 0 \_ keyring: _uid.0 + 111416553 --als--v 0 0 \_ rxrpc: afs@CAMBRIDGE.REDHAT.COM + +Currently the username, realm, password and proposed ticket lifetime are +compiled in to the program. + +It is not required to acquire a key before using AFS facilities, but if one is +not acquired then all operations will be governed by the anonymous user parts +of the ACLs. + +If a key is acquired, then all AFS operations, including mounts and automounts, +made by a possessor of that key will be secured with that key. + +If a file is opened with a particular key and then the file descriptor is +passed to a process that doesn't have that key (perhaps over an AF_UNIX +socket), then the operations on the file will be made with key that was used to +open the file. + + +The @sys Substitution +===================== + +The list of up to 16 @sys substitutions for the current network namespace can +be configured by writing a list to /proc/fs/afs/sysname:: + + [root@andromeda ~]# echo foo amd64_linux_26 >/proc/fs/afs/sysname + +or cleared entirely by writing an empty list:: + + [root@andromeda ~]# echo >/proc/fs/afs/sysname + +The current list for current network namespace can be retrieved by:: + + [root@andromeda ~]# cat /proc/fs/afs/sysname + foo + amd64_linux_26 + +When @sys is being substituted for, each element of the list is tried in the +order given. + +By default, the list will contain one item that conforms to the pattern +"_linux_26", amd64 being the name for x86_64. diff --git a/Documentation/filesystems/afs.txt b/Documentation/filesystems/afs.txt deleted file mode 100644 index 8c6ea7b41048..000000000000 --- a/Documentation/filesystems/afs.txt +++ /dev/null @@ -1,258 +0,0 @@ - ==================== - kAFS: AFS FILESYSTEM - ==================== - -Contents: - - - Overview. - - Usage. - - Mountpoints. - - Dynamic root. - - Proc filesystem. - - The cell database. - - Security. - - The @sys substitution. - - -======== -OVERVIEW -======== - -This filesystem provides a fairly simple secure AFS filesystem driver. It is -under development and does not yet provide the full feature set. The features -it does support include: - - (*) Security (currently only AFS kaserver and KerberosIV tickets). - - (*) File reading and writing. - - (*) Automounting. - - (*) Local caching (via fscache). - -It does not yet support the following AFS features: - - (*) pioctl() system call. - - -=========== -COMPILATION -=========== - -The filesystem should be enabled by turning on the kernel configuration -options: - - CONFIG_AF_RXRPC - The RxRPC protocol transport - CONFIG_RXKAD - The RxRPC Kerberos security handler - CONFIG_AFS - The AFS filesystem - -Additionally, the following can be turned on to aid debugging: - - CONFIG_AF_RXRPC_DEBUG - Permit AF_RXRPC debugging to be enabled - CONFIG_AFS_DEBUG - Permit AFS debugging to be enabled - -They permit the debugging messages to be turned on dynamically by manipulating -the masks in the following files: - - /sys/module/af_rxrpc/parameters/debug - /sys/module/kafs/parameters/debug - - -===== -USAGE -===== - -When inserting the driver modules the root cell must be specified along with a -list of volume location server IP addresses: - - modprobe rxrpc - modprobe kafs rootcell=cambridge.redhat.com:172.16.18.73:172.16.18.91 - -The first module is the AF_RXRPC network protocol driver. This provides the -RxRPC remote operation protocol and may also be accessed from userspace. See: - - Documentation/networking/rxrpc.txt - -The second module is the kerberos RxRPC security driver, and the third module -is the actual filesystem driver for the AFS filesystem. - -Once the module has been loaded, more modules can be added by the following -procedure: - - echo add grand.central.org 18.9.48.14:128.2.203.61:130.237.48.87 >/proc/fs/afs/cells - -Where the parameters to the "add" command are the name of a cell and a list of -volume location servers within that cell, with the latter separated by colons. - -Filesystems can be mounted anywhere by commands similar to the following: - - mount -t afs "%cambridge.redhat.com:root.afs." /afs - mount -t afs "#cambridge.redhat.com:root.cell." /afs/cambridge - mount -t afs "#root.afs." /afs - mount -t afs "#root.cell." /afs/cambridge - -Where the initial character is either a hash or a percent symbol depending on -whether you definitely want a R/W volume (percent) or whether you'd prefer a -R/O volume, but are willing to use a R/W volume instead (hash). - -The name of the volume can be suffixes with ".backup" or ".readonly" to -specify connection to only volumes of those types. - -The name of the cell is optional, and if not given during a mount, then the -named volume will be looked up in the cell specified during modprobe. - -Additional cells can be added through /proc (see later section). - - -=========== -MOUNTPOINTS -=========== - -AFS has a concept of mountpoints. In AFS terms, these are specially formatted -symbolic links (of the same form as the "device name" passed to mount). kAFS -presents these to the user as directories that have a follow-link capability -(ie: symbolic link semantics). If anyone attempts to access them, they will -automatically cause the target volume to be mounted (if possible) on that site. - -Automatically mounted filesystems will be automatically unmounted approximately -twenty minutes after they were last used. Alternatively they can be unmounted -directly with the umount() system call. - -Manually unmounting an AFS volume will cause any idle submounts upon it to be -culled first. If all are culled, then the requested volume will also be -unmounted, otherwise error EBUSY will be returned. - -This can be used by the administrator to attempt to unmount the whole AFS tree -mounted on /afs in one go by doing: - - umount /afs - - -============ -DYNAMIC ROOT -============ - -A mount option is available to create a serverless mount that is only usable -for dynamic lookup. Creating such a mount can be done by, for example: - - mount -t afs none /afs -o dyn - -This creates a mount that just has an empty directory at the root. Attempting -to look up a name in this directory will cause a mountpoint to be created that -looks up a cell of the same name, for example: - - ls /afs/grand.central.org/ - - -=============== -PROC FILESYSTEM -=============== - -The AFS modules creates a "/proc/fs/afs/" directory and populates it: - - (*) A "cells" file that lists cells currently known to the afs module and - their usage counts: - - [root@andromeda ~]# cat /proc/fs/afs/cells - USE NAME - 3 cambridge.redhat.com - - (*) A directory per cell that contains files that list volume location - servers, volumes, and active servers known within that cell. - - [root@andromeda ~]# cat /proc/fs/afs/cambridge.redhat.com/servers - USE ADDR STATE - 4 172.16.18.91 0 - [root@andromeda ~]# cat /proc/fs/afs/cambridge.redhat.com/vlservers - ADDRESS - 172.16.18.91 - [root@andromeda ~]# cat /proc/fs/afs/cambridge.redhat.com/volumes - USE STT VLID[0] VLID[1] VLID[2] NAME - 1 Val 20000000 20000001 20000002 root.afs - - -================= -THE CELL DATABASE -================= - -The filesystem maintains an internal database of all the cells it knows and the -IP addresses of the volume location servers for those cells. The cell to which -the system belongs is added to the database when modprobe is performed by the -"rootcell=" argument or, if compiled in, using a "kafs.rootcell=" argument on -the kernel command line. - -Further cells can be added by commands similar to the following: - - echo add CELLNAME VLADDR[:VLADDR][:VLADDR]... >/proc/fs/afs/cells - echo add grand.central.org 18.9.48.14:128.2.203.61:130.237.48.87 >/proc/fs/afs/cells - -No other cell database operations are available at this time. - - -======== -SECURITY -======== - -Secure operations are initiated by acquiring a key using the klog program. A -very primitive klog program is available at: - - http://people.redhat.com/~dhowells/rxrpc/klog.c - -This should be compiled by: - - make klog LDLIBS="-lcrypto -lcrypt -lkrb4 -lkeyutils" - -And then run as: - - ./klog - -Assuming it's successful, this adds a key of type RxRPC, named for the service -and cell, eg: "afs@". This can be viewed with the keyctl program or -by cat'ing /proc/keys: - - [root@andromeda ~]# keyctl show - Session Keyring - -3 --alswrv 0 0 keyring: _ses.3268 - 2 --alswrv 0 0 \_ keyring: _uid.0 - 111416553 --als--v 0 0 \_ rxrpc: afs@CAMBRIDGE.REDHAT.COM - -Currently the username, realm, password and proposed ticket lifetime are -compiled in to the program. - -It is not required to acquire a key before using AFS facilities, but if one is -not acquired then all operations will be governed by the anonymous user parts -of the ACLs. - -If a key is acquired, then all AFS operations, including mounts and automounts, -made by a possessor of that key will be secured with that key. - -If a file is opened with a particular key and then the file descriptor is -passed to a process that doesn't have that key (perhaps over an AF_UNIX -socket), then the operations on the file will be made with key that was used to -open the file. - - -===================== -THE @SYS SUBSTITUTION -===================== - -The list of up to 16 @sys substitutions for the current network namespace can -be configured by writing a list to /proc/fs/afs/sysname: - - [root@andromeda ~]# echo foo amd64_linux_26 >/proc/fs/afs/sysname - -or cleared entirely by writing an empty list: - - [root@andromeda ~]# echo >/proc/fs/afs/sysname - -The current list for current network namespace can be retrieved by: - - [root@andromeda ~]# cat /proc/fs/afs/sysname - foo - amd64_linux_26 - -When @sys is being substituted for, each element of the list is tried in the -order given. - -By default, the list will contain one item that conforms to the pattern -"_linux_26", amd64 being the name for x86_64. diff --git a/Documentation/filesystems/index.rst b/Documentation/filesystems/index.rst index 273d802ad5fb..0598bc52abdc 100644 --- a/Documentation/filesystems/index.rst +++ b/Documentation/filesystems/index.rst @@ -49,6 +49,7 @@ Documentation for filesystem implementations. 9p adfs affs + afs autofs fuse overlayfs -- cgit From c64d3dc69f38a08d082813f1c0425d7a108ef950 Mon Sep 17 00:00:00 2001 From: Mauro Carvalho Chehab Date: Mon, 17 Feb 2020 17:11:51 +0100 Subject: docs: filesystems: convert autofs-mount-control.txt to ReST - Add a SPDX header; - Adjust document title; - Some whitespace fixes and new line breaks; - Mark literal blocks as such; - Add it to filesystems/index.rst. Signed-off-by: Mauro Carvalho Chehab Link: https://lore.kernel.org/r/8cae057ae244d0f5b58d3c209bcdae5ed82bc52c.1581955849.git.mchehab+huawei@kernel.org Signed-off-by: Jonathan Corbet --- Documentation/filesystems/autofs-mount-control.rst | 410 +++++++++++++++++++++ Documentation/filesystems/autofs-mount-control.txt | 408 -------------------- Documentation/filesystems/index.rst | 1 + 3 files changed, 411 insertions(+), 408 deletions(-) create mode 100644 Documentation/filesystems/autofs-mount-control.rst delete mode 100644 Documentation/filesystems/autofs-mount-control.txt diff --git a/Documentation/filesystems/autofs-mount-control.rst b/Documentation/filesystems/autofs-mount-control.rst new file mode 100644 index 000000000000..2903aed92316 --- /dev/null +++ b/Documentation/filesystems/autofs-mount-control.rst @@ -0,0 +1,410 @@ +.. SPDX-License-Identifier: GPL-2.0 + +==================================================================== +Miscellaneous Device control operations for the autofs kernel module +==================================================================== + +The problem +=========== + +There is a problem with active restarts in autofs (that is to say +restarting autofs when there are busy mounts). + +During normal operation autofs uses a file descriptor opened on the +directory that is being managed in order to be able to issue control +operations. Using a file descriptor gives ioctl operations access to +autofs specific information stored in the super block. The operations +are things such as setting an autofs mount catatonic, setting the +expire timeout and requesting expire checks. As is explained below, +certain types of autofs triggered mounts can end up covering an autofs +mount itself which prevents us being able to use open(2) to obtain a +file descriptor for these operations if we don't already have one open. + +Currently autofs uses "umount -l" (lazy umount) to clear active mounts +at restart. While using lazy umount works for most cases, anything that +needs to walk back up the mount tree to construct a path, such as +getcwd(2) and the proc file system /proc//cwd, no longer works +because the point from which the path is constructed has been detached +from the mount tree. + +The actual problem with autofs is that it can't reconnect to existing +mounts. Immediately one thinks of just adding the ability to remount +autofs file systems would solve it, but alas, that can't work. This is +because autofs direct mounts and the implementation of "on demand mount +and expire" of nested mount trees have the file system mounted directly +on top of the mount trigger directory dentry. + +For example, there are two types of automount maps, direct (in the kernel +module source you will see a third type called an offset, which is just +a direct mount in disguise) and indirect. + +Here is a master map with direct and indirect map entries:: + + /- /etc/auto.direct + /test /etc/auto.indirect + +and the corresponding map files:: + + /etc/auto.direct: + + /automount/dparse/g6 budgie:/autofs/export1 + /automount/dparse/g1 shark:/autofs/export1 + and so on. + +/etc/auto.indirect:: + + g1 shark:/autofs/export1 + g6 budgie:/autofs/export1 + and so on. + +For the above indirect map an autofs file system is mounted on /test and +mounts are triggered for each sub-directory key by the inode lookup +operation. So we see a mount of shark:/autofs/export1 on /test/g1, for +example. + +The way that direct mounts are handled is by making an autofs mount on +each full path, such as /automount/dparse/g1, and using it as a mount +trigger. So when we walk on the path we mount shark:/autofs/export1 "on +top of this mount point". Since these are always directories we can +use the follow_link inode operation to trigger the mount. + +But, each entry in direct and indirect maps can have offsets (making +them multi-mount map entries). + +For example, an indirect mount map entry could also be:: + + g1 \ + / shark:/autofs/export5/testing/test \ + /s1 shark:/autofs/export/testing/test/s1 \ + /s2 shark:/autofs/export5/testing/test/s2 \ + /s1/ss1 shark:/autofs/export1 \ + /s2/ss2 shark:/autofs/export2 + +and a similarly a direct mount map entry could also be:: + + /automount/dparse/g1 \ + / shark:/autofs/export5/testing/test \ + /s1 shark:/autofs/export/testing/test/s1 \ + /s2 shark:/autofs/export5/testing/test/s2 \ + /s1/ss1 shark:/autofs/export2 \ + /s2/ss2 shark:/autofs/export2 + +One of the issues with version 4 of autofs was that, when mounting an +entry with a large number of offsets, possibly with nesting, we needed +to mount and umount all of the offsets as a single unit. Not really a +problem, except for people with a large number of offsets in map entries. +This mechanism is used for the well known "hosts" map and we have seen +cases (in 2.4) where the available number of mounts are exhausted or +where the number of privileged ports available is exhausted. + +In version 5 we mount only as we go down the tree of offsets and +similarly for expiring them which resolves the above problem. There is +somewhat more detail to the implementation but it isn't needed for the +sake of the problem explanation. The one important detail is that these +offsets are implemented using the same mechanism as the direct mounts +above and so the mount points can be covered by a mount. + +The current autofs implementation uses an ioctl file descriptor opened +on the mount point for control operations. The references held by the +descriptor are accounted for in checks made to determine if a mount is +in use and is also used to access autofs file system information held +in the mount super block. So the use of a file handle needs to be +retained. + + +The Solution +============ + +To be able to restart autofs leaving existing direct, indirect and +offset mounts in place we need to be able to obtain a file handle +for these potentially covered autofs mount points. Rather than just +implement an isolated operation it was decided to re-implement the +existing ioctl interface and add new operations to provide this +functionality. + +In addition, to be able to reconstruct a mount tree that has busy mounts, +the uid and gid of the last user that triggered the mount needs to be +available because these can be used as macro substitution variables in +autofs maps. They are recorded at mount request time and an operation +has been added to retrieve them. + +Since we're re-implementing the control interface, a couple of other +problems with the existing interface have been addressed. First, when +a mount or expire operation completes a status is returned to the +kernel by either a "send ready" or a "send fail" operation. The +"send fail" operation of the ioctl interface could only ever send +ENOENT so the re-implementation allows user space to send an actual +status. Another expensive operation in user space, for those using +very large maps, is discovering if a mount is present. Usually this +involves scanning /proc/mounts and since it needs to be done quite +often it can introduce significant overhead when there are many entries +in the mount table. An operation to lookup the mount status of a mount +point dentry (covered or not) has also been added. + +Current kernel development policy recommends avoiding the use of the +ioctl mechanism in favor of systems such as Netlink. An implementation +using this system was attempted to evaluate its suitability and it was +found to be inadequate, in this case. The Generic Netlink system was +used for this as raw Netlink would lead to a significant increase in +complexity. There's no question that the Generic Netlink system is an +elegant solution for common case ioctl functions but it's not a complete +replacement probably because its primary purpose in life is to be a +message bus implementation rather than specifically an ioctl replacement. +While it would be possible to work around this there is one concern +that lead to the decision to not use it. This is that the autofs +expire in the daemon has become far to complex because umount +candidates are enumerated, almost for no other reason than to "count" +the number of times to call the expire ioctl. This involves scanning +the mount table which has proved to be a big overhead for users with +large maps. The best way to improve this is try and get back to the +way the expire was done long ago. That is, when an expire request is +issued for a mount (file handle) we should continually call back to +the daemon until we can't umount any more mounts, then return the +appropriate status to the daemon. At the moment we just expire one +mount at a time. A Generic Netlink implementation would exclude this +possibility for future development due to the requirements of the +message bus architecture. + + +autofs Miscellaneous Device mount control interface +==================================================== + +The control interface is opening a device node, typically /dev/autofs. + +All the ioctls use a common structure to pass the needed parameter +information and return operation results:: + + struct autofs_dev_ioctl { + __u32 ver_major; + __u32 ver_minor; + __u32 size; /* total size of data passed in + * including this struct */ + __s32 ioctlfd; /* automount command fd */ + + /* Command parameters */ + union { + struct args_protover protover; + struct args_protosubver protosubver; + struct args_openmount openmount; + struct args_ready ready; + struct args_fail fail; + struct args_setpipefd setpipefd; + struct args_timeout timeout; + struct args_requester requester; + struct args_expire expire; + struct args_askumount askumount; + struct args_ismountpoint ismountpoint; + }; + + char path[0]; + }; + +The ioctlfd field is a mount point file descriptor of an autofs mount +point. It is returned by the open call and is used by all calls except +the check for whether a given path is a mount point, where it may +optionally be used to check a specific mount corresponding to a given +mount point file descriptor, and when requesting the uid and gid of the +last successful mount on a directory within the autofs file system. + +The union is used to communicate parameters and results of calls made +as described below. + +The path field is used to pass a path where it is needed and the size field +is used account for the increased structure length when translating the +structure sent from user space. + +This structure can be initialized before setting specific fields by using +the void function call init_autofs_dev_ioctl(``struct autofs_dev_ioctl *``). + +All of the ioctls perform a copy of this structure from user space to +kernel space and return -EINVAL if the size parameter is smaller than +the structure size itself, -ENOMEM if the kernel memory allocation fails +or -EFAULT if the copy itself fails. Other checks include a version check +of the compiled in user space version against the module version and a +mismatch results in a -EINVAL return. If the size field is greater than +the structure size then a path is assumed to be present and is checked to +ensure it begins with a "/" and is NULL terminated, otherwise -EINVAL is +returned. Following these checks, for all ioctl commands except +AUTOFS_DEV_IOCTL_VERSION_CMD, AUTOFS_DEV_IOCTL_OPENMOUNT_CMD and +AUTOFS_DEV_IOCTL_CLOSEMOUNT_CMD the ioctlfd is validated and if it is +not a valid descriptor or doesn't correspond to an autofs mount point +an error of -EBADF, -ENOTTY or -EINVAL (not an autofs descriptor) is +returned. + + +The ioctls +========== + +An example of an implementation which uses this interface can be seen +in autofs version 5.0.4 and later in file lib/dev-ioctl-lib.c of the +distribution tar available for download from kernel.org in directory +/pub/linux/daemons/autofs/v5. + +The device node ioctl operations implemented by this interface are: + + +AUTOFS_DEV_IOCTL_VERSION +------------------------ + +Get the major and minor version of the autofs device ioctl kernel module +implementation. It requires an initialized struct autofs_dev_ioctl as an +input parameter and sets the version information in the passed in structure. +It returns 0 on success or the error -EINVAL if a version mismatch is +detected. + + +AUTOFS_DEV_IOCTL_PROTOVER_CMD and AUTOFS_DEV_IOCTL_PROTOSUBVER_CMD +------------------------------------------------------------------ + +Get the major and minor version of the autofs protocol version understood +by loaded module. This call requires an initialized struct autofs_dev_ioctl +with the ioctlfd field set to a valid autofs mount point descriptor +and sets the requested version number in version field of struct args_protover +or sub_version field of struct args_protosubver. These commands return +0 on success or one of the negative error codes if validation fails. + + +AUTOFS_DEV_IOCTL_OPENMOUNT and AUTOFS_DEV_IOCTL_CLOSEMOUNT +---------------------------------------------------------- + +Obtain and release a file descriptor for an autofs managed mount point +path. The open call requires an initialized struct autofs_dev_ioctl with +the path field set and the size field adjusted appropriately as well +as the devid field of struct args_openmount set to the device number of +the autofs mount. The device number can be obtained from the mount options +shown in /proc/mounts. The close call requires an initialized struct +autofs_dev_ioct with the ioctlfd field set to the descriptor obtained +from the open call. The release of the file descriptor can also be done +with close(2) so any open descriptors will also be closed at process exit. +The close call is included in the implemented operations largely for +completeness and to provide for a consistent user space implementation. + + +AUTOFS_DEV_IOCTL_READY_CMD and AUTOFS_DEV_IOCTL_FAIL_CMD +-------------------------------------------------------- + +Return mount and expire result status from user space to the kernel. +Both of these calls require an initialized struct autofs_dev_ioctl +with the ioctlfd field set to the descriptor obtained from the open +call and the token field of struct args_ready or struct args_fail set +to the wait queue token number, received by user space in the foregoing +mount or expire request. The status field of struct args_fail is set to +the errno of the operation. It is set to 0 on success. + + +AUTOFS_DEV_IOCTL_SETPIPEFD_CMD +------------------------------ + +Set the pipe file descriptor used for kernel communication to the daemon. +Normally this is set at mount time using an option but when reconnecting +to a existing mount we need to use this to tell the autofs mount about +the new kernel pipe descriptor. In order to protect mounts against +incorrectly setting the pipe descriptor we also require that the autofs +mount be catatonic (see next call). + +The call requires an initialized struct autofs_dev_ioctl with the +ioctlfd field set to the descriptor obtained from the open call and +the pipefd field of struct args_setpipefd set to descriptor of the pipe. +On success the call also sets the process group id used to identify the +controlling process (eg. the owning automount(8) daemon) to the process +group of the caller. + + +AUTOFS_DEV_IOCTL_CATATONIC_CMD +------------------------------ + +Make the autofs mount point catatonic. The autofs mount will no longer +issue mount requests, the kernel communication pipe descriptor is released +and any remaining waits in the queue released. + +The call requires an initialized struct autofs_dev_ioctl with the +ioctlfd field set to the descriptor obtained from the open call. + + +AUTOFS_DEV_IOCTL_TIMEOUT_CMD +---------------------------- + +Set the expire timeout for mounts within an autofs mount point. + +The call requires an initialized struct autofs_dev_ioctl with the +ioctlfd field set to the descriptor obtained from the open call. + + +AUTOFS_DEV_IOCTL_REQUESTER_CMD +------------------------------ + +Return the uid and gid of the last process to successfully trigger a the +mount on the given path dentry. + +The call requires an initialized struct autofs_dev_ioctl with the path +field set to the mount point in question and the size field adjusted +appropriately. Upon return the uid field of struct args_requester contains +the uid and gid field the gid. + +When reconstructing an autofs mount tree with active mounts we need to +re-connect to mounts that may have used the original process uid and +gid (or string variations of them) for mount lookups within the map entry. +This call provides the ability to obtain this uid and gid so they may be +used by user space for the mount map lookups. + + +AUTOFS_DEV_IOCTL_EXPIRE_CMD +--------------------------- + +Issue an expire request to the kernel for an autofs mount. Typically +this ioctl is called until no further expire candidates are found. + +The call requires an initialized struct autofs_dev_ioctl with the +ioctlfd field set to the descriptor obtained from the open call. In +addition an immediate expire that's independent of the mount timeout, +and a forced expire that's independent of whether the mount is busy, +can be requested by setting the how field of struct args_expire to +AUTOFS_EXP_IMMEDIATE or AUTOFS_EXP_FORCED, respectively . If no +expire candidates can be found the ioctl returns -1 with errno set to +EAGAIN. + +This call causes the kernel module to check the mount corresponding +to the given ioctlfd for mounts that can be expired, issues an expire +request back to the daemon and waits for completion. + +AUTOFS_DEV_IOCTL_ASKUMOUNT_CMD +------------------------------ + +Checks if an autofs mount point is in use. + +The call requires an initialized struct autofs_dev_ioctl with the +ioctlfd field set to the descriptor obtained from the open call and +it returns the result in the may_umount field of struct args_askumount, +1 for busy and 0 otherwise. + + +AUTOFS_DEV_IOCTL_ISMOUNTPOINT_CMD +--------------------------------- + +Check if the given path is a mountpoint. + +The call requires an initialized struct autofs_dev_ioctl. There are two +possible variations. Both use the path field set to the path of the mount +point to check and the size field adjusted appropriately. One uses the +ioctlfd field to identify a specific mount point to check while the other +variation uses the path and optionally in.type field of struct args_ismountpoint +set to an autofs mount type. The call returns 1 if this is a mount point +and sets out.devid field to the device number of the mount and out.magic +field to the relevant super block magic number (described below) or 0 if +it isn't a mountpoint. In both cases the the device number (as returned +by new_encode_dev()) is returned in out.devid field. + +If supplied with a file descriptor we're looking for a specific mount, +not necessarily at the top of the mounted stack. In this case the path +the descriptor corresponds to is considered a mountpoint if it is itself +a mountpoint or contains a mount, such as a multi-mount without a root +mount. In this case we return 1 if the descriptor corresponds to a mount +point and and also returns the super magic of the covering mount if there +is one or 0 if it isn't a mountpoint. + +If a path is supplied (and the ioctlfd field is set to -1) then the path +is looked up and is checked to see if it is the root of a mount. If a +type is also given we are looking for a particular autofs mount and if +a match isn't found a fail is returned. If the the located path is the +root of a mount 1 is returned along with the super magic of the mount +or 0 otherwise. diff --git a/Documentation/filesystems/autofs-mount-control.txt b/Documentation/filesystems/autofs-mount-control.txt deleted file mode 100644 index acc02fc57993..000000000000 --- a/Documentation/filesystems/autofs-mount-control.txt +++ /dev/null @@ -1,408 +0,0 @@ - -Miscellaneous Device control operations for the autofs kernel module -==================================================================== - -The problem -=========== - -There is a problem with active restarts in autofs (that is to say -restarting autofs when there are busy mounts). - -During normal operation autofs uses a file descriptor opened on the -directory that is being managed in order to be able to issue control -operations. Using a file descriptor gives ioctl operations access to -autofs specific information stored in the super block. The operations -are things such as setting an autofs mount catatonic, setting the -expire timeout and requesting expire checks. As is explained below, -certain types of autofs triggered mounts can end up covering an autofs -mount itself which prevents us being able to use open(2) to obtain a -file descriptor for these operations if we don't already have one open. - -Currently autofs uses "umount -l" (lazy umount) to clear active mounts -at restart. While using lazy umount works for most cases, anything that -needs to walk back up the mount tree to construct a path, such as -getcwd(2) and the proc file system /proc//cwd, no longer works -because the point from which the path is constructed has been detached -from the mount tree. - -The actual problem with autofs is that it can't reconnect to existing -mounts. Immediately one thinks of just adding the ability to remount -autofs file systems would solve it, but alas, that can't work. This is -because autofs direct mounts and the implementation of "on demand mount -and expire" of nested mount trees have the file system mounted directly -on top of the mount trigger directory dentry. - -For example, there are two types of automount maps, direct (in the kernel -module source you will see a third type called an offset, which is just -a direct mount in disguise) and indirect. - -Here is a master map with direct and indirect map entries: - -/- /etc/auto.direct -/test /etc/auto.indirect - -and the corresponding map files: - -/etc/auto.direct: - -/automount/dparse/g6 budgie:/autofs/export1 -/automount/dparse/g1 shark:/autofs/export1 -and so on. - -/etc/auto.indirect: - -g1 shark:/autofs/export1 -g6 budgie:/autofs/export1 -and so on. - -For the above indirect map an autofs file system is mounted on /test and -mounts are triggered for each sub-directory key by the inode lookup -operation. So we see a mount of shark:/autofs/export1 on /test/g1, for -example. - -The way that direct mounts are handled is by making an autofs mount on -each full path, such as /automount/dparse/g1, and using it as a mount -trigger. So when we walk on the path we mount shark:/autofs/export1 "on -top of this mount point". Since these are always directories we can -use the follow_link inode operation to trigger the mount. - -But, each entry in direct and indirect maps can have offsets (making -them multi-mount map entries). - -For example, an indirect mount map entry could also be: - -g1 \ - / shark:/autofs/export5/testing/test \ - /s1 shark:/autofs/export/testing/test/s1 \ - /s2 shark:/autofs/export5/testing/test/s2 \ - /s1/ss1 shark:/autofs/export1 \ - /s2/ss2 shark:/autofs/export2 - -and a similarly a direct mount map entry could also be: - -/automount/dparse/g1 \ - / shark:/autofs/export5/testing/test \ - /s1 shark:/autofs/export/testing/test/s1 \ - /s2 shark:/autofs/export5/testing/test/s2 \ - /s1/ss1 shark:/autofs/export2 \ - /s2/ss2 shark:/autofs/export2 - -One of the issues with version 4 of autofs was that, when mounting an -entry with a large number of offsets, possibly with nesting, we needed -to mount and umount all of the offsets as a single unit. Not really a -problem, except for people with a large number of offsets in map entries. -This mechanism is used for the well known "hosts" map and we have seen -cases (in 2.4) where the available number of mounts are exhausted or -where the number of privileged ports available is exhausted. - -In version 5 we mount only as we go down the tree of offsets and -similarly for expiring them which resolves the above problem. There is -somewhat more detail to the implementation but it isn't needed for the -sake of the problem explanation. The one important detail is that these -offsets are implemented using the same mechanism as the direct mounts -above and so the mount points can be covered by a mount. - -The current autofs implementation uses an ioctl file descriptor opened -on the mount point for control operations. The references held by the -descriptor are accounted for in checks made to determine if a mount is -in use and is also used to access autofs file system information held -in the mount super block. So the use of a file handle needs to be -retained. - - -The Solution -============ - -To be able to restart autofs leaving existing direct, indirect and -offset mounts in place we need to be able to obtain a file handle -for these potentially covered autofs mount points. Rather than just -implement an isolated operation it was decided to re-implement the -existing ioctl interface and add new operations to provide this -functionality. - -In addition, to be able to reconstruct a mount tree that has busy mounts, -the uid and gid of the last user that triggered the mount needs to be -available because these can be used as macro substitution variables in -autofs maps. They are recorded at mount request time and an operation -has been added to retrieve them. - -Since we're re-implementing the control interface, a couple of other -problems with the existing interface have been addressed. First, when -a mount or expire operation completes a status is returned to the -kernel by either a "send ready" or a "send fail" operation. The -"send fail" operation of the ioctl interface could only ever send -ENOENT so the re-implementation allows user space to send an actual -status. Another expensive operation in user space, for those using -very large maps, is discovering if a mount is present. Usually this -involves scanning /proc/mounts and since it needs to be done quite -often it can introduce significant overhead when there are many entries -in the mount table. An operation to lookup the mount status of a mount -point dentry (covered or not) has also been added. - -Current kernel development policy recommends avoiding the use of the -ioctl mechanism in favor of systems such as Netlink. An implementation -using this system was attempted to evaluate its suitability and it was -found to be inadequate, in this case. The Generic Netlink system was -used for this as raw Netlink would lead to a significant increase in -complexity. There's no question that the Generic Netlink system is an -elegant solution for common case ioctl functions but it's not a complete -replacement probably because its primary purpose in life is to be a -message bus implementation rather than specifically an ioctl replacement. -While it would be possible to work around this there is one concern -that lead to the decision to not use it. This is that the autofs -expire in the daemon has become far to complex because umount -candidates are enumerated, almost for no other reason than to "count" -the number of times to call the expire ioctl. This involves scanning -the mount table which has proved to be a big overhead for users with -large maps. The best way to improve this is try and get back to the -way the expire was done long ago. That is, when an expire request is -issued for a mount (file handle) we should continually call back to -the daemon until we can't umount any more mounts, then return the -appropriate status to the daemon. At the moment we just expire one -mount at a time. A Generic Netlink implementation would exclude this -possibility for future development due to the requirements of the -message bus architecture. - - -autofs Miscellaneous Device mount control interface -==================================================== - -The control interface is opening a device node, typically /dev/autofs. - -All the ioctls use a common structure to pass the needed parameter -information and return operation results: - -struct autofs_dev_ioctl { - __u32 ver_major; - __u32 ver_minor; - __u32 size; /* total size of data passed in - * including this struct */ - __s32 ioctlfd; /* automount command fd */ - - /* Command parameters */ - union { - struct args_protover protover; - struct args_protosubver protosubver; - struct args_openmount openmount; - struct args_ready ready; - struct args_fail fail; - struct args_setpipefd setpipefd; - struct args_timeout timeout; - struct args_requester requester; - struct args_expire expire; - struct args_askumount askumount; - struct args_ismountpoint ismountpoint; - }; - - char path[0]; -}; - -The ioctlfd field is a mount point file descriptor of an autofs mount -point. It is returned by the open call and is used by all calls except -the check for whether a given path is a mount point, where it may -optionally be used to check a specific mount corresponding to a given -mount point file descriptor, and when requesting the uid and gid of the -last successful mount on a directory within the autofs file system. - -The union is used to communicate parameters and results of calls made -as described below. - -The path field is used to pass a path where it is needed and the size field -is used account for the increased structure length when translating the -structure sent from user space. - -This structure can be initialized before setting specific fields by using -the void function call init_autofs_dev_ioctl(struct autofs_dev_ioctl *). - -All of the ioctls perform a copy of this structure from user space to -kernel space and return -EINVAL if the size parameter is smaller than -the structure size itself, -ENOMEM if the kernel memory allocation fails -or -EFAULT if the copy itself fails. Other checks include a version check -of the compiled in user space version against the module version and a -mismatch results in a -EINVAL return. If the size field is greater than -the structure size then a path is assumed to be present and is checked to -ensure it begins with a "/" and is NULL terminated, otherwise -EINVAL is -returned. Following these checks, for all ioctl commands except -AUTOFS_DEV_IOCTL_VERSION_CMD, AUTOFS_DEV_IOCTL_OPENMOUNT_CMD and -AUTOFS_DEV_IOCTL_CLOSEMOUNT_CMD the ioctlfd is validated and if it is -not a valid descriptor or doesn't correspond to an autofs mount point -an error of -EBADF, -ENOTTY or -EINVAL (not an autofs descriptor) is -returned. - - -The ioctls -========== - -An example of an implementation which uses this interface can be seen -in autofs version 5.0.4 and later in file lib/dev-ioctl-lib.c of the -distribution tar available for download from kernel.org in directory -/pub/linux/daemons/autofs/v5. - -The device node ioctl operations implemented by this interface are: - - -AUTOFS_DEV_IOCTL_VERSION ------------------------- - -Get the major and minor version of the autofs device ioctl kernel module -implementation. It requires an initialized struct autofs_dev_ioctl as an -input parameter and sets the version information in the passed in structure. -It returns 0 on success or the error -EINVAL if a version mismatch is -detected. - - -AUTOFS_DEV_IOCTL_PROTOVER_CMD and AUTOFS_DEV_IOCTL_PROTOSUBVER_CMD ------------------------------------------------------------------- - -Get the major and minor version of the autofs protocol version understood -by loaded module. This call requires an initialized struct autofs_dev_ioctl -with the ioctlfd field set to a valid autofs mount point descriptor -and sets the requested version number in version field of struct args_protover -or sub_version field of struct args_protosubver. These commands return -0 on success or one of the negative error codes if validation fails. - - -AUTOFS_DEV_IOCTL_OPENMOUNT and AUTOFS_DEV_IOCTL_CLOSEMOUNT ----------------------------------------------------------- - -Obtain and release a file descriptor for an autofs managed mount point -path. The open call requires an initialized struct autofs_dev_ioctl with -the path field set and the size field adjusted appropriately as well -as the devid field of struct args_openmount set to the device number of -the autofs mount. The device number can be obtained from the mount options -shown in /proc/mounts. The close call requires an initialized struct -autofs_dev_ioct with the ioctlfd field set to the descriptor obtained -from the open call. The release of the file descriptor can also be done -with close(2) so any open descriptors will also be closed at process exit. -The close call is included in the implemented operations largely for -completeness and to provide for a consistent user space implementation. - - -AUTOFS_DEV_IOCTL_READY_CMD and AUTOFS_DEV_IOCTL_FAIL_CMD --------------------------------------------------------- - -Return mount and expire result status from user space to the kernel. -Both of these calls require an initialized struct autofs_dev_ioctl -with the ioctlfd field set to the descriptor obtained from the open -call and the token field of struct args_ready or struct args_fail set -to the wait queue token number, received by user space in the foregoing -mount or expire request. The status field of struct args_fail is set to -the errno of the operation. It is set to 0 on success. - - -AUTOFS_DEV_IOCTL_SETPIPEFD_CMD ------------------------------- - -Set the pipe file descriptor used for kernel communication to the daemon. -Normally this is set at mount time using an option but when reconnecting -to a existing mount we need to use this to tell the autofs mount about -the new kernel pipe descriptor. In order to protect mounts against -incorrectly setting the pipe descriptor we also require that the autofs -mount be catatonic (see next call). - -The call requires an initialized struct autofs_dev_ioctl with the -ioctlfd field set to the descriptor obtained from the open call and -the pipefd field of struct args_setpipefd set to descriptor of the pipe. -On success the call also sets the process group id used to identify the -controlling process (eg. the owning automount(8) daemon) to the process -group of the caller. - - -AUTOFS_DEV_IOCTL_CATATONIC_CMD ------------------------------- - -Make the autofs mount point catatonic. The autofs mount will no longer -issue mount requests, the kernel communication pipe descriptor is released -and any remaining waits in the queue released. - -The call requires an initialized struct autofs_dev_ioctl with the -ioctlfd field set to the descriptor obtained from the open call. - - -AUTOFS_DEV_IOCTL_TIMEOUT_CMD ----------------------------- - -Set the expire timeout for mounts within an autofs mount point. - -The call requires an initialized struct autofs_dev_ioctl with the -ioctlfd field set to the descriptor obtained from the open call. - - -AUTOFS_DEV_IOCTL_REQUESTER_CMD ------------------------------- - -Return the uid and gid of the last process to successfully trigger a the -mount on the given path dentry. - -The call requires an initialized struct autofs_dev_ioctl with the path -field set to the mount point in question and the size field adjusted -appropriately. Upon return the uid field of struct args_requester contains -the uid and gid field the gid. - -When reconstructing an autofs mount tree with active mounts we need to -re-connect to mounts that may have used the original process uid and -gid (or string variations of them) for mount lookups within the map entry. -This call provides the ability to obtain this uid and gid so they may be -used by user space for the mount map lookups. - - -AUTOFS_DEV_IOCTL_EXPIRE_CMD ---------------------------- - -Issue an expire request to the kernel for an autofs mount. Typically -this ioctl is called until no further expire candidates are found. - -The call requires an initialized struct autofs_dev_ioctl with the -ioctlfd field set to the descriptor obtained from the open call. In -addition an immediate expire that's independent of the mount timeout, -and a forced expire that's independent of whether the mount is busy, -can be requested by setting the how field of struct args_expire to -AUTOFS_EXP_IMMEDIATE or AUTOFS_EXP_FORCED, respectively . If no -expire candidates can be found the ioctl returns -1 with errno set to -EAGAIN. - -This call causes the kernel module to check the mount corresponding -to the given ioctlfd for mounts that can be expired, issues an expire -request back to the daemon and waits for completion. - -AUTOFS_DEV_IOCTL_ASKUMOUNT_CMD ------------------------------- - -Checks if an autofs mount point is in use. - -The call requires an initialized struct autofs_dev_ioctl with the -ioctlfd field set to the descriptor obtained from the open call and -it returns the result in the may_umount field of struct args_askumount, -1 for busy and 0 otherwise. - - -AUTOFS_DEV_IOCTL_ISMOUNTPOINT_CMD ---------------------------------- - -Check if the given path is a mountpoint. - -The call requires an initialized struct autofs_dev_ioctl. There are two -possible variations. Both use the path field set to the path of the mount -point to check and the size field adjusted appropriately. One uses the -ioctlfd field to identify a specific mount point to check while the other -variation uses the path and optionally in.type field of struct args_ismountpoint -set to an autofs mount type. The call returns 1 if this is a mount point -and sets out.devid field to the device number of the mount and out.magic -field to the relevant super block magic number (described below) or 0 if -it isn't a mountpoint. In both cases the the device number (as returned -by new_encode_dev()) is returned in out.devid field. - -If supplied with a file descriptor we're looking for a specific mount, -not necessarily at the top of the mounted stack. In this case the path -the descriptor corresponds to is considered a mountpoint if it is itself -a mountpoint or contains a mount, such as a multi-mount without a root -mount. In this case we return 1 if the descriptor corresponds to a mount -point and and also returns the super magic of the covering mount if there -is one or 0 if it isn't a mountpoint. - -If a path is supplied (and the ioctlfd field is set to -1) then the path -is looked up and is checked to see if it is the root of a mount. If a -type is also given we are looking for a particular autofs mount and if -a match isn't found a fail is returned. If the the located path is the -root of a mount 1 is returned along with the super magic of the mount -or 0 otherwise. diff --git a/Documentation/filesystems/index.rst b/Documentation/filesystems/index.rst index 0598bc52abdc..c9480138d47e 100644 --- a/Documentation/filesystems/index.rst +++ b/Documentation/filesystems/index.rst @@ -51,6 +51,7 @@ Documentation for filesystem implementations. affs afs autofs + autofs-mount-control fuse overlayfs virtiofs -- cgit From c54ad9a4e8faf080e6b395cc4b8298dfc5170255 Mon Sep 17 00:00:00 2001 From: Mauro Carvalho Chehab Date: Mon, 17 Feb 2020 17:11:52 +0100 Subject: docs: filesystems: convert befs.txt to ReST - Add a SPDX header; - Adjust document and section titles; - Some whitespace fixes and new line breaks; - Mark literal blocks as such; - Add table markups; - Add it to filesystems/index.rst. Signed-off-by: Mauro Carvalho Chehab Link: https://lore.kernel.org/r/3e29ea6df6cd569021cfa953ccb8ed7dfc146f3d.1581955849.git.mchehab+huawei@kernel.org Signed-off-by: Jonathan Corbet --- Documentation/filesystems/befs.rst | 128 ++++++++++++++++++++++++++++++++++++ Documentation/filesystems/befs.txt | 117 -------------------------------- Documentation/filesystems/index.rst | 1 + 3 files changed, 129 insertions(+), 117 deletions(-) create mode 100644 Documentation/filesystems/befs.rst delete mode 100644 Documentation/filesystems/befs.txt diff --git a/Documentation/filesystems/befs.rst b/Documentation/filesystems/befs.rst new file mode 100644 index 000000000000..79f9740d76ff --- /dev/null +++ b/Documentation/filesystems/befs.rst @@ -0,0 +1,128 @@ +.. SPDX-License-Identifier: GPL-2.0 + +========================= +BeOS filesystem for Linux +========================= + +Document last updated: Dec 6, 2001 + +Warning +======= +Make sure you understand that this is alpha software. This means that the +implementation is neither complete nor well-tested. + +I DISCLAIM ALL RESPONSIBILITY FOR ANY POSSIBLE BAD EFFECTS OF THIS CODE! + +License +======= +This software is covered by the GNU General Public License. +See the file COPYING for the complete text of the license. +Or the GNU website: + +Author +====== +The largest part of the code written by Will Dyson +He has been working on the code since Aug 13, 2001. See the changelog for +details. + +Original Author: Makoto Kato + +His original code can still be found at: + + +Does anyone know of a more current email address for Makoto? He doesn't +respond to the address given above... + +This filesystem doesn't have a maintainer. + +What is this Driver? +==================== +This module implements the native filesystem of BeOS http://www.beincorporated.com/ +for the linux 2.4.1 and later kernels. Currently it is a read-only +implementation. + +Which is it, BFS or BEFS? +========================= +Be, Inc said, "BeOS Filesystem is officially called BFS, not BeFS". +But Unixware Boot Filesystem is called bfs, too. And they are already in +the kernel. Because of this naming conflict, on Linux the BeOS +filesystem is called befs. + +How to Install +============== +step 1. Install the BeFS patch into the source code tree of linux. + +Apply the patchfile to your kernel source tree. +Assuming that your kernel source is in /foo/bar/linux and the patchfile +is called patch-befs-xxx, you would do the following: + + cd /foo/bar/linux + patch -p1 < /path/to/patch-befs-xxx + +if the patching step fails (i.e. there are rejected hunks), you can try to +figure it out yourself (it shouldn't be hard), or mail the maintainer +(Will Dyson ) for help. + +step 2. Configuration & make kernel + +The linux kernel has many compile-time options. Most of them are beyond the +scope of this document. I suggest the Kernel-HOWTO document as a good general +reference on this topic. http://www.linuxdocs.org/HOWTOs/Kernel-HOWTO-4.html + +However, to use the BeFS module, you must enable it at configure time:: + + cd /foo/bar/linux + make menuconfig (or xconfig) + +The BeFS module is not a standard part of the linux kernel, so you must first +enable support for experimental code under the "Code maturity level" menu. + +Then, under the "Filesystems" menu will be an option called "BeFS +filesystem (experimental)", or something like that. Enable that option +(it is fine to make it a module). + +Save your kernel configuration and then build your kernel. + +step 3. Install + +See the kernel howto for +instructions on this critical step. + +Using BFS +========= +To use the BeOS filesystem, use filesystem type 'befs'. + +ex:: + + mount -t befs /dev/fd0 /beos + +Mount Options +============= + +============= =========================================================== +uid=nnn All files in the partition will be owned by user id nnn. +gid=nnn All files in the partition will be in group nnn. +iocharset=xxx Use xxx as the name of the NLS translation table. +debug The driver will output debugging information to the syslog. +============= =========================================================== + +How to Get Lastest Version +========================== + +The latest version is currently available at: + + +Any Known Bugs? +=============== +As of Jan 20, 2002: + + None + +Special Thanks +============== +Dominic Giampalo ... Writing "Practical file system design with Be filesystem" + +Hiroyuki Yamada ... Testing LinuxPPC. + + + diff --git a/Documentation/filesystems/befs.txt b/Documentation/filesystems/befs.txt deleted file mode 100644 index da45e6c842b8..000000000000 --- a/Documentation/filesystems/befs.txt +++ /dev/null @@ -1,117 +0,0 @@ -BeOS filesystem for Linux - -Document last updated: Dec 6, 2001 - -WARNING -======= -Make sure you understand that this is alpha software. This means that the -implementation is neither complete nor well-tested. - -I DISCLAIM ALL RESPONSIBILITY FOR ANY POSSIBLE BAD EFFECTS OF THIS CODE! - -LICENSE -===== -This software is covered by the GNU General Public License. -See the file COPYING for the complete text of the license. -Or the GNU website: - -AUTHOR -===== -The largest part of the code written by Will Dyson -He has been working on the code since Aug 13, 2001. See the changelog for -details. - -Original Author: Makoto Kato -His original code can still be found at: - -Does anyone know of a more current email address for Makoto? He doesn't -respond to the address given above... - -This filesystem doesn't have a maintainer. - -WHAT IS THIS DRIVER? -================== -This module implements the native filesystem of BeOS http://www.beincorporated.com/ -for the linux 2.4.1 and later kernels. Currently it is a read-only -implementation. - -Which is it, BFS or BEFS? -================ -Be, Inc said, "BeOS Filesystem is officially called BFS, not BeFS". -But Unixware Boot Filesystem is called bfs, too. And they are already in -the kernel. Because of this naming conflict, on Linux the BeOS -filesystem is called befs. - -HOW TO INSTALL -============== -step 1. Install the BeFS patch into the source code tree of linux. - -Apply the patchfile to your kernel source tree. -Assuming that your kernel source is in /foo/bar/linux and the patchfile -is called patch-befs-xxx, you would do the following: - - cd /foo/bar/linux - patch -p1 < /path/to/patch-befs-xxx - -if the patching step fails (i.e. there are rejected hunks), you can try to -figure it out yourself (it shouldn't be hard), or mail the maintainer -(Will Dyson ) for help. - -step 2. Configuration & make kernel - -The linux kernel has many compile-time options. Most of them are beyond the -scope of this document. I suggest the Kernel-HOWTO document as a good general -reference on this topic. http://www.linuxdocs.org/HOWTOs/Kernel-HOWTO-4.html - -However, to use the BeFS module, you must enable it at configure time. - - cd /foo/bar/linux - make menuconfig (or xconfig) - -The BeFS module is not a standard part of the linux kernel, so you must first -enable support for experimental code under the "Code maturity level" menu. - -Then, under the "Filesystems" menu will be an option called "BeFS -filesystem (experimental)", or something like that. Enable that option -(it is fine to make it a module). - -Save your kernel configuration and then build your kernel. - -step 3. Install - -See the kernel howto for -instructions on this critical step. - -USING BFS -========= -To use the BeOS filesystem, use filesystem type 'befs'. - -ex) - mount -t befs /dev/fd0 /beos - -MOUNT OPTIONS -============= -uid=nnn All files in the partition will be owned by user id nnn. -gid=nnn All files in the partition will be in group nnn. -iocharset=xxx Use xxx as the name of the NLS translation table. -debug The driver will output debugging information to the syslog. - -HOW TO GET LASTEST VERSION -========================== - -The latest version is currently available at: - - -ANY KNOWN BUGS? -=========== -As of Jan 20, 2002: - - None - -SPECIAL THANKS -============== -Dominic Giampalo ... Writing "Practical file system design with Be filesystem" -Hiroyuki Yamada ... Testing LinuxPPC. - - - diff --git a/Documentation/filesystems/index.rst b/Documentation/filesystems/index.rst index c9480138d47e..98de437f5500 100644 --- a/Documentation/filesystems/index.rst +++ b/Documentation/filesystems/index.rst @@ -52,6 +52,7 @@ Documentation for filesystem implementations. afs autofs autofs-mount-control + befs fuse overlayfs virtiofs -- cgit From ee68f34d7e7e553ffb74f09df0f3764fbfcf5d4b Mon Sep 17 00:00:00 2001 From: Mauro Carvalho Chehab Date: Mon, 17 Feb 2020 17:11:53 +0100 Subject: docs: filesystems: convert bfs.txt to ReST - Add a SPDX header; - Adjust document title; - Some whitespace fixes and new line breaks; - Mark literal blocks as such; - Add it to filesystems/index.rst. Signed-off-by: Mauro Carvalho Chehab Link: https://lore.kernel.org/r/93991bcc05e419368ee1e585c81057fb2c7c8d2b.1581955849.git.mchehab+huawei@kernel.org Signed-off-by: Jonathan Corbet --- Documentation/filesystems/bfs.rst | 60 +++++++++++++++++++++++++++++++++++++ Documentation/filesystems/bfs.txt | 57 ----------------------------------- Documentation/filesystems/index.rst | 1 + 3 files changed, 61 insertions(+), 57 deletions(-) create mode 100644 Documentation/filesystems/bfs.rst delete mode 100644 Documentation/filesystems/bfs.txt diff --git a/Documentation/filesystems/bfs.rst b/Documentation/filesystems/bfs.rst new file mode 100644 index 000000000000..ce14b9018807 --- /dev/null +++ b/Documentation/filesystems/bfs.rst @@ -0,0 +1,60 @@ +.. SPDX-License-Identifier: GPL-2.0 + +======================== +BFS Filesystem for Linux +======================== + +The BFS filesystem is used by SCO UnixWare OS for the /stand slice, which +usually contains the kernel image and a few other files required for the +boot process. + +In order to access /stand partition under Linux you obviously need to +know the partition number and the kernel must support UnixWare disk slices +(CONFIG_UNIXWARE_DISKLABEL config option). However BFS support does not +depend on having UnixWare disklabel support because one can also mount +BFS filesystem via loopback:: + + # losetup /dev/loop0 stand.img + # mount -t bfs /dev/loop0 /mnt/stand + +where stand.img is a file containing the image of BFS filesystem. +When you have finished using it and umounted you need to also deallocate +/dev/loop0 device by:: + + # losetup -d /dev/loop0 + +You can simplify mounting by just typing:: + + # mount -t bfs -o loop stand.img /mnt/stand + +this will allocate the first available loopback device (and load loop.o +kernel module if necessary) automatically. If the loopback driver is not +loaded automatically, make sure that you have compiled the module and +that modprobe is functioning. Beware that umount will not deallocate +/dev/loopN device if /etc/mtab file on your system is a symbolic link to +/proc/mounts. You will need to do it manually using "-d" switch of +losetup(8). Read losetup(8) manpage for more info. + +To create the BFS image under UnixWare you need to find out first which +slice contains it. The command prtvtoc(1M) is your friend:: + + # prtvtoc /dev/rdsk/c0b0t0d0s0 + +(assuming your root disk is on target=0, lun=0, bus=0, controller=0). Then you +look for the slice with tag "STAND", which is usually slice 10. With this +information you can use dd(1) to create the BFS image:: + + # umount /stand + # dd if=/dev/rdsk/c0b0t0d0sa of=stand.img bs=512 + +Just in case, you can verify that you have done the right thing by checking +the magic number:: + + # od -Ad -tx4 stand.img | more + +The first 4 bytes should be 0x1badface. + +If you have any patches, questions or suggestions regarding this BFS +implementation please contact the author: + +Tigran Aivazian diff --git a/Documentation/filesystems/bfs.txt b/Documentation/filesystems/bfs.txt deleted file mode 100644 index 843ce91a2e40..000000000000 --- a/Documentation/filesystems/bfs.txt +++ /dev/null @@ -1,57 +0,0 @@ -BFS FILESYSTEM FOR LINUX -======================== - -The BFS filesystem is used by SCO UnixWare OS for the /stand slice, which -usually contains the kernel image and a few other files required for the -boot process. - -In order to access /stand partition under Linux you obviously need to -know the partition number and the kernel must support UnixWare disk slices -(CONFIG_UNIXWARE_DISKLABEL config option). However BFS support does not -depend on having UnixWare disklabel support because one can also mount -BFS filesystem via loopback: - -# losetup /dev/loop0 stand.img -# mount -t bfs /dev/loop0 /mnt/stand - -where stand.img is a file containing the image of BFS filesystem. -When you have finished using it and umounted you need to also deallocate -/dev/loop0 device by: - -# losetup -d /dev/loop0 - -You can simplify mounting by just typing: - -# mount -t bfs -o loop stand.img /mnt/stand - -this will allocate the first available loopback device (and load loop.o -kernel module if necessary) automatically. If the loopback driver is not -loaded automatically, make sure that you have compiled the module and -that modprobe is functioning. Beware that umount will not deallocate -/dev/loopN device if /etc/mtab file on your system is a symbolic link to -/proc/mounts. You will need to do it manually using "-d" switch of -losetup(8). Read losetup(8) manpage for more info. - -To create the BFS image under UnixWare you need to find out first which -slice contains it. The command prtvtoc(1M) is your friend: - -# prtvtoc /dev/rdsk/c0b0t0d0s0 - -(assuming your root disk is on target=0, lun=0, bus=0, controller=0). Then you -look for the slice with tag "STAND", which is usually slice 10. With this -information you can use dd(1) to create the BFS image: - -# umount /stand -# dd if=/dev/rdsk/c0b0t0d0sa of=stand.img bs=512 - -Just in case, you can verify that you have done the right thing by checking -the magic number: - -# od -Ad -tx4 stand.img | more - -The first 4 bytes should be 0x1badface. - -If you have any patches, questions or suggestions regarding this BFS -implementation please contact the author: - -Tigran Aivazian diff --git a/Documentation/filesystems/index.rst b/Documentation/filesystems/index.rst index 98de437f5500..f74e6b273d9f 100644 --- a/Documentation/filesystems/index.rst +++ b/Documentation/filesystems/index.rst @@ -53,6 +53,7 @@ Documentation for filesystem implementations. autofs autofs-mount-control befs + bfs fuse overlayfs virtiofs -- cgit From 5d43e1bc2dfccbb07ea662fa4536544f1b6efd43 Mon Sep 17 00:00:00 2001 From: Mauro Carvalho Chehab Date: Mon, 17 Feb 2020 17:11:54 +0100 Subject: docs: filesystems: convert btrfs.txt to ReST Just trivial changes: - Add a SPDX header; - Add it to filesystems/index.rst. While here, adjust document title, just to make it use the same style of the other docs. Signed-off-by: Mauro Carvalho Chehab Acked-by: David Sterba Link: https://lore.kernel.org/r/1ef76da4ac24a9a6f6187723554733c702ea19ae.1581955849.git.mchehab+huawei@kernel.org Signed-off-by: Jonathan Corbet --- Documentation/filesystems/btrfs.rst | 34 ++++++++++++++++++++++++++++++++++ Documentation/filesystems/btrfs.txt | 31 ------------------------------- Documentation/filesystems/index.rst | 1 + 3 files changed, 35 insertions(+), 31 deletions(-) create mode 100644 Documentation/filesystems/btrfs.rst delete mode 100644 Documentation/filesystems/btrfs.txt diff --git a/Documentation/filesystems/btrfs.rst b/Documentation/filesystems/btrfs.rst new file mode 100644 index 000000000000..d0904f602819 --- /dev/null +++ b/Documentation/filesystems/btrfs.rst @@ -0,0 +1,34 @@ +.. SPDX-License-Identifier: GPL-2.0 + +===== +BTRFS +===== + +Btrfs is a copy on write filesystem for Linux aimed at implementing advanced +features while focusing on fault tolerance, repair and easy administration. +Jointly developed by several companies, licensed under the GPL and open for +contribution from anyone. + +The main Btrfs features include: + + * Extent based file storage (2^64 max file size) + * Space efficient packing of small files + * Space efficient indexed directories + * Dynamic inode allocation + * Writable snapshots + * Subvolumes (separate internal filesystem roots) + * Object level mirroring and striping + * Checksums on data and metadata (multiple algorithms available) + * Compression + * Integrated multiple device support, with several raid algorithms + * Offline filesystem check + * Efficient incremental backup and FS mirroring + * Online filesystem defragmentation + +For more information please refer to the wiki + + https://btrfs.wiki.kernel.org + +that maintains information about administration tasks, frequently asked +questions, use cases, mount options, comprehensible changelogs, features, +manual pages, source code repositories, contacts etc. diff --git a/Documentation/filesystems/btrfs.txt b/Documentation/filesystems/btrfs.txt deleted file mode 100644 index f9dad22d95ce..000000000000 --- a/Documentation/filesystems/btrfs.txt +++ /dev/null @@ -1,31 +0,0 @@ -BTRFS -===== - -Btrfs is a copy on write filesystem for Linux aimed at implementing advanced -features while focusing on fault tolerance, repair and easy administration. -Jointly developed by several companies, licensed under the GPL and open for -contribution from anyone. - -The main Btrfs features include: - - * Extent based file storage (2^64 max file size) - * Space efficient packing of small files - * Space efficient indexed directories - * Dynamic inode allocation - * Writable snapshots - * Subvolumes (separate internal filesystem roots) - * Object level mirroring and striping - * Checksums on data and metadata (multiple algorithms available) - * Compression - * Integrated multiple device support, with several raid algorithms - * Offline filesystem check - * Efficient incremental backup and FS mirroring - * Online filesystem defragmentation - -For more information please refer to the wiki - - https://btrfs.wiki.kernel.org - -that maintains information about administration tasks, frequently asked -questions, use cases, mount options, comprehensible changelogs, features, -manual pages, source code repositories, contacts etc. diff --git a/Documentation/filesystems/index.rst b/Documentation/filesystems/index.rst index f74e6b273d9f..dae862cf167e 100644 --- a/Documentation/filesystems/index.rst +++ b/Documentation/filesystems/index.rst @@ -54,6 +54,7 @@ Documentation for filesystem implementations. autofs-mount-control befs bfs + btrfs fuse overlayfs virtiofs -- cgit From 471379a174aa444b326d1b74e9f96a8b4b766b79 Mon Sep 17 00:00:00 2001 From: Mauro Carvalho Chehab Date: Mon, 17 Feb 2020 17:11:55 +0100 Subject: docs: filesystems: convert ceph.txt to ReST - Add a SPDX header; - Adjust document title; - Some whitespace fixes and new line breaks; - Mark literal blocks as such; - Add it to filesystems/index.rst. Signed-off-by: Mauro Carvalho Chehab Acked-by: Jeff Layton Link: https://lore.kernel.org/r/df2f142b5ca5842e030d8209482dfd62dcbe020f.1581955849.git.mchehab+huawei@kernel.org Signed-off-by: Jonathan Corbet --- Documentation/filesystems/ceph.rst | 190 ++++++++++++++++++++++++++++++++++++ Documentation/filesystems/ceph.txt | 186 ----------------------------------- Documentation/filesystems/index.rst | 1 + 3 files changed, 191 insertions(+), 186 deletions(-) create mode 100644 Documentation/filesystems/ceph.rst delete mode 100644 Documentation/filesystems/ceph.txt diff --git a/Documentation/filesystems/ceph.rst b/Documentation/filesystems/ceph.rst new file mode 100644 index 000000000000..b46a7218248f --- /dev/null +++ b/Documentation/filesystems/ceph.rst @@ -0,0 +1,190 @@ +.. SPDX-License-Identifier: GPL-2.0 + +============================ +Ceph Distributed File System +============================ + +Ceph is a distributed network file system designed to provide good +performance, reliability, and scalability. + +Basic features include: + + * POSIX semantics + * Seamless scaling from 1 to many thousands of nodes + * High availability and reliability. No single point of failure. + * N-way replication of data across storage nodes + * Fast recovery from node failures + * Automatic rebalancing of data on node addition/removal + * Easy deployment: most FS components are userspace daemons + +Also, + + * Flexible snapshots (on any directory) + * Recursive accounting (nested files, directories, bytes) + +In contrast to cluster filesystems like GFS, OCFS2, and GPFS that rely +on symmetric access by all clients to shared block devices, Ceph +separates data and metadata management into independent server +clusters, similar to Lustre. Unlike Lustre, however, metadata and +storage nodes run entirely as user space daemons. File data is striped +across storage nodes in large chunks to distribute workload and +facilitate high throughputs. When storage nodes fail, data is +re-replicated in a distributed fashion by the storage nodes themselves +(with some minimal coordination from a cluster monitor), making the +system extremely efficient and scalable. + +Metadata servers effectively form a large, consistent, distributed +in-memory cache above the file namespace that is extremely scalable, +dynamically redistributes metadata in response to workload changes, +and can tolerate arbitrary (well, non-Byzantine) node failures. The +metadata server takes a somewhat unconventional approach to metadata +storage to significantly improve performance for common workloads. In +particular, inodes with only a single link are embedded in +directories, allowing entire directories of dentries and inodes to be +loaded into its cache with a single I/O operation. The contents of +extremely large directories can be fragmented and managed by +independent metadata servers, allowing scalable concurrent access. + +The system offers automatic data rebalancing/migration when scaling +from a small cluster of just a few nodes to many hundreds, without +requiring an administrator carve the data set into static volumes or +go through the tedious process of migrating data between servers. +When the file system approaches full, new nodes can be easily added +and things will "just work." + +Ceph includes flexible snapshot mechanism that allows a user to create +a snapshot on any subdirectory (and its nested contents) in the +system. Snapshot creation and deletion are as simple as 'mkdir +.snap/foo' and 'rmdir .snap/foo'. + +Ceph also provides some recursive accounting on directories for nested +files and bytes. That is, a 'getfattr -d foo' on any directory in the +system will reveal the total number of nested regular files and +subdirectories, and a summation of all nested file sizes. This makes +the identification of large disk space consumers relatively quick, as +no 'du' or similar recursive scan of the file system is required. + +Finally, Ceph also allows quotas to be set on any directory in the system. +The quota can restrict the number of bytes or the number of files stored +beneath that point in the directory hierarchy. Quotas can be set using +extended attributes 'ceph.quota.max_files' and 'ceph.quota.max_bytes', eg:: + + setfattr -n ceph.quota.max_bytes -v 100000000 /some/dir + getfattr -n ceph.quota.max_bytes /some/dir + +A limitation of the current quotas implementation is that it relies on the +cooperation of the client mounting the file system to stop writers when a +limit is reached. A modified or adversarial client cannot be prevented +from writing as much data as it needs. + +Mount Syntax +============ + +The basic mount syntax is:: + + # mount -t ceph monip[:port][,monip2[:port]...]:/[subdir] mnt + +You only need to specify a single monitor, as the client will get the +full list when it connects. (However, if the monitor you specify +happens to be down, the mount won't succeed.) The port can be left +off if the monitor is using the default. So if the monitor is at +1.2.3.4:: + + # mount -t ceph 1.2.3.4:/ /mnt/ceph + +is sufficient. If /sbin/mount.ceph is installed, a hostname can be +used instead of an IP address. + + + +Mount Options +============= + + ip=A.B.C.D[:N] + Specify the IP and/or port the client should bind to locally. + There is normally not much reason to do this. If the IP is not + specified, the client's IP address is determined by looking at the + address its connection to the monitor originates from. + + wsize=X + Specify the maximum write size in bytes. Default: 16 MB. + + rsize=X + Specify the maximum read size in bytes. Default: 16 MB. + + rasize=X + Specify the maximum readahead size in bytes. Default: 8 MB. + + mount_timeout=X + Specify the timeout value for mount (in seconds), in the case + of a non-responsive Ceph file system. The default is 30 + seconds. + + caps_max=X + Specify the maximum number of caps to hold. Unused caps are released + when number of caps exceeds the limit. The default is 0 (no limit) + + rbytes + When stat() is called on a directory, set st_size to 'rbytes', + the summation of file sizes over all files nested beneath that + directory. This is the default. + + norbytes + When stat() is called on a directory, set st_size to the + number of entries in that directory. + + nocrc + Disable CRC32C calculation for data writes. If set, the storage node + must rely on TCP's error correction to detect data corruption + in the data payload. + + dcache + Use the dcache contents to perform negative lookups and + readdir when the client has the entire directory contents in + its cache. (This does not change correctness; the client uses + cached metadata only when a lease or capability ensures it is + valid.) + + nodcache + Do not use the dcache as above. This avoids a significant amount of + complex code, sacrificing performance without affecting correctness, + and is useful for tracking down bugs. + + noasyncreaddir + Do not use the dcache as above for readdir. + + noquotadf + Report overall filesystem usage in statfs instead of using the root + directory quota. + + nocopyfrom + Don't use the RADOS 'copy-from' operation to perform remote object + copies. Currently, it's only used in copy_file_range, which will revert + to the default VFS implementation if this option is used. + + recover_session= + Set auto reconnect mode in the case where the client is blacklisted. The + available modes are "no" and "clean". The default is "no". + + * no: never attempt to reconnect when client detects that it has been + blacklisted. Operations will generally fail after being blacklisted. + + * clean: client reconnects to the ceph cluster automatically when it + detects that it has been blacklisted. During reconnect, client drops + dirty data/metadata, invalidates page caches and writable file handles. + After reconnect, file locks become stale because the MDS loses track + of them. If an inode contains any stale file locks, read/write on the + inode is not allowed until applications release all stale file locks. + +More Information +================ + +For more information on Ceph, see the home page at + https://ceph.com/ + +The Linux kernel client source tree is available at + - https://github.com/ceph/ceph-client.git + - git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client.git + +and the source for the full system is at + https://github.com/ceph/ceph.git diff --git a/Documentation/filesystems/ceph.txt b/Documentation/filesystems/ceph.txt deleted file mode 100644 index b19b6a03f91c..000000000000 --- a/Documentation/filesystems/ceph.txt +++ /dev/null @@ -1,186 +0,0 @@ -Ceph Distributed File System -============================ - -Ceph is a distributed network file system designed to provide good -performance, reliability, and scalability. - -Basic features include: - - * POSIX semantics - * Seamless scaling from 1 to many thousands of nodes - * High availability and reliability. No single point of failure. - * N-way replication of data across storage nodes - * Fast recovery from node failures - * Automatic rebalancing of data on node addition/removal - * Easy deployment: most FS components are userspace daemons - -Also, - * Flexible snapshots (on any directory) - * Recursive accounting (nested files, directories, bytes) - -In contrast to cluster filesystems like GFS, OCFS2, and GPFS that rely -on symmetric access by all clients to shared block devices, Ceph -separates data and metadata management into independent server -clusters, similar to Lustre. Unlike Lustre, however, metadata and -storage nodes run entirely as user space daemons. File data is striped -across storage nodes in large chunks to distribute workload and -facilitate high throughputs. When storage nodes fail, data is -re-replicated in a distributed fashion by the storage nodes themselves -(with some minimal coordination from a cluster monitor), making the -system extremely efficient and scalable. - -Metadata servers effectively form a large, consistent, distributed -in-memory cache above the file namespace that is extremely scalable, -dynamically redistributes metadata in response to workload changes, -and can tolerate arbitrary (well, non-Byzantine) node failures. The -metadata server takes a somewhat unconventional approach to metadata -storage to significantly improve performance for common workloads. In -particular, inodes with only a single link are embedded in -directories, allowing entire directories of dentries and inodes to be -loaded into its cache with a single I/O operation. The contents of -extremely large directories can be fragmented and managed by -independent metadata servers, allowing scalable concurrent access. - -The system offers automatic data rebalancing/migration when scaling -from a small cluster of just a few nodes to many hundreds, without -requiring an administrator carve the data set into static volumes or -go through the tedious process of migrating data between servers. -When the file system approaches full, new nodes can be easily added -and things will "just work." - -Ceph includes flexible snapshot mechanism that allows a user to create -a snapshot on any subdirectory (and its nested contents) in the -system. Snapshot creation and deletion are as simple as 'mkdir -.snap/foo' and 'rmdir .snap/foo'. - -Ceph also provides some recursive accounting on directories for nested -files and bytes. That is, a 'getfattr -d foo' on any directory in the -system will reveal the total number of nested regular files and -subdirectories, and a summation of all nested file sizes. This makes -the identification of large disk space consumers relatively quick, as -no 'du' or similar recursive scan of the file system is required. - -Finally, Ceph also allows quotas to be set on any directory in the system. -The quota can restrict the number of bytes or the number of files stored -beneath that point in the directory hierarchy. Quotas can be set using -extended attributes 'ceph.quota.max_files' and 'ceph.quota.max_bytes', eg: - - setfattr -n ceph.quota.max_bytes -v 100000000 /some/dir - getfattr -n ceph.quota.max_bytes /some/dir - -A limitation of the current quotas implementation is that it relies on the -cooperation of the client mounting the file system to stop writers when a -limit is reached. A modified or adversarial client cannot be prevented -from writing as much data as it needs. - -Mount Syntax -============ - -The basic mount syntax is: - - # mount -t ceph monip[:port][,monip2[:port]...]:/[subdir] mnt - -You only need to specify a single monitor, as the client will get the -full list when it connects. (However, if the monitor you specify -happens to be down, the mount won't succeed.) The port can be left -off if the monitor is using the default. So if the monitor is at -1.2.3.4, - - # mount -t ceph 1.2.3.4:/ /mnt/ceph - -is sufficient. If /sbin/mount.ceph is installed, a hostname can be -used instead of an IP address. - - - -Mount Options -============= - - ip=A.B.C.D[:N] - Specify the IP and/or port the client should bind to locally. - There is normally not much reason to do this. If the IP is not - specified, the client's IP address is determined by looking at the - address its connection to the monitor originates from. - - wsize=X - Specify the maximum write size in bytes. Default: 16 MB. - - rsize=X - Specify the maximum read size in bytes. Default: 16 MB. - - rasize=X - Specify the maximum readahead size in bytes. Default: 8 MB. - - mount_timeout=X - Specify the timeout value for mount (in seconds), in the case - of a non-responsive Ceph file system. The default is 30 - seconds. - - caps_max=X - Specify the maximum number of caps to hold. Unused caps are released - when number of caps exceeds the limit. The default is 0 (no limit) - - rbytes - When stat() is called on a directory, set st_size to 'rbytes', - the summation of file sizes over all files nested beneath that - directory. This is the default. - - norbytes - When stat() is called on a directory, set st_size to the - number of entries in that directory. - - nocrc - Disable CRC32C calculation for data writes. If set, the storage node - must rely on TCP's error correction to detect data corruption - in the data payload. - - dcache - Use the dcache contents to perform negative lookups and - readdir when the client has the entire directory contents in - its cache. (This does not change correctness; the client uses - cached metadata only when a lease or capability ensures it is - valid.) - - nodcache - Do not use the dcache as above. This avoids a significant amount of - complex code, sacrificing performance without affecting correctness, - and is useful for tracking down bugs. - - noasyncreaddir - Do not use the dcache as above for readdir. - - noquotadf - Report overall filesystem usage in statfs instead of using the root - directory quota. - - nocopyfrom - Don't use the RADOS 'copy-from' operation to perform remote object - copies. Currently, it's only used in copy_file_range, which will revert - to the default VFS implementation if this option is used. - - recover_session= - Set auto reconnect mode in the case where the client is blacklisted. The - available modes are "no" and "clean". The default is "no". - - * no: never attempt to reconnect when client detects that it has been - blacklisted. Operations will generally fail after being blacklisted. - - * clean: client reconnects to the ceph cluster automatically when it - detects that it has been blacklisted. During reconnect, client drops - dirty data/metadata, invalidates page caches and writable file handles. - After reconnect, file locks become stale because the MDS loses track - of them. If an inode contains any stale file locks, read/write on the - inode is not allowed until applications release all stale file locks. - -More Information -================ - -For more information on Ceph, see the home page at - https://ceph.com/ - -The Linux kernel client source tree is available at - https://github.com/ceph/ceph-client.git - git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client.git - -and the source for the full system is at - https://github.com/ceph/ceph.git diff --git a/Documentation/filesystems/index.rst b/Documentation/filesystems/index.rst index dae862cf167e..ddd8f7b2bb25 100644 --- a/Documentation/filesystems/index.rst +++ b/Documentation/filesystems/index.rst @@ -55,6 +55,7 @@ Documentation for filesystem implementations. befs bfs btrfs + ceph fuse overlayfs virtiofs -- cgit From f1fa0e6028d395c5f0d1a0929a795b8dc0d43295 Mon Sep 17 00:00:00 2001 From: Mauro Carvalho Chehab Date: Mon, 17 Feb 2020 17:11:56 +0100 Subject: docs: filesystems: convert cramfs.txt to ReST - Add a SPDX header; - Adjust document title; - Some whitespace fixes and new line breaks; - Mark literal blocks as such; - Add table markups; - Add it to filesystems/index.rst. Signed-off-by: Mauro Carvalho Chehab Acked-by: Nicolas Pitre Link: https://lore.kernel.org/r/e87b267e71f99974b7bb3fc0a4a08454ff58165e.1581955849.git.mchehab+huawei@kernel.org Signed-off-by: Jonathan Corbet --- Documentation/filesystems/cramfs.rst | 123 +++++++++++++++++++++++++++++++++++ Documentation/filesystems/cramfs.txt | 118 --------------------------------- Documentation/filesystems/index.rst | 1 + 3 files changed, 124 insertions(+), 118 deletions(-) create mode 100644 Documentation/filesystems/cramfs.rst delete mode 100644 Documentation/filesystems/cramfs.txt diff --git a/Documentation/filesystems/cramfs.rst b/Documentation/filesystems/cramfs.rst new file mode 100644 index 000000000000..afbdbde98bd2 --- /dev/null +++ b/Documentation/filesystems/cramfs.rst @@ -0,0 +1,123 @@ +.. SPDX-License-Identifier: GPL-2.0 + +=========================================== +Cramfs - cram a filesystem onto a small ROM +=========================================== + +cramfs is designed to be simple and small, and to compress things well. + +It uses the zlib routines to compress a file one page at a time, and +allows random page access. The meta-data is not compressed, but is +expressed in a very terse representation to make it use much less +diskspace than traditional filesystems. + +You can't write to a cramfs filesystem (making it compressible and +compact also makes it _very_ hard to update on-the-fly), so you have to +create the disk image with the "mkcramfs" utility. + + +Usage Notes +----------- + +File sizes are limited to less than 16MB. + +Maximum filesystem size is a little over 256MB. (The last file on the +filesystem is allowed to extend past 256MB.) + +Only the low 8 bits of gid are stored. The current version of +mkcramfs simply truncates to 8 bits, which is a potential security +issue. + +Hard links are supported, but hard linked files +will still have a link count of 1 in the cramfs image. + +Cramfs directories have no ``.`` or ``..`` entries. Directories (like +every other file on cramfs) always have a link count of 1. (There's +no need to use -noleaf in ``find``, btw.) + +No timestamps are stored in a cramfs, so these default to the epoch +(1970 GMT). Recently-accessed files may have updated timestamps, but +the update lasts only as long as the inode is cached in memory, after +which the timestamp reverts to 1970, i.e. moves backwards in time. + +Currently, cramfs must be written and read with architectures of the +same endianness, and can be read only by kernels with PAGE_SIZE +== 4096. At least the latter of these is a bug, but it hasn't been +decided what the best fix is. For the moment if you have larger pages +you can just change the #define in mkcramfs.c, so long as you don't +mind the filesystem becoming unreadable to future kernels. + + +Memory Mapped cramfs image +-------------------------- + +The CRAMFS_MTD Kconfig option adds support for loading data directly from +a physical linear memory range (usually non volatile memory like Flash) +instead of going through the block device layer. This saves some memory +since no intermediate buffering is necessary to hold the data before +decompressing. + +And when data blocks are kept uncompressed and properly aligned, they will +automatically be mapped directly into user space whenever possible providing +eXecute-In-Place (XIP) from ROM of read-only segments. Data segments mapped +read-write (hence they have to be copied to RAM) may still be compressed in +the cramfs image in the same file along with non compressed read-only +segments. Both MMU and no-MMU systems are supported. This is particularly +handy for tiny embedded systems with very tight memory constraints. + +The location of the cramfs image in memory is system dependent. You must +know the proper physical address where the cramfs image is located and +configure an MTD device for it. Also, that MTD device must be supported +by a map driver that implements the "point" method. Examples of such +MTD drivers are cfi_cmdset_0001 (Intel/Sharp CFI flash) or physmap +(Flash device in physical memory map). MTD partitions based on such devices +are fine too. Then that device should be specified with the "mtd:" prefix +as the mount device argument. For example, to mount the MTD device named +"fs_partition" on the /mnt directory:: + + $ mount -t cramfs mtd:fs_partition /mnt + +To boot a kernel with this as root filesystem, suffice to specify +something like "root=mtd:fs_partition" on the kernel command line. + + +Tools +----- + +A version of mkcramfs that can take advantage of the latest capabilities +described above can be found here: + +https://github.com/npitre/cramfs-tools + + +For /usr/share/magic +-------------------- + +===== ======================= ======================= +0 ulelong 0x28cd3d45 Linux cramfs offset 0 +>4 ulelong x size %d +>8 ulelong x flags 0x%x +>12 ulelong x future 0x%x +>16 string >\0 signature "%.16s" +>32 ulelong x fsid.crc 0x%x +>36 ulelong x fsid.edition %d +>40 ulelong x fsid.blocks %d +>44 ulelong x fsid.files %d +>48 string >\0 name "%.16s" +512 ulelong 0x28cd3d45 Linux cramfs offset 512 +>516 ulelong x size %d +>520 ulelong x flags 0x%x +>524 ulelong x future 0x%x +>528 string >\0 signature "%.16s" +>544 ulelong x fsid.crc 0x%x +>548 ulelong x fsid.edition %d +>552 ulelong x fsid.blocks %d +>556 ulelong x fsid.files %d +>560 string >\0 name "%.16s" +===== ======================= ======================= + + +Hacker Notes +------------ + +See fs/cramfs/README for filesystem layout and implementation notes. diff --git a/Documentation/filesystems/cramfs.txt b/Documentation/filesystems/cramfs.txt deleted file mode 100644 index 8e19a53d648b..000000000000 --- a/Documentation/filesystems/cramfs.txt +++ /dev/null @@ -1,118 +0,0 @@ - - Cramfs - cram a filesystem onto a small ROM - -cramfs is designed to be simple and small, and to compress things well. - -It uses the zlib routines to compress a file one page at a time, and -allows random page access. The meta-data is not compressed, but is -expressed in a very terse representation to make it use much less -diskspace than traditional filesystems. - -You can't write to a cramfs filesystem (making it compressible and -compact also makes it _very_ hard to update on-the-fly), so you have to -create the disk image with the "mkcramfs" utility. - - -Usage Notes ------------ - -File sizes are limited to less than 16MB. - -Maximum filesystem size is a little over 256MB. (The last file on the -filesystem is allowed to extend past 256MB.) - -Only the low 8 bits of gid are stored. The current version of -mkcramfs simply truncates to 8 bits, which is a potential security -issue. - -Hard links are supported, but hard linked files -will still have a link count of 1 in the cramfs image. - -Cramfs directories have no `.' or `..' entries. Directories (like -every other file on cramfs) always have a link count of 1. (There's -no need to use -noleaf in `find', btw.) - -No timestamps are stored in a cramfs, so these default to the epoch -(1970 GMT). Recently-accessed files may have updated timestamps, but -the update lasts only as long as the inode is cached in memory, after -which the timestamp reverts to 1970, i.e. moves backwards in time. - -Currently, cramfs must be written and read with architectures of the -same endianness, and can be read only by kernels with PAGE_SIZE -== 4096. At least the latter of these is a bug, but it hasn't been -decided what the best fix is. For the moment if you have larger pages -you can just change the #define in mkcramfs.c, so long as you don't -mind the filesystem becoming unreadable to future kernels. - - -Memory Mapped cramfs image --------------------------- - -The CRAMFS_MTD Kconfig option adds support for loading data directly from -a physical linear memory range (usually non volatile memory like Flash) -instead of going through the block device layer. This saves some memory -since no intermediate buffering is necessary to hold the data before -decompressing. - -And when data blocks are kept uncompressed and properly aligned, they will -automatically be mapped directly into user space whenever possible providing -eXecute-In-Place (XIP) from ROM of read-only segments. Data segments mapped -read-write (hence they have to be copied to RAM) may still be compressed in -the cramfs image in the same file along with non compressed read-only -segments. Both MMU and no-MMU systems are supported. This is particularly -handy for tiny embedded systems with very tight memory constraints. - -The location of the cramfs image in memory is system dependent. You must -know the proper physical address where the cramfs image is located and -configure an MTD device for it. Also, that MTD device must be supported -by a map driver that implements the "point" method. Examples of such -MTD drivers are cfi_cmdset_0001 (Intel/Sharp CFI flash) or physmap -(Flash device in physical memory map). MTD partitions based on such devices -are fine too. Then that device should be specified with the "mtd:" prefix -as the mount device argument. For example, to mount the MTD device named -"fs_partition" on the /mnt directory: - -$ mount -t cramfs mtd:fs_partition /mnt - -To boot a kernel with this as root filesystem, suffice to specify -something like "root=mtd:fs_partition" on the kernel command line. - - -Tools ------ - -A version of mkcramfs that can take advantage of the latest capabilities -described above can be found here: - -https://github.com/npitre/cramfs-tools - - -For /usr/share/magic --------------------- - -0 ulelong 0x28cd3d45 Linux cramfs offset 0 ->4 ulelong x size %d ->8 ulelong x flags 0x%x ->12 ulelong x future 0x%x ->16 string >\0 signature "%.16s" ->32 ulelong x fsid.crc 0x%x ->36 ulelong x fsid.edition %d ->40 ulelong x fsid.blocks %d ->44 ulelong x fsid.files %d ->48 string >\0 name "%.16s" -512 ulelong 0x28cd3d45 Linux cramfs offset 512 ->516 ulelong x size %d ->520 ulelong x flags 0x%x ->524 ulelong x future 0x%x ->528 string >\0 signature "%.16s" ->544 ulelong x fsid.crc 0x%x ->548 ulelong x fsid.edition %d ->552 ulelong x fsid.blocks %d ->556 ulelong x fsid.files %d ->560 string >\0 name "%.16s" - - -Hacker Notes ------------- - -See fs/cramfs/README for filesystem layout and implementation notes. diff --git a/Documentation/filesystems/index.rst b/Documentation/filesystems/index.rst index ddd8f7b2bb25..8fe848ea04af 100644 --- a/Documentation/filesystems/index.rst +++ b/Documentation/filesystems/index.rst @@ -56,6 +56,7 @@ Documentation for filesystem implementations. bfs btrfs ceph + cramfs fuse overlayfs virtiofs -- cgit From 57443789849cd79e66488301a01f01c6340942ce Mon Sep 17 00:00:00 2001 From: Mauro Carvalho Chehab Date: Mon, 17 Feb 2020 17:11:57 +0100 Subject: docs: filesystems: convert debugfs.txt to ReST - Add a SPDX header; - Use copyright symbol; - Add a document title; - Some whitespace fixes and new line breaks; - Mark literal blocks as such; - Use footnoote markups; - Add it to filesystems/index.rst. Signed-off-by: Mauro Carvalho Chehab Link: https://lore.kernel.org/r/42db8f9db17a5d8b619130815ae63d1615951d50.1581955849.git.mchehab+huawei@kernel.org Signed-off-by: Jonathan Corbet --- Documentation/filesystems/debugfs.rst | 247 ++++++++++++++++++++++++++++++++++ Documentation/filesystems/debugfs.txt | 241 --------------------------------- Documentation/filesystems/index.rst | 1 + 3 files changed, 248 insertions(+), 241 deletions(-) create mode 100644 Documentation/filesystems/debugfs.rst delete mode 100644 Documentation/filesystems/debugfs.txt diff --git a/Documentation/filesystems/debugfs.rst b/Documentation/filesystems/debugfs.rst new file mode 100644 index 000000000000..c89d2d335dfb --- /dev/null +++ b/Documentation/filesystems/debugfs.rst @@ -0,0 +1,247 @@ +.. SPDX-License-Identifier: GPL-2.0 +.. include:: + +======= +DebugFS +======= + +Copyright |copy| 2009 Jonathan Corbet + +Debugfs exists as a simple way for kernel developers to make information +available to user space. Unlike /proc, which is only meant for information +about a process, or sysfs, which has strict one-value-per-file rules, +debugfs has no rules at all. Developers can put any information they want +there. The debugfs filesystem is also intended to not serve as a stable +ABI to user space; in theory, there are no stability constraints placed on +files exported there. The real world is not always so simple, though [1]_; +even debugfs interfaces are best designed with the idea that they will need +to be maintained forever. + +Debugfs is typically mounted with a command like:: + + mount -t debugfs none /sys/kernel/debug + +(Or an equivalent /etc/fstab line). +The debugfs root directory is accessible only to the root user by +default. To change access to the tree the "uid", "gid" and "mode" mount +options can be used. + +Note that the debugfs API is exported GPL-only to modules. + +Code using debugfs should include . Then, the first order +of business will be to create at least one directory to hold a set of +debugfs files:: + + struct dentry *debugfs_create_dir(const char *name, struct dentry *parent); + +This call, if successful, will make a directory called name underneath the +indicated parent directory. If parent is NULL, the directory will be +created in the debugfs root. On success, the return value is a struct +dentry pointer which can be used to create files in the directory (and to +clean it up at the end). An ERR_PTR(-ERROR) return value indicates that +something went wrong. If ERR_PTR(-ENODEV) is returned, that is an +indication that the kernel has been built without debugfs support and none +of the functions described below will work. + +The most general way to create a file within a debugfs directory is with:: + + struct dentry *debugfs_create_file(const char *name, umode_t mode, + struct dentry *parent, void *data, + const struct file_operations *fops); + +Here, name is the name of the file to create, mode describes the access +permissions the file should have, parent indicates the directory which +should hold the file, data will be stored in the i_private field of the +resulting inode structure, and fops is a set of file operations which +implement the file's behavior. At a minimum, the read() and/or write() +operations should be provided; others can be included as needed. Again, +the return value will be a dentry pointer to the created file, +ERR_PTR(-ERROR) on error, or ERR_PTR(-ENODEV) if debugfs support is +missing. + +Create a file with an initial size, the following function can be used +instead:: + + struct dentry *debugfs_create_file_size(const char *name, umode_t mode, + struct dentry *parent, void *data, + const struct file_operations *fops, + loff_t file_size); + +file_size is the initial file size. The other parameters are the same +as the function debugfs_create_file. + +In a number of cases, the creation of a set of file operations is not +actually necessary; the debugfs code provides a number of helper functions +for simple situations. Files containing a single integer value can be +created with any of:: + + void debugfs_create_u8(const char *name, umode_t mode, + struct dentry *parent, u8 *value); + void debugfs_create_u16(const char *name, umode_t mode, + struct dentry *parent, u16 *value); + struct dentry *debugfs_create_u32(const char *name, umode_t mode, + struct dentry *parent, u32 *value); + void debugfs_create_u64(const char *name, umode_t mode, + struct dentry *parent, u64 *value); + +These files support both reading and writing the given value; if a specific +file should not be written to, simply set the mode bits accordingly. The +values in these files are in decimal; if hexadecimal is more appropriate, +the following functions can be used instead:: + + void debugfs_create_x8(const char *name, umode_t mode, + struct dentry *parent, u8 *value); + void debugfs_create_x16(const char *name, umode_t mode, + struct dentry *parent, u16 *value); + void debugfs_create_x32(const char *name, umode_t mode, + struct dentry *parent, u32 *value); + void debugfs_create_x64(const char *name, umode_t mode, + struct dentry *parent, u64 *value); + +These functions are useful as long as the developer knows the size of the +value to be exported. Some types can have different widths on different +architectures, though, complicating the situation somewhat. There are +functions meant to help out in such special cases:: + + void debugfs_create_size_t(const char *name, umode_t mode, + struct dentry *parent, size_t *value); + +As might be expected, this function will create a debugfs file to represent +a variable of type size_t. + +Similarly, there are helpers for variables of type unsigned long, in decimal +and hexadecimal:: + + struct dentry *debugfs_create_ulong(const char *name, umode_t mode, + struct dentry *parent, + unsigned long *value); + void debugfs_create_xul(const char *name, umode_t mode, + struct dentry *parent, unsigned long *value); + +Boolean values can be placed in debugfs with:: + + struct dentry *debugfs_create_bool(const char *name, umode_t mode, + struct dentry *parent, bool *value); + +A read on the resulting file will yield either Y (for non-zero values) or +N, followed by a newline. If written to, it will accept either upper- or +lower-case values, or 1 or 0. Any other input will be silently ignored. + +Also, atomic_t values can be placed in debugfs with:: + + void debugfs_create_atomic_t(const char *name, umode_t mode, + struct dentry *parent, atomic_t *value) + +A read of this file will get atomic_t values, and a write of this file +will set atomic_t values. + +Another option is exporting a block of arbitrary binary data, with +this structure and function:: + + struct debugfs_blob_wrapper { + void *data; + unsigned long size; + }; + + struct dentry *debugfs_create_blob(const char *name, umode_t mode, + struct dentry *parent, + struct debugfs_blob_wrapper *blob); + +A read of this file will return the data pointed to by the +debugfs_blob_wrapper structure. Some drivers use "blobs" as a simple way +to return several lines of (static) formatted text output. This function +can be used to export binary information, but there does not appear to be +any code which does so in the mainline. Note that all files created with +debugfs_create_blob() are read-only. + +If you want to dump a block of registers (something that happens quite +often during development, even if little such code reaches mainline. +Debugfs offers two functions: one to make a registers-only file, and +another to insert a register block in the middle of another sequential +file:: + + struct debugfs_reg32 { + char *name; + unsigned long offset; + }; + + struct debugfs_regset32 { + struct debugfs_reg32 *regs; + int nregs; + void __iomem *base; + }; + + struct dentry *debugfs_create_regset32(const char *name, umode_t mode, + struct dentry *parent, + struct debugfs_regset32 *regset); + + void debugfs_print_regs32(struct seq_file *s, struct debugfs_reg32 *regs, + int nregs, void __iomem *base, char *prefix); + +The "base" argument may be 0, but you may want to build the reg32 array +using __stringify, and a number of register names (macros) are actually +byte offsets over a base for the register block. + +If you want to dump an u32 array in debugfs, you can create file with:: + + void debugfs_create_u32_array(const char *name, umode_t mode, + struct dentry *parent, + u32 *array, u32 elements); + +The "array" argument provides data, and the "elements" argument is +the number of elements in the array. Note: Once array is created its +size can not be changed. + +There is a helper function to create device related seq_file:: + + struct dentry *debugfs_create_devm_seqfile(struct device *dev, + const char *name, + struct dentry *parent, + int (*read_fn)(struct seq_file *s, + void *data)); + +The "dev" argument is the device related to this debugfs file, and +the "read_fn" is a function pointer which to be called to print the +seq_file content. + +There are a couple of other directory-oriented helper functions:: + + struct dentry *debugfs_rename(struct dentry *old_dir, + struct dentry *old_dentry, + struct dentry *new_dir, + const char *new_name); + + struct dentry *debugfs_create_symlink(const char *name, + struct dentry *parent, + const char *target); + +A call to debugfs_rename() will give a new name to an existing debugfs +file, possibly in a different directory. The new_name must not exist prior +to the call; the return value is old_dentry with updated information. +Symbolic links can be created with debugfs_create_symlink(). + +There is one important thing that all debugfs users must take into account: +there is no automatic cleanup of any directories created in debugfs. If a +module is unloaded without explicitly removing debugfs entries, the result +will be a lot of stale pointers and no end of highly antisocial behavior. +So all debugfs users - at least those which can be built as modules - must +be prepared to remove all files and directories they create there. A file +can be removed with:: + + void debugfs_remove(struct dentry *dentry); + +The dentry value can be NULL or an error value, in which case nothing will +be removed. + +Once upon a time, debugfs users were required to remember the dentry +pointer for every debugfs file they created so that all files could be +cleaned up. We live in more civilized times now, though, and debugfs users +can call:: + + void debugfs_remove_recursive(struct dentry *dentry); + +If this function is passed a pointer for the dentry corresponding to the +top-level directory, the entire hierarchy below that directory will be +removed. + +.. [1] http://lwn.net/Articles/309298/ diff --git a/Documentation/filesystems/debugfs.txt b/Documentation/filesystems/debugfs.txt deleted file mode 100644 index dc497b96fa4f..000000000000 --- a/Documentation/filesystems/debugfs.txt +++ /dev/null @@ -1,241 +0,0 @@ -Copyright 2009 Jonathan Corbet - -Debugfs exists as a simple way for kernel developers to make information -available to user space. Unlike /proc, which is only meant for information -about a process, or sysfs, which has strict one-value-per-file rules, -debugfs has no rules at all. Developers can put any information they want -there. The debugfs filesystem is also intended to not serve as a stable -ABI to user space; in theory, there are no stability constraints placed on -files exported there. The real world is not always so simple, though [1]; -even debugfs interfaces are best designed with the idea that they will need -to be maintained forever. - -Debugfs is typically mounted with a command like: - - mount -t debugfs none /sys/kernel/debug - -(Or an equivalent /etc/fstab line). -The debugfs root directory is accessible only to the root user by -default. To change access to the tree the "uid", "gid" and "mode" mount -options can be used. - -Note that the debugfs API is exported GPL-only to modules. - -Code using debugfs should include . Then, the first order -of business will be to create at least one directory to hold a set of -debugfs files: - - struct dentry *debugfs_create_dir(const char *name, struct dentry *parent); - -This call, if successful, will make a directory called name underneath the -indicated parent directory. If parent is NULL, the directory will be -created in the debugfs root. On success, the return value is a struct -dentry pointer which can be used to create files in the directory (and to -clean it up at the end). An ERR_PTR(-ERROR) return value indicates that -something went wrong. If ERR_PTR(-ENODEV) is returned, that is an -indication that the kernel has been built without debugfs support and none -of the functions described below will work. - -The most general way to create a file within a debugfs directory is with: - - struct dentry *debugfs_create_file(const char *name, umode_t mode, - struct dentry *parent, void *data, - const struct file_operations *fops); - -Here, name is the name of the file to create, mode describes the access -permissions the file should have, parent indicates the directory which -should hold the file, data will be stored in the i_private field of the -resulting inode structure, and fops is a set of file operations which -implement the file's behavior. At a minimum, the read() and/or write() -operations should be provided; others can be included as needed. Again, -the return value will be a dentry pointer to the created file, -ERR_PTR(-ERROR) on error, or ERR_PTR(-ENODEV) if debugfs support is -missing. - -Create a file with an initial size, the following function can be used -instead: - - struct dentry *debugfs_create_file_size(const char *name, umode_t mode, - struct dentry *parent, void *data, - const struct file_operations *fops, - loff_t file_size); - -file_size is the initial file size. The other parameters are the same -as the function debugfs_create_file. - -In a number of cases, the creation of a set of file operations is not -actually necessary; the debugfs code provides a number of helper functions -for simple situations. Files containing a single integer value can be -created with any of: - - void debugfs_create_u8(const char *name, umode_t mode, - struct dentry *parent, u8 *value); - void debugfs_create_u16(const char *name, umode_t mode, - struct dentry *parent, u16 *value); - struct dentry *debugfs_create_u32(const char *name, umode_t mode, - struct dentry *parent, u32 *value); - void debugfs_create_u64(const char *name, umode_t mode, - struct dentry *parent, u64 *value); - -These files support both reading and writing the given value; if a specific -file should not be written to, simply set the mode bits accordingly. The -values in these files are in decimal; if hexadecimal is more appropriate, -the following functions can be used instead: - - void debugfs_create_x8(const char *name, umode_t mode, - struct dentry *parent, u8 *value); - void debugfs_create_x16(const char *name, umode_t mode, - struct dentry *parent, u16 *value); - void debugfs_create_x32(const char *name, umode_t mode, - struct dentry *parent, u32 *value); - void debugfs_create_x64(const char *name, umode_t mode, - struct dentry *parent, u64 *value); - -These functions are useful as long as the developer knows the size of the -value to be exported. Some types can have different widths on different -architectures, though, complicating the situation somewhat. There are -functions meant to help out in such special cases: - - void debugfs_create_size_t(const char *name, umode_t mode, - struct dentry *parent, size_t *value); - -As might be expected, this function will create a debugfs file to represent -a variable of type size_t. - -Similarly, there are helpers for variables of type unsigned long, in decimal -and hexadecimal: - - struct dentry *debugfs_create_ulong(const char *name, umode_t mode, - struct dentry *parent, - unsigned long *value); - void debugfs_create_xul(const char *name, umode_t mode, - struct dentry *parent, unsigned long *value); - -Boolean values can be placed in debugfs with: - - struct dentry *debugfs_create_bool(const char *name, umode_t mode, - struct dentry *parent, bool *value); - -A read on the resulting file will yield either Y (for non-zero values) or -N, followed by a newline. If written to, it will accept either upper- or -lower-case values, or 1 or 0. Any other input will be silently ignored. - -Also, atomic_t values can be placed in debugfs with: - - void debugfs_create_atomic_t(const char *name, umode_t mode, - struct dentry *parent, atomic_t *value) - -A read of this file will get atomic_t values, and a write of this file -will set atomic_t values. - -Another option is exporting a block of arbitrary binary data, with -this structure and function: - - struct debugfs_blob_wrapper { - void *data; - unsigned long size; - }; - - struct dentry *debugfs_create_blob(const char *name, umode_t mode, - struct dentry *parent, - struct debugfs_blob_wrapper *blob); - -A read of this file will return the data pointed to by the -debugfs_blob_wrapper structure. Some drivers use "blobs" as a simple way -to return several lines of (static) formatted text output. This function -can be used to export binary information, but there does not appear to be -any code which does so in the mainline. Note that all files created with -debugfs_create_blob() are read-only. - -If you want to dump a block of registers (something that happens quite -often during development, even if little such code reaches mainline. -Debugfs offers two functions: one to make a registers-only file, and -another to insert a register block in the middle of another sequential -file. - - struct debugfs_reg32 { - char *name; - unsigned long offset; - }; - - struct debugfs_regset32 { - struct debugfs_reg32 *regs; - int nregs; - void __iomem *base; - }; - - struct dentry *debugfs_create_regset32(const char *name, umode_t mode, - struct dentry *parent, - struct debugfs_regset32 *regset); - - void debugfs_print_regs32(struct seq_file *s, struct debugfs_reg32 *regs, - int nregs, void __iomem *base, char *prefix); - -The "base" argument may be 0, but you may want to build the reg32 array -using __stringify, and a number of register names (macros) are actually -byte offsets over a base for the register block. - -If you want to dump an u32 array in debugfs, you can create file with: - - void debugfs_create_u32_array(const char *name, umode_t mode, - struct dentry *parent, - u32 *array, u32 elements); - -The "array" argument provides data, and the "elements" argument is -the number of elements in the array. Note: Once array is created its -size can not be changed. - -There is a helper function to create device related seq_file: - - struct dentry *debugfs_create_devm_seqfile(struct device *dev, - const char *name, - struct dentry *parent, - int (*read_fn)(struct seq_file *s, - void *data)); - -The "dev" argument is the device related to this debugfs file, and -the "read_fn" is a function pointer which to be called to print the -seq_file content. - -There are a couple of other directory-oriented helper functions: - - struct dentry *debugfs_rename(struct dentry *old_dir, - struct dentry *old_dentry, - struct dentry *new_dir, - const char *new_name); - - struct dentry *debugfs_create_symlink(const char *name, - struct dentry *parent, - const char *target); - -A call to debugfs_rename() will give a new name to an existing debugfs -file, possibly in a different directory. The new_name must not exist prior -to the call; the return value is old_dentry with updated information. -Symbolic links can be created with debugfs_create_symlink(). - -There is one important thing that all debugfs users must take into account: -there is no automatic cleanup of any directories created in debugfs. If a -module is unloaded without explicitly removing debugfs entries, the result -will be a lot of stale pointers and no end of highly antisocial behavior. -So all debugfs users - at least those which can be built as modules - must -be prepared to remove all files and directories they create there. A file -can be removed with: - - void debugfs_remove(struct dentry *dentry); - -The dentry value can be NULL or an error value, in which case nothing will -be removed. - -Once upon a time, debugfs users were required to remember the dentry -pointer for every debugfs file they created so that all files could be -cleaned up. We live in more civilized times now, though, and debugfs users -can call: - - void debugfs_remove_recursive(struct dentry *dentry); - -If this function is passed a pointer for the dentry corresponding to the -top-level directory, the entire hierarchy below that directory will be -removed. - -Notes: - [1] http://lwn.net/Articles/309298/ diff --git a/Documentation/filesystems/index.rst b/Documentation/filesystems/index.rst index 8fe848ea04af..ab3b656bbe60 100644 --- a/Documentation/filesystems/index.rst +++ b/Documentation/filesystems/index.rst @@ -57,6 +57,7 @@ Documentation for filesystem implementations. btrfs ceph cramfs + debugfs fuse overlayfs virtiofs -- cgit From 14a19fa5cf759ea18bc7d692cd8fe326af3c4d0a Mon Sep 17 00:00:00 2001 From: Mauro Carvalho Chehab Date: Mon, 17 Feb 2020 17:11:58 +0100 Subject: docs: filesystems: convert dlmfs.txt to ReST - Add a SPDX header; - Use copyright symbol; - Adjust document title; - Some whitespace fixes and new line breaks; - Mark literal blocks as such; - Add table markups; - Add it to filesystems/index.rst. Signed-off-by: Mauro Carvalho Chehab Link: https://lore.kernel.org/r/efc9e59925723e17d1a4741b11049616c221463e.1581955849.git.mchehab+huawei@kernel.org Signed-off-by: Jonathan Corbet --- Documentation/filesystems/dlmfs.rst | 140 ++++++++++++++++++++++++++++++++++++ Documentation/filesystems/dlmfs.txt | 130 --------------------------------- Documentation/filesystems/index.rst | 1 + 3 files changed, 141 insertions(+), 130 deletions(-) create mode 100644 Documentation/filesystems/dlmfs.rst delete mode 100644 Documentation/filesystems/dlmfs.txt diff --git a/Documentation/filesystems/dlmfs.rst b/Documentation/filesystems/dlmfs.rst new file mode 100644 index 000000000000..68daaa7facf9 --- /dev/null +++ b/Documentation/filesystems/dlmfs.rst @@ -0,0 +1,140 @@ +.. SPDX-License-Identifier: GPL-2.0 +.. include:: + +===== +DLMFS +===== + +A minimal DLM userspace interface implemented via a virtual file +system. + +dlmfs is built with OCFS2 as it requires most of its infrastructure. + +:Project web page: http://ocfs2.wiki.kernel.org +:Tools web page: https://github.com/markfasheh/ocfs2-tools +:OCFS2 mailing lists: http://oss.oracle.com/projects/ocfs2/mailman/ + +All code copyright 2005 Oracle except when otherwise noted. + +Credits +======= + +Some code taken from ramfs which is Copyright |copy| 2000 Linus Torvalds +and Transmeta Corp. + +Mark Fasheh + +Caveats +======= +- Right now it only works with the OCFS2 DLM, though support for other + DLM implementations should not be a major issue. + +Mount options +============= +None + +Usage +===== + +If you're just interested in OCFS2, then please see ocfs2.txt. The +rest of this document will be geared towards those who want to use +dlmfs for easy to setup and easy to use clustered locking in +userspace. + +Setup +===== + +dlmfs requires that the OCFS2 cluster infrastructure be in +place. Please download ocfs2-tools from the above url and configure a +cluster. + +You'll want to start heartbeating on a volume which all the nodes in +your lockspace can access. The easiest way to do this is via +ocfs2_hb_ctl (distributed with ocfs2-tools). Right now it requires +that an OCFS2 file system be in place so that it can automatically +find its heartbeat area, though it will eventually support heartbeat +against raw disks. + +Please see the ocfs2_hb_ctl and mkfs.ocfs2 manual pages distributed +with ocfs2-tools. + +Once you're heartbeating, DLM lock 'domains' can be easily created / +destroyed and locks within them accessed. + +Locking +======= + +Users may access dlmfs via standard file system calls, or they can use +'libo2dlm' (distributed with ocfs2-tools) which abstracts the file +system calls and presents a more traditional locking api. + +dlmfs handles lock caching automatically for the user, so a lock +request for an already acquired lock will not generate another DLM +call. Userspace programs are assumed to handle their own local +locking. + +Two levels of locks are supported - Shared Read, and Exclusive. +Also supported is a Trylock operation. + +For information on the libo2dlm interface, please see o2dlm.h, +distributed with ocfs2-tools. + +Lock value blocks can be read and written to a resource via read(2) +and write(2) against the fd obtained via your open(2) call. The +maximum currently supported LVB length is 64 bytes (though that is an +OCFS2 DLM limitation). Through this mechanism, users of dlmfs can share +small amounts of data amongst their nodes. + +mkdir(2) signals dlmfs to join a domain (which will have the same name +as the resulting directory) + +rmdir(2) signals dlmfs to leave the domain + +Locks for a given domain are represented by regular inodes inside the +domain directory. Locking against them is done via the open(2) system +call. + +The open(2) call will not return until your lock has been granted or +an error has occurred, unless it has been instructed to do a trylock +operation. If the lock succeeds, you'll get an fd. + +open(2) with O_CREAT to ensure the resource inode is created - dlmfs does +not automatically create inodes for existing lock resources. + +============ =========================== +Open Flag Lock Request Type +============ =========================== +O_RDONLY Shared Read +O_RDWR Exclusive +============ =========================== + + +============ =========================== +Open Flag Resulting Locking Behavior +============ =========================== +O_NONBLOCK Trylock operation +============ =========================== + +You must provide exactly one of O_RDONLY or O_RDWR. + +If O_NONBLOCK is also provided and the trylock operation was valid but +could not lock the resource then open(2) will return ETXTBUSY. + +close(2) drops the lock associated with your fd. + +Modes passed to mkdir(2) or open(2) are adhered to locally. Chown is +supported locally as well. This means you can use them to restrict +access to the resources via dlmfs on your local node only. + +The resource LVB may be read from the fd in either Shared Read or +Exclusive modes via the read(2) system call. It can be written via +write(2) only when open in Exclusive mode. + +Once written, an LVB will be visible to other nodes who obtain Read +Only or higher level locks on the resource. + +See Also +======== +http://opendlm.sourceforge.net/cvsmirror/opendlm/docs/dlmbook_final.pdf + +For more information on the VMS distributed locking API. diff --git a/Documentation/filesystems/dlmfs.txt b/Documentation/filesystems/dlmfs.txt deleted file mode 100644 index fcf4d509d118..000000000000 --- a/Documentation/filesystems/dlmfs.txt +++ /dev/null @@ -1,130 +0,0 @@ -dlmfs -================== -A minimal DLM userspace interface implemented via a virtual file -system. - -dlmfs is built with OCFS2 as it requires most of its infrastructure. - -Project web page: http://ocfs2.wiki.kernel.org -Tools web page: https://github.com/markfasheh/ocfs2-tools -OCFS2 mailing lists: http://oss.oracle.com/projects/ocfs2/mailman/ - -All code copyright 2005 Oracle except when otherwise noted. - -CREDITS -======= - -Some code taken from ramfs which is Copyright (C) 2000 Linus Torvalds -and Transmeta Corp. - -Mark Fasheh - -Caveats -======= -- Right now it only works with the OCFS2 DLM, though support for other - DLM implementations should not be a major issue. - -Mount options -============= -None - -Usage -===== - -If you're just interested in OCFS2, then please see ocfs2.txt. The -rest of this document will be geared towards those who want to use -dlmfs for easy to setup and easy to use clustered locking in -userspace. - -Setup -===== - -dlmfs requires that the OCFS2 cluster infrastructure be in -place. Please download ocfs2-tools from the above url and configure a -cluster. - -You'll want to start heartbeating on a volume which all the nodes in -your lockspace can access. The easiest way to do this is via -ocfs2_hb_ctl (distributed with ocfs2-tools). Right now it requires -that an OCFS2 file system be in place so that it can automatically -find its heartbeat area, though it will eventually support heartbeat -against raw disks. - -Please see the ocfs2_hb_ctl and mkfs.ocfs2 manual pages distributed -with ocfs2-tools. - -Once you're heartbeating, DLM lock 'domains' can be easily created / -destroyed and locks within them accessed. - -Locking -======= - -Users may access dlmfs via standard file system calls, or they can use -'libo2dlm' (distributed with ocfs2-tools) which abstracts the file -system calls and presents a more traditional locking api. - -dlmfs handles lock caching automatically for the user, so a lock -request for an already acquired lock will not generate another DLM -call. Userspace programs are assumed to handle their own local -locking. - -Two levels of locks are supported - Shared Read, and Exclusive. -Also supported is a Trylock operation. - -For information on the libo2dlm interface, please see o2dlm.h, -distributed with ocfs2-tools. - -Lock value blocks can be read and written to a resource via read(2) -and write(2) against the fd obtained via your open(2) call. The -maximum currently supported LVB length is 64 bytes (though that is an -OCFS2 DLM limitation). Through this mechanism, users of dlmfs can share -small amounts of data amongst their nodes. - -mkdir(2) signals dlmfs to join a domain (which will have the same name -as the resulting directory) - -rmdir(2) signals dlmfs to leave the domain - -Locks for a given domain are represented by regular inodes inside the -domain directory. Locking against them is done via the open(2) system -call. - -The open(2) call will not return until your lock has been granted or -an error has occurred, unless it has been instructed to do a trylock -operation. If the lock succeeds, you'll get an fd. - -open(2) with O_CREAT to ensure the resource inode is created - dlmfs does -not automatically create inodes for existing lock resources. - -Open Flag Lock Request Type ---------- ----------------- -O_RDONLY Shared Read -O_RDWR Exclusive - -Open Flag Resulting Locking Behavior ---------- -------------------------- -O_NONBLOCK Trylock operation - -You must provide exactly one of O_RDONLY or O_RDWR. - -If O_NONBLOCK is also provided and the trylock operation was valid but -could not lock the resource then open(2) will return ETXTBUSY. - -close(2) drops the lock associated with your fd. - -Modes passed to mkdir(2) or open(2) are adhered to locally. Chown is -supported locally as well. This means you can use them to restrict -access to the resources via dlmfs on your local node only. - -The resource LVB may be read from the fd in either Shared Read or -Exclusive modes via the read(2) system call. It can be written via -write(2) only when open in Exclusive mode. - -Once written, an LVB will be visible to other nodes who obtain Read -Only or higher level locks on the resource. - -See Also -======== -http://opendlm.sourceforge.net/cvsmirror/opendlm/docs/dlmbook_final.pdf - -For more information on the VMS distributed locking API. diff --git a/Documentation/filesystems/index.rst b/Documentation/filesystems/index.rst index ab3b656bbe60..c6885c7ef781 100644 --- a/Documentation/filesystems/index.rst +++ b/Documentation/filesystems/index.rst @@ -58,6 +58,7 @@ Documentation for filesystem implementations. ceph cramfs debugfs + dlmfs fuse overlayfs virtiofs -- cgit From b02a17cb8ae23479c9bf306e96d2dd71422de63f Mon Sep 17 00:00:00 2001 From: Mauro Carvalho Chehab Date: Mon, 17 Feb 2020 17:11:59 +0100 Subject: docs: filesystems: convert ecryptfs.txt to ReST - Add a SPDX header; - Add a document title; - use :field: markup; - Some whitespace fixes and new line breaks; - Mark literal blocks as such; - Add table markups; - Add it to filesystems/index.rst. Signed-off-by: Mauro Carvalho Chehab Acked-by: Tyler Hicks Link: https://lore.kernel.org/r/6e13841ebd00c8d988027115c75c58821bb41a0c.1581955849.git.mchehab+huawei@kernel.org Signed-off-by: Jonathan Corbet --- Documentation/filesystems/ecryptfs.rst | 87 ++++++++++++++++++++++++++++++++++ Documentation/filesystems/ecryptfs.txt | 77 ------------------------------ Documentation/filesystems/index.rst | 1 + 3 files changed, 88 insertions(+), 77 deletions(-) create mode 100644 Documentation/filesystems/ecryptfs.rst delete mode 100644 Documentation/filesystems/ecryptfs.txt diff --git a/Documentation/filesystems/ecryptfs.rst b/Documentation/filesystems/ecryptfs.rst new file mode 100644 index 000000000000..7236172300ef --- /dev/null +++ b/Documentation/filesystems/ecryptfs.rst @@ -0,0 +1,87 @@ +.. SPDX-License-Identifier: GPL-2.0 + +====================================================== +eCryptfs: A stacked cryptographic filesystem for Linux +====================================================== + +eCryptfs is free software. Please see the file COPYING for details. +For documentation, please see the files in the doc/ subdirectory. For +building and installation instructions please see the INSTALL file. + +:Maintainer: Phillip Hellewell +:Lead developer: Michael A. Halcrow +:Developers: Michael C. Thompson + Kent Yoder +:Web Site: http://ecryptfs.sf.net + +This software is currently undergoing development. Make sure to +maintain a backup copy of any data you write into eCryptfs. + +eCryptfs requires the userspace tools downloadable from the +SourceForge site: + +http://sourceforge.net/projects/ecryptfs/ + +Userspace requirements include: + +- David Howells' userspace keyring headers and libraries (version + 1.0 or higher), obtainable from + http://people.redhat.com/~dhowells/keyutils/ +- Libgcrypt + + +Notes +===== + +In the beta/experimental releases of eCryptfs, when you upgrade +eCryptfs, you should copy the files to an unencrypted location and +then copy the files back into the new eCryptfs mount to migrate the +files. + + +Mount-wide Passphrase +===================== + +Create a new directory into which eCryptfs will write its encrypted +files (i.e., /root/crypt). Then, create the mount point directory +(i.e., /mnt/crypt). Now it's time to mount eCryptfs:: + + mount -t ecryptfs /root/crypt /mnt/crypt + +You should be prompted for a passphrase and a salt (the salt may be +blank). + +Try writing a new file:: + + echo "Hello, World" > /mnt/crypt/hello.txt + +The operation will complete. Notice that there is a new file in +/root/crypt that is at least 12288 bytes in size (depending on your +host page size). This is the encrypted underlying file for what you +just wrote. To test reading, from start to finish, you need to clear +the user session keyring: + +keyctl clear @u + +Then umount /mnt/crypt and mount again per the instructions given +above. + +:: + + cat /mnt/crypt/hello.txt + + +Notes +===== + +eCryptfs version 0.1 should only be mounted on (1) empty directories +or (2) directories containing files only created by eCryptfs. If you +mount a directory that has pre-existing files not created by eCryptfs, +then behavior is undefined. Do not run eCryptfs in higher verbosity +levels unless you are doing so for the sole purpose of debugging or +development, since secret values will be written out to the system log +in that case. + + +Mike Halcrow +mhalcrow@us.ibm.com diff --git a/Documentation/filesystems/ecryptfs.txt b/Documentation/filesystems/ecryptfs.txt deleted file mode 100644 index 01d8a08351ac..000000000000 --- a/Documentation/filesystems/ecryptfs.txt +++ /dev/null @@ -1,77 +0,0 @@ -eCryptfs: A stacked cryptographic filesystem for Linux - -eCryptfs is free software. Please see the file COPYING for details. -For documentation, please see the files in the doc/ subdirectory. For -building and installation instructions please see the INSTALL file. - -Maintainer: Phillip Hellewell -Lead developer: Michael A. Halcrow -Developers: Michael C. Thompson - Kent Yoder -Web Site: http://ecryptfs.sf.net - -This software is currently undergoing development. Make sure to -maintain a backup copy of any data you write into eCryptfs. - -eCryptfs requires the userspace tools downloadable from the -SourceForge site: - -http://sourceforge.net/projects/ecryptfs/ - -Userspace requirements include: - - David Howells' userspace keyring headers and libraries (version - 1.0 or higher), obtainable from - http://people.redhat.com/~dhowells/keyutils/ - - Libgcrypt - - -NOTES - -In the beta/experimental releases of eCryptfs, when you upgrade -eCryptfs, you should copy the files to an unencrypted location and -then copy the files back into the new eCryptfs mount to migrate the -files. - - -MOUNT-WIDE PASSPHRASE - -Create a new directory into which eCryptfs will write its encrypted -files (i.e., /root/crypt). Then, create the mount point directory -(i.e., /mnt/crypt). Now it's time to mount eCryptfs: - -mount -t ecryptfs /root/crypt /mnt/crypt - -You should be prompted for a passphrase and a salt (the salt may be -blank). - -Try writing a new file: - -echo "Hello, World" > /mnt/crypt/hello.txt - -The operation will complete. Notice that there is a new file in -/root/crypt that is at least 12288 bytes in size (depending on your -host page size). This is the encrypted underlying file for what you -just wrote. To test reading, from start to finish, you need to clear -the user session keyring: - -keyctl clear @u - -Then umount /mnt/crypt and mount again per the instructions given -above. - -cat /mnt/crypt/hello.txt - - -NOTES - -eCryptfs version 0.1 should only be mounted on (1) empty directories -or (2) directories containing files only created by eCryptfs. If you -mount a directory that has pre-existing files not created by eCryptfs, -then behavior is undefined. Do not run eCryptfs in higher verbosity -levels unless you are doing so for the sole purpose of debugging or -development, since secret values will be written out to the system log -in that case. - - -Mike Halcrow -mhalcrow@us.ibm.com diff --git a/Documentation/filesystems/index.rst b/Documentation/filesystems/index.rst index c6885c7ef781..d6d69f1c9287 100644 --- a/Documentation/filesystems/index.rst +++ b/Documentation/filesystems/index.rst @@ -59,6 +59,7 @@ Documentation for filesystem implementations. cramfs debugfs dlmfs + ecryptfs fuse overlayfs virtiofs -- cgit From 06dedb45b79c6550b878244879f33b6e614126bd Mon Sep 17 00:00:00 2001 From: Mauro Carvalho Chehab Date: Mon, 17 Feb 2020 17:12:00 +0100 Subject: docs: filesystems: convert efivarfs.txt to ReST Trivial changes: - Add a SPDX header; - Adjust document title; - Mark a literal block as such; - Add it to filesystems/index.rst. Signed-off-by: Mauro Carvalho Chehab Link: https://lore.kernel.org/r/215691d747055c4ccb038ec7d78d8d1fe87fe2c0.1581955849.git.mchehab+huawei@kernel.org Signed-off-by: Jonathan Corbet --- Documentation/filesystems/efivarfs.rst | 26 ++++++++++++++++++++++++++ Documentation/filesystems/efivarfs.txt | 23 ----------------------- Documentation/filesystems/index.rst | 1 + 3 files changed, 27 insertions(+), 23 deletions(-) create mode 100644 Documentation/filesystems/efivarfs.rst delete mode 100644 Documentation/filesystems/efivarfs.txt diff --git a/Documentation/filesystems/efivarfs.rst b/Documentation/filesystems/efivarfs.rst new file mode 100644 index 000000000000..90ac65683e7e --- /dev/null +++ b/Documentation/filesystems/efivarfs.rst @@ -0,0 +1,26 @@ +.. SPDX-License-Identifier: GPL-2.0 + +======================================= +efivarfs - a (U)EFI variable filesystem +======================================= + +The efivarfs filesystem was created to address the shortcomings of +using entries in sysfs to maintain EFI variables. The old sysfs EFI +variables code only supported variables of up to 1024 bytes. This +limitation existed in version 0.99 of the EFI specification, but was +removed before any full releases. Since variables can now be larger +than a single page, sysfs isn't the best interface for this. + +Variables can be created, deleted and modified with the efivarfs +filesystem. + +efivarfs is typically mounted like this:: + + mount -t efivarfs none /sys/firmware/efi/efivars + +Due to the presence of numerous firmware bugs where removing non-standard +UEFI variables causes the system firmware to fail to POST, efivarfs +files that are not well-known standardized variables are created +as immutable files. This doesn't prevent removal - "chattr -i" will work - +but it does prevent this kind of failure from being accomplished +accidentally. diff --git a/Documentation/filesystems/efivarfs.txt b/Documentation/filesystems/efivarfs.txt deleted file mode 100644 index 686a64bba775..000000000000 --- a/Documentation/filesystems/efivarfs.txt +++ /dev/null @@ -1,23 +0,0 @@ - -efivarfs - a (U)EFI variable filesystem - -The efivarfs filesystem was created to address the shortcomings of -using entries in sysfs to maintain EFI variables. The old sysfs EFI -variables code only supported variables of up to 1024 bytes. This -limitation existed in version 0.99 of the EFI specification, but was -removed before any full releases. Since variables can now be larger -than a single page, sysfs isn't the best interface for this. - -Variables can be created, deleted and modified with the efivarfs -filesystem. - -efivarfs is typically mounted like this, - - mount -t efivarfs none /sys/firmware/efi/efivars - -Due to the presence of numerous firmware bugs where removing non-standard -UEFI variables causes the system firmware to fail to POST, efivarfs -files that are not well-known standardized variables are created -as immutable files. This doesn't prevent removal - "chattr -i" will work - -but it does prevent this kind of failure from being accomplished -accidentally. diff --git a/Documentation/filesystems/index.rst b/Documentation/filesystems/index.rst index d6d69f1c9287..4230f49d2732 100644 --- a/Documentation/filesystems/index.rst +++ b/Documentation/filesystems/index.rst @@ -60,6 +60,7 @@ Documentation for filesystem implementations. debugfs dlmfs ecryptfs + efivarfs fuse overlayfs virtiofs -- cgit From e66d8631ddb3306bd9f463324c2d9a5d9dc559f7 Mon Sep 17 00:00:00 2001 From: Mauro Carvalho Chehab Date: Mon, 17 Feb 2020 17:12:01 +0100 Subject: docs: filesystems: convert erofs.txt to ReST - Add a SPDX header; - Add a document title; - Some whitespace fixes and new line breaks; - Mark literal blocks as such; - Add table markups; - Add lists markups; - Add it to filesystems/index.rst. Signed-off-by: Mauro Carvalho Chehab Link: https://lore.kernel.org/r/402d1d2f7252b8a683f7a9c6867bc5428da64026.1581955849.git.mchehab+huawei@kernel.org Signed-off-by: Jonathan Corbet --- Documentation/filesystems/erofs.rst | 240 ++++++++++++++++++++++++++++++++++++ Documentation/filesystems/erofs.txt | 211 ------------------------------- Documentation/filesystems/index.rst | 1 + 3 files changed, 241 insertions(+), 211 deletions(-) create mode 100644 Documentation/filesystems/erofs.rst delete mode 100644 Documentation/filesystems/erofs.txt diff --git a/Documentation/filesystems/erofs.rst b/Documentation/filesystems/erofs.rst new file mode 100644 index 000000000000..bf145171c2bf --- /dev/null +++ b/Documentation/filesystems/erofs.rst @@ -0,0 +1,240 @@ +.. SPDX-License-Identifier: GPL-2.0 + +====================================== +Enhanced Read-Only File System - EROFS +====================================== + +Overview +======== + +EROFS file-system stands for Enhanced Read-Only File System. Different +from other read-only file systems, it aims to be designed for flexibility, +scalability, but be kept simple and high performance. + +It is designed as a better filesystem solution for the following scenarios: + + - read-only storage media or + + - part of a fully trusted read-only solution, which means it needs to be + immutable and bit-for-bit identical to the official golden image for + their releases due to security and other considerations and + + - hope to save some extra storage space with guaranteed end-to-end performance + by using reduced metadata and transparent file compression, especially + for those embedded devices with limited memory (ex, smartphone); + +Here is the main features of EROFS: + + - Little endian on-disk design; + + - Currently 4KB block size (nobh) and therefore maximum 16TB address space; + + - Metadata & data could be mixed by design; + + - 2 inode versions for different requirements: + + ===================== ============ ===================================== + compact (v1) extended (v2) + ===================== ============ ===================================== + Inode metadata size 32 bytes 64 bytes + Max file size 4 GB 16 EB (also limited by max. vol size) + Max uids/gids 65536 4294967296 + File change time no yes (64 + 32-bit timestamp) + Max hardlinks 65536 4294967296 + Metadata reserved 4 bytes 14 bytes + ===================== ============ ===================================== + + - Support extended attributes (xattrs) as an option; + + - Support xattr inline and tail-end data inline for all files; + + - Support POSIX.1e ACLs by using xattrs; + + - Support transparent file compression as an option: + LZ4 algorithm with 4 KB fixed-sized output compression for high performance. + +The following git tree provides the file system user-space tools under +development (ex, formatting tool mkfs.erofs): + +- git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs-utils.git + +Bugs and patches are welcome, please kindly help us and send to the following +linux-erofs mailing list: + +- linux-erofs mailing list + +Mount options +============= + +=================== ========================================================= +(no)user_xattr Setup Extended User Attributes. Note: xattr is enabled + by default if CONFIG_EROFS_FS_XATTR is selected. +(no)acl Setup POSIX Access Control List. Note: acl is enabled + by default if CONFIG_EROFS_FS_POSIX_ACL is selected. +cache_strategy=%s Select a strategy for cached decompression from now on: + + ========== ============================================= + disabled In-place I/O decompression only; + readahead Cache the last incomplete compressed physical + cluster for further reading. It still does + in-place I/O decompression for the rest + compressed physical clusters; + readaround Cache the both ends of incomplete compressed + physical clusters for further reading. + It still does in-place I/O decompression + for the rest compressed physical clusters. + ========== ============================================= +=================== ========================================================= + +On-disk details +=============== + +Summary +------- +Different from other read-only file systems, an EROFS volume is designed +to be as simple as possible:: + + |-> aligned with the block size + ____________________________________________________________ + | |SB| | ... | Metadata | ... | Data | Metadata | ... | Data | + |_|__|_|_____|__________|_____|______|__________|_____|______| + 0 +1K + +All data areas should be aligned with the block size, but metadata areas +may not. All metadatas can be now observed in two different spaces (views): + + 1. Inode metadata space + + Each valid inode should be aligned with an inode slot, which is a fixed + value (32 bytes) and designed to be kept in line with compact inode size. + + Each inode can be directly found with the following formula: + inode offset = meta_blkaddr * block_size + 32 * nid + + :: + + |-> aligned with 8B + |-> followed closely + + meta_blkaddr blocks |-> another slot + _____________________________________________________________________ + | ... | inode | xattrs | extents | data inline | ... | inode ... + |________|_______|(optional)|(optional)|__(optional)_|_____|__________ + |-> aligned with the inode slot size + . . + . . + . . + . . + . . + . . + .____________________________________________________|-> aligned with 4B + | xattr_ibody_header | shared xattrs | inline xattrs | + |____________________|_______________|_______________| + |-> 12 bytes <-|->x * 4 bytes<-| . + . . . + . . . + . . . + ._______________________________.______________________. + | id | id | id | id | ... | id | ent | ... | ent| ... | + |____|____|____|____|______|____|_____|_____|____|_____| + |-> aligned with 4B + |-> aligned with 4B + + Inode could be 32 or 64 bytes, which can be distinguished from a common + field which all inode versions have -- i_format:: + + __________________ __________________ + | i_format | | i_format | + |__________________| |__________________| + | ... | | ... | + | | | | + |__________________| 32 bytes | | + | | + |__________________| 64 bytes + + Xattrs, extents, data inline are followed by the corresponding inode with + proper alignment, and they could be optional for different data mappings. + _currently_ total 4 valid data mappings are supported: + + == ==================================================================== + 0 flat file data without data inline (no extent); + 1 fixed-sized output data compression (with non-compacted indexes); + 2 flat file data with tail packing data inline (no extent); + 3 fixed-sized output data compression (with compacted indexes, v5.3+). + == ==================================================================== + + The size of the optional xattrs is indicated by i_xattr_count in inode + header. Large xattrs or xattrs shared by many different files can be + stored in shared xattrs metadata rather than inlined right after inode. + + 2. Shared xattrs metadata space + + Shared xattrs space is similar to the above inode space, started with + a specific block indicated by xattr_blkaddr, organized one by one with + proper align. + + Each share xattr can also be directly found by the following formula: + xattr offset = xattr_blkaddr * block_size + 4 * xattr_id + + :: + + |-> aligned by 4 bytes + + xattr_blkaddr blocks |-> aligned with 4 bytes + _________________________________________________________________________ + | ... | xattr_entry | xattr data | ... | xattr_entry | xattr data ... + |________|_____________|_____________|_____|______________|_______________ + +Directories +----------- +All directories are now organized in a compact on-disk format. Note that +each directory block is divided into index and name areas in order to support +random file lookup, and all directory entries are _strictly_ recorded in +alphabetical order in order to support improved prefix binary search +algorithm (could refer to the related source code). + +:: + + ___________________________ + / | + / ______________|________________ + / / | nameoff1 | nameoffN-1 + ____________.______________._______________v________________v__________ + | dirent | dirent | ... | dirent | filename | filename | ... | filename | + |___.0___|____1___|_____|___N-1__|____0_____|____1_____|_____|___N-1____| + \ ^ + \ | * could have + \ | trailing '\0' + \________________________| nameoff0 + + Directory block + +Note that apart from the offset of the first filename, nameoff0 also indicates +the total number of directory entries in this block since it is no need to +introduce another on-disk field at all. + +Compression +----------- +Currently, EROFS supports 4KB fixed-sized output transparent file compression, +as illustrated below:: + + |---- Variant-Length Extent ----|-------- VLE --------|----- VLE ----- + clusterofs clusterofs clusterofs + | | | logical data + _________v_______________________________v_____________________v_______________ + ... | . | | . | | . | ... + ____|____.________|_____________|________.____|_____________|__.__________|____ + |-> cluster <-|-> cluster <-|-> cluster <-|-> cluster <-|-> cluster <-| + size size size size size + . . . . + . . . . + . . . . + _______._____________._____________._____________._____________________ + ... | | | | ... physical data + _______|_____________|_____________|_____________|_____________________ + |-> cluster <-|-> cluster <-|-> cluster <-| + size size size + +Currently each on-disk physical cluster can contain 4KB (un)compressed data +at most. For each logical cluster, there is a corresponding on-disk index to +describe its cluster type, physical cluster address, etc. + +See "struct z_erofs_vle_decompressed_index" in erofs_fs.h for more details. diff --git a/Documentation/filesystems/erofs.txt b/Documentation/filesystems/erofs.txt deleted file mode 100644 index db6d39c3ae71..000000000000 --- a/Documentation/filesystems/erofs.txt +++ /dev/null @@ -1,211 +0,0 @@ -Overview -======== - -EROFS file-system stands for Enhanced Read-Only File System. Different -from other read-only file systems, it aims to be designed for flexibility, -scalability, but be kept simple and high performance. - -It is designed as a better filesystem solution for the following scenarios: - - read-only storage media or - - - part of a fully trusted read-only solution, which means it needs to be - immutable and bit-for-bit identical to the official golden image for - their releases due to security and other considerations and - - - hope to save some extra storage space with guaranteed end-to-end performance - by using reduced metadata and transparent file compression, especially - for those embedded devices with limited memory (ex, smartphone); - -Here is the main features of EROFS: - - Little endian on-disk design; - - - Currently 4KB block size (nobh) and therefore maximum 16TB address space; - - - Metadata & data could be mixed by design; - - - 2 inode versions for different requirements: - compact (v1) extended (v2) - Inode metadata size: 32 bytes 64 bytes - Max file size: 4 GB 16 EB (also limited by max. vol size) - Max uids/gids: 65536 4294967296 - File change time: no yes (64 + 32-bit timestamp) - Max hardlinks: 65536 4294967296 - Metadata reserved: 4 bytes 14 bytes - - - Support extended attributes (xattrs) as an option; - - - Support xattr inline and tail-end data inline for all files; - - - Support POSIX.1e ACLs by using xattrs; - - - Support transparent file compression as an option: - LZ4 algorithm with 4 KB fixed-sized output compression for high performance. - -The following git tree provides the file system user-space tools under -development (ex, formatting tool mkfs.erofs): ->> git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs-utils.git - -Bugs and patches are welcome, please kindly help us and send to the following -linux-erofs mailing list: ->> linux-erofs mailing list - -Mount options -============= - -(no)user_xattr Setup Extended User Attributes. Note: xattr is enabled - by default if CONFIG_EROFS_FS_XATTR is selected. -(no)acl Setup POSIX Access Control List. Note: acl is enabled - by default if CONFIG_EROFS_FS_POSIX_ACL is selected. -cache_strategy=%s Select a strategy for cached decompression from now on: - disabled: In-place I/O decompression only; - readahead: Cache the last incomplete compressed physical - cluster for further reading. It still does - in-place I/O decompression for the rest - compressed physical clusters; - readaround: Cache the both ends of incomplete compressed - physical clusters for further reading. - It still does in-place I/O decompression - for the rest compressed physical clusters. - -On-disk details -=============== - -Summary -------- -Different from other read-only file systems, an EROFS volume is designed -to be as simple as possible: - - |-> aligned with the block size - ____________________________________________________________ - | |SB| | ... | Metadata | ... | Data | Metadata | ... | Data | - |_|__|_|_____|__________|_____|______|__________|_____|______| - 0 +1K - -All data areas should be aligned with the block size, but metadata areas -may not. All metadatas can be now observed in two different spaces (views): - 1. Inode metadata space - Each valid inode should be aligned with an inode slot, which is a fixed - value (32 bytes) and designed to be kept in line with compact inode size. - - Each inode can be directly found with the following formula: - inode offset = meta_blkaddr * block_size + 32 * nid - - |-> aligned with 8B - |-> followed closely - + meta_blkaddr blocks |-> another slot - _____________________________________________________________________ - | ... | inode | xattrs | extents | data inline | ... | inode ... - |________|_______|(optional)|(optional)|__(optional)_|_____|__________ - |-> aligned with the inode slot size - . . - . . - . . - . . - . . - . . - .____________________________________________________|-> aligned with 4B - | xattr_ibody_header | shared xattrs | inline xattrs | - |____________________|_______________|_______________| - |-> 12 bytes <-|->x * 4 bytes<-| . - . . . - . . . - . . . - ._______________________________.______________________. - | id | id | id | id | ... | id | ent | ... | ent| ... | - |____|____|____|____|______|____|_____|_____|____|_____| - |-> aligned with 4B - |-> aligned with 4B - - Inode could be 32 or 64 bytes, which can be distinguished from a common - field which all inode versions have -- i_format: - - __________________ __________________ - | i_format | | i_format | - |__________________| |__________________| - | ... | | ... | - | | | | - |__________________| 32 bytes | | - | | - |__________________| 64 bytes - - Xattrs, extents, data inline are followed by the corresponding inode with - proper alignment, and they could be optional for different data mappings. - _currently_ total 4 valid data mappings are supported: - - 0 flat file data without data inline (no extent); - 1 fixed-sized output data compression (with non-compacted indexes); - 2 flat file data with tail packing data inline (no extent); - 3 fixed-sized output data compression (with compacted indexes, v5.3+). - - The size of the optional xattrs is indicated by i_xattr_count in inode - header. Large xattrs or xattrs shared by many different files can be - stored in shared xattrs metadata rather than inlined right after inode. - - 2. Shared xattrs metadata space - Shared xattrs space is similar to the above inode space, started with - a specific block indicated by xattr_blkaddr, organized one by one with - proper align. - - Each share xattr can also be directly found by the following formula: - xattr offset = xattr_blkaddr * block_size + 4 * xattr_id - - |-> aligned by 4 bytes - + xattr_blkaddr blocks |-> aligned with 4 bytes - _________________________________________________________________________ - | ... | xattr_entry | xattr data | ... | xattr_entry | xattr data ... - |________|_____________|_____________|_____|______________|_______________ - -Directories ------------ -All directories are now organized in a compact on-disk format. Note that -each directory block is divided into index and name areas in order to support -random file lookup, and all directory entries are _strictly_ recorded in -alphabetical order in order to support improved prefix binary search -algorithm (could refer to the related source code). - - ___________________________ - / | - / ______________|________________ - / / | nameoff1 | nameoffN-1 - ____________.______________._______________v________________v__________ -| dirent | dirent | ... | dirent | filename | filename | ... | filename | -|___.0___|____1___|_____|___N-1__|____0_____|____1_____|_____|___N-1____| - \ ^ - \ | * could have - \ | trailing '\0' - \________________________| nameoff0 - - Directory block - -Note that apart from the offset of the first filename, nameoff0 also indicates -the total number of directory entries in this block since it is no need to -introduce another on-disk field at all. - -Compression ------------ -Currently, EROFS supports 4KB fixed-sized output transparent file compression, -as illustrated below: - - |---- Variant-Length Extent ----|-------- VLE --------|----- VLE ----- - clusterofs clusterofs clusterofs - | | | logical data -_________v_______________________________v_____________________v_______________ -... | . | | . | | . | ... -____|____.________|_____________|________.____|_____________|__.__________|____ - |-> cluster <-|-> cluster <-|-> cluster <-|-> cluster <-|-> cluster <-| - size size size size size - . . . . - . . . . - . . . . - _______._____________._____________._____________._____________________ - ... | | | | ... physical data - _______|_____________|_____________|_____________|_____________________ - |-> cluster <-|-> cluster <-|-> cluster <-| - size size size - -Currently each on-disk physical cluster can contain 4KB (un)compressed data -at most. For each logical cluster, there is a corresponding on-disk index to -describe its cluster type, physical cluster address, etc. - -See "struct z_erofs_vle_decompressed_index" in erofs_fs.h for more details. - diff --git a/Documentation/filesystems/index.rst b/Documentation/filesystems/index.rst index 4230f49d2732..03a493b27920 100644 --- a/Documentation/filesystems/index.rst +++ b/Documentation/filesystems/index.rst @@ -61,6 +61,7 @@ Documentation for filesystem implementations. dlmfs ecryptfs efivarfs + erofs fuse overlayfs virtiofs -- cgit From 6e29ad2ea34f63f2b959807370672af569861378 Mon Sep 17 00:00:00 2001 From: Mauro Carvalho Chehab Date: Mon, 17 Feb 2020 17:12:02 +0100 Subject: docs: filesystems: convert ext2.txt to ReST - Add a SPDX header; - Some whitespace fixes and new line breaks; - Mark literal blocks as such; - Add table markups; - Use footnoote markups; - Add it to filesystems/index.rst. Signed-off-by: Mauro Carvalho Chehab Acked-by: Jan Kara Link: https://lore.kernel.org/r/fde6721f0303259d830391e351dbde48f67f3ec7.1581955849.git.mchehab+huawei@kernel.org Signed-off-by: Jonathan Corbet --- Documentation/filesystems/ext2.rst | 399 ++++++++++++++++++++++++++++++++++++ Documentation/filesystems/ext2.txt | 388 ----------------------------------- Documentation/filesystems/index.rst | 1 + 3 files changed, 400 insertions(+), 388 deletions(-) create mode 100644 Documentation/filesystems/ext2.rst delete mode 100644 Documentation/filesystems/ext2.txt diff --git a/Documentation/filesystems/ext2.rst b/Documentation/filesystems/ext2.rst new file mode 100644 index 000000000000..d83dbbb162e2 --- /dev/null +++ b/Documentation/filesystems/ext2.rst @@ -0,0 +1,399 @@ +.. SPDX-License-Identifier: GPL-2.0 + + +The Second Extended Filesystem +============================== + +ext2 was originally released in January 1993. Written by R\'emy Card, +Theodore Ts'o and Stephen Tweedie, it was a major rewrite of the +Extended Filesystem. It is currently still (April 2001) the predominant +filesystem in use by Linux. There are also implementations available +for NetBSD, FreeBSD, the GNU HURD, Windows 95/98/NT, OS/2 and RISC OS. + +Options +======= + +Most defaults are determined by the filesystem superblock, and can be +set using tune2fs(8). Kernel-determined defaults are indicated by (*). + +==================== === ================================================ +bsddf (*) Makes ``df`` act like BSD. +minixdf Makes ``df`` act like Minix. + +check=none, nocheck (*) Don't do extra checking of bitmaps on mount + (check=normal and check=strict options removed) + +dax Use direct access (no page cache). See + Documentation/filesystems/dax.txt. + +debug Extra debugging information is sent to the + kernel syslog. Useful for developers. + +errors=continue Keep going on a filesystem error. +errors=remount-ro Remount the filesystem read-only on an error. +errors=panic Panic and halt the machine if an error occurs. + +grpid, bsdgroups Give objects the same group ID as their parent. +nogrpid, sysvgroups New objects have the group ID of their creator. + +nouid32 Use 16-bit UIDs and GIDs. + +oldalloc Enable the old block allocator. Orlov should + have better performance, we'd like to get some + feedback if it's the contrary for you. +orlov (*) Use the Orlov block allocator. + (See http://lwn.net/Articles/14633/ and + http://lwn.net/Articles/14446/.) + +resuid=n The user ID which may use the reserved blocks. +resgid=n The group ID which may use the reserved blocks. + +sb=n Use alternate superblock at this location. + +user_xattr Enable "user." POSIX Extended Attributes + (requires CONFIG_EXT2_FS_XATTR). +nouser_xattr Don't support "user." extended attributes. + +acl Enable POSIX Access Control Lists support + (requires CONFIG_EXT2_FS_POSIX_ACL). +noacl Don't support POSIX ACLs. + +nobh Do not attach buffer_heads to file pagecache. + +quota, usrquota Enable user disk quota support + (requires CONFIG_QUOTA). + +grpquota Enable group disk quota support + (requires CONFIG_QUOTA). +==================== === ================================================ + +noquota option ls silently ignored by ext2. + + +Specification +============= + +ext2 shares many properties with traditional Unix filesystems. It has +the concepts of blocks, inodes and directories. It has space in the +specification for Access Control Lists (ACLs), fragments, undeletion and +compression though these are not yet implemented (some are available as +separate patches). There is also a versioning mechanism to allow new +features (such as journalling) to be added in a maximally compatible +manner. + +Blocks +------ + +The space in the device or file is split up into blocks. These are +a fixed size, of 1024, 2048 or 4096 bytes (8192 bytes on Alpha systems), +which is decided when the filesystem is created. Smaller blocks mean +less wasted space per file, but require slightly more accounting overhead, +and also impose other limits on the size of files and the filesystem. + +Block Groups +------------ + +Blocks are clustered into block groups in order to reduce fragmentation +and minimise the amount of head seeking when reading a large amount +of consecutive data. Information about each block group is kept in a +descriptor table stored in the block(s) immediately after the superblock. +Two blocks near the start of each group are reserved for the block usage +bitmap and the inode usage bitmap which show which blocks and inodes +are in use. Since each bitmap is limited to a single block, this means +that the maximum size of a block group is 8 times the size of a block. + +The block(s) following the bitmaps in each block group are designated +as the inode table for that block group and the remainder are the data +blocks. The block allocation algorithm attempts to allocate data blocks +in the same block group as the inode which contains them. + +The Superblock +-------------- + +The superblock contains all the information about the configuration of +the filing system. The primary copy of the superblock is stored at an +offset of 1024 bytes from the start of the device, and it is essential +to mounting the filesystem. Since it is so important, backup copies of +the superblock are stored in block groups throughout the filesystem. +The first version of ext2 (revision 0) stores a copy at the start of +every block group, along with backups of the group descriptor block(s). +Because this can consume a considerable amount of space for large +filesystems, later revisions can optionally reduce the number of backup +copies by only putting backups in specific groups (this is the sparse +superblock feature). The groups chosen are 0, 1 and powers of 3, 5 and 7. + +The information in the superblock contains fields such as the total +number of inodes and blocks in the filesystem and how many are free, +how many inodes and blocks are in each block group, when the filesystem +was mounted (and if it was cleanly unmounted), when it was modified, +what version of the filesystem it is (see the Revisions section below) +and which OS created it. + +If the filesystem is revision 1 or higher, then there are extra fields, +such as a volume name, a unique identification number, the inode size, +and space for optional filesystem features to store configuration info. + +All fields in the superblock (as in all other ext2 structures) are stored +on the disc in little endian format, so a filesystem is portable between +machines without having to know what machine it was created on. + +Inodes +------ + +The inode (index node) is a fundamental concept in the ext2 filesystem. +Each object in the filesystem is represented by an inode. The inode +structure contains pointers to the filesystem blocks which contain the +data held in the object and all of the metadata about an object except +its name. The metadata about an object includes the permissions, owner, +group, flags, size, number of blocks used, access time, change time, +modification time, deletion time, number of links, fragments, version +(for NFS) and extended attributes (EAs) and/or Access Control Lists (ACLs). + +There are some reserved fields which are currently unused in the inode +structure and several which are overloaded. One field is reserved for the +directory ACL if the inode is a directory and alternately for the top 32 +bits of the file size if the inode is a regular file (allowing file sizes +larger than 2GB). The translator field is unused under Linux, but is used +by the HURD to reference the inode of a program which will be used to +interpret this object. Most of the remaining reserved fields have been +used up for both Linux and the HURD for larger owner and group fields, +The HURD also has a larger mode field so it uses another of the remaining +fields to store the extra more bits. + +There are pointers to the first 12 blocks which contain the file's data +in the inode. There is a pointer to an indirect block (which contains +pointers to the next set of blocks), a pointer to a doubly-indirect +block (which contains pointers to indirect blocks) and a pointer to a +trebly-indirect block (which contains pointers to doubly-indirect blocks). + +The flags field contains some ext2-specific flags which aren't catered +for by the standard chmod flags. These flags can be listed with lsattr +and changed with the chattr command, and allow specific filesystem +behaviour on a per-file basis. There are flags for secure deletion, +undeletable, compression, synchronous updates, immutability, append-only, +dumpable, no-atime, indexed directories, and data-journaling. Not all +of these are supported yet. + +Directories +----------- + +A directory is a filesystem object and has an inode just like a file. +It is a specially formatted file containing records which associate +each name with an inode number. Later revisions of the filesystem also +encode the type of the object (file, directory, symlink, device, fifo, +socket) to avoid the need to check the inode itself for this information +(support for taking advantage of this feature does not yet exist in +Glibc 2.2). + +The inode allocation code tries to assign inodes which are in the same +block group as the directory in which they are first created. + +The current implementation of ext2 uses a singly-linked list to store +the filenames in the directory; a pending enhancement uses hashing of the +filenames to allow lookup without the need to scan the entire directory. + +The current implementation never removes empty directory blocks once they +have been allocated to hold more files. + +Special files +------------- + +Symbolic links are also filesystem objects with inodes. They deserve +special mention because the data for them is stored within the inode +itself if the symlink is less than 60 bytes long. It uses the fields +which would normally be used to store the pointers to data blocks. +This is a worthwhile optimisation as it we avoid allocating a full +block for the symlink, and most symlinks are less than 60 characters long. + +Character and block special devices never have data blocks assigned to +them. Instead, their device number is stored in the inode, again reusing +the fields which would be used to point to the data blocks. + +Reserved Space +-------------- + +In ext2, there is a mechanism for reserving a certain number of blocks +for a particular user (normally the super-user). This is intended to +allow for the system to continue functioning even if non-privileged users +fill up all the space available to them (this is independent of filesystem +quotas). It also keeps the filesystem from filling up entirely which +helps combat fragmentation. + +Filesystem check +---------------- + +At boot time, most systems run a consistency check (e2fsck) on their +filesystems. The superblock of the ext2 filesystem contains several +fields which indicate whether fsck should actually run (since checking +the filesystem at boot can take a long time if it is large). fsck will +run if the filesystem was not cleanly unmounted, if the maximum mount +count has been exceeded or if the maximum time between checks has been +exceeded. + +Feature Compatibility +--------------------- + +The compatibility feature mechanism used in ext2 is sophisticated. +It safely allows features to be added to the filesystem, without +unnecessarily sacrificing compatibility with older versions of the +filesystem code. The feature compatibility mechanism is not supported by +the original revision 0 (EXT2_GOOD_OLD_REV) of ext2, but was introduced in +revision 1. There are three 32-bit fields, one for compatible features +(COMPAT), one for read-only compatible (RO_COMPAT) features and one for +incompatible (INCOMPAT) features. + +These feature flags have specific meanings for the kernel as follows: + +A COMPAT flag indicates that a feature is present in the filesystem, +but the on-disk format is 100% compatible with older on-disk formats, so +a kernel which didn't know anything about this feature could read/write +the filesystem without any chance of corrupting the filesystem (or even +making it inconsistent). This is essentially just a flag which says +"this filesystem has a (hidden) feature" that the kernel or e2fsck may +want to be aware of (more on e2fsck and feature flags later). The ext3 +HAS_JOURNAL feature is a COMPAT flag because the ext3 journal is simply +a regular file with data blocks in it so the kernel does not need to +take any special notice of it if it doesn't understand ext3 journaling. + +An RO_COMPAT flag indicates that the on-disk format is 100% compatible +with older on-disk formats for reading (i.e. the feature does not change +the visible on-disk format). However, an old kernel writing to such a +filesystem would/could corrupt the filesystem, so this is prevented. The +most common such feature, SPARSE_SUPER, is an RO_COMPAT feature because +sparse groups allow file data blocks where superblock/group descriptor +backups used to live, and ext2_free_blocks() refuses to free these blocks, +which would leading to inconsistent bitmaps. An old kernel would also +get an error if it tried to free a series of blocks which crossed a group +boundary, but this is a legitimate layout in a SPARSE_SUPER filesystem. + +An INCOMPAT flag indicates the on-disk format has changed in some +way that makes it unreadable by older kernels, or would otherwise +cause a problem if an old kernel tried to mount it. FILETYPE is an +INCOMPAT flag because older kernels would think a filename was longer +than 256 characters, which would lead to corrupt directory listings. +The COMPRESSION flag is an obvious INCOMPAT flag - if the kernel +doesn't understand compression, you would just get garbage back from +read() instead of it automatically decompressing your data. The ext3 +RECOVER flag is needed to prevent a kernel which does not understand the +ext3 journal from mounting the filesystem without replaying the journal. + +For e2fsck, it needs to be more strict with the handling of these +flags than the kernel. If it doesn't understand ANY of the COMPAT, +RO_COMPAT, or INCOMPAT flags it will refuse to check the filesystem, +because it has no way of verifying whether a given feature is valid +or not. Allowing e2fsck to succeed on a filesystem with an unknown +feature is a false sense of security for the user. Refusing to check +a filesystem with unknown features is a good incentive for the user to +update to the latest e2fsck. This also means that anyone adding feature +flags to ext2 also needs to update e2fsck to verify these features. + +Metadata +-------- + +It is frequently claimed that the ext2 implementation of writing +asynchronous metadata is faster than the ffs synchronous metadata +scheme but less reliable. Both methods are equally resolvable by their +respective fsck programs. + +If you're exceptionally paranoid, there are 3 ways of making metadata +writes synchronous on ext2: + +- per-file if you have the program source: use the O_SYNC flag to open() +- per-file if you don't have the source: use "chattr +S" on the file +- per-filesystem: add the "sync" option to mount (or in /etc/fstab) + +the first and last are not ext2 specific but do force the metadata to +be written synchronously. See also Journaling below. + +Limitations +----------- + +There are various limits imposed by the on-disk layout of ext2. Other +limits are imposed by the current implementation of the kernel code. +Many of the limits are determined at the time the filesystem is first +created, and depend upon the block size chosen. The ratio of inodes to +data blocks is fixed at filesystem creation time, so the only way to +increase the number of inodes is to increase the size of the filesystem. +No tools currently exist which can change the ratio of inodes to blocks. + +Most of these limits could be overcome with slight changes in the on-disk +format and using a compatibility flag to signal the format change (at +the expense of some compatibility). + +===================== ======= ======= ======= ======== +Filesystem block size 1kB 2kB 4kB 8kB +===================== ======= ======= ======= ======== +File size limit 16GB 256GB 2048GB 2048GB +Filesystem size limit 2047GB 8192GB 16384GB 32768GB +===================== ======= ======= ======= ======== + +There is a 2.4 kernel limit of 2048GB for a single block device, so no +filesystem larger than that can be created at this time. There is also +an upper limit on the block size imposed by the page size of the kernel, +so 8kB blocks are only allowed on Alpha systems (and other architectures +which support larger pages). + +There is an upper limit of 32000 subdirectories in a single directory. + +There is a "soft" upper limit of about 10-15k files in a single directory +with the current linear linked-list directory implementation. This limit +stems from performance problems when creating and deleting (and also +finding) files in such large directories. Using a hashed directory index +(under development) allows 100k-1M+ files in a single directory without +performance problems (although RAM size becomes an issue at this point). + +The (meaningless) absolute upper limit of files in a single directory +(imposed by the file size, the realistic limit is obviously much less) +is over 130 trillion files. It would be higher except there are not +enough 4-character names to make up unique directory entries, so they +have to be 8 character filenames, even then we are fairly close to +running out of unique filenames. + +Journaling +---------- + +A journaling extension to the ext2 code has been developed by Stephen +Tweedie. It avoids the risks of metadata corruption and the need to +wait for e2fsck to complete after a crash, without requiring a change +to the on-disk ext2 layout. In a nutshell, the journal is a regular +file which stores whole metadata (and optionally data) blocks that have +been modified, prior to writing them into the filesystem. This means +it is possible to add a journal to an existing ext2 filesystem without +the need for data conversion. + +When changes to the filesystem (e.g. a file is renamed) they are stored in +a transaction in the journal and can either be complete or incomplete at +the time of a crash. If a transaction is complete at the time of a crash +(or in the normal case where the system does not crash), then any blocks +in that transaction are guaranteed to represent a valid filesystem state, +and are copied into the filesystem. If a transaction is incomplete at +the time of the crash, then there is no guarantee of consistency for +the blocks in that transaction so they are discarded (which means any +filesystem changes they represent are also lost). +Check Documentation/filesystems/ext4/ if you want to read more about +ext4 and journaling. + +References +========== + +======================= =============================================== +The kernel source file:/usr/src/linux/fs/ext2/ +e2fsprogs (e2fsck) http://e2fsprogs.sourceforge.net/ +Design & Implementation http://e2fsprogs.sourceforge.net/ext2intro.html +Journaling (ext3) ftp://ftp.uk.linux.org/pub/linux/sct/fs/jfs/ +Filesystem Resizing http://ext2resize.sourceforge.net/ +Compression [1]_ http://e2compr.sourceforge.net/ +======================= =============================================== + +Implementations for: + +======================= =========================================================== +Windows 95/98/NT/2000 http://www.chrysocome.net/explore2fs +Windows 95 [1]_ http://www.yipton.net/content.html#FSDEXT2 +DOS client [1]_ ftp://metalab.unc.edu/pub/Linux/system/filesystems/ext2/ +OS/2 [2]_ ftp://metalab.unc.edu/pub/Linux/system/filesystems/ext2/ +RISC OS client http://www.esw-heim.tu-clausthal.de/~marco/smorbrod/IscaFS/ +======================= =========================================================== + +.. [1] no longer actively developed/supported (as of Apr 2001) +.. [2] no longer actively developed/supported (as of Mar 2009) diff --git a/Documentation/filesystems/ext2.txt b/Documentation/filesystems/ext2.txt deleted file mode 100644 index 94c2cf0292f5..000000000000 --- a/Documentation/filesystems/ext2.txt +++ /dev/null @@ -1,388 +0,0 @@ - -The Second Extended Filesystem -============================== - -ext2 was originally released in January 1993. Written by R\'emy Card, -Theodore Ts'o and Stephen Tweedie, it was a major rewrite of the -Extended Filesystem. It is currently still (April 2001) the predominant -filesystem in use by Linux. There are also implementations available -for NetBSD, FreeBSD, the GNU HURD, Windows 95/98/NT, OS/2 and RISC OS. - -Options -======= - -Most defaults are determined by the filesystem superblock, and can be -set using tune2fs(8). Kernel-determined defaults are indicated by (*). - -bsddf (*) Makes `df' act like BSD. -minixdf Makes `df' act like Minix. - -check=none, nocheck (*) Don't do extra checking of bitmaps on mount - (check=normal and check=strict options removed) - -dax Use direct access (no page cache). See - Documentation/filesystems/dax.txt. - -debug Extra debugging information is sent to the - kernel syslog. Useful for developers. - -errors=continue Keep going on a filesystem error. -errors=remount-ro Remount the filesystem read-only on an error. -errors=panic Panic and halt the machine if an error occurs. - -grpid, bsdgroups Give objects the same group ID as their parent. -nogrpid, sysvgroups New objects have the group ID of their creator. - -nouid32 Use 16-bit UIDs and GIDs. - -oldalloc Enable the old block allocator. Orlov should - have better performance, we'd like to get some - feedback if it's the contrary for you. -orlov (*) Use the Orlov block allocator. - (See http://lwn.net/Articles/14633/ and - http://lwn.net/Articles/14446/.) - -resuid=n The user ID which may use the reserved blocks. -resgid=n The group ID which may use the reserved blocks. - -sb=n Use alternate superblock at this location. - -user_xattr Enable "user." POSIX Extended Attributes - (requires CONFIG_EXT2_FS_XATTR). -nouser_xattr Don't support "user." extended attributes. - -acl Enable POSIX Access Control Lists support - (requires CONFIG_EXT2_FS_POSIX_ACL). -noacl Don't support POSIX ACLs. - -nobh Do not attach buffer_heads to file pagecache. - -quota, usrquota Enable user disk quota support - (requires CONFIG_QUOTA). - -grpquota Enable group disk quota support - (requires CONFIG_QUOTA). - -noquota option ls silently ignored by ext2. - - -Specification -============= - -ext2 shares many properties with traditional Unix filesystems. It has -the concepts of blocks, inodes and directories. It has space in the -specification for Access Control Lists (ACLs), fragments, undeletion and -compression though these are not yet implemented (some are available as -separate patches). There is also a versioning mechanism to allow new -features (such as journalling) to be added in a maximally compatible -manner. - -Blocks ------- - -The space in the device or file is split up into blocks. These are -a fixed size, of 1024, 2048 or 4096 bytes (8192 bytes on Alpha systems), -which is decided when the filesystem is created. Smaller blocks mean -less wasted space per file, but require slightly more accounting overhead, -and also impose other limits on the size of files and the filesystem. - -Block Groups ------------- - -Blocks are clustered into block groups in order to reduce fragmentation -and minimise the amount of head seeking when reading a large amount -of consecutive data. Information about each block group is kept in a -descriptor table stored in the block(s) immediately after the superblock. -Two blocks near the start of each group are reserved for the block usage -bitmap and the inode usage bitmap which show which blocks and inodes -are in use. Since each bitmap is limited to a single block, this means -that the maximum size of a block group is 8 times the size of a block. - -The block(s) following the bitmaps in each block group are designated -as the inode table for that block group and the remainder are the data -blocks. The block allocation algorithm attempts to allocate data blocks -in the same block group as the inode which contains them. - -The Superblock --------------- - -The superblock contains all the information about the configuration of -the filing system. The primary copy of the superblock is stored at an -offset of 1024 bytes from the start of the device, and it is essential -to mounting the filesystem. Since it is so important, backup copies of -the superblock are stored in block groups throughout the filesystem. -The first version of ext2 (revision 0) stores a copy at the start of -every block group, along with backups of the group descriptor block(s). -Because this can consume a considerable amount of space for large -filesystems, later revisions can optionally reduce the number of backup -copies by only putting backups in specific groups (this is the sparse -superblock feature). The groups chosen are 0, 1 and powers of 3, 5 and 7. - -The information in the superblock contains fields such as the total -number of inodes and blocks in the filesystem and how many are free, -how many inodes and blocks are in each block group, when the filesystem -was mounted (and if it was cleanly unmounted), when it was modified, -what version of the filesystem it is (see the Revisions section below) -and which OS created it. - -If the filesystem is revision 1 or higher, then there are extra fields, -such as a volume name, a unique identification number, the inode size, -and space for optional filesystem features to store configuration info. - -All fields in the superblock (as in all other ext2 structures) are stored -on the disc in little endian format, so a filesystem is portable between -machines without having to know what machine it was created on. - -Inodes ------- - -The inode (index node) is a fundamental concept in the ext2 filesystem. -Each object in the filesystem is represented by an inode. The inode -structure contains pointers to the filesystem blocks which contain the -data held in the object and all of the metadata about an object except -its name. The metadata about an object includes the permissions, owner, -group, flags, size, number of blocks used, access time, change time, -modification time, deletion time, number of links, fragments, version -(for NFS) and extended attributes (EAs) and/or Access Control Lists (ACLs). - -There are some reserved fields which are currently unused in the inode -structure and several which are overloaded. One field is reserved for the -directory ACL if the inode is a directory and alternately for the top 32 -bits of the file size if the inode is a regular file (allowing file sizes -larger than 2GB). The translator field is unused under Linux, but is used -by the HURD to reference the inode of a program which will be used to -interpret this object. Most of the remaining reserved fields have been -used up for both Linux and the HURD for larger owner and group fields, -The HURD also has a larger mode field so it uses another of the remaining -fields to store the extra more bits. - -There are pointers to the first 12 blocks which contain the file's data -in the inode. There is a pointer to an indirect block (which contains -pointers to the next set of blocks), a pointer to a doubly-indirect -block (which contains pointers to indirect blocks) and a pointer to a -trebly-indirect block (which contains pointers to doubly-indirect blocks). - -The flags field contains some ext2-specific flags which aren't catered -for by the standard chmod flags. These flags can be listed with lsattr -and changed with the chattr command, and allow specific filesystem -behaviour on a per-file basis. There are flags for secure deletion, -undeletable, compression, synchronous updates, immutability, append-only, -dumpable, no-atime, indexed directories, and data-journaling. Not all -of these are supported yet. - -Directories ------------ - -A directory is a filesystem object and has an inode just like a file. -It is a specially formatted file containing records which associate -each name with an inode number. Later revisions of the filesystem also -encode the type of the object (file, directory, symlink, device, fifo, -socket) to avoid the need to check the inode itself for this information -(support for taking advantage of this feature does not yet exist in -Glibc 2.2). - -The inode allocation code tries to assign inodes which are in the same -block group as the directory in which they are first created. - -The current implementation of ext2 uses a singly-linked list to store -the filenames in the directory; a pending enhancement uses hashing of the -filenames to allow lookup without the need to scan the entire directory. - -The current implementation never removes empty directory blocks once they -have been allocated to hold more files. - -Special files -------------- - -Symbolic links are also filesystem objects with inodes. They deserve -special mention because the data for them is stored within the inode -itself if the symlink is less than 60 bytes long. It uses the fields -which would normally be used to store the pointers to data blocks. -This is a worthwhile optimisation as it we avoid allocating a full -block for the symlink, and most symlinks are less than 60 characters long. - -Character and block special devices never have data blocks assigned to -them. Instead, their device number is stored in the inode, again reusing -the fields which would be used to point to the data blocks. - -Reserved Space --------------- - -In ext2, there is a mechanism for reserving a certain number of blocks -for a particular user (normally the super-user). This is intended to -allow for the system to continue functioning even if non-privileged users -fill up all the space available to them (this is independent of filesystem -quotas). It also keeps the filesystem from filling up entirely which -helps combat fragmentation. - -Filesystem check ----------------- - -At boot time, most systems run a consistency check (e2fsck) on their -filesystems. The superblock of the ext2 filesystem contains several -fields which indicate whether fsck should actually run (since checking -the filesystem at boot can take a long time if it is large). fsck will -run if the filesystem was not cleanly unmounted, if the maximum mount -count has been exceeded or if the maximum time between checks has been -exceeded. - -Feature Compatibility ---------------------- - -The compatibility feature mechanism used in ext2 is sophisticated. -It safely allows features to be added to the filesystem, without -unnecessarily sacrificing compatibility with older versions of the -filesystem code. The feature compatibility mechanism is not supported by -the original revision 0 (EXT2_GOOD_OLD_REV) of ext2, but was introduced in -revision 1. There are three 32-bit fields, one for compatible features -(COMPAT), one for read-only compatible (RO_COMPAT) features and one for -incompatible (INCOMPAT) features. - -These feature flags have specific meanings for the kernel as follows: - -A COMPAT flag indicates that a feature is present in the filesystem, -but the on-disk format is 100% compatible with older on-disk formats, so -a kernel which didn't know anything about this feature could read/write -the filesystem without any chance of corrupting the filesystem (or even -making it inconsistent). This is essentially just a flag which says -"this filesystem has a (hidden) feature" that the kernel or e2fsck may -want to be aware of (more on e2fsck and feature flags later). The ext3 -HAS_JOURNAL feature is a COMPAT flag because the ext3 journal is simply -a regular file with data blocks in it so the kernel does not need to -take any special notice of it if it doesn't understand ext3 journaling. - -An RO_COMPAT flag indicates that the on-disk format is 100% compatible -with older on-disk formats for reading (i.e. the feature does not change -the visible on-disk format). However, an old kernel writing to such a -filesystem would/could corrupt the filesystem, so this is prevented. The -most common such feature, SPARSE_SUPER, is an RO_COMPAT feature because -sparse groups allow file data blocks where superblock/group descriptor -backups used to live, and ext2_free_blocks() refuses to free these blocks, -which would leading to inconsistent bitmaps. An old kernel would also -get an error if it tried to free a series of blocks which crossed a group -boundary, but this is a legitimate layout in a SPARSE_SUPER filesystem. - -An INCOMPAT flag indicates the on-disk format has changed in some -way that makes it unreadable by older kernels, or would otherwise -cause a problem if an old kernel tried to mount it. FILETYPE is an -INCOMPAT flag because older kernels would think a filename was longer -than 256 characters, which would lead to corrupt directory listings. -The COMPRESSION flag is an obvious INCOMPAT flag - if the kernel -doesn't understand compression, you would just get garbage back from -read() instead of it automatically decompressing your data. The ext3 -RECOVER flag is needed to prevent a kernel which does not understand the -ext3 journal from mounting the filesystem without replaying the journal. - -For e2fsck, it needs to be more strict with the handling of these -flags than the kernel. If it doesn't understand ANY of the COMPAT, -RO_COMPAT, or INCOMPAT flags it will refuse to check the filesystem, -because it has no way of verifying whether a given feature is valid -or not. Allowing e2fsck to succeed on a filesystem with an unknown -feature is a false sense of security for the user. Refusing to check -a filesystem with unknown features is a good incentive for the user to -update to the latest e2fsck. This also means that anyone adding feature -flags to ext2 also needs to update e2fsck to verify these features. - -Metadata --------- - -It is frequently claimed that the ext2 implementation of writing -asynchronous metadata is faster than the ffs synchronous metadata -scheme but less reliable. Both methods are equally resolvable by their -respective fsck programs. - -If you're exceptionally paranoid, there are 3 ways of making metadata -writes synchronous on ext2: - -per-file if you have the program source: use the O_SYNC flag to open() -per-file if you don't have the source: use "chattr +S" on the file -per-filesystem: add the "sync" option to mount (or in /etc/fstab) - -the first and last are not ext2 specific but do force the metadata to -be written synchronously. See also Journaling below. - -Limitations ------------ - -There are various limits imposed by the on-disk layout of ext2. Other -limits are imposed by the current implementation of the kernel code. -Many of the limits are determined at the time the filesystem is first -created, and depend upon the block size chosen. The ratio of inodes to -data blocks is fixed at filesystem creation time, so the only way to -increase the number of inodes is to increase the size of the filesystem. -No tools currently exist which can change the ratio of inodes to blocks. - -Most of these limits could be overcome with slight changes in the on-disk -format and using a compatibility flag to signal the format change (at -the expense of some compatibility). - -Filesystem block size: 1kB 2kB 4kB 8kB - -File size limit: 16GB 256GB 2048GB 2048GB -Filesystem size limit: 2047GB 8192GB 16384GB 32768GB - -There is a 2.4 kernel limit of 2048GB for a single block device, so no -filesystem larger than that can be created at this time. There is also -an upper limit on the block size imposed by the page size of the kernel, -so 8kB blocks are only allowed on Alpha systems (and other architectures -which support larger pages). - -There is an upper limit of 32000 subdirectories in a single directory. - -There is a "soft" upper limit of about 10-15k files in a single directory -with the current linear linked-list directory implementation. This limit -stems from performance problems when creating and deleting (and also -finding) files in such large directories. Using a hashed directory index -(under development) allows 100k-1M+ files in a single directory without -performance problems (although RAM size becomes an issue at this point). - -The (meaningless) absolute upper limit of files in a single directory -(imposed by the file size, the realistic limit is obviously much less) -is over 130 trillion files. It would be higher except there are not -enough 4-character names to make up unique directory entries, so they -have to be 8 character filenames, even then we are fairly close to -running out of unique filenames. - -Journaling ----------- - -A journaling extension to the ext2 code has been developed by Stephen -Tweedie. It avoids the risks of metadata corruption and the need to -wait for e2fsck to complete after a crash, without requiring a change -to the on-disk ext2 layout. In a nutshell, the journal is a regular -file which stores whole metadata (and optionally data) blocks that have -been modified, prior to writing them into the filesystem. This means -it is possible to add a journal to an existing ext2 filesystem without -the need for data conversion. - -When changes to the filesystem (e.g. a file is renamed) they are stored in -a transaction in the journal and can either be complete or incomplete at -the time of a crash. If a transaction is complete at the time of a crash -(or in the normal case where the system does not crash), then any blocks -in that transaction are guaranteed to represent a valid filesystem state, -and are copied into the filesystem. If a transaction is incomplete at -the time of the crash, then there is no guarantee of consistency for -the blocks in that transaction so they are discarded (which means any -filesystem changes they represent are also lost). -Check Documentation/filesystems/ext4/ if you want to read more about -ext4 and journaling. - -References -========== - -The kernel source file:/usr/src/linux/fs/ext2/ -e2fsprogs (e2fsck) http://e2fsprogs.sourceforge.net/ -Design & Implementation http://e2fsprogs.sourceforge.net/ext2intro.html -Journaling (ext3) ftp://ftp.uk.linux.org/pub/linux/sct/fs/jfs/ -Filesystem Resizing http://ext2resize.sourceforge.net/ -Compression (*) http://e2compr.sourceforge.net/ - -Implementations for: -Windows 95/98/NT/2000 http://www.chrysocome.net/explore2fs -Windows 95 (*) http://www.yipton.net/content.html#FSDEXT2 -DOS client (*) ftp://metalab.unc.edu/pub/Linux/system/filesystems/ext2/ -OS/2 (+) ftp://metalab.unc.edu/pub/Linux/system/filesystems/ext2/ -RISC OS client http://www.esw-heim.tu-clausthal.de/~marco/smorbrod/IscaFS/ - -(*) no longer actively developed/supported (as of Apr 2001) -(+) no longer actively developed/supported (as of Mar 2009) diff --git a/Documentation/filesystems/index.rst b/Documentation/filesystems/index.rst index 03a493b27920..102b3b65486a 100644 --- a/Documentation/filesystems/index.rst +++ b/Documentation/filesystems/index.rst @@ -62,6 +62,7 @@ Documentation for filesystem implementations. ecryptfs efivarfs erofs + ext2 fuse overlayfs virtiofs -- cgit From 7dc62406320c4103bbdeeeecd0a7ef03e3e58009 Mon Sep 17 00:00:00 2001 From: Mauro Carvalho Chehab Date: Mon, 17 Feb 2020 17:12:03 +0100 Subject: docs: filesystems: convert ext3.txt to ReST Nothing really required here. Just renaming would be enough. Yet, while here, lets add a SPDX header and adjust document title to met the same standard we're using on most docs. Signed-off-by: Mauro Carvalho Chehab Link: https://lore.kernel.org/r/26960235e3e7c972bd543f5dd59f1ef4f3a877c6.1581955849.git.mchehab+huawei@kernel.org Signed-off-by: Jonathan Corbet --- Documentation/filesystems/ext3.rst | 14 ++++++++++++++ Documentation/filesystems/ext3.txt | 12 ------------ Documentation/filesystems/index.rst | 1 + 3 files changed, 15 insertions(+), 12 deletions(-) create mode 100644 Documentation/filesystems/ext3.rst delete mode 100644 Documentation/filesystems/ext3.txt diff --git a/Documentation/filesystems/ext3.rst b/Documentation/filesystems/ext3.rst new file mode 100644 index 000000000000..c06cec3a8fdc --- /dev/null +++ b/Documentation/filesystems/ext3.rst @@ -0,0 +1,14 @@ +.. SPDX-License-Identifier: GPL-2.0 + +=============== +Ext3 Filesystem +=============== + +Ext3 was originally released in September 1999. Written by Stephen Tweedie +for the 2.2 branch, and ported to 2.4 kernels by Peter Braam, Andreas Dilger, +Andrew Morton, Alexander Viro, Ted Ts'o and Stephen Tweedie. + +Ext3 is the ext2 filesystem enhanced with journalling capabilities. The +filesystem is a subset of ext4 filesystem so use ext4 driver for accessing +ext3 filesystems. + diff --git a/Documentation/filesystems/ext3.txt b/Documentation/filesystems/ext3.txt deleted file mode 100644 index 58758fbef9e0..000000000000 --- a/Documentation/filesystems/ext3.txt +++ /dev/null @@ -1,12 +0,0 @@ - -Ext3 Filesystem -=============== - -Ext3 was originally released in September 1999. Written by Stephen Tweedie -for the 2.2 branch, and ported to 2.4 kernels by Peter Braam, Andreas Dilger, -Andrew Morton, Alexander Viro, Ted Ts'o and Stephen Tweedie. - -Ext3 is the ext2 filesystem enhanced with journalling capabilities. The -filesystem is a subset of ext4 filesystem so use ext4 driver for accessing -ext3 filesystems. - diff --git a/Documentation/filesystems/index.rst b/Documentation/filesystems/index.rst index 102b3b65486a..aa2c3d1de3de 100644 --- a/Documentation/filesystems/index.rst +++ b/Documentation/filesystems/index.rst @@ -63,6 +63,7 @@ Documentation for filesystem implementations. efivarfs erofs ext2 + ext3 fuse overlayfs virtiofs -- cgit From 89272ca1102e000f7dbca724b7b106e688199a5d Mon Sep 17 00:00:00 2001 From: Mauro Carvalho Chehab Date: Mon, 17 Feb 2020 17:12:04 +0100 Subject: docs: filesystems: convert f2fs.txt to ReST - Add a SPDX header; - Adjust document and section titles; - Some whitespace fixes and new line breaks; - Mark literal blocks as such; - Add table markups; - Add it to filesystems/index.rst. Signed-off-by: Mauro Carvalho Chehab Link: https://lore.kernel.org/r/8dd156320b0c015dec6d3f848d03ea057042a15b.1581955849.git.mchehab+huawei@kernel.org Signed-off-by: Jonathan Corbet --- Documentation/filesystems/f2fs.rst | 762 ++++++++++++++++++++++++++++++++++++ Documentation/filesystems/f2fs.txt | 730 ---------------------------------- Documentation/filesystems/index.rst | 1 + 3 files changed, 763 insertions(+), 730 deletions(-) create mode 100644 Documentation/filesystems/f2fs.rst delete mode 100644 Documentation/filesystems/f2fs.txt diff --git a/Documentation/filesystems/f2fs.rst b/Documentation/filesystems/f2fs.rst new file mode 100644 index 000000000000..d681203728d7 --- /dev/null +++ b/Documentation/filesystems/f2fs.rst @@ -0,0 +1,762 @@ +.. SPDX-License-Identifier: GPL-2.0 + +========================================== +WHAT IS Flash-Friendly File System (F2FS)? +========================================== + +NAND flash memory-based storage devices, such as SSD, eMMC, and SD cards, have +been equipped on a variety systems ranging from mobile to server systems. Since +they are known to have different characteristics from the conventional rotating +disks, a file system, an upper layer to the storage device, should adapt to the +changes from the sketch in the design level. + +F2FS is a file system exploiting NAND flash memory-based storage devices, which +is based on Log-structured File System (LFS). The design has been focused on +addressing the fundamental issues in LFS, which are snowball effect of wandering +tree and high cleaning overhead. + +Since a NAND flash memory-based storage device shows different characteristic +according to its internal geometry or flash memory management scheme, namely FTL, +F2FS and its tools support various parameters not only for configuring on-disk +layout, but also for selecting allocation and cleaning algorithms. + +The following git tree provides the file system formatting tool (mkfs.f2fs), +a consistency checking tool (fsck.f2fs), and a debugging tool (dump.f2fs). + +- git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs-tools.git + +For reporting bugs and sending patches, please use the following mailing list: + +- linux-f2fs-devel@lists.sourceforge.net + +Background and Design issues +============================ + +Log-structured File System (LFS) +-------------------------------- +"A log-structured file system writes all modifications to disk sequentially in +a log-like structure, thereby speeding up both file writing and crash recovery. +The log is the only structure on disk; it contains indexing information so that +files can be read back from the log efficiently. In order to maintain large free +areas on disk for fast writing, we divide the log into segments and use a +segment cleaner to compress the live information from heavily fragmented +segments." from Rosenblum, M. and Ousterhout, J. K., 1992, "The design and +implementation of a log-structured file system", ACM Trans. Computer Systems +10, 1, 26–52. + +Wandering Tree Problem +---------------------- +In LFS, when a file data is updated and written to the end of log, its direct +pointer block is updated due to the changed location. Then the indirect pointer +block is also updated due to the direct pointer block update. In this manner, +the upper index structures such as inode, inode map, and checkpoint block are +also updated recursively. This problem is called as wandering tree problem [1], +and in order to enhance the performance, it should eliminate or relax the update +propagation as much as possible. + +[1] Bityutskiy, A. 2005. JFFS3 design issues. http://www.linux-mtd.infradead.org/ + +Cleaning Overhead +----------------- +Since LFS is based on out-of-place writes, it produces so many obsolete blocks +scattered across the whole storage. In order to serve new empty log space, it +needs to reclaim these obsolete blocks seamlessly to users. This job is called +as a cleaning process. + +The process consists of three operations as follows. + +1. A victim segment is selected through referencing segment usage table. +2. It loads parent index structures of all the data in the victim identified by + segment summary blocks. +3. It checks the cross-reference between the data and its parent index structure. +4. It moves valid data selectively. + +This cleaning job may cause unexpected long delays, so the most important goal +is to hide the latencies to users. And also definitely, it should reduce the +amount of valid data to be moved, and move them quickly as well. + +Key Features +============ + +Flash Awareness +--------------- +- Enlarge the random write area for better performance, but provide the high + spatial locality +- Align FS data structures to the operational units in FTL as best efforts + +Wandering Tree Problem +---------------------- +- Use a term, “node”, that represents inodes as well as various pointer blocks +- Introduce Node Address Table (NAT) containing the locations of all the “node” + blocks; this will cut off the update propagation. + +Cleaning Overhead +----------------- +- Support a background cleaning process +- Support greedy and cost-benefit algorithms for victim selection policies +- Support multi-head logs for static/dynamic hot and cold data separation +- Introduce adaptive logging for efficient block allocation + +Mount Options +============= + + +====================== ============================================================ +background_gc=%s Turn on/off cleaning operations, namely garbage + collection, triggered in background when I/O subsystem is + idle. If background_gc=on, it will turn on the garbage + collection and if background_gc=off, garbage collection + will be turned off. If background_gc=sync, it will turn + on synchronous garbage collection running in background. + Default value for this option is on. So garbage + collection is on by default. +disable_roll_forward Disable the roll-forward recovery routine +norecovery Disable the roll-forward recovery routine, mounted read- + only (i.e., -o ro,disable_roll_forward) +discard/nodiscard Enable/disable real-time discard in f2fs, if discard is + enabled, f2fs will issue discard/TRIM commands when a + segment is cleaned. +no_heap Disable heap-style segment allocation which finds free + segments for data from the beginning of main area, while + for node from the end of main area. +nouser_xattr Disable Extended User Attributes. Note: xattr is enabled + by default if CONFIG_F2FS_FS_XATTR is selected. +noacl Disable POSIX Access Control List. Note: acl is enabled + by default if CONFIG_F2FS_FS_POSIX_ACL is selected. +active_logs=%u Support configuring the number of active logs. In the + current design, f2fs supports only 2, 4, and 6 logs. + Default number is 6. +disable_ext_identify Disable the extension list configured by mkfs, so f2fs + does not aware of cold files such as media files. +inline_xattr Enable the inline xattrs feature. +noinline_xattr Disable the inline xattrs feature. +inline_xattr_size=%u Support configuring inline xattr size, it depends on + flexible inline xattr feature. +inline_data Enable the inline data feature: New created small(<~3.4k) + files can be written into inode block. +inline_dentry Enable the inline dir feature: data in new created + directory entries can be written into inode block. The + space of inode block which is used to store inline + dentries is limited to ~3.4k. +noinline_dentry Disable the inline dentry feature. +flush_merge Merge concurrent cache_flush commands as much as possible + to eliminate redundant command issues. If the underlying + device handles the cache_flush command relatively slowly, + recommend to enable this option. +nobarrier This option can be used if underlying storage guarantees + its cached data should be written to the novolatile area. + If this option is set, no cache_flush commands are issued + but f2fs still guarantees the write ordering of all the + data writes. +fastboot This option is used when a system wants to reduce mount + time as much as possible, even though normal performance + can be sacrificed. +extent_cache Enable an extent cache based on rb-tree, it can cache + as many as extent which map between contiguous logical + address and physical address per inode, resulting in + increasing the cache hit ratio. Set by default. +noextent_cache Disable an extent cache based on rb-tree explicitly, see + the above extent_cache mount option. +noinline_data Disable the inline data feature, inline data feature is + enabled by default. +data_flush Enable data flushing before checkpoint in order to + persist data of regular and symlink. +reserve_root=%d Support configuring reserved space which is used for + allocation from a privileged user with specified uid or + gid, unit: 4KB, the default limit is 0.2% of user blocks. +resuid=%d The user ID which may use the reserved blocks. +resgid=%d The group ID which may use the reserved blocks. +fault_injection=%d Enable fault injection in all supported types with + specified injection rate. +fault_type=%d Support configuring fault injection type, should be + enabled with fault_injection option, fault type value + is shown below, it supports single or combined type. + + =================== =========== + Type_Name Type_Value + =================== =========== + FAULT_KMALLOC 0x000000001 + FAULT_KVMALLOC 0x000000002 + FAULT_PAGE_ALLOC 0x000000004 + FAULT_PAGE_GET 0x000000008 + FAULT_ALLOC_BIO 0x000000010 + FAULT_ALLOC_NID 0x000000020 + FAULT_ORPHAN 0x000000040 + FAULT_BLOCK 0x000000080 + FAULT_DIR_DEPTH 0x000000100 + FAULT_EVICT_INODE 0x000000200 + FAULT_TRUNCATE 0x000000400 + FAULT_READ_IO 0x000000800 + FAULT_CHECKPOINT 0x000001000 + FAULT_DISCARD 0x000002000 + FAULT_WRITE_IO 0x000004000 + =================== =========== +mode=%s Control block allocation mode which supports "adaptive" + and "lfs". In "lfs" mode, there should be no random + writes towards main area. +io_bits=%u Set the bit size of write IO requests. It should be set + with "mode=lfs". +usrquota Enable plain user disk quota accounting. +grpquota Enable plain group disk quota accounting. +prjquota Enable plain project quota accounting. +usrjquota= Appoint specified file and type during mount, so that quota +grpjquota= information can be properly updated during recovery flow, +prjjquota= : must be in root directory; +jqfmt= : [vfsold,vfsv0,vfsv1]. +offusrjquota Turn off user journelled quota. +offgrpjquota Turn off group journelled quota. +offprjjquota Turn off project journelled quota. +quota Enable plain user disk quota accounting. +noquota Disable all plain disk quota option. +whint_mode=%s Control which write hints are passed down to block + layer. This supports "off", "user-based", and + "fs-based". In "off" mode (default), f2fs does not pass + down hints. In "user-based" mode, f2fs tries to pass + down hints given by users. And in "fs-based" mode, f2fs + passes down hints with its policy. +alloc_mode=%s Adjust block allocation policy, which supports "reuse" + and "default". +fsync_mode=%s Control the policy of fsync. Currently supports "posix", + "strict", and "nobarrier". In "posix" mode, which is + default, fsync will follow POSIX semantics and does a + light operation to improve the filesystem performance. + In "strict" mode, fsync will be heavy and behaves in line + with xfs, ext4 and btrfs, where xfstest generic/342 will + pass, but the performance will regress. "nobarrier" is + based on "posix", but doesn't issue flush command for + non-atomic files likewise "nobarrier" mount option. +test_dummy_encryption Enable dummy encryption, which provides a fake fscrypt + context. The fake fscrypt context is used by xfstests. +checkpoint=%s[:%u[%]] Set to "disable" to turn off checkpointing. Set to "enable" + to reenable checkpointing. Is enabled by default. While + disabled, any unmounting or unexpected shutdowns will cause + the filesystem contents to appear as they did when the + filesystem was mounted with that option. + While mounting with checkpoint=disabled, the filesystem must + run garbage collection to ensure that all available space can + be used. If this takes too much time, the mount may return + EAGAIN. You may optionally add a value to indicate how much + of the disk you would be willing to temporarily give up to + avoid additional garbage collection. This can be given as a + number of blocks, or as a percent. For instance, mounting + with checkpoint=disable:100% would always succeed, but it may + hide up to all remaining free space. The actual space that + would be unusable can be viewed at /sys/fs/f2fs//unusable + This space is reclaimed once checkpoint=enable. +compress_algorithm=%s Control compress algorithm, currently f2fs supports "lzo" + and "lz4" algorithm. +compress_log_size=%u Support configuring compress cluster size, the size will + be 4KB * (1 << %u), 16KB is minimum size, also it's + default size. +compress_extension=%s Support adding specified extension, so that f2fs can enable + compression on those corresponding files, e.g. if all files + with '.ext' has high compression rate, we can set the '.ext' + on compression extension list and enable compression on + these file by default rather than to enable it via ioctl. + For other files, we can still enable compression via ioctl. +====================== ============================================================ + +Debugfs Entries +=============== + +/sys/kernel/debug/f2fs/ contains information about all the partitions mounted as +f2fs. Each file shows the whole f2fs information. + +/sys/kernel/debug/f2fs/status includes: + + - major file system information managed by f2fs currently + - average SIT information about whole segments + - current memory footprint consumed by f2fs. + +Sysfs Entries +============= + +Information about mounted f2fs file systems can be found in +/sys/fs/f2fs. Each mounted filesystem will have a directory in +/sys/fs/f2fs based on its device name (i.e., /sys/fs/f2fs/sda). +The files in each per-device directory are shown in table below. + +Files in /sys/fs/f2fs/ +(see also Documentation/ABI/testing/sysfs-fs-f2fs) + +Usage +===== + +1. Download userland tools and compile them. + +2. Skip, if f2fs was compiled statically inside kernel. + Otherwise, insert the f2fs.ko module:: + + # insmod f2fs.ko + +3. Create a directory trying to mount:: + + # mkdir /mnt/f2fs + +4. Format the block device, and then mount as f2fs:: + + # mkfs.f2fs -l label /dev/block_device + # mount -t f2fs /dev/block_device /mnt/f2fs + +mkfs.f2fs +--------- +The mkfs.f2fs is for the use of formatting a partition as the f2fs filesystem, +which builds a basic on-disk layout. + +The options consist of: + +=============== =========================================================== +``-l [label]`` Give a volume label, up to 512 unicode name. +``-a [0 or 1]`` Split start location of each area for heap-based allocation. + + 1 is set by default, which performs this. +``-o [int]`` Set overprovision ratio in percent over volume size. + + 5 is set by default. +``-s [int]`` Set the number of segments per section. + + 1 is set by default. +``-z [int]`` Set the number of sections per zone. + + 1 is set by default. +``-e [str]`` Set basic extension list. e.g. "mp3,gif,mov" +``-t [0 or 1]`` Disable discard command or not. + + 1 is set by default, which conducts discard. +=============== =========================================================== + +fsck.f2fs +--------- +The fsck.f2fs is a tool to check the consistency of an f2fs-formatted +partition, which examines whether the filesystem metadata and user-made data +are cross-referenced correctly or not. +Note that, initial version of the tool does not fix any inconsistency. + +The options consist of:: + + -d debug level [default:0] + +dump.f2fs +--------- +The dump.f2fs shows the information of specific inode and dumps SSA and SIT to +file. Each file is dump_ssa and dump_sit. + +The dump.f2fs is used to debug on-disk data structures of the f2fs filesystem. +It shows on-disk inode information recognized by a given inode number, and is +able to dump all the SSA and SIT entries into predefined files, ./dump_ssa and +./dump_sit respectively. + +The options consist of:: + + -d debug level [default:0] + -i inode no (hex) + -s [SIT dump segno from #1~#2 (decimal), for all 0~-1] + -a [SSA dump segno from #1~#2 (decimal), for all 0~-1] + +Examples:: + + # dump.f2fs -i [ino] /dev/sdx + # dump.f2fs -s 0~-1 /dev/sdx (SIT dump) + # dump.f2fs -a 0~-1 /dev/sdx (SSA dump) + +Design +====== + +On-disk Layout +-------------- + +F2FS divides the whole volume into a number of segments, each of which is fixed +to 2MB in size. A section is composed of consecutive segments, and a zone +consists of a set of sections. By default, section and zone sizes are set to one +segment size identically, but users can easily modify the sizes by mkfs. + +F2FS splits the entire volume into six areas, and all the areas except superblock +consists of multiple segments as described below:: + + align with the zone size <-| + |-> align with the segment size + _________________________________________________________________________ + | | | Segment | Node | Segment | | + | Superblock | Checkpoint | Info. | Address | Summary | Main | + | (SB) | (CP) | Table (SIT) | Table (NAT) | Area (SSA) | | + |____________|_____2______|______N______|______N______|______N_____|__N___| + . . + . . + . . + ._________________________________________. + |_Segment_|_..._|_Segment_|_..._|_Segment_| + . . + ._________._________ + |_section_|__...__|_ + . . + .________. + |__zone__| + +- Superblock (SB) + It is located at the beginning of the partition, and there exist two copies + to avoid file system crash. It contains basic partition information and some + default parameters of f2fs. + +- Checkpoint (CP) + It contains file system information, bitmaps for valid NAT/SIT sets, orphan + inode lists, and summary entries of current active segments. + +- Segment Information Table (SIT) + It contains segment information such as valid block count and bitmap for the + validity of all the blocks. + +- Node Address Table (NAT) + It is composed of a block address table for all the node blocks stored in + Main area. + +- Segment Summary Area (SSA) + It contains summary entries which contains the owner information of all the + data and node blocks stored in Main area. + +- Main Area + It contains file and directory data including their indices. + +In order to avoid misalignment between file system and flash-based storage, F2FS +aligns the start block address of CP with the segment size. Also, it aligns the +start block address of Main area with the zone size by reserving some segments +in SSA area. + +Reference the following survey for additional technical details. +https://wiki.linaro.org/WorkingGroups/Kernel/Projects/FlashCardSurvey + +File System Metadata Structure +------------------------------ + +F2FS adopts the checkpointing scheme to maintain file system consistency. At +mount time, F2FS first tries to find the last valid checkpoint data by scanning +CP area. In order to reduce the scanning time, F2FS uses only two copies of CP. +One of them always indicates the last valid data, which is called as shadow copy +mechanism. In addition to CP, NAT and SIT also adopt the shadow copy mechanism. + +For file system consistency, each CP points to which NAT and SIT copies are +valid, as shown as below:: + + +--------+----------+---------+ + | CP | SIT | NAT | + +--------+----------+---------+ + . . . . + . . . . + . . . . + +-------+-------+--------+--------+--------+--------+ + | CP #0 | CP #1 | SIT #0 | SIT #1 | NAT #0 | NAT #1 | + +-------+-------+--------+--------+--------+--------+ + | ^ ^ + | | | + `----------------------------------------' + +Index Structure +--------------- + +The key data structure to manage the data locations is a "node". Similar to +traditional file structures, F2FS has three types of node: inode, direct node, +indirect node. F2FS assigns 4KB to an inode block which contains 923 data block +indices, two direct node pointers, two indirect node pointers, and one double +indirect node pointer as described below. One direct node block contains 1018 +data blocks, and one indirect node block contains also 1018 node blocks. Thus, +one inode block (i.e., a file) covers:: + + 4KB * (923 + 2 * 1018 + 2 * 1018 * 1018 + 1018 * 1018 * 1018) := 3.94TB. + + Inode block (4KB) + |- data (923) + |- direct node (2) + | `- data (1018) + |- indirect node (2) + | `- direct node (1018) + | `- data (1018) + `- double indirect node (1) + `- indirect node (1018) + `- direct node (1018) + `- data (1018) + +Note that, all the node blocks are mapped by NAT which means the location of +each node is translated by the NAT table. In the consideration of the wandering +tree problem, F2FS is able to cut off the propagation of node updates caused by +leaf data writes. + +Directory Structure +------------------- + +A directory entry occupies 11 bytes, which consists of the following attributes. + +- hash hash value of the file name +- ino inode number +- len the length of file name +- type file type such as directory, symlink, etc + +A dentry block consists of 214 dentry slots and file names. Therein a bitmap is +used to represent whether each dentry is valid or not. A dentry block occupies +4KB with the following composition. + +:: + + Dentry Block(4 K) = bitmap (27 bytes) + reserved (3 bytes) + + dentries(11 * 214 bytes) + file name (8 * 214 bytes) + + [Bucket] + +--------------------------------+ + |dentry block 1 | dentry block 2 | + +--------------------------------+ + . . + . . + . [Dentry Block Structure: 4KB] . + +--------+----------+----------+------------+ + | bitmap | reserved | dentries | file names | + +--------+----------+----------+------------+ + [Dentry Block: 4KB] . . + . . + . . + +------+------+-----+------+ + | hash | ino | len | type | + +------+------+-----+------+ + [Dentry Structure: 11 bytes] + +F2FS implements multi-level hash tables for directory structure. Each level has +a hash table with dedicated number of hash buckets as shown below. Note that +"A(2B)" means a bucket includes 2 data blocks. + +:: + + ---------------------- + A : bucket + B : block + N : MAX_DIR_HASH_DEPTH + ---------------------- + + level #0 | A(2B) + | + level #1 | A(2B) - A(2B) + | + level #2 | A(2B) - A(2B) - A(2B) - A(2B) + . | . . . . + level #N/2 | A(2B) - A(2B) - A(2B) - A(2B) - A(2B) - ... - A(2B) + . | . . . . + level #N | A(4B) - A(4B) - A(4B) - A(4B) - A(4B) - ... - A(4B) + +The number of blocks and buckets are determined by:: + + ,- 2, if n < MAX_DIR_HASH_DEPTH / 2, + # of blocks in level #n = | + `- 4, Otherwise + + ,- 2^(n + dir_level), + | if n + dir_level < MAX_DIR_HASH_DEPTH / 2, + # of buckets in level #n = | + `- 2^((MAX_DIR_HASH_DEPTH / 2) - 1), + Otherwise + +When F2FS finds a file name in a directory, at first a hash value of the file +name is calculated. Then, F2FS scans the hash table in level #0 to find the +dentry consisting of the file name and its inode number. If not found, F2FS +scans the next hash table in level #1. In this way, F2FS scans hash tables in +each levels incrementally from 1 to N. In each levels F2FS needs to scan only +one bucket determined by the following equation, which shows O(log(# of files)) +complexity:: + + bucket number to scan in level #n = (hash value) % (# of buckets in level #n) + +In the case of file creation, F2FS finds empty consecutive slots that cover the +file name. F2FS searches the empty slots in the hash tables of whole levels from +1 to N in the same way as the lookup operation. + +The following figure shows an example of two cases holding children:: + + --------------> Dir <-------------- + | | + child child + + child - child [hole] - child + + child - child - child [hole] - [hole] - child + + Case 1: Case 2: + Number of children = 6, Number of children = 3, + File size = 7 File size = 7 + +Default Block Allocation +------------------------ + +At runtime, F2FS manages six active logs inside "Main" area: Hot/Warm/Cold node +and Hot/Warm/Cold data. + +- Hot node contains direct node blocks of directories. +- Warm node contains direct node blocks except hot node blocks. +- Cold node contains indirect node blocks +- Hot data contains dentry blocks +- Warm data contains data blocks except hot and cold data blocks +- Cold data contains multimedia data or migrated data blocks + +LFS has two schemes for free space management: threaded log and copy-and-compac- +tion. The copy-and-compaction scheme which is known as cleaning, is well-suited +for devices showing very good sequential write performance, since free segments +are served all the time for writing new data. However, it suffers from cleaning +overhead under high utilization. Contrarily, the threaded log scheme suffers +from random writes, but no cleaning process is needed. F2FS adopts a hybrid +scheme where the copy-and-compaction scheme is adopted by default, but the +policy is dynamically changed to the threaded log scheme according to the file +system status. + +In order to align F2FS with underlying flash-based storage, F2FS allocates a +segment in a unit of section. F2FS expects that the section size would be the +same as the unit size of garbage collection in FTL. Furthermore, with respect +to the mapping granularity in FTL, F2FS allocates each section of the active +logs from different zones as much as possible, since FTL can write the data in +the active logs into one allocation unit according to its mapping granularity. + +Cleaning process +---------------- + +F2FS does cleaning both on demand and in the background. On-demand cleaning is +triggered when there are not enough free segments to serve VFS calls. Background +cleaner is operated by a kernel thread, and triggers the cleaning job when the +system is idle. + +F2FS supports two victim selection policies: greedy and cost-benefit algorithms. +In the greedy algorithm, F2FS selects a victim segment having the smallest number +of valid blocks. In the cost-benefit algorithm, F2FS selects a victim segment +according to the segment age and the number of valid blocks in order to address +log block thrashing problem in the greedy algorithm. F2FS adopts the greedy +algorithm for on-demand cleaner, while background cleaner adopts cost-benefit +algorithm. + +In order to identify whether the data in the victim segment are valid or not, +F2FS manages a bitmap. Each bit represents the validity of a block, and the +bitmap is composed of a bit stream covering whole blocks in main area. + +Write-hint Policy +----------------- + +1) whint_mode=off. F2FS only passes down WRITE_LIFE_NOT_SET. + +2) whint_mode=user-based. F2FS tries to pass down hints given by +users. + +===================== ======================== =================== +User F2FS Block +===================== ======================== =================== + META WRITE_LIFE_NOT_SET + HOT_NODE " + WARM_NODE " + COLD_NODE " +ioctl(COLD) COLD_DATA WRITE_LIFE_EXTREME +extension list " " + +-- buffered io +WRITE_LIFE_EXTREME COLD_DATA WRITE_LIFE_EXTREME +WRITE_LIFE_SHORT HOT_DATA WRITE_LIFE_SHORT +WRITE_LIFE_NOT_SET WARM_DATA WRITE_LIFE_NOT_SET +WRITE_LIFE_NONE " " +WRITE_LIFE_MEDIUM " " +WRITE_LIFE_LONG " " + +-- direct io +WRITE_LIFE_EXTREME COLD_DATA WRITE_LIFE_EXTREME +WRITE_LIFE_SHORT HOT_DATA WRITE_LIFE_SHORT +WRITE_LIFE_NOT_SET WARM_DATA WRITE_LIFE_NOT_SET +WRITE_LIFE_NONE " WRITE_LIFE_NONE +WRITE_LIFE_MEDIUM " WRITE_LIFE_MEDIUM +WRITE_LIFE_LONG " WRITE_LIFE_LONG +===================== ======================== =================== + +3) whint_mode=fs-based. F2FS passes down hints with its policy. + +===================== ======================== =================== +User F2FS Block +===================== ======================== =================== + META WRITE_LIFE_MEDIUM; + HOT_NODE WRITE_LIFE_NOT_SET + WARM_NODE " + COLD_NODE WRITE_LIFE_NONE +ioctl(COLD) COLD_DATA WRITE_LIFE_EXTREME +extension list " " + +-- buffered io +WRITE_LIFE_EXTREME COLD_DATA WRITE_LIFE_EXTREME +WRITE_LIFE_SHORT HOT_DATA WRITE_LIFE_SHORT +WRITE_LIFE_NOT_SET WARM_DATA WRITE_LIFE_LONG +WRITE_LIFE_NONE " " +WRITE_LIFE_MEDIUM " " +WRITE_LIFE_LONG " " + +-- direct io +WRITE_LIFE_EXTREME COLD_DATA WRITE_LIFE_EXTREME +WRITE_LIFE_SHORT HOT_DATA WRITE_LIFE_SHORT +WRITE_LIFE_NOT_SET WARM_DATA WRITE_LIFE_NOT_SET +WRITE_LIFE_NONE " WRITE_LIFE_NONE +WRITE_LIFE_MEDIUM " WRITE_LIFE_MEDIUM +WRITE_LIFE_LONG " WRITE_LIFE_LONG +===================== ======================== =================== + +Fallocate(2) Policy +------------------- + +The default policy follows the below posix rule. + +Allocating disk space + The default operation (i.e., mode is zero) of fallocate() allocates + the disk space within the range specified by offset and len. The + file size (as reported by stat(2)) will be changed if offset+len is + greater than the file size. Any subregion within the range specified + by offset and len that did not contain data before the call will be + initialized to zero. This default behavior closely resembles the + behavior of the posix_fallocate(3) library function, and is intended + as a method of optimally implementing that function. + +However, once F2FS receives ioctl(fd, F2FS_IOC_SET_PIN_FILE) in prior to +fallocate(fd, DEFAULT_MODE), it allocates on-disk blocks addressess having +zero or random data, which is useful to the below scenario where: + + 1. create(fd) + 2. ioctl(fd, F2FS_IOC_SET_PIN_FILE) + 3. fallocate(fd, 0, 0, size) + 4. address = fibmap(fd, offset) + 5. open(blkdev) + 6. write(blkdev, address) + +Compression implementation +-------------------------- + +- New term named cluster is defined as basic unit of compression, file can + be divided into multiple clusters logically. One cluster includes 4 << n + (n >= 0) logical pages, compression size is also cluster size, each of + cluster can be compressed or not. + +- In cluster metadata layout, one special block address is used to indicate + cluster is compressed one or normal one, for compressed cluster, following + metadata maps cluster to [1, 4 << n - 1] physical blocks, in where f2fs + stores data including compress header and compressed data. + +- In order to eliminate write amplification during overwrite, F2FS only + support compression on write-once file, data can be compressed only when + all logical blocks in file are valid and cluster compress ratio is lower + than specified threshold. + +- To enable compression on regular inode, there are three ways: + + * chattr +c file + * chattr +c dir; touch dir/file + * mount w/ -o compress_extension=ext; touch file.ext + +Compress metadata layout:: + + [Dnode Structure] + +-----------------------------------------------+ + | cluster 1 | cluster 2 | ......... | cluster N | + +-----------------------------------------------+ + . . . . + . . . . + . Compressed Cluster . . Normal Cluster . + +----------+---------+---------+---------+ +---------+---------+---------+---------+ + |compr flag| block 1 | block 2 | block 3 | | block 1 | block 2 | block 3 | block 4 | + +----------+---------+---------+---------+ +---------+---------+---------+---------+ + . . + . . + . . + +-------------+-------------+----------+----------------------------+ + | data length | data chksum | reserved | compressed data | + +-------------+-------------+----------+----------------------------+ diff --git a/Documentation/filesystems/f2fs.txt b/Documentation/filesystems/f2fs.txt deleted file mode 100644 index 4eb3e2ddd00e..000000000000 --- a/Documentation/filesystems/f2fs.txt +++ /dev/null @@ -1,730 +0,0 @@ -================================================================================ -WHAT IS Flash-Friendly File System (F2FS)? -================================================================================ - -NAND flash memory-based storage devices, such as SSD, eMMC, and SD cards, have -been equipped on a variety systems ranging from mobile to server systems. Since -they are known to have different characteristics from the conventional rotating -disks, a file system, an upper layer to the storage device, should adapt to the -changes from the sketch in the design level. - -F2FS is a file system exploiting NAND flash memory-based storage devices, which -is based on Log-structured File System (LFS). The design has been focused on -addressing the fundamental issues in LFS, which are snowball effect of wandering -tree and high cleaning overhead. - -Since a NAND flash memory-based storage device shows different characteristic -according to its internal geometry or flash memory management scheme, namely FTL, -F2FS and its tools support various parameters not only for configuring on-disk -layout, but also for selecting allocation and cleaning algorithms. - -The following git tree provides the file system formatting tool (mkfs.f2fs), -a consistency checking tool (fsck.f2fs), and a debugging tool (dump.f2fs). ->> git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs-tools.git - -For reporting bugs and sending patches, please use the following mailing list: ->> linux-f2fs-devel@lists.sourceforge.net - -================================================================================ -BACKGROUND AND DESIGN ISSUES -================================================================================ - -Log-structured File System (LFS) --------------------------------- -"A log-structured file system writes all modifications to disk sequentially in -a log-like structure, thereby speeding up both file writing and crash recovery. -The log is the only structure on disk; it contains indexing information so that -files can be read back from the log efficiently. In order to maintain large free -areas on disk for fast writing, we divide the log into segments and use a -segment cleaner to compress the live information from heavily fragmented -segments." from Rosenblum, M. and Ousterhout, J. K., 1992, "The design and -implementation of a log-structured file system", ACM Trans. Computer Systems -10, 1, 26–52. - -Wandering Tree Problem ----------------------- -In LFS, when a file data is updated and written to the end of log, its direct -pointer block is updated due to the changed location. Then the indirect pointer -block is also updated due to the direct pointer block update. In this manner, -the upper index structures such as inode, inode map, and checkpoint block are -also updated recursively. This problem is called as wandering tree problem [1], -and in order to enhance the performance, it should eliminate or relax the update -propagation as much as possible. - -[1] Bityutskiy, A. 2005. JFFS3 design issues. http://www.linux-mtd.infradead.org/ - -Cleaning Overhead ------------------ -Since LFS is based on out-of-place writes, it produces so many obsolete blocks -scattered across the whole storage. In order to serve new empty log space, it -needs to reclaim these obsolete blocks seamlessly to users. This job is called -as a cleaning process. - -The process consists of three operations as follows. -1. A victim segment is selected through referencing segment usage table. -2. It loads parent index structures of all the data in the victim identified by - segment summary blocks. -3. It checks the cross-reference between the data and its parent index structure. -4. It moves valid data selectively. - -This cleaning job may cause unexpected long delays, so the most important goal -is to hide the latencies to users. And also definitely, it should reduce the -amount of valid data to be moved, and move them quickly as well. - -================================================================================ -KEY FEATURES -================================================================================ - -Flash Awareness ---------------- -- Enlarge the random write area for better performance, but provide the high - spatial locality -- Align FS data structures to the operational units in FTL as best efforts - -Wandering Tree Problem ----------------------- -- Use a term, “node”, that represents inodes as well as various pointer blocks -- Introduce Node Address Table (NAT) containing the locations of all the “node” - blocks; this will cut off the update propagation. - -Cleaning Overhead ------------------ -- Support a background cleaning process -- Support greedy and cost-benefit algorithms for victim selection policies -- Support multi-head logs for static/dynamic hot and cold data separation -- Introduce adaptive logging for efficient block allocation - -================================================================================ -MOUNT OPTIONS -================================================================================ - -background_gc=%s Turn on/off cleaning operations, namely garbage - collection, triggered in background when I/O subsystem is - idle. If background_gc=on, it will turn on the garbage - collection and if background_gc=off, garbage collection - will be turned off. If background_gc=sync, it will turn - on synchronous garbage collection running in background. - Default value for this option is on. So garbage - collection is on by default. -disable_roll_forward Disable the roll-forward recovery routine -norecovery Disable the roll-forward recovery routine, mounted read- - only (i.e., -o ro,disable_roll_forward) -discard/nodiscard Enable/disable real-time discard in f2fs, if discard is - enabled, f2fs will issue discard/TRIM commands when a - segment is cleaned. -no_heap Disable heap-style segment allocation which finds free - segments for data from the beginning of main area, while - for node from the end of main area. -nouser_xattr Disable Extended User Attributes. Note: xattr is enabled - by default if CONFIG_F2FS_FS_XATTR is selected. -noacl Disable POSIX Access Control List. Note: acl is enabled - by default if CONFIG_F2FS_FS_POSIX_ACL is selected. -active_logs=%u Support configuring the number of active logs. In the - current design, f2fs supports only 2, 4, and 6 logs. - Default number is 6. -disable_ext_identify Disable the extension list configured by mkfs, so f2fs - does not aware of cold files such as media files. -inline_xattr Enable the inline xattrs feature. -noinline_xattr Disable the inline xattrs feature. -inline_xattr_size=%u Support configuring inline xattr size, it depends on - flexible inline xattr feature. -inline_data Enable the inline data feature: New created small(<~3.4k) - files can be written into inode block. -inline_dentry Enable the inline dir feature: data in new created - directory entries can be written into inode block. The - space of inode block which is used to store inline - dentries is limited to ~3.4k. -noinline_dentry Disable the inline dentry feature. -flush_merge Merge concurrent cache_flush commands as much as possible - to eliminate redundant command issues. If the underlying - device handles the cache_flush command relatively slowly, - recommend to enable this option. -nobarrier This option can be used if underlying storage guarantees - its cached data should be written to the novolatile area. - If this option is set, no cache_flush commands are issued - but f2fs still guarantees the write ordering of all the - data writes. -fastboot This option is used when a system wants to reduce mount - time as much as possible, even though normal performance - can be sacrificed. -extent_cache Enable an extent cache based on rb-tree, it can cache - as many as extent which map between contiguous logical - address and physical address per inode, resulting in - increasing the cache hit ratio. Set by default. -noextent_cache Disable an extent cache based on rb-tree explicitly, see - the above extent_cache mount option. -noinline_data Disable the inline data feature, inline data feature is - enabled by default. -data_flush Enable data flushing before checkpoint in order to - persist data of regular and symlink. -reserve_root=%d Support configuring reserved space which is used for - allocation from a privileged user with specified uid or - gid, unit: 4KB, the default limit is 0.2% of user blocks. -resuid=%d The user ID which may use the reserved blocks. -resgid=%d The group ID which may use the reserved blocks. -fault_injection=%d Enable fault injection in all supported types with - specified injection rate. -fault_type=%d Support configuring fault injection type, should be - enabled with fault_injection option, fault type value - is shown below, it supports single or combined type. - Type_Name Type_Value - FAULT_KMALLOC 0x000000001 - FAULT_KVMALLOC 0x000000002 - FAULT_PAGE_ALLOC 0x000000004 - FAULT_PAGE_GET 0x000000008 - FAULT_ALLOC_BIO 0x000000010 - FAULT_ALLOC_NID 0x000000020 - FAULT_ORPHAN 0x000000040 - FAULT_BLOCK 0x000000080 - FAULT_DIR_DEPTH 0x000000100 - FAULT_EVICT_INODE 0x000000200 - FAULT_TRUNCATE 0x000000400 - FAULT_READ_IO 0x000000800 - FAULT_CHECKPOINT 0x000001000 - FAULT_DISCARD 0x000002000 - FAULT_WRITE_IO 0x000004000 -mode=%s Control block allocation mode which supports "adaptive" - and "lfs". In "lfs" mode, there should be no random - writes towards main area. -io_bits=%u Set the bit size of write IO requests. It should be set - with "mode=lfs". -usrquota Enable plain user disk quota accounting. -grpquota Enable plain group disk quota accounting. -prjquota Enable plain project quota accounting. -usrjquota= Appoint specified file and type during mount, so that quota -grpjquota= information can be properly updated during recovery flow, -prjjquota= : must be in root directory; -jqfmt= : [vfsold,vfsv0,vfsv1]. -offusrjquota Turn off user journelled quota. -offgrpjquota Turn off group journelled quota. -offprjjquota Turn off project journelled quota. -quota Enable plain user disk quota accounting. -noquota Disable all plain disk quota option. -whint_mode=%s Control which write hints are passed down to block - layer. This supports "off", "user-based", and - "fs-based". In "off" mode (default), f2fs does not pass - down hints. In "user-based" mode, f2fs tries to pass - down hints given by users. And in "fs-based" mode, f2fs - passes down hints with its policy. -alloc_mode=%s Adjust block allocation policy, which supports "reuse" - and "default". -fsync_mode=%s Control the policy of fsync. Currently supports "posix", - "strict", and "nobarrier". In "posix" mode, which is - default, fsync will follow POSIX semantics and does a - light operation to improve the filesystem performance. - In "strict" mode, fsync will be heavy and behaves in line - with xfs, ext4 and btrfs, where xfstest generic/342 will - pass, but the performance will regress. "nobarrier" is - based on "posix", but doesn't issue flush command for - non-atomic files likewise "nobarrier" mount option. -test_dummy_encryption Enable dummy encryption, which provides a fake fscrypt - context. The fake fscrypt context is used by xfstests. -checkpoint=%s[:%u[%]] Set to "disable" to turn off checkpointing. Set to "enable" - to reenable checkpointing. Is enabled by default. While - disabled, any unmounting or unexpected shutdowns will cause - the filesystem contents to appear as they did when the - filesystem was mounted with that option. - While mounting with checkpoint=disabled, the filesystem must - run garbage collection to ensure that all available space can - be used. If this takes too much time, the mount may return - EAGAIN. You may optionally add a value to indicate how much - of the disk you would be willing to temporarily give up to - avoid additional garbage collection. This can be given as a - number of blocks, or as a percent. For instance, mounting - with checkpoint=disable:100% would always succeed, but it may - hide up to all remaining free space. The actual space that - would be unusable can be viewed at /sys/fs/f2fs//unusable - This space is reclaimed once checkpoint=enable. -compress_algorithm=%s Control compress algorithm, currently f2fs supports "lzo" - and "lz4" algorithm. -compress_log_size=%u Support configuring compress cluster size, the size will - be 4KB * (1 << %u), 16KB is minimum size, also it's - default size. -compress_extension=%s Support adding specified extension, so that f2fs can enable - compression on those corresponding files, e.g. if all files - with '.ext' has high compression rate, we can set the '.ext' - on compression extension list and enable compression on - these file by default rather than to enable it via ioctl. - For other files, we can still enable compression via ioctl. - -================================================================================ -DEBUGFS ENTRIES -================================================================================ - -/sys/kernel/debug/f2fs/ contains information about all the partitions mounted as -f2fs. Each file shows the whole f2fs information. - -/sys/kernel/debug/f2fs/status includes: - - major file system information managed by f2fs currently - - average SIT information about whole segments - - current memory footprint consumed by f2fs. - -================================================================================ -SYSFS ENTRIES -================================================================================ - -Information about mounted f2fs file systems can be found in -/sys/fs/f2fs. Each mounted filesystem will have a directory in -/sys/fs/f2fs based on its device name (i.e., /sys/fs/f2fs/sda). -The files in each per-device directory are shown in table below. - -Files in /sys/fs/f2fs/ -(see also Documentation/ABI/testing/sysfs-fs-f2fs) - -================================================================================ -USAGE -================================================================================ - -1. Download userland tools and compile them. - -2. Skip, if f2fs was compiled statically inside kernel. - Otherwise, insert the f2fs.ko module. - # insmod f2fs.ko - -3. Create a directory trying to mount - # mkdir /mnt/f2fs - -4. Format the block device, and then mount as f2fs - # mkfs.f2fs -l label /dev/block_device - # mount -t f2fs /dev/block_device /mnt/f2fs - -mkfs.f2fs ---------- -The mkfs.f2fs is for the use of formatting a partition as the f2fs filesystem, -which builds a basic on-disk layout. - -The options consist of: --l [label] : Give a volume label, up to 512 unicode name. --a [0 or 1] : Split start location of each area for heap-based allocation. - 1 is set by default, which performs this. --o [int] : Set overprovision ratio in percent over volume size. - 5 is set by default. --s [int] : Set the number of segments per section. - 1 is set by default. --z [int] : Set the number of sections per zone. - 1 is set by default. --e [str] : Set basic extension list. e.g. "mp3,gif,mov" --t [0 or 1] : Disable discard command or not. - 1 is set by default, which conducts discard. - -fsck.f2fs ---------- -The fsck.f2fs is a tool to check the consistency of an f2fs-formatted -partition, which examines whether the filesystem metadata and user-made data -are cross-referenced correctly or not. -Note that, initial version of the tool does not fix any inconsistency. - -The options consist of: - -d debug level [default:0] - -dump.f2fs ---------- -The dump.f2fs shows the information of specific inode and dumps SSA and SIT to -file. Each file is dump_ssa and dump_sit. - -The dump.f2fs is used to debug on-disk data structures of the f2fs filesystem. -It shows on-disk inode information recognized by a given inode number, and is -able to dump all the SSA and SIT entries into predefined files, ./dump_ssa and -./dump_sit respectively. - -The options consist of: - -d debug level [default:0] - -i inode no (hex) - -s [SIT dump segno from #1~#2 (decimal), for all 0~-1] - -a [SSA dump segno from #1~#2 (decimal), for all 0~-1] - -Examples: -# dump.f2fs -i [ino] /dev/sdx -# dump.f2fs -s 0~-1 /dev/sdx (SIT dump) -# dump.f2fs -a 0~-1 /dev/sdx (SSA dump) - -================================================================================ -DESIGN -================================================================================ - -On-disk Layout --------------- - -F2FS divides the whole volume into a number of segments, each of which is fixed -to 2MB in size. A section is composed of consecutive segments, and a zone -consists of a set of sections. By default, section and zone sizes are set to one -segment size identically, but users can easily modify the sizes by mkfs. - -F2FS splits the entire volume into six areas, and all the areas except superblock -consists of multiple segments as described below. - - align with the zone size <-| - |-> align with the segment size - _________________________________________________________________________ - | | | Segment | Node | Segment | | - | Superblock | Checkpoint | Info. | Address | Summary | Main | - | (SB) | (CP) | Table (SIT) | Table (NAT) | Area (SSA) | | - |____________|_____2______|______N______|______N______|______N_____|__N___| - . . - . . - . . - ._________________________________________. - |_Segment_|_..._|_Segment_|_..._|_Segment_| - . . - ._________._________ - |_section_|__...__|_ - . . - .________. - |__zone__| - -- Superblock (SB) - : It is located at the beginning of the partition, and there exist two copies - to avoid file system crash. It contains basic partition information and some - default parameters of f2fs. - -- Checkpoint (CP) - : It contains file system information, bitmaps for valid NAT/SIT sets, orphan - inode lists, and summary entries of current active segments. - -- Segment Information Table (SIT) - : It contains segment information such as valid block count and bitmap for the - validity of all the blocks. - -- Node Address Table (NAT) - : It is composed of a block address table for all the node blocks stored in - Main area. - -- Segment Summary Area (SSA) - : It contains summary entries which contains the owner information of all the - data and node blocks stored in Main area. - -- Main Area - : It contains file and directory data including their indices. - -In order to avoid misalignment between file system and flash-based storage, F2FS -aligns the start block address of CP with the segment size. Also, it aligns the -start block address of Main area with the zone size by reserving some segments -in SSA area. - -Reference the following survey for additional technical details. -https://wiki.linaro.org/WorkingGroups/Kernel/Projects/FlashCardSurvey - -File System Metadata Structure ------------------------------- - -F2FS adopts the checkpointing scheme to maintain file system consistency. At -mount time, F2FS first tries to find the last valid checkpoint data by scanning -CP area. In order to reduce the scanning time, F2FS uses only two copies of CP. -One of them always indicates the last valid data, which is called as shadow copy -mechanism. In addition to CP, NAT and SIT also adopt the shadow copy mechanism. - -For file system consistency, each CP points to which NAT and SIT copies are -valid, as shown as below. - - +--------+----------+---------+ - | CP | SIT | NAT | - +--------+----------+---------+ - . . . . - . . . . - . . . . - +-------+-------+--------+--------+--------+--------+ - | CP #0 | CP #1 | SIT #0 | SIT #1 | NAT #0 | NAT #1 | - +-------+-------+--------+--------+--------+--------+ - | ^ ^ - | | | - `----------------------------------------' - -Index Structure ---------------- - -The key data structure to manage the data locations is a "node". Similar to -traditional file structures, F2FS has three types of node: inode, direct node, -indirect node. F2FS assigns 4KB to an inode block which contains 923 data block -indices, two direct node pointers, two indirect node pointers, and one double -indirect node pointer as described below. One direct node block contains 1018 -data blocks, and one indirect node block contains also 1018 node blocks. Thus, -one inode block (i.e., a file) covers: - - 4KB * (923 + 2 * 1018 + 2 * 1018 * 1018 + 1018 * 1018 * 1018) := 3.94TB. - - Inode block (4KB) - |- data (923) - |- direct node (2) - | `- data (1018) - |- indirect node (2) - | `- direct node (1018) - | `- data (1018) - `- double indirect node (1) - `- indirect node (1018) - `- direct node (1018) - `- data (1018) - -Note that, all the node blocks are mapped by NAT which means the location of -each node is translated by the NAT table. In the consideration of the wandering -tree problem, F2FS is able to cut off the propagation of node updates caused by -leaf data writes. - -Directory Structure -------------------- - -A directory entry occupies 11 bytes, which consists of the following attributes. - -- hash hash value of the file name -- ino inode number -- len the length of file name -- type file type such as directory, symlink, etc - -A dentry block consists of 214 dentry slots and file names. Therein a bitmap is -used to represent whether each dentry is valid or not. A dentry block occupies -4KB with the following composition. - - Dentry Block(4 K) = bitmap (27 bytes) + reserved (3 bytes) + - dentries(11 * 214 bytes) + file name (8 * 214 bytes) - - [Bucket] - +--------------------------------+ - |dentry block 1 | dentry block 2 | - +--------------------------------+ - . . - . . - . [Dentry Block Structure: 4KB] . - +--------+----------+----------+------------+ - | bitmap | reserved | dentries | file names | - +--------+----------+----------+------------+ - [Dentry Block: 4KB] . . - . . - . . - +------+------+-----+------+ - | hash | ino | len | type | - +------+------+-----+------+ - [Dentry Structure: 11 bytes] - -F2FS implements multi-level hash tables for directory structure. Each level has -a hash table with dedicated number of hash buckets as shown below. Note that -"A(2B)" means a bucket includes 2 data blocks. - ----------------------- -A : bucket -B : block -N : MAX_DIR_HASH_DEPTH ----------------------- - -level #0 | A(2B) - | -level #1 | A(2B) - A(2B) - | -level #2 | A(2B) - A(2B) - A(2B) - A(2B) - . | . . . . -level #N/2 | A(2B) - A(2B) - A(2B) - A(2B) - A(2B) - ... - A(2B) - . | . . . . -level #N | A(4B) - A(4B) - A(4B) - A(4B) - A(4B) - ... - A(4B) - -The number of blocks and buckets are determined by, - - ,- 2, if n < MAX_DIR_HASH_DEPTH / 2, - # of blocks in level #n = | - `- 4, Otherwise - - ,- 2^(n + dir_level), - | if n + dir_level < MAX_DIR_HASH_DEPTH / 2, - # of buckets in level #n = | - `- 2^((MAX_DIR_HASH_DEPTH / 2) - 1), - Otherwise - -When F2FS finds a file name in a directory, at first a hash value of the file -name is calculated. Then, F2FS scans the hash table in level #0 to find the -dentry consisting of the file name and its inode number. If not found, F2FS -scans the next hash table in level #1. In this way, F2FS scans hash tables in -each levels incrementally from 1 to N. In each levels F2FS needs to scan only -one bucket determined by the following equation, which shows O(log(# of files)) -complexity. - - bucket number to scan in level #n = (hash value) % (# of buckets in level #n) - -In the case of file creation, F2FS finds empty consecutive slots that cover the -file name. F2FS searches the empty slots in the hash tables of whole levels from -1 to N in the same way as the lookup operation. - -The following figure shows an example of two cases holding children. - --------------> Dir <-------------- - | | - child child - - child - child [hole] - child - - child - child - child [hole] - [hole] - child - - Case 1: Case 2: - Number of children = 6, Number of children = 3, - File size = 7 File size = 7 - -Default Block Allocation ------------------------- - -At runtime, F2FS manages six active logs inside "Main" area: Hot/Warm/Cold node -and Hot/Warm/Cold data. - -- Hot node contains direct node blocks of directories. -- Warm node contains direct node blocks except hot node blocks. -- Cold node contains indirect node blocks -- Hot data contains dentry blocks -- Warm data contains data blocks except hot and cold data blocks -- Cold data contains multimedia data or migrated data blocks - -LFS has two schemes for free space management: threaded log and copy-and-compac- -tion. The copy-and-compaction scheme which is known as cleaning, is well-suited -for devices showing very good sequential write performance, since free segments -are served all the time for writing new data. However, it suffers from cleaning -overhead under high utilization. Contrarily, the threaded log scheme suffers -from random writes, but no cleaning process is needed. F2FS adopts a hybrid -scheme where the copy-and-compaction scheme is adopted by default, but the -policy is dynamically changed to the threaded log scheme according to the file -system status. - -In order to align F2FS with underlying flash-based storage, F2FS allocates a -segment in a unit of section. F2FS expects that the section size would be the -same as the unit size of garbage collection in FTL. Furthermore, with respect -to the mapping granularity in FTL, F2FS allocates each section of the active -logs from different zones as much as possible, since FTL can write the data in -the active logs into one allocation unit according to its mapping granularity. - -Cleaning process ----------------- - -F2FS does cleaning both on demand and in the background. On-demand cleaning is -triggered when there are not enough free segments to serve VFS calls. Background -cleaner is operated by a kernel thread, and triggers the cleaning job when the -system is idle. - -F2FS supports two victim selection policies: greedy and cost-benefit algorithms. -In the greedy algorithm, F2FS selects a victim segment having the smallest number -of valid blocks. In the cost-benefit algorithm, F2FS selects a victim segment -according to the segment age and the number of valid blocks in order to address -log block thrashing problem in the greedy algorithm. F2FS adopts the greedy -algorithm for on-demand cleaner, while background cleaner adopts cost-benefit -algorithm. - -In order to identify whether the data in the victim segment are valid or not, -F2FS manages a bitmap. Each bit represents the validity of a block, and the -bitmap is composed of a bit stream covering whole blocks in main area. - -Write-hint Policy ------------------ - -1) whint_mode=off. F2FS only passes down WRITE_LIFE_NOT_SET. - -2) whint_mode=user-based. F2FS tries to pass down hints given by -users. - -User F2FS Block ----- ---- ----- - META WRITE_LIFE_NOT_SET - HOT_NODE " - WARM_NODE " - COLD_NODE " -*ioctl(COLD) COLD_DATA WRITE_LIFE_EXTREME -*extension list " " - --- buffered io -WRITE_LIFE_EXTREME COLD_DATA WRITE_LIFE_EXTREME -WRITE_LIFE_SHORT HOT_DATA WRITE_LIFE_SHORT -WRITE_LIFE_NOT_SET WARM_DATA WRITE_LIFE_NOT_SET -WRITE_LIFE_NONE " " -WRITE_LIFE_MEDIUM " " -WRITE_LIFE_LONG " " - --- direct io -WRITE_LIFE_EXTREME COLD_DATA WRITE_LIFE_EXTREME -WRITE_LIFE_SHORT HOT_DATA WRITE_LIFE_SHORT -WRITE_LIFE_NOT_SET WARM_DATA WRITE_LIFE_NOT_SET -WRITE_LIFE_NONE " WRITE_LIFE_NONE -WRITE_LIFE_MEDIUM " WRITE_LIFE_MEDIUM -WRITE_LIFE_LONG " WRITE_LIFE_LONG - -3) whint_mode=fs-based. F2FS passes down hints with its policy. - -User F2FS Block ----- ---- ----- - META WRITE_LIFE_MEDIUM; - HOT_NODE WRITE_LIFE_NOT_SET - WARM_NODE " - COLD_NODE WRITE_LIFE_NONE -ioctl(COLD) COLD_DATA WRITE_LIFE_EXTREME -extension list " " - --- buffered io -WRITE_LIFE_EXTREME COLD_DATA WRITE_LIFE_EXTREME -WRITE_LIFE_SHORT HOT_DATA WRITE_LIFE_SHORT -WRITE_LIFE_NOT_SET WARM_DATA WRITE_LIFE_LONG -WRITE_LIFE_NONE " " -WRITE_LIFE_MEDIUM " " -WRITE_LIFE_LONG " " - --- direct io -WRITE_LIFE_EXTREME COLD_DATA WRITE_LIFE_EXTREME -WRITE_LIFE_SHORT HOT_DATA WRITE_LIFE_SHORT -WRITE_LIFE_NOT_SET WARM_DATA WRITE_LIFE_NOT_SET -WRITE_LIFE_NONE " WRITE_LIFE_NONE -WRITE_LIFE_MEDIUM " WRITE_LIFE_MEDIUM -WRITE_LIFE_LONG " WRITE_LIFE_LONG - -Fallocate(2) Policy -------------------- - -The default policy follows the below posix rule. - -Allocating disk space - The default operation (i.e., mode is zero) of fallocate() allocates - the disk space within the range specified by offset and len. The - file size (as reported by stat(2)) will be changed if offset+len is - greater than the file size. Any subregion within the range specified - by offset and len that did not contain data before the call will be - initialized to zero. This default behavior closely resembles the - behavior of the posix_fallocate(3) library function, and is intended - as a method of optimally implementing that function. - -However, once F2FS receives ioctl(fd, F2FS_IOC_SET_PIN_FILE) in prior to -fallocate(fd, DEFAULT_MODE), it allocates on-disk blocks addressess having -zero or random data, which is useful to the below scenario where: - 1. create(fd) - 2. ioctl(fd, F2FS_IOC_SET_PIN_FILE) - 3. fallocate(fd, 0, 0, size) - 4. address = fibmap(fd, offset) - 5. open(blkdev) - 6. write(blkdev, address) - -Compression implementation --------------------------- - -- New term named cluster is defined as basic unit of compression, file can -be divided into multiple clusters logically. One cluster includes 4 << n -(n >= 0) logical pages, compression size is also cluster size, each of -cluster can be compressed or not. - -- In cluster metadata layout, one special block address is used to indicate -cluster is compressed one or normal one, for compressed cluster, following -metadata maps cluster to [1, 4 << n - 1] physical blocks, in where f2fs -stores data including compress header and compressed data. - -- In order to eliminate write amplification during overwrite, F2FS only -support compression on write-once file, data can be compressed only when -all logical blocks in file are valid and cluster compress ratio is lower -than specified threshold. - -- To enable compression on regular inode, there are three ways: -* chattr +c file -* chattr +c dir; touch dir/file -* mount w/ -o compress_extension=ext; touch file.ext - -Compress metadata layout: - [Dnode Structure] - +-----------------------------------------------+ - | cluster 1 | cluster 2 | ......... | cluster N | - +-----------------------------------------------+ - . . . . - . . . . - . Compressed Cluster . . Normal Cluster . -+----------+---------+---------+---------+ +---------+---------+---------+---------+ -|compr flag| block 1 | block 2 | block 3 | | block 1 | block 2 | block 3 | block 4 | -+----------+---------+---------+---------+ +---------+---------+---------+---------+ - . . - . . - . . - +-------------+-------------+----------+----------------------------+ - | data length | data chksum | reserved | compressed data | - +-------------+-------------+----------+----------------------------+ diff --git a/Documentation/filesystems/index.rst b/Documentation/filesystems/index.rst index aa2c3d1de3de..f69d20406be0 100644 --- a/Documentation/filesystems/index.rst +++ b/Documentation/filesystems/index.rst @@ -64,6 +64,7 @@ Documentation for filesystem implementations. erofs ext2 ext3 + f2fs fuse overlayfs virtiofs -- cgit From 720c2fc1ec7cb36bfc5326603522bc3955534773 Mon Sep 17 00:00:00 2001 From: Mauro Carvalho Chehab Date: Mon, 17 Feb 2020 17:12:05 +0100 Subject: docs: filesystems: convert gfs2.txt to ReST - Add a SPDX header; - Adjust document title; - Some whitespace fixes and new line breaks; - Mark literal blocks as such; - Add table markups; - Add it to filesystems/index.rst. Signed-off-by: Mauro Carvalho Chehab Acked-by: Bob Peterson Link: https://lore.kernel.org/r/6d7a296de025bcfed7a229da7f8cc1678944f304.1581955849.git.mchehab+huawei@kernel.org Signed-off-by: Jonathan Corbet --- Documentation/filesystems/gfs2.rst | 53 +++++++++++++++++++++++++++++++++++++ Documentation/filesystems/gfs2.txt | 45 ------------------------------- Documentation/filesystems/index.rst | 1 + 3 files changed, 54 insertions(+), 45 deletions(-) create mode 100644 Documentation/filesystems/gfs2.rst delete mode 100644 Documentation/filesystems/gfs2.txt diff --git a/Documentation/filesystems/gfs2.rst b/Documentation/filesystems/gfs2.rst new file mode 100644 index 000000000000..8d1ab589ce18 --- /dev/null +++ b/Documentation/filesystems/gfs2.rst @@ -0,0 +1,53 @@ +.. SPDX-License-Identifier: GPL-2.0 + +================== +Global File System +================== + +https://fedorahosted.org/cluster/wiki/HomePage + +GFS is a cluster file system. It allows a cluster of computers to +simultaneously use a block device that is shared between them (with FC, +iSCSI, NBD, etc). GFS reads and writes to the block device like a local +file system, but also uses a lock module to allow the computers coordinate +their I/O so file system consistency is maintained. One of the nifty +features of GFS is perfect consistency -- changes made to the file system +on one machine show up immediately on all other machines in the cluster. + +GFS uses interchangeable inter-node locking mechanisms, the currently +supported mechanisms are: + + lock_nolock + - allows gfs to be used as a local file system + + lock_dlm + - uses a distributed lock manager (dlm) for inter-node locking. + The dlm is found at linux/fs/dlm/ + +Lock_dlm depends on user space cluster management systems found +at the URL above. + +To use gfs as a local file system, no external clustering systems are +needed, simply:: + + $ mkfs -t gfs2 -p lock_nolock -j 1 /dev/block_device + $ mount -t gfs2 /dev/block_device /dir + +If you are using Fedora, you need to install the gfs2-utils package +and, for lock_dlm, you will also need to install the cman package +and write a cluster.conf as per the documentation. For F17 and above +cman has been replaced by the dlm package. + +GFS2 is not on-disk compatible with previous versions of GFS, but it +is pretty close. + +The following man pages can be found at the URL above: + + ============ ============================================= + fsck.gfs2 to repair a filesystem + gfs2_grow to expand a filesystem online + gfs2_jadd to add journals to a filesystem online + tunegfs2 to manipulate, examine and tune a filesystem + gfs2_convert to convert a gfs filesystem to gfs2 in-place + mkfs.gfs2 to make a filesystem + ============ ============================================= diff --git a/Documentation/filesystems/gfs2.txt b/Documentation/filesystems/gfs2.txt deleted file mode 100644 index cc4f2306609e..000000000000 --- a/Documentation/filesystems/gfs2.txt +++ /dev/null @@ -1,45 +0,0 @@ -Global File System ------------------- - -https://fedorahosted.org/cluster/wiki/HomePage - -GFS is a cluster file system. It allows a cluster of computers to -simultaneously use a block device that is shared between them (with FC, -iSCSI, NBD, etc). GFS reads and writes to the block device like a local -file system, but also uses a lock module to allow the computers coordinate -their I/O so file system consistency is maintained. One of the nifty -features of GFS is perfect consistency -- changes made to the file system -on one machine show up immediately on all other machines in the cluster. - -GFS uses interchangeable inter-node locking mechanisms, the currently -supported mechanisms are: - - lock_nolock -- allows gfs to be used as a local file system - - lock_dlm -- uses a distributed lock manager (dlm) for inter-node locking - The dlm is found at linux/fs/dlm/ - -Lock_dlm depends on user space cluster management systems found -at the URL above. - -To use gfs as a local file system, no external clustering systems are -needed, simply: - - $ mkfs -t gfs2 -p lock_nolock -j 1 /dev/block_device - $ mount -t gfs2 /dev/block_device /dir - -If you are using Fedora, you need to install the gfs2-utils package -and, for lock_dlm, you will also need to install the cman package -and write a cluster.conf as per the documentation. For F17 and above -cman has been replaced by the dlm package. - -GFS2 is not on-disk compatible with previous versions of GFS, but it -is pretty close. - -The following man pages can be found at the URL above: - fsck.gfs2 to repair a filesystem - gfs2_grow to expand a filesystem online - gfs2_jadd to add journals to a filesystem online - tunegfs2 to manipulate, examine and tune a filesystem - gfs2_convert to convert a gfs filesystem to gfs2 in-place - mkfs.gfs2 to make a filesystem diff --git a/Documentation/filesystems/index.rst b/Documentation/filesystems/index.rst index f69d20406be0..f24befe78326 100644 --- a/Documentation/filesystems/index.rst +++ b/Documentation/filesystems/index.rst @@ -65,6 +65,7 @@ Documentation for filesystem implementations. ext2 ext3 f2fs + gfs2 fuse overlayfs virtiofs -- cgit From 5b7ac27a6e2c54cc09f479b616f1076afeae3c1b Mon Sep 17 00:00:00 2001 From: Mauro Carvalho Chehab Date: Mon, 17 Feb 2020 17:12:06 +0100 Subject: docs: filesystems: convert gfs2-uevents.txt to ReST This document is almost in ReST format: all it needs is to have the titles adjusted and add a SPDX header. In other words: - Add a SPDX header; - Add a document title; - Adjust section titles; - Add it to filesystems/index.rst. Signed-off-by: Mauro Carvalho Chehab Acked-by: Bob Peterson Link: https://lore.kernel.org/r/1d1c46b7e86bd0a18d9abbea0de0bc2be84e5e2b.1581955849.git.mchehab+huawei@kernel.org Signed-off-by: Jonathan Corbet --- Documentation/filesystems/gfs2-uevents.rst | 112 +++++++++++++++++++++++++++++ Documentation/filesystems/gfs2-uevents.txt | 100 -------------------------- Documentation/filesystems/index.rst | 1 + 3 files changed, 113 insertions(+), 100 deletions(-) create mode 100644 Documentation/filesystems/gfs2-uevents.rst delete mode 100644 Documentation/filesystems/gfs2-uevents.txt diff --git a/Documentation/filesystems/gfs2-uevents.rst b/Documentation/filesystems/gfs2-uevents.rst new file mode 100644 index 000000000000..f162a2c76c69 --- /dev/null +++ b/Documentation/filesystems/gfs2-uevents.rst @@ -0,0 +1,112 @@ +.. SPDX-License-Identifier: GPL-2.0 + +================ +uevents and GFS2 +================ + +During the lifetime of a GFS2 mount, a number of uevents are generated. +This document explains what the events are and what they are used +for (by gfs_controld in gfs2-utils). + +A list of GFS2 uevents +====================== + +1. ADD +------ + +The ADD event occurs at mount time. It will always be the first +uevent generated by the newly created filesystem. If the mount +is successful, an ONLINE uevent will follow. If it is not successful +then a REMOVE uevent will follow. + +The ADD uevent has two environment variables: SPECTATOR=[0|1] +and RDONLY=[0|1] that specify the spectator status (a read-only mount +with no journal assigned), and read-only (with journal assigned) status +of the filesystem respectively. + +2. ONLINE +--------- + +The ONLINE uevent is generated after a successful mount or remount. It +has the same environment variables as the ADD uevent. The ONLINE +uevent, along with the two environment variables for spectator and +RDONLY are a relatively recent addition (2.6.32-rc+) and will not +be generated by older kernels. + +3. CHANGE +--------- + +The CHANGE uevent is used in two places. One is when reporting the +successful mount of the filesystem by the first node (FIRSTMOUNT=Done). +This is used as a signal by gfs_controld that it is then ok for other +nodes in the cluster to mount the filesystem. + +The other CHANGE uevent is used to inform of the completion +of journal recovery for one of the filesystems journals. It has +two environment variables, JID= which specifies the journal id which +has just been recovered, and RECOVERY=[Done|Failed] to indicate the +success (or otherwise) of the operation. These uevents are generated +for every journal recovered, whether it is during the initial mount +process or as the result of gfs_controld requesting a specific journal +recovery via the /sys/fs/gfs2//lock_module/recovery file. + +Because the CHANGE uevent was used (in early versions of gfs_controld) +without checking the environment variables to discover the state, we +cannot add any more functions to it without running the risk of +someone using an older version of the user tools and breaking their +cluster. For this reason the ONLINE uevent was used when adding a new +uevent for a successful mount or remount. + +4. OFFLINE +---------- + +The OFFLINE uevent is only generated due to filesystem errors and is used +as part of the "withdraw" mechanism. Currently this doesn't give any +information about what the error is, which is something that needs to +be fixed. + +5. REMOVE +--------- + +The REMOVE uevent is generated at the end of an unsuccessful mount +or at the end of a umount of the filesystem. All REMOVE uevents will +have been preceded by at least an ADD uevent for the same filesystem, +and unlike the other uevents is generated automatically by the kernel's +kobject subsystem. + + +Information common to all GFS2 uevents (uevent environment variables) +===================================================================== + +1. LOCKTABLE= +-------------- + +The LOCKTABLE is a string, as supplied on the mount command +line (locktable=) or via fstab. It is used as a filesystem label +as well as providing the information for a lock_dlm mount to be +able to join the cluster. + +2. LOCKPROTO= +------------- + +The LOCKPROTO is a string, and its value depends on what is set +on the mount command line, or via fstab. It will be either +lock_nolock or lock_dlm. In the future other lock managers +may be supported. + +3. JOURNALID= +------------- + +If a journal is in use by the filesystem (journals are not +assigned for spectator mounts) then this will give the +numeric journal id in all GFS2 uevents. + +4. UUID= +-------- + +With recent versions of gfs2-utils, mkfs.gfs2 writes a UUID +into the filesystem superblock. If it exists, this will +be included in every uevent relating to the filesystem. + + + diff --git a/Documentation/filesystems/gfs2-uevents.txt b/Documentation/filesystems/gfs2-uevents.txt deleted file mode 100644 index 19a19ebebc34..000000000000 --- a/Documentation/filesystems/gfs2-uevents.txt +++ /dev/null @@ -1,100 +0,0 @@ - uevents and GFS2 - ================== - -During the lifetime of a GFS2 mount, a number of uevents are generated. -This document explains what the events are and what they are used -for (by gfs_controld in gfs2-utils). - -A list of GFS2 uevents ------------------------ - -1. ADD - -The ADD event occurs at mount time. It will always be the first -uevent generated by the newly created filesystem. If the mount -is successful, an ONLINE uevent will follow. If it is not successful -then a REMOVE uevent will follow. - -The ADD uevent has two environment variables: SPECTATOR=[0|1] -and RDONLY=[0|1] that specify the spectator status (a read-only mount -with no journal assigned), and read-only (with journal assigned) status -of the filesystem respectively. - -2. ONLINE - -The ONLINE uevent is generated after a successful mount or remount. It -has the same environment variables as the ADD uevent. The ONLINE -uevent, along with the two environment variables for spectator and -RDONLY are a relatively recent addition (2.6.32-rc+) and will not -be generated by older kernels. - -3. CHANGE - -The CHANGE uevent is used in two places. One is when reporting the -successful mount of the filesystem by the first node (FIRSTMOUNT=Done). -This is used as a signal by gfs_controld that it is then ok for other -nodes in the cluster to mount the filesystem. - -The other CHANGE uevent is used to inform of the completion -of journal recovery for one of the filesystems journals. It has -two environment variables, JID= which specifies the journal id which -has just been recovered, and RECOVERY=[Done|Failed] to indicate the -success (or otherwise) of the operation. These uevents are generated -for every journal recovered, whether it is during the initial mount -process or as the result of gfs_controld requesting a specific journal -recovery via the /sys/fs/gfs2//lock_module/recovery file. - -Because the CHANGE uevent was used (in early versions of gfs_controld) -without checking the environment variables to discover the state, we -cannot add any more functions to it without running the risk of -someone using an older version of the user tools and breaking their -cluster. For this reason the ONLINE uevent was used when adding a new -uevent for a successful mount or remount. - -4. OFFLINE - -The OFFLINE uevent is only generated due to filesystem errors and is used -as part of the "withdraw" mechanism. Currently this doesn't give any -information about what the error is, which is something that needs to -be fixed. - -5. REMOVE - -The REMOVE uevent is generated at the end of an unsuccessful mount -or at the end of a umount of the filesystem. All REMOVE uevents will -have been preceded by at least an ADD uevent for the same filesystem, -and unlike the other uevents is generated automatically by the kernel's -kobject subsystem. - - -Information common to all GFS2 uevents (uevent environment variables) ----------------------------------------------------------------------- - -1. LOCKTABLE= - -The LOCKTABLE is a string, as supplied on the mount command -line (locktable=) or via fstab. It is used as a filesystem label -as well as providing the information for a lock_dlm mount to be -able to join the cluster. - -2. LOCKPROTO= - -The LOCKPROTO is a string, and its value depends on what is set -on the mount command line, or via fstab. It will be either -lock_nolock or lock_dlm. In the future other lock managers -may be supported. - -3. JOURNALID= - -If a journal is in use by the filesystem (journals are not -assigned for spectator mounts) then this will give the -numeric journal id in all GFS2 uevents. - -4. UUID= - -With recent versions of gfs2-utils, mkfs.gfs2 writes a UUID -into the filesystem superblock. If it exists, this will -be included in every uevent relating to the filesystem. - - - diff --git a/Documentation/filesystems/index.rst b/Documentation/filesystems/index.rst index f24befe78326..c16e517e37c5 100644 --- a/Documentation/filesystems/index.rst +++ b/Documentation/filesystems/index.rst @@ -66,6 +66,7 @@ Documentation for filesystem implementations. ext3 f2fs gfs2 + gfs2-uevents fuse overlayfs virtiofs -- cgit From cdded7db3625c98e66316911947bd3a1941992e2 Mon Sep 17 00:00:00 2001 From: Mauro Carvalho Chehab Date: Mon, 17 Feb 2020 17:12:07 +0100 Subject: docs: filesystems: convert hfsplus.txt to ReST Just trivial changes: - Add a SPDX header; - Add it to filesystems/index.rst. While here, adjust document title, just to make it use the same style of the other docs. Signed-off-by: Mauro Carvalho Chehab Link: https://lore.kernel.org/r/4298409da951fbee000201a6c8d9c85e961b2b79.1581955849.git.mchehab+huawei@kernel.org Signed-off-by: Jonathan Corbet --- Documentation/filesystems/hfsplus.rst | 61 +++++++++++++++++++++++++++++++++++ Documentation/filesystems/hfsplus.txt | 59 --------------------------------- Documentation/filesystems/index.rst | 1 + 3 files changed, 62 insertions(+), 59 deletions(-) create mode 100644 Documentation/filesystems/hfsplus.rst delete mode 100644 Documentation/filesystems/hfsplus.txt diff --git a/Documentation/filesystems/hfsplus.rst b/Documentation/filesystems/hfsplus.rst new file mode 100644 index 000000000000..f02f4f5fc020 --- /dev/null +++ b/Documentation/filesystems/hfsplus.rst @@ -0,0 +1,61 @@ +.. SPDX-License-Identifier: GPL-2.0 + +====================================== +Macintosh HFSPlus Filesystem for Linux +====================================== + +HFSPlus is a filesystem first introduced in MacOS 8.1. +HFSPlus has several extensions to HFS, including 32-bit allocation +blocks, 255-character unicode filenames, and file sizes of 2^63 bytes. + + +Mount options +============= + +When mounting an HFSPlus filesystem, the following options are accepted: + + creator=cccc, type=cccc + Specifies the creator/type values as shown by the MacOS finder + used for creating new files. Default values: '????'. + + uid=n, gid=n + Specifies the user/group that owns all files on the filesystem + that have uninitialized permissions structures. + Default: user/group id of the mounting process. + + umask=n + Specifies the umask (in octal) used for files and directories + that have uninitialized permissions structures. + Default: umask of the mounting process. + + session=n + Select the CDROM session to mount as HFSPlus filesystem. Defaults to + leaving that decision to the CDROM driver. This option will fail + with anything but a CDROM as underlying devices. + + part=n + Select partition number n from the devices. This option only makes + sense for CDROMs because they can't be partitioned under Linux. + For disk devices the generic partition parsing code does this + for us. Defaults to not parsing the partition table at all. + + decompose + Decompose file name characters. + + nodecompose + Do not decompose file name characters. + + force + Used to force write access to volumes that are marked as journalled + or locked. Use at your own risk. + + nls=cccc + Encoding to use when presenting file names. + + +References +========== + +kernel source: + +Apple Technote 1150 https://developer.apple.com/legacy/library/technotes/tn/tn1150.html diff --git a/Documentation/filesystems/hfsplus.txt b/Documentation/filesystems/hfsplus.txt deleted file mode 100644 index 59f7569fc9ed..000000000000 --- a/Documentation/filesystems/hfsplus.txt +++ /dev/null @@ -1,59 +0,0 @@ - -Macintosh HFSPlus Filesystem for Linux -====================================== - -HFSPlus is a filesystem first introduced in MacOS 8.1. -HFSPlus has several extensions to HFS, including 32-bit allocation -blocks, 255-character unicode filenames, and file sizes of 2^63 bytes. - - -Mount options -============= - -When mounting an HFSPlus filesystem, the following options are accepted: - - creator=cccc, type=cccc - Specifies the creator/type values as shown by the MacOS finder - used for creating new files. Default values: '????'. - - uid=n, gid=n - Specifies the user/group that owns all files on the filesystem - that have uninitialized permissions structures. - Default: user/group id of the mounting process. - - umask=n - Specifies the umask (in octal) used for files and directories - that have uninitialized permissions structures. - Default: umask of the mounting process. - - session=n - Select the CDROM session to mount as HFSPlus filesystem. Defaults to - leaving that decision to the CDROM driver. This option will fail - with anything but a CDROM as underlying devices. - - part=n - Select partition number n from the devices. This option only makes - sense for CDROMs because they can't be partitioned under Linux. - For disk devices the generic partition parsing code does this - for us. Defaults to not parsing the partition table at all. - - decompose - Decompose file name characters. - - nodecompose - Do not decompose file name characters. - - force - Used to force write access to volumes that are marked as journalled - or locked. Use at your own risk. - - nls=cccc - Encoding to use when presenting file names. - - -References -========== - -kernel source: - -Apple Technote 1150 https://developer.apple.com/legacy/library/technotes/tn/tn1150.html diff --git a/Documentation/filesystems/index.rst b/Documentation/filesystems/index.rst index c16e517e37c5..c351bc8a8c85 100644 --- a/Documentation/filesystems/index.rst +++ b/Documentation/filesystems/index.rst @@ -67,6 +67,7 @@ Documentation for filesystem implementations. f2fs gfs2 gfs2-uevents + hfsplus fuse overlayfs virtiofs -- cgit From 5040a0acc8f2300ef35a1d9cc1c50a25235e061d Mon Sep 17 00:00:00 2001 From: Mauro Carvalho Chehab Date: Mon, 17 Feb 2020 17:12:08 +0100 Subject: docs: filesystems: convert hfs.txt to ReST - Add a SPDX header; - Adjust document and section titles; - Use notes markups; - Add lists markups; - Add it to filesystems/index.rst. Signed-off-by: Mauro Carvalho Chehab Link: https://lore.kernel.org/r/8a625d6652d88809730020048d26c3b9333ddbdf.1581955849.git.mchehab+huawei@kernel.org Signed-off-by: Jonathan Corbet --- Documentation/filesystems/hfs.rst | 87 +++++++++++++++++++++++++++++++++++++ Documentation/filesystems/hfs.txt | 82 ---------------------------------- Documentation/filesystems/index.rst | 1 + 3 files changed, 88 insertions(+), 82 deletions(-) create mode 100644 Documentation/filesystems/hfs.rst delete mode 100644 Documentation/filesystems/hfs.txt diff --git a/Documentation/filesystems/hfs.rst b/Documentation/filesystems/hfs.rst new file mode 100644 index 000000000000..ab17a005e9b1 --- /dev/null +++ b/Documentation/filesystems/hfs.rst @@ -0,0 +1,87 @@ +.. SPDX-License-Identifier: GPL-2.0 + +================================== +Macintosh HFS Filesystem for Linux +================================== + + +.. Note:: This filesystem doesn't have a maintainer. + + +HFS stands for ``Hierarchical File System`` and is the filesystem used +by the Mac Plus and all later Macintosh models. Earlier Macintosh +models used MFS (``Macintosh File System``), which is not supported, +MacOS 8.1 and newer support a filesystem called HFS+ that's similar to +HFS but is extended in various areas. Use the hfsplus filesystem driver +to access such filesystems from Linux. + + +Mount options +============= + +When mounting an HFS filesystem, the following options are accepted: + + creator=cccc, type=cccc + Specifies the creator/type values as shown by the MacOS finder + used for creating new files. Default values: '????'. + + uid=n, gid=n + Specifies the user/group that owns all files on the filesystems. + Default: user/group id of the mounting process. + + dir_umask=n, file_umask=n, umask=n + Specifies the umask used for all files , all directories or all + files and directories. Defaults to the umask of the mounting process. + + session=n + Select the CDROM session to mount as HFS filesystem. Defaults to + leaving that decision to the CDROM driver. This option will fail + with anything but a CDROM as underlying devices. + + part=n + Select partition number n from the devices. Does only makes + sense for CDROMS because they can't be partitioned under Linux. + For disk devices the generic partition parsing code does this + for us. Defaults to not parsing the partition table at all. + + quiet + Ignore invalid mount options instead of complaining. + + +Writing to HFS Filesystems +========================== + +HFS is not a UNIX filesystem, thus it does not have the usual features you'd +expect: + + * You can't modify the set-uid, set-gid, sticky or executable bits or the uid + and gid of files. + * You can't create hard- or symlinks, device files, sockets or FIFOs. + +HFS does on the other have the concepts of multiple forks per file. These +non-standard forks are represented as hidden additional files in the normal +filesystems namespace which is kind of a cludge and makes the semantics for +the a little strange: + + * You can't create, delete or rename resource forks of files or the + Finder's metadata. + * They are however created (with default values), deleted and renamed + along with the corresponding data fork or directory. + * Copying files to a different filesystem will loose those attributes + that are essential for MacOS to work. + + +Creating HFS filesystems +======================== + +The hfsutils package from Robert Leslie contains a program called +hformat that can be used to create HFS filesystem. See + for details. + + +Credits +======= + +The HFS drivers was written by Paul H. Hargrovea (hargrove@sccm.Stanford.EDU). +Roman Zippel (roman@ardistech.com) rewrote large parts of the code and brought +in btree routines derived from Brad Boyer's hfsplus driver. diff --git a/Documentation/filesystems/hfs.txt b/Documentation/filesystems/hfs.txt deleted file mode 100644 index d096df6db07a..000000000000 --- a/Documentation/filesystems/hfs.txt +++ /dev/null @@ -1,82 +0,0 @@ -Note: This filesystem doesn't have a maintainer. - -Macintosh HFS Filesystem for Linux -================================== - -HFS stands for ``Hierarchical File System'' and is the filesystem used -by the Mac Plus and all later Macintosh models. Earlier Macintosh -models used MFS (``Macintosh File System''), which is not supported, -MacOS 8.1 and newer support a filesystem called HFS+ that's similar to -HFS but is extended in various areas. Use the hfsplus filesystem driver -to access such filesystems from Linux. - - -Mount options -============= - -When mounting an HFS filesystem, the following options are accepted: - - creator=cccc, type=cccc - Specifies the creator/type values as shown by the MacOS finder - used for creating new files. Default values: '????'. - - uid=n, gid=n - Specifies the user/group that owns all files on the filesystems. - Default: user/group id of the mounting process. - - dir_umask=n, file_umask=n, umask=n - Specifies the umask used for all files , all directories or all - files and directories. Defaults to the umask of the mounting process. - - session=n - Select the CDROM session to mount as HFS filesystem. Defaults to - leaving that decision to the CDROM driver. This option will fail - with anything but a CDROM as underlying devices. - - part=n - Select partition number n from the devices. Does only makes - sense for CDROMS because they can't be partitioned under Linux. - For disk devices the generic partition parsing code does this - for us. Defaults to not parsing the partition table at all. - - quiet - Ignore invalid mount options instead of complaining. - - -Writing to HFS Filesystems -========================== - -HFS is not a UNIX filesystem, thus it does not have the usual features you'd -expect: - - o You can't modify the set-uid, set-gid, sticky or executable bits or the uid - and gid of files. - o You can't create hard- or symlinks, device files, sockets or FIFOs. - -HFS does on the other have the concepts of multiple forks per file. These -non-standard forks are represented as hidden additional files in the normal -filesystems namespace which is kind of a cludge and makes the semantics for -the a little strange: - - o You can't create, delete or rename resource forks of files or the - Finder's metadata. - o They are however created (with default values), deleted and renamed - along with the corresponding data fork or directory. - o Copying files to a different filesystem will loose those attributes - that are essential for MacOS to work. - - -Creating HFS filesystems -=================================== - -The hfsutils package from Robert Leslie contains a program called -hformat that can be used to create HFS filesystem. See - for details. - - -Credits -======= - -The HFS drivers was written by Paul H. Hargrovea (hargrove@sccm.Stanford.EDU). -Roman Zippel (roman@ardistech.com) rewrote large parts of the code and brought -in btree routines derived from Brad Boyer's hfsplus driver. diff --git a/Documentation/filesystems/index.rst b/Documentation/filesystems/index.rst index c351bc8a8c85..f776411340cb 100644 --- a/Documentation/filesystems/index.rst +++ b/Documentation/filesystems/index.rst @@ -67,6 +67,7 @@ Documentation for filesystem implementations. f2fs gfs2 gfs2-uevents + hfs hfsplus fuse overlayfs -- cgit From a1ef4bcd1664a9c1ae5191598b769ab37b93aa57 Mon Sep 17 00:00:00 2001 From: Mauro Carvalho Chehab Date: Mon, 17 Feb 2020 17:12:09 +0100 Subject: docs: filesystems: convert hpfs.txt to ReST - Add a SPDX header; - Adjust document and section titles; - Some whitespace fixes and new line breaks; - Mark literal blocks as such; - Add table markups; - Add it to filesystems/index.rst. Signed-off-by: Mauro Carvalho Chehab Link: https://lore.kernel.org/r/581019c3120938118aa55ba28902b62083c3f37a.1581955849.git.mchehab+huawei@kernel.org Signed-off-by: Jonathan Corbet --- Documentation/filesystems/hpfs.rst | 353 ++++++++++++++++++++++++++++++++++++ Documentation/filesystems/hpfs.txt | 296 ------------------------------ Documentation/filesystems/index.rst | 1 + 3 files changed, 354 insertions(+), 296 deletions(-) create mode 100644 Documentation/filesystems/hpfs.rst delete mode 100644 Documentation/filesystems/hpfs.txt diff --git a/Documentation/filesystems/hpfs.rst b/Documentation/filesystems/hpfs.rst new file mode 100644 index 000000000000..0db152278572 --- /dev/null +++ b/Documentation/filesystems/hpfs.rst @@ -0,0 +1,353 @@ +.. SPDX-License-Identifier: GPL-2.0 + +==================== +Read/Write HPFS 2.09 +==================== + +1998-2004, Mikulas Patocka + +:email: mikulas@artax.karlin.mff.cuni.cz +:homepage: http://artax.karlin.mff.cuni.cz/~mikulas/vyplody/hpfs/index-e.cgi + +Credits +======= +Chris Smith, 1993, original read-only HPFS, some code and hpfs structures file + is taken from it + +Jacques Gelinas, MSDos mmap, Inspired by fs/nfs/mmap.c (Jon Tombs 15 Aug 1993) + +Werner Almesberger, 1992, 1993, MSDos option parser & CR/LF conversion + +Mount options + +uid=xxx,gid=xxx,umask=xxx (default uid=gid=0 umask=default_system_umask) + Set owner/group/mode for files that do not have it specified in extended + attributes. Mode is inverted umask - for example umask 027 gives owner + all permission, group read permission and anybody else no access. Note + that for files mode is anded with 0666. If you want files to have 'x' + rights, you must use extended attributes. +case=lower,asis (default asis) + File name lowercasing in readdir. +conv=binary,text,auto (default binary) + CR/LF -> LF conversion, if auto, decision is made according to extension + - there is a list of text extensions (I thing it's better to not convert + text file than to damage binary file). If you want to change that list, + change it in the source. Original readonly HPFS contained some strange + heuristic algorithm that I removed. I thing it's danger to let the + computer decide whether file is text or binary. For example, DJGPP + binaries contain small text message at the beginning and they could be + misidentified and damaged under some circumstances. +check=none,normal,strict (default normal) + Check level. Selecting none will cause only little speedup and big + danger. I tried to write it so that it won't crash if check=normal on + corrupted filesystems. check=strict means many superfluous checks - + used for debugging (for example it checks if file is allocated in + bitmaps when accessing it). +errors=continue,remount-ro,panic (default remount-ro) + Behaviour when filesystem errors found. +chkdsk=no,errors,always (default errors) + When to mark filesystem dirty so that OS/2 checks it. +eas=no,ro,rw (default rw) + What to do with extended attributes. 'no' - ignore them and use always + values specified in uid/gid/mode options. 'ro' - read extended + attributes but do not create them. 'rw' - create extended attributes + when you use chmod/chown/chgrp/mknod/ln -s on the filesystem. +timeshift=(-)nnn (default 0) + Shifts the time by nnn seconds. For example, if you see under linux + one hour more, than under os/2, use timeshift=-3600. + + +File names +========== + +As in OS/2, filenames are case insensitive. However, shell thinks that names +are case sensitive, so for example when you create a file FOO, you can use +'cat FOO', 'cat Foo', 'cat foo' or 'cat F*' but not 'cat f*'. Note, that you +also won't be able to compile linux kernel (and maybe other things) on HPFS +because kernel creates different files with names like bootsect.S and +bootsect.s. When searching for file thats name has characters >= 128, codepages +are used - see below. +OS/2 ignores dots and spaces at the end of file name, so this driver does as +well. If you create 'a. ...', the file 'a' will be created, but you can still +access it under names 'a.', 'a..', 'a . . . ' etc. + + +Extended attributes +=================== + +On HPFS partitions, OS/2 can associate to each file a special information called +extended attributes. Extended attributes are pairs of (key,value) where key is +an ascii string identifying that attribute and value is any string of bytes of +variable length. OS/2 stores window and icon positions and file types there. So +why not use it for unix-specific info like file owner or access rights? This +driver can do it. If you chown/chgrp/chmod on a hpfs partition, extended +attributes with keys "UID", "GID" or "MODE" and 2-byte values are created. Only +that extended attributes those value differs from defaults specified in mount +options are created. Once created, the extended attributes are never deleted, +they're just changed. It means that when your default uid=0 and you type +something like 'chown luser file; chown root file' the file will contain +extended attribute UID=0. And when you umount the fs and mount it again with +uid=luser_uid, the file will be still owned by root! If you chmod file to 444, +extended attribute "MODE" will not be set, this special case is done by setting +read-only flag. When you mknod a block or char device, besides "MODE", the +special 4-byte extended attribute "DEV" will be created containing the device +number. Currently this driver cannot resize extended attributes - it means +that if somebody (I don't know who?) has set "UID", "GID", "MODE" or "DEV" +attributes with different sizes, they won't be rewritten and changing these +values doesn't work. + + +Symlinks +======== + +You can do symlinks on HPFS partition, symlinks are achieved by setting extended +attribute named "SYMLINK" with symlink value. Like on ext2, you can chown and +chgrp symlinks but I don't know what is it good for. chmoding symlink results +in chmoding file where symlink points. These symlinks are just for Linux use and +incompatible with OS/2. OS/2 PmShell symlinks are not supported because they are +stored in very crazy way. They tried to do it so that link changes when file is +moved ... sometimes it works. But the link is partly stored in directory +extended attributes and partly in OS2SYS.INI. I don't want (and don't know how) +to analyze or change OS2SYS.INI. + + +Codepages +========= + +HPFS can contain several uppercasing tables for several codepages and each +file has a pointer to codepage its name is in. However OS/2 was created in +America where people don't care much about codepages and so multiple codepages +support is quite buggy. I have Czech OS/2 working in codepage 852 on my disk. +Once I booted English OS/2 working in cp 850 and I created a file on my 852 +partition. It marked file name codepage as 850 - good. But when I again booted +Czech OS/2, the file was completely inaccessible under any name. It seems that +OS/2 uppercases the search pattern with its system code page (852) and file +name it's comparing to with its code page (850). These could never match. Is it +really what IBM developers wanted? But problems continued. When I created in +Czech OS/2 another file in that directory, that file was inaccessible too. OS/2 +probably uses different uppercasing method when searching where to place a file +(note, that files in HPFS directory must be sorted) and when searching for +a file. Finally when I opened this directory in PmShell, PmShell crashed (the +funny thing was that, when rebooted, PmShell tried to reopen this directory +again :-). chkdsk happily ignores these errors and only low-level disk +modification saved me. Never mix different language versions of OS/2 on one +system although HPFS was designed to allow that. +OK, I could implement complex codepage support to this driver but I think it +would cause more problems than benefit with such buggy implementation in OS/2. +So this driver simply uses first codepage it finds for uppercasing and +lowercasing no matter what's file codepage index. Usually all file names are in +this codepage - if you don't try to do what I described above :-) + + +Known bugs +========== + +HPFS386 on OS/2 server is not supported. HPFS386 installed on normal OS/2 client +should work. If you have OS/2 server, use only read-only mode. I don't know how +to handle some HPFS386 structures like access control list or extended perm +list, I don't know how to delete them when file is deleted and how to not +overwrite them with extended attributes. Send me some info on these structures +and I'll make it. However, this driver should detect presence of HPFS386 +structures, remount read-only and not destroy them (I hope). + +When there's not enough space for extended attributes, they will be truncated +and no error is returned. + +OS/2 can't access files if the path is longer than about 256 chars but this +driver allows you to do it. chkdsk ignores such errors. + +Sometimes you won't be able to delete some files on a very full filesystem +(returning error ENOSPC). That's because file in non-leaf node in directory tree +(one directory, if it's large, has dirents in tree on HPFS) must be replaced +with another node when deleted. And that new file might have larger name than +the old one so the new name doesn't fit in directory node (dnode). And that +would result in directory tree splitting, that takes disk space. Workaround is +to delete other files that are leaf (probability that the file is non-leaf is +about 1/50) or to truncate file first to make some space. +You encounter this problem only if you have many directories so that +preallocated directory band is full i.e.:: + + number_of_directories / size_of_filesystem_in_mb > 4. + +You can't delete open directories. + +You can't rename over directories (what is it good for?). + +Renaming files so that only case changes doesn't work. This driver supports it +but vfs doesn't. Something like 'mv file FILE' won't work. + +All atimes and directory mtimes are not updated. That's because of performance +reasons. If you extremely wish to update them, let me know, I'll write it (but +it will be slow). + +When the system is out of memory and swap, it may slightly corrupt filesystem +(lost files, unbalanced directories). (I guess all filesystem may do it). + +When compiled, you get warning: function declaration isn't a prototype. Does +anybody know what does it mean? + + +What does "unbalanced tree" message mean? +========================================= + +Old versions of this driver created sometimes unbalanced dnode trees. OS/2 +chkdsk doesn't scream if the tree is unbalanced (and sometimes creates +unbalanced trees too :-) but both HPFS and HPFS386 contain bug that it rarely +crashes when the tree is not balanced. This driver handles unbalanced trees +correctly and writes warning if it finds them. If you see this message, this is +probably because of directories created with old version of this driver. +Workaround is to move all files from that directory to another and then back +again. Do it in Linux, not OS/2! If you see this message in directory that is +whole created by this driver, it is BUG - let me know about it. + + +Bugs in OS/2 +============ + +When you have two (or more) lost directories pointing each to other, chkdsk +locks up when repairing filesystem. + +Sometimes (I think it's random) when you create a file with one-char name under +OS/2, OS/2 marks it as 'long'. chkdsk then removes this flag saying "Minor fs +error corrected". + +File names like "a .b" are marked as 'long' by OS/2 but chkdsk "corrects" it and +marks them as short (and writes "minor fs error corrected"). This bug is not in +HPFS386. + +Codepage bugs described above +============================= + +If you don't install fixpacks, there are many, many more... + + +History +======= + +====== ========================================================================= +0.90 First public release +0.91 Fixed bug that caused shooting to memory when write_inode was called on + open inode (rarely happened) +0.92 Fixed a little memory leak in freeing directory inodes +0.93 Fixed bug that locked up the machine when there were too many filenames + with first 15 characters same + Fixed write_file to zero file when writing behind file end +0.94 Fixed a little memory leak when trying to delete busy file or directory +0.95 Fixed a bug that i_hpfs_parent_dir was not updated when moving files +1.90 First version for 2.1.1xx kernels +1.91 Fixed a bug that chk_sectors failed when sectors were at the end of disk + Fixed a race-condition when write_inode is called while deleting file + Fixed a bug that could possibly happen (with very low probability) when + using 0xff in filenames. + + Rewritten locking to avoid race-conditions + + Mount option 'eas' now works + + Fsync no longer returns error + + Files beginning with '.' are marked hidden + + Remount support added + + Alloc is not so slow when filesystem becomes full + + Atimes are no more updated because it slows down operation + + Code cleanup (removed all commented debug prints) +1.92 Corrected a bug when sync was called just before closing file +1.93 Modified, so that it works with kernels >= 2.1.131, I don't know if it + works with previous versions + + Fixed a possible problem with disks > 64G (but I don't have one, so I can't + test it) + + Fixed a file overflow at 2G + + Added new option 'timeshift' + + Changed behaviour on HPFS386: It is now possible to operate on HPFS386 in + read-only mode + + Fixed a bug that slowed down alloc and prevented allocating 100% space + (this bug was not destructive) +1.94 Added workaround for one bug in Linux + + Fixed one buffer leak + + Fixed some incompatibilities with large extended attributes (but it's still + not 100% ok, I have no info on it and OS/2 doesn't want to create them) + + Rewritten allocation + + Fixed a bug with i_blocks (du sometimes didn't display correct values) + + Directories have no longer archive attribute set (some programs don't like + it) + + Fixed a bug that it set badly one flag in large anode tree (it was not + destructive) +1.95 Fixed one buffer leak, that could happen on corrupted filesystem + + Fixed one bug in allocation in 1.94 +1.96 Added workaround for one bug in OS/2 (HPFS locked up, HPFS386 reported + error sometimes when opening directories in PMSHELL) + + Fixed a possible bitmap race + + Fixed possible problem on large disks + + You can now delete open files + + Fixed a nondestructive race in rename +1.97 Support for HPFS v3 (on large partitions) + + ZFixed a bug that it didn't allow creation of files > 128M + (it should be 2G) +1.97.1 Changed names of global symbols + + Fixed a bug when chmoding or chowning root directory +1.98 Fixed a deadlock when using old_readdir + Better directory handling; workaround for "unbalanced tree" bug in OS/2 +1.99 Corrected a possible problem when there's not enough space while deleting + file + + Now it tries to truncate the file if there's not enough space when + deleting + + Removed a lot of redundant code +2.00 Fixed a bug in rename (it was there since 1.96) + Better anti-fragmentation strategy +2.01 Fixed problem with directory listing over NFS + + Directory lseek now checks for proper parameters + + Fixed race-condition in buffer code - it is in all filesystems in Linux; + when reading device (cat /dev/hda) while creating files on it, files + could be damaged +2.02 Workaround for bug in breada in Linux. breada could cause accesses beyond + end of partition +2.03 Char, block devices and pipes are correctly created + + Fixed non-crashing race in unlink (Alexander Viro) + + Now it works with Japanese version of OS/2 +2.04 Fixed error when ftruncate used to extend file +2.05 Fixed crash when got mount parameters without = + + Fixed crash when allocation of anode failed due to full disk + + Fixed some crashes when block io or inode allocation failed +2.06 Fixed some crash on corrupted disk structures + + Better allocation strategy + + Reschedule points added so that it doesn't lock CPU long time + + It should work in read-only mode on Warp Server +2.07 More fixes for Warp Server. Now it really works +2.08 Creating new files is not so slow on large disks + + An attempt to sync deleted file does not generate filesystem error +2.09 Fixed error on extremely fragmented files +====== ========================================================================= diff --git a/Documentation/filesystems/hpfs.txt b/Documentation/filesystems/hpfs.txt deleted file mode 100644 index 74630bd504fb..000000000000 --- a/Documentation/filesystems/hpfs.txt +++ /dev/null @@ -1,296 +0,0 @@ -Read/Write HPFS 2.09 -1998-2004, Mikulas Patocka - -email: mikulas@artax.karlin.mff.cuni.cz -homepage: http://artax.karlin.mff.cuni.cz/~mikulas/vyplody/hpfs/index-e.cgi - -CREDITS: -Chris Smith, 1993, original read-only HPFS, some code and hpfs structures file - is taken from it -Jacques Gelinas, MSDos mmap, Inspired by fs/nfs/mmap.c (Jon Tombs 15 Aug 1993) -Werner Almesberger, 1992, 1993, MSDos option parser & CR/LF conversion - -Mount options - -uid=xxx,gid=xxx,umask=xxx (default uid=gid=0 umask=default_system_umask) - Set owner/group/mode for files that do not have it specified in extended - attributes. Mode is inverted umask - for example umask 027 gives owner - all permission, group read permission and anybody else no access. Note - that for files mode is anded with 0666. If you want files to have 'x' - rights, you must use extended attributes. -case=lower,asis (default asis) - File name lowercasing in readdir. -conv=binary,text,auto (default binary) - CR/LF -> LF conversion, if auto, decision is made according to extension - - there is a list of text extensions (I thing it's better to not convert - text file than to damage binary file). If you want to change that list, - change it in the source. Original readonly HPFS contained some strange - heuristic algorithm that I removed. I thing it's danger to let the - computer decide whether file is text or binary. For example, DJGPP - binaries contain small text message at the beginning and they could be - misidentified and damaged under some circumstances. -check=none,normal,strict (default normal) - Check level. Selecting none will cause only little speedup and big - danger. I tried to write it so that it won't crash if check=normal on - corrupted filesystems. check=strict means many superfluous checks - - used for debugging (for example it checks if file is allocated in - bitmaps when accessing it). -errors=continue,remount-ro,panic (default remount-ro) - Behaviour when filesystem errors found. -chkdsk=no,errors,always (default errors) - When to mark filesystem dirty so that OS/2 checks it. -eas=no,ro,rw (default rw) - What to do with extended attributes. 'no' - ignore them and use always - values specified in uid/gid/mode options. 'ro' - read extended - attributes but do not create them. 'rw' - create extended attributes - when you use chmod/chown/chgrp/mknod/ln -s on the filesystem. -timeshift=(-)nnn (default 0) - Shifts the time by nnn seconds. For example, if you see under linux - one hour more, than under os/2, use timeshift=-3600. - - -File names - -As in OS/2, filenames are case insensitive. However, shell thinks that names -are case sensitive, so for example when you create a file FOO, you can use -'cat FOO', 'cat Foo', 'cat foo' or 'cat F*' but not 'cat f*'. Note, that you -also won't be able to compile linux kernel (and maybe other things) on HPFS -because kernel creates different files with names like bootsect.S and -bootsect.s. When searching for file thats name has characters >= 128, codepages -are used - see below. -OS/2 ignores dots and spaces at the end of file name, so this driver does as -well. If you create 'a. ...', the file 'a' will be created, but you can still -access it under names 'a.', 'a..', 'a . . . ' etc. - - -Extended attributes - -On HPFS partitions, OS/2 can associate to each file a special information called -extended attributes. Extended attributes are pairs of (key,value) where key is -an ascii string identifying that attribute and value is any string of bytes of -variable length. OS/2 stores window and icon positions and file types there. So -why not use it for unix-specific info like file owner or access rights? This -driver can do it. If you chown/chgrp/chmod on a hpfs partition, extended -attributes with keys "UID", "GID" or "MODE" and 2-byte values are created. Only -that extended attributes those value differs from defaults specified in mount -options are created. Once created, the extended attributes are never deleted, -they're just changed. It means that when your default uid=0 and you type -something like 'chown luser file; chown root file' the file will contain -extended attribute UID=0. And when you umount the fs and mount it again with -uid=luser_uid, the file will be still owned by root! If you chmod file to 444, -extended attribute "MODE" will not be set, this special case is done by setting -read-only flag. When you mknod a block or char device, besides "MODE", the -special 4-byte extended attribute "DEV" will be created containing the device -number. Currently this driver cannot resize extended attributes - it means -that if somebody (I don't know who?) has set "UID", "GID", "MODE" or "DEV" -attributes with different sizes, they won't be rewritten and changing these -values doesn't work. - - -Symlinks - -You can do symlinks on HPFS partition, symlinks are achieved by setting extended -attribute named "SYMLINK" with symlink value. Like on ext2, you can chown and -chgrp symlinks but I don't know what is it good for. chmoding symlink results -in chmoding file where symlink points. These symlinks are just for Linux use and -incompatible with OS/2. OS/2 PmShell symlinks are not supported because they are -stored in very crazy way. They tried to do it so that link changes when file is -moved ... sometimes it works. But the link is partly stored in directory -extended attributes and partly in OS2SYS.INI. I don't want (and don't know how) -to analyze or change OS2SYS.INI. - - -Codepages - -HPFS can contain several uppercasing tables for several codepages and each -file has a pointer to codepage its name is in. However OS/2 was created in -America where people don't care much about codepages and so multiple codepages -support is quite buggy. I have Czech OS/2 working in codepage 852 on my disk. -Once I booted English OS/2 working in cp 850 and I created a file on my 852 -partition. It marked file name codepage as 850 - good. But when I again booted -Czech OS/2, the file was completely inaccessible under any name. It seems that -OS/2 uppercases the search pattern with its system code page (852) and file -name it's comparing to with its code page (850). These could never match. Is it -really what IBM developers wanted? But problems continued. When I created in -Czech OS/2 another file in that directory, that file was inaccessible too. OS/2 -probably uses different uppercasing method when searching where to place a file -(note, that files in HPFS directory must be sorted) and when searching for -a file. Finally when I opened this directory in PmShell, PmShell crashed (the -funny thing was that, when rebooted, PmShell tried to reopen this directory -again :-). chkdsk happily ignores these errors and only low-level disk -modification saved me. Never mix different language versions of OS/2 on one -system although HPFS was designed to allow that. -OK, I could implement complex codepage support to this driver but I think it -would cause more problems than benefit with such buggy implementation in OS/2. -So this driver simply uses first codepage it finds for uppercasing and -lowercasing no matter what's file codepage index. Usually all file names are in -this codepage - if you don't try to do what I described above :-) - - -Known bugs - -HPFS386 on OS/2 server is not supported. HPFS386 installed on normal OS/2 client -should work. If you have OS/2 server, use only read-only mode. I don't know how -to handle some HPFS386 structures like access control list or extended perm -list, I don't know how to delete them when file is deleted and how to not -overwrite them with extended attributes. Send me some info on these structures -and I'll make it. However, this driver should detect presence of HPFS386 -structures, remount read-only and not destroy them (I hope). - -When there's not enough space for extended attributes, they will be truncated -and no error is returned. - -OS/2 can't access files if the path is longer than about 256 chars but this -driver allows you to do it. chkdsk ignores such errors. - -Sometimes you won't be able to delete some files on a very full filesystem -(returning error ENOSPC). That's because file in non-leaf node in directory tree -(one directory, if it's large, has dirents in tree on HPFS) must be replaced -with another node when deleted. And that new file might have larger name than -the old one so the new name doesn't fit in directory node (dnode). And that -would result in directory tree splitting, that takes disk space. Workaround is -to delete other files that are leaf (probability that the file is non-leaf is -about 1/50) or to truncate file first to make some space. -You encounter this problem only if you have many directories so that -preallocated directory band is full i.e. - number_of_directories / size_of_filesystem_in_mb > 4. - -You can't delete open directories. - -You can't rename over directories (what is it good for?). - -Renaming files so that only case changes doesn't work. This driver supports it -but vfs doesn't. Something like 'mv file FILE' won't work. - -All atimes and directory mtimes are not updated. That's because of performance -reasons. If you extremely wish to update them, let me know, I'll write it (but -it will be slow). - -When the system is out of memory and swap, it may slightly corrupt filesystem -(lost files, unbalanced directories). (I guess all filesystem may do it). - -When compiled, you get warning: function declaration isn't a prototype. Does -anybody know what does it mean? - - -What does "unbalanced tree" message mean? - -Old versions of this driver created sometimes unbalanced dnode trees. OS/2 -chkdsk doesn't scream if the tree is unbalanced (and sometimes creates -unbalanced trees too :-) but both HPFS and HPFS386 contain bug that it rarely -crashes when the tree is not balanced. This driver handles unbalanced trees -correctly and writes warning if it finds them. If you see this message, this is -probably because of directories created with old version of this driver. -Workaround is to move all files from that directory to another and then back -again. Do it in Linux, not OS/2! If you see this message in directory that is -whole created by this driver, it is BUG - let me know about it. - - -Bugs in OS/2 - -When you have two (or more) lost directories pointing each to other, chkdsk -locks up when repairing filesystem. - -Sometimes (I think it's random) when you create a file with one-char name under -OS/2, OS/2 marks it as 'long'. chkdsk then removes this flag saying "Minor fs -error corrected". - -File names like "a .b" are marked as 'long' by OS/2 but chkdsk "corrects" it and -marks them as short (and writes "minor fs error corrected"). This bug is not in -HPFS386. - -Codepage bugs described above. - -If you don't install fixpacks, there are many, many more... - - -History - -0.90 First public release -0.91 Fixed bug that caused shooting to memory when write_inode was called on - open inode (rarely happened) -0.92 Fixed a little memory leak in freeing directory inodes -0.93 Fixed bug that locked up the machine when there were too many filenames - with first 15 characters same - Fixed write_file to zero file when writing behind file end -0.94 Fixed a little memory leak when trying to delete busy file or directory -0.95 Fixed a bug that i_hpfs_parent_dir was not updated when moving files -1.90 First version for 2.1.1xx kernels -1.91 Fixed a bug that chk_sectors failed when sectors were at the end of disk - Fixed a race-condition when write_inode is called while deleting file - Fixed a bug that could possibly happen (with very low probability) when - using 0xff in filenames - Rewritten locking to avoid race-conditions - Mount option 'eas' now works - Fsync no longer returns error - Files beginning with '.' are marked hidden - Remount support added - Alloc is not so slow when filesystem becomes full - Atimes are no more updated because it slows down operation - Code cleanup (removed all commented debug prints) -1.92 Corrected a bug when sync was called just before closing file -1.93 Modified, so that it works with kernels >= 2.1.131, I don't know if it - works with previous versions - Fixed a possible problem with disks > 64G (but I don't have one, so I can't - test it) - Fixed a file overflow at 2G - Added new option 'timeshift' - Changed behaviour on HPFS386: It is now possible to operate on HPFS386 in - read-only mode - Fixed a bug that slowed down alloc and prevented allocating 100% space - (this bug was not destructive) -1.94 Added workaround for one bug in Linux - Fixed one buffer leak - Fixed some incompatibilities with large extended attributes (but it's still - not 100% ok, I have no info on it and OS/2 doesn't want to create them) - Rewritten allocation - Fixed a bug with i_blocks (du sometimes didn't display correct values) - Directories have no longer archive attribute set (some programs don't like - it) - Fixed a bug that it set badly one flag in large anode tree (it was not - destructive) -1.95 Fixed one buffer leak, that could happen on corrupted filesystem - Fixed one bug in allocation in 1.94 -1.96 Added workaround for one bug in OS/2 (HPFS locked up, HPFS386 reported - error sometimes when opening directories in PMSHELL) - Fixed a possible bitmap race - Fixed possible problem on large disks - You can now delete open files - Fixed a nondestructive race in rename -1.97 Support for HPFS v3 (on large partitions) - Fixed a bug that it didn't allow creation of files > 128M (it should be 2G) -1.97.1 Changed names of global symbols - Fixed a bug when chmoding or chowning root directory -1.98 Fixed a deadlock when using old_readdir - Better directory handling; workaround for "unbalanced tree" bug in OS/2 -1.99 Corrected a possible problem when there's not enough space while deleting - file - Now it tries to truncate the file if there's not enough space when deleting - Removed a lot of redundant code -2.00 Fixed a bug in rename (it was there since 1.96) - Better anti-fragmentation strategy -2.01 Fixed problem with directory listing over NFS - Directory lseek now checks for proper parameters - Fixed race-condition in buffer code - it is in all filesystems in Linux; - when reading device (cat /dev/hda) while creating files on it, files - could be damaged -2.02 Workaround for bug in breada in Linux. breada could cause accesses beyond - end of partition -2.03 Char, block devices and pipes are correctly created - Fixed non-crashing race in unlink (Alexander Viro) - Now it works with Japanese version of OS/2 -2.04 Fixed error when ftruncate used to extend file -2.05 Fixed crash when got mount parameters without = - Fixed crash when allocation of anode failed due to full disk - Fixed some crashes when block io or inode allocation failed -2.06 Fixed some crash on corrupted disk structures - Better allocation strategy - Reschedule points added so that it doesn't lock CPU long time - It should work in read-only mode on Warp Server -2.07 More fixes for Warp Server. Now it really works -2.08 Creating new files is not so slow on large disks - An attempt to sync deleted file does not generate filesystem error -2.09 Fixed error on extremely fragmented files - - - vim: set textwidth=80: diff --git a/Documentation/filesystems/index.rst b/Documentation/filesystems/index.rst index f776411340cb..3fbe2fa0b5c5 100644 --- a/Documentation/filesystems/index.rst +++ b/Documentation/filesystems/index.rst @@ -69,6 +69,7 @@ Documentation for filesystem implementations. gfs2-uevents hfs hfsplus + hpfs fuse overlayfs virtiofs -- cgit From de389cf08d4708d0a03516e5ce0e193f49f0b358 Mon Sep 17 00:00:00 2001 From: Mauro Carvalho Chehab Date: Mon, 17 Feb 2020 17:12:10 +0100 Subject: docs: filesystems: convert inotify.txt to ReST - Add a SPDX header; - Add a document title; - Adjust document title; - Fix list markups; - Some whitespace fixes and new line breaks; - Add it to filesystems/index.rst. Signed-off-by: Mauro Carvalho Chehab Acked-by: Jan Kara Link: https://lore.kernel.org/r/8f846843ecf1914988feb4d001e3a53d27dc1a65.1581955849.git.mchehab+huawei@kernel.org Signed-off-by: Jonathan Corbet --- Documentation/filesystems/index.rst | 1 + Documentation/filesystems/inotify.rst | 90 +++++++++++++++++++++++++++++++++++ Documentation/filesystems/inotify.txt | 79 ------------------------------ 3 files changed, 91 insertions(+), 79 deletions(-) create mode 100644 Documentation/filesystems/inotify.rst delete mode 100644 Documentation/filesystems/inotify.txt diff --git a/Documentation/filesystems/index.rst b/Documentation/filesystems/index.rst index 3fbe2fa0b5c5..5a737722652c 100644 --- a/Documentation/filesystems/index.rst +++ b/Documentation/filesystems/index.rst @@ -70,6 +70,7 @@ Documentation for filesystem implementations. hfs hfsplus hpfs + inotify fuse overlayfs virtiofs diff --git a/Documentation/filesystems/inotify.rst b/Documentation/filesystems/inotify.rst new file mode 100644 index 000000000000..7f7ef8af0e1e --- /dev/null +++ b/Documentation/filesystems/inotify.rst @@ -0,0 +1,90 @@ +.. SPDX-License-Identifier: GPL-2.0 + +=============================================================== +Inotify - A Powerful yet Simple File Change Notification System +=============================================================== + + + +Document started 15 Mar 2005 by Robert Love + +Document updated 4 Jan 2015 by Zhang Zhen + + - Deleted obsoleted interface, just refer to manpages for user interface. + +(i) Rationale + +Q: + What is the design decision behind not tying the watch to the open fd of + the watched object? + +A: + Watches are associated with an open inotify device, not an open file. + This solves the primary problem with dnotify: keeping the file open pins + the file and thus, worse, pins the mount. Dnotify is therefore infeasible + for use on a desktop system with removable media as the media cannot be + unmounted. Watching a file should not require that it be open. + +Q: + What is the design decision behind using an-fd-per-instance as opposed to + an fd-per-watch? + +A: + An fd-per-watch quickly consumes more file descriptors than are allowed, + more fd's than are feasible to manage, and more fd's than are optimally + select()-able. Yes, root can bump the per-process fd limit and yes, users + can use epoll, but requiring both is a silly and extraneous requirement. + A watch consumes less memory than an open file, separating the number + spaces is thus sensible. The current design is what user-space developers + want: Users initialize inotify, once, and add n watches, requiring but one + fd and no twiddling with fd limits. Initializing an inotify instance two + thousand times is silly. If we can implement user-space's preferences + cleanly--and we can, the idr layer makes stuff like this trivial--then we + should. + + There are other good arguments. With a single fd, there is a single + item to block on, which is mapped to a single queue of events. The single + fd returns all watch events and also any potential out-of-band data. If + every fd was a separate watch, + + - There would be no way to get event ordering. Events on file foo and + file bar would pop poll() on both fd's, but there would be no way to tell + which happened first. A single queue trivially gives you ordering. Such + ordering is crucial to existing applications such as Beagle. Imagine + "mv a b ; mv b a" events without ordering. + + - We'd have to maintain n fd's and n internal queues with state, + versus just one. It is a lot messier in the kernel. A single, linear + queue is the data structure that makes sense. + + - User-space developers prefer the current API. The Beagle guys, for + example, love it. Trust me, I asked. It is not a surprise: Who'd want + to manage and block on 1000 fd's via select? + + - No way to get out of band data. + + - 1024 is still too low. ;-) + + When you talk about designing a file change notification system that + scales to 1000s of directories, juggling 1000s of fd's just does not seem + the right interface. It is too heavy. + + Additionally, it _is_ possible to more than one instance and + juggle more than one queue and thus more than one associated fd. There + need not be a one-fd-per-process mapping; it is one-fd-per-queue and a + process can easily want more than one queue. + +Q: + Why the system call approach? + +A: + The poor user-space interface is the second biggest problem with dnotify. + Signals are a terrible, terrible interface for file notification. Or for + anything, for that matter. The ideal solution, from all perspectives, is a + file descriptor-based one that allows basic file I/O and poll/select. + Obtaining the fd and managing the watches could have been done either via a + device file or a family of new system calls. We decided to implement a + family of system calls because that is the preferred approach for new kernel + interfaces. The only real difference was whether we wanted to use open(2) + and ioctl(2) or a couple of new system calls. System calls beat ioctls. + diff --git a/Documentation/filesystems/inotify.txt b/Documentation/filesystems/inotify.txt deleted file mode 100644 index 51f61db787fb..000000000000 --- a/Documentation/filesystems/inotify.txt +++ /dev/null @@ -1,79 +0,0 @@ - inotify - a powerful yet simple file change notification system - - - -Document started 15 Mar 2005 by Robert Love -Document updated 4 Jan 2015 by Zhang Zhen - --Deleted obsoleted interface, just refer to manpages for user interface. - -(i) Rationale - -Q: What is the design decision behind not tying the watch to the open fd of - the watched object? - -A: Watches are associated with an open inotify device, not an open file. - This solves the primary problem with dnotify: keeping the file open pins - the file and thus, worse, pins the mount. Dnotify is therefore infeasible - for use on a desktop system with removable media as the media cannot be - unmounted. Watching a file should not require that it be open. - -Q: What is the design decision behind using an-fd-per-instance as opposed to - an fd-per-watch? - -A: An fd-per-watch quickly consumes more file descriptors than are allowed, - more fd's than are feasible to manage, and more fd's than are optimally - select()-able. Yes, root can bump the per-process fd limit and yes, users - can use epoll, but requiring both is a silly and extraneous requirement. - A watch consumes less memory than an open file, separating the number - spaces is thus sensible. The current design is what user-space developers - want: Users initialize inotify, once, and add n watches, requiring but one - fd and no twiddling with fd limits. Initializing an inotify instance two - thousand times is silly. If we can implement user-space's preferences - cleanly--and we can, the idr layer makes stuff like this trivial--then we - should. - - There are other good arguments. With a single fd, there is a single - item to block on, which is mapped to a single queue of events. The single - fd returns all watch events and also any potential out-of-band data. If - every fd was a separate watch, - - - There would be no way to get event ordering. Events on file foo and - file bar would pop poll() on both fd's, but there would be no way to tell - which happened first. A single queue trivially gives you ordering. Such - ordering is crucial to existing applications such as Beagle. Imagine - "mv a b ; mv b a" events without ordering. - - - We'd have to maintain n fd's and n internal queues with state, - versus just one. It is a lot messier in the kernel. A single, linear - queue is the data structure that makes sense. - - - User-space developers prefer the current API. The Beagle guys, for - example, love it. Trust me, I asked. It is not a surprise: Who'd want - to manage and block on 1000 fd's via select? - - - No way to get out of band data. - - - 1024 is still too low. ;-) - - When you talk about designing a file change notification system that - scales to 1000s of directories, juggling 1000s of fd's just does not seem - the right interface. It is too heavy. - - Additionally, it _is_ possible to more than one instance and - juggle more than one queue and thus more than one associated fd. There - need not be a one-fd-per-process mapping; it is one-fd-per-queue and a - process can easily want more than one queue. - -Q: Why the system call approach? - -A: The poor user-space interface is the second biggest problem with dnotify. - Signals are a terrible, terrible interface for file notification. Or for - anything, for that matter. The ideal solution, from all perspectives, is a - file descriptor-based one that allows basic file I/O and poll/select. - Obtaining the fd and managing the watches could have been done either via a - device file or a family of new system calls. We decided to implement a - family of system calls because that is the preferred approach for new kernel - interfaces. The only real difference was whether we wanted to use open(2) - and ioctl(2) or a couple of new system calls. System calls beat ioctls. - -- cgit From 76f216855b6bd1027e236b29cd7fece7336c37eb Mon Sep 17 00:00:00 2001 From: Mauro Carvalho Chehab Date: Mon, 17 Feb 2020 17:12:11 +0100 Subject: docs: filesystems: convert isofs.txt to ReST - Add a SPDX header; - Add a document title; - Some whitespace fixes and new line breaks; - Add table markups; - Add lists markups; - Add it to filesystems/index.rst. Signed-off-by: Mauro Carvalho Chehab Link: https://lore.kernel.org/r/ec16dc09d0c23bb0c1af3d3f33a96896083a1d36.1581955849.git.mchehab+huawei@kernel.org Signed-off-by: Jonathan Corbet --- Documentation/filesystems/index.rst | 1 + Documentation/filesystems/isofs.rst | 64 +++++++++++++++++++++++++++++++++++++ Documentation/filesystems/isofs.txt | 48 ---------------------------- 3 files changed, 65 insertions(+), 48 deletions(-) create mode 100644 Documentation/filesystems/isofs.rst delete mode 100644 Documentation/filesystems/isofs.txt diff --git a/Documentation/filesystems/index.rst b/Documentation/filesystems/index.rst index 5a737722652c..8c8813ada53f 100644 --- a/Documentation/filesystems/index.rst +++ b/Documentation/filesystems/index.rst @@ -71,6 +71,7 @@ Documentation for filesystem implementations. hfsplus hpfs inotify + isofs fuse overlayfs virtiofs diff --git a/Documentation/filesystems/isofs.rst b/Documentation/filesystems/isofs.rst new file mode 100644 index 000000000000..08fd469091d4 --- /dev/null +++ b/Documentation/filesystems/isofs.rst @@ -0,0 +1,64 @@ +.. SPDX-License-Identifier: GPL-2.0 + +================== +ISO9660 Filesystem +================== + +Mount options that are the same as for msdos and vfat partitions. + + ========= ======================================================== + gid=nnn All files in the partition will be in group nnn. + uid=nnn All files in the partition will be owned by user id nnn. + umask=nnn The permission mask (see umask(1)) for the partition. + ========= ======================================================== + +Mount options that are the same as vfat partitions. These are only useful +when using discs encoded using Microsoft's Joliet extensions. + + ============== ============================================================= + iocharset=name Character set to use for converting from Unicode to + ASCII. Joliet filenames are stored in Unicode format, but + Unix for the most part doesn't know how to deal with Unicode. + There is also an option of doing UTF-8 translations with the + utf8 option. + utf8 Encode Unicode names in UTF-8 format. Default is no. + ============== ============================================================= + +Mount options unique to the isofs filesystem. + + ================= ============================================================ + block=512 Set the block size for the disk to 512 bytes + block=1024 Set the block size for the disk to 1024 bytes + block=2048 Set the block size for the disk to 2048 bytes + check=relaxed Matches filenames with different cases + check=strict Matches only filenames with the exact same case + cruft Try to handle badly formatted CDs. + map=off Do not map non-Rock Ridge filenames to lower case + map=normal Map non-Rock Ridge filenames to lower case + map=acorn As map=normal but also apply Acorn extensions if present + mode=xxx Sets the permissions on files to xxx unless Rock Ridge + extensions set the permissions otherwise + dmode=xxx Sets the permissions on directories to xxx unless Rock Ridge + extensions set the permissions otherwise + overriderockperm Set permissions on files and directories according to + 'mode' and 'dmode' even though Rock Ridge extensions are + present. + nojoliet Ignore Joliet extensions if they are present. + norock Ignore Rock Ridge extensions if they are present. + hide Completely strip hidden files from the file system. + showassoc Show files marked with the 'associated' bit + unhide Deprecated; showing hidden files is now default; + If given, it is a synonym for 'showassoc' which will + recreate previous unhide behavior + session=x Select number of session on multisession CD + sbsector=xxx Session begins from sector xxx + ================= ============================================================ + +Recommended documents about ISO 9660 standard are located at: + +- http://www.y-adagio.com/ +- ftp://ftp.ecma.ch/ecma-st/Ecma-119.pdf + +Quoting from the PDF "This 2nd Edition of Standard ECMA-119 is technically +identical with ISO 9660.", so it is a valid and gratis substitute of the +official ISO specification. diff --git a/Documentation/filesystems/isofs.txt b/Documentation/filesystems/isofs.txt deleted file mode 100644 index ba0a93384de0..000000000000 --- a/Documentation/filesystems/isofs.txt +++ /dev/null @@ -1,48 +0,0 @@ -Mount options that are the same as for msdos and vfat partitions. - - gid=nnn All files in the partition will be in group nnn. - uid=nnn All files in the partition will be owned by user id nnn. - umask=nnn The permission mask (see umask(1)) for the partition. - -Mount options that are the same as vfat partitions. These are only useful -when using discs encoded using Microsoft's Joliet extensions. - iocharset=name Character set to use for converting from Unicode to - ASCII. Joliet filenames are stored in Unicode format, but - Unix for the most part doesn't know how to deal with Unicode. - There is also an option of doing UTF-8 translations with the - utf8 option. - utf8 Encode Unicode names in UTF-8 format. Default is no. - -Mount options unique to the isofs filesystem. - block=512 Set the block size for the disk to 512 bytes - block=1024 Set the block size for the disk to 1024 bytes - block=2048 Set the block size for the disk to 2048 bytes - check=relaxed Matches filenames with different cases - check=strict Matches only filenames with the exact same case - cruft Try to handle badly formatted CDs. - map=off Do not map non-Rock Ridge filenames to lower case - map=normal Map non-Rock Ridge filenames to lower case - map=acorn As map=normal but also apply Acorn extensions if present - mode=xxx Sets the permissions on files to xxx unless Rock Ridge - extensions set the permissions otherwise - dmode=xxx Sets the permissions on directories to xxx unless Rock Ridge - extensions set the permissions otherwise - overriderockperm Set permissions on files and directories according to - 'mode' and 'dmode' even though Rock Ridge extensions are - present. - nojoliet Ignore Joliet extensions if they are present. - norock Ignore Rock Ridge extensions if they are present. - hide Completely strip hidden files from the file system. - showassoc Show files marked with the 'associated' bit - unhide Deprecated; showing hidden files is now default; - If given, it is a synonym for 'showassoc' which will - recreate previous unhide behavior - session=x Select number of session on multisession CD - sbsector=xxx Session begins from sector xxx - -Recommended documents about ISO 9660 standard are located at: -http://www.y-adagio.com/ -ftp://ftp.ecma.ch/ecma-st/Ecma-119.pdf -Quoting from the PDF "This 2nd Edition of Standard ECMA-119 is technically -identical with ISO 9660.", so it is a valid and gratis substitute of the -official ISO specification. -- cgit From 2640c19dcab0f6530007dfb4ee5870f5d61b0772 Mon Sep 17 00:00:00 2001 From: Mauro Carvalho Chehab Date: Mon, 17 Feb 2020 17:12:12 +0100 Subject: docs: filesystems: convert nilfs2.txt to ReST - Add a SPDX header; - Add a document title; - Adjust document title; - Mark literal blocks as such; - use :field: markup; - Add table markups; - Add it to filesystems/index.rst. Signed-off-by: Mauro Carvalho Chehab Link: https://lore.kernel.org/r/f7989ca501585f5990fffd2d365cfca4fe9fdd6f.1581955849.git.mchehab+huawei@kernel.org Signed-off-by: Jonathan Corbet --- Documentation/filesystems/index.rst | 3 +- Documentation/filesystems/nilfs2.rst | 286 +++++++++++++++++++++++++++++++++++ Documentation/filesystems/nilfs2.txt | 276 --------------------------------- 3 files changed, 288 insertions(+), 277 deletions(-) create mode 100644 Documentation/filesystems/nilfs2.rst delete mode 100644 Documentation/filesystems/nilfs2.txt diff --git a/Documentation/filesystems/index.rst b/Documentation/filesystems/index.rst index 8c8813ada53f..01587704fcc9 100644 --- a/Documentation/filesystems/index.rst +++ b/Documentation/filesystems/index.rst @@ -70,9 +70,10 @@ Documentation for filesystem implementations. hfs hfsplus hpfs + fuse inotify isofs - fuse + nilfs2 overlayfs virtiofs vfat diff --git a/Documentation/filesystems/nilfs2.rst b/Documentation/filesystems/nilfs2.rst new file mode 100644 index 000000000000..6c49f04e9e0a --- /dev/null +++ b/Documentation/filesystems/nilfs2.rst @@ -0,0 +1,286 @@ +.. SPDX-License-Identifier: GPL-2.0 + +====== +NILFS2 +====== + +NILFS2 is a log-structured file system (LFS) supporting continuous +snapshotting. In addition to versioning capability of the entire file +system, users can even restore files mistakenly overwritten or +destroyed just a few seconds ago. Since NILFS2 can keep consistency +like conventional LFS, it achieves quick recovery after system +crashes. + +NILFS2 creates a number of checkpoints every few seconds or per +synchronous write basis (unless there is no change). Users can select +significant versions among continuously created checkpoints, and can +change them into snapshots which will be preserved until they are +changed back to checkpoints. + +There is no limit on the number of snapshots until the volume gets +full. Each snapshot is mountable as a read-only file system +concurrently with its writable mount, and this feature is convenient +for online backup. + +The userland tools are included in nilfs-utils package, which is +available from the following download page. At least "mkfs.nilfs2", +"mount.nilfs2", "umount.nilfs2", and "nilfs_cleanerd" (so called +cleaner or garbage collector) are required. Details on the tools are +described in the man pages included in the package. + +:Project web page: https://nilfs.sourceforge.io/ +:Download page: https://nilfs.sourceforge.io/en/download.html +:List info: http://vger.kernel.org/vger-lists.html#linux-nilfs + +Caveats +======= + +Features which NILFS2 does not support yet: + + - atime + - extended attributes + - POSIX ACLs + - quotas + - fsck + - defragmentation + +Mount options +============= + +NILFS2 supports the following mount options: +(*) == default + +======================= ======================================================= +barrier(*) This enables/disables the use of write barriers. This +nobarrier requires an IO stack which can support barriers, and + if nilfs gets an error on a barrier write, it will + disable again with a warning. +errors=continue Keep going on a filesystem error. +errors=remount-ro(*) Remount the filesystem read-only on an error. +errors=panic Panic and halt the machine if an error occurs. +cp=n Specify the checkpoint-number of the snapshot to be + mounted. Checkpoints and snapshots are listed by lscp + user command. Only the checkpoints marked as snapshot + are mountable with this option. Snapshot is read-only, + so a read-only mount option must be specified together. +order=relaxed(*) Apply relaxed order semantics that allows modified data + blocks to be written to disk without making a + checkpoint if no metadata update is going. This mode + is equivalent to the ordered data mode of the ext3 + filesystem except for the updates on data blocks still + conserve atomicity. This will improve synchronous + write performance for overwriting. +order=strict Apply strict in-order semantics that preserves sequence + of all file operations including overwriting of data + blocks. That means, it is guaranteed that no + overtaking of events occurs in the recovered file + system after a crash. +norecovery Disable recovery of the filesystem on mount. + This disables every write access on the device for + read-only mounts or snapshots. This option will fail + for r/w mounts on an unclean volume. +discard This enables/disables the use of discard/TRIM commands. +nodiscard(*) The discard/TRIM commands are sent to the underlying + block device when blocks are freed. This is useful + for SSD devices and sparse/thinly-provisioned LUNs. +======================= ======================================================= + +Ioctls +====== + +There is some NILFS2 specific functionality which can be accessed by applications +through the system call interfaces. The list of all NILFS2 specific ioctls are +shown in the table below. + +Table of NILFS2 specific ioctls: + + ============================== =============================================== + Ioctl Description + ============================== =============================================== + NILFS_IOCTL_CHANGE_CPMODE Change mode of given checkpoint between + checkpoint and snapshot state. This ioctl is + used in chcp and mkcp utilities. + + NILFS_IOCTL_DELETE_CHECKPOINT Remove checkpoint from NILFS2 file system. + This ioctl is used in rmcp utility. + + NILFS_IOCTL_GET_CPINFO Return info about requested checkpoints. This + ioctl is used in lscp utility and by + nilfs_cleanerd daemon. + + NILFS_IOCTL_GET_CPSTAT Return checkpoints statistics. This ioctl is + used by lscp, rmcp utilities and by + nilfs_cleanerd daemon. + + NILFS_IOCTL_GET_SUINFO Return segment usage info about requested + segments. This ioctl is used in lssu, + nilfs_resize utilities and by nilfs_cleanerd + daemon. + + NILFS_IOCTL_SET_SUINFO Modify segment usage info of requested + segments. This ioctl is used by + nilfs_cleanerd daemon to skip unnecessary + cleaning operation of segments and reduce + performance penalty or wear of flash device + due to redundant move of in-use blocks. + + NILFS_IOCTL_GET_SUSTAT Return segment usage statistics. This ioctl + is used in lssu, nilfs_resize utilities and + by nilfs_cleanerd daemon. + + NILFS_IOCTL_GET_VINFO Return information on virtual block addresses. + This ioctl is used by nilfs_cleanerd daemon. + + NILFS_IOCTL_GET_BDESCS Return information about descriptors of disk + block numbers. This ioctl is used by + nilfs_cleanerd daemon. + + NILFS_IOCTL_CLEAN_SEGMENTS Do garbage collection operation in the + environment of requested parameters from + userspace. This ioctl is used by + nilfs_cleanerd daemon. + + NILFS_IOCTL_SYNC Make a checkpoint. This ioctl is used in + mkcp utility. + + NILFS_IOCTL_RESIZE Resize NILFS2 volume. This ioctl is used + by nilfs_resize utility. + + NILFS_IOCTL_SET_ALLOC_RANGE Define lower limit of segments in bytes and + upper limit of segments in bytes. This ioctl + is used by nilfs_resize utility. + ============================== =============================================== + +NILFS2 usage +============ + +To use nilfs2 as a local file system, simply:: + + # mkfs -t nilfs2 /dev/block_device + # mount -t nilfs2 /dev/block_device /dir + +This will also invoke the cleaner through the mount helper program +(mount.nilfs2). + +Checkpoints and snapshots are managed by the following commands. +Their manpages are included in the nilfs-utils package above. + + ==== =========================================================== + lscp list checkpoints or snapshots. + mkcp make a checkpoint or a snapshot. + chcp change an existing checkpoint to a snapshot or vice versa. + rmcp invalidate specified checkpoint(s). + ==== =========================================================== + +To mount a snapshot:: + + # mount -t nilfs2 -r -o cp= /dev/block_device /snap_dir + +where is the checkpoint number of the snapshot. + +To unmount the NILFS2 mount point or snapshot, simply:: + + # umount /dir + +Then, the cleaner daemon is automatically shut down by the umount +helper program (umount.nilfs2). + +Disk format +=========== + +A nilfs2 volume is equally divided into a number of segments except +for the super block (SB) and segment #0. A segment is the container +of logs. Each log is composed of summary information blocks, payload +blocks, and an optional super root block (SR):: + + ______________________________________________________ + | |SB| | Segment | Segment | Segment | ... | Segment | | + |_|__|_|____0____|____1____|____2____|_____|____N____|_| + 0 +1K +4K +8M +16M +24M +(8MB x N) + . . (Typical offsets for 4KB-block) + . . + .______________________. + | log | log |... | log | + |__1__|__2__|____|__m__| + . . + . . + . . + .______________________________. + | Summary | Payload blocks |SR| + |_blocks__|_________________|__| + +The payload blocks are organized per file, and each file consists of +data blocks and B-tree node blocks:: + + |<--- File-A --->|<--- File-B --->| + _______________________________________________________________ + | Data blocks | B-tree blocks | Data blocks | B-tree blocks | ... + _|_____________|_______________|_____________|_______________|_ + + +Since only the modified blocks are written in the log, it may have +files without data blocks or B-tree node blocks. + +The organization of the blocks is recorded in the summary information +blocks, which contains a header structure (nilfs_segment_summary), per +file structures (nilfs_finfo), and per block structures (nilfs_binfo):: + + _________________________________________________________________________ + | Summary | finfo | binfo | ... | binfo | finfo | binfo | ... | binfo |... + |_blocks__|___A___|_(A,1)_|_____|(A,Na)_|___B___|_(B,1)_|_____|(B,Nb)_|___ + + +The logs include regular files, directory files, symbolic link files +and several meta data files. The mata data files are the files used +to maintain file system meta data. The current version of NILFS2 uses +the following meta data files:: + + 1) Inode file (ifile) -- Stores on-disk inodes + 2) Checkpoint file (cpfile) -- Stores checkpoints + 3) Segment usage file (sufile) -- Stores allocation state of segments + 4) Data address translation file -- Maps virtual block numbers to usual + (DAT) block numbers. This file serves to + make on-disk blocks relocatable. + +The following figure shows a typical organization of the logs:: + + _________________________________________________________________________ + | Summary | regular file | file | ... | ifile | cpfile | sufile | DAT |SR| + |_blocks__|_or_directory_|_______|_____|_______|________|________|_____|__| + + +To stride over segment boundaries, this sequence of files may be split +into multiple logs. The sequence of logs that should be treated as +logically one log, is delimited with flags marked in the segment +summary. The recovery code of nilfs2 looks this boundary information +to ensure atomicity of updates. + +The super root block is inserted for every checkpoints. It includes +three special inodes, inodes for the DAT, cpfile, and sufile. Inodes +of regular files, directories, symlinks and other special files, are +included in the ifile. The inode of ifile itself is included in the +corresponding checkpoint entry in the cpfile. Thus, the hierarchy +among NILFS2 files can be depicted as follows:: + + Super block (SB) + | + v + Super root block (the latest cno=xx) + |-- DAT + |-- sufile + `-- cpfile + |-- ifile (cno=c1) + |-- ifile (cno=c2) ---- file (ino=i1) + : : |-- file (ino=i2) + `-- ifile (cno=xx) |-- file (ino=i3) + : : + `-- file (ino=yy) + ( regular file, directory, or symlink ) + +For detail on the format of each file, please see nilfs2_ondisk.h +located at include/uapi/linux directory. + +There are no patents or other intellectual property that we protect +with regard to the design of NILFS2. It is allowed to replicate the +design in hopes that other operating systems could share (mount, read, +write, etc.) data stored in this format. diff --git a/Documentation/filesystems/nilfs2.txt b/Documentation/filesystems/nilfs2.txt deleted file mode 100644 index f2f3f8592a6f..000000000000 --- a/Documentation/filesystems/nilfs2.txt +++ /dev/null @@ -1,276 +0,0 @@ -NILFS2 ------- - -NILFS2 is a log-structured file system (LFS) supporting continuous -snapshotting. In addition to versioning capability of the entire file -system, users can even restore files mistakenly overwritten or -destroyed just a few seconds ago. Since NILFS2 can keep consistency -like conventional LFS, it achieves quick recovery after system -crashes. - -NILFS2 creates a number of checkpoints every few seconds or per -synchronous write basis (unless there is no change). Users can select -significant versions among continuously created checkpoints, and can -change them into snapshots which will be preserved until they are -changed back to checkpoints. - -There is no limit on the number of snapshots until the volume gets -full. Each snapshot is mountable as a read-only file system -concurrently with its writable mount, and this feature is convenient -for online backup. - -The userland tools are included in nilfs-utils package, which is -available from the following download page. At least "mkfs.nilfs2", -"mount.nilfs2", "umount.nilfs2", and "nilfs_cleanerd" (so called -cleaner or garbage collector) are required. Details on the tools are -described in the man pages included in the package. - -Project web page: https://nilfs.sourceforge.io/ -Download page: https://nilfs.sourceforge.io/en/download.html -List info: http://vger.kernel.org/vger-lists.html#linux-nilfs - -Caveats -======= - -Features which NILFS2 does not support yet: - - - atime - - extended attributes - - POSIX ACLs - - quotas - - fsck - - defragmentation - -Mount options -============= - -NILFS2 supports the following mount options: -(*) == default - -barrier(*) This enables/disables the use of write barriers. This -nobarrier requires an IO stack which can support barriers, and - if nilfs gets an error on a barrier write, it will - disable again with a warning. -errors=continue Keep going on a filesystem error. -errors=remount-ro(*) Remount the filesystem read-only on an error. -errors=panic Panic and halt the machine if an error occurs. -cp=n Specify the checkpoint-number of the snapshot to be - mounted. Checkpoints and snapshots are listed by lscp - user command. Only the checkpoints marked as snapshot - are mountable with this option. Snapshot is read-only, - so a read-only mount option must be specified together. -order=relaxed(*) Apply relaxed order semantics that allows modified data - blocks to be written to disk without making a - checkpoint if no metadata update is going. This mode - is equivalent to the ordered data mode of the ext3 - filesystem except for the updates on data blocks still - conserve atomicity. This will improve synchronous - write performance for overwriting. -order=strict Apply strict in-order semantics that preserves sequence - of all file operations including overwriting of data - blocks. That means, it is guaranteed that no - overtaking of events occurs in the recovered file - system after a crash. -norecovery Disable recovery of the filesystem on mount. - This disables every write access on the device for - read-only mounts or snapshots. This option will fail - for r/w mounts on an unclean volume. -discard This enables/disables the use of discard/TRIM commands. -nodiscard(*) The discard/TRIM commands are sent to the underlying - block device when blocks are freed. This is useful - for SSD devices and sparse/thinly-provisioned LUNs. - -Ioctls -====== - -There is some NILFS2 specific functionality which can be accessed by applications -through the system call interfaces. The list of all NILFS2 specific ioctls are -shown in the table below. - -Table of NILFS2 specific ioctls -.............................................................................. - Ioctl Description - NILFS_IOCTL_CHANGE_CPMODE Change mode of given checkpoint between - checkpoint and snapshot state. This ioctl is - used in chcp and mkcp utilities. - - NILFS_IOCTL_DELETE_CHECKPOINT Remove checkpoint from NILFS2 file system. - This ioctl is used in rmcp utility. - - NILFS_IOCTL_GET_CPINFO Return info about requested checkpoints. This - ioctl is used in lscp utility and by - nilfs_cleanerd daemon. - - NILFS_IOCTL_GET_CPSTAT Return checkpoints statistics. This ioctl is - used by lscp, rmcp utilities and by - nilfs_cleanerd daemon. - - NILFS_IOCTL_GET_SUINFO Return segment usage info about requested - segments. This ioctl is used in lssu, - nilfs_resize utilities and by nilfs_cleanerd - daemon. - - NILFS_IOCTL_SET_SUINFO Modify segment usage info of requested - segments. This ioctl is used by - nilfs_cleanerd daemon to skip unnecessary - cleaning operation of segments and reduce - performance penalty or wear of flash device - due to redundant move of in-use blocks. - - NILFS_IOCTL_GET_SUSTAT Return segment usage statistics. This ioctl - is used in lssu, nilfs_resize utilities and - by nilfs_cleanerd daemon. - - NILFS_IOCTL_GET_VINFO Return information on virtual block addresses. - This ioctl is used by nilfs_cleanerd daemon. - - NILFS_IOCTL_GET_BDESCS Return information about descriptors of disk - block numbers. This ioctl is used by - nilfs_cleanerd daemon. - - NILFS_IOCTL_CLEAN_SEGMENTS Do garbage collection operation in the - environment of requested parameters from - userspace. This ioctl is used by - nilfs_cleanerd daemon. - - NILFS_IOCTL_SYNC Make a checkpoint. This ioctl is used in - mkcp utility. - - NILFS_IOCTL_RESIZE Resize NILFS2 volume. This ioctl is used - by nilfs_resize utility. - - NILFS_IOCTL_SET_ALLOC_RANGE Define lower limit of segments in bytes and - upper limit of segments in bytes. This ioctl - is used by nilfs_resize utility. - -NILFS2 usage -============ - -To use nilfs2 as a local file system, simply: - - # mkfs -t nilfs2 /dev/block_device - # mount -t nilfs2 /dev/block_device /dir - -This will also invoke the cleaner through the mount helper program -(mount.nilfs2). - -Checkpoints and snapshots are managed by the following commands. -Their manpages are included in the nilfs-utils package above. - - lscp list checkpoints or snapshots. - mkcp make a checkpoint or a snapshot. - chcp change an existing checkpoint to a snapshot or vice versa. - rmcp invalidate specified checkpoint(s). - -To mount a snapshot, - - # mount -t nilfs2 -r -o cp= /dev/block_device /snap_dir - -where is the checkpoint number of the snapshot. - -To unmount the NILFS2 mount point or snapshot, simply: - - # umount /dir - -Then, the cleaner daemon is automatically shut down by the umount -helper program (umount.nilfs2). - -Disk format -=========== - -A nilfs2 volume is equally divided into a number of segments except -for the super block (SB) and segment #0. A segment is the container -of logs. Each log is composed of summary information blocks, payload -blocks, and an optional super root block (SR): - - ______________________________________________________ - | |SB| | Segment | Segment | Segment | ... | Segment | | - |_|__|_|____0____|____1____|____2____|_____|____N____|_| - 0 +1K +4K +8M +16M +24M +(8MB x N) - . . (Typical offsets for 4KB-block) - . . - .______________________. - | log | log |... | log | - |__1__|__2__|____|__m__| - . . - . . - . . - .______________________________. - | Summary | Payload blocks |SR| - |_blocks__|_________________|__| - -The payload blocks are organized per file, and each file consists of -data blocks and B-tree node blocks: - - |<--- File-A --->|<--- File-B --->| - _______________________________________________________________ - | Data blocks | B-tree blocks | Data blocks | B-tree blocks | ... - _|_____________|_______________|_____________|_______________|_ - - -Since only the modified blocks are written in the log, it may have -files without data blocks or B-tree node blocks. - -The organization of the blocks is recorded in the summary information -blocks, which contains a header structure (nilfs_segment_summary), per -file structures (nilfs_finfo), and per block structures (nilfs_binfo): - - _________________________________________________________________________ - | Summary | finfo | binfo | ... | binfo | finfo | binfo | ... | binfo |... - |_blocks__|___A___|_(A,1)_|_____|(A,Na)_|___B___|_(B,1)_|_____|(B,Nb)_|___ - - -The logs include regular files, directory files, symbolic link files -and several meta data files. The mata data files are the files used -to maintain file system meta data. The current version of NILFS2 uses -the following meta data files: - - 1) Inode file (ifile) -- Stores on-disk inodes - 2) Checkpoint file (cpfile) -- Stores checkpoints - 3) Segment usage file (sufile) -- Stores allocation state of segments - 4) Data address translation file -- Maps virtual block numbers to usual - (DAT) block numbers. This file serves to - make on-disk blocks relocatable. - -The following figure shows a typical organization of the logs: - - _________________________________________________________________________ - | Summary | regular file | file | ... | ifile | cpfile | sufile | DAT |SR| - |_blocks__|_or_directory_|_______|_____|_______|________|________|_____|__| - - -To stride over segment boundaries, this sequence of files may be split -into multiple logs. The sequence of logs that should be treated as -logically one log, is delimited with flags marked in the segment -summary. The recovery code of nilfs2 looks this boundary information -to ensure atomicity of updates. - -The super root block is inserted for every checkpoints. It includes -three special inodes, inodes for the DAT, cpfile, and sufile. Inodes -of regular files, directories, symlinks and other special files, are -included in the ifile. The inode of ifile itself is included in the -corresponding checkpoint entry in the cpfile. Thus, the hierarchy -among NILFS2 files can be depicted as follows: - - Super block (SB) - | - v - Super root block (the latest cno=xx) - |-- DAT - |-- sufile - `-- cpfile - |-- ifile (cno=c1) - |-- ifile (cno=c2) ---- file (ino=i1) - : : |-- file (ino=i2) - `-- ifile (cno=xx) |-- file (ino=i3) - : : - `-- file (ino=yy) - ( regular file, directory, or symlink ) - -For detail on the format of each file, please see nilfs2_ondisk.h -located at include/uapi/linux directory. - -There are no patents or other intellectual property that we protect -with regard to the design of NILFS2. It is allowed to replicate the -design in hopes that other operating systems could share (mount, read, -write, etc.) data stored in this format. -- cgit From 461f2c8f13fcc0d349e4acac46aacf63dbeb34ca Mon Sep 17 00:00:00 2001 From: Mauro Carvalho Chehab Date: Mon, 17 Feb 2020 17:12:13 +0100 Subject: docs: filesystems: convert ntfs.txt to ReST - Add a SPDX header; - Adjust document title; - Comment out text-only ToC; - Some whitespace fixes and new line breaks; - Mark literal blocks as such; - Add table markups; - Add it to filesystems/index.rst. Signed-off-by: Mauro Carvalho Chehab Link: https://lore.kernel.org/r/f09ca6c9bdd4e7aa7208f3dba0b8753080b38d03.1581955849.git.mchehab+huawei@kernel.org Signed-off-by: Jonathan Corbet --- Documentation/filesystems/index.rst | 3 +- Documentation/filesystems/ntfs.rst | 466 ++++++++++++++++++++++++++++++++++++ Documentation/filesystems/ntfs.txt | 451 ---------------------------------- 3 files changed, 468 insertions(+), 452 deletions(-) create mode 100644 Documentation/filesystems/ntfs.rst delete mode 100644 Documentation/filesystems/ntfs.txt diff --git a/Documentation/filesystems/index.rst b/Documentation/filesystems/index.rst index 01587704fcc9..62be53c4755d 100644 --- a/Documentation/filesystems/index.rst +++ b/Documentation/filesystems/index.rst @@ -74,7 +74,8 @@ Documentation for filesystem implementations. inotify isofs nilfs2 + nfs/index + ntfs overlayfs virtiofs vfat - nfs/index diff --git a/Documentation/filesystems/ntfs.rst b/Documentation/filesystems/ntfs.rst new file mode 100644 index 000000000000..5bb093a26485 --- /dev/null +++ b/Documentation/filesystems/ntfs.rst @@ -0,0 +1,466 @@ +.. SPDX-License-Identifier: GPL-2.0 + +================================ +The Linux NTFS filesystem driver +================================ + + +.. Table of contents + + - Overview + - Web site + - Features + - Supported mount options + - Known bugs and (mis-)features + - Using NTFS volume and stripe sets + - The Device-Mapper driver + - The Software RAID / MD driver + - Limitations when using the MD driver + + +Overview +======== + +Linux-NTFS comes with a number of user-space programs known as ntfsprogs. +These include mkntfs, a full-featured ntfs filesystem format utility, +ntfsundelete used for recovering files that were unintentionally deleted +from an NTFS volume and ntfsresize which is used to resize an NTFS partition. +See the web site for more information. + +To mount an NTFS 1.2/3.x (Windows NT4/2000/XP/2003) volume, use the file +system type 'ntfs'. The driver currently supports read-only mode (with no +fault-tolerance, encryption or journalling) and very limited, but safe, write +support. + +For fault tolerance and raid support (i.e. volume and stripe sets), you can +use the kernel's Software RAID / MD driver. See section "Using Software RAID +with NTFS" for details. + + +Web site +======== + +There is plenty of additional information on the linux-ntfs web site +at http://www.linux-ntfs.org/ + +The web site has a lot of additional information, such as a comprehensive +FAQ, documentation on the NTFS on-disk format, information on the Linux-NTFS +userspace utilities, etc. + + +Features +======== + +- This is a complete rewrite of the NTFS driver that used to be in the 2.4 and + earlier kernels. This new driver implements NTFS read support and is + functionally equivalent to the old ntfs driver and it also implements limited + write support. The biggest limitation at present is that files/directories + cannot be created or deleted. See below for the list of write features that + are so far supported. Another limitation is that writing to compressed files + is not implemented at all. Also, neither read nor write access to encrypted + files is so far implemented. +- The new driver has full support for sparse files on NTFS 3.x volumes which + the old driver isn't happy with. +- The new driver supports execution of binaries due to mmap() now being + supported. +- The new driver supports loopback mounting of files on NTFS which is used by + some Linux distributions to enable the user to run Linux from an NTFS + partition by creating a large file while in Windows and then loopback + mounting the file while in Linux and creating a Linux filesystem on it that + is used to install Linux on it. +- A comparison of the two drivers using:: + + time find . -type f -exec md5sum "{}" \; + + run three times in sequence with each driver (after a reboot) on a 1.4GiB + NTFS partition, showed the new driver to be 20% faster in total time elapsed + (from 9:43 minutes on average down to 7:53). The time spent in user space + was unchanged but the time spent in the kernel was decreased by a factor of + 2.5 (from 85 CPU seconds down to 33). +- The driver does not support short file names in general. For backwards + compatibility, we implement access to files using their short file names if + they exist. The driver will not create short file names however, and a + rename will discard any existing short file name. +- The new driver supports exporting of mounted NTFS volumes via NFS. +- The new driver supports async io (aio). +- The new driver supports fsync(2), fdatasync(2), and msync(2). +- The new driver supports readv(2) and writev(2). +- The new driver supports access time updates (including mtime and ctime). +- The new driver supports truncate(2) and open(2) with O_TRUNC. But at present + only very limited support for highly fragmented files, i.e. ones which have + their data attribute split across multiple extents, is included. Another + limitation is that at present truncate(2) will never create sparse files, + since to mark a file sparse we need to modify the directory entry for the + file and we do not implement directory modifications yet. +- The new driver supports write(2) which can both overwrite existing data and + extend the file size so that you can write beyond the existing data. Also, + writing into sparse regions is supported and the holes are filled in with + clusters. But at present only limited support for highly fragmented files, + i.e. ones which have their data attribute split across multiple extents, is + included. Another limitation is that write(2) will never create sparse + files, since to mark a file sparse we need to modify the directory entry for + the file and we do not implement directory modifications yet. + +Supported mount options +======================= + +In addition to the generic mount options described by the manual page for the +mount command (man 8 mount, also see man 5 fstab), the NTFS driver supports the +following mount options: + +======================= ======================================================= +iocharset=name Deprecated option. Still supported but please use + nls=name in the future. See description for nls=name. + +nls=name Character set to use when returning file names. + Unlike VFAT, NTFS suppresses names that contain + unconvertible characters. Note that most character + sets contain insufficient characters to represent all + possible Unicode characters that can exist on NTFS. + To be sure you are not missing any files, you are + advised to use nls=utf8 which is capable of + representing all Unicode characters. + +utf8= Option no longer supported. Currently mapped to + nls=utf8 but please use nls=utf8 in the future and + make sure utf8 is compiled either as module or into + the kernel. See description for nls=name. + +uid= +gid= +umask= Provide default owner, group, and access mode mask. + These options work as documented in mount(8). By + default, the files/directories are owned by root and + he/she has read and write permissions, as well as + browse permission for directories. No one else has any + access permissions. I.e. the mode on all files is by + default rw------- and for directories rwx------, a + consequence of the default fmask=0177 and dmask=0077. + Using a umask of zero will grant all permissions to + everyone, i.e. all files and directories will have mode + rwxrwxrwx. + +fmask= +dmask= Instead of specifying umask which applies both to + files and directories, fmask applies only to files and + dmask only to directories. + +sloppy= If sloppy is specified, ignore unknown mount options. + Otherwise the default behaviour is to abort mount if + any unknown options are found. + +show_sys_files= If show_sys_files is specified, show the system files + in directory listings. Otherwise the default behaviour + is to hide the system files. + Note that even when show_sys_files is specified, "$MFT" + will not be visible due to bugs/mis-features in glibc. + Further, note that irrespective of show_sys_files, all + files are accessible by name, i.e. you can always do + "ls -l \$UpCase" for example to specifically show the + system file containing the Unicode upcase table. + +case_sensitive= If case_sensitive is specified, treat all file names as + case sensitive and create file names in the POSIX + namespace. Otherwise the default behaviour is to treat + file names as case insensitive and to create file names + in the WIN32/LONG name space. Note, the Linux NTFS + driver will never create short file names and will + remove them on rename/delete of the corresponding long + file name. + Note that files remain accessible via their short file + name, if it exists. If case_sensitive, you will need + to provide the correct case of the short file name. + +disable_sparse= If disable_sparse is specified, creation of sparse + regions, i.e. holes, inside files is disabled for the + volume (for the duration of this mount only). By + default, creation of sparse regions is enabled, which + is consistent with the behaviour of traditional Unix + filesystems. + +errors=opt What to do when critical filesystem errors are found. + Following values can be used for "opt": + + ======== ========================================= + continue DEFAULT, try to clean-up as much as + possible, e.g. marking a corrupt inode as + bad so it is no longer accessed, and then + continue. + recover At present only supported is recovery of + the boot sector from the backup copy. + If read-only mount, the recovery is done + in memory only and not written to disk. + ======== ========================================= + + Note that the options are additive, i.e. specifying:: + + errors=continue,errors=recover + + means the driver will attempt to recover and if that + fails it will clean-up as much as possible and + continue. + +mft_zone_multiplier= Set the MFT zone multiplier for the volume (this + setting is not persistent across mounts and can be + changed from mount to mount but cannot be changed on + remount). Values of 1 to 4 are allowed, 1 being the + default. The MFT zone multiplier determines how much + space is reserved for the MFT on the volume. If all + other space is used up, then the MFT zone will be + shrunk dynamically, so this has no impact on the + amount of free space. However, it can have an impact + on performance by affecting fragmentation of the MFT. + In general use the default. If you have a lot of small + files then use a higher value. The values have the + following meaning: + + ===== ================================= + Value MFT zone size (% of volume size) + ===== ================================= + 1 12.5% + 2 25% + 3 37.5% + 4 50% + ===== ================================= + + Note this option is irrelevant for read-only mounts. +======================= ======================================================= + + +Known bugs and (mis-)features +============================= + +- The link count on each directory inode entry is set to 1, due to Linux not + supporting directory hard links. This may well confuse some user space + applications, since the directory names will have the same inode numbers. + This also speeds up ntfs_read_inode() immensely. And we haven't found any + problems with this approach so far. If you find a problem with this, please + let us know. + + +Please send bug reports/comments/feedback/abuse to the Linux-NTFS development +list at sourceforge: linux-ntfs-dev@lists.sourceforge.net + + +Using NTFS volume and stripe sets +================================= + +For support of volume and stripe sets, you can either use the kernel's +Device-Mapper driver or the kernel's Software RAID / MD driver. The former is +the recommended one to use for linear raid. But the latter is required for +raid level 5. For striping and mirroring, either driver should work fine. + + +The Device-Mapper driver +------------------------ + +You will need to create a table of the components of the volume/stripe set and +how they fit together and load this into the kernel using the dmsetup utility +(see man 8 dmsetup). + +Linear volume sets, i.e. linear raid, has been tested and works fine. Even +though untested, there is no reason why stripe sets, i.e. raid level 0, and +mirrors, i.e. raid level 1 should not work, too. Stripes with parity, i.e. +raid level 5, unfortunately cannot work yet because the current version of the +Device-Mapper driver does not support raid level 5. You may be able to use the +Software RAID / MD driver for raid level 5, see the next section for details. + +To create the table describing your volume you will need to know each of its +components and their sizes in sectors, i.e. multiples of 512-byte blocks. + +For NT4 fault tolerant volumes you can obtain the sizes using fdisk. So for +example if one of your partitions is /dev/hda2 you would do:: + + $ fdisk -ul /dev/hda + + Disk /dev/hda: 81.9 GB, 81964302336 bytes + 255 heads, 63 sectors/track, 9964 cylinders, total 160086528 sectors + Units = sectors of 1 * 512 = 512 bytes + + Device Boot Start End Blocks Id System + /dev/hda1 * 63 4209029 2104483+ 83 Linux + /dev/hda2 4209030 37768814 16779892+ 86 NTFS + /dev/hda3 37768815 46170809 4200997+ 83 Linux + +And you would know that /dev/hda2 has a size of 37768814 - 4209030 + 1 = +33559785 sectors. + +For Win2k and later dynamic disks, you can for example use the ldminfo utility +which is part of the Linux LDM tools (the latest version at the time of +writing is linux-ldm-0.0.8.tar.bz2). You can download it from: + + http://www.linux-ntfs.org/ + +Simply extract the downloaded archive (tar xvjf linux-ldm-0.0.8.tar.bz2), go +into it (cd linux-ldm-0.0.8) and change to the test directory (cd test). You +will find the precompiled (i386) ldminfo utility there. NOTE: You will not be +able to compile this yourself easily so use the binary version! + +Then you would use ldminfo in dump mode to obtain the necessary information:: + + $ ./ldminfo --dump /dev/hda + +This would dump the LDM database found on /dev/hda which describes all of your +dynamic disks and all the volumes on them. At the bottom you will see the +VOLUME DEFINITIONS section which is all you really need. You may need to look +further above to determine which of the disks in the volume definitions is +which device in Linux. Hint: Run ldminfo on each of your dynamic disks and +look at the Disk Id close to the top of the output for each (the PRIVATE HEADER +section). You can then find these Disk Ids in the VBLK DATABASE section in the + components where you will get the LDM Name for the disk that is found in +the VOLUME DEFINITIONS section. + +Note you will also need to enable the LDM driver in the Linux kernel. If your +distribution did not enable it, you will need to recompile the kernel with it +enabled. This will create the LDM partitions on each device at boot time. You +would then use those devices (for /dev/hda they would be /dev/hda1, 2, 3, etc) +in the Device-Mapper table. + +You can also bypass using the LDM driver by using the main device (e.g. +/dev/hda) and then using the offsets of the LDM partitions into this device as +the "Start sector of device" when creating the table. Once again ldminfo would +give you the correct information to do this. + +Assuming you know all your devices and their sizes things are easy. + +For a linear raid the table would look like this (note all values are in +512-byte sectors):: + + # Offset into Size of this Raid type Device Start sector + # volume device of device + 0 1028161 linear /dev/hda1 0 + 1028161 3903762 linear /dev/hdb2 0 + 4931923 2103211 linear /dev/hdc1 0 + +For a striped volume, i.e. raid level 0, you will need to know the chunk size +you used when creating the volume. Windows uses 64kiB as the default, so it +will probably be this unless you changes the defaults when creating the array. + +For a raid level 0 the table would look like this (note all values are in +512-byte sectors):: + + # Offset Size Raid Number Chunk 1st Start 2nd Start + # into of the type of size Device in Device in + # volume volume stripes device device + 0 2056320 striped 2 128 /dev/hda1 0 /dev/hdb1 0 + +If there are more than two devices, just add each of them to the end of the +line. + +Finally, for a mirrored volume, i.e. raid level 1, the table would look like +this (note all values are in 512-byte sectors):: + + # Ofs Size Raid Log Number Region Should Number Source Start Target Start + # in of the type type of log size sync? of Device in Device in + # vol volume params mirrors Device Device + 0 2056320 mirror core 2 16 nosync 2 /dev/hda1 0 /dev/hdb1 0 + +If you are mirroring to multiple devices you can specify further targets at the +end of the line. + +Note the "Should sync?" parameter "nosync" means that the two mirrors are +already in sync which will be the case on a clean shutdown of Windows. If the +mirrors are not clean, you can specify the "sync" option instead of "nosync" +and the Device-Mapper driver will then copy the entirety of the "Source Device" +to the "Target Device" or if you specified multiple target devices to all of +them. + +Once you have your table, save it in a file somewhere (e.g. /etc/ntfsvolume1), +and hand it over to dmsetup to work with, like so:: + + $ dmsetup create myvolume1 /etc/ntfsvolume1 + +You can obviously replace "myvolume1" with whatever name you like. + +If it all worked, you will now have the device /dev/device-mapper/myvolume1 +which you can then just use as an argument to the mount command as usual to +mount the ntfs volume. For example:: + + $ mount -t ntfs -o ro /dev/device-mapper/myvolume1 /mnt/myvol1 + +(You need to create the directory /mnt/myvol1 first and of course you can use +anything you like instead of /mnt/myvol1 as long as it is an existing +directory.) + +It is advisable to do the mount read-only to see if the volume has been setup +correctly to avoid the possibility of causing damage to the data on the ntfs +volume. + + +The Software RAID / MD driver +----------------------------- + +An alternative to using the Device-Mapper driver is to use the kernel's +Software RAID / MD driver. For which you need to set up your /etc/raidtab +appropriately (see man 5 raidtab). + +Linear volume sets, i.e. linear raid, as well as stripe sets, i.e. raid level +0, have been tested and work fine (though see section "Limitations when using +the MD driver with NTFS volumes" especially if you want to use linear raid). +Even though untested, there is no reason why mirrors, i.e. raid level 1, and +stripes with parity, i.e. raid level 5, should not work, too. + +You have to use the "persistent-superblock 0" option for each raid-disk in the +NTFS volume/stripe you are configuring in /etc/raidtab as the persistent +superblock used by the MD driver would damage the NTFS volume. + +Windows by default uses a stripe chunk size of 64k, so you probably want the +"chunk-size 64k" option for each raid-disk, too. + +For example, if you have a stripe set consisting of two partitions /dev/hda5 +and /dev/hdb1 your /etc/raidtab would look like this:: + + raiddev /dev/md0 + raid-level 0 + nr-raid-disks 2 + nr-spare-disks 0 + persistent-superblock 0 + chunk-size 64k + device /dev/hda5 + raid-disk 0 + device /dev/hdb1 + raid-disk 1 + +For linear raid, just change the raid-level above to "raid-level linear", for +mirrors, change it to "raid-level 1", and for stripe sets with parity, change +it to "raid-level 5". + +Note for stripe sets with parity you will also need to tell the MD driver +which parity algorithm to use by specifying the option "parity-algorithm +which", where you need to replace "which" with the name of the algorithm to +use (see man 5 raidtab for available algorithms) and you will have to try the +different available algorithms until you find one that works. Make sure you +are working read-only when playing with this as you may damage your data +otherwise. If you find which algorithm works please let us know (email the +linux-ntfs developers list linux-ntfs-dev@lists.sourceforge.net or drop in on +IRC in channel #ntfs on the irc.freenode.net network) so we can update this +documentation. + +Once the raidtab is setup, run for example raid0run -a to start all devices or +raid0run /dev/md0 to start a particular md device, in this case /dev/md0. + +Then just use the mount command as usual to mount the ntfs volume using for +example:: + + mount -t ntfs -o ro /dev/md0 /mnt/myntfsvolume + +It is advisable to do the mount read-only to see if the md volume has been +setup correctly to avoid the possibility of causing damage to the data on the +ntfs volume. + + +Limitations when using the Software RAID / MD driver +----------------------------------------------------- + +Using the md driver will not work properly if any of your NTFS partitions have +an odd number of sectors. This is especially important for linear raid as all +data after the first partition with an odd number of sectors will be offset by +one or more sectors so if you mount such a partition with write support you +will cause massive damage to the data on the volume which will only become +apparent when you try to use the volume again under Windows. + +So when using linear raid, make sure that all your partitions have an even +number of sectors BEFORE attempting to use it. You have been warned! + +Even better is to simply use the Device-Mapper for linear raid and then you do +not have this problem with odd numbers of sectors. diff --git a/Documentation/filesystems/ntfs.txt b/Documentation/filesystems/ntfs.txt deleted file mode 100644 index 553f10d03076..000000000000 --- a/Documentation/filesystems/ntfs.txt +++ /dev/null @@ -1,451 +0,0 @@ -The Linux NTFS filesystem driver -================================ - - -Table of contents -================= - -- Overview -- Web site -- Features -- Supported mount options -- Known bugs and (mis-)features -- Using NTFS volume and stripe sets - - The Device-Mapper driver - - The Software RAID / MD driver - - Limitations when using the MD driver - - -Overview -======== - -Linux-NTFS comes with a number of user-space programs known as ntfsprogs. -These include mkntfs, a full-featured ntfs filesystem format utility, -ntfsundelete used for recovering files that were unintentionally deleted -from an NTFS volume and ntfsresize which is used to resize an NTFS partition. -See the web site for more information. - -To mount an NTFS 1.2/3.x (Windows NT4/2000/XP/2003) volume, use the file -system type 'ntfs'. The driver currently supports read-only mode (with no -fault-tolerance, encryption or journalling) and very limited, but safe, write -support. - -For fault tolerance and raid support (i.e. volume and stripe sets), you can -use the kernel's Software RAID / MD driver. See section "Using Software RAID -with NTFS" for details. - - -Web site -======== - -There is plenty of additional information on the linux-ntfs web site -at http://www.linux-ntfs.org/ - -The web site has a lot of additional information, such as a comprehensive -FAQ, documentation on the NTFS on-disk format, information on the Linux-NTFS -userspace utilities, etc. - - -Features -======== - -- This is a complete rewrite of the NTFS driver that used to be in the 2.4 and - earlier kernels. This new driver implements NTFS read support and is - functionally equivalent to the old ntfs driver and it also implements limited - write support. The biggest limitation at present is that files/directories - cannot be created or deleted. See below for the list of write features that - are so far supported. Another limitation is that writing to compressed files - is not implemented at all. Also, neither read nor write access to encrypted - files is so far implemented. -- The new driver has full support for sparse files on NTFS 3.x volumes which - the old driver isn't happy with. -- The new driver supports execution of binaries due to mmap() now being - supported. -- The new driver supports loopback mounting of files on NTFS which is used by - some Linux distributions to enable the user to run Linux from an NTFS - partition by creating a large file while in Windows and then loopback - mounting the file while in Linux and creating a Linux filesystem on it that - is used to install Linux on it. -- A comparison of the two drivers using: - time find . -type f -exec md5sum "{}" \; - run three times in sequence with each driver (after a reboot) on a 1.4GiB - NTFS partition, showed the new driver to be 20% faster in total time elapsed - (from 9:43 minutes on average down to 7:53). The time spent in user space - was unchanged but the time spent in the kernel was decreased by a factor of - 2.5 (from 85 CPU seconds down to 33). -- The driver does not support short file names in general. For backwards - compatibility, we implement access to files using their short file names if - they exist. The driver will not create short file names however, and a - rename will discard any existing short file name. -- The new driver supports exporting of mounted NTFS volumes via NFS. -- The new driver supports async io (aio). -- The new driver supports fsync(2), fdatasync(2), and msync(2). -- The new driver supports readv(2) and writev(2). -- The new driver supports access time updates (including mtime and ctime). -- The new driver supports truncate(2) and open(2) with O_TRUNC. But at present - only very limited support for highly fragmented files, i.e. ones which have - their data attribute split across multiple extents, is included. Another - limitation is that at present truncate(2) will never create sparse files, - since to mark a file sparse we need to modify the directory entry for the - file and we do not implement directory modifications yet. -- The new driver supports write(2) which can both overwrite existing data and - extend the file size so that you can write beyond the existing data. Also, - writing into sparse regions is supported and the holes are filled in with - clusters. But at present only limited support for highly fragmented files, - i.e. ones which have their data attribute split across multiple extents, is - included. Another limitation is that write(2) will never create sparse - files, since to mark a file sparse we need to modify the directory entry for - the file and we do not implement directory modifications yet. - -Supported mount options -======================= - -In addition to the generic mount options described by the manual page for the -mount command (man 8 mount, also see man 5 fstab), the NTFS driver supports the -following mount options: - -iocharset=name Deprecated option. Still supported but please use - nls=name in the future. See description for nls=name. - -nls=name Character set to use when returning file names. - Unlike VFAT, NTFS suppresses names that contain - unconvertible characters. Note that most character - sets contain insufficient characters to represent all - possible Unicode characters that can exist on NTFS. - To be sure you are not missing any files, you are - advised to use nls=utf8 which is capable of - representing all Unicode characters. - -utf8= Option no longer supported. Currently mapped to - nls=utf8 but please use nls=utf8 in the future and - make sure utf8 is compiled either as module or into - the kernel. See description for nls=name. - -uid= -gid= -umask= Provide default owner, group, and access mode mask. - These options work as documented in mount(8). By - default, the files/directories are owned by root and - he/she has read and write permissions, as well as - browse permission for directories. No one else has any - access permissions. I.e. the mode on all files is by - default rw------- and for directories rwx------, a - consequence of the default fmask=0177 and dmask=0077. - Using a umask of zero will grant all permissions to - everyone, i.e. all files and directories will have mode - rwxrwxrwx. - -fmask= -dmask= Instead of specifying umask which applies both to - files and directories, fmask applies only to files and - dmask only to directories. - -sloppy= If sloppy is specified, ignore unknown mount options. - Otherwise the default behaviour is to abort mount if - any unknown options are found. - -show_sys_files= If show_sys_files is specified, show the system files - in directory listings. Otherwise the default behaviour - is to hide the system files. - Note that even when show_sys_files is specified, "$MFT" - will not be visible due to bugs/mis-features in glibc. - Further, note that irrespective of show_sys_files, all - files are accessible by name, i.e. you can always do - "ls -l \$UpCase" for example to specifically show the - system file containing the Unicode upcase table. - -case_sensitive= If case_sensitive is specified, treat all file names as - case sensitive and create file names in the POSIX - namespace. Otherwise the default behaviour is to treat - file names as case insensitive and to create file names - in the WIN32/LONG name space. Note, the Linux NTFS - driver will never create short file names and will - remove them on rename/delete of the corresponding long - file name. - Note that files remain accessible via their short file - name, if it exists. If case_sensitive, you will need - to provide the correct case of the short file name. - -disable_sparse= If disable_sparse is specified, creation of sparse - regions, i.e. holes, inside files is disabled for the - volume (for the duration of this mount only). By - default, creation of sparse regions is enabled, which - is consistent with the behaviour of traditional Unix - filesystems. - -errors=opt What to do when critical filesystem errors are found. - Following values can be used for "opt": - continue: DEFAULT, try to clean-up as much as - possible, e.g. marking a corrupt inode as - bad so it is no longer accessed, and then - continue. - recover: At present only supported is recovery of - the boot sector from the backup copy. - If read-only mount, the recovery is done - in memory only and not written to disk. - Note that the options are additive, i.e. specifying: - errors=continue,errors=recover - means the driver will attempt to recover and if that - fails it will clean-up as much as possible and - continue. - -mft_zone_multiplier= Set the MFT zone multiplier for the volume (this - setting is not persistent across mounts and can be - changed from mount to mount but cannot be changed on - remount). Values of 1 to 4 are allowed, 1 being the - default. The MFT zone multiplier determines how much - space is reserved for the MFT on the volume. If all - other space is used up, then the MFT zone will be - shrunk dynamically, so this has no impact on the - amount of free space. However, it can have an impact - on performance by affecting fragmentation of the MFT. - In general use the default. If you have a lot of small - files then use a higher value. The values have the - following meaning: - Value MFT zone size (% of volume size) - 1 12.5% - 2 25% - 3 37.5% - 4 50% - Note this option is irrelevant for read-only mounts. - - -Known bugs and (mis-)features -============================= - -- The link count on each directory inode entry is set to 1, due to Linux not - supporting directory hard links. This may well confuse some user space - applications, since the directory names will have the same inode numbers. - This also speeds up ntfs_read_inode() immensely. And we haven't found any - problems with this approach so far. If you find a problem with this, please - let us know. - - -Please send bug reports/comments/feedback/abuse to the Linux-NTFS development -list at sourceforge: linux-ntfs-dev@lists.sourceforge.net - - -Using NTFS volume and stripe sets -================================= - -For support of volume and stripe sets, you can either use the kernel's -Device-Mapper driver or the kernel's Software RAID / MD driver. The former is -the recommended one to use for linear raid. But the latter is required for -raid level 5. For striping and mirroring, either driver should work fine. - - -The Device-Mapper driver ------------------------- - -You will need to create a table of the components of the volume/stripe set and -how they fit together and load this into the kernel using the dmsetup utility -(see man 8 dmsetup). - -Linear volume sets, i.e. linear raid, has been tested and works fine. Even -though untested, there is no reason why stripe sets, i.e. raid level 0, and -mirrors, i.e. raid level 1 should not work, too. Stripes with parity, i.e. -raid level 5, unfortunately cannot work yet because the current version of the -Device-Mapper driver does not support raid level 5. You may be able to use the -Software RAID / MD driver for raid level 5, see the next section for details. - -To create the table describing your volume you will need to know each of its -components and their sizes in sectors, i.e. multiples of 512-byte blocks. - -For NT4 fault tolerant volumes you can obtain the sizes using fdisk. So for -example if one of your partitions is /dev/hda2 you would do: - -$ fdisk -ul /dev/hda - -Disk /dev/hda: 81.9 GB, 81964302336 bytes -255 heads, 63 sectors/track, 9964 cylinders, total 160086528 sectors -Units = sectors of 1 * 512 = 512 bytes - - Device Boot Start End Blocks Id System - /dev/hda1 * 63 4209029 2104483+ 83 Linux - /dev/hda2 4209030 37768814 16779892+ 86 NTFS - /dev/hda3 37768815 46170809 4200997+ 83 Linux - -And you would know that /dev/hda2 has a size of 37768814 - 4209030 + 1 = -33559785 sectors. - -For Win2k and later dynamic disks, you can for example use the ldminfo utility -which is part of the Linux LDM tools (the latest version at the time of -writing is linux-ldm-0.0.8.tar.bz2). You can download it from: - http://www.linux-ntfs.org/ -Simply extract the downloaded archive (tar xvjf linux-ldm-0.0.8.tar.bz2), go -into it (cd linux-ldm-0.0.8) and change to the test directory (cd test). You -will find the precompiled (i386) ldminfo utility there. NOTE: You will not be -able to compile this yourself easily so use the binary version! - -Then you would use ldminfo in dump mode to obtain the necessary information: - -$ ./ldminfo --dump /dev/hda - -This would dump the LDM database found on /dev/hda which describes all of your -dynamic disks and all the volumes on them. At the bottom you will see the -VOLUME DEFINITIONS section which is all you really need. You may need to look -further above to determine which of the disks in the volume definitions is -which device in Linux. Hint: Run ldminfo on each of your dynamic disks and -look at the Disk Id close to the top of the output for each (the PRIVATE HEADER -section). You can then find these Disk Ids in the VBLK DATABASE section in the - components where you will get the LDM Name for the disk that is found in -the VOLUME DEFINITIONS section. - -Note you will also need to enable the LDM driver in the Linux kernel. If your -distribution did not enable it, you will need to recompile the kernel with it -enabled. This will create the LDM partitions on each device at boot time. You -would then use those devices (for /dev/hda they would be /dev/hda1, 2, 3, etc) -in the Device-Mapper table. - -You can also bypass using the LDM driver by using the main device (e.g. -/dev/hda) and then using the offsets of the LDM partitions into this device as -the "Start sector of device" when creating the table. Once again ldminfo would -give you the correct information to do this. - -Assuming you know all your devices and their sizes things are easy. - -For a linear raid the table would look like this (note all values are in -512-byte sectors): - ---- cut here --- -# Offset into Size of this Raid type Device Start sector -# volume device of device -0 1028161 linear /dev/hda1 0 -1028161 3903762 linear /dev/hdb2 0 -4931923 2103211 linear /dev/hdc1 0 ---- cut here --- - -For a striped volume, i.e. raid level 0, you will need to know the chunk size -you used when creating the volume. Windows uses 64kiB as the default, so it -will probably be this unless you changes the defaults when creating the array. - -For a raid level 0 the table would look like this (note all values are in -512-byte sectors): - ---- cut here --- -# Offset Size Raid Number Chunk 1st Start 2nd Start -# into of the type of size Device in Device in -# volume volume stripes device device -0 2056320 striped 2 128 /dev/hda1 0 /dev/hdb1 0 ---- cut here --- - -If there are more than two devices, just add each of them to the end of the -line. - -Finally, for a mirrored volume, i.e. raid level 1, the table would look like -this (note all values are in 512-byte sectors): - ---- cut here --- -# Ofs Size Raid Log Number Region Should Number Source Start Target Start -# in of the type type of log size sync? of Device in Device in -# vol volume params mirrors Device Device -0 2056320 mirror core 2 16 nosync 2 /dev/hda1 0 /dev/hdb1 0 ---- cut here --- - -If you are mirroring to multiple devices you can specify further targets at the -end of the line. - -Note the "Should sync?" parameter "nosync" means that the two mirrors are -already in sync which will be the case on a clean shutdown of Windows. If the -mirrors are not clean, you can specify the "sync" option instead of "nosync" -and the Device-Mapper driver will then copy the entirety of the "Source Device" -to the "Target Device" or if you specified multiple target devices to all of -them. - -Once you have your table, save it in a file somewhere (e.g. /etc/ntfsvolume1), -and hand it over to dmsetup to work with, like so: - -$ dmsetup create myvolume1 /etc/ntfsvolume1 - -You can obviously replace "myvolume1" with whatever name you like. - -If it all worked, you will now have the device /dev/device-mapper/myvolume1 -which you can then just use as an argument to the mount command as usual to -mount the ntfs volume. For example: - -$ mount -t ntfs -o ro /dev/device-mapper/myvolume1 /mnt/myvol1 - -(You need to create the directory /mnt/myvol1 first and of course you can use -anything you like instead of /mnt/myvol1 as long as it is an existing -directory.) - -It is advisable to do the mount read-only to see if the volume has been setup -correctly to avoid the possibility of causing damage to the data on the ntfs -volume. - - -The Software RAID / MD driver ------------------------------ - -An alternative to using the Device-Mapper driver is to use the kernel's -Software RAID / MD driver. For which you need to set up your /etc/raidtab -appropriately (see man 5 raidtab). - -Linear volume sets, i.e. linear raid, as well as stripe sets, i.e. raid level -0, have been tested and work fine (though see section "Limitations when using -the MD driver with NTFS volumes" especially if you want to use linear raid). -Even though untested, there is no reason why mirrors, i.e. raid level 1, and -stripes with parity, i.e. raid level 5, should not work, too. - -You have to use the "persistent-superblock 0" option for each raid-disk in the -NTFS volume/stripe you are configuring in /etc/raidtab as the persistent -superblock used by the MD driver would damage the NTFS volume. - -Windows by default uses a stripe chunk size of 64k, so you probably want the -"chunk-size 64k" option for each raid-disk, too. - -For example, if you have a stripe set consisting of two partitions /dev/hda5 -and /dev/hdb1 your /etc/raidtab would look like this: - -raiddev /dev/md0 - raid-level 0 - nr-raid-disks 2 - nr-spare-disks 0 - persistent-superblock 0 - chunk-size 64k - device /dev/hda5 - raid-disk 0 - device /dev/hdb1 - raid-disk 1 - -For linear raid, just change the raid-level above to "raid-level linear", for -mirrors, change it to "raid-level 1", and for stripe sets with parity, change -it to "raid-level 5". - -Note for stripe sets with parity you will also need to tell the MD driver -which parity algorithm to use by specifying the option "parity-algorithm -which", where you need to replace "which" with the name of the algorithm to -use (see man 5 raidtab for available algorithms) and you will have to try the -different available algorithms until you find one that works. Make sure you -are working read-only when playing with this as you may damage your data -otherwise. If you find which algorithm works please let us know (email the -linux-ntfs developers list linux-ntfs-dev@lists.sourceforge.net or drop in on -IRC in channel #ntfs on the irc.freenode.net network) so we can update this -documentation. - -Once the raidtab is setup, run for example raid0run -a to start all devices or -raid0run /dev/md0 to start a particular md device, in this case /dev/md0. - -Then just use the mount command as usual to mount the ntfs volume using for -example: mount -t ntfs -o ro /dev/md0 /mnt/myntfsvolume - -It is advisable to do the mount read-only to see if the md volume has been -setup correctly to avoid the possibility of causing damage to the data on the -ntfs volume. - - -Limitations when using the Software RAID / MD driver ------------------------------------------------------ - -Using the md driver will not work properly if any of your NTFS partitions have -an odd number of sectors. This is especially important for linear raid as all -data after the first partition with an odd number of sectors will be offset by -one or more sectors so if you mount such a partition with write support you -will cause massive damage to the data on the volume which will only become -apparent when you try to use the volume again under Windows. - -So when using linear raid, make sure that all your partitions have an even -number of sectors BEFORE attempting to use it. You have been warned! - -Even better is to simply use the Device-Mapper for linear raid and then you do -not have this problem with odd numbers of sectors. -- cgit From 3d0c60d004644630f1431ce486e76adcc829e288 Mon Sep 17 00:00:00 2001 From: Mauro Carvalho Chehab Date: Mon, 17 Feb 2020 17:12:14 +0100 Subject: docs: filesystems: convert ocfs2-online-filecheck.txt to ReST - Add a SPDX header; - Add a document title; - Some whitespace fixes and new line breaks; - Mark literal blocks as such; - Add it to filesystems/index.rst. Signed-off-by: Mauro Carvalho Chehab Link: https://lore.kernel.org/r/6007166acc3252697755836354bd29b5a5fb82aa.1581955849.git.mchehab+huawei@kernel.org Signed-off-by: Jonathan Corbet --- Documentation/filesystems/index.rst | 1 + .../filesystems/ocfs2-online-filecheck.rst | 99 ++++++++++++++++++++++ .../filesystems/ocfs2-online-filecheck.txt | 94 -------------------- 3 files changed, 100 insertions(+), 94 deletions(-) create mode 100644 Documentation/filesystems/ocfs2-online-filecheck.rst delete mode 100644 Documentation/filesystems/ocfs2-online-filecheck.txt diff --git a/Documentation/filesystems/index.rst b/Documentation/filesystems/index.rst index 62be53c4755d..f3a26fdbd04f 100644 --- a/Documentation/filesystems/index.rst +++ b/Documentation/filesystems/index.rst @@ -76,6 +76,7 @@ Documentation for filesystem implementations. nilfs2 nfs/index ntfs + ocfs2-online-filecheck overlayfs virtiofs vfat diff --git a/Documentation/filesystems/ocfs2-online-filecheck.rst b/Documentation/filesystems/ocfs2-online-filecheck.rst new file mode 100644 index 000000000000..2257bb53edc1 --- /dev/null +++ b/Documentation/filesystems/ocfs2-online-filecheck.rst @@ -0,0 +1,99 @@ +.. SPDX-License-Identifier: GPL-2.0 + +===================================== +OCFS2 file system - online file check +===================================== + +This document will describe OCFS2 online file check feature. + +Introduction +============ +OCFS2 is often used in high-availability systems. However, OCFS2 usually +converts the filesystem to read-only when encounters an error. This may not be +necessary, since turning the filesystem read-only would affect other running +processes as well, decreasing availability. +Then, a mount option (errors=continue) is introduced, which would return the +-EIO errno to the calling process and terminate further processing so that the +filesystem is not corrupted further. The filesystem is not converted to +read-only, and the problematic file's inode number is reported in the kernel +log. The user can try to check/fix this file via online filecheck feature. + +Scope +===== +This effort is to check/fix small issues which may hinder day-to-day operations +of a cluster filesystem by turning the filesystem read-only. The scope of +checking/fixing is at the file level, initially for regular files and eventually +to all files (including system files) of the filesystem. + +In case of directory to file links is incorrect, the directory inode is +reported as erroneous. + +This feature is not suited for extravagant checks which involve dependency of +other components of the filesystem, such as but not limited to, checking if the +bits for file blocks in the allocation has been set. In case of such an error, +the offline fsck should/would be recommended. + +Finally, such an operation/feature should not be automated lest the filesystem +may end up with more damage than before the repair attempt. So, this has to +be performed using user interaction and consent. + +User interface +============== +When there are errors in the OCFS2 filesystem, they are usually accompanied +by the inode number which caused the error. This inode number would be the +input to check/fix the file. + +There is a sysfs directory for each OCFS2 file system mounting:: + + /sys/fs/ocfs2//filecheck + +Here, indicates the name of OCFS2 volume device which has been already +mounted. The file above would accept inode numbers. This could be used to +communicate with kernel space, tell which file(inode number) will be checked or +fixed. Currently, three operations are supported, which includes checking +inode, fixing inode and setting the size of result record history. + +1. If you want to know what error exactly happened to before fixing, do:: + + # echo "" > /sys/fs/ocfs2//filecheck/check + # cat /sys/fs/ocfs2//filecheck/check + +The output is like this:: + + INO DONE ERROR + 39502 1 GENERATION + + lists the inode numbers. + indicates whether the operation has been finished. + says what kind of errors was found. For the detailed error numbers, + please refer to the file linux/fs/ocfs2/filecheck.h. + +2. If you determine to fix this inode, do:: + + # echo "" > /sys/fs/ocfs2//filecheck/fix + # cat /sys/fs/ocfs2//filecheck/fix + +The output is like this::: + + INO DONE ERROR + 39502 1 SUCCESS + +This time, the column indicates whether this fix is successful or not. + +3. The record cache is used to store the history of check/fix results. It's +default size is 10, and can be adjust between the range of 10 ~ 100. You can +adjust the size like this:: + + # echo "" > /sys/fs/ocfs2//filecheck/set + +Fixing stuff +============ +On receiving the inode, the filesystem would read the inode and the +file metadata. In case of errors, the filesystem would fix the errors +and report the problems it fixed in the kernel log. As a precautionary measure, +the inode must first be checked for errors before performing a final fix. + +The inode and the result history will be maintained temporarily in a +small linked list buffer which would contain the last (N) inodes +fixed/checked, the detailed errors which were fixed/checked are printed in the +kernel log. diff --git a/Documentation/filesystems/ocfs2-online-filecheck.txt b/Documentation/filesystems/ocfs2-online-filecheck.txt deleted file mode 100644 index 139fab175c8a..000000000000 --- a/Documentation/filesystems/ocfs2-online-filecheck.txt +++ /dev/null @@ -1,94 +0,0 @@ - OCFS2 online file check - ----------------------- - -This document will describe OCFS2 online file check feature. - -Introduction -============ -OCFS2 is often used in high-availability systems. However, OCFS2 usually -converts the filesystem to read-only when encounters an error. This may not be -necessary, since turning the filesystem read-only would affect other running -processes as well, decreasing availability. -Then, a mount option (errors=continue) is introduced, which would return the --EIO errno to the calling process and terminate further processing so that the -filesystem is not corrupted further. The filesystem is not converted to -read-only, and the problematic file's inode number is reported in the kernel -log. The user can try to check/fix this file via online filecheck feature. - -Scope -===== -This effort is to check/fix small issues which may hinder day-to-day operations -of a cluster filesystem by turning the filesystem read-only. The scope of -checking/fixing is at the file level, initially for regular files and eventually -to all files (including system files) of the filesystem. - -In case of directory to file links is incorrect, the directory inode is -reported as erroneous. - -This feature is not suited for extravagant checks which involve dependency of -other components of the filesystem, such as but not limited to, checking if the -bits for file blocks in the allocation has been set. In case of such an error, -the offline fsck should/would be recommended. - -Finally, such an operation/feature should not be automated lest the filesystem -may end up with more damage than before the repair attempt. So, this has to -be performed using user interaction and consent. - -User interface -============== -When there are errors in the OCFS2 filesystem, they are usually accompanied -by the inode number which caused the error. This inode number would be the -input to check/fix the file. - -There is a sysfs directory for each OCFS2 file system mounting: - - /sys/fs/ocfs2//filecheck - -Here, indicates the name of OCFS2 volume device which has been already -mounted. The file above would accept inode numbers. This could be used to -communicate with kernel space, tell which file(inode number) will be checked or -fixed. Currently, three operations are supported, which includes checking -inode, fixing inode and setting the size of result record history. - -1. If you want to know what error exactly happened to before fixing, do - - # echo "" > /sys/fs/ocfs2//filecheck/check - # cat /sys/fs/ocfs2//filecheck/check - -The output is like this: - INO DONE ERROR -39502 1 GENERATION - - lists the inode numbers. - indicates whether the operation has been finished. - says what kind of errors was found. For the detailed error numbers, -please refer to the file linux/fs/ocfs2/filecheck.h. - -2. If you determine to fix this inode, do - - # echo "" > /sys/fs/ocfs2//filecheck/fix - # cat /sys/fs/ocfs2//filecheck/fix - -The output is like this: - INO DONE ERROR -39502 1 SUCCESS - -This time, the column indicates whether this fix is successful or not. - -3. The record cache is used to store the history of check/fix results. It's -default size is 10, and can be adjust between the range of 10 ~ 100. You can -adjust the size like this: - - # echo "" > /sys/fs/ocfs2//filecheck/set - -Fixing stuff -============ -On receiving the inode, the filesystem would read the inode and the -file metadata. In case of errors, the filesystem would fix the errors -and report the problems it fixed in the kernel log. As a precautionary measure, -the inode must first be checked for errors before performing a final fix. - -The inode and the result history will be maintained temporarily in a -small linked list buffer which would contain the last (N) inodes -fixed/checked, the detailed errors which were fixed/checked are printed in the -kernel log. -- cgit From fa95e087ff69468b4e452c50c3f4c59a45846b8d Mon Sep 17 00:00:00 2001 From: Mauro Carvalho Chehab Date: Mon, 17 Feb 2020 17:12:15 +0100 Subject: docs: filesystems: convert ocfs2.txt to ReST - Add a SPDX header; - Adjust document title; - Some whitespace fixes and new line breaks; - Mark literal blocks as such; - Add it to filesystems/index.rst. Signed-off-by: Mauro Carvalho Chehab Acked-by: Joseph Qi Link: https://lore.kernel.org/r/e29a8120bf1d847f23fb68e915f10a7d43bed9e3.1581955849.git.mchehab+huawei@kernel.org Signed-off-by: Jonathan Corbet --- Documentation/filesystems/index.rst | 1 + Documentation/filesystems/ocfs2.rst | 117 ++++++++++++++++++++++++++++++++++++ Documentation/filesystems/ocfs2.txt | 106 -------------------------------- 3 files changed, 118 insertions(+), 106 deletions(-) create mode 100644 Documentation/filesystems/ocfs2.rst delete mode 100644 Documentation/filesystems/ocfs2.txt diff --git a/Documentation/filesystems/index.rst b/Documentation/filesystems/index.rst index f3a26fdbd04f..3b2b07491c98 100644 --- a/Documentation/filesystems/index.rst +++ b/Documentation/filesystems/index.rst @@ -76,6 +76,7 @@ Documentation for filesystem implementations. nilfs2 nfs/index ntfs + ocfs2 ocfs2-online-filecheck overlayfs virtiofs diff --git a/Documentation/filesystems/ocfs2.rst b/Documentation/filesystems/ocfs2.rst new file mode 100644 index 000000000000..412386bc6506 --- /dev/null +++ b/Documentation/filesystems/ocfs2.rst @@ -0,0 +1,117 @@ +.. SPDX-License-Identifier: GPL-2.0 + +================ +OCFS2 filesystem +================ + +OCFS2 is a general purpose extent based shared disk cluster file +system with many similarities to ext3. It supports 64 bit inode +numbers, and has automatically extending metadata groups which may +also make it attractive for non-clustered use. + +You'll want to install the ocfs2-tools package in order to at least +get "mount.ocfs2" and "ocfs2_hb_ctl". + +Project web page: http://ocfs2.wiki.kernel.org +Tools git tree: https://github.com/markfasheh/ocfs2-tools +OCFS2 mailing lists: http://oss.oracle.com/projects/ocfs2/mailman/ + +All code copyright 2005 Oracle except when otherwise noted. + +Credits +======= + +Lots of code taken from ext3 and other projects. + +Authors in alphabetical order: + +- Joel Becker +- Zach Brown +- Mark Fasheh +- Kurt Hackel +- Tao Ma +- Sunil Mushran +- Manish Singh +- Tiger Yang + +Caveats +======= +Features which OCFS2 does not support yet: + + - Directory change notification (F_NOTIFY) + - Distributed Caching (F_SETLEASE/F_GETLEASE/break_lease) + +Mount options +============= + +OCFS2 supports the following mount options: + +(*) == default + +======================= ======================================================== +barrier=1 This enables/disables barriers. barrier=0 disables it, + barrier=1 enables it. +errors=remount-ro(*) Remount the filesystem read-only on an error. +errors=panic Panic and halt the machine if an error occurs. +intr (*) Allow signals to interrupt cluster operations. +nointr Do not allow signals to interrupt cluster + operations. +noatime Do not update access time. +relatime(*) Update atime if the previous atime is older than + mtime or ctime +strictatime Always update atime, but the minimum update interval + is specified by atime_quantum. +atime_quantum=60(*) OCFS2 will not update atime unless this number + of seconds has passed since the last update. + Set to zero to always update atime. This option need + work with strictatime. +data=ordered (*) All data are forced directly out to the main file + system prior to its metadata being committed to the + journal. +data=writeback Data ordering is not preserved, data may be written + into the main file system after its metadata has been + committed to the journal. +preferred_slot=0(*) During mount, try to use this filesystem slot first. If + it is in use by another node, the first empty one found + will be chosen. Invalid values will be ignored. +commit=nrsec (*) Ocfs2 can be told to sync all its data and metadata + every 'nrsec' seconds. The default value is 5 seconds. + This means that if you lose your power, you will lose + as much as the latest 5 seconds of work (your + filesystem will not be damaged though, thanks to the + journaling). This default value (or any low value) + will hurt performance, but it's good for data-safety. + Setting it to 0 will have the same effect as leaving + it at the default (5 seconds). + Setting it to very large values will improve + performance. +localalloc=8(*) Allows custom localalloc size in MB. If the value is too + large, the fs will silently revert it to the default. +localflocks This disables cluster aware flock. +inode64 Indicates that Ocfs2 is allowed to create inodes at + any location in the filesystem, including those which + will result in inode numbers occupying more than 32 + bits of significance. +user_xattr (*) Enables Extended User Attributes. +nouser_xattr Disables Extended User Attributes. +acl Enables POSIX Access Control Lists support. +noacl (*) Disables POSIX Access Control Lists support. +resv_level=2 (*) Set how aggressive allocation reservations will be. + Valid values are between 0 (reservations off) to 8 + (maximum space for reservations). +dir_resv_level= (*) By default, directory reservations will scale with file + reservations - users should rarely need to change this + value. If allocation reservations are turned off, this + option will have no effect. +coherency=full (*) Disallow concurrent O_DIRECT writes, cluster inode + lock will be taken to force other nodes drop cache, + therefore full cluster coherency is guaranteed even + for O_DIRECT writes. +coherency=buffered Allow concurrent O_DIRECT writes without EX lock among + nodes, which gains high performance at risk of getting + stale data on other nodes. +journal_async_commit Commit block can be written to disk without waiting + for descriptor blocks. If enabled older kernels cannot + mount the device. This will enable 'journal_checksum' + internally. +======================= ======================================================== diff --git a/Documentation/filesystems/ocfs2.txt b/Documentation/filesystems/ocfs2.txt deleted file mode 100644 index 4c49e5410595..000000000000 --- a/Documentation/filesystems/ocfs2.txt +++ /dev/null @@ -1,106 +0,0 @@ -OCFS2 filesystem -================== -OCFS2 is a general purpose extent based shared disk cluster file -system with many similarities to ext3. It supports 64 bit inode -numbers, and has automatically extending metadata groups which may -also make it attractive for non-clustered use. - -You'll want to install the ocfs2-tools package in order to at least -get "mount.ocfs2" and "ocfs2_hb_ctl". - -Project web page: http://ocfs2.wiki.kernel.org -Tools git tree: https://github.com/markfasheh/ocfs2-tools -OCFS2 mailing lists: http://oss.oracle.com/projects/ocfs2/mailman/ - -All code copyright 2005 Oracle except when otherwise noted. - -CREDITS: -Lots of code taken from ext3 and other projects. - -Authors in alphabetical order: -Joel Becker -Zach Brown -Mark Fasheh -Kurt Hackel -Tao Ma -Sunil Mushran -Manish Singh -Tiger Yang - -Caveats -======= -Features which OCFS2 does not support yet: - - Directory change notification (F_NOTIFY) - - Distributed Caching (F_SETLEASE/F_GETLEASE/break_lease) - -Mount options -============= - -OCFS2 supports the following mount options: -(*) == default - -barrier=1 This enables/disables barriers. barrier=0 disables it, - barrier=1 enables it. -errors=remount-ro(*) Remount the filesystem read-only on an error. -errors=panic Panic and halt the machine if an error occurs. -intr (*) Allow signals to interrupt cluster operations. -nointr Do not allow signals to interrupt cluster - operations. -noatime Do not update access time. -relatime(*) Update atime if the previous atime is older than - mtime or ctime -strictatime Always update atime, but the minimum update interval - is specified by atime_quantum. -atime_quantum=60(*) OCFS2 will not update atime unless this number - of seconds has passed since the last update. - Set to zero to always update atime. This option need - work with strictatime. -data=ordered (*) All data are forced directly out to the main file - system prior to its metadata being committed to the - journal. -data=writeback Data ordering is not preserved, data may be written - into the main file system after its metadata has been - committed to the journal. -preferred_slot=0(*) During mount, try to use this filesystem slot first. If - it is in use by another node, the first empty one found - will be chosen. Invalid values will be ignored. -commit=nrsec (*) Ocfs2 can be told to sync all its data and metadata - every 'nrsec' seconds. The default value is 5 seconds. - This means that if you lose your power, you will lose - as much as the latest 5 seconds of work (your - filesystem will not be damaged though, thanks to the - journaling). This default value (or any low value) - will hurt performance, but it's good for data-safety. - Setting it to 0 will have the same effect as leaving - it at the default (5 seconds). - Setting it to very large values will improve - performance. -localalloc=8(*) Allows custom localalloc size in MB. If the value is too - large, the fs will silently revert it to the default. -localflocks This disables cluster aware flock. -inode64 Indicates that Ocfs2 is allowed to create inodes at - any location in the filesystem, including those which - will result in inode numbers occupying more than 32 - bits of significance. -user_xattr (*) Enables Extended User Attributes. -nouser_xattr Disables Extended User Attributes. -acl Enables POSIX Access Control Lists support. -noacl (*) Disables POSIX Access Control Lists support. -resv_level=2 (*) Set how aggressive allocation reservations will be. - Valid values are between 0 (reservations off) to 8 - (maximum space for reservations). -dir_resv_level= (*) By default, directory reservations will scale with file - reservations - users should rarely need to change this - value. If allocation reservations are turned off, this - option will have no effect. -coherency=full (*) Disallow concurrent O_DIRECT writes, cluster inode - lock will be taken to force other nodes drop cache, - therefore full cluster coherency is guaranteed even - for O_DIRECT writes. -coherency=buffered Allow concurrent O_DIRECT writes without EX lock among - nodes, which gains high performance at risk of getting - stale data on other nodes. -journal_async_commit Commit block can be written to disk without waiting - for descriptor blocks. If enabled older kernels cannot - mount the device. This will enable 'journal_checksum' - internally. -- cgit From 7cbb468f0c70878fe64d324790ee049c1881af7c Mon Sep 17 00:00:00 2001 From: Mauro Carvalho Chehab Date: Mon, 17 Feb 2020 17:12:16 +0100 Subject: docs: filesystems: convert omfs.txt to ReST - Add a SPDX header; - Adjust document title; - Mark literal blocks as such; - Add table markups; - Add it to filesystems/index.rst. Signed-off-by: Mauro Carvalho Chehab Acked-by: Bob Copeland Link: https://lore.kernel.org/r/0c125c7c971d81a557ca954992b8d770a9d1e3e8.1581955849.git.mchehab+huawei@kernel.org Signed-off-by: Jonathan Corbet --- Documentation/filesystems/index.rst | 1 + Documentation/filesystems/omfs.rst | 112 ++++++++++++++++++++++++++++++++++++ Documentation/filesystems/omfs.txt | 106 ---------------------------------- 3 files changed, 113 insertions(+), 106 deletions(-) create mode 100644 Documentation/filesystems/omfs.rst delete mode 100644 Documentation/filesystems/omfs.txt diff --git a/Documentation/filesystems/index.rst b/Documentation/filesystems/index.rst index 3b2b07491c98..fbee77175840 100644 --- a/Documentation/filesystems/index.rst +++ b/Documentation/filesystems/index.rst @@ -78,6 +78,7 @@ Documentation for filesystem implementations. ntfs ocfs2 ocfs2-online-filecheck + omfs overlayfs virtiofs vfat diff --git a/Documentation/filesystems/omfs.rst b/Documentation/filesystems/omfs.rst new file mode 100644 index 000000000000..4c8bb3074169 --- /dev/null +++ b/Documentation/filesystems/omfs.rst @@ -0,0 +1,112 @@ +.. SPDX-License-Identifier: GPL-2.0 + +================================ +Optimized MPEG Filesystem (OMFS) +================================ + +Overview +======== + +OMFS is a filesystem created by SonicBlue for use in the ReplayTV DVR +and Rio Karma MP3 player. The filesystem is extent-based, utilizing +block sizes from 2k to 8k, with hash-based directories. This +filesystem driver may be used to read and write disks from these +devices. + +Note, it is not recommended that this FS be used in place of a general +filesystem for your own streaming media device. Native Linux filesystems +will likely perform better. + +More information is available at: + + http://linux-karma.sf.net/ + +Various utilities, including mkomfs and omfsck, are included with +omfsprogs, available at: + + http://bobcopeland.com/karma/ + +Instructions are included in its README. + +Options +======= + +OMFS supports the following mount-time options: + + ============ ======================================== + uid=n make all files owned by specified user + gid=n make all files owned by specified group + umask=xxx set permission umask to xxx + fmask=xxx set umask to xxx for files + dmask=xxx set umask to xxx for directories + ============ ======================================== + +Disk format +=========== + +OMFS discriminates between "sysblocks" and normal data blocks. The sysblock +group consists of super block information, file metadata, directory structures, +and extents. Each sysblock has a header containing CRCs of the entire +sysblock, and may be mirrored in successive blocks on the disk. A sysblock may +have a smaller size than a data block, but since they are both addressed by the +same 64-bit block number, any remaining space in the smaller sysblock is +unused. + +Sysblock header information:: + + struct omfs_header { + __be64 h_self; /* FS block where this is located */ + __be32 h_body_size; /* size of useful data after header */ + __be16 h_crc; /* crc-ccitt of body_size bytes */ + char h_fill1[2]; + u8 h_version; /* version, always 1 */ + char h_type; /* OMFS_INODE_X */ + u8 h_magic; /* OMFS_IMAGIC */ + u8 h_check_xor; /* XOR of header bytes before this */ + __be32 h_fill2; + }; + +Files and directories are both represented by omfs_inode:: + + struct omfs_inode { + struct omfs_header i_head; /* header */ + __be64 i_parent; /* parent containing this inode */ + __be64 i_sibling; /* next inode in hash bucket */ + __be64 i_ctime; /* ctime, in milliseconds */ + char i_fill1[35]; + char i_type; /* OMFS_[DIR,FILE] */ + __be32 i_fill2; + char i_fill3[64]; + char i_name[OMFS_NAMELEN]; /* filename */ + __be64 i_size; /* size of file, in bytes */ + }; + +Directories in OMFS are implemented as a large hash table. Filenames are +hashed then prepended into the bucket list beginning at OMFS_DIR_START. +Lookup requires hashing the filename, then seeking across i_sibling pointers +until a match is found on i_name. Empty buckets are represented by block +pointers with all-1s (~0). + +A file is an omfs_inode structure followed by an extent table beginning at +OMFS_EXTENT_START:: + + struct omfs_extent_entry { + __be64 e_cluster; /* start location of a set of blocks */ + __be64 e_blocks; /* number of blocks after e_cluster */ + }; + + struct omfs_extent { + __be64 e_next; /* next extent table location */ + __be32 e_extent_count; /* total # extents in this table */ + __be32 e_fill; + struct omfs_extent_entry e_entry; /* start of extent entries */ + }; + +Each extent holds the block offset followed by number of blocks allocated to +the extent. The final extent in each table is a terminator with e_cluster +being ~0 and e_blocks being ones'-complement of the total number of blocks +in the table. + +If this table overflows, a continuation inode is written and pointed to by +e_next. These have a header but lack the rest of the inode structure. + diff --git a/Documentation/filesystems/omfs.txt b/Documentation/filesystems/omfs.txt deleted file mode 100644 index 1d0d41ff5c65..000000000000 --- a/Documentation/filesystems/omfs.txt +++ /dev/null @@ -1,106 +0,0 @@ -Optimized MPEG Filesystem (OMFS) - -Overview -======== - -OMFS is a filesystem created by SonicBlue for use in the ReplayTV DVR -and Rio Karma MP3 player. The filesystem is extent-based, utilizing -block sizes from 2k to 8k, with hash-based directories. This -filesystem driver may be used to read and write disks from these -devices. - -Note, it is not recommended that this FS be used in place of a general -filesystem for your own streaming media device. Native Linux filesystems -will likely perform better. - -More information is available at: - - http://linux-karma.sf.net/ - -Various utilities, including mkomfs and omfsck, are included with -omfsprogs, available at: - - http://bobcopeland.com/karma/ - -Instructions are included in its README. - -Options -======= - -OMFS supports the following mount-time options: - - uid=n - make all files owned by specified user - gid=n - make all files owned by specified group - umask=xxx - set permission umask to xxx - fmask=xxx - set umask to xxx for files - dmask=xxx - set umask to xxx for directories - -Disk format -=========== - -OMFS discriminates between "sysblocks" and normal data blocks. The sysblock -group consists of super block information, file metadata, directory structures, -and extents. Each sysblock has a header containing CRCs of the entire -sysblock, and may be mirrored in successive blocks on the disk. A sysblock may -have a smaller size than a data block, but since they are both addressed by the -same 64-bit block number, any remaining space in the smaller sysblock is -unused. - -Sysblock header information: - -struct omfs_header { - __be64 h_self; /* FS block where this is located */ - __be32 h_body_size; /* size of useful data after header */ - __be16 h_crc; /* crc-ccitt of body_size bytes */ - char h_fill1[2]; - u8 h_version; /* version, always 1 */ - char h_type; /* OMFS_INODE_X */ - u8 h_magic; /* OMFS_IMAGIC */ - u8 h_check_xor; /* XOR of header bytes before this */ - __be32 h_fill2; -}; - -Files and directories are both represented by omfs_inode: - -struct omfs_inode { - struct omfs_header i_head; /* header */ - __be64 i_parent; /* parent containing this inode */ - __be64 i_sibling; /* next inode in hash bucket */ - __be64 i_ctime; /* ctime, in milliseconds */ - char i_fill1[35]; - char i_type; /* OMFS_[DIR,FILE] */ - __be32 i_fill2; - char i_fill3[64]; - char i_name[OMFS_NAMELEN]; /* filename */ - __be64 i_size; /* size of file, in bytes */ -}; - -Directories in OMFS are implemented as a large hash table. Filenames are -hashed then prepended into the bucket list beginning at OMFS_DIR_START. -Lookup requires hashing the filename, then seeking across i_sibling pointers -until a match is found on i_name. Empty buckets are represented by block -pointers with all-1s (~0). - -A file is an omfs_inode structure followed by an extent table beginning at -OMFS_EXTENT_START: - -struct omfs_extent_entry { - __be64 e_cluster; /* start location of a set of blocks */ - __be64 e_blocks; /* number of blocks after e_cluster */ -}; - -struct omfs_extent { - __be64 e_next; /* next extent table location */ - __be32 e_extent_count; /* total # extents in this table */ - __be32 e_fill; - struct omfs_extent_entry e_entry; /* start of extent entries */ -}; - -Each extent holds the block offset followed by number of blocks allocated to -the extent. The final extent in each table is a terminator with e_cluster -being ~0 and e_blocks being ones'-complement of the total number of blocks -in the table. - -If this table overflows, a continuation inode is written and pointed to by -e_next. These have a header but lack the rest of the inode structure. - -- cgit From 18ccb2233fc5f7c27b5be17f5b6585c2fa62d919 Mon Sep 17 00:00:00 2001 From: Mauro Carvalho Chehab Date: Mon, 17 Feb 2020 17:12:17 +0100 Subject: docs: filesystems: convert orangefs.txt to ReST - Add a SPDX header; - Adjust document and section titles; - Some whitespace fixes and new line breaks; - Mark literal blocks as such; - Add it to filesystems/index.rst. Signed-off-by: Mauro Carvalho Chehab Link: https://lore.kernel.org/r/6f438eeff5b029d229197a602bd9b74004fe9b63.1581955849.git.mchehab+huawei@kernel.org Signed-off-by: Jonathan Corbet --- Documentation/filesystems/index.rst | 1 + Documentation/filesystems/orangefs.rst | 554 +++++++++++++++++++++++++++++++++ Documentation/filesystems/orangefs.txt | 529 ------------------------------- 3 files changed, 555 insertions(+), 529 deletions(-) create mode 100644 Documentation/filesystems/orangefs.rst delete mode 100644 Documentation/filesystems/orangefs.txt diff --git a/Documentation/filesystems/index.rst b/Documentation/filesystems/index.rst index fbee77175840..fed53f831192 100644 --- a/Documentation/filesystems/index.rst +++ b/Documentation/filesystems/index.rst @@ -79,6 +79,7 @@ Documentation for filesystem implementations. ocfs2 ocfs2-online-filecheck omfs + orangefs overlayfs virtiofs vfat diff --git a/Documentation/filesystems/orangefs.rst b/Documentation/filesystems/orangefs.rst new file mode 100644 index 000000000000..7d6d4cad73c4 --- /dev/null +++ b/Documentation/filesystems/orangefs.rst @@ -0,0 +1,554 @@ +.. SPDX-License-Identifier: GPL-2.0 + +======== +ORANGEFS +======== + +OrangeFS is an LGPL userspace scale-out parallel storage system. It is ideal +for large storage problems faced by HPC, BigData, Streaming Video, +Genomics, Bioinformatics. + +Orangefs, originally called PVFS, was first developed in 1993 by +Walt Ligon and Eric Blumer as a parallel file system for Parallel +Virtual Machine (PVM) as part of a NASA grant to study the I/O patterns +of parallel programs. + +Orangefs features include: + + * Distributes file data among multiple file servers + * Supports simultaneous access by multiple clients + * Stores file data and metadata on servers using local file system + and access methods + * Userspace implementation is easy to install and maintain + * Direct MPI support + * Stateless + + +Mailing List Archives +===================== + +http://lists.orangefs.org/pipermail/devel_lists.orangefs.org/ + + +Mailing List Submissions +======================== + +devel@lists.orangefs.org + + +Documentation +============= + +http://www.orangefs.org/documentation/ + + +Userspace Filesystem Source +=========================== + +http://www.orangefs.org/download + +Orangefs versions prior to 2.9.3 would not be compatible with the +upstream version of the kernel client. + + +Running ORANGEFS On a Single Server +=================================== + +OrangeFS is usually run in large installations with multiple servers and +clients, but a complete filesystem can be run on a single machine for +development and testing. + +On Fedora, install orangefs and orangefs-server:: + + dnf -y install orangefs orangefs-server + +There is an example server configuration file in +/etc/orangefs/orangefs.conf. Change localhost to your hostname if +necessary. + +To generate a filesystem to run xfstests against, see below. + +There is an example client configuration file in /etc/pvfs2tab. It is a +single line. Uncomment it and change the hostname if necessary. This +controls clients which use libpvfs2. This does not control the +pvfs2-client-core. + +Create the filesystem:: + + pvfs2-server -f /etc/orangefs/orangefs.conf + +Start the server:: + + systemctl start orangefs-server + +Test the server:: + + pvfs2-ping -m /pvfsmnt + +Start the client. The module must be compiled in or loaded before this +point:: + + systemctl start orangefs-client + +Mount the filesystem:: + + mount -t pvfs2 tcp://localhost:3334/orangefs /pvfsmnt + + +Building ORANGEFS on a Single Server +==================================== + +Where OrangeFS cannot be installed from distribution packages, it may be +built from source. + +You can omit --prefix if you don't care that things are sprinkled around +in /usr/local. As of version 2.9.6, OrangeFS uses Berkeley DB by +default, we will probably be changing the default to LMDB soon. + +:: + + ./configure --prefix=/opt/ofs --with-db-backend=lmdb + + make + + make install + +Create an orangefs config file:: + + /opt/ofs/bin/pvfs2-genconfig /etc/pvfs2.conf + +Create an /etc/pvfs2tab file:: + + echo tcp://localhost:3334/orangefs /pvfsmnt pvfs2 defaults,noauto 0 0 > \ + /etc/pvfs2tab + +Create the mount point you specified in the tab file if needed:: + + mkdir /pvfsmnt + +Bootstrap the server:: + + /opt/ofs/sbin/pvfs2-server -f /etc/pvfs2.conf + +Start the server:: + + /opt/osf/sbin/pvfs2-server /etc/pvfs2.conf + +Now the server should be running. Pvfs2-ls is a simple +test to verify that the server is running:: + + /opt/ofs/bin/pvfs2-ls /pvfsmnt + +If stuff seems to be working, load the kernel module and +turn on the client core:: + + /opt/ofs/sbin/pvfs2-client -p /opt/osf/sbin/pvfs2-client-core + +Mount your filesystem:: + + mount -t pvfs2 tcp://localhost:3334/orangefs /pvfsmnt + + +Running xfstests +================ + +It is useful to use a scratch filesystem with xfstests. This can be +done with only one server. + +Make a second copy of the FileSystem section in the server configuration +file, which is /etc/orangefs/orangefs.conf. Change the Name to scratch. +Change the ID to something other than the ID of the first FileSystem +section (2 is usually a good choice). + +Then there are two FileSystem sections: orangefs and scratch. + +This change should be made before creating the filesystem. + +:: + + pvfs2-server -f /etc/orangefs/orangefs.conf + +To run xfstests, create /etc/xfsqa.config:: + + TEST_DIR=/orangefs + TEST_DEV=tcp://localhost:3334/orangefs + SCRATCH_MNT=/scratch + SCRATCH_DEV=tcp://localhost:3334/scratch + +Then xfstests can be run:: + + ./check -pvfs2 + + +Options +======= + +The following mount options are accepted: + + acl + Allow the use of Access Control Lists on files and directories. + + intr + Some operations between the kernel client and the user space + filesystem can be interruptible, such as changes in debug levels + and the setting of tunable parameters. + + local_lock + Enable posix locking from the perspective of "this" kernel. The + default file_operations lock action is to return ENOSYS. Posix + locking kicks in if the filesystem is mounted with -o local_lock. + Distributed locking is being worked on for the future. + + +Debugging +========= + +If you want the debug (GOSSIP) statements in a particular +source file (inode.c for example) go to syslog:: + + echo inode > /sys/kernel/debug/orangefs/kernel-debug + +No debugging (the default):: + + echo none > /sys/kernel/debug/orangefs/kernel-debug + +Debugging from several source files:: + + echo inode,dir > /sys/kernel/debug/orangefs/kernel-debug + +All debugging:: + + echo all > /sys/kernel/debug/orangefs/kernel-debug + +Get a list of all debugging keywords:: + + cat /sys/kernel/debug/orangefs/debug-help + + +Protocol between Kernel Module and Userspace +============================================ + +Orangefs is a user space filesystem and an associated kernel module. +We'll just refer to the user space part of Orangefs as "userspace" +from here on out. Orangefs descends from PVFS, and userspace code +still uses PVFS for function and variable names. Userspace typedefs +many of the important structures. Function and variable names in +the kernel module have been transitioned to "orangefs", and The Linux +Coding Style avoids typedefs, so kernel module structures that +correspond to userspace structures are not typedefed. + +The kernel module implements a pseudo device that userspace +can read from and write to. Userspace can also manipulate the +kernel module through the pseudo device with ioctl. + +The Bufmap +---------- + +At startup userspace allocates two page-size-aligned (posix_memalign) +mlocked memory buffers, one is used for IO and one is used for readdir +operations. The IO buffer is 41943040 bytes and the readdir buffer is +4194304 bytes. Each buffer contains logical chunks, or partitions, and +a pointer to each buffer is added to its own PVFS_dev_map_desc structure +which also describes its total size, as well as the size and number of +the partitions. + +A pointer to the IO buffer's PVFS_dev_map_desc structure is sent to a +mapping routine in the kernel module with an ioctl. The structure is +copied from user space to kernel space with copy_from_user and is used +to initialize the kernel module's "bufmap" (struct orangefs_bufmap), which +then contains: + + * refcnt + - a reference counter + * desc_size - PVFS2_BUFMAP_DEFAULT_DESC_SIZE (4194304) - the IO buffer's + partition size, which represents the filesystem's block size and + is used for s_blocksize in super blocks. + * desc_count - PVFS2_BUFMAP_DEFAULT_DESC_COUNT (10) - the number of + partitions in the IO buffer. + * desc_shift - log2(desc_size), used for s_blocksize_bits in super blocks. + * total_size - the total size of the IO buffer. + * page_count - the number of 4096 byte pages in the IO buffer. + * page_array - a pointer to ``page_count * (sizeof(struct page*))`` bytes + of kcalloced memory. This memory is used as an array of pointers + to each of the pages in the IO buffer through a call to get_user_pages. + * desc_array - a pointer to ``desc_count * (sizeof(struct orangefs_bufmap_desc))`` + bytes of kcalloced memory. This memory is further intialized: + + user_desc is the kernel's copy of the IO buffer's ORANGEFS_dev_map_desc + structure. user_desc->ptr points to the IO buffer. + + :: + + pages_per_desc = bufmap->desc_size / PAGE_SIZE + offset = 0 + + bufmap->desc_array[0].page_array = &bufmap->page_array[offset] + bufmap->desc_array[0].array_count = pages_per_desc = 1024 + bufmap->desc_array[0].uaddr = (user_desc->ptr) + (0 * 1024 * 4096) + offset += 1024 + . + . + . + bufmap->desc_array[9].page_array = &bufmap->page_array[offset] + bufmap->desc_array[9].array_count = pages_per_desc = 1024 + bufmap->desc_array[9].uaddr = (user_desc->ptr) + + (9 * 1024 * 4096) + offset += 1024 + + * buffer_index_array - a desc_count sized array of ints, used to + indicate which of the IO buffer's partitions are available to use. + * buffer_index_lock - a spinlock to protect buffer_index_array during update. + * readdir_index_array - a five (ORANGEFS_READDIR_DEFAULT_DESC_COUNT) element + int array used to indicate which of the readdir buffer's partitions are + available to use. + * readdir_index_lock - a spinlock to protect readdir_index_array during + update. + +Operations +---------- + +The kernel module builds an "op" (struct orangefs_kernel_op_s) when it +needs to communicate with userspace. Part of the op contains the "upcall" +which expresses the request to userspace. Part of the op eventually +contains the "downcall" which expresses the results of the request. + +The slab allocator is used to keep a cache of op structures handy. + +At init time the kernel module defines and initializes a request list +and an in_progress hash table to keep track of all the ops that are +in flight at any given time. + +Ops are stateful: + + * unknown + - op was just initialized + * waiting + - op is on request_list (upward bound) + * inprogr + - op is in progress (waiting for downcall) + * serviced + - op has matching downcall; ok + * purged + - op has to start a timer since client-core + exited uncleanly before servicing op + * given up + - submitter has given up waiting for it + +When some arbitrary userspace program needs to perform a +filesystem operation on Orangefs (readdir, I/O, create, whatever) +an op structure is initialized and tagged with a distinguishing ID +number. The upcall part of the op is filled out, and the op is +passed to the "service_operation" function. + +Service_operation changes the op's state to "waiting", puts +it on the request list, and signals the Orangefs file_operations.poll +function through a wait queue. Userspace is polling the pseudo-device +and thus becomes aware of the upcall request that needs to be read. + +When the Orangefs file_operations.read function is triggered, the +request list is searched for an op that seems ready-to-process. +The op is removed from the request list. The tag from the op and +the filled-out upcall struct are copy_to_user'ed back to userspace. + +If any of these (and some additional protocol) copy_to_users fail, +the op's state is set to "waiting" and the op is added back to +the request list. Otherwise, the op's state is changed to "in progress", +and the op is hashed on its tag and put onto the end of a list in the +in_progress hash table at the index the tag hashed to. + +When userspace has assembled the response to the upcall, it +writes the response, which includes the distinguishing tag, back to +the pseudo device in a series of io_vecs. This triggers the Orangefs +file_operations.write_iter function to find the op with the associated +tag and remove it from the in_progress hash table. As long as the op's +state is not "canceled" or "given up", its state is set to "serviced". +The file_operations.write_iter function returns to the waiting vfs, +and back to service_operation through wait_for_matching_downcall. + +Service operation returns to its caller with the op's downcall +part (the response to the upcall) filled out. + +The "client-core" is the bridge between the kernel module and +userspace. The client-core is a daemon. The client-core has an +associated watchdog daemon. If the client-core is ever signaled +to die, the watchdog daemon restarts the client-core. Even though +the client-core is restarted "right away", there is a period of +time during such an event that the client-core is dead. A dead client-core +can't be triggered by the Orangefs file_operations.poll function. +Ops that pass through service_operation during a "dead spell" can timeout +on the wait queue and one attempt is made to recycle them. Obviously, +if the client-core stays dead too long, the arbitrary userspace processes +trying to use Orangefs will be negatively affected. Waiting ops +that can't be serviced will be removed from the request list and +have their states set to "given up". In-progress ops that can't +be serviced will be removed from the in_progress hash table and +have their states set to "given up". + +Readdir and I/O ops are atypical with respect to their payloads. + + - readdir ops use the smaller of the two pre-allocated pre-partitioned + memory buffers. The readdir buffer is only available to userspace. + The kernel module obtains an index to a free partition before launching + a readdir op. Userspace deposits the results into the indexed partition + and then writes them to back to the pvfs device. + + - io (read and write) ops use the larger of the two pre-allocated + pre-partitioned memory buffers. The IO buffer is accessible from + both userspace and the kernel module. The kernel module obtains an + index to a free partition before launching an io op. The kernel module + deposits write data into the indexed partition, to be consumed + directly by userspace. Userspace deposits the results of read + requests into the indexed partition, to be consumed directly + by the kernel module. + +Responses to kernel requests are all packaged in pvfs2_downcall_t +structs. Besides a few other members, pvfs2_downcall_t contains a +union of structs, each of which is associated with a particular +response type. + +The several members outside of the union are: + + ``int32_t type`` + - type of operation. + ``int32_t status`` + - return code for the operation. + ``int64_t trailer_size`` + - 0 unless readdir operation. + ``char *trailer_buf`` + - initialized to NULL, used during readdir operations. + +The appropriate member inside the union is filled out for any +particular response. + + PVFS2_VFS_OP_FILE_IO + fill a pvfs2_io_response_t + + PVFS2_VFS_OP_LOOKUP + fill a PVFS_object_kref + + PVFS2_VFS_OP_CREATE + fill a PVFS_object_kref + + PVFS2_VFS_OP_SYMLINK + fill a PVFS_object_kref + + PVFS2_VFS_OP_GETATTR + fill in a PVFS_sys_attr_s (tons of stuff the kernel doesn't need) + fill in a string with the link target when the object is a symlink. + + PVFS2_VFS_OP_MKDIR + fill a PVFS_object_kref + + PVFS2_VFS_OP_STATFS + fill a pvfs2_statfs_response_t with useless info . It is hard for + us to know, in a timely fashion, these statistics about our + distributed network filesystem. + + PVFS2_VFS_OP_FS_MOUNT + fill a pvfs2_fs_mount_response_t which is just like a PVFS_object_kref + except its members are in a different order and "__pad1" is replaced + with "id". + + PVFS2_VFS_OP_GETXATTR + fill a pvfs2_getxattr_response_t + + PVFS2_VFS_OP_LISTXATTR + fill a pvfs2_listxattr_response_t + + PVFS2_VFS_OP_PARAM + fill a pvfs2_param_response_t + + PVFS2_VFS_OP_PERF_COUNT + fill a pvfs2_perf_count_response_t + + PVFS2_VFS_OP_FSKEY + file a pvfs2_fs_key_response_t + + PVFS2_VFS_OP_READDIR + jamb everything needed to represent a pvfs2_readdir_response_t into + the readdir buffer descriptor specified in the upcall. + +Userspace uses writev() on /dev/pvfs2-req to pass responses to the requests +made by the kernel side. + +A buffer_list containing: + + - a pointer to the prepared response to the request from the + kernel (struct pvfs2_downcall_t). + - and also, in the case of a readdir request, a pointer to a + buffer containing descriptors for the objects in the target + directory. + +... is sent to the function (PINT_dev_write_list) which performs +the writev. + +PINT_dev_write_list has a local iovec array: struct iovec io_array[10]; + +The first four elements of io_array are initialized like this for all +responses:: + + io_array[0].iov_base = address of local variable "proto_ver" (int32_t) + io_array[0].iov_len = sizeof(int32_t) + + io_array[1].iov_base = address of global variable "pdev_magic" (int32_t) + io_array[1].iov_len = sizeof(int32_t) + + io_array[2].iov_base = address of parameter "tag" (PVFS_id_gen_t) + io_array[2].iov_len = sizeof(int64_t) + + io_array[3].iov_base = address of out_downcall member (pvfs2_downcall_t) + of global variable vfs_request (vfs_request_t) + io_array[3].iov_len = sizeof(pvfs2_downcall_t) + +Readdir responses initialize the fifth element io_array like this:: + + io_array[4].iov_base = contents of member trailer_buf (char *) + from out_downcall member of global variable + vfs_request + io_array[4].iov_len = contents of member trailer_size (PVFS_size) + from out_downcall member of global variable + vfs_request + +Orangefs exploits the dcache in order to avoid sending redundant +requests to userspace. We keep object inode attributes up-to-date with +orangefs_inode_getattr. Orangefs_inode_getattr uses two arguments to +help it decide whether or not to update an inode: "new" and "bypass". +Orangefs keeps private data in an object's inode that includes a short +timeout value, getattr_time, which allows any iteration of +orangefs_inode_getattr to know how long it has been since the inode was +updated. When the object is not new (new == 0) and the bypass flag is not +set (bypass == 0) orangefs_inode_getattr returns without updating the inode +if getattr_time has not timed out. Getattr_time is updated each time the +inode is updated. + +Creation of a new object (file, dir, sym-link) includes the evaluation of +its pathname, resulting in a negative directory entry for the object. +A new inode is allocated and associated with the dentry, turning it from +a negative dentry into a "productive full member of society". Orangefs +obtains the new inode from Linux with new_inode() and associates +the inode with the dentry by sending the pair back to Linux with +d_instantiate(). + +The evaluation of a pathname for an object resolves to its corresponding +dentry. If there is no corresponding dentry, one is created for it in +the dcache. Whenever a dentry is modified or verified Orangefs stores a +short timeout value in the dentry's d_time, and the dentry will be trusted +for that amount of time. Orangefs is a network filesystem, and objects +can potentially change out-of-band with any particular Orangefs kernel module +instance, so trusting a dentry is risky. The alternative to trusting +dentries is to always obtain the needed information from userspace - at +least a trip to the client-core, maybe to the servers. Obtaining information +from a dentry is cheap, obtaining it from userspace is relatively expensive, +hence the motivation to use the dentry when possible. + +The timeout values d_time and getattr_time are jiffy based, and the +code is designed to avoid the jiffy-wrap problem:: + + "In general, if the clock may have wrapped around more than once, there + is no way to tell how much time has elapsed. However, if the times t1 + and t2 are known to be fairly close, we can reliably compute the + difference in a way that takes into account the possibility that the + clock may have wrapped between times." + +from course notes by instructor Andy Wang + diff --git a/Documentation/filesystems/orangefs.txt b/Documentation/filesystems/orangefs.txt deleted file mode 100644 index f4ba94950e3f..000000000000 --- a/Documentation/filesystems/orangefs.txt +++ /dev/null @@ -1,529 +0,0 @@ -ORANGEFS -======== - -OrangeFS is an LGPL userspace scale-out parallel storage system. It is ideal -for large storage problems faced by HPC, BigData, Streaming Video, -Genomics, Bioinformatics. - -Orangefs, originally called PVFS, was first developed in 1993 by -Walt Ligon and Eric Blumer as a parallel file system for Parallel -Virtual Machine (PVM) as part of a NASA grant to study the I/O patterns -of parallel programs. - -Orangefs features include: - - * Distributes file data among multiple file servers - * Supports simultaneous access by multiple clients - * Stores file data and metadata on servers using local file system - and access methods - * Userspace implementation is easy to install and maintain - * Direct MPI support - * Stateless - - -MAILING LIST ARCHIVES -===================== - -http://lists.orangefs.org/pipermail/devel_lists.orangefs.org/ - - -MAILING LIST SUBMISSIONS -======================== - -devel@lists.orangefs.org - - -DOCUMENTATION -============= - -http://www.orangefs.org/documentation/ - - -USERSPACE FILESYSTEM SOURCE -=========================== - -http://www.orangefs.org/download - -Orangefs versions prior to 2.9.3 would not be compatible with the -upstream version of the kernel client. - - -RUNNING ORANGEFS ON A SINGLE SERVER -=================================== - -OrangeFS is usually run in large installations with multiple servers and -clients, but a complete filesystem can be run on a single machine for -development and testing. - -On Fedora, install orangefs and orangefs-server. - -dnf -y install orangefs orangefs-server - -There is an example server configuration file in -/etc/orangefs/orangefs.conf. Change localhost to your hostname if -necessary. - -To generate a filesystem to run xfstests against, see below. - -There is an example client configuration file in /etc/pvfs2tab. It is a -single line. Uncomment it and change the hostname if necessary. This -controls clients which use libpvfs2. This does not control the -pvfs2-client-core. - -Create the filesystem. - -pvfs2-server -f /etc/orangefs/orangefs.conf - -Start the server. - -systemctl start orangefs-server - -Test the server. - -pvfs2-ping -m /pvfsmnt - -Start the client. The module must be compiled in or loaded before this -point. - -systemctl start orangefs-client - -Mount the filesystem. - -mount -t pvfs2 tcp://localhost:3334/orangefs /pvfsmnt - - -BUILDING ORANGEFS ON A SINGLE SERVER -==================================== - -Where OrangeFS cannot be installed from distribution packages, it may be -built from source. - -You can omit --prefix if you don't care that things are sprinkled around -in /usr/local. As of version 2.9.6, OrangeFS uses Berkeley DB by -default, we will probably be changing the default to LMDB soon. - -./configure --prefix=/opt/ofs --with-db-backend=lmdb - -make - -make install - -Create an orangefs config file. - -/opt/ofs/bin/pvfs2-genconfig /etc/pvfs2.conf - -Create an /etc/pvfs2tab file. - -echo tcp://localhost:3334/orangefs /pvfsmnt pvfs2 defaults,noauto 0 0 > \ - /etc/pvfs2tab - -Create the mount point you specified in the tab file if needed. - -mkdir /pvfsmnt - -Bootstrap the server. - -/opt/ofs/sbin/pvfs2-server -f /etc/pvfs2.conf - -Start the server. - -/opt/osf/sbin/pvfs2-server /etc/pvfs2.conf - -Now the server should be running. Pvfs2-ls is a simple -test to verify that the server is running. - -/opt/ofs/bin/pvfs2-ls /pvfsmnt - -If stuff seems to be working, load the kernel module and -turn on the client core. - -/opt/ofs/sbin/pvfs2-client -p /opt/osf/sbin/pvfs2-client-core - -Mount your filesystem. - -mount -t pvfs2 tcp://localhost:3334/orangefs /pvfsmnt - - -RUNNING XFSTESTS -================ - -It is useful to use a scratch filesystem with xfstests. This can be -done with only one server. - -Make a second copy of the FileSystem section in the server configuration -file, which is /etc/orangefs/orangefs.conf. Change the Name to scratch. -Change the ID to something other than the ID of the first FileSystem -section (2 is usually a good choice). - -Then there are two FileSystem sections: orangefs and scratch. - -This change should be made before creating the filesystem. - -pvfs2-server -f /etc/orangefs/orangefs.conf - -To run xfstests, create /etc/xfsqa.config. - -TEST_DIR=/orangefs -TEST_DEV=tcp://localhost:3334/orangefs -SCRATCH_MNT=/scratch -SCRATCH_DEV=tcp://localhost:3334/scratch - -Then xfstests can be run - -./check -pvfs2 - - -OPTIONS -======= - -The following mount options are accepted: - - acl - Allow the use of Access Control Lists on files and directories. - - intr - Some operations between the kernel client and the user space - filesystem can be interruptible, such as changes in debug levels - and the setting of tunable parameters. - - local_lock - Enable posix locking from the perspective of "this" kernel. The - default file_operations lock action is to return ENOSYS. Posix - locking kicks in if the filesystem is mounted with -o local_lock. - Distributed locking is being worked on for the future. - - -DEBUGGING -========= - -If you want the debug (GOSSIP) statements in a particular -source file (inode.c for example) go to syslog: - - echo inode > /sys/kernel/debug/orangefs/kernel-debug - -No debugging (the default): - - echo none > /sys/kernel/debug/orangefs/kernel-debug - -Debugging from several source files: - - echo inode,dir > /sys/kernel/debug/orangefs/kernel-debug - -All debugging: - - echo all > /sys/kernel/debug/orangefs/kernel-debug - -Get a list of all debugging keywords: - - cat /sys/kernel/debug/orangefs/debug-help - - -PROTOCOL BETWEEN KERNEL MODULE AND USERSPACE -============================================ - -Orangefs is a user space filesystem and an associated kernel module. -We'll just refer to the user space part of Orangefs as "userspace" -from here on out. Orangefs descends from PVFS, and userspace code -still uses PVFS for function and variable names. Userspace typedefs -many of the important structures. Function and variable names in -the kernel module have been transitioned to "orangefs", and The Linux -Coding Style avoids typedefs, so kernel module structures that -correspond to userspace structures are not typedefed. - -The kernel module implements a pseudo device that userspace -can read from and write to. Userspace can also manipulate the -kernel module through the pseudo device with ioctl. - -THE BUFMAP: - -At startup userspace allocates two page-size-aligned (posix_memalign) -mlocked memory buffers, one is used for IO and one is used for readdir -operations. The IO buffer is 41943040 bytes and the readdir buffer is -4194304 bytes. Each buffer contains logical chunks, or partitions, and -a pointer to each buffer is added to its own PVFS_dev_map_desc structure -which also describes its total size, as well as the size and number of -the partitions. - -A pointer to the IO buffer's PVFS_dev_map_desc structure is sent to a -mapping routine in the kernel module with an ioctl. The structure is -copied from user space to kernel space with copy_from_user and is used -to initialize the kernel module's "bufmap" (struct orangefs_bufmap), which -then contains: - - * refcnt - a reference counter - * desc_size - PVFS2_BUFMAP_DEFAULT_DESC_SIZE (4194304) - the IO buffer's - partition size, which represents the filesystem's block size and - is used for s_blocksize in super blocks. - * desc_count - PVFS2_BUFMAP_DEFAULT_DESC_COUNT (10) - the number of - partitions in the IO buffer. - * desc_shift - log2(desc_size), used for s_blocksize_bits in super blocks. - * total_size - the total size of the IO buffer. - * page_count - the number of 4096 byte pages in the IO buffer. - * page_array - a pointer to page_count * (sizeof(struct page*)) bytes - of kcalloced memory. This memory is used as an array of pointers - to each of the pages in the IO buffer through a call to get_user_pages. - * desc_array - a pointer to desc_count * (sizeof(struct orangefs_bufmap_desc)) - bytes of kcalloced memory. This memory is further intialized: - - user_desc is the kernel's copy of the IO buffer's ORANGEFS_dev_map_desc - structure. user_desc->ptr points to the IO buffer. - - pages_per_desc = bufmap->desc_size / PAGE_SIZE - offset = 0 - - bufmap->desc_array[0].page_array = &bufmap->page_array[offset] - bufmap->desc_array[0].array_count = pages_per_desc = 1024 - bufmap->desc_array[0].uaddr = (user_desc->ptr) + (0 * 1024 * 4096) - offset += 1024 - . - . - . - bufmap->desc_array[9].page_array = &bufmap->page_array[offset] - bufmap->desc_array[9].array_count = pages_per_desc = 1024 - bufmap->desc_array[9].uaddr = (user_desc->ptr) + - (9 * 1024 * 4096) - offset += 1024 - - * buffer_index_array - a desc_count sized array of ints, used to - indicate which of the IO buffer's partitions are available to use. - * buffer_index_lock - a spinlock to protect buffer_index_array during update. - * readdir_index_array - a five (ORANGEFS_READDIR_DEFAULT_DESC_COUNT) element - int array used to indicate which of the readdir buffer's partitions are - available to use. - * readdir_index_lock - a spinlock to protect readdir_index_array during - update. - -OPERATIONS: - -The kernel module builds an "op" (struct orangefs_kernel_op_s) when it -needs to communicate with userspace. Part of the op contains the "upcall" -which expresses the request to userspace. Part of the op eventually -contains the "downcall" which expresses the results of the request. - -The slab allocator is used to keep a cache of op structures handy. - -At init time the kernel module defines and initializes a request list -and an in_progress hash table to keep track of all the ops that are -in flight at any given time. - -Ops are stateful: - - * unknown - op was just initialized - * waiting - op is on request_list (upward bound) - * inprogr - op is in progress (waiting for downcall) - * serviced - op has matching downcall; ok - * purged - op has to start a timer since client-core - exited uncleanly before servicing op - * given up - submitter has given up waiting for it - -When some arbitrary userspace program needs to perform a -filesystem operation on Orangefs (readdir, I/O, create, whatever) -an op structure is initialized and tagged with a distinguishing ID -number. The upcall part of the op is filled out, and the op is -passed to the "service_operation" function. - -Service_operation changes the op's state to "waiting", puts -it on the request list, and signals the Orangefs file_operations.poll -function through a wait queue. Userspace is polling the pseudo-device -and thus becomes aware of the upcall request that needs to be read. - -When the Orangefs file_operations.read function is triggered, the -request list is searched for an op that seems ready-to-process. -The op is removed from the request list. The tag from the op and -the filled-out upcall struct are copy_to_user'ed back to userspace. - -If any of these (and some additional protocol) copy_to_users fail, -the op's state is set to "waiting" and the op is added back to -the request list. Otherwise, the op's state is changed to "in progress", -and the op is hashed on its tag and put onto the end of a list in the -in_progress hash table at the index the tag hashed to. - -When userspace has assembled the response to the upcall, it -writes the response, which includes the distinguishing tag, back to -the pseudo device in a series of io_vecs. This triggers the Orangefs -file_operations.write_iter function to find the op with the associated -tag and remove it from the in_progress hash table. As long as the op's -state is not "canceled" or "given up", its state is set to "serviced". -The file_operations.write_iter function returns to the waiting vfs, -and back to service_operation through wait_for_matching_downcall. - -Service operation returns to its caller with the op's downcall -part (the response to the upcall) filled out. - -The "client-core" is the bridge between the kernel module and -userspace. The client-core is a daemon. The client-core has an -associated watchdog daemon. If the client-core is ever signaled -to die, the watchdog daemon restarts the client-core. Even though -the client-core is restarted "right away", there is a period of -time during such an event that the client-core is dead. A dead client-core -can't be triggered by the Orangefs file_operations.poll function. -Ops that pass through service_operation during a "dead spell" can timeout -on the wait queue and one attempt is made to recycle them. Obviously, -if the client-core stays dead too long, the arbitrary userspace processes -trying to use Orangefs will be negatively affected. Waiting ops -that can't be serviced will be removed from the request list and -have their states set to "given up". In-progress ops that can't -be serviced will be removed from the in_progress hash table and -have their states set to "given up". - -Readdir and I/O ops are atypical with respect to their payloads. - - - readdir ops use the smaller of the two pre-allocated pre-partitioned - memory buffers. The readdir buffer is only available to userspace. - The kernel module obtains an index to a free partition before launching - a readdir op. Userspace deposits the results into the indexed partition - and then writes them to back to the pvfs device. - - - io (read and write) ops use the larger of the two pre-allocated - pre-partitioned memory buffers. The IO buffer is accessible from - both userspace and the kernel module. The kernel module obtains an - index to a free partition before launching an io op. The kernel module - deposits write data into the indexed partition, to be consumed - directly by userspace. Userspace deposits the results of read - requests into the indexed partition, to be consumed directly - by the kernel module. - -Responses to kernel requests are all packaged in pvfs2_downcall_t -structs. Besides a few other members, pvfs2_downcall_t contains a -union of structs, each of which is associated with a particular -response type. - -The several members outside of the union are: - - int32_t type - type of operation. - - int32_t status - return code for the operation. - - int64_t trailer_size - 0 unless readdir operation. - - char *trailer_buf - initialized to NULL, used during readdir operations. - -The appropriate member inside the union is filled out for any -particular response. - - PVFS2_VFS_OP_FILE_IO - fill a pvfs2_io_response_t - - PVFS2_VFS_OP_LOOKUP - fill a PVFS_object_kref - - PVFS2_VFS_OP_CREATE - fill a PVFS_object_kref - - PVFS2_VFS_OP_SYMLINK - fill a PVFS_object_kref - - PVFS2_VFS_OP_GETATTR - fill in a PVFS_sys_attr_s (tons of stuff the kernel doesn't need) - fill in a string with the link target when the object is a symlink. - - PVFS2_VFS_OP_MKDIR - fill a PVFS_object_kref - - PVFS2_VFS_OP_STATFS - fill a pvfs2_statfs_response_t with useless info . It is hard for - us to know, in a timely fashion, these statistics about our - distributed network filesystem. - - PVFS2_VFS_OP_FS_MOUNT - fill a pvfs2_fs_mount_response_t which is just like a PVFS_object_kref - except its members are in a different order and "__pad1" is replaced - with "id". - - PVFS2_VFS_OP_GETXATTR - fill a pvfs2_getxattr_response_t - - PVFS2_VFS_OP_LISTXATTR - fill a pvfs2_listxattr_response_t - - PVFS2_VFS_OP_PARAM - fill a pvfs2_param_response_t - - PVFS2_VFS_OP_PERF_COUNT - fill a pvfs2_perf_count_response_t - - PVFS2_VFS_OP_FSKEY - file a pvfs2_fs_key_response_t - - PVFS2_VFS_OP_READDIR - jamb everything needed to represent a pvfs2_readdir_response_t into - the readdir buffer descriptor specified in the upcall. - -Userspace uses writev() on /dev/pvfs2-req to pass responses to the requests -made by the kernel side. - -A buffer_list containing: - - a pointer to the prepared response to the request from the - kernel (struct pvfs2_downcall_t). - - and also, in the case of a readdir request, a pointer to a - buffer containing descriptors for the objects in the target - directory. -... is sent to the function (PINT_dev_write_list) which performs -the writev. - -PINT_dev_write_list has a local iovec array: struct iovec io_array[10]; - -The first four elements of io_array are initialized like this for all -responses: - - io_array[0].iov_base = address of local variable "proto_ver" (int32_t) - io_array[0].iov_len = sizeof(int32_t) - - io_array[1].iov_base = address of global variable "pdev_magic" (int32_t) - io_array[1].iov_len = sizeof(int32_t) - - io_array[2].iov_base = address of parameter "tag" (PVFS_id_gen_t) - io_array[2].iov_len = sizeof(int64_t) - - io_array[3].iov_base = address of out_downcall member (pvfs2_downcall_t) - of global variable vfs_request (vfs_request_t) - io_array[3].iov_len = sizeof(pvfs2_downcall_t) - -Readdir responses initialize the fifth element io_array like this: - - io_array[4].iov_base = contents of member trailer_buf (char *) - from out_downcall member of global variable - vfs_request - io_array[4].iov_len = contents of member trailer_size (PVFS_size) - from out_downcall member of global variable - vfs_request - -Orangefs exploits the dcache in order to avoid sending redundant -requests to userspace. We keep object inode attributes up-to-date with -orangefs_inode_getattr. Orangefs_inode_getattr uses two arguments to -help it decide whether or not to update an inode: "new" and "bypass". -Orangefs keeps private data in an object's inode that includes a short -timeout value, getattr_time, which allows any iteration of -orangefs_inode_getattr to know how long it has been since the inode was -updated. When the object is not new (new == 0) and the bypass flag is not -set (bypass == 0) orangefs_inode_getattr returns without updating the inode -if getattr_time has not timed out. Getattr_time is updated each time the -inode is updated. - -Creation of a new object (file, dir, sym-link) includes the evaluation of -its pathname, resulting in a negative directory entry for the object. -A new inode is allocated and associated with the dentry, turning it from -a negative dentry into a "productive full member of society". Orangefs -obtains the new inode from Linux with new_inode() and associates -the inode with the dentry by sending the pair back to Linux with -d_instantiate(). - -The evaluation of a pathname for an object resolves to its corresponding -dentry. If there is no corresponding dentry, one is created for it in -the dcache. Whenever a dentry is modified or verified Orangefs stores a -short timeout value in the dentry's d_time, and the dentry will be trusted -for that amount of time. Orangefs is a network filesystem, and objects -can potentially change out-of-band with any particular Orangefs kernel module -instance, so trusting a dentry is risky. The alternative to trusting -dentries is to always obtain the needed information from userspace - at -least a trip to the client-core, maybe to the servers. Obtaining information -from a dentry is cheap, obtaining it from userspace is relatively expensive, -hence the motivation to use the dentry when possible. - -The timeout values d_time and getattr_time are jiffy based, and the -code is designed to avoid the jiffy-wrap problem: - -"In general, if the clock may have wrapped around more than once, there -is no way to tell how much time has elapsed. However, if the times t1 -and t2 are known to be fairly close, we can reliably compute the -difference in a way that takes into account the possibility that the -clock may have wrapped between times." - - from course notes by instructor Andy Wang - -- cgit From c33e97efa9d9de538e5f0afe6cb07f83afcd5b68 Mon Sep 17 00:00:00 2001 From: Mauro Carvalho Chehab Date: Mon, 17 Feb 2020 17:12:18 +0100 Subject: docs: filesystems: convert proc.txt to ReST This document has a nice format! Unfortunately, not exactly ReST. So, several adjustments were required: - Add a SPDX header; - Adjust document and section titles; - Whitespace fixes and new line breaks; - Mark literal blocks as such; - Add table markups; - Add table captions; - Add it to filesystems/index.rst. Signed-off-by: Mauro Carvalho Chehab Link: https://lore.kernel.org/r/1d113d860188de416ca3b0b97371dc2195433d5b.1581955849.git.mchehab+huawei@kernel.org Signed-off-by: Jonathan Corbet --- Documentation/filesystems/index.rst | 1 + Documentation/filesystems/proc.rst | 2169 +++++++++++++++++++++++++++++++++++ Documentation/filesystems/proc.txt | 2047 --------------------------------- 3 files changed, 2170 insertions(+), 2047 deletions(-) create mode 100644 Documentation/filesystems/proc.rst delete mode 100644 Documentation/filesystems/proc.txt diff --git a/Documentation/filesystems/index.rst b/Documentation/filesystems/index.rst index fed53f831192..671906e2fee6 100644 --- a/Documentation/filesystems/index.rst +++ b/Documentation/filesystems/index.rst @@ -81,5 +81,6 @@ Documentation for filesystem implementations. omfs orangefs overlayfs + proc virtiofs vfat diff --git a/Documentation/filesystems/proc.rst b/Documentation/filesystems/proc.rst new file mode 100644 index 000000000000..38b606991065 --- /dev/null +++ b/Documentation/filesystems/proc.rst @@ -0,0 +1,2169 @@ +.. SPDX-License-Identifier: GPL-2.0 + +==================== +The /proc Filesystem +==================== + +===================== ======================================= ================ +/proc/sys Terrehon Bowden , October 7 1999 + Bodo Bauer +2.4.x update Jorge Nerin November 14 2000 +move /proc/sys Shen Feng April 1 2009 +fixes/update part 1.1 Stefani Seibold June 9 2009 +===================== ======================================= ================ + + + +.. Table of Contents + + 0 Preface + 0.1 Introduction/Credits + 0.2 Legal Stuff + + 1 Collecting System Information + 1.1 Process-Specific Subdirectories + 1.2 Kernel data + 1.3 IDE devices in /proc/ide + 1.4 Networking info in /proc/net + 1.5 SCSI info + 1.6 Parallel port info in /proc/parport + 1.7 TTY info in /proc/tty + 1.8 Miscellaneous kernel statistics in /proc/stat + 1.9 Ext4 file system parameters + + 2 Modifying System Parameters + + 3 Per-Process Parameters + 3.1 /proc//oom_adj & /proc//oom_score_adj - Adjust the oom-killer + score + 3.2 /proc//oom_score - Display current oom-killer score + 3.3 /proc//io - Display the IO accounting fields + 3.4 /proc//coredump_filter - Core dump filtering settings + 3.5 /proc//mountinfo - Information about mounts + 3.6 /proc//comm & /proc//task//comm + 3.7 /proc//task//children - Information about task children + 3.8 /proc//fdinfo/ - Information about opened file + 3.9 /proc//map_files - Information about memory mapped files + 3.10 /proc//timerslack_ns - Task timerslack value + 3.11 /proc//patch_state - Livepatch patch operation state + 3.12 /proc//arch_status - Task architecture specific information + + 4 Configuring procfs + 4.1 Mount options + +Preface +======= + +0.1 Introduction/Credits +------------------------ + +This documentation is part of a soon (or so we hope) to be released book on +the SuSE Linux distribution. As there is no complete documentation for the +/proc file system and we've used many freely available sources to write these +chapters, it seems only fair to give the work back to the Linux community. +This work is based on the 2.2.* kernel version and the upcoming 2.4.*. I'm +afraid it's still far from complete, but we hope it will be useful. As far as +we know, it is the first 'all-in-one' document about the /proc file system. It +is focused on the Intel x86 hardware, so if you are looking for PPC, ARM, +SPARC, AXP, etc., features, you probably won't find what you are looking for. +It also only covers IPv4 networking, not IPv6 nor other protocols - sorry. But +additions and patches are welcome and will be added to this document if you +mail them to Bodo. + +We'd like to thank Alan Cox, Rik van Riel, and Alexey Kuznetsov and a lot of +other people for help compiling this documentation. We'd also like to extend a +special thank you to Andi Kleen for documentation, which we relied on heavily +to create this document, as well as the additional information he provided. +Thanks to everybody else who contributed source or docs to the Linux kernel +and helped create a great piece of software... :) + +If you have any comments, corrections or additions, please don't hesitate to +contact Bodo Bauer at bb@ricochet.net. We'll be happy to add them to this +document. + +The latest version of this document is available online at +http://tldp.org/LDP/Linux-Filesystem-Hierarchy/html/proc.html + +If the above direction does not works for you, you could try the kernel +mailing list at linux-kernel@vger.kernel.org and/or try to reach me at +comandante@zaralinux.com. + +0.2 Legal Stuff +--------------- + +We don't guarantee the correctness of this document, and if you come to us +complaining about how you screwed up your system because of incorrect +documentation, we won't feel responsible... + +Chapter 1: Collecting System Information +======================================== + +In This Chapter +--------------- +* Investigating the properties of the pseudo file system /proc and its + ability to provide information on the running Linux system +* Examining /proc's structure +* Uncovering various information about the kernel and the processes running + on the system + +------------------------------------------------------------------------------ + +The proc file system acts as an interface to internal data structures in the +kernel. It can be used to obtain information about the system and to change +certain kernel parameters at runtime (sysctl). + +First, we'll take a look at the read-only parts of /proc. In Chapter 2, we +show you how you can use /proc/sys to change settings. + +1.1 Process-Specific Subdirectories +----------------------------------- + +The directory /proc contains (among other things) one subdirectory for each +process running on the system, which is named after the process ID (PID). + +The link self points to the process reading the file system. Each process +subdirectory has the entries listed in Table 1-1. + +Note that an open a file descriptor to /proc/ or to any of its +contained files or subdirectories does not prevent being reused +for some other process in the event that exits. Operations on +open /proc/ file descriptors corresponding to dead processes +never act on any new process that the kernel may, through chance, have +also assigned the process ID . Instead, operations on these FDs +usually fail with ESRCH. + +.. table:: Table 1-1: Process specific entries in /proc + + ============= =============================================================== + File Content + ============= =============================================================== + clear_refs Clears page referenced bits shown in smaps output + cmdline Command line arguments + cpu Current and last cpu in which it was executed (2.4)(smp) + cwd Link to the current working directory + environ Values of environment variables + exe Link to the executable of this process + fd Directory, which contains all file descriptors + maps Memory maps to executables and library files (2.4) + mem Memory held by this process + root Link to the root directory of this process + stat Process status + statm Process memory status information + status Process status in human readable form + wchan Present with CONFIG_KALLSYMS=y: it shows the kernel function + symbol the task is blocked in - or "0" if not blocked. + pagemap Page table + stack Report full stack trace, enable via CONFIG_STACKTRACE + smaps An extension based on maps, showing the memory consumption of + each mapping and flags associated with it + smaps_rollup Accumulated smaps stats for all mappings of the process. This + can be derived from smaps, but is faster and more convenient + numa_maps An extension based on maps, showing the memory locality and + binding policy as well as mem usage (in pages) of each mapping. + ============= =============================================================== + +For example, to get the status information of a process, all you have to do is +read the file /proc/PID/status:: + + >cat /proc/self/status + Name: cat + State: R (running) + Tgid: 5452 + Pid: 5452 + PPid: 743 + TracerPid: 0 (2.4) + Uid: 501 501 501 501 + Gid: 100 100 100 100 + FDSize: 256 + Groups: 100 14 16 + VmPeak: 5004 kB + VmSize: 5004 kB + VmLck: 0 kB + VmHWM: 476 kB + VmRSS: 476 kB + RssAnon: 352 kB + RssFile: 120 kB + RssShmem: 4 kB + VmData: 156 kB + VmStk: 88 kB + VmExe: 68 kB + VmLib: 1412 kB + VmPTE: 20 kb + VmSwap: 0 kB + HugetlbPages: 0 kB + CoreDumping: 0 + THP_enabled: 1 + Threads: 1 + SigQ: 0/28578 + SigPnd: 0000000000000000 + ShdPnd: 0000000000000000 + SigBlk: 0000000000000000 + SigIgn: 0000000000000000 + SigCgt: 0000000000000000 + CapInh: 00000000fffffeff + CapPrm: 0000000000000000 + CapEff: 0000000000000000 + CapBnd: ffffffffffffffff + CapAmb: 0000000000000000 + NoNewPrivs: 0 + Seccomp: 0 + Speculation_Store_Bypass: thread vulnerable + voluntary_ctxt_switches: 0 + nonvoluntary_ctxt_switches: 1 + +This shows you nearly the same information you would get if you viewed it with +the ps command. In fact, ps uses the proc file system to obtain its +information. But you get a more detailed view of the process by reading the +file /proc/PID/status. It fields are described in table 1-2. + +The statm file contains more detailed information about the process +memory usage. Its seven fields are explained in Table 1-3. The stat file +contains details information about the process itself. Its fields are +explained in Table 1-4. + +(for SMP CONFIG users) + +For making accounting scalable, RSS related information are handled in an +asynchronous manner and the value may not be very precise. To see a precise +snapshot of a moment, you can see /proc//smaps file and scan page table. +It's slow but very precise. + +.. table:: Table 1-2: Contents of the status files (as of 4.19) + + ========================== =================================================== + Field Content + ========================== =================================================== + Name filename of the executable + Umask file mode creation mask + State state (R is running, S is sleeping, D is sleeping + in an uninterruptible wait, Z is zombie, + T is traced or stopped) + Tgid thread group ID + Ngid NUMA group ID (0 if none) + Pid process id + PPid process id of the parent process + TracerPid PID of process tracing this process (0 if not) + Uid Real, effective, saved set, and file system UIDs + Gid Real, effective, saved set, and file system GIDs + FDSize number of file descriptor slots currently allocated + Groups supplementary group list + NStgid descendant namespace thread group ID hierarchy + NSpid descendant namespace process ID hierarchy + NSpgid descendant namespace process group ID hierarchy + NSsid descendant namespace session ID hierarchy + VmPeak peak virtual memory size + VmSize total program size + VmLck locked memory size + VmPin pinned memory size + VmHWM peak resident set size ("high water mark") + VmRSS size of memory portions. It contains the three + following parts + (VmRSS = RssAnon + RssFile + RssShmem) + RssAnon size of resident anonymous memory + RssFile size of resident file mappings + RssShmem size of resident shmem memory (includes SysV shm, + mapping of tmpfs and shared anonymous mappings) + VmData size of private data segments + VmStk size of stack segments + VmExe size of text segment + VmLib size of shared library code + VmPTE size of page table entries + VmSwap amount of swap used by anonymous private data + (shmem swap usage is not included) + HugetlbPages size of hugetlb memory portions + CoreDumping process's memory is currently being dumped + (killing the process may lead to a corrupted core) + THP_enabled process is allowed to use THP (returns 0 when + PR_SET_THP_DISABLE is set on the process + Threads number of threads + SigQ number of signals queued/max. number for queue + SigPnd bitmap of pending signals for the thread + ShdPnd bitmap of shared pending signals for the process + SigBlk bitmap of blocked signals + SigIgn bitmap of ignored signals + SigCgt bitmap of caught signals + CapInh bitmap of inheritable capabilities + CapPrm bitmap of permitted capabilities + CapEff bitmap of effective capabilities + CapBnd bitmap of capabilities bounding set + CapAmb bitmap of ambient capabilities + NoNewPrivs no_new_privs, like prctl(PR_GET_NO_NEW_PRIV, ...) + Seccomp seccomp mode, like prctl(PR_GET_SECCOMP, ...) + Speculation_Store_Bypass speculative store bypass mitigation status + Cpus_allowed mask of CPUs on which this process may run + Cpus_allowed_list Same as previous, but in "list format" + Mems_allowed mask of memory nodes allowed to this process + Mems_allowed_list Same as previous, but in "list format" + voluntary_ctxt_switches number of voluntary context switches + nonvoluntary_ctxt_switches number of non voluntary context switches + ========================== =================================================== + + +.. table:: Table 1-3: Contents of the statm files (as of 2.6.8-rc3) + + ======== =============================== ============================== + Field Content + ======== =============================== ============================== + size total program size (pages) (same as VmSize in status) + resident size of memory portions (pages) (same as VmRSS in status) + shared number of pages that are shared (i.e. backed by a file, same + as RssFile+RssShmem in status) + trs number of pages that are 'code' (not including libs; broken, + includes data segment) + lrs number of pages of library (always 0 on 2.6) + drs number of pages of data/stack (including libs; broken, + includes library text) + dt number of dirty pages (always 0 on 2.6) + ======== =============================== ============================== + + +.. table:: Table 1-4: Contents of the stat files (as of 2.6.30-rc7) + + ============= =============================================================== + Field Content + ============= =============================================================== + pid process id + tcomm filename of the executable + state state (R is running, S is sleeping, D is sleeping in an + uninterruptible wait, Z is zombie, T is traced or stopped) + ppid process id of the parent process + pgrp pgrp of the process + sid session id + tty_nr tty the process uses + tty_pgrp pgrp of the tty + flags task flags + min_flt number of minor faults + cmin_flt number of minor faults with child's + maj_flt number of major faults + cmaj_flt number of major faults with child's + utime user mode jiffies + stime kernel mode jiffies + cutime user mode jiffies with child's + cstime kernel mode jiffies with child's + priority priority level + nice nice level + num_threads number of threads + it_real_value (obsolete, always 0) + start_time time the process started after system boot + vsize virtual memory size + rss resident set memory size + rsslim current limit in bytes on the rss + start_code address above which program text can run + end_code address below which program text can run + start_stack address of the start of the main process stack + esp current value of ESP + eip current value of EIP + pending bitmap of pending signals + blocked bitmap of blocked signals + sigign bitmap of ignored signals + sigcatch bitmap of caught signals + 0 (place holder, used to be the wchan address, + use /proc/PID/wchan instead) + 0 (place holder) + 0 (place holder) + exit_signal signal to send to parent thread on exit + task_cpu which CPU the task is scheduled on + rt_priority realtime priority + policy scheduling policy (man sched_setscheduler) + blkio_ticks time spent waiting for block IO + gtime guest time of the task in jiffies + cgtime guest time of the task children in jiffies + start_data address above which program data+bss is placed + end_data address below which program data+bss is placed + start_brk address above which program heap can be expanded with brk() + arg_start address above which program command line is placed + arg_end address below which program command line is placed + env_start address above which program environment is placed + env_end address below which program environment is placed + exit_code the thread's exit_code in the form reported by the waitpid + system call + ============= =============================================================== + +The /proc/PID/maps file contains the currently mapped memory regions and +their access permissions. + +The format is:: + + address perms offset dev inode pathname + + 08048000-08049000 r-xp 00000000 03:00 8312 /opt/test + 08049000-0804a000 rw-p 00001000 03:00 8312 /opt/test + 0804a000-0806b000 rw-p 00000000 00:00 0 [heap] + a7cb1000-a7cb2000 ---p 00000000 00:00 0 + a7cb2000-a7eb2000 rw-p 00000000 00:00 0 + a7eb2000-a7eb3000 ---p 00000000 00:00 0 + a7eb3000-a7ed5000 rw-p 00000000 00:00 0 + a7ed5000-a8008000 r-xp 00000000 03:00 4222 /lib/libc.so.6 + a8008000-a800a000 r--p 00133000 03:00 4222 /lib/libc.so.6 + a800a000-a800b000 rw-p 00135000 03:00 4222 /lib/libc.so.6 + a800b000-a800e000 rw-p 00000000 00:00 0 + a800e000-a8022000 r-xp 00000000 03:00 14462 /lib/libpthread.so.0 + a8022000-a8023000 r--p 00013000 03:00 14462 /lib/libpthread.so.0 + a8023000-a8024000 rw-p 00014000 03:00 14462 /lib/libpthread.so.0 + a8024000-a8027000 rw-p 00000000 00:00 0 + a8027000-a8043000 r-xp 00000000 03:00 8317 /lib/ld-linux.so.2 + a8043000-a8044000 r--p 0001b000 03:00 8317 /lib/ld-linux.so.2 + a8044000-a8045000 rw-p 0001c000 03:00 8317 /lib/ld-linux.so.2 + aff35000-aff4a000 rw-p 00000000 00:00 0 [stack] + ffffe000-fffff000 r-xp 00000000 00:00 0 [vdso] + +where "address" is the address space in the process that it occupies, "perms" +is a set of permissions:: + + r = read + w = write + x = execute + s = shared + p = private (copy on write) + +"offset" is the offset into the mapping, "dev" is the device (major:minor), and +"inode" is the inode on that device. 0 indicates that no inode is associated +with the memory region, as the case would be with BSS (uninitialized data). +The "pathname" shows the name associated file for this mapping. If the mapping +is not associated with a file: + + ======= ==================================== + [heap] the heap of the program + [stack] the stack of the main process + [vdso] the "virtual dynamic shared object", + the kernel system call handler + ======= ==================================== + + or if empty, the mapping is anonymous. + +The /proc/PID/smaps is an extension based on maps, showing the memory +consumption for each of the process's mappings. For each mapping (aka Virtual +Memory Area, or VMA) there is a series of lines such as the following:: + + 08048000-080bc000 r-xp 00000000 03:02 13130 /bin/bash + + Size: 1084 kB + KernelPageSize: 4 kB + MMUPageSize: 4 kB + Rss: 892 kB + Pss: 374 kB + Shared_Clean: 892 kB + Shared_Dirty: 0 kB + Private_Clean: 0 kB + Private_Dirty: 0 kB + Referenced: 892 kB + Anonymous: 0 kB + LazyFree: 0 kB + AnonHugePages: 0 kB + ShmemPmdMapped: 0 kB + Shared_Hugetlb: 0 kB + Private_Hugetlb: 0 kB + Swap: 0 kB + SwapPss: 0 kB + KernelPageSize: 4 kB + MMUPageSize: 4 kB + Locked: 0 kB + THPeligible: 0 + VmFlags: rd ex mr mw me dw + +The first of these lines shows the same information as is displayed for the +mapping in /proc/PID/maps. Following lines show the size of the mapping +(size); the size of each page allocated when backing a VMA (KernelPageSize), +which is usually the same as the size in the page table entries; the page size +used by the MMU when backing a VMA (in most cases, the same as KernelPageSize); +the amount of the mapping that is currently resident in RAM (RSS); the +process' proportional share of this mapping (PSS); and the number of clean and +dirty shared and private pages in the mapping. + +The "proportional set size" (PSS) of a process is the count of pages it has +in memory, where each page is divided by the number of processes sharing it. +So if a process has 1000 pages all to itself, and 1000 shared with one other +process, its PSS will be 1500. + +Note that even a page which is part of a MAP_SHARED mapping, but has only +a single pte mapped, i.e. is currently used by only one process, is accounted +as private and not as shared. + +"Referenced" indicates the amount of memory currently marked as referenced or +accessed. + +"Anonymous" shows the amount of memory that does not belong to any file. Even +a mapping associated with a file may contain anonymous pages: when MAP_PRIVATE +and a page is modified, the file page is replaced by a private anonymous copy. + +"LazyFree" shows the amount of memory which is marked by madvise(MADV_FREE). +The memory isn't freed immediately with madvise(). It's freed in memory +pressure if the memory is clean. Please note that the printed value might +be lower than the real value due to optimizations used in the current +implementation. If this is not desirable please file a bug report. + +"AnonHugePages" shows the ammount of memory backed by transparent hugepage. + +"ShmemPmdMapped" shows the ammount of shared (shmem/tmpfs) memory backed by +huge pages. + +"Shared_Hugetlb" and "Private_Hugetlb" show the ammounts of memory backed by +hugetlbfs page which is *not* counted in "RSS" or "PSS" field for historical +reasons. And these are not included in {Shared,Private}_{Clean,Dirty} field. + +"Swap" shows how much would-be-anonymous memory is also used, but out on swap. + +For shmem mappings, "Swap" includes also the size of the mapped (and not +replaced by copy-on-write) part of the underlying shmem object out on swap. +"SwapPss" shows proportional swap share of this mapping. Unlike "Swap", this +does not take into account swapped out page of underlying shmem objects. +"Locked" indicates whether the mapping is locked in memory or not. +"THPeligible" indicates whether the mapping is eligible for allocating THP +pages - 1 if true, 0 otherwise. It just shows the current status. + +"VmFlags" field deserves a separate description. This member represents the +kernel flags associated with the particular virtual memory area in two letter +encoded manner. The codes are the following: + + == ======================================= + rd readable + wr writeable + ex executable + sh shared + mr may read + mw may write + me may execute + ms may share + gd stack segment growns down + pf pure PFN range + dw disabled write to the mapped file + lo pages are locked in memory + io memory mapped I/O area + sr sequential read advise provided + rr random read advise provided + dc do not copy area on fork + de do not expand area on remapping + ac area is accountable + nr swap space is not reserved for the area + ht area uses huge tlb pages + ar architecture specific flag + dd do not include area into core dump + sd soft dirty flag + mm mixed map area + hg huge page advise flag + nh no huge page advise flag + mg mergable advise flag + == ======================================= + +Note that there is no guarantee that every flag and associated mnemonic will +be present in all further kernel releases. Things get changed, the flags may +be vanished or the reverse -- new added. Interpretation of their meaning +might change in future as well. So each consumer of these flags has to +follow each specific kernel version for the exact semantic. + +This file is only present if the CONFIG_MMU kernel configuration option is +enabled. + +Note: reading /proc/PID/maps or /proc/PID/smaps is inherently racy (consistent +output can be achieved only in the single read call). + +This typically manifests when doing partial reads of these files while the +memory map is being modified. Despite the races, we do provide the following +guarantees: + +1) The mapped addresses never go backwards, which implies no two + regions will ever overlap. +2) If there is something at a given vaddr during the entirety of the + life of the smaps/maps walk, there will be some output for it. + +The /proc/PID/smaps_rollup file includes the same fields as /proc/PID/smaps, +but their values are the sums of the corresponding values for all mappings of +the process. Additionally, it contains these fields: + +- Pss_Anon +- Pss_File +- Pss_Shmem + +They represent the proportional shares of anonymous, file, and shmem pages, as +described for smaps above. These fields are omitted in smaps since each +mapping identifies the type (anon, file, or shmem) of all pages it contains. +Thus all information in smaps_rollup can be derived from smaps, but at a +significantly higher cost. + +The /proc/PID/clear_refs is used to reset the PG_Referenced and ACCESSED/YOUNG +bits on both physical and virtual pages associated with a process, and the +soft-dirty bit on pte (see Documentation/admin-guide/mm/soft-dirty.rst +for details). +To clear the bits for all the pages associated with the process:: + + > echo 1 > /proc/PID/clear_refs + +To clear the bits for the anonymous pages associated with the process:: + + > echo 2 > /proc/PID/clear_refs + +To clear the bits for the file mapped pages associated with the process:: + + > echo 3 > /proc/PID/clear_refs + +To clear the soft-dirty bit:: + + > echo 4 > /proc/PID/clear_refs + +To reset the peak resident set size ("high water mark") to the process's +current value:: + + > echo 5 > /proc/PID/clear_refs + +Any other value written to /proc/PID/clear_refs will have no effect. + +The /proc/pid/pagemap gives the PFN, which can be used to find the pageflags +using /proc/kpageflags and number of times a page is mapped using +/proc/kpagecount. For detailed explanation, see +Documentation/admin-guide/mm/pagemap.rst. + +The /proc/pid/numa_maps is an extension based on maps, showing the memory +locality and binding policy, as well as the memory usage (in pages) of +each mapping. The output follows a general format where mapping details get +summarized separated by blank spaces, one mapping per each file line:: + + address policy mapping details + + 00400000 default file=/usr/local/bin/app mapped=1 active=0 N3=1 kernelpagesize_kB=4 + 00600000 default file=/usr/local/bin/app anon=1 dirty=1 N3=1 kernelpagesize_kB=4 + 3206000000 default file=/lib64/ld-2.12.so mapped=26 mapmax=6 N0=24 N3=2 kernelpagesize_kB=4 + 320621f000 default file=/lib64/ld-2.12.so anon=1 dirty=1 N3=1 kernelpagesize_kB=4 + 3206220000 default file=/lib64/ld-2.12.so anon=1 dirty=1 N3=1 kernelpagesize_kB=4 + 3206221000 default anon=1 dirty=1 N3=1 kernelpagesize_kB=4 + 3206800000 default file=/lib64/libc-2.12.so mapped=59 mapmax=21 active=55 N0=41 N3=18 kernelpagesize_kB=4 + 320698b000 default file=/lib64/libc-2.12.so + 3206b8a000 default file=/lib64/libc-2.12.so anon=2 dirty=2 N3=2 kernelpagesize_kB=4 + 3206b8e000 default file=/lib64/libc-2.12.so anon=1 dirty=1 N3=1 kernelpagesize_kB=4 + 3206b8f000 default anon=3 dirty=3 active=1 N3=3 kernelpagesize_kB=4 + 7f4dc10a2000 default anon=3 dirty=3 N3=3 kernelpagesize_kB=4 + 7f4dc10b4000 default anon=2 dirty=2 active=1 N3=2 kernelpagesize_kB=4 + 7f4dc1200000 default file=/anon_hugepage\040(deleted) huge anon=1 dirty=1 N3=1 kernelpagesize_kB=2048 + 7fff335f0000 default stack anon=3 dirty=3 N3=3 kernelpagesize_kB=4 + 7fff3369d000 default mapped=1 mapmax=35 active=0 N3=1 kernelpagesize_kB=4 + +Where: + +"address" is the starting address for the mapping; + +"policy" reports the NUMA memory policy set for the mapping (see Documentation/admin-guide/mm/numa_memory_policy.rst); + +"mapping details" summarizes mapping data such as mapping type, page usage counters, +node locality page counters (N0 == node0, N1 == node1, ...) and the kernel page +size, in KB, that is backing the mapping up. + +1.2 Kernel data +--------------- + +Similar to the process entries, the kernel data files give information about +the running kernel. The files used to obtain this information are contained in +/proc and are listed in Table 1-5. Not all of these will be present in your +system. It depends on the kernel configuration and the loaded modules, which +files are there, and which are missing. + +.. table:: Table 1-5: Kernel info in /proc + + ============ =============================================================== + File Content + ============ =============================================================== + apm Advanced power management info + buddyinfo Kernel memory allocator information (see text) (2.5) + bus Directory containing bus specific information + cmdline Kernel command line + cpuinfo Info about the CPU + devices Available devices (block and character) + dma Used DMS channels + filesystems Supported filesystems + driver Various drivers grouped here, currently rtc (2.4) + execdomains Execdomains, related to security (2.4) + fb Frame Buffer devices (2.4) + fs File system parameters, currently nfs/exports (2.4) + ide Directory containing info about the IDE subsystem + interrupts Interrupt usage + iomem Memory map (2.4) + ioports I/O port usage + irq Masks for irq to cpu affinity (2.4)(smp?) + isapnp ISA PnP (Plug&Play) Info (2.4) + kcore Kernel core image (can be ELF or A.OUT(deprecated in 2.4)) + kmsg Kernel messages + ksyms Kernel symbol table + loadavg Load average of last 1, 5 & 15 minutes + locks Kernel locks + meminfo Memory info + misc Miscellaneous + modules List of loaded modules + mounts Mounted filesystems + net Networking info (see text) + pagetypeinfo Additional page allocator information (see text) (2.5) + partitions Table of partitions known to the system + pci Deprecated info of PCI bus (new way -> /proc/bus/pci/, + decoupled by lspci (2.4) + rtc Real time clock + scsi SCSI info (see text) + slabinfo Slab pool info + softirqs softirq usage + stat Overall statistics + swaps Swap space utilization + sys See chapter 2 + sysvipc Info of SysVIPC Resources (msg, sem, shm) (2.4) + tty Info of tty drivers + uptime Wall clock since boot, combined idle time of all cpus + version Kernel version + video bttv info of video resources (2.4) + vmallocinfo Show vmalloced areas + ============ =============================================================== + +You can, for example, check which interrupts are currently in use and what +they are used for by looking in the file /proc/interrupts:: + + > cat /proc/interrupts + CPU0 + 0: 8728810 XT-PIC timer + 1: 895 XT-PIC keyboard + 2: 0 XT-PIC cascade + 3: 531695 XT-PIC aha152x + 4: 2014133 XT-PIC serial + 5: 44401 XT-PIC pcnet_cs + 8: 2 XT-PIC rtc + 11: 8 XT-PIC i82365 + 12: 182918 XT-PIC PS/2 Mouse + 13: 1 XT-PIC fpu + 14: 1232265 XT-PIC ide0 + 15: 7 XT-PIC ide1 + NMI: 0 + +In 2.4.* a couple of lines where added to this file LOC & ERR (this time is the +output of a SMP machine):: + + > cat /proc/interrupts + + CPU0 CPU1 + 0: 1243498 1214548 IO-APIC-edge timer + 1: 8949 8958 IO-APIC-edge keyboard + 2: 0 0 XT-PIC cascade + 5: 11286 10161 IO-APIC-edge soundblaster + 8: 1 0 IO-APIC-edge rtc + 9: 27422 27407 IO-APIC-edge 3c503 + 12: 113645 113873 IO-APIC-edge PS/2 Mouse + 13: 0 0 XT-PIC fpu + 14: 22491 24012 IO-APIC-edge ide0 + 15: 2183 2415 IO-APIC-edge ide1 + 17: 30564 30414 IO-APIC-level eth0 + 18: 177 164 IO-APIC-level bttv + NMI: 2457961 2457959 + LOC: 2457882 2457881 + ERR: 2155 + +NMI is incremented in this case because every timer interrupt generates a NMI +(Non Maskable Interrupt) which is used by the NMI Watchdog to detect lockups. + +LOC is the local interrupt counter of the internal APIC of every CPU. + +ERR is incremented in the case of errors in the IO-APIC bus (the bus that +connects the CPUs in a SMP system. This means that an error has been detected, +the IO-APIC automatically retry the transmission, so it should not be a big +problem, but you should read the SMP-FAQ. + +In 2.6.2* /proc/interrupts was expanded again. This time the goal was for +/proc/interrupts to display every IRQ vector in use by the system, not +just those considered 'most important'. The new vectors are: + +THR + interrupt raised when a machine check threshold counter + (typically counting ECC corrected errors of memory or cache) exceeds + a configurable threshold. Only available on some systems. + +TRM + a thermal event interrupt occurs when a temperature threshold + has been exceeded for the CPU. This interrupt may also be generated + when the temperature drops back to normal. + +SPU + a spurious interrupt is some interrupt that was raised then lowered + by some IO device before it could be fully processed by the APIC. Hence + the APIC sees the interrupt but does not know what device it came from. + For this case the APIC will generate the interrupt with a IRQ vector + of 0xff. This might also be generated by chipset bugs. + +RES, CAL, TLB] + rescheduling, call and TLB flush interrupts are + sent from one CPU to another per the needs of the OS. Typically, + their statistics are used by kernel developers and interested users to + determine the occurrence of interrupts of the given type. + +The above IRQ vectors are displayed only when relevant. For example, +the threshold vector does not exist on x86_64 platforms. Others are +suppressed when the system is a uniprocessor. As of this writing, only +i386 and x86_64 platforms support the new IRQ vector displays. + +Of some interest is the introduction of the /proc/irq directory to 2.4. +It could be used to set IRQ to CPU affinity, this means that you can "hook" an +IRQ to only one CPU, or to exclude a CPU of handling IRQs. The contents of the +irq subdir is one subdir for each IRQ, and two files; default_smp_affinity and +prof_cpu_mask. + +For example:: + + > ls /proc/irq/ + 0 10 12 14 16 18 2 4 6 8 prof_cpu_mask + 1 11 13 15 17 19 3 5 7 9 default_smp_affinity + > ls /proc/irq/0/ + smp_affinity + +smp_affinity is a bitmask, in which you can specify which CPUs can handle the +IRQ, you can set it by doing:: + + > echo 1 > /proc/irq/10/smp_affinity + +This means that only the first CPU will handle the IRQ, but you can also echo +5 which means that only the first and third CPU can handle the IRQ. + +The contents of each smp_affinity file is the same by default:: + + > cat /proc/irq/0/smp_affinity + ffffffff + +There is an alternate interface, smp_affinity_list which allows specifying +a cpu range instead of a bitmask:: + + > cat /proc/irq/0/smp_affinity_list + 1024-1031 + +The default_smp_affinity mask applies to all non-active IRQs, which are the +IRQs which have not yet been allocated/activated, and hence which lack a +/proc/irq/[0-9]* directory. + +The node file on an SMP system shows the node to which the device using the IRQ +reports itself as being attached. This hardware locality information does not +include information about any possible driver locality preference. + +prof_cpu_mask specifies which CPUs are to be profiled by the system wide +profiler. Default value is ffffffff (all cpus if there are only 32 of them). + +The way IRQs are routed is handled by the IO-APIC, and it's Round Robin +between all the CPUs which are allowed to handle it. As usual the kernel has +more info than you and does a better job than you, so the defaults are the +best choice for almost everyone. [Note this applies only to those IO-APIC's +that support "Round Robin" interrupt distribution.] + +There are three more important subdirectories in /proc: net, scsi, and sys. +The general rule is that the contents, or even the existence of these +directories, depend on your kernel configuration. If SCSI is not enabled, the +directory scsi may not exist. The same is true with the net, which is there +only when networking support is present in the running kernel. + +The slabinfo file gives information about memory usage at the slab level. +Linux uses slab pools for memory management above page level in version 2.2. +Commonly used objects have their own slab pool (such as network buffers, +directory cache, and so on). + +:: + + > cat /proc/buddyinfo + + Node 0, zone DMA 0 4 5 4 4 3 ... + Node 0, zone Normal 1 0 0 1 101 8 ... + Node 0, zone HighMem 2 0 0 1 1 0 ... + +External fragmentation is a problem under some workloads, and buddyinfo is a +useful tool for helping diagnose these problems. Buddyinfo will give you a +clue as to how big an area you can safely allocate, or why a previous +allocation failed. + +Each column represents the number of pages of a certain order which are +available. In this case, there are 0 chunks of 2^0*PAGE_SIZE available in +ZONE_DMA, 4 chunks of 2^1*PAGE_SIZE in ZONE_DMA, 101 chunks of 2^4*PAGE_SIZE +available in ZONE_NORMAL, etc... + +More information relevant to external fragmentation can be found in +pagetypeinfo:: + + > cat /proc/pagetypeinfo + Page block order: 9 + Pages per block: 512 + + Free pages count per migrate type at order 0 1 2 3 4 5 6 7 8 9 10 + Node 0, zone DMA, type Unmovable 0 0 0 1 1 1 1 1 1 1 0 + Node 0, zone DMA, type Reclaimable 0 0 0 0 0 0 0 0 0 0 0 + Node 0, zone DMA, type Movable 1 1 2 1 2 1 1 0 1 0 2 + Node 0, zone DMA, type Reserve 0 0 0 0 0 0 0 0 0 1 0 + Node 0, zone DMA, type Isolate 0 0 0 0 0 0 0 0 0 0 0 + Node 0, zone DMA32, type Unmovable 103 54 77 1 1 1 11 8 7 1 9 + Node 0, zone DMA32, type Reclaimable 0 0 2 1 0 0 0 0 1 0 0 + Node 0, zone DMA32, type Movable 169 152 113 91 77 54 39 13 6 1 452 + Node 0, zone DMA32, type Reserve 1 2 2 2 2 0 1 1 1 1 0 + Node 0, zone DMA32, type Isolate 0 0 0 0 0 0 0 0 0 0 0 + + Number of blocks type Unmovable Reclaimable Movable Reserve Isolate + Node 0, zone DMA 2 0 5 1 0 + Node 0, zone DMA32 41 6 967 2 0 + +Fragmentation avoidance in the kernel works by grouping pages of different +migrate types into the same contiguous regions of memory called page blocks. +A page block is typically the size of the default hugepage size e.g. 2MB on +X86-64. By keeping pages grouped based on their ability to move, the kernel +can reclaim pages within a page block to satisfy a high-order allocation. + +The pagetypinfo begins with information on the size of a page block. It +then gives the same type of information as buddyinfo except broken down +by migrate-type and finishes with details on how many page blocks of each +type exist. + +If min_free_kbytes has been tuned correctly (recommendations made by hugeadm +from libhugetlbfs https://github.com/libhugetlbfs/libhugetlbfs/), one can +make an estimate of the likely number of huge pages that can be allocated +at a given point in time. All the "Movable" blocks should be allocatable +unless memory has been mlock()'d. Some of the Reclaimable blocks should +also be allocatable although a lot of filesystem metadata may have to be +reclaimed to achieve this. + + +meminfo +~~~~~~~ + +Provides information about distribution and utilization of memory. This +varies by architecture and compile options. The following is from a +16GB PIII, which has highmem enabled. You may not have all of these fields. + +:: + + > cat /proc/meminfo + + MemTotal: 16344972 kB + MemFree: 13634064 kB + MemAvailable: 14836172 kB + Buffers: 3656 kB + Cached: 1195708 kB + SwapCached: 0 kB + Active: 891636 kB + Inactive: 1077224 kB + HighTotal: 15597528 kB + HighFree: 13629632 kB + LowTotal: 747444 kB + LowFree: 4432 kB + SwapTotal: 0 kB + SwapFree: 0 kB + Dirty: 968 kB + Writeback: 0 kB + AnonPages: 861800 kB + Mapped: 280372 kB + Shmem: 644 kB + KReclaimable: 168048 kB + Slab: 284364 kB + SReclaimable: 159856 kB + SUnreclaim: 124508 kB + PageTables: 24448 kB + NFS_Unstable: 0 kB + Bounce: 0 kB + WritebackTmp: 0 kB + CommitLimit: 7669796 kB + Committed_AS: 100056 kB + VmallocTotal: 112216 kB + VmallocUsed: 428 kB + VmallocChunk: 111088 kB + Percpu: 62080 kB + HardwareCorrupted: 0 kB + AnonHugePages: 49152 kB + ShmemHugePages: 0 kB + ShmemPmdMapped: 0 kB + +MemTotal + Total usable ram (i.e. physical ram minus a few reserved + bits and the kernel binary code) +MemFree + The sum of LowFree+HighFree +MemAvailable + An estimate of how much memory is available for starting new + applications, without swapping. Calculated from MemFree, + SReclaimable, the size of the file LRU lists, and the low + watermarks in each zone. + The estimate takes into account that the system needs some + page cache to function well, and that not all reclaimable + slab will be reclaimable, due to items being in use. The + impact of those factors will vary from system to system. +Buffers + Relatively temporary storage for raw disk blocks + shouldn't get tremendously large (20MB or so) +Cached + in-memory cache for files read from the disk (the + pagecache). Doesn't include SwapCached +SwapCached + Memory that once was swapped out, is swapped back in but + still also is in the swapfile (if memory is needed it + doesn't need to be swapped out AGAIN because it is already + in the swapfile. This saves I/O) +Active + Memory that has been used more recently and usually not + reclaimed unless absolutely necessary. +Inactive + Memory which has been less recently used. It is more + eligible to be reclaimed for other purposes +HighTotal, HighFree + Highmem is all memory above ~860MB of physical memory + Highmem areas are for use by userspace programs, or + for the pagecache. The kernel must use tricks to access + this memory, making it slower to access than lowmem. +LowTotal, LowFree + Lowmem is memory which can be used for everything that + highmem can be used for, but it is also available for the + kernel's use for its own data structures. Among many + other things, it is where everything from the Slab is + allocated. Bad things happen when you're out of lowmem. +SwapTotal + total amount of swap space available +SwapFree + Memory which has been evicted from RAM, and is temporarily + on the disk +Dirty + Memory which is waiting to get written back to the disk +Writeback + Memory which is actively being written back to the disk +AnonPages + Non-file backed pages mapped into userspace page tables +HardwareCorrupted + The amount of RAM/memory in KB, the kernel identifies as + corrupted. +AnonHugePages + Non-file backed huge pages mapped into userspace page tables +Mapped + files which have been mmaped, such as libraries +Shmem + Total memory used by shared memory (shmem) and tmpfs +ShmemHugePages + Memory used by shared memory (shmem) and tmpfs allocated + with huge pages +ShmemPmdMapped + Shared memory mapped into userspace with huge pages +KReclaimable + Kernel allocations that the kernel will attempt to reclaim + under memory pressure. Includes SReclaimable (below), and other + direct allocations with a shrinker. +Slab + in-kernel data structures cache +SReclaimable + Part of Slab, that might be reclaimed, such as caches +SUnreclaim + Part of Slab, that cannot be reclaimed on memory pressure +PageTables + amount of memory dedicated to the lowest level of page + tables. +NFS_Unstable + NFS pages sent to the server, but not yet committed to stable + storage +Bounce + Memory used for block device "bounce buffers" +WritebackTmp + Memory used by FUSE for temporary writeback buffers +CommitLimit + Based on the overcommit ratio ('vm.overcommit_ratio'), + this is the total amount of memory currently available to + be allocated on the system. This limit is only adhered to + if strict overcommit accounting is enabled (mode 2 in + 'vm.overcommit_memory'). + + The CommitLimit is calculated with the following formula:: + + CommitLimit = ([total RAM pages] - [total huge TLB pages]) * + overcommit_ratio / 100 + [total swap pages] + + For example, on a system with 1G of physical RAM and 7G + of swap with a `vm.overcommit_ratio` of 30 it would + yield a CommitLimit of 7.3G. + + For more details, see the memory overcommit documentation + in vm/overcommit-accounting. +Committed_AS + The amount of memory presently allocated on the system. + The committed memory is a sum of all of the memory which + has been allocated by processes, even if it has not been + "used" by them as of yet. A process which malloc()'s 1G + of memory, but only touches 300M of it will show up as + using 1G. This 1G is memory which has been "committed" to + by the VM and can be used at any time by the allocating + application. With strict overcommit enabled on the system + (mode 2 in 'vm.overcommit_memory'),allocations which would + exceed the CommitLimit (detailed above) will not be permitted. + This is useful if one needs to guarantee that processes will + not fail due to lack of memory once that memory has been + successfully allocated. +VmallocTotal + total size of vmalloc memory area +VmallocUsed + amount of vmalloc area which is used +VmallocChunk + largest contiguous block of vmalloc area which is free +Percpu + Memory allocated to the percpu allocator used to back percpu + allocations. This stat excludes the cost of metadata. + +vmallocinfo +~~~~~~~~~~~ + +Provides information about vmalloced/vmaped areas. One line per area, +containing the virtual address range of the area, size in bytes, +caller information of the creator, and optional information depending +on the kind of area : + + ========== =================================================== + pages=nr number of pages + phys=addr if a physical address was specified + ioremap I/O mapping (ioremap() and friends) + vmalloc vmalloc() area + vmap vmap()ed pages + user VM_USERMAP area + vpages buffer for pages pointers was vmalloced (huge area) + N=nr (Only on NUMA kernels) + Number of pages allocated on memory node + ========== =================================================== + +:: + + > cat /proc/vmallocinfo + 0xffffc20000000000-0xffffc20000201000 2101248 alloc_large_system_hash+0x204 ... + /0x2c0 pages=512 vmalloc N0=128 N1=128 N2=128 N3=128 + 0xffffc20000201000-0xffffc20000302000 1052672 alloc_large_system_hash+0x204 ... + /0x2c0 pages=256 vmalloc N0=64 N1=64 N2=64 N3=64 + 0xffffc20000302000-0xffffc20000304000 8192 acpi_tb_verify_table+0x21/0x4f... + phys=7fee8000 ioremap + 0xffffc20000304000-0xffffc20000307000 12288 acpi_tb_verify_table+0x21/0x4f... + phys=7fee7000 ioremap + 0xffffc2000031d000-0xffffc2000031f000 8192 init_vdso_vars+0x112/0x210 + 0xffffc2000031f000-0xffffc2000032b000 49152 cramfs_uncompress_init+0x2e ... + /0x80 pages=11 vmalloc N0=3 N1=3 N2=2 N3=3 + 0xffffc2000033a000-0xffffc2000033d000 12288 sys_swapon+0x640/0xac0 ... + pages=2 vmalloc N1=2 + 0xffffc20000347000-0xffffc2000034c000 20480 xt_alloc_table_info+0xfe ... + /0x130 [x_tables] pages=4 vmalloc N0=4 + 0xffffffffa0000000-0xffffffffa000f000 61440 sys_init_module+0xc27/0x1d00 ... + pages=14 vmalloc N2=14 + 0xffffffffa000f000-0xffffffffa0014000 20480 sys_init_module+0xc27/0x1d00 ... + pages=4 vmalloc N1=4 + 0xffffffffa0014000-0xffffffffa0017000 12288 sys_init_module+0xc27/0x1d00 ... + pages=2 vmalloc N1=2 + 0xffffffffa0017000-0xffffffffa0022000 45056 sys_init_module+0xc27/0x1d00 ... + pages=10 vmalloc N0=10 + + +softirqs +~~~~~~~~ + +Provides counts of softirq handlers serviced since boot time, for each cpu. + +:: + + > cat /proc/softirqs + CPU0 CPU1 CPU2 CPU3 + HI: 0 0 0 0 + TIMER: 27166 27120 27097 27034 + NET_TX: 0 0 0 17 + NET_RX: 42 0 0 39 + BLOCK: 0 0 107 1121 + TASKLET: 0 0 0 290 + SCHED: 27035 26983 26971 26746 + HRTIMER: 0 0 0 0 + RCU: 1678 1769 2178 2250 + + +1.3 IDE devices in /proc/ide +---------------------------- + +The subdirectory /proc/ide contains information about all IDE devices of which +the kernel is aware. There is one subdirectory for each IDE controller, the +file drivers and a link for each IDE device, pointing to the device directory +in the controller specific subtree. + +The file drivers contains general information about the drivers used for the +IDE devices:: + + > cat /proc/ide/drivers + ide-cdrom version 4.53 + ide-disk version 1.08 + +More detailed information can be found in the controller specific +subdirectories. These are named ide0, ide1 and so on. Each of these +directories contains the files shown in table 1-6. + + +.. table:: Table 1-6: IDE controller info in /proc/ide/ide? + + ======= ======================================= + File Content + ======= ======================================= + channel IDE channel (0 or 1) + config Configuration (only for PCI/IDE bridge) + mate Mate name + model Type/Chipset of IDE controller + ======= ======================================= + +Each device connected to a controller has a separate subdirectory in the +controllers directory. The files listed in table 1-7 are contained in these +directories. + + +.. table:: Table 1-7: IDE device information + + ================ ========================================== + File Content + ================ ========================================== + cache The cache + capacity Capacity of the medium (in 512Byte blocks) + driver driver and version + geometry physical and logical geometry + identify device identify block + media media type + model device identifier + settings device setup + smart_thresholds IDE disk management thresholds + smart_values IDE disk management values + ================ ========================================== + +The most interesting file is ``settings``. This file contains a nice +overview of the drive parameters:: + + # cat /proc/ide/ide0/hda/settings + name value min max mode + ---- ----- --- --- ---- + bios_cyl 526 0 65535 rw + bios_head 255 0 255 rw + bios_sect 63 0 63 rw + breada_readahead 4 0 127 rw + bswap 0 0 1 r + file_readahead 72 0 2097151 rw + io_32bit 0 0 3 rw + keepsettings 0 0 1 rw + max_kb_per_request 122 1 127 rw + multcount 0 0 8 rw + nice1 1 0 1 rw + nowerr 0 0 1 rw + pio_mode write-only 0 255 w + slow 0 0 1 rw + unmaskirq 0 0 1 rw + using_dma 0 0 1 rw + + +1.4 Networking info in /proc/net +-------------------------------- + +The subdirectory /proc/net follows the usual pattern. Table 1-8 shows the +additional values you get for IP version 6 if you configure the kernel to +support this. Table 1-9 lists the files and their meaning. + + +.. table:: Table 1-8: IPv6 info in /proc/net + + ========== ===================================================== + File Content + ========== ===================================================== + udp6 UDP sockets (IPv6) + tcp6 TCP sockets (IPv6) + raw6 Raw device statistics (IPv6) + igmp6 IP multicast addresses, which this host joined (IPv6) + if_inet6 List of IPv6 interface addresses + ipv6_route Kernel routing table for IPv6 + rt6_stats Global IPv6 routing tables statistics + sockstat6 Socket statistics (IPv6) + snmp6 Snmp data (IPv6) + ========== ===================================================== + +.. table:: Table 1-9: Network info in /proc/net + + ============= ================================================================ + File Content + ============= ================================================================ + arp Kernel ARP table + dev network devices with statistics + dev_mcast the Layer2 multicast groups a device is listening too + (interface index, label, number of references, number of bound + addresses). + dev_stat network device status + ip_fwchains Firewall chain linkage + ip_fwnames Firewall chain names + ip_masq Directory containing the masquerading tables + ip_masquerade Major masquerading table + netstat Network statistics + raw raw device statistics + route Kernel routing table + rpc Directory containing rpc info + rt_cache Routing cache + snmp SNMP data + sockstat Socket statistics + tcp TCP sockets + udp UDP sockets + unix UNIX domain sockets + wireless Wireless interface data (Wavelan etc) + igmp IP multicast addresses, which this host joined + psched Global packet scheduler parameters. + netlink List of PF_NETLINK sockets + ip_mr_vifs List of multicast virtual interfaces + ip_mr_cache List of multicast routing cache + ============= ================================================================ + +You can use this information to see which network devices are available in +your system and how much traffic was routed over those devices:: + + > cat /proc/net/dev + Inter-|Receive |[... + face |bytes packets errs drop fifo frame compressed multicast|[... + lo: 908188 5596 0 0 0 0 0 0 [... + ppp0:15475140 20721 410 0 0 410 0 0 [... + eth0: 614530 7085 0 0 0 0 0 1 [... + + ...] Transmit + ...] bytes packets errs drop fifo colls carrier compressed + ...] 908188 5596 0 0 0 0 0 0 + ...] 1375103 17405 0 0 0 0 0 0 + ...] 1703981 5535 0 0 0 3 0 0 + +In addition, each Channel Bond interface has its own directory. For +example, the bond0 device will have a directory called /proc/net/bond0/. +It will contain information that is specific to that bond, such as the +current slaves of the bond, the link status of the slaves, and how +many times the slaves link has failed. + +1.5 SCSI info +------------- + +If you have a SCSI host adapter in your system, you'll find a subdirectory +named after the driver for this adapter in /proc/scsi. You'll also see a list +of all recognized SCSI devices in /proc/scsi:: + + >cat /proc/scsi/scsi + Attached devices: + Host: scsi0 Channel: 00 Id: 00 Lun: 00 + Vendor: IBM Model: DGHS09U Rev: 03E0 + Type: Direct-Access ANSI SCSI revision: 03 + Host: scsi0 Channel: 00 Id: 06 Lun: 00 + Vendor: PIONEER Model: CD-ROM DR-U06S Rev: 1.04 + Type: CD-ROM ANSI SCSI revision: 02 + + +The directory named after the driver has one file for each adapter found in +the system. These files contain information about the controller, including +the used IRQ and the IO address range. The amount of information shown is +dependent on the adapter you use. The example shows the output for an Adaptec +AHA-2940 SCSI adapter:: + + > cat /proc/scsi/aic7xxx/0 + + Adaptec AIC7xxx driver version: 5.1.19/3.2.4 + Compile Options: + TCQ Enabled By Default : Disabled + AIC7XXX_PROC_STATS : Disabled + AIC7XXX_RESET_DELAY : 5 + Adapter Configuration: + SCSI Adapter: Adaptec AHA-294X Ultra SCSI host adapter + Ultra Wide Controller + PCI MMAPed I/O Base: 0xeb001000 + Adapter SEEPROM Config: SEEPROM found and used. + Adaptec SCSI BIOS: Enabled + IRQ: 10 + SCBs: Active 0, Max Active 2, + Allocated 15, HW 16, Page 255 + Interrupts: 160328 + BIOS Control Word: 0x18b6 + Adapter Control Word: 0x005b + Extended Translation: Enabled + Disconnect Enable Flags: 0xffff + Ultra Enable Flags: 0x0001 + Tag Queue Enable Flags: 0x0000 + Ordered Queue Tag Flags: 0x0000 + Default Tag Queue Depth: 8 + Tagged Queue By Device array for aic7xxx host instance 0: + {255,255,255,255,255,255,255,255,255,255,255,255,255,255,255,255} + Actual queue depth per device for aic7xxx host instance 0: + {1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1} + Statistics: + (scsi0:0:0:0) + Device using Wide/Sync transfers at 40.0 MByte/sec, offset 8 + Transinfo settings: current(12/8/1/0), goal(12/8/1/0), user(12/15/1/0) + Total transfers 160151 (74577 reads and 85574 writes) + (scsi0:0:6:0) + Device using Narrow/Sync transfers at 5.0 MByte/sec, offset 15 + Transinfo settings: current(50/15/0/0), goal(50/15/0/0), user(50/15/0/0) + Total transfers 0 (0 reads and 0 writes) + + +1.6 Parallel port info in /proc/parport +--------------------------------------- + +The directory /proc/parport contains information about the parallel ports of +your system. It has one subdirectory for each port, named after the port +number (0,1,2,...). + +These directories contain the four files shown in Table 1-10. + + +.. table:: Table 1-10: Files in /proc/parport + + ========= ==================================================================== + File Content + ========= ==================================================================== + autoprobe Any IEEE-1284 device ID information that has been acquired. + devices list of the device drivers using that port. A + will appear by the + name of the device currently using the port (it might not appear + against any). + hardware Parallel port's base address, IRQ line and DMA channel. + irq IRQ that parport is using for that port. This is in a separate + file to allow you to alter it by writing a new value in (IRQ + number or none). + ========= ==================================================================== + +1.7 TTY info in /proc/tty +------------------------- + +Information about the available and actually used tty's can be found in the +directory /proc/tty.You'll find entries for drivers and line disciplines in +this directory, as shown in Table 1-11. + + +.. table:: Table 1-11: Files in /proc/tty + + ============= ============================================== + File Content + ============= ============================================== + drivers list of drivers and their usage + ldiscs registered line disciplines + driver/serial usage statistic and status of single tty lines + ============= ============================================== + +To see which tty's are currently in use, you can simply look into the file +/proc/tty/drivers:: + + > cat /proc/tty/drivers + pty_slave /dev/pts 136 0-255 pty:slave + pty_master /dev/ptm 128 0-255 pty:master + pty_slave /dev/ttyp 3 0-255 pty:slave + pty_master /dev/pty 2 0-255 pty:master + serial /dev/cua 5 64-67 serial:callout + serial /dev/ttyS 4 64-67 serial + /dev/tty0 /dev/tty0 4 0 system:vtmaster + /dev/ptmx /dev/ptmx 5 2 system + /dev/console /dev/console 5 1 system:console + /dev/tty /dev/tty 5 0 system:/dev/tty + unknown /dev/tty 4 1-63 console + + +1.8 Miscellaneous kernel statistics in /proc/stat +------------------------------------------------- + +Various pieces of information about kernel activity are available in the +/proc/stat file. All of the numbers reported in this file are aggregates +since the system first booted. For a quick look, simply cat the file:: + + > cat /proc/stat + cpu 2255 34 2290 22625563 6290 127 456 0 0 0 + cpu0 1132 34 1441 11311718 3675 127 438 0 0 0 + cpu1 1123 0 849 11313845 2614 0 18 0 0 0 + intr 114930548 113199788 3 0 5 263 0 4 [... lots more numbers ...] + ctxt 1990473 + btime 1062191376 + processes 2915 + procs_running 1 + procs_blocked 0 + softirq 183433 0 21755 12 39 1137 231 21459 2263 + +The very first "cpu" line aggregates the numbers in all of the other "cpuN" +lines. These numbers identify the amount of time the CPU has spent performing +different kinds of work. Time units are in USER_HZ (typically hundredths of a +second). The meanings of the columns are as follows, from left to right: + +- user: normal processes executing in user mode +- nice: niced processes executing in user mode +- system: processes executing in kernel mode +- idle: twiddling thumbs +- iowait: In a word, iowait stands for waiting for I/O to complete. But there + are several problems: + + 1. Cpu will not wait for I/O to complete, iowait is the time that a task is + waiting for I/O to complete. When cpu goes into idle state for + outstanding task io, another task will be scheduled on this CPU. + 2. In a multi-core CPU, the task waiting for I/O to complete is not running + on any CPU, so the iowait of each CPU is difficult to calculate. + 3. The value of iowait field in /proc/stat will decrease in certain + conditions. + + So, the iowait is not reliable by reading from /proc/stat. +- irq: servicing interrupts +- softirq: servicing softirqs +- steal: involuntary wait +- guest: running a normal guest +- guest_nice: running a niced guest + +The "intr" line gives counts of interrupts serviced since boot time, for each +of the possible system interrupts. The first column is the total of all +interrupts serviced including unnumbered architecture specific interrupts; +each subsequent column is the total for that particular numbered interrupt. +Unnumbered interrupts are not shown, only summed into the total. + +The "ctxt" line gives the total number of context switches across all CPUs. + +The "btime" line gives the time at which the system booted, in seconds since +the Unix epoch. + +The "processes" line gives the number of processes and threads created, which +includes (but is not limited to) those created by calls to the fork() and +clone() system calls. + +The "procs_running" line gives the total number of threads that are +running or ready to run (i.e., the total number of runnable threads). + +The "procs_blocked" line gives the number of processes currently blocked, +waiting for I/O to complete. + +The "softirq" line gives counts of softirqs serviced since boot time, for each +of the possible system softirqs. The first column is the total of all +softirqs serviced; each subsequent column is the total for that particular +softirq. + + +1.9 Ext4 file system parameters +------------------------------- + +Information about mounted ext4 file systems can be found in +/proc/fs/ext4. Each mounted filesystem will have a directory in +/proc/fs/ext4 based on its device name (i.e., /proc/fs/ext4/hdc or +/proc/fs/ext4/dm-0). The files in each per-device directory are shown +in Table 1-12, below. + +.. table:: Table 1-12: Files in /proc/fs/ext4/ + + ============== ========================================================== + File Content + mb_groups details of multiblock allocator buddy cache of free blocks + ============== ========================================================== + +2.0 /proc/consoles +------------------ +Shows registered system console lines. + +To see which character device lines are currently used for the system console +/dev/console, you may simply look into the file /proc/consoles:: + + > cat /proc/consoles + tty0 -WU (ECp) 4:7 + ttyS0 -W- (Ep) 4:64 + +The columns are: + ++--------------------+-------------------------------------------------------+ +| device | name of the device | ++====================+=======================================================+ +| operations | * R = can do read operations | +| | * W = can do write operations | +| | * U = can do unblank | ++--------------------+-------------------------------------------------------+ +| flags | * E = it is enabled | +| | * C = it is preferred console | +| | * B = it is primary boot console | +| | * p = it is used for printk buffer | +| | * b = it is not a TTY but a Braille device | +| | * a = it is safe to use when cpu is offline | ++--------------------+-------------------------------------------------------+ +| major:minor | major and minor number of the device separated by a | +| | colon | ++--------------------+-------------------------------------------------------+ + +Summary +------- + +The /proc file system serves information about the running system. It not only +allows access to process data but also allows you to request the kernel status +by reading files in the hierarchy. + +The directory structure of /proc reflects the types of information and makes +it easy, if not obvious, where to look for specific data. + +Chapter 2: Modifying System Parameters +====================================== + +In This Chapter +--------------- + +* Modifying kernel parameters by writing into files found in /proc/sys +* Exploring the files which modify certain parameters +* Review of the /proc/sys file tree + +------------------------------------------------------------------------------ + +A very interesting part of /proc is the directory /proc/sys. This is not only +a source of information, it also allows you to change parameters within the +kernel. Be very careful when attempting this. You can optimize your system, +but you can also cause it to crash. Never alter kernel parameters on a +production system. Set up a development machine and test to make sure that +everything works the way you want it to. You may have no alternative but to +reboot the machine once an error has been made. + +To change a value, simply echo the new value into the file. An example is +given below in the section on the file system data. You need to be root to do +this. You can create your own boot script to perform this every time your +system boots. + +The files in /proc/sys can be used to fine tune and monitor miscellaneous and +general things in the operation of the Linux kernel. Since some of the files +can inadvertently disrupt your system, it is advisable to read both +documentation and source before actually making adjustments. In any case, be +very careful when writing to any of these files. The entries in /proc may +change slightly between the 2.1.* and the 2.2 kernel, so if there is any doubt +review the kernel documentation in the directory /usr/src/linux/Documentation. +This chapter is heavily based on the documentation included in the pre 2.2 +kernels, and became part of it in version 2.2.1 of the Linux kernel. + +Please see: Documentation/admin-guide/sysctl/ directory for descriptions of these +entries. + +Summary +------- + +Certain aspects of kernel behavior can be modified at runtime, without the +need to recompile the kernel, or even to reboot the system. The files in the +/proc/sys tree can not only be read, but also modified. You can use the echo +command to write value into these files, thereby changing the default settings +of the kernel. + + +Chapter 3: Per-process Parameters +================================= + +3.1 /proc//oom_adj & /proc//oom_score_adj- Adjust the oom-killer score +-------------------------------------------------------------------------------- + +These file can be used to adjust the badness heuristic used to select which +process gets killed in out of memory conditions. + +The badness heuristic assigns a value to each candidate task ranging from 0 +(never kill) to 1000 (always kill) to determine which process is targeted. The +units are roughly a proportion along that range of allowed memory the process +may allocate from based on an estimation of its current memory and swap use. +For example, if a task is using all allowed memory, its badness score will be +1000. If it is using half of its allowed memory, its score will be 500. + +There is an additional factor included in the badness score: the current memory +and swap usage is discounted by 3% for root processes. + +The amount of "allowed" memory depends on the context in which the oom killer +was called. If it is due to the memory assigned to the allocating task's cpuset +being exhausted, the allowed memory represents the set of mems assigned to that +cpuset. If it is due to a mempolicy's node(s) being exhausted, the allowed +memory represents the set of mempolicy nodes. If it is due to a memory +limit (or swap limit) being reached, the allowed memory is that configured +limit. Finally, if it is due to the entire system being out of memory, the +allowed memory represents all allocatable resources. + +The value of /proc//oom_score_adj is added to the badness score before it +is used to determine which task to kill. Acceptable values range from -1000 +(OOM_SCORE_ADJ_MIN) to +1000 (OOM_SCORE_ADJ_MAX). This allows userspace to +polarize the preference for oom killing either by always preferring a certain +task or completely disabling it. The lowest possible value, -1000, is +equivalent to disabling oom killing entirely for that task since it will always +report a badness score of 0. + +Consequently, it is very simple for userspace to define the amount of memory to +consider for each task. Setting a /proc//oom_score_adj value of +500, for +example, is roughly equivalent to allowing the remainder of tasks sharing the +same system, cpuset, mempolicy, or memory controller resources to use at least +50% more memory. A value of -500, on the other hand, would be roughly +equivalent to discounting 50% of the task's allowed memory from being considered +as scoring against the task. + +For backwards compatibility with previous kernels, /proc//oom_adj may also +be used to tune the badness score. Its acceptable values range from -16 +(OOM_ADJUST_MIN) to +15 (OOM_ADJUST_MAX) and a special value of -17 +(OOM_DISABLE) to disable oom killing entirely for that task. Its value is +scaled linearly with /proc//oom_score_adj. + +The value of /proc//oom_score_adj may be reduced no lower than the last +value set by a CAP_SYS_RESOURCE process. To reduce the value any lower +requires CAP_SYS_RESOURCE. + +Caveat: when a parent task is selected, the oom killer will sacrifice any first +generation children with separate address spaces instead, if possible. This +avoids servers and important system daemons from being killed and loses the +minimal amount of work. + + +3.2 /proc//oom_score - Display current oom-killer score +------------------------------------------------------------- + +This file can be used to check the current score used by the oom-killer is for +any given . Use it together with /proc//oom_score_adj to tune which +process should be killed in an out-of-memory situation. + + +3.3 /proc//io - Display the IO accounting fields +------------------------------------------------------- + +This file contains IO statistics for each running process + +Example +~~~~~~~ + +:: + + test:/tmp # dd if=/dev/zero of=/tmp/test.dat & + [1] 3828 + + test:/tmp # cat /proc/3828/io + rchar: 323934931 + wchar: 323929600 + syscr: 632687 + syscw: 632675 + read_bytes: 0 + write_bytes: 323932160 + cancelled_write_bytes: 0 + + +Description +~~~~~~~~~~~ + +rchar +^^^^^ + +I/O counter: chars read +The number of bytes which this task has caused to be read from storage. This +is simply the sum of bytes which this process passed to read() and pread(). +It includes things like tty IO and it is unaffected by whether or not actual +physical disk IO was required (the read might have been satisfied from +pagecache) + + +wchar +^^^^^ + +I/O counter: chars written +The number of bytes which this task has caused, or shall cause to be written +to disk. Similar caveats apply here as with rchar. + + +syscr +^^^^^ + +I/O counter: read syscalls +Attempt to count the number of read I/O operations, i.e. syscalls like read() +and pread(). + + +syscw +^^^^^ + +I/O counter: write syscalls +Attempt to count the number of write I/O operations, i.e. syscalls like +write() and pwrite(). + + +read_bytes +^^^^^^^^^^ + +I/O counter: bytes read +Attempt to count the number of bytes which this process really did cause to +be fetched from the storage layer. Done at the submit_bio() level, so it is +accurate for block-backed filesystems. + + +write_bytes +^^^^^^^^^^^ + +I/O counter: bytes written +Attempt to count the number of bytes which this process caused to be sent to +the storage layer. This is done at page-dirtying time. + + +cancelled_write_bytes +^^^^^^^^^^^^^^^^^^^^^ + +The big inaccuracy here is truncate. If a process writes 1MB to a file and +then deletes the file, it will in fact perform no writeout. But it will have +been accounted as having caused 1MB of write. +In other words: The number of bytes which this process caused to not happen, +by truncating pagecache. A task can cause "negative" IO too. If this task +truncates some dirty pagecache, some IO which another task has been accounted +for (in its write_bytes) will not be happening. We _could_ just subtract that +from the truncating task's write_bytes, but there is information loss in doing +that. + + +.. Note:: + + At its current implementation state, this is a bit racy on 32-bit machines: + if process A reads process B's /proc/pid/io while process B is updating one + of those 64-bit counters, process A could see an intermediate result. + + +More information about this can be found within the taskstats documentation in +Documentation/accounting. + +3.4 /proc//coredump_filter - Core dump filtering settings +--------------------------------------------------------------- +When a process is dumped, all anonymous memory is written to a core file as +long as the size of the core file isn't limited. But sometimes we don't want +to dump some memory segments, for example, huge shared memory or DAX. +Conversely, sometimes we want to save file-backed memory segments into a core +file, not only the individual files. + +/proc//coredump_filter allows you to customize which memory segments +will be dumped when the process is dumped. coredump_filter is a bitmask +of memory types. If a bit of the bitmask is set, memory segments of the +corresponding memory type are dumped, otherwise they are not dumped. + +The following 9 memory types are supported: + + - (bit 0) anonymous private memory + - (bit 1) anonymous shared memory + - (bit 2) file-backed private memory + - (bit 3) file-backed shared memory + - (bit 4) ELF header pages in file-backed private memory areas (it is + effective only if the bit 2 is cleared) + - (bit 5) hugetlb private memory + - (bit 6) hugetlb shared memory + - (bit 7) DAX private memory + - (bit 8) DAX shared memory + + Note that MMIO pages such as frame buffer are never dumped and vDSO pages + are always dumped regardless of the bitmask status. + + Note that bits 0-4 don't affect hugetlb or DAX memory. hugetlb memory is + only affected by bit 5-6, and DAX is only affected by bits 7-8. + +The default value of coredump_filter is 0x33; this means all anonymous memory +segments, ELF header pages and hugetlb private memory are dumped. + +If you don't want to dump all shared memory segments attached to pid 1234, +write 0x31 to the process's proc file:: + + $ echo 0x31 > /proc/1234/coredump_filter + +When a new process is created, the process inherits the bitmask status from its +parent. It is useful to set up coredump_filter before the program runs. +For example:: + + $ echo 0x7 > /proc/self/coredump_filter + $ ./some_program + +3.5 /proc//mountinfo - Information about mounts +-------------------------------------------------------- + +This file contains lines of the form:: + + 36 35 98:0 /mnt1 /mnt2 rw,noatime master:1 - ext3 /dev/root rw,errors=continue + (1)(2)(3) (4) (5) (6) (7) (8) (9) (10) (11) + + (1) mount ID: unique identifier of the mount (may be reused after umount) + (2) parent ID: ID of parent (or of self for the top of the mount tree) + (3) major:minor: value of st_dev for files on filesystem + (4) root: root of the mount within the filesystem + (5) mount point: mount point relative to the process's root + (6) mount options: per mount options + (7) optional fields: zero or more fields of the form "tag[:value]" + (8) separator: marks the end of the optional fields + (9) filesystem type: name of filesystem of the form "type[.subtype]" + (10) mount source: filesystem specific information or "none" + (11) super options: per super block options + +Parsers should ignore all unrecognised optional fields. Currently the +possible optional fields are: + +================ ============================================================== +shared:X mount is shared in peer group X +master:X mount is slave to peer group X +propagate_from:X mount is slave and receives propagation from peer group X [#]_ +unbindable mount is unbindable +================ ============================================================== + +.. [#] X is the closest dominant peer group under the process's root. If + X is the immediate master of the mount, or if there's no dominant peer + group under the same root, then only the "master:X" field is present + and not the "propagate_from:X" field. + +For more information on mount propagation see: + + Documentation/filesystems/sharedsubtree.txt + + +3.6 /proc//comm & /proc//task//comm +-------------------------------------------------------- +These files provide a method to access a tasks comm value. It also allows for +a task to set its own or one of its thread siblings comm value. The comm value +is limited in size compared to the cmdline value, so writing anything longer +then the kernel's TASK_COMM_LEN (currently 16 chars) will result in a truncated +comm value. + + +3.7 /proc//task//children - Information about task children +------------------------------------------------------------------------- +This file provides a fast way to retrieve first level children pids +of a task pointed by / pair. The format is a space separated +stream of pids. + +Note the "first level" here -- if a child has own children they will +not be listed here, one needs to read /proc//task//children +to obtain the descendants. + +Since this interface is intended to be fast and cheap it doesn't +guarantee to provide precise results and some children might be +skipped, especially if they've exited right after we printed their +pids, so one need to either stop or freeze processes being inspected +if precise results are needed. + + +3.8 /proc//fdinfo/ - Information about opened file +--------------------------------------------------------------- +This file provides information associated with an opened file. The regular +files have at least three fields -- 'pos', 'flags' and mnt_id. The 'pos' +represents the current offset of the opened file in decimal form [see lseek(2) +for details], 'flags' denotes the octal O_xxx mask the file has been +created with [see open(2) for details] and 'mnt_id' represents mount ID of +the file system containing the opened file [see 3.5 /proc//mountinfo +for details]. + +A typical output is:: + + pos: 0 + flags: 0100002 + mnt_id: 19 + +All locks associated with a file descriptor are shown in its fdinfo too:: + + lock: 1: FLOCK ADVISORY WRITE 359 00:13:11691 0 EOF + +The files such as eventfd, fsnotify, signalfd, epoll among the regular pos/flags +pair provide additional information particular to the objects they represent. + +Eventfd files +~~~~~~~~~~~~~ + +:: + + pos: 0 + flags: 04002 + mnt_id: 9 + eventfd-count: 5a + +where 'eventfd-count' is hex value of a counter. + +Signalfd files +~~~~~~~~~~~~~~ + +:: + + pos: 0 + flags: 04002 + mnt_id: 9 + sigmask: 0000000000000200 + +where 'sigmask' is hex value of the signal mask associated +with a file. + +Epoll files +~~~~~~~~~~~ + +:: + + pos: 0 + flags: 02 + mnt_id: 9 + tfd: 5 events: 1d data: ffffffffffffffff pos:0 ino:61af sdev:7 + +where 'tfd' is a target file descriptor number in decimal form, +'events' is events mask being watched and the 'data' is data +associated with a target [see epoll(7) for more details]. + +The 'pos' is current offset of the target file in decimal form +[see lseek(2)], 'ino' and 'sdev' are inode and device numbers +where target file resides, all in hex format. + +Fsnotify files +~~~~~~~~~~~~~~ +For inotify files the format is the following:: + + pos: 0 + flags: 02000000 + inotify wd:3 ino:9e7e sdev:800013 mask:800afce ignored_mask:0 fhandle-bytes:8 fhandle-type:1 f_handle:7e9e0000640d1b6d + +where 'wd' is a watch descriptor in decimal form, ie a target file +descriptor number, 'ino' and 'sdev' are inode and device where the +target file resides and the 'mask' is the mask of events, all in hex +form [see inotify(7) for more details]. + +If the kernel was built with exportfs support, the path to the target +file is encoded as a file handle. The file handle is provided by three +fields 'fhandle-bytes', 'fhandle-type' and 'f_handle', all in hex +format. + +If the kernel is built without exportfs support the file handle won't be +printed out. + +If there is no inotify mark attached yet the 'inotify' line will be omitted. + +For fanotify files the format is:: + + pos: 0 + flags: 02 + mnt_id: 9 + fanotify flags:10 event-flags:0 + fanotify mnt_id:12 mflags:40 mask:38 ignored_mask:40000003 + fanotify ino:4f969 sdev:800013 mflags:0 mask:3b ignored_mask:40000000 fhandle-bytes:8 fhandle-type:1 f_handle:69f90400c275b5b4 + +where fanotify 'flags' and 'event-flags' are values used in fanotify_init +call, 'mnt_id' is the mount point identifier, 'mflags' is the value of +flags associated with mark which are tracked separately from events +mask. 'ino', 'sdev' are target inode and device, 'mask' is the events +mask and 'ignored_mask' is the mask of events which are to be ignored. +All in hex format. Incorporation of 'mflags', 'mask' and 'ignored_mask' +does provide information about flags and mask used in fanotify_mark +call [see fsnotify manpage for details]. + +While the first three lines are mandatory and always printed, the rest is +optional and may be omitted if no marks created yet. + +Timerfd files +~~~~~~~~~~~~~ + +:: + + pos: 0 + flags: 02 + mnt_id: 9 + clockid: 0 + ticks: 0 + settime flags: 01 + it_value: (0, 49406829) + it_interval: (1, 0) + +where 'clockid' is the clock type and 'ticks' is the number of the timer expirations +that have occurred [see timerfd_create(2) for details]. 'settime flags' are +flags in octal form been used to setup the timer [see timerfd_settime(2) for +details]. 'it_value' is remaining time until the timer exiration. +'it_interval' is the interval for the timer. Note the timer might be set up +with TIMER_ABSTIME option which will be shown in 'settime flags', but 'it_value' +still exhibits timer's remaining time. + +3.9 /proc//map_files - Information about memory mapped files +--------------------------------------------------------------------- +This directory contains symbolic links which represent memory mapped files +the process is maintaining. Example output:: + + | lr-------- 1 root root 64 Jan 27 11:24 333c600000-333c620000 -> /usr/lib64/ld-2.18.so + | lr-------- 1 root root 64 Jan 27 11:24 333c81f000-333c820000 -> /usr/lib64/ld-2.18.so + | lr-------- 1 root root 64 Jan 27 11:24 333c820000-333c821000 -> /usr/lib64/ld-2.18.so + | ... + | lr-------- 1 root root 64 Jan 27 11:24 35d0421000-35d0422000 -> /usr/lib64/libselinux.so.1 + | lr-------- 1 root root 64 Jan 27 11:24 400000-41a000 -> /usr/bin/ls + +The name of a link represents the virtual memory bounds of a mapping, i.e. +vm_area_struct::vm_start-vm_area_struct::vm_end. + +The main purpose of the map_files is to retrieve a set of memory mapped +files in a fast way instead of parsing /proc//maps or +/proc//smaps, both of which contain many more records. At the same +time one can open(2) mappings from the listings of two processes and +comparing their inode numbers to figure out which anonymous memory areas +are actually shared. + +3.10 /proc//timerslack_ns - Task timerslack value +--------------------------------------------------------- +This file provides the value of the task's timerslack value in nanoseconds. +This value specifies a amount of time that normal timers may be deferred +in order to coalesce timers and avoid unnecessary wakeups. + +This allows a task's interactivity vs power consumption trade off to be +adjusted. + +Writing 0 to the file will set the tasks timerslack to the default value. + +Valid values are from 0 - ULLONG_MAX + +An application setting the value must have PTRACE_MODE_ATTACH_FSCREDS level +permissions on the task specified to change its timerslack_ns value. + +3.11 /proc//patch_state - Livepatch patch operation state +----------------------------------------------------------------- +When CONFIG_LIVEPATCH is enabled, this file displays the value of the +patch state for the task. + +A value of '-1' indicates that no patch is in transition. + +A value of '0' indicates that a patch is in transition and the task is +unpatched. If the patch is being enabled, then the task hasn't been +patched yet. If the patch is being disabled, then the task has already +been unpatched. + +A value of '1' indicates that a patch is in transition and the task is +patched. If the patch is being enabled, then the task has already been +patched. If the patch is being disabled, then the task hasn't been +unpatched yet. + +3.12 /proc//arch_status - task architecture specific status +------------------------------------------------------------------- +When CONFIG_PROC_PID_ARCH_STATUS is enabled, this file displays the +architecture specific status of the task. + +Example +~~~~~~~ + +:: + + $ cat /proc/6753/arch_status + AVX512_elapsed_ms: 8 + +Description +~~~~~~~~~~~ + +x86 specific entries: +~~~~~~~~~~~~~~~~~~~~~ + +AVX512_elapsed_ms: +^^^^^^^^^^^^^^^^^^ + + If AVX512 is supported on the machine, this entry shows the milliseconds + elapsed since the last time AVX512 usage was recorded. The recording + happens on a best effort basis when a task is scheduled out. This means + that the value depends on two factors: + + 1) The time which the task spent on the CPU without being scheduled + out. With CPU isolation and a single runnable task this can take + several seconds. + + 2) The time since the task was scheduled out last. Depending on the + reason for being scheduled out (time slice exhausted, syscall ...) + this can be arbitrary long time. + + As a consequence the value cannot be considered precise and authoritative + information. The application which uses this information has to be aware + of the overall scenario on the system in order to determine whether a + task is a real AVX512 user or not. Precise information can be obtained + with performance counters. + + A special value of '-1' indicates that no AVX512 usage was recorded, thus + the task is unlikely an AVX512 user, but depends on the workload and the + scheduling scenario, it also could be a false negative mentioned above. + +Configuring procfs +------------------ + +4.1 Mount options +--------------------- + +The following mount options are supported: + + ========= ======================================================== + hidepid= Set /proc// access mode. + gid= Set the group authorized to learn processes information. + ========= ======================================================== + +hidepid=0 means classic mode - everybody may access all /proc// directories +(default). + +hidepid=1 means users may not access any /proc// directories but their +own. Sensitive files like cmdline, sched*, status are now protected against +other users. This makes it impossible to learn whether any user runs +specific program (given the program doesn't reveal itself by its behaviour). +As an additional bonus, as /proc//cmdline is unaccessible for other users, +poorly written programs passing sensitive information via program arguments are +now protected against local eavesdroppers. + +hidepid=2 means hidepid=1 plus all /proc// will be fully invisible to other +users. It doesn't mean that it hides a fact whether a process with a specific +pid value exists (it can be learned by other means, e.g. by "kill -0 $PID"), +but it hides process' uid and gid, which may be learned by stat()'ing +/proc// otherwise. It greatly complicates an intruder's task of gathering +information about running processes, whether some daemon runs with elevated +privileges, whether other user runs some sensitive program, whether other users +run any program at all, etc. + +gid= defines a group authorized to learn processes information otherwise +prohibited by hidepid=. If you use some daemon like identd which needs to learn +information about processes information, just add identd to this group. diff --git a/Documentation/filesystems/proc.txt b/Documentation/filesystems/proc.txt deleted file mode 100644 index 99ca040e3f90..000000000000 --- a/Documentation/filesystems/proc.txt +++ /dev/null @@ -1,2047 +0,0 @@ ------------------------------------------------------------------------------- - T H E /proc F I L E S Y S T E M ------------------------------------------------------------------------------- -/proc/sys Terrehon Bowden October 7 1999 - Bodo Bauer - -2.4.x update Jorge Nerin November 14 2000 -move /proc/sys Shen Feng April 1 2009 ------------------------------------------------------------------------------- -Version 1.3 Kernel version 2.2.12 - Kernel version 2.4.0-test11-pre4 ------------------------------------------------------------------------------- -fixes/update part 1.1 Stefani Seibold June 9 2009 - -Table of Contents ------------------ - - 0 Preface - 0.1 Introduction/Credits - 0.2 Legal Stuff - - 1 Collecting System Information - 1.1 Process-Specific Subdirectories - 1.2 Kernel data - 1.3 IDE devices in /proc/ide - 1.4 Networking info in /proc/net - 1.5 SCSI info - 1.6 Parallel port info in /proc/parport - 1.7 TTY info in /proc/tty - 1.8 Miscellaneous kernel statistics in /proc/stat - 1.9 Ext4 file system parameters - - 2 Modifying System Parameters - - 3 Per-Process Parameters - 3.1 /proc//oom_adj & /proc//oom_score_adj - Adjust the oom-killer - score - 3.2 /proc//oom_score - Display current oom-killer score - 3.3 /proc//io - Display the IO accounting fields - 3.4 /proc//coredump_filter - Core dump filtering settings - 3.5 /proc//mountinfo - Information about mounts - 3.6 /proc//comm & /proc//task//comm - 3.7 /proc//task//children - Information about task children - 3.8 /proc//fdinfo/ - Information about opened file - 3.9 /proc//map_files - Information about memory mapped files - 3.10 /proc//timerslack_ns - Task timerslack value - 3.11 /proc//patch_state - Livepatch patch operation state - 3.12 /proc//arch_status - Task architecture specific information - - 4 Configuring procfs - 4.1 Mount options - ------------------------------------------------------------------------------- -Preface ------------------------------------------------------------------------------- - -0.1 Introduction/Credits ------------------------- - -This documentation is part of a soon (or so we hope) to be released book on -the SuSE Linux distribution. As there is no complete documentation for the -/proc file system and we've used many freely available sources to write these -chapters, it seems only fair to give the work back to the Linux community. -This work is based on the 2.2.* kernel version and the upcoming 2.4.*. I'm -afraid it's still far from complete, but we hope it will be useful. As far as -we know, it is the first 'all-in-one' document about the /proc file system. It -is focused on the Intel x86 hardware, so if you are looking for PPC, ARM, -SPARC, AXP, etc., features, you probably won't find what you are looking for. -It also only covers IPv4 networking, not IPv6 nor other protocols - sorry. But -additions and patches are welcome and will be added to this document if you -mail them to Bodo. - -We'd like to thank Alan Cox, Rik van Riel, and Alexey Kuznetsov and a lot of -other people for help compiling this documentation. We'd also like to extend a -special thank you to Andi Kleen for documentation, which we relied on heavily -to create this document, as well as the additional information he provided. -Thanks to everybody else who contributed source or docs to the Linux kernel -and helped create a great piece of software... :) - -If you have any comments, corrections or additions, please don't hesitate to -contact Bodo Bauer at bb@ricochet.net. We'll be happy to add them to this -document. - -The latest version of this document is available online at -http://tldp.org/LDP/Linux-Filesystem-Hierarchy/html/proc.html - -If the above direction does not works for you, you could try the kernel -mailing list at linux-kernel@vger.kernel.org and/or try to reach me at -comandante@zaralinux.com. - -0.2 Legal Stuff ---------------- - -We don't guarantee the correctness of this document, and if you come to us -complaining about how you screwed up your system because of incorrect -documentation, we won't feel responsible... - ------------------------------------------------------------------------------- -CHAPTER 1: COLLECTING SYSTEM INFORMATION ------------------------------------------------------------------------------- - ------------------------------------------------------------------------------- -In This Chapter ------------------------------------------------------------------------------- -* Investigating the properties of the pseudo file system /proc and its - ability to provide information on the running Linux system -* Examining /proc's structure -* Uncovering various information about the kernel and the processes running - on the system ------------------------------------------------------------------------------- - - -The proc file system acts as an interface to internal data structures in the -kernel. It can be used to obtain information about the system and to change -certain kernel parameters at runtime (sysctl). - -First, we'll take a look at the read-only parts of /proc. In Chapter 2, we -show you how you can use /proc/sys to change settings. - -1.1 Process-Specific Subdirectories ------------------------------------ - -The directory /proc contains (among other things) one subdirectory for each -process running on the system, which is named after the process ID (PID). - -The link self points to the process reading the file system. Each process -subdirectory has the entries listed in Table 1-1. - -Note that an open a file descriptor to /proc/ or to any of its -contained files or subdirectories does not prevent being reused -for some other process in the event that exits. Operations on -open /proc/ file descriptors corresponding to dead processes -never act on any new process that the kernel may, through chance, have -also assigned the process ID . Instead, operations on these FDs -usually fail with ESRCH. - -Table 1-1: Process specific entries in /proc -.............................................................................. - File Content - clear_refs Clears page referenced bits shown in smaps output - cmdline Command line arguments - cpu Current and last cpu in which it was executed (2.4)(smp) - cwd Link to the current working directory - environ Values of environment variables - exe Link to the executable of this process - fd Directory, which contains all file descriptors - maps Memory maps to executables and library files (2.4) - mem Memory held by this process - root Link to the root directory of this process - stat Process status - statm Process memory status information - status Process status in human readable form - wchan Present with CONFIG_KALLSYMS=y: it shows the kernel function - symbol the task is blocked in - or "0" if not blocked. - pagemap Page table - stack Report full stack trace, enable via CONFIG_STACKTRACE - smaps An extension based on maps, showing the memory consumption of - each mapping and flags associated with it - smaps_rollup Accumulated smaps stats for all mappings of the process. This - can be derived from smaps, but is faster and more convenient - numa_maps An extension based on maps, showing the memory locality and - binding policy as well as mem usage (in pages) of each mapping. -.............................................................................. - -For example, to get the status information of a process, all you have to do is -read the file /proc/PID/status: - - >cat /proc/self/status - Name: cat - State: R (running) - Tgid: 5452 - Pid: 5452 - PPid: 743 - TracerPid: 0 (2.4) - Uid: 501 501 501 501 - Gid: 100 100 100 100 - FDSize: 256 - Groups: 100 14 16 - VmPeak: 5004 kB - VmSize: 5004 kB - VmLck: 0 kB - VmHWM: 476 kB - VmRSS: 476 kB - RssAnon: 352 kB - RssFile: 120 kB - RssShmem: 4 kB - VmData: 156 kB - VmStk: 88 kB - VmExe: 68 kB - VmLib: 1412 kB - VmPTE: 20 kb - VmSwap: 0 kB - HugetlbPages: 0 kB - CoreDumping: 0 - THP_enabled: 1 - Threads: 1 - SigQ: 0/28578 - SigPnd: 0000000000000000 - ShdPnd: 0000000000000000 - SigBlk: 0000000000000000 - SigIgn: 0000000000000000 - SigCgt: 0000000000000000 - CapInh: 00000000fffffeff - CapPrm: 0000000000000000 - CapEff: 0000000000000000 - CapBnd: ffffffffffffffff - CapAmb: 0000000000000000 - NoNewPrivs: 0 - Seccomp: 0 - Speculation_Store_Bypass: thread vulnerable - voluntary_ctxt_switches: 0 - nonvoluntary_ctxt_switches: 1 - -This shows you nearly the same information you would get if you viewed it with -the ps command. In fact, ps uses the proc file system to obtain its -information. But you get a more detailed view of the process by reading the -file /proc/PID/status. It fields are described in table 1-2. - -The statm file contains more detailed information about the process -memory usage. Its seven fields are explained in Table 1-3. The stat file -contains details information about the process itself. Its fields are -explained in Table 1-4. - -(for SMP CONFIG users) -For making accounting scalable, RSS related information are handled in an -asynchronous manner and the value may not be very precise. To see a precise -snapshot of a moment, you can see /proc//smaps file and scan page table. -It's slow but very precise. - -Table 1-2: Contents of the status files (as of 4.19) -.............................................................................. - Field Content - Name filename of the executable - Umask file mode creation mask - State state (R is running, S is sleeping, D is sleeping - in an uninterruptible wait, Z is zombie, - T is traced or stopped) - Tgid thread group ID - Ngid NUMA group ID (0 if none) - Pid process id - PPid process id of the parent process - TracerPid PID of process tracing this process (0 if not) - Uid Real, effective, saved set, and file system UIDs - Gid Real, effective, saved set, and file system GIDs - FDSize number of file descriptor slots currently allocated - Groups supplementary group list - NStgid descendant namespace thread group ID hierarchy - NSpid descendant namespace process ID hierarchy - NSpgid descendant namespace process group ID hierarchy - NSsid descendant namespace session ID hierarchy - VmPeak peak virtual memory size - VmSize total program size - VmLck locked memory size - VmPin pinned memory size - VmHWM peak resident set size ("high water mark") - VmRSS size of memory portions. It contains the three - following parts (VmRSS = RssAnon + RssFile + RssShmem) - RssAnon size of resident anonymous memory - RssFile size of resident file mappings - RssShmem size of resident shmem memory (includes SysV shm, - mapping of tmpfs and shared anonymous mappings) - VmData size of private data segments - VmStk size of stack segments - VmExe size of text segment - VmLib size of shared library code - VmPTE size of page table entries - VmSwap amount of swap used by anonymous private data - (shmem swap usage is not included) - HugetlbPages size of hugetlb memory portions - CoreDumping process's memory is currently being dumped - (killing the process may lead to a corrupted core) - THP_enabled process is allowed to use THP (returns 0 when - PR_SET_THP_DISABLE is set on the process - Threads number of threads - SigQ number of signals queued/max. number for queue - SigPnd bitmap of pending signals for the thread - ShdPnd bitmap of shared pending signals for the process - SigBlk bitmap of blocked signals - SigIgn bitmap of ignored signals - SigCgt bitmap of caught signals - CapInh bitmap of inheritable capabilities - CapPrm bitmap of permitted capabilities - CapEff bitmap of effective capabilities - CapBnd bitmap of capabilities bounding set - CapAmb bitmap of ambient capabilities - NoNewPrivs no_new_privs, like prctl(PR_GET_NO_NEW_PRIV, ...) - Seccomp seccomp mode, like prctl(PR_GET_SECCOMP, ...) - Speculation_Store_Bypass speculative store bypass mitigation status - Cpus_allowed mask of CPUs on which this process may run - Cpus_allowed_list Same as previous, but in "list format" - Mems_allowed mask of memory nodes allowed to this process - Mems_allowed_list Same as previous, but in "list format" - voluntary_ctxt_switches number of voluntary context switches - nonvoluntary_ctxt_switches number of non voluntary context switches -.............................................................................. - -Table 1-3: Contents of the statm files (as of 2.6.8-rc3) -.............................................................................. - Field Content - size total program size (pages) (same as VmSize in status) - resident size of memory portions (pages) (same as VmRSS in status) - shared number of pages that are shared (i.e. backed by a file, same - as RssFile+RssShmem in status) - trs number of pages that are 'code' (not including libs; broken, - includes data segment) - lrs number of pages of library (always 0 on 2.6) - drs number of pages of data/stack (including libs; broken, - includes library text) - dt number of dirty pages (always 0 on 2.6) -.............................................................................. - - -Table 1-4: Contents of the stat files (as of 2.6.30-rc7) -.............................................................................. - Field Content - pid process id - tcomm filename of the executable - state state (R is running, S is sleeping, D is sleeping in an - uninterruptible wait, Z is zombie, T is traced or stopped) - ppid process id of the parent process - pgrp pgrp of the process - sid session id - tty_nr tty the process uses - tty_pgrp pgrp of the tty - flags task flags - min_flt number of minor faults - cmin_flt number of minor faults with child's - maj_flt number of major faults - cmaj_flt number of major faults with child's - utime user mode jiffies - stime kernel mode jiffies - cutime user mode jiffies with child's - cstime kernel mode jiffies with child's - priority priority level - nice nice level - num_threads number of threads - it_real_value (obsolete, always 0) - start_time time the process started after system boot - vsize virtual memory size - rss resident set memory size - rsslim current limit in bytes on the rss - start_code address above which program text can run - end_code address below which program text can run - start_stack address of the start of the main process stack - esp current value of ESP - eip current value of EIP - pending bitmap of pending signals - blocked bitmap of blocked signals - sigign bitmap of ignored signals - sigcatch bitmap of caught signals - 0 (place holder, used to be the wchan address, use /proc/PID/wchan instead) - 0 (place holder) - 0 (place holder) - exit_signal signal to send to parent thread on exit - task_cpu which CPU the task is scheduled on - rt_priority realtime priority - policy scheduling policy (man sched_setscheduler) - blkio_ticks time spent waiting for block IO - gtime guest time of the task in jiffies - cgtime guest time of the task children in jiffies - start_data address above which program data+bss is placed - end_data address below which program data+bss is placed - start_brk address above which program heap can be expanded with brk() - arg_start address above which program command line is placed - arg_end address below which program command line is placed - env_start address above which program environment is placed - env_end address below which program environment is placed - exit_code the thread's exit_code in the form reported by the waitpid system call -.............................................................................. - -The /proc/PID/maps file contains the currently mapped memory regions and -their access permissions. - -The format is: - -address perms offset dev inode pathname - -08048000-08049000 r-xp 00000000 03:00 8312 /opt/test -08049000-0804a000 rw-p 00001000 03:00 8312 /opt/test -0804a000-0806b000 rw-p 00000000 00:00 0 [heap] -a7cb1000-a7cb2000 ---p 00000000 00:00 0 -a7cb2000-a7eb2000 rw-p 00000000 00:00 0 -a7eb2000-a7eb3000 ---p 00000000 00:00 0 -a7eb3000-a7ed5000 rw-p 00000000 00:00 0 -a7ed5000-a8008000 r-xp 00000000 03:00 4222 /lib/libc.so.6 -a8008000-a800a000 r--p 00133000 03:00 4222 /lib/libc.so.6 -a800a000-a800b000 rw-p 00135000 03:00 4222 /lib/libc.so.6 -a800b000-a800e000 rw-p 00000000 00:00 0 -a800e000-a8022000 r-xp 00000000 03:00 14462 /lib/libpthread.so.0 -a8022000-a8023000 r--p 00013000 03:00 14462 /lib/libpthread.so.0 -a8023000-a8024000 rw-p 00014000 03:00 14462 /lib/libpthread.so.0 -a8024000-a8027000 rw-p 00000000 00:00 0 -a8027000-a8043000 r-xp 00000000 03:00 8317 /lib/ld-linux.so.2 -a8043000-a8044000 r--p 0001b000 03:00 8317 /lib/ld-linux.so.2 -a8044000-a8045000 rw-p 0001c000 03:00 8317 /lib/ld-linux.so.2 -aff35000-aff4a000 rw-p 00000000 00:00 0 [stack] -ffffe000-fffff000 r-xp 00000000 00:00 0 [vdso] - -where "address" is the address space in the process that it occupies, "perms" -is a set of permissions: - - r = read - w = write - x = execute - s = shared - p = private (copy on write) - -"offset" is the offset into the mapping, "dev" is the device (major:minor), and -"inode" is the inode on that device. 0 indicates that no inode is associated -with the memory region, as the case would be with BSS (uninitialized data). -The "pathname" shows the name associated file for this mapping. If the mapping -is not associated with a file: - - [heap] = the heap of the program - [stack] = the stack of the main process - [vdso] = the "virtual dynamic shared object", - the kernel system call handler - - or if empty, the mapping is anonymous. - -The /proc/PID/smaps is an extension based on maps, showing the memory -consumption for each of the process's mappings. For each mapping (aka Virtual -Memory Area, or VMA) there is a series of lines such as the following: - -08048000-080bc000 r-xp 00000000 03:02 13130 /bin/bash - -Size: 1084 kB -KernelPageSize: 4 kB -MMUPageSize: 4 kB -Rss: 892 kB -Pss: 374 kB -Shared_Clean: 892 kB -Shared_Dirty: 0 kB -Private_Clean: 0 kB -Private_Dirty: 0 kB -Referenced: 892 kB -Anonymous: 0 kB -LazyFree: 0 kB -AnonHugePages: 0 kB -ShmemPmdMapped: 0 kB -Shared_Hugetlb: 0 kB -Private_Hugetlb: 0 kB -Swap: 0 kB -SwapPss: 0 kB -KernelPageSize: 4 kB -MMUPageSize: 4 kB -Locked: 0 kB -THPeligible: 0 -VmFlags: rd ex mr mw me dw - -The first of these lines shows the same information as is displayed for the -mapping in /proc/PID/maps. Following lines show the size of the mapping -(size); the size of each page allocated when backing a VMA (KernelPageSize), -which is usually the same as the size in the page table entries; the page size -used by the MMU when backing a VMA (in most cases, the same as KernelPageSize); -the amount of the mapping that is currently resident in RAM (RSS); the -process' proportional share of this mapping (PSS); and the number of clean and -dirty shared and private pages in the mapping. - -The "proportional set size" (PSS) of a process is the count of pages it has -in memory, where each page is divided by the number of processes sharing it. -So if a process has 1000 pages all to itself, and 1000 shared with one other -process, its PSS will be 1500. -Note that even a page which is part of a MAP_SHARED mapping, but has only -a single pte mapped, i.e. is currently used by only one process, is accounted -as private and not as shared. -"Referenced" indicates the amount of memory currently marked as referenced or -accessed. -"Anonymous" shows the amount of memory that does not belong to any file. Even -a mapping associated with a file may contain anonymous pages: when MAP_PRIVATE -and a page is modified, the file page is replaced by a private anonymous copy. -"LazyFree" shows the amount of memory which is marked by madvise(MADV_FREE). -The memory isn't freed immediately with madvise(). It's freed in memory -pressure if the memory is clean. Please note that the printed value might -be lower than the real value due to optimizations used in the current -implementation. If this is not desirable please file a bug report. -"AnonHugePages" shows the ammount of memory backed by transparent hugepage. -"ShmemPmdMapped" shows the ammount of shared (shmem/tmpfs) memory backed by -huge pages. -"Shared_Hugetlb" and "Private_Hugetlb" show the ammounts of memory backed by -hugetlbfs page which is *not* counted in "RSS" or "PSS" field for historical -reasons. And these are not included in {Shared,Private}_{Clean,Dirty} field. -"Swap" shows how much would-be-anonymous memory is also used, but out on swap. -For shmem mappings, "Swap" includes also the size of the mapped (and not -replaced by copy-on-write) part of the underlying shmem object out on swap. -"SwapPss" shows proportional swap share of this mapping. Unlike "Swap", this -does not take into account swapped out page of underlying shmem objects. -"Locked" indicates whether the mapping is locked in memory or not. -"THPeligible" indicates whether the mapping is eligible for allocating THP -pages - 1 if true, 0 otherwise. It just shows the current status. - -"VmFlags" field deserves a separate description. This member represents the kernel -flags associated with the particular virtual memory area in two letter encoded -manner. The codes are the following: - rd - readable - wr - writeable - ex - executable - sh - shared - mr - may read - mw - may write - me - may execute - ms - may share - gd - stack segment growns down - pf - pure PFN range - dw - disabled write to the mapped file - lo - pages are locked in memory - io - memory mapped I/O area - sr - sequential read advise provided - rr - random read advise provided - dc - do not copy area on fork - de - do not expand area on remapping - ac - area is accountable - nr - swap space is not reserved for the area - ht - area uses huge tlb pages - ar - architecture specific flag - dd - do not include area into core dump - sd - soft-dirty flag - mm - mixed map area - hg - huge page advise flag - nh - no-huge page advise flag - mg - mergable advise flag - -Note that there is no guarantee that every flag and associated mnemonic will -be present in all further kernel releases. Things get changed, the flags may -be vanished or the reverse -- new added. Interpretation of their meaning -might change in future as well. So each consumer of these flags has to -follow each specific kernel version for the exact semantic. - -This file is only present if the CONFIG_MMU kernel configuration option is -enabled. - -Note: reading /proc/PID/maps or /proc/PID/smaps is inherently racy (consistent -output can be achieved only in the single read call). -This typically manifests when doing partial reads of these files while the -memory map is being modified. Despite the races, we do provide the following -guarantees: - -1) The mapped addresses never go backwards, which implies no two - regions will ever overlap. -2) If there is something at a given vaddr during the entirety of the - life of the smaps/maps walk, there will be some output for it. - -The /proc/PID/smaps_rollup file includes the same fields as /proc/PID/smaps, -but their values are the sums of the corresponding values for all mappings of -the process. Additionally, it contains these fields: - -Pss_Anon -Pss_File -Pss_Shmem - -They represent the proportional shares of anonymous, file, and shmem pages, as -described for smaps above. These fields are omitted in smaps since each -mapping identifies the type (anon, file, or shmem) of all pages it contains. -Thus all information in smaps_rollup can be derived from smaps, but at a -significantly higher cost. - -The /proc/PID/clear_refs is used to reset the PG_Referenced and ACCESSED/YOUNG -bits on both physical and virtual pages associated with a process, and the -soft-dirty bit on pte (see Documentation/admin-guide/mm/soft-dirty.rst -for details). -To clear the bits for all the pages associated with the process - > echo 1 > /proc/PID/clear_refs - -To clear the bits for the anonymous pages associated with the process - > echo 2 > /proc/PID/clear_refs - -To clear the bits for the file mapped pages associated with the process - > echo 3 > /proc/PID/clear_refs - -To clear the soft-dirty bit - > echo 4 > /proc/PID/clear_refs - -To reset the peak resident set size ("high water mark") to the process's -current value: - > echo 5 > /proc/PID/clear_refs - -Any other value written to /proc/PID/clear_refs will have no effect. - -The /proc/pid/pagemap gives the PFN, which can be used to find the pageflags -using /proc/kpageflags and number of times a page is mapped using -/proc/kpagecount. For detailed explanation, see -Documentation/admin-guide/mm/pagemap.rst. - -The /proc/pid/numa_maps is an extension based on maps, showing the memory -locality and binding policy, as well as the memory usage (in pages) of -each mapping. The output follows a general format where mapping details get -summarized separated by blank spaces, one mapping per each file line: - -address policy mapping details - -00400000 default file=/usr/local/bin/app mapped=1 active=0 N3=1 kernelpagesize_kB=4 -00600000 default file=/usr/local/bin/app anon=1 dirty=1 N3=1 kernelpagesize_kB=4 -3206000000 default file=/lib64/ld-2.12.so mapped=26 mapmax=6 N0=24 N3=2 kernelpagesize_kB=4 -320621f000 default file=/lib64/ld-2.12.so anon=1 dirty=1 N3=1 kernelpagesize_kB=4 -3206220000 default file=/lib64/ld-2.12.so anon=1 dirty=1 N3=1 kernelpagesize_kB=4 -3206221000 default anon=1 dirty=1 N3=1 kernelpagesize_kB=4 -3206800000 default file=/lib64/libc-2.12.so mapped=59 mapmax=21 active=55 N0=41 N3=18 kernelpagesize_kB=4 -320698b000 default file=/lib64/libc-2.12.so -3206b8a000 default file=/lib64/libc-2.12.so anon=2 dirty=2 N3=2 kernelpagesize_kB=4 -3206b8e000 default file=/lib64/libc-2.12.so anon=1 dirty=1 N3=1 kernelpagesize_kB=4 -3206b8f000 default anon=3 dirty=3 active=1 N3=3 kernelpagesize_kB=4 -7f4dc10a2000 default anon=3 dirty=3 N3=3 kernelpagesize_kB=4 -7f4dc10b4000 default anon=2 dirty=2 active=1 N3=2 kernelpagesize_kB=4 -7f4dc1200000 default file=/anon_hugepage\040(deleted) huge anon=1 dirty=1 N3=1 kernelpagesize_kB=2048 -7fff335f0000 default stack anon=3 dirty=3 N3=3 kernelpagesize_kB=4 -7fff3369d000 default mapped=1 mapmax=35 active=0 N3=1 kernelpagesize_kB=4 - -Where: -"address" is the starting address for the mapping; -"policy" reports the NUMA memory policy set for the mapping (see Documentation/admin-guide/mm/numa_memory_policy.rst); -"mapping details" summarizes mapping data such as mapping type, page usage counters, -node locality page counters (N0 == node0, N1 == node1, ...) and the kernel page -size, in KB, that is backing the mapping up. - -1.2 Kernel data ---------------- - -Similar to the process entries, the kernel data files give information about -the running kernel. The files used to obtain this information are contained in -/proc and are listed in Table 1-5. Not all of these will be present in your -system. It depends on the kernel configuration and the loaded modules, which -files are there, and which are missing. - -Table 1-5: Kernel info in /proc -.............................................................................. - File Content - apm Advanced power management info - buddyinfo Kernel memory allocator information (see text) (2.5) - bus Directory containing bus specific information - cmdline Kernel command line - cpuinfo Info about the CPU - devices Available devices (block and character) - dma Used DMS channels - filesystems Supported filesystems - driver Various drivers grouped here, currently rtc (2.4) - execdomains Execdomains, related to security (2.4) - fb Frame Buffer devices (2.4) - fs File system parameters, currently nfs/exports (2.4) - ide Directory containing info about the IDE subsystem - interrupts Interrupt usage - iomem Memory map (2.4) - ioports I/O port usage - irq Masks for irq to cpu affinity (2.4)(smp?) - isapnp ISA PnP (Plug&Play) Info (2.4) - kcore Kernel core image (can be ELF or A.OUT(deprecated in 2.4)) - kmsg Kernel messages - ksyms Kernel symbol table - loadavg Load average of last 1, 5 & 15 minutes - locks Kernel locks - meminfo Memory info - misc Miscellaneous - modules List of loaded modules - mounts Mounted filesystems - net Networking info (see text) - pagetypeinfo Additional page allocator information (see text) (2.5) - partitions Table of partitions known to the system - pci Deprecated info of PCI bus (new way -> /proc/bus/pci/, - decoupled by lspci (2.4) - rtc Real time clock - scsi SCSI info (see text) - slabinfo Slab pool info - softirqs softirq usage - stat Overall statistics - swaps Swap space utilization - sys See chapter 2 - sysvipc Info of SysVIPC Resources (msg, sem, shm) (2.4) - tty Info of tty drivers - uptime Wall clock since boot, combined idle time of all cpus - version Kernel version - video bttv info of video resources (2.4) - vmallocinfo Show vmalloced areas -.............................................................................. - -You can, for example, check which interrupts are currently in use and what -they are used for by looking in the file /proc/interrupts: - - > cat /proc/interrupts - CPU0 - 0: 8728810 XT-PIC timer - 1: 895 XT-PIC keyboard - 2: 0 XT-PIC cascade - 3: 531695 XT-PIC aha152x - 4: 2014133 XT-PIC serial - 5: 44401 XT-PIC pcnet_cs - 8: 2 XT-PIC rtc - 11: 8 XT-PIC i82365 - 12: 182918 XT-PIC PS/2 Mouse - 13: 1 XT-PIC fpu - 14: 1232265 XT-PIC ide0 - 15: 7 XT-PIC ide1 - NMI: 0 - -In 2.4.* a couple of lines where added to this file LOC & ERR (this time is the -output of a SMP machine): - - > cat /proc/interrupts - - CPU0 CPU1 - 0: 1243498 1214548 IO-APIC-edge timer - 1: 8949 8958 IO-APIC-edge keyboard - 2: 0 0 XT-PIC cascade - 5: 11286 10161 IO-APIC-edge soundblaster - 8: 1 0 IO-APIC-edge rtc - 9: 27422 27407 IO-APIC-edge 3c503 - 12: 113645 113873 IO-APIC-edge PS/2 Mouse - 13: 0 0 XT-PIC fpu - 14: 22491 24012 IO-APIC-edge ide0 - 15: 2183 2415 IO-APIC-edge ide1 - 17: 30564 30414 IO-APIC-level eth0 - 18: 177 164 IO-APIC-level bttv - NMI: 2457961 2457959 - LOC: 2457882 2457881 - ERR: 2155 - -NMI is incremented in this case because every timer interrupt generates a NMI -(Non Maskable Interrupt) which is used by the NMI Watchdog to detect lockups. - -LOC is the local interrupt counter of the internal APIC of every CPU. - -ERR is incremented in the case of errors in the IO-APIC bus (the bus that -connects the CPUs in a SMP system. This means that an error has been detected, -the IO-APIC automatically retry the transmission, so it should not be a big -problem, but you should read the SMP-FAQ. - -In 2.6.2* /proc/interrupts was expanded again. This time the goal was for -/proc/interrupts to display every IRQ vector in use by the system, not -just those considered 'most important'. The new vectors are: - - THR -- interrupt raised when a machine check threshold counter - (typically counting ECC corrected errors of memory or cache) exceeds - a configurable threshold. Only available on some systems. - - TRM -- a thermal event interrupt occurs when a temperature threshold - has been exceeded for the CPU. This interrupt may also be generated - when the temperature drops back to normal. - - SPU -- a spurious interrupt is some interrupt that was raised then lowered - by some IO device before it could be fully processed by the APIC. Hence - the APIC sees the interrupt but does not know what device it came from. - For this case the APIC will generate the interrupt with a IRQ vector - of 0xff. This might also be generated by chipset bugs. - - RES, CAL, TLB -- rescheduling, call and TLB flush interrupts are - sent from one CPU to another per the needs of the OS. Typically, - their statistics are used by kernel developers and interested users to - determine the occurrence of interrupts of the given type. - -The above IRQ vectors are displayed only when relevant. For example, -the threshold vector does not exist on x86_64 platforms. Others are -suppressed when the system is a uniprocessor. As of this writing, only -i386 and x86_64 platforms support the new IRQ vector displays. - -Of some interest is the introduction of the /proc/irq directory to 2.4. -It could be used to set IRQ to CPU affinity, this means that you can "hook" an -IRQ to only one CPU, or to exclude a CPU of handling IRQs. The contents of the -irq subdir is one subdir for each IRQ, and two files; default_smp_affinity and -prof_cpu_mask. - -For example - > ls /proc/irq/ - 0 10 12 14 16 18 2 4 6 8 prof_cpu_mask - 1 11 13 15 17 19 3 5 7 9 default_smp_affinity - > ls /proc/irq/0/ - smp_affinity - -smp_affinity is a bitmask, in which you can specify which CPUs can handle the -IRQ, you can set it by doing: - - > echo 1 > /proc/irq/10/smp_affinity - -This means that only the first CPU will handle the IRQ, but you can also echo -5 which means that only the first and third CPU can handle the IRQ. - -The contents of each smp_affinity file is the same by default: - - > cat /proc/irq/0/smp_affinity - ffffffff - -There is an alternate interface, smp_affinity_list which allows specifying -a cpu range instead of a bitmask: - - > cat /proc/irq/0/smp_affinity_list - 1024-1031 - -The default_smp_affinity mask applies to all non-active IRQs, which are the -IRQs which have not yet been allocated/activated, and hence which lack a -/proc/irq/[0-9]* directory. - -The node file on an SMP system shows the node to which the device using the IRQ -reports itself as being attached. This hardware locality information does not -include information about any possible driver locality preference. - -prof_cpu_mask specifies which CPUs are to be profiled by the system wide -profiler. Default value is ffffffff (all cpus if there are only 32 of them). - -The way IRQs are routed is handled by the IO-APIC, and it's Round Robin -between all the CPUs which are allowed to handle it. As usual the kernel has -more info than you and does a better job than you, so the defaults are the -best choice for almost everyone. [Note this applies only to those IO-APIC's -that support "Round Robin" interrupt distribution.] - -There are three more important subdirectories in /proc: net, scsi, and sys. -The general rule is that the contents, or even the existence of these -directories, depend on your kernel configuration. If SCSI is not enabled, the -directory scsi may not exist. The same is true with the net, which is there -only when networking support is present in the running kernel. - -The slabinfo file gives information about memory usage at the slab level. -Linux uses slab pools for memory management above page level in version 2.2. -Commonly used objects have their own slab pool (such as network buffers, -directory cache, and so on). - -.............................................................................. - -> cat /proc/buddyinfo - -Node 0, zone DMA 0 4 5 4 4 3 ... -Node 0, zone Normal 1 0 0 1 101 8 ... -Node 0, zone HighMem 2 0 0 1 1 0 ... - -External fragmentation is a problem under some workloads, and buddyinfo is a -useful tool for helping diagnose these problems. Buddyinfo will give you a -clue as to how big an area you can safely allocate, or why a previous -allocation failed. - -Each column represents the number of pages of a certain order which are -available. In this case, there are 0 chunks of 2^0*PAGE_SIZE available in -ZONE_DMA, 4 chunks of 2^1*PAGE_SIZE in ZONE_DMA, 101 chunks of 2^4*PAGE_SIZE -available in ZONE_NORMAL, etc... - -More information relevant to external fragmentation can be found in -pagetypeinfo. - -> cat /proc/pagetypeinfo -Page block order: 9 -Pages per block: 512 - -Free pages count per migrate type at order 0 1 2 3 4 5 6 7 8 9 10 -Node 0, zone DMA, type Unmovable 0 0 0 1 1 1 1 1 1 1 0 -Node 0, zone DMA, type Reclaimable 0 0 0 0 0 0 0 0 0 0 0 -Node 0, zone DMA, type Movable 1 1 2 1 2 1 1 0 1 0 2 -Node 0, zone DMA, type Reserve 0 0 0 0 0 0 0 0 0 1 0 -Node 0, zone DMA, type Isolate 0 0 0 0 0 0 0 0 0 0 0 -Node 0, zone DMA32, type Unmovable 103 54 77 1 1 1 11 8 7 1 9 -Node 0, zone DMA32, type Reclaimable 0 0 2 1 0 0 0 0 1 0 0 -Node 0, zone DMA32, type Movable 169 152 113 91 77 54 39 13 6 1 452 -Node 0, zone DMA32, type Reserve 1 2 2 2 2 0 1 1 1 1 0 -Node 0, zone DMA32, type Isolate 0 0 0 0 0 0 0 0 0 0 0 - -Number of blocks type Unmovable Reclaimable Movable Reserve Isolate -Node 0, zone DMA 2 0 5 1 0 -Node 0, zone DMA32 41 6 967 2 0 - -Fragmentation avoidance in the kernel works by grouping pages of different -migrate types into the same contiguous regions of memory called page blocks. -A page block is typically the size of the default hugepage size e.g. 2MB on -X86-64. By keeping pages grouped based on their ability to move, the kernel -can reclaim pages within a page block to satisfy a high-order allocation. - -The pagetypinfo begins with information on the size of a page block. It -then gives the same type of information as buddyinfo except broken down -by migrate-type and finishes with details on how many page blocks of each -type exist. - -If min_free_kbytes has been tuned correctly (recommendations made by hugeadm -from libhugetlbfs https://github.com/libhugetlbfs/libhugetlbfs/), one can -make an estimate of the likely number of huge pages that can be allocated -at a given point in time. All the "Movable" blocks should be allocatable -unless memory has been mlock()'d. Some of the Reclaimable blocks should -also be allocatable although a lot of filesystem metadata may have to be -reclaimed to achieve this. - -.............................................................................. - -meminfo: - -Provides information about distribution and utilization of memory. This -varies by architecture and compile options. The following is from a -16GB PIII, which has highmem enabled. You may not have all of these fields. - -> cat /proc/meminfo - -MemTotal: 16344972 kB -MemFree: 13634064 kB -MemAvailable: 14836172 kB -Buffers: 3656 kB -Cached: 1195708 kB -SwapCached: 0 kB -Active: 891636 kB -Inactive: 1077224 kB -HighTotal: 15597528 kB -HighFree: 13629632 kB -LowTotal: 747444 kB -LowFree: 4432 kB -SwapTotal: 0 kB -SwapFree: 0 kB -Dirty: 968 kB -Writeback: 0 kB -AnonPages: 861800 kB -Mapped: 280372 kB -Shmem: 644 kB -KReclaimable: 168048 kB -Slab: 284364 kB -SReclaimable: 159856 kB -SUnreclaim: 124508 kB -PageTables: 24448 kB -NFS_Unstable: 0 kB -Bounce: 0 kB -WritebackTmp: 0 kB -CommitLimit: 7669796 kB -Committed_AS: 100056 kB -VmallocTotal: 112216 kB -VmallocUsed: 428 kB -VmallocChunk: 111088 kB -Percpu: 62080 kB -HardwareCorrupted: 0 kB -AnonHugePages: 49152 kB -ShmemHugePages: 0 kB -ShmemPmdMapped: 0 kB - - - MemTotal: Total usable ram (i.e. physical ram minus a few reserved - bits and the kernel binary code) - MemFree: The sum of LowFree+HighFree -MemAvailable: An estimate of how much memory is available for starting new - applications, without swapping. Calculated from MemFree, - SReclaimable, the size of the file LRU lists, and the low - watermarks in each zone. - The estimate takes into account that the system needs some - page cache to function well, and that not all reclaimable - slab will be reclaimable, due to items being in use. The - impact of those factors will vary from system to system. - Buffers: Relatively temporary storage for raw disk blocks - shouldn't get tremendously large (20MB or so) - Cached: in-memory cache for files read from the disk (the - pagecache). Doesn't include SwapCached - SwapCached: Memory that once was swapped out, is swapped back in but - still also is in the swapfile (if memory is needed it - doesn't need to be swapped out AGAIN because it is already - in the swapfile. This saves I/O) - Active: Memory that has been used more recently and usually not - reclaimed unless absolutely necessary. - Inactive: Memory which has been less recently used. It is more - eligible to be reclaimed for other purposes - HighTotal: - HighFree: Highmem is all memory above ~860MB of physical memory - Highmem areas are for use by userspace programs, or - for the pagecache. The kernel must use tricks to access - this memory, making it slower to access than lowmem. - LowTotal: - LowFree: Lowmem is memory which can be used for everything that - highmem can be used for, but it is also available for the - kernel's use for its own data structures. Among many - other things, it is where everything from the Slab is - allocated. Bad things happen when you're out of lowmem. - SwapTotal: total amount of swap space available - SwapFree: Memory which has been evicted from RAM, and is temporarily - on the disk - Dirty: Memory which is waiting to get written back to the disk - Writeback: Memory which is actively being written back to the disk - AnonPages: Non-file backed pages mapped into userspace page tables -HardwareCorrupted: The amount of RAM/memory in KB, the kernel identifies as - corrupted. -AnonHugePages: Non-file backed huge pages mapped into userspace page tables - Mapped: files which have been mmaped, such as libraries - Shmem: Total memory used by shared memory (shmem) and tmpfs -ShmemHugePages: Memory used by shared memory (shmem) and tmpfs allocated - with huge pages -ShmemPmdMapped: Shared memory mapped into userspace with huge pages -KReclaimable: Kernel allocations that the kernel will attempt to reclaim - under memory pressure. Includes SReclaimable (below), and other - direct allocations with a shrinker. - Slab: in-kernel data structures cache -SReclaimable: Part of Slab, that might be reclaimed, such as caches - SUnreclaim: Part of Slab, that cannot be reclaimed on memory pressure - PageTables: amount of memory dedicated to the lowest level of page - tables. -NFS_Unstable: NFS pages sent to the server, but not yet committed to stable - storage - Bounce: Memory used for block device "bounce buffers" -WritebackTmp: Memory used by FUSE for temporary writeback buffers - CommitLimit: Based on the overcommit ratio ('vm.overcommit_ratio'), - this is the total amount of memory currently available to - be allocated on the system. This limit is only adhered to - if strict overcommit accounting is enabled (mode 2 in - 'vm.overcommit_memory'). - The CommitLimit is calculated with the following formula: - CommitLimit = ([total RAM pages] - [total huge TLB pages]) * - overcommit_ratio / 100 + [total swap pages] - For example, on a system with 1G of physical RAM and 7G - of swap with a `vm.overcommit_ratio` of 30 it would - yield a CommitLimit of 7.3G. - For more details, see the memory overcommit documentation - in vm/overcommit-accounting. -Committed_AS: The amount of memory presently allocated on the system. - The committed memory is a sum of all of the memory which - has been allocated by processes, even if it has not been - "used" by them as of yet. A process which malloc()'s 1G - of memory, but only touches 300M of it will show up as - using 1G. This 1G is memory which has been "committed" to - by the VM and can be used at any time by the allocating - application. With strict overcommit enabled on the system - (mode 2 in 'vm.overcommit_memory'),allocations which would - exceed the CommitLimit (detailed above) will not be permitted. - This is useful if one needs to guarantee that processes will - not fail due to lack of memory once that memory has been - successfully allocated. -VmallocTotal: total size of vmalloc memory area - VmallocUsed: amount of vmalloc area which is used -VmallocChunk: largest contiguous block of vmalloc area which is free - Percpu: Memory allocated to the percpu allocator used to back percpu - allocations. This stat excludes the cost of metadata. - -.............................................................................. - -vmallocinfo: - -Provides information about vmalloced/vmaped areas. One line per area, -containing the virtual address range of the area, size in bytes, -caller information of the creator, and optional information depending -on the kind of area : - - pages=nr number of pages - phys=addr if a physical address was specified - ioremap I/O mapping (ioremap() and friends) - vmalloc vmalloc() area - vmap vmap()ed pages - user VM_USERMAP area - vpages buffer for pages pointers was vmalloced (huge area) - N=nr (Only on NUMA kernels) - Number of pages allocated on memory node - -> cat /proc/vmallocinfo -0xffffc20000000000-0xffffc20000201000 2101248 alloc_large_system_hash+0x204 ... - /0x2c0 pages=512 vmalloc N0=128 N1=128 N2=128 N3=128 -0xffffc20000201000-0xffffc20000302000 1052672 alloc_large_system_hash+0x204 ... - /0x2c0 pages=256 vmalloc N0=64 N1=64 N2=64 N3=64 -0xffffc20000302000-0xffffc20000304000 8192 acpi_tb_verify_table+0x21/0x4f... - phys=7fee8000 ioremap -0xffffc20000304000-0xffffc20000307000 12288 acpi_tb_verify_table+0x21/0x4f... - phys=7fee7000 ioremap -0xffffc2000031d000-0xffffc2000031f000 8192 init_vdso_vars+0x112/0x210 -0xffffc2000031f000-0xffffc2000032b000 49152 cramfs_uncompress_init+0x2e ... - /0x80 pages=11 vmalloc N0=3 N1=3 N2=2 N3=3 -0xffffc2000033a000-0xffffc2000033d000 12288 sys_swapon+0x640/0xac0 ... - pages=2 vmalloc N1=2 -0xffffc20000347000-0xffffc2000034c000 20480 xt_alloc_table_info+0xfe ... - /0x130 [x_tables] pages=4 vmalloc N0=4 -0xffffffffa0000000-0xffffffffa000f000 61440 sys_init_module+0xc27/0x1d00 ... - pages=14 vmalloc N2=14 -0xffffffffa000f000-0xffffffffa0014000 20480 sys_init_module+0xc27/0x1d00 ... - pages=4 vmalloc N1=4 -0xffffffffa0014000-0xffffffffa0017000 12288 sys_init_module+0xc27/0x1d00 ... - pages=2 vmalloc N1=2 -0xffffffffa0017000-0xffffffffa0022000 45056 sys_init_module+0xc27/0x1d00 ... - pages=10 vmalloc N0=10 - -.............................................................................. - -softirqs: - -Provides counts of softirq handlers serviced since boot time, for each cpu. - -> cat /proc/softirqs - CPU0 CPU1 CPU2 CPU3 - HI: 0 0 0 0 - TIMER: 27166 27120 27097 27034 - NET_TX: 0 0 0 17 - NET_RX: 42 0 0 39 - BLOCK: 0 0 107 1121 - TASKLET: 0 0 0 290 - SCHED: 27035 26983 26971 26746 - HRTIMER: 0 0 0 0 - RCU: 1678 1769 2178 2250 - - -1.3 IDE devices in /proc/ide ----------------------------- - -The subdirectory /proc/ide contains information about all IDE devices of which -the kernel is aware. There is one subdirectory for each IDE controller, the -file drivers and a link for each IDE device, pointing to the device directory -in the controller specific subtree. - -The file drivers contains general information about the drivers used for the -IDE devices: - - > cat /proc/ide/drivers - ide-cdrom version 4.53 - ide-disk version 1.08 - -More detailed information can be found in the controller specific -subdirectories. These are named ide0, ide1 and so on. Each of these -directories contains the files shown in table 1-6. - - -Table 1-6: IDE controller info in /proc/ide/ide? -.............................................................................. - File Content - channel IDE channel (0 or 1) - config Configuration (only for PCI/IDE bridge) - mate Mate name - model Type/Chipset of IDE controller -.............................................................................. - -Each device connected to a controller has a separate subdirectory in the -controllers directory. The files listed in table 1-7 are contained in these -directories. - - -Table 1-7: IDE device information -.............................................................................. - File Content - cache The cache - capacity Capacity of the medium (in 512Byte blocks) - driver driver and version - geometry physical and logical geometry - identify device identify block - media media type - model device identifier - settings device setup - smart_thresholds IDE disk management thresholds - smart_values IDE disk management values -.............................................................................. - -The most interesting file is settings. This file contains a nice overview of -the drive parameters: - - # cat /proc/ide/ide0/hda/settings - name value min max mode - ---- ----- --- --- ---- - bios_cyl 526 0 65535 rw - bios_head 255 0 255 rw - bios_sect 63 0 63 rw - breada_readahead 4 0 127 rw - bswap 0 0 1 r - file_readahead 72 0 2097151 rw - io_32bit 0 0 3 rw - keepsettings 0 0 1 rw - max_kb_per_request 122 1 127 rw - multcount 0 0 8 rw - nice1 1 0 1 rw - nowerr 0 0 1 rw - pio_mode write-only 0 255 w - slow 0 0 1 rw - unmaskirq 0 0 1 rw - using_dma 0 0 1 rw - - -1.4 Networking info in /proc/net --------------------------------- - -The subdirectory /proc/net follows the usual pattern. Table 1-8 shows the -additional values you get for IP version 6 if you configure the kernel to -support this. Table 1-9 lists the files and their meaning. - - -Table 1-8: IPv6 info in /proc/net -.............................................................................. - File Content - udp6 UDP sockets (IPv6) - tcp6 TCP sockets (IPv6) - raw6 Raw device statistics (IPv6) - igmp6 IP multicast addresses, which this host joined (IPv6) - if_inet6 List of IPv6 interface addresses - ipv6_route Kernel routing table for IPv6 - rt6_stats Global IPv6 routing tables statistics - sockstat6 Socket statistics (IPv6) - snmp6 Snmp data (IPv6) -.............................................................................. - - -Table 1-9: Network info in /proc/net -.............................................................................. - File Content - arp Kernel ARP table - dev network devices with statistics - dev_mcast the Layer2 multicast groups a device is listening too - (interface index, label, number of references, number of bound - addresses). - dev_stat network device status - ip_fwchains Firewall chain linkage - ip_fwnames Firewall chain names - ip_masq Directory containing the masquerading tables - ip_masquerade Major masquerading table - netstat Network statistics - raw raw device statistics - route Kernel routing table - rpc Directory containing rpc info - rt_cache Routing cache - snmp SNMP data - sockstat Socket statistics - tcp TCP sockets - udp UDP sockets - unix UNIX domain sockets - wireless Wireless interface data (Wavelan etc) - igmp IP multicast addresses, which this host joined - psched Global packet scheduler parameters. - netlink List of PF_NETLINK sockets - ip_mr_vifs List of multicast virtual interfaces - ip_mr_cache List of multicast routing cache -.............................................................................. - -You can use this information to see which network devices are available in -your system and how much traffic was routed over those devices: - - > cat /proc/net/dev - Inter-|Receive |[... - face |bytes packets errs drop fifo frame compressed multicast|[... - lo: 908188 5596 0 0 0 0 0 0 [... - ppp0:15475140 20721 410 0 0 410 0 0 [... - eth0: 614530 7085 0 0 0 0 0 1 [... - - ...] Transmit - ...] bytes packets errs drop fifo colls carrier compressed - ...] 908188 5596 0 0 0 0 0 0 - ...] 1375103 17405 0 0 0 0 0 0 - ...] 1703981 5535 0 0 0 3 0 0 - -In addition, each Channel Bond interface has its own directory. For -example, the bond0 device will have a directory called /proc/net/bond0/. -It will contain information that is specific to that bond, such as the -current slaves of the bond, the link status of the slaves, and how -many times the slaves link has failed. - -1.5 SCSI info -------------- - -If you have a SCSI host adapter in your system, you'll find a subdirectory -named after the driver for this adapter in /proc/scsi. You'll also see a list -of all recognized SCSI devices in /proc/scsi: - - >cat /proc/scsi/scsi - Attached devices: - Host: scsi0 Channel: 00 Id: 00 Lun: 00 - Vendor: IBM Model: DGHS09U Rev: 03E0 - Type: Direct-Access ANSI SCSI revision: 03 - Host: scsi0 Channel: 00 Id: 06 Lun: 00 - Vendor: PIONEER Model: CD-ROM DR-U06S Rev: 1.04 - Type: CD-ROM ANSI SCSI revision: 02 - - -The directory named after the driver has one file for each adapter found in -the system. These files contain information about the controller, including -the used IRQ and the IO address range. The amount of information shown is -dependent on the adapter you use. The example shows the output for an Adaptec -AHA-2940 SCSI adapter: - - > cat /proc/scsi/aic7xxx/0 - - Adaptec AIC7xxx driver version: 5.1.19/3.2.4 - Compile Options: - TCQ Enabled By Default : Disabled - AIC7XXX_PROC_STATS : Disabled - AIC7XXX_RESET_DELAY : 5 - Adapter Configuration: - SCSI Adapter: Adaptec AHA-294X Ultra SCSI host adapter - Ultra Wide Controller - PCI MMAPed I/O Base: 0xeb001000 - Adapter SEEPROM Config: SEEPROM found and used. - Adaptec SCSI BIOS: Enabled - IRQ: 10 - SCBs: Active 0, Max Active 2, - Allocated 15, HW 16, Page 255 - Interrupts: 160328 - BIOS Control Word: 0x18b6 - Adapter Control Word: 0x005b - Extended Translation: Enabled - Disconnect Enable Flags: 0xffff - Ultra Enable Flags: 0x0001 - Tag Queue Enable Flags: 0x0000 - Ordered Queue Tag Flags: 0x0000 - Default Tag Queue Depth: 8 - Tagged Queue By Device array for aic7xxx host instance 0: - {255,255,255,255,255,255,255,255,255,255,255,255,255,255,255,255} - Actual queue depth per device for aic7xxx host instance 0: - {1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1} - Statistics: - (scsi0:0:0:0) - Device using Wide/Sync transfers at 40.0 MByte/sec, offset 8 - Transinfo settings: current(12/8/1/0), goal(12/8/1/0), user(12/15/1/0) - Total transfers 160151 (74577 reads and 85574 writes) - (scsi0:0:6:0) - Device using Narrow/Sync transfers at 5.0 MByte/sec, offset 15 - Transinfo settings: current(50/15/0/0), goal(50/15/0/0), user(50/15/0/0) - Total transfers 0 (0 reads and 0 writes) - - -1.6 Parallel port info in /proc/parport ---------------------------------------- - -The directory /proc/parport contains information about the parallel ports of -your system. It has one subdirectory for each port, named after the port -number (0,1,2,...). - -These directories contain the four files shown in Table 1-10. - - -Table 1-10: Files in /proc/parport -.............................................................................. - File Content - autoprobe Any IEEE-1284 device ID information that has been acquired. - devices list of the device drivers using that port. A + will appear by the - name of the device currently using the port (it might not appear - against any). - hardware Parallel port's base address, IRQ line and DMA channel. - irq IRQ that parport is using for that port. This is in a separate - file to allow you to alter it by writing a new value in (IRQ - number or none). -.............................................................................. - -1.7 TTY info in /proc/tty -------------------------- - -Information about the available and actually used tty's can be found in the -directory /proc/tty.You'll find entries for drivers and line disciplines in -this directory, as shown in Table 1-11. - - -Table 1-11: Files in /proc/tty -.............................................................................. - File Content - drivers list of drivers and their usage - ldiscs registered line disciplines - driver/serial usage statistic and status of single tty lines -.............................................................................. - -To see which tty's are currently in use, you can simply look into the file -/proc/tty/drivers: - - > cat /proc/tty/drivers - pty_slave /dev/pts 136 0-255 pty:slave - pty_master /dev/ptm 128 0-255 pty:master - pty_slave /dev/ttyp 3 0-255 pty:slave - pty_master /dev/pty 2 0-255 pty:master - serial /dev/cua 5 64-67 serial:callout - serial /dev/ttyS 4 64-67 serial - /dev/tty0 /dev/tty0 4 0 system:vtmaster - /dev/ptmx /dev/ptmx 5 2 system - /dev/console /dev/console 5 1 system:console - /dev/tty /dev/tty 5 0 system:/dev/tty - unknown /dev/tty 4 1-63 console - - -1.8 Miscellaneous kernel statistics in /proc/stat -------------------------------------------------- - -Various pieces of information about kernel activity are available in the -/proc/stat file. All of the numbers reported in this file are aggregates -since the system first booted. For a quick look, simply cat the file: - - > cat /proc/stat - cpu 2255 34 2290 22625563 6290 127 456 0 0 0 - cpu0 1132 34 1441 11311718 3675 127 438 0 0 0 - cpu1 1123 0 849 11313845 2614 0 18 0 0 0 - intr 114930548 113199788 3 0 5 263 0 4 [... lots more numbers ...] - ctxt 1990473 - btime 1062191376 - processes 2915 - procs_running 1 - procs_blocked 0 - softirq 183433 0 21755 12 39 1137 231 21459 2263 - -The very first "cpu" line aggregates the numbers in all of the other "cpuN" -lines. These numbers identify the amount of time the CPU has spent performing -different kinds of work. Time units are in USER_HZ (typically hundredths of a -second). The meanings of the columns are as follows, from left to right: - -- user: normal processes executing in user mode -- nice: niced processes executing in user mode -- system: processes executing in kernel mode -- idle: twiddling thumbs -- iowait: In a word, iowait stands for waiting for I/O to complete. But there - are several problems: - 1. Cpu will not wait for I/O to complete, iowait is the time that a task is - waiting for I/O to complete. When cpu goes into idle state for - outstanding task io, another task will be scheduled on this CPU. - 2. In a multi-core CPU, the task waiting for I/O to complete is not running - on any CPU, so the iowait of each CPU is difficult to calculate. - 3. The value of iowait field in /proc/stat will decrease in certain - conditions. - So, the iowait is not reliable by reading from /proc/stat. -- irq: servicing interrupts -- softirq: servicing softirqs -- steal: involuntary wait -- guest: running a normal guest -- guest_nice: running a niced guest - -The "intr" line gives counts of interrupts serviced since boot time, for each -of the possible system interrupts. The first column is the total of all -interrupts serviced including unnumbered architecture specific interrupts; -each subsequent column is the total for that particular numbered interrupt. -Unnumbered interrupts are not shown, only summed into the total. - -The "ctxt" line gives the total number of context switches across all CPUs. - -The "btime" line gives the time at which the system booted, in seconds since -the Unix epoch. - -The "processes" line gives the number of processes and threads created, which -includes (but is not limited to) those created by calls to the fork() and -clone() system calls. - -The "procs_running" line gives the total number of threads that are -running or ready to run (i.e., the total number of runnable threads). - -The "procs_blocked" line gives the number of processes currently blocked, -waiting for I/O to complete. - -The "softirq" line gives counts of softirqs serviced since boot time, for each -of the possible system softirqs. The first column is the total of all -softirqs serviced; each subsequent column is the total for that particular -softirq. - - -1.9 Ext4 file system parameters -------------------------------- - -Information about mounted ext4 file systems can be found in -/proc/fs/ext4. Each mounted filesystem will have a directory in -/proc/fs/ext4 based on its device name (i.e., /proc/fs/ext4/hdc or -/proc/fs/ext4/dm-0). The files in each per-device directory are shown -in Table 1-12, below. - -Table 1-12: Files in /proc/fs/ext4/ -.............................................................................. - File Content - mb_groups details of multiblock allocator buddy cache of free blocks -.............................................................................. - -2.0 /proc/consoles ------------------- -Shows registered system console lines. - -To see which character device lines are currently used for the system console -/dev/console, you may simply look into the file /proc/consoles: - - > cat /proc/consoles - tty0 -WU (ECp) 4:7 - ttyS0 -W- (Ep) 4:64 - -The columns are: - - device name of the device - operations R = can do read operations - W = can do write operations - U = can do unblank - flags E = it is enabled - C = it is preferred console - B = it is primary boot console - p = it is used for printk buffer - b = it is not a TTY but a Braille device - a = it is safe to use when cpu is offline - major:minor major and minor number of the device separated by a colon - ------------------------------------------------------------------------------- -Summary ------------------------------------------------------------------------------- -The /proc file system serves information about the running system. It not only -allows access to process data but also allows you to request the kernel status -by reading files in the hierarchy. - -The directory structure of /proc reflects the types of information and makes -it easy, if not obvious, where to look for specific data. ------------------------------------------------------------------------------- - ------------------------------------------------------------------------------- -CHAPTER 2: MODIFYING SYSTEM PARAMETERS ------------------------------------------------------------------------------- - ------------------------------------------------------------------------------- -In This Chapter ------------------------------------------------------------------------------- -* Modifying kernel parameters by writing into files found in /proc/sys -* Exploring the files which modify certain parameters -* Review of the /proc/sys file tree ------------------------------------------------------------------------------- - - -A very interesting part of /proc is the directory /proc/sys. This is not only -a source of information, it also allows you to change parameters within the -kernel. Be very careful when attempting this. You can optimize your system, -but you can also cause it to crash. Never alter kernel parameters on a -production system. Set up a development machine and test to make sure that -everything works the way you want it to. You may have no alternative but to -reboot the machine once an error has been made. - -To change a value, simply echo the new value into the file. An example is -given below in the section on the file system data. You need to be root to do -this. You can create your own boot script to perform this every time your -system boots. - -The files in /proc/sys can be used to fine tune and monitor miscellaneous and -general things in the operation of the Linux kernel. Since some of the files -can inadvertently disrupt your system, it is advisable to read both -documentation and source before actually making adjustments. In any case, be -very careful when writing to any of these files. The entries in /proc may -change slightly between the 2.1.* and the 2.2 kernel, so if there is any doubt -review the kernel documentation in the directory /usr/src/linux/Documentation. -This chapter is heavily based on the documentation included in the pre 2.2 -kernels, and became part of it in version 2.2.1 of the Linux kernel. - -Please see: Documentation/admin-guide/sysctl/ directory for descriptions of these -entries. - ------------------------------------------------------------------------------- -Summary ------------------------------------------------------------------------------- -Certain aspects of kernel behavior can be modified at runtime, without the -need to recompile the kernel, or even to reboot the system. The files in the -/proc/sys tree can not only be read, but also modified. You can use the echo -command to write value into these files, thereby changing the default settings -of the kernel. ------------------------------------------------------------------------------- - ------------------------------------------------------------------------------- -CHAPTER 3: PER-PROCESS PARAMETERS ------------------------------------------------------------------------------- - -3.1 /proc//oom_adj & /proc//oom_score_adj- Adjust the oom-killer score --------------------------------------------------------------------------------- - -These file can be used to adjust the badness heuristic used to select which -process gets killed in out of memory conditions. - -The badness heuristic assigns a value to each candidate task ranging from 0 -(never kill) to 1000 (always kill) to determine which process is targeted. The -units are roughly a proportion along that range of allowed memory the process -may allocate from based on an estimation of its current memory and swap use. -For example, if a task is using all allowed memory, its badness score will be -1000. If it is using half of its allowed memory, its score will be 500. - -There is an additional factor included in the badness score: the current memory -and swap usage is discounted by 3% for root processes. - -The amount of "allowed" memory depends on the context in which the oom killer -was called. If it is due to the memory assigned to the allocating task's cpuset -being exhausted, the allowed memory represents the set of mems assigned to that -cpuset. If it is due to a mempolicy's node(s) being exhausted, the allowed -memory represents the set of mempolicy nodes. If it is due to a memory -limit (or swap limit) being reached, the allowed memory is that configured -limit. Finally, if it is due to the entire system being out of memory, the -allowed memory represents all allocatable resources. - -The value of /proc//oom_score_adj is added to the badness score before it -is used to determine which task to kill. Acceptable values range from -1000 -(OOM_SCORE_ADJ_MIN) to +1000 (OOM_SCORE_ADJ_MAX). This allows userspace to -polarize the preference for oom killing either by always preferring a certain -task or completely disabling it. The lowest possible value, -1000, is -equivalent to disabling oom killing entirely for that task since it will always -report a badness score of 0. - -Consequently, it is very simple for userspace to define the amount of memory to -consider for each task. Setting a /proc//oom_score_adj value of +500, for -example, is roughly equivalent to allowing the remainder of tasks sharing the -same system, cpuset, mempolicy, or memory controller resources to use at least -50% more memory. A value of -500, on the other hand, would be roughly -equivalent to discounting 50% of the task's allowed memory from being considered -as scoring against the task. - -For backwards compatibility with previous kernels, /proc//oom_adj may also -be used to tune the badness score. Its acceptable values range from -16 -(OOM_ADJUST_MIN) to +15 (OOM_ADJUST_MAX) and a special value of -17 -(OOM_DISABLE) to disable oom killing entirely for that task. Its value is -scaled linearly with /proc//oom_score_adj. - -The value of /proc//oom_score_adj may be reduced no lower than the last -value set by a CAP_SYS_RESOURCE process. To reduce the value any lower -requires CAP_SYS_RESOURCE. - -Caveat: when a parent task is selected, the oom killer will sacrifice any first -generation children with separate address spaces instead, if possible. This -avoids servers and important system daemons from being killed and loses the -minimal amount of work. - - -3.2 /proc//oom_score - Display current oom-killer score -------------------------------------------------------------- - -This file can be used to check the current score used by the oom-killer is for -any given . Use it together with /proc//oom_score_adj to tune which -process should be killed in an out-of-memory situation. - - -3.3 /proc//io - Display the IO accounting fields -------------------------------------------------------- - -This file contains IO statistics for each running process - -Example -------- - -test:/tmp # dd if=/dev/zero of=/tmp/test.dat & -[1] 3828 - -test:/tmp # cat /proc/3828/io -rchar: 323934931 -wchar: 323929600 -syscr: 632687 -syscw: 632675 -read_bytes: 0 -write_bytes: 323932160 -cancelled_write_bytes: 0 - - -Description ------------ - -rchar ------ - -I/O counter: chars read -The number of bytes which this task has caused to be read from storage. This -is simply the sum of bytes which this process passed to read() and pread(). -It includes things like tty IO and it is unaffected by whether or not actual -physical disk IO was required (the read might have been satisfied from -pagecache) - - -wchar ------ - -I/O counter: chars written -The number of bytes which this task has caused, or shall cause to be written -to disk. Similar caveats apply here as with rchar. - - -syscr ------ - -I/O counter: read syscalls -Attempt to count the number of read I/O operations, i.e. syscalls like read() -and pread(). - - -syscw ------ - -I/O counter: write syscalls -Attempt to count the number of write I/O operations, i.e. syscalls like -write() and pwrite(). - - -read_bytes ----------- - -I/O counter: bytes read -Attempt to count the number of bytes which this process really did cause to -be fetched from the storage layer. Done at the submit_bio() level, so it is -accurate for block-backed filesystems. - - -write_bytes ------------ - -I/O counter: bytes written -Attempt to count the number of bytes which this process caused to be sent to -the storage layer. This is done at page-dirtying time. - - -cancelled_write_bytes ---------------------- - -The big inaccuracy here is truncate. If a process writes 1MB to a file and -then deletes the file, it will in fact perform no writeout. But it will have -been accounted as having caused 1MB of write. -In other words: The number of bytes which this process caused to not happen, -by truncating pagecache. A task can cause "negative" IO too. If this task -truncates some dirty pagecache, some IO which another task has been accounted -for (in its write_bytes) will not be happening. We _could_ just subtract that -from the truncating task's write_bytes, but there is information loss in doing -that. - - -Note ----- - -At its current implementation state, this is a bit racy on 32-bit machines: if -process A reads process B's /proc/pid/io while process B is updating one of -those 64-bit counters, process A could see an intermediate result. - - -More information about this can be found within the taskstats documentation in -Documentation/accounting. - -3.4 /proc//coredump_filter - Core dump filtering settings ---------------------------------------------------------------- -When a process is dumped, all anonymous memory is written to a core file as -long as the size of the core file isn't limited. But sometimes we don't want -to dump some memory segments, for example, huge shared memory or DAX. -Conversely, sometimes we want to save file-backed memory segments into a core -file, not only the individual files. - -/proc//coredump_filter allows you to customize which memory segments -will be dumped when the process is dumped. coredump_filter is a bitmask -of memory types. If a bit of the bitmask is set, memory segments of the -corresponding memory type are dumped, otherwise they are not dumped. - -The following 9 memory types are supported: - - (bit 0) anonymous private memory - - (bit 1) anonymous shared memory - - (bit 2) file-backed private memory - - (bit 3) file-backed shared memory - - (bit 4) ELF header pages in file-backed private memory areas (it is - effective only if the bit 2 is cleared) - - (bit 5) hugetlb private memory - - (bit 6) hugetlb shared memory - - (bit 7) DAX private memory - - (bit 8) DAX shared memory - - Note that MMIO pages such as frame buffer are never dumped and vDSO pages - are always dumped regardless of the bitmask status. - - Note that bits 0-4 don't affect hugetlb or DAX memory. hugetlb memory is - only affected by bit 5-6, and DAX is only affected by bits 7-8. - -The default value of coredump_filter is 0x33; this means all anonymous memory -segments, ELF header pages and hugetlb private memory are dumped. - -If you don't want to dump all shared memory segments attached to pid 1234, -write 0x31 to the process's proc file. - - $ echo 0x31 > /proc/1234/coredump_filter - -When a new process is created, the process inherits the bitmask status from its -parent. It is useful to set up coredump_filter before the program runs. -For example: - - $ echo 0x7 > /proc/self/coredump_filter - $ ./some_program - -3.5 /proc//mountinfo - Information about mounts --------------------------------------------------------- - -This file contains lines of the form: - -36 35 98:0 /mnt1 /mnt2 rw,noatime master:1 - ext3 /dev/root rw,errors=continue -(1)(2)(3) (4) (5) (6) (7) (8) (9) (10) (11) - -(1) mount ID: unique identifier of the mount (may be reused after umount) -(2) parent ID: ID of parent (or of self for the top of the mount tree) -(3) major:minor: value of st_dev for files on filesystem -(4) root: root of the mount within the filesystem -(5) mount point: mount point relative to the process's root -(6) mount options: per mount options -(7) optional fields: zero or more fields of the form "tag[:value]" -(8) separator: marks the end of the optional fields -(9) filesystem type: name of filesystem of the form "type[.subtype]" -(10) mount source: filesystem specific information or "none" -(11) super options: per super block options - -Parsers should ignore all unrecognised optional fields. Currently the -possible optional fields are: - -shared:X mount is shared in peer group X -master:X mount is slave to peer group X -propagate_from:X mount is slave and receives propagation from peer group X (*) -unbindable mount is unbindable - -(*) X is the closest dominant peer group under the process's root. If -X is the immediate master of the mount, or if there's no dominant peer -group under the same root, then only the "master:X" field is present -and not the "propagate_from:X" field. - -For more information on mount propagation see: - - Documentation/filesystems/sharedsubtree.txt - - -3.6 /proc//comm & /proc//task//comm --------------------------------------------------------- -These files provide a method to access a tasks comm value. It also allows for -a task to set its own or one of its thread siblings comm value. The comm value -is limited in size compared to the cmdline value, so writing anything longer -then the kernel's TASK_COMM_LEN (currently 16 chars) will result in a truncated -comm value. - - -3.7 /proc//task//children - Information about task children -------------------------------------------------------------------------- -This file provides a fast way to retrieve first level children pids -of a task pointed by / pair. The format is a space separated -stream of pids. - -Note the "first level" here -- if a child has own children they will -not be listed here, one needs to read /proc//task//children -to obtain the descendants. - -Since this interface is intended to be fast and cheap it doesn't -guarantee to provide precise results and some children might be -skipped, especially if they've exited right after we printed their -pids, so one need to either stop or freeze processes being inspected -if precise results are needed. - - -3.8 /proc//fdinfo/ - Information about opened file ---------------------------------------------------------------- -This file provides information associated with an opened file. The regular -files have at least three fields -- 'pos', 'flags' and mnt_id. The 'pos' -represents the current offset of the opened file in decimal form [see lseek(2) -for details], 'flags' denotes the octal O_xxx mask the file has been -created with [see open(2) for details] and 'mnt_id' represents mount ID of -the file system containing the opened file [see 3.5 /proc//mountinfo -for details]. - -A typical output is - - pos: 0 - flags: 0100002 - mnt_id: 19 - -All locks associated with a file descriptor are shown in its fdinfo too. - -lock: 1: FLOCK ADVISORY WRITE 359 00:13:11691 0 EOF - -The files such as eventfd, fsnotify, signalfd, epoll among the regular pos/flags -pair provide additional information particular to the objects they represent. - - Eventfd files - ~~~~~~~~~~~~~ - pos: 0 - flags: 04002 - mnt_id: 9 - eventfd-count: 5a - - where 'eventfd-count' is hex value of a counter. - - Signalfd files - ~~~~~~~~~~~~~~ - pos: 0 - flags: 04002 - mnt_id: 9 - sigmask: 0000000000000200 - - where 'sigmask' is hex value of the signal mask associated - with a file. - - Epoll files - ~~~~~~~~~~~ - pos: 0 - flags: 02 - mnt_id: 9 - tfd: 5 events: 1d data: ffffffffffffffff pos:0 ino:61af sdev:7 - - where 'tfd' is a target file descriptor number in decimal form, - 'events' is events mask being watched and the 'data' is data - associated with a target [see epoll(7) for more details]. - - The 'pos' is current offset of the target file in decimal form - [see lseek(2)], 'ino' and 'sdev' are inode and device numbers - where target file resides, all in hex format. - - Fsnotify files - ~~~~~~~~~~~~~~ - For inotify files the format is the following - - pos: 0 - flags: 02000000 - inotify wd:3 ino:9e7e sdev:800013 mask:800afce ignored_mask:0 fhandle-bytes:8 fhandle-type:1 f_handle:7e9e0000640d1b6d - - where 'wd' is a watch descriptor in decimal form, ie a target file - descriptor number, 'ino' and 'sdev' are inode and device where the - target file resides and the 'mask' is the mask of events, all in hex - form [see inotify(7) for more details]. - - If the kernel was built with exportfs support, the path to the target - file is encoded as a file handle. The file handle is provided by three - fields 'fhandle-bytes', 'fhandle-type' and 'f_handle', all in hex - format. - - If the kernel is built without exportfs support the file handle won't be - printed out. - - If there is no inotify mark attached yet the 'inotify' line will be omitted. - - For fanotify files the format is - - pos: 0 - flags: 02 - mnt_id: 9 - fanotify flags:10 event-flags:0 - fanotify mnt_id:12 mflags:40 mask:38 ignored_mask:40000003 - fanotify ino:4f969 sdev:800013 mflags:0 mask:3b ignored_mask:40000000 fhandle-bytes:8 fhandle-type:1 f_handle:69f90400c275b5b4 - - where fanotify 'flags' and 'event-flags' are values used in fanotify_init - call, 'mnt_id' is the mount point identifier, 'mflags' is the value of - flags associated with mark which are tracked separately from events - mask. 'ino', 'sdev' are target inode and device, 'mask' is the events - mask and 'ignored_mask' is the mask of events which are to be ignored. - All in hex format. Incorporation of 'mflags', 'mask' and 'ignored_mask' - does provide information about flags and mask used in fanotify_mark - call [see fsnotify manpage for details]. - - While the first three lines are mandatory and always printed, the rest is - optional and may be omitted if no marks created yet. - - Timerfd files - ~~~~~~~~~~~~~ - - pos: 0 - flags: 02 - mnt_id: 9 - clockid: 0 - ticks: 0 - settime flags: 01 - it_value: (0, 49406829) - it_interval: (1, 0) - - where 'clockid' is the clock type and 'ticks' is the number of the timer expirations - that have occurred [see timerfd_create(2) for details]. 'settime flags' are - flags in octal form been used to setup the timer [see timerfd_settime(2) for - details]. 'it_value' is remaining time until the timer exiration. - 'it_interval' is the interval for the timer. Note the timer might be set up - with TIMER_ABSTIME option which will be shown in 'settime flags', but 'it_value' - still exhibits timer's remaining time. - -3.9 /proc//map_files - Information about memory mapped files ---------------------------------------------------------------------- -This directory contains symbolic links which represent memory mapped files -the process is maintaining. Example output: - - | lr-------- 1 root root 64 Jan 27 11:24 333c600000-333c620000 -> /usr/lib64/ld-2.18.so - | lr-------- 1 root root 64 Jan 27 11:24 333c81f000-333c820000 -> /usr/lib64/ld-2.18.so - | lr-------- 1 root root 64 Jan 27 11:24 333c820000-333c821000 -> /usr/lib64/ld-2.18.so - | ... - | lr-------- 1 root root 64 Jan 27 11:24 35d0421000-35d0422000 -> /usr/lib64/libselinux.so.1 - | lr-------- 1 root root 64 Jan 27 11:24 400000-41a000 -> /usr/bin/ls - -The name of a link represents the virtual memory bounds of a mapping, i.e. -vm_area_struct::vm_start-vm_area_struct::vm_end. - -The main purpose of the map_files is to retrieve a set of memory mapped -files in a fast way instead of parsing /proc//maps or -/proc//smaps, both of which contain many more records. At the same -time one can open(2) mappings from the listings of two processes and -comparing their inode numbers to figure out which anonymous memory areas -are actually shared. - -3.10 /proc//timerslack_ns - Task timerslack value ---------------------------------------------------------- -This file provides the value of the task's timerslack value in nanoseconds. -This value specifies a amount of time that normal timers may be deferred -in order to coalesce timers and avoid unnecessary wakeups. - -This allows a task's interactivity vs power consumption trade off to be -adjusted. - -Writing 0 to the file will set the tasks timerslack to the default value. - -Valid values are from 0 - ULLONG_MAX - -An application setting the value must have PTRACE_MODE_ATTACH_FSCREDS level -permissions on the task specified to change its timerslack_ns value. - -3.11 /proc//patch_state - Livepatch patch operation state ------------------------------------------------------------------ -When CONFIG_LIVEPATCH is enabled, this file displays the value of the -patch state for the task. - -A value of '-1' indicates that no patch is in transition. - -A value of '0' indicates that a patch is in transition and the task is -unpatched. If the patch is being enabled, then the task hasn't been -patched yet. If the patch is being disabled, then the task has already -been unpatched. - -A value of '1' indicates that a patch is in transition and the task is -patched. If the patch is being enabled, then the task has already been -patched. If the patch is being disabled, then the task hasn't been -unpatched yet. - -3.12 /proc//arch_status - task architecture specific status -------------------------------------------------------------------- -When CONFIG_PROC_PID_ARCH_STATUS is enabled, this file displays the -architecture specific status of the task. - -Example -------- - $ cat /proc/6753/arch_status - AVX512_elapsed_ms: 8 - -Description ------------ - -x86 specific entries: ---------------------- - AVX512_elapsed_ms: - ------------------ - If AVX512 is supported on the machine, this entry shows the milliseconds - elapsed since the last time AVX512 usage was recorded. The recording - happens on a best effort basis when a task is scheduled out. This means - that the value depends on two factors: - - 1) The time which the task spent on the CPU without being scheduled - out. With CPU isolation and a single runnable task this can take - several seconds. - - 2) The time since the task was scheduled out last. Depending on the - reason for being scheduled out (time slice exhausted, syscall ...) - this can be arbitrary long time. - - As a consequence the value cannot be considered precise and authoritative - information. The application which uses this information has to be aware - of the overall scenario on the system in order to determine whether a - task is a real AVX512 user or not. Precise information can be obtained - with performance counters. - - A special value of '-1' indicates that no AVX512 usage was recorded, thus - the task is unlikely an AVX512 user, but depends on the workload and the - scheduling scenario, it also could be a false negative mentioned above. - ------------------------------------------------------------------------------- -Configuring procfs ------------------------------------------------------------------------------- - -4.1 Mount options ---------------------- - -The following mount options are supported: - - hidepid= Set /proc// access mode. - gid= Set the group authorized to learn processes information. - -hidepid=0 means classic mode - everybody may access all /proc// directories -(default). - -hidepid=1 means users may not access any /proc// directories but their -own. Sensitive files like cmdline, sched*, status are now protected against -other users. This makes it impossible to learn whether any user runs -specific program (given the program doesn't reveal itself by its behaviour). -As an additional bonus, as /proc//cmdline is unaccessible for other users, -poorly written programs passing sensitive information via program arguments are -now protected against local eavesdroppers. - -hidepid=2 means hidepid=1 plus all /proc// will be fully invisible to other -users. It doesn't mean that it hides a fact whether a process with a specific -pid value exists (it can be learned by other means, e.g. by "kill -0 $PID"), -but it hides process' uid and gid, which may be learned by stat()'ing -/proc// otherwise. It greatly complicates an intruder's task of gathering -information about running processes, whether some daemon runs with elevated -privileges, whether other user runs some sensitive program, whether other users -run any program at all, etc. - -gid= defines a group authorized to learn processes information otherwise -prohibited by hidepid=. If you use some daemon like identd which needs to learn -information about processes information, just add identd to this group. -- cgit From d5eefa2c5e567751df74d38d5b8cec7ed6e7a08c Mon Sep 17 00:00:00 2001 From: Mauro Carvalho Chehab Date: Mon, 17 Feb 2020 17:12:19 +0100 Subject: docs: filesystems: convert qnx6.txt to ReST - Add a SPDX header; - Adjust document title; - Some whitespace fixes and new line breaks; - Add it to filesystems/index.rst. Signed-off-by: Mauro Carvalho Chehab Link: https://lore.kernel.org/r/ccd22c1e1426ce4cb30ece9a71c39ebb41844762.1581955849.git.mchehab+huawei@kernel.org Signed-off-by: Jonathan Corbet --- Documentation/filesystems/index.rst | 1 + Documentation/filesystems/qnx6.rst | 196 ++++++++++++++++++++++++++++++++++++ Documentation/filesystems/qnx6.txt | 174 -------------------------------- 3 files changed, 197 insertions(+), 174 deletions(-) create mode 100644 Documentation/filesystems/qnx6.rst delete mode 100644 Documentation/filesystems/qnx6.txt diff --git a/Documentation/filesystems/index.rst b/Documentation/filesystems/index.rst index 671906e2fee6..08883a481a76 100644 --- a/Documentation/filesystems/index.rst +++ b/Documentation/filesystems/index.rst @@ -82,5 +82,6 @@ Documentation for filesystem implementations. orangefs overlayfs proc + qnx6 virtiofs vfat diff --git a/Documentation/filesystems/qnx6.rst b/Documentation/filesystems/qnx6.rst new file mode 100644 index 000000000000..b71308314070 --- /dev/null +++ b/Documentation/filesystems/qnx6.rst @@ -0,0 +1,196 @@ +.. SPDX-License-Identifier: GPL-2.0 + +=================== +The QNX6 Filesystem +=================== + +The qnx6fs is used by newer QNX operating system versions. (e.g. Neutrino) +It got introduced in QNX 6.4.0 and is used default since 6.4.1. + +Option +====== + +mmi_fs Mount filesystem as used for example by Audi MMI 3G system + +Specification +============= + +qnx6fs shares many properties with traditional Unix filesystems. It has the +concepts of blocks, inodes and directories. + +On QNX it is possible to create little endian and big endian qnx6 filesystems. +This feature makes it possible to create and use a different endianness fs +for the target (QNX is used on quite a range of embedded systems) platform +running on a different endianness. + +The Linux driver handles endianness transparently. (LE and BE) + +Blocks +------ + +The space in the device or file is split up into blocks. These are a fixed +size of 512, 1024, 2048 or 4096, which is decided when the filesystem is +created. + +Blockpointers are 32bit, so the maximum space that can be addressed is +2^32 * 4096 bytes or 16TB + +The superblocks +--------------- + +The superblock contains all global information about the filesystem. +Each qnx6fs got two superblocks, each one having a 64bit serial number. +That serial number is used to identify the "active" superblock. +In write mode with reach new snapshot (after each synchronous write), the +serial of the new master superblock is increased (old superblock serial + 1) + +So basically the snapshot functionality is realized by an atomic final +update of the serial number. Before updating that serial, all modifications +are done by copying all modified blocks during that specific write request +(or period) and building up a new (stable) filesystem structure under the +inactive superblock. + +Each superblock holds a set of root inodes for the different filesystem +parts. (Inode, Bitmap and Longfilenames) +Each of these root nodes holds information like total size of the stored +data and the addressing levels in that specific tree. +If the level value is 0, up to 16 direct blocks can be addressed by each +node. + +Level 1 adds an additional indirect addressing level where each indirect +addressing block holds up to blocksize / 4 bytes pointers to data blocks. +Level 2 adds an additional indirect addressing block level (so, already up +to 16 * 256 * 256 = 1048576 blocks that can be addressed by such a tree). + +Unused block pointers are always set to ~0 - regardless of root node, +indirect addressing blocks or inodes. + +Data leaves are always on the lowest level. So no data is stored on upper +tree levels. + +The first Superblock is located at 0x2000. (0x2000 is the bootblock size) +The Audi MMI 3G first superblock directly starts at byte 0. + +Second superblock position can either be calculated from the superblock +information (total number of filesystem blocks) or by taking the highest +device address, zeroing the last 3 bytes and then subtracting 0x1000 from +that address. + +0x1000 is the size reserved for each superblock - regardless of the +blocksize of the filesystem. + +Inodes +------ + +Each object in the filesystem is represented by an inode. (index node) +The inode structure contains pointers to the filesystem blocks which contain +the data held in the object and all of the metadata about an object except +its longname. (filenames longer than 27 characters) +The metadata about an object includes the permissions, owner, group, flags, +size, number of blocks used, access time, change time and modification time. + +Object mode field is POSIX format. (which makes things easier) + +There are also pointers to the first 16 blocks, if the object data can be +addressed with 16 direct blocks. + +For more than 16 blocks an indirect addressing in form of another tree is +used. (scheme is the same as the one used for the superblock root nodes) + +The filesize is stored 64bit. Inode counting starts with 1. (while long +filename inodes start with 0) + +Directories +----------- + +A directory is a filesystem object and has an inode just like a file. +It is a specially formatted file containing records which associate each +name with an inode number. + +'.' inode number points to the directory inode + +'..' inode number points to the parent directory inode + +Eeach filename record additionally got a filename length field. + +One special case are long filenames or subdirectory names. + +These got set a filename length field of 0xff in the corresponding directory +record plus the longfile inode number also stored in that record. + +With that longfilename inode number, the longfilename tree can be walked +starting with the superblock longfilename root node pointers. + +Special files +------------- + +Symbolic links are also filesystem objects with inodes. They got a specific +bit in the inode mode field identifying them as symbolic link. + +The directory entry file inode pointer points to the target file inode. + +Hard links got an inode, a directory entry, but a specific mode bit set, +no block pointers and the directory file record pointing to the target file +inode. + +Character and block special devices do not exist in QNX as those files +are handled by the QNX kernel/drivers and created in /dev independent of the +underlaying filesystem. + +Long filenames +-------------- + +Long filenames are stored in a separate addressing tree. The staring point +is the longfilename root node in the active superblock. + +Each data block (tree leaves) holds one long filename. That filename is +limited to 510 bytes. The first two starting bytes are used as length field +for the actual filename. + +If that structure shall fit for all allowed blocksizes, it is clear why there +is a limit of 510 bytes for the actual filename stored. + +Bitmap +------ + +The qnx6fs filesystem allocation bitmap is stored in a tree under bitmap +root node in the superblock and each bit in the bitmap represents one +filesystem block. + +The first block is block 0, which starts 0x1000 after superblock start. +So for a normal qnx6fs 0x3000 (bootblock + superblock) is the physical +address at which block 0 is located. + +Bits at the end of the last bitmap block are set to 1, if the device is +smaller than addressing space in the bitmap. + +Bitmap system area +------------------ + +The bitmap itself is divided into three parts. + +First the system area, that is split into two halves. + +Then userspace. + +The requirement for a static, fixed preallocated system area comes from how +qnx6fs deals with writes. + +Each superblock got it's own half of the system area. So superblock #1 +always uses blocks from the lower half while superblock #2 just writes to +blocks represented by the upper half bitmap system area bits. + +Bitmap blocks, Inode blocks and indirect addressing blocks for those two +tree structures are treated as system blocks. + +The rational behind that is that a write request can work on a new snapshot +(system area of the inactive - resp. lower serial numbered superblock) while +at the same time there is still a complete stable filesystem structer in the +other half of the system area. + +When finished with writing (a sync write is completed, the maximum sync leap +time or a filesystem sync is requested), serial of the previously inactive +superblock atomically is increased and the fs switches over to that - then +stable declared - superblock. + +For all data outside the system area, blocks are just copied while writing. diff --git a/Documentation/filesystems/qnx6.txt b/Documentation/filesystems/qnx6.txt deleted file mode 100644 index 48ea68f15845..000000000000 --- a/Documentation/filesystems/qnx6.txt +++ /dev/null @@ -1,174 +0,0 @@ -The QNX6 Filesystem -=================== - -The qnx6fs is used by newer QNX operating system versions. (e.g. Neutrino) -It got introduced in QNX 6.4.0 and is used default since 6.4.1. - -Option -====== - -mmi_fs Mount filesystem as used for example by Audi MMI 3G system - -Specification -============= - -qnx6fs shares many properties with traditional Unix filesystems. It has the -concepts of blocks, inodes and directories. -On QNX it is possible to create little endian and big endian qnx6 filesystems. -This feature makes it possible to create and use a different endianness fs -for the target (QNX is used on quite a range of embedded systems) platform -running on a different endianness. -The Linux driver handles endianness transparently. (LE and BE) - -Blocks ------- - -The space in the device or file is split up into blocks. These are a fixed -size of 512, 1024, 2048 or 4096, which is decided when the filesystem is -created. -Blockpointers are 32bit, so the maximum space that can be addressed is -2^32 * 4096 bytes or 16TB - -The superblocks ---------------- - -The superblock contains all global information about the filesystem. -Each qnx6fs got two superblocks, each one having a 64bit serial number. -That serial number is used to identify the "active" superblock. -In write mode with reach new snapshot (after each synchronous write), the -serial of the new master superblock is increased (old superblock serial + 1) - -So basically the snapshot functionality is realized by an atomic final -update of the serial number. Before updating that serial, all modifications -are done by copying all modified blocks during that specific write request -(or period) and building up a new (stable) filesystem structure under the -inactive superblock. - -Each superblock holds a set of root inodes for the different filesystem -parts. (Inode, Bitmap and Longfilenames) -Each of these root nodes holds information like total size of the stored -data and the addressing levels in that specific tree. -If the level value is 0, up to 16 direct blocks can be addressed by each -node. -Level 1 adds an additional indirect addressing level where each indirect -addressing block holds up to blocksize / 4 bytes pointers to data blocks. -Level 2 adds an additional indirect addressing block level (so, already up -to 16 * 256 * 256 = 1048576 blocks that can be addressed by such a tree). - -Unused block pointers are always set to ~0 - regardless of root node, -indirect addressing blocks or inodes. -Data leaves are always on the lowest level. So no data is stored on upper -tree levels. - -The first Superblock is located at 0x2000. (0x2000 is the bootblock size) -The Audi MMI 3G first superblock directly starts at byte 0. -Second superblock position can either be calculated from the superblock -information (total number of filesystem blocks) or by taking the highest -device address, zeroing the last 3 bytes and then subtracting 0x1000 from -that address. - -0x1000 is the size reserved for each superblock - regardless of the -blocksize of the filesystem. - -Inodes ------- - -Each object in the filesystem is represented by an inode. (index node) -The inode structure contains pointers to the filesystem blocks which contain -the data held in the object and all of the metadata about an object except -its longname. (filenames longer than 27 characters) -The metadata about an object includes the permissions, owner, group, flags, -size, number of blocks used, access time, change time and modification time. - -Object mode field is POSIX format. (which makes things easier) - -There are also pointers to the first 16 blocks, if the object data can be -addressed with 16 direct blocks. -For more than 16 blocks an indirect addressing in form of another tree is -used. (scheme is the same as the one used for the superblock root nodes) - -The filesize is stored 64bit. Inode counting starts with 1. (while long -filename inodes start with 0) - -Directories ------------ - -A directory is a filesystem object and has an inode just like a file. -It is a specially formatted file containing records which associate each -name with an inode number. -'.' inode number points to the directory inode -'..' inode number points to the parent directory inode -Eeach filename record additionally got a filename length field. - -One special case are long filenames or subdirectory names. -These got set a filename length field of 0xff in the corresponding directory -record plus the longfile inode number also stored in that record. -With that longfilename inode number, the longfilename tree can be walked -starting with the superblock longfilename root node pointers. - -Special files -------------- - -Symbolic links are also filesystem objects with inodes. They got a specific -bit in the inode mode field identifying them as symbolic link. -The directory entry file inode pointer points to the target file inode. - -Hard links got an inode, a directory entry, but a specific mode bit set, -no block pointers and the directory file record pointing to the target file -inode. - -Character and block special devices do not exist in QNX as those files -are handled by the QNX kernel/drivers and created in /dev independent of the -underlaying filesystem. - -Long filenames --------------- - -Long filenames are stored in a separate addressing tree. The staring point -is the longfilename root node in the active superblock. -Each data block (tree leaves) holds one long filename. That filename is -limited to 510 bytes. The first two starting bytes are used as length field -for the actual filename. -If that structure shall fit for all allowed blocksizes, it is clear why there -is a limit of 510 bytes for the actual filename stored. - -Bitmap ------- - -The qnx6fs filesystem allocation bitmap is stored in a tree under bitmap -root node in the superblock and each bit in the bitmap represents one -filesystem block. -The first block is block 0, which starts 0x1000 after superblock start. -So for a normal qnx6fs 0x3000 (bootblock + superblock) is the physical -address at which block 0 is located. - -Bits at the end of the last bitmap block are set to 1, if the device is -smaller than addressing space in the bitmap. - -Bitmap system area ------------------- - -The bitmap itself is divided into three parts. -First the system area, that is split into two halves. -Then userspace. - -The requirement for a static, fixed preallocated system area comes from how -qnx6fs deals with writes. -Each superblock got it's own half of the system area. So superblock #1 -always uses blocks from the lower half while superblock #2 just writes to -blocks represented by the upper half bitmap system area bits. - -Bitmap blocks, Inode blocks and indirect addressing blocks for those two -tree structures are treated as system blocks. - -The rational behind that is that a write request can work on a new snapshot -(system area of the inactive - resp. lower serial numbered superblock) while -at the same time there is still a complete stable filesystem structer in the -other half of the system area. - -When finished with writing (a sync write is completed, the maximum sync leap -time or a filesystem sync is requested), serial of the previously inactive -superblock atomically is increased and the fs switches over to that - then -stable declared - superblock. - -For all data outside the system area, blocks are just copied while writing. -- cgit From 8979fc9a282441d086ead589528c711d9df3d94a Mon Sep 17 00:00:00 2001 From: Mauro Carvalho Chehab Date: Mon, 17 Feb 2020 17:12:20 +0100 Subject: docs: filesystems: convert ramfs-rootfs-initramfs.txt to ReST - Add a SPDX header; - Add a document title; - Some whitespace fixes and new line breaks; - Mark literal blocks as such; - Add table markups; - Use notes markups; - Add lists markups; - Add it to filesystems/index.rst. Signed-off-by: Mauro Carvalho Chehab Link: https://lore.kernel.org/r/89cbcc99a6371f3bff3ea1668fe497e8a15c226b.1581955849.git.mchehab+huawei@kernel.org Signed-off-by: Jonathan Corbet --- Documentation/filesystems/index.rst | 1 + .../filesystems/ramfs-rootfs-initramfs.rst | 369 +++++++++++++++++++++ .../filesystems/ramfs-rootfs-initramfs.txt | 359 -------------------- 3 files changed, 370 insertions(+), 359 deletions(-) create mode 100644 Documentation/filesystems/ramfs-rootfs-initramfs.rst delete mode 100644 Documentation/filesystems/ramfs-rootfs-initramfs.txt diff --git a/Documentation/filesystems/index.rst b/Documentation/filesystems/index.rst index 08883a481a76..b8689d082911 100644 --- a/Documentation/filesystems/index.rst +++ b/Documentation/filesystems/index.rst @@ -83,5 +83,6 @@ Documentation for filesystem implementations. overlayfs proc qnx6 + ramfs-rootfs-initramfs virtiofs vfat diff --git a/Documentation/filesystems/ramfs-rootfs-initramfs.rst b/Documentation/filesystems/ramfs-rootfs-initramfs.rst new file mode 100644 index 000000000000..6c576e241d86 --- /dev/null +++ b/Documentation/filesystems/ramfs-rootfs-initramfs.rst @@ -0,0 +1,369 @@ +.. SPDX-License-Identifier: GPL-2.0 + +=========================== +Ramfs, rootfs and initramfs +=========================== + +October 17, 2005 + +Rob Landley +============================= + +What is ramfs? +-------------- + +Ramfs is a very simple filesystem that exports Linux's disk caching +mechanisms (the page cache and dentry cache) as a dynamically resizable +RAM-based filesystem. + +Normally all files are cached in memory by Linux. Pages of data read from +backing store (usually the block device the filesystem is mounted on) are kept +around in case it's needed again, but marked as clean (freeable) in case the +Virtual Memory system needs the memory for something else. Similarly, data +written to files is marked clean as soon as it has been written to backing +store, but kept around for caching purposes until the VM reallocates the +memory. A similar mechanism (the dentry cache) greatly speeds up access to +directories. + +With ramfs, there is no backing store. Files written into ramfs allocate +dentries and page cache as usual, but there's nowhere to write them to. +This means the pages are never marked clean, so they can't be freed by the +VM when it's looking to recycle memory. + +The amount of code required to implement ramfs is tiny, because all the +work is done by the existing Linux caching infrastructure. Basically, +you're mounting the disk cache as a filesystem. Because of this, ramfs is not +an optional component removable via menuconfig, since there would be negligible +space savings. + +ramfs and ramdisk: +------------------ + +The older "ram disk" mechanism created a synthetic block device out of +an area of RAM and used it as backing store for a filesystem. This block +device was of fixed size, so the filesystem mounted on it was of fixed +size. Using a ram disk also required unnecessarily copying memory from the +fake block device into the page cache (and copying changes back out), as well +as creating and destroying dentries. Plus it needed a filesystem driver +(such as ext2) to format and interpret this data. + +Compared to ramfs, this wastes memory (and memory bus bandwidth), creates +unnecessary work for the CPU, and pollutes the CPU caches. (There are tricks +to avoid this copying by playing with the page tables, but they're unpleasantly +complicated and turn out to be about as expensive as the copying anyway.) +More to the point, all the work ramfs is doing has to happen _anyway_, +since all file access goes through the page and dentry caches. The RAM +disk is simply unnecessary; ramfs is internally much simpler. + +Another reason ramdisks are semi-obsolete is that the introduction of +loopback devices offered a more flexible and convenient way to create +synthetic block devices, now from files instead of from chunks of memory. +See losetup (8) for details. + +ramfs and tmpfs: +---------------- + +One downside of ramfs is you can keep writing data into it until you fill +up all memory, and the VM can't free it because the VM thinks that files +should get written to backing store (rather than swap space), but ramfs hasn't +got any backing store. Because of this, only root (or a trusted user) should +be allowed write access to a ramfs mount. + +A ramfs derivative called tmpfs was created to add size limits, and the ability +to write the data to swap space. Normal users can be allowed write access to +tmpfs mounts. See Documentation/filesystems/tmpfs.txt for more information. + +What is rootfs? +--------------- + +Rootfs is a special instance of ramfs (or tmpfs, if that's enabled), which is +always present in 2.6 systems. You can't unmount rootfs for approximately the +same reason you can't kill the init process; rather than having special code +to check for and handle an empty list, it's smaller and simpler for the kernel +to just make sure certain lists can't become empty. + +Most systems just mount another filesystem over rootfs and ignore it. The +amount of space an empty instance of ramfs takes up is tiny. + +If CONFIG_TMPFS is enabled, rootfs will use tmpfs instead of ramfs by +default. To force ramfs, add "rootfstype=ramfs" to the kernel command +line. + +What is initramfs? +------------------ + +All 2.6 Linux kernels contain a gzipped "cpio" format archive, which is +extracted into rootfs when the kernel boots up. After extracting, the kernel +checks to see if rootfs contains a file "init", and if so it executes it as PID +1. If found, this init process is responsible for bringing the system the +rest of the way up, including locating and mounting the real root device (if +any). If rootfs does not contain an init program after the embedded cpio +archive is extracted into it, the kernel will fall through to the older code +to locate and mount a root partition, then exec some variant of /sbin/init +out of that. + +All this differs from the old initrd in several ways: + + - The old initrd was always a separate file, while the initramfs archive is + linked into the linux kernel image. (The directory ``linux-*/usr`` is + devoted to generating this archive during the build.) + + - The old initrd file was a gzipped filesystem image (in some file format, + such as ext2, that needed a driver built into the kernel), while the new + initramfs archive is a gzipped cpio archive (like tar only simpler, + see cpio(1) and Documentation/driver-api/early-userspace/buffer-format.rst). + The kernel's cpio extraction code is not only extremely small, it's also + __init text and data that can be discarded during the boot process. + + - The program run by the old initrd (which was called /initrd, not /init) did + some setup and then returned to the kernel, while the init program from + initramfs is not expected to return to the kernel. (If /init needs to hand + off control it can overmount / with a new root device and exec another init + program. See the switch_root utility, below.) + + - When switching another root device, initrd would pivot_root and then + umount the ramdisk. But initramfs is rootfs: you can neither pivot_root + rootfs, nor unmount it. Instead delete everything out of rootfs to + free up the space (find -xdev / -exec rm '{}' ';'), overmount rootfs + with the new root (cd /newmount; mount --move . /; chroot .), attach + stdin/stdout/stderr to the new /dev/console, and exec the new init. + + Since this is a remarkably persnickety process (and involves deleting + commands before you can run them), the klibc package introduced a helper + program (utils/run_init.c) to do all this for you. Most other packages + (such as busybox) have named this command "switch_root". + +Populating initramfs: +--------------------- + +The 2.6 kernel build process always creates a gzipped cpio format initramfs +archive and links it into the resulting kernel binary. By default, this +archive is empty (consuming 134 bytes on x86). + +The config option CONFIG_INITRAMFS_SOURCE (in General Setup in menuconfig, +and living in usr/Kconfig) can be used to specify a source for the +initramfs archive, which will automatically be incorporated into the +resulting binary. This option can point to an existing gzipped cpio +archive, a directory containing files to be archived, or a text file +specification such as the following example:: + + dir /dev 755 0 0 + nod /dev/console 644 0 0 c 5 1 + nod /dev/loop0 644 0 0 b 7 0 + dir /bin 755 1000 1000 + slink /bin/sh busybox 777 0 0 + file /bin/busybox initramfs/busybox 755 0 0 + dir /proc 755 0 0 + dir /sys 755 0 0 + dir /mnt 755 0 0 + file /init initramfs/init.sh 755 0 0 + +Run "usr/gen_init_cpio" (after the kernel build) to get a usage message +documenting the above file format. + +One advantage of the configuration file is that root access is not required to +set permissions or create device nodes in the new archive. (Note that those +two example "file" entries expect to find files named "init.sh" and "busybox" in +a directory called "initramfs", under the linux-2.6.* directory. See +Documentation/driver-api/early-userspace/early_userspace_support.rst for more details.) + +The kernel does not depend on external cpio tools. If you specify a +directory instead of a configuration file, the kernel's build infrastructure +creates a configuration file from that directory (usr/Makefile calls +usr/gen_initramfs_list.sh), and proceeds to package up that directory +using the config file (by feeding it to usr/gen_init_cpio, which is created +from usr/gen_init_cpio.c). The kernel's build-time cpio creation code is +entirely self-contained, and the kernel's boot-time extractor is also +(obviously) self-contained. + +The one thing you might need external cpio utilities installed for is creating +or extracting your own preprepared cpio files to feed to the kernel build +(instead of a config file or directory). + +The following command line can extract a cpio image (either by the above script +or by the kernel build) back into its component files:: + + cpio -i -d -H newc -F initramfs_data.cpio --no-absolute-filenames + +The following shell script can create a prebuilt cpio archive you can +use in place of the above config file:: + + #!/bin/sh + + # Copyright 2006 Rob Landley and TimeSys Corporation. + # Licensed under GPL version 2 + + if [ $# -ne 2 ] + then + echo "usage: mkinitramfs directory imagename.cpio.gz" + exit 1 + fi + + if [ -d "$1" ] + then + echo "creating $2 from $1" + (cd "$1"; find . | cpio -o -H newc | gzip) > "$2" + else + echo "First argument must be a directory" + exit 1 + fi + +.. Note:: + + The cpio man page contains some bad advice that will break your initramfs + archive if you follow it. It says "A typical way to generate the list + of filenames is with the find command; you should give find the -depth + option to minimize problems with permissions on directories that are + unwritable or not searchable." Don't do this when creating + initramfs.cpio.gz images, it won't work. The Linux kernel cpio extractor + won't create files in a directory that doesn't exist, so the directory + entries must go before the files that go in those directories. + The above script gets them in the right order. + +External initramfs images: +-------------------------- + +If the kernel has initrd support enabled, an external cpio.gz archive can also +be passed into a 2.6 kernel in place of an initrd. In this case, the kernel +will autodetect the type (initramfs, not initrd) and extract the external cpio +archive into rootfs before trying to run /init. + +This has the memory efficiency advantages of initramfs (no ramdisk block +device) but the separate packaging of initrd (which is nice if you have +non-GPL code you'd like to run from initramfs, without conflating it with +the GPL licensed Linux kernel binary). + +It can also be used to supplement the kernel's built-in initramfs image. The +files in the external archive will overwrite any conflicting files in +the built-in initramfs archive. Some distributors also prefer to customize +a single kernel image with task-specific initramfs images, without recompiling. + +Contents of initramfs: +---------------------- + +An initramfs archive is a complete self-contained root filesystem for Linux. +If you don't already understand what shared libraries, devices, and paths +you need to get a minimal root filesystem up and running, here are some +references: + +- http://www.tldp.org/HOWTO/Bootdisk-HOWTO/ +- http://www.tldp.org/HOWTO/From-PowerUp-To-Bash-Prompt-HOWTO.html +- http://www.linuxfromscratch.org/lfs/view/stable/ + +The "klibc" package (http://www.kernel.org/pub/linux/libs/klibc) is +designed to be a tiny C library to statically link early userspace +code against, along with some related utilities. It is BSD licensed. + +I use uClibc (http://www.uclibc.org) and busybox (http://www.busybox.net) +myself. These are LGPL and GPL, respectively. (A self-contained initramfs +package is planned for the busybox 1.3 release.) + +In theory you could use glibc, but that's not well suited for small embedded +uses like this. (A "hello world" program statically linked against glibc is +over 400k. With uClibc it's 7k. Also note that glibc dlopens libnss to do +name lookups, even when otherwise statically linked.) + +A good first step is to get initramfs to run a statically linked "hello world" +program as init, and test it under an emulator like qemu (www.qemu.org) or +User Mode Linux, like so:: + + cat > hello.c << EOF + #include + #include + + int main(int argc, char *argv[]) + { + printf("Hello world!\n"); + sleep(999999999); + } + EOF + gcc -static hello.c -o init + echo init | cpio -o -H newc | gzip > test.cpio.gz + # Testing external initramfs using the initrd loading mechanism. + qemu -kernel /boot/vmlinuz -initrd test.cpio.gz /dev/zero + +When debugging a normal root filesystem, it's nice to be able to boot with +"init=/bin/sh". The initramfs equivalent is "rdinit=/bin/sh", and it's +just as useful. + +Why cpio rather than tar? +------------------------- + +This decision was made back in December, 2001. The discussion started here: + + http://www.uwsg.iu.edu/hypermail/linux/kernel/0112.2/1538.html + +And spawned a second thread (specifically on tar vs cpio), starting here: + + http://www.uwsg.iu.edu/hypermail/linux/kernel/0112.2/1587.html + +The quick and dirty summary version (which is no substitute for reading +the above threads) is: + +1) cpio is a standard. It's decades old (from the AT&T days), and already + widely used on Linux (inside RPM, Red Hat's device driver disks). Here's + a Linux Journal article about it from 1996: + + http://www.linuxjournal.com/article/1213 + + It's not as popular as tar because the traditional cpio command line tools + require _truly_hideous_ command line arguments. But that says nothing + either way about the archive format, and there are alternative tools, + such as: + + http://freecode.com/projects/afio + +2) The cpio archive format chosen by the kernel is simpler and cleaner (and + thus easier to create and parse) than any of the (literally dozens of) + various tar archive formats. The complete initramfs archive format is + explained in buffer-format.txt, created in usr/gen_init_cpio.c, and + extracted in init/initramfs.c. All three together come to less than 26k + total of human-readable text. + +3) The GNU project standardizing on tar is approximately as relevant as + Windows standardizing on zip. Linux is not part of either, and is free + to make its own technical decisions. + +4) Since this is a kernel internal format, it could easily have been + something brand new. The kernel provides its own tools to create and + extract this format anyway. Using an existing standard was preferable, + but not essential. + +5) Al Viro made the decision (quote: "tar is ugly as hell and not going to be + supported on the kernel side"): + + http://www.uwsg.iu.edu/hypermail/linux/kernel/0112.2/1540.html + + explained his reasoning: + + - http://www.uwsg.iu.edu/hypermail/linux/kernel/0112.2/1550.html + - http://www.uwsg.iu.edu/hypermail/linux/kernel/0112.2/1638.html + + and, most importantly, designed and implemented the initramfs code. + +Future directions: +------------------ + +Today (2.6.16), initramfs is always compiled in, but not always used. The +kernel falls back to legacy boot code that is reached only if initramfs does +not contain an /init program. The fallback is legacy code, there to ensure a +smooth transition and allowing early boot functionality to gradually move to +"early userspace" (I.E. initramfs). + +The move to early userspace is necessary because finding and mounting the real +root device is complex. Root partitions can span multiple devices (raid or +separate journal). They can be out on the network (requiring dhcp, setting a +specific MAC address, logging into a server, etc). They can live on removable +media, with dynamically allocated major/minor numbers and persistent naming +issues requiring a full udev implementation to sort out. They can be +compressed, encrypted, copy-on-write, loopback mounted, strangely partitioned, +and so on. + +This kind of complexity (which inevitably includes policy) is rightly handled +in userspace. Both klibc and busybox/uClibc are working on simple initramfs +packages to drop into a kernel build. + +The klibc package has now been accepted into Andrew Morton's 2.6.17-mm tree. +The kernel's current early boot code (partition detection, etc) will probably +be migrated into a default initramfs, automatically created and used by the +kernel build. diff --git a/Documentation/filesystems/ramfs-rootfs-initramfs.txt b/Documentation/filesystems/ramfs-rootfs-initramfs.txt deleted file mode 100644 index 97d42ccaa92d..000000000000 --- a/Documentation/filesystems/ramfs-rootfs-initramfs.txt +++ /dev/null @@ -1,359 +0,0 @@ -ramfs, rootfs and initramfs -October 17, 2005 -Rob Landley -============================= - -What is ramfs? --------------- - -Ramfs is a very simple filesystem that exports Linux's disk caching -mechanisms (the page cache and dentry cache) as a dynamically resizable -RAM-based filesystem. - -Normally all files are cached in memory by Linux. Pages of data read from -backing store (usually the block device the filesystem is mounted on) are kept -around in case it's needed again, but marked as clean (freeable) in case the -Virtual Memory system needs the memory for something else. Similarly, data -written to files is marked clean as soon as it has been written to backing -store, but kept around for caching purposes until the VM reallocates the -memory. A similar mechanism (the dentry cache) greatly speeds up access to -directories. - -With ramfs, there is no backing store. Files written into ramfs allocate -dentries and page cache as usual, but there's nowhere to write them to. -This means the pages are never marked clean, so they can't be freed by the -VM when it's looking to recycle memory. - -The amount of code required to implement ramfs is tiny, because all the -work is done by the existing Linux caching infrastructure. Basically, -you're mounting the disk cache as a filesystem. Because of this, ramfs is not -an optional component removable via menuconfig, since there would be negligible -space savings. - -ramfs and ramdisk: ------------------- - -The older "ram disk" mechanism created a synthetic block device out of -an area of RAM and used it as backing store for a filesystem. This block -device was of fixed size, so the filesystem mounted on it was of fixed -size. Using a ram disk also required unnecessarily copying memory from the -fake block device into the page cache (and copying changes back out), as well -as creating and destroying dentries. Plus it needed a filesystem driver -(such as ext2) to format and interpret this data. - -Compared to ramfs, this wastes memory (and memory bus bandwidth), creates -unnecessary work for the CPU, and pollutes the CPU caches. (There are tricks -to avoid this copying by playing with the page tables, but they're unpleasantly -complicated and turn out to be about as expensive as the copying anyway.) -More to the point, all the work ramfs is doing has to happen _anyway_, -since all file access goes through the page and dentry caches. The RAM -disk is simply unnecessary; ramfs is internally much simpler. - -Another reason ramdisks are semi-obsolete is that the introduction of -loopback devices offered a more flexible and convenient way to create -synthetic block devices, now from files instead of from chunks of memory. -See losetup (8) for details. - -ramfs and tmpfs: ----------------- - -One downside of ramfs is you can keep writing data into it until you fill -up all memory, and the VM can't free it because the VM thinks that files -should get written to backing store (rather than swap space), but ramfs hasn't -got any backing store. Because of this, only root (or a trusted user) should -be allowed write access to a ramfs mount. - -A ramfs derivative called tmpfs was created to add size limits, and the ability -to write the data to swap space. Normal users can be allowed write access to -tmpfs mounts. See Documentation/filesystems/tmpfs.txt for more information. - -What is rootfs? ---------------- - -Rootfs is a special instance of ramfs (or tmpfs, if that's enabled), which is -always present in 2.6 systems. You can't unmount rootfs for approximately the -same reason you can't kill the init process; rather than having special code -to check for and handle an empty list, it's smaller and simpler for the kernel -to just make sure certain lists can't become empty. - -Most systems just mount another filesystem over rootfs and ignore it. The -amount of space an empty instance of ramfs takes up is tiny. - -If CONFIG_TMPFS is enabled, rootfs will use tmpfs instead of ramfs by -default. To force ramfs, add "rootfstype=ramfs" to the kernel command -line. - -What is initramfs? ------------------- - -All 2.6 Linux kernels contain a gzipped "cpio" format archive, which is -extracted into rootfs when the kernel boots up. After extracting, the kernel -checks to see if rootfs contains a file "init", and if so it executes it as PID -1. If found, this init process is responsible for bringing the system the -rest of the way up, including locating and mounting the real root device (if -any). If rootfs does not contain an init program after the embedded cpio -archive is extracted into it, the kernel will fall through to the older code -to locate and mount a root partition, then exec some variant of /sbin/init -out of that. - -All this differs from the old initrd in several ways: - - - The old initrd was always a separate file, while the initramfs archive is - linked into the linux kernel image. (The directory linux-*/usr is devoted - to generating this archive during the build.) - - - The old initrd file was a gzipped filesystem image (in some file format, - such as ext2, that needed a driver built into the kernel), while the new - initramfs archive is a gzipped cpio archive (like tar only simpler, - see cpio(1) and Documentation/driver-api/early-userspace/buffer-format.rst). The - kernel's cpio extraction code is not only extremely small, it's also - __init text and data that can be discarded during the boot process. - - - The program run by the old initrd (which was called /initrd, not /init) did - some setup and then returned to the kernel, while the init program from - initramfs is not expected to return to the kernel. (If /init needs to hand - off control it can overmount / with a new root device and exec another init - program. See the switch_root utility, below.) - - - When switching another root device, initrd would pivot_root and then - umount the ramdisk. But initramfs is rootfs: you can neither pivot_root - rootfs, nor unmount it. Instead delete everything out of rootfs to - free up the space (find -xdev / -exec rm '{}' ';'), overmount rootfs - with the new root (cd /newmount; mount --move . /; chroot .), attach - stdin/stdout/stderr to the new /dev/console, and exec the new init. - - Since this is a remarkably persnickety process (and involves deleting - commands before you can run them), the klibc package introduced a helper - program (utils/run_init.c) to do all this for you. Most other packages - (such as busybox) have named this command "switch_root". - -Populating initramfs: ---------------------- - -The 2.6 kernel build process always creates a gzipped cpio format initramfs -archive and links it into the resulting kernel binary. By default, this -archive is empty (consuming 134 bytes on x86). - -The config option CONFIG_INITRAMFS_SOURCE (in General Setup in menuconfig, -and living in usr/Kconfig) can be used to specify a source for the -initramfs archive, which will automatically be incorporated into the -resulting binary. This option can point to an existing gzipped cpio -archive, a directory containing files to be archived, or a text file -specification such as the following example: - - dir /dev 755 0 0 - nod /dev/console 644 0 0 c 5 1 - nod /dev/loop0 644 0 0 b 7 0 - dir /bin 755 1000 1000 - slink /bin/sh busybox 777 0 0 - file /bin/busybox initramfs/busybox 755 0 0 - dir /proc 755 0 0 - dir /sys 755 0 0 - dir /mnt 755 0 0 - file /init initramfs/init.sh 755 0 0 - -Run "usr/gen_init_cpio" (after the kernel build) to get a usage message -documenting the above file format. - -One advantage of the configuration file is that root access is not required to -set permissions or create device nodes in the new archive. (Note that those -two example "file" entries expect to find files named "init.sh" and "busybox" in -a directory called "initramfs", under the linux-2.6.* directory. See -Documentation/driver-api/early-userspace/early_userspace_support.rst for more details.) - -The kernel does not depend on external cpio tools. If you specify a -directory instead of a configuration file, the kernel's build infrastructure -creates a configuration file from that directory (usr/Makefile calls -usr/gen_initramfs_list.sh), and proceeds to package up that directory -using the config file (by feeding it to usr/gen_init_cpio, which is created -from usr/gen_init_cpio.c). The kernel's build-time cpio creation code is -entirely self-contained, and the kernel's boot-time extractor is also -(obviously) self-contained. - -The one thing you might need external cpio utilities installed for is creating -or extracting your own preprepared cpio files to feed to the kernel build -(instead of a config file or directory). - -The following command line can extract a cpio image (either by the above script -or by the kernel build) back into its component files: - - cpio -i -d -H newc -F initramfs_data.cpio --no-absolute-filenames - -The following shell script can create a prebuilt cpio archive you can -use in place of the above config file: - - #!/bin/sh - - # Copyright 2006 Rob Landley and TimeSys Corporation. - # Licensed under GPL version 2 - - if [ $# -ne 2 ] - then - echo "usage: mkinitramfs directory imagename.cpio.gz" - exit 1 - fi - - if [ -d "$1" ] - then - echo "creating $2 from $1" - (cd "$1"; find . | cpio -o -H newc | gzip) > "$2" - else - echo "First argument must be a directory" - exit 1 - fi - -Note: The cpio man page contains some bad advice that will break your initramfs -archive if you follow it. It says "A typical way to generate the list -of filenames is with the find command; you should give find the -depth option -to minimize problems with permissions on directories that are unwritable or not -searchable." Don't do this when creating initramfs.cpio.gz images, it won't -work. The Linux kernel cpio extractor won't create files in a directory that -doesn't exist, so the directory entries must go before the files that go in -those directories. The above script gets them in the right order. - -External initramfs images: --------------------------- - -If the kernel has initrd support enabled, an external cpio.gz archive can also -be passed into a 2.6 kernel in place of an initrd. In this case, the kernel -will autodetect the type (initramfs, not initrd) and extract the external cpio -archive into rootfs before trying to run /init. - -This has the memory efficiency advantages of initramfs (no ramdisk block -device) but the separate packaging of initrd (which is nice if you have -non-GPL code you'd like to run from initramfs, without conflating it with -the GPL licensed Linux kernel binary). - -It can also be used to supplement the kernel's built-in initramfs image. The -files in the external archive will overwrite any conflicting files in -the built-in initramfs archive. Some distributors also prefer to customize -a single kernel image with task-specific initramfs images, without recompiling. - -Contents of initramfs: ----------------------- - -An initramfs archive is a complete self-contained root filesystem for Linux. -If you don't already understand what shared libraries, devices, and paths -you need to get a minimal root filesystem up and running, here are some -references: -http://www.tldp.org/HOWTO/Bootdisk-HOWTO/ -http://www.tldp.org/HOWTO/From-PowerUp-To-Bash-Prompt-HOWTO.html -http://www.linuxfromscratch.org/lfs/view/stable/ - -The "klibc" package (http://www.kernel.org/pub/linux/libs/klibc) is -designed to be a tiny C library to statically link early userspace -code against, along with some related utilities. It is BSD licensed. - -I use uClibc (http://www.uclibc.org) and busybox (http://www.busybox.net) -myself. These are LGPL and GPL, respectively. (A self-contained initramfs -package is planned for the busybox 1.3 release.) - -In theory you could use glibc, but that's not well suited for small embedded -uses like this. (A "hello world" program statically linked against glibc is -over 400k. With uClibc it's 7k. Also note that glibc dlopens libnss to do -name lookups, even when otherwise statically linked.) - -A good first step is to get initramfs to run a statically linked "hello world" -program as init, and test it under an emulator like qemu (www.qemu.org) or -User Mode Linux, like so: - - cat > hello.c << EOF - #include - #include - - int main(int argc, char *argv[]) - { - printf("Hello world!\n"); - sleep(999999999); - } - EOF - gcc -static hello.c -o init - echo init | cpio -o -H newc | gzip > test.cpio.gz - # Testing external initramfs using the initrd loading mechanism. - qemu -kernel /boot/vmlinuz -initrd test.cpio.gz /dev/zero - -When debugging a normal root filesystem, it's nice to be able to boot with -"init=/bin/sh". The initramfs equivalent is "rdinit=/bin/sh", and it's -just as useful. - -Why cpio rather than tar? -------------------------- - -This decision was made back in December, 2001. The discussion started here: - - http://www.uwsg.iu.edu/hypermail/linux/kernel/0112.2/1538.html - -And spawned a second thread (specifically on tar vs cpio), starting here: - - http://www.uwsg.iu.edu/hypermail/linux/kernel/0112.2/1587.html - -The quick and dirty summary version (which is no substitute for reading -the above threads) is: - -1) cpio is a standard. It's decades old (from the AT&T days), and already - widely used on Linux (inside RPM, Red Hat's device driver disks). Here's - a Linux Journal article about it from 1996: - - http://www.linuxjournal.com/article/1213 - - It's not as popular as tar because the traditional cpio command line tools - require _truly_hideous_ command line arguments. But that says nothing - either way about the archive format, and there are alternative tools, - such as: - - http://freecode.com/projects/afio - -2) The cpio archive format chosen by the kernel is simpler and cleaner (and - thus easier to create and parse) than any of the (literally dozens of) - various tar archive formats. The complete initramfs archive format is - explained in buffer-format.txt, created in usr/gen_init_cpio.c, and - extracted in init/initramfs.c. All three together come to less than 26k - total of human-readable text. - -3) The GNU project standardizing on tar is approximately as relevant as - Windows standardizing on zip. Linux is not part of either, and is free - to make its own technical decisions. - -4) Since this is a kernel internal format, it could easily have been - something brand new. The kernel provides its own tools to create and - extract this format anyway. Using an existing standard was preferable, - but not essential. - -5) Al Viro made the decision (quote: "tar is ugly as hell and not going to be - supported on the kernel side"): - - http://www.uwsg.iu.edu/hypermail/linux/kernel/0112.2/1540.html - - explained his reasoning: - - http://www.uwsg.iu.edu/hypermail/linux/kernel/0112.2/1550.html - http://www.uwsg.iu.edu/hypermail/linux/kernel/0112.2/1638.html - - and, most importantly, designed and implemented the initramfs code. - -Future directions: ------------------- - -Today (2.6.16), initramfs is always compiled in, but not always used. The -kernel falls back to legacy boot code that is reached only if initramfs does -not contain an /init program. The fallback is legacy code, there to ensure a -smooth transition and allowing early boot functionality to gradually move to -"early userspace" (I.E. initramfs). - -The move to early userspace is necessary because finding and mounting the real -root device is complex. Root partitions can span multiple devices (raid or -separate journal). They can be out on the network (requiring dhcp, setting a -specific MAC address, logging into a server, etc). They can live on removable -media, with dynamically allocated major/minor numbers and persistent naming -issues requiring a full udev implementation to sort out. They can be -compressed, encrypted, copy-on-write, loopback mounted, strangely partitioned, -and so on. - -This kind of complexity (which inevitably includes policy) is rightly handled -in userspace. Both klibc and busybox/uClibc are working on simple initramfs -packages to drop into a kernel build. - -The klibc package has now been accepted into Andrew Morton's 2.6.17-mm tree. -The kernel's current early boot code (partition detection, etc) will probably -be migrated into a default initramfs, automatically created and used by the -kernel build. -- cgit From 56e6d5c0eb7b862b4c984107e665821722413008 Mon Sep 17 00:00:00 2001 From: Mauro Carvalho Chehab Date: Mon, 17 Feb 2020 17:12:21 +0100 Subject: docs: filesystems: convert relay.txt to ReST - Add a SPDX header; - Adjust document title; - Some whitespace fixes and new line breaks; - Mark literal blocks as such; - Add table markups; - Use notes markups; - Add it to filesystems/index.rst. Signed-off-by: Mauro Carvalho Chehab Link: https://lore.kernel.org/r/f48bb0fdf64d197f28c6f469adb61a7a091adb75.1581955849.git.mchehab+huawei@kernel.org Signed-off-by: Jonathan Corbet --- Documentation/filesystems/index.rst | 1 + Documentation/filesystems/relay.rst | 501 ++++++++++++++++++++++++++++++++++++ Documentation/filesystems/relay.txt | 494 ----------------------------------- 3 files changed, 502 insertions(+), 494 deletions(-) create mode 100644 Documentation/filesystems/relay.rst delete mode 100644 Documentation/filesystems/relay.txt diff --git a/Documentation/filesystems/index.rst b/Documentation/filesystems/index.rst index b8689d082911..0aade8146d4d 100644 --- a/Documentation/filesystems/index.rst +++ b/Documentation/filesystems/index.rst @@ -84,5 +84,6 @@ Documentation for filesystem implementations. proc qnx6 ramfs-rootfs-initramfs + relay virtiofs vfat diff --git a/Documentation/filesystems/relay.rst b/Documentation/filesystems/relay.rst new file mode 100644 index 000000000000..04ad083cfe62 --- /dev/null +++ b/Documentation/filesystems/relay.rst @@ -0,0 +1,501 @@ +.. SPDX-License-Identifier: GPL-2.0 + +================================== +relay interface (formerly relayfs) +================================== + +The relay interface provides a means for kernel applications to +efficiently log and transfer large quantities of data from the kernel +to userspace via user-defined 'relay channels'. + +A 'relay channel' is a kernel->user data relay mechanism implemented +as a set of per-cpu kernel buffers ('channel buffers'), each +represented as a regular file ('relay file') in user space. Kernel +clients write into the channel buffers using efficient write +functions; these automatically log into the current cpu's channel +buffer. User space applications mmap() or read() from the relay files +and retrieve the data as it becomes available. The relay files +themselves are files created in a host filesystem, e.g. debugfs, and +are associated with the channel buffers using the API described below. + +The format of the data logged into the channel buffers is completely +up to the kernel client; the relay interface does however provide +hooks which allow kernel clients to impose some structure on the +buffer data. The relay interface doesn't implement any form of data +filtering - this also is left to the kernel client. The purpose is to +keep things as simple as possible. + +This document provides an overview of the relay interface API. The +details of the function parameters are documented along with the +functions in the relay interface code - please see that for details. + +Semantics +========= + +Each relay channel has one buffer per CPU, each buffer has one or more +sub-buffers. Messages are written to the first sub-buffer until it is +too full to contain a new message, in which case it is written to +the next (if available). Messages are never split across sub-buffers. +At this point, userspace can be notified so it empties the first +sub-buffer, while the kernel continues writing to the next. + +When notified that a sub-buffer is full, the kernel knows how many +bytes of it are padding i.e. unused space occurring because a complete +message couldn't fit into a sub-buffer. Userspace can use this +knowledge to copy only valid data. + +After copying it, userspace can notify the kernel that a sub-buffer +has been consumed. + +A relay channel can operate in a mode where it will overwrite data not +yet collected by userspace, and not wait for it to be consumed. + +The relay channel itself does not provide for communication of such +data between userspace and kernel, allowing the kernel side to remain +simple and not impose a single interface on userspace. It does +provide a set of examples and a separate helper though, described +below. + +The read() interface both removes padding and internally consumes the +read sub-buffers; thus in cases where read(2) is being used to drain +the channel buffers, special-purpose communication between kernel and +user isn't necessary for basic operation. + +One of the major goals of the relay interface is to provide a low +overhead mechanism for conveying kernel data to userspace. While the +read() interface is easy to use, it's not as efficient as the mmap() +approach; the example code attempts to make the tradeoff between the +two approaches as small as possible. + +klog and relay-apps example code +================================ + +The relay interface itself is ready to use, but to make things easier, +a couple simple utility functions and a set of examples are provided. + +The relay-apps example tarball, available on the relay sourceforge +site, contains a set of self-contained examples, each consisting of a +pair of .c files containing boilerplate code for each of the user and +kernel sides of a relay application. When combined these two sets of +boilerplate code provide glue to easily stream data to disk, without +having to bother with mundane housekeeping chores. + +The 'klog debugging functions' patch (klog.patch in the relay-apps +tarball) provides a couple of high-level logging functions to the +kernel which allow writing formatted text or raw data to a channel, +regardless of whether a channel to write into exists or not, or even +whether the relay interface is compiled into the kernel or not. These +functions allow you to put unconditional 'trace' statements anywhere +in the kernel or kernel modules; only when there is a 'klog handler' +registered will data actually be logged (see the klog and kleak +examples for details). + +It is of course possible to use the relay interface from scratch, +i.e. without using any of the relay-apps example code or klog, but +you'll have to implement communication between userspace and kernel, +allowing both to convey the state of buffers (full, empty, amount of +padding). The read() interface both removes padding and internally +consumes the read sub-buffers; thus in cases where read(2) is being +used to drain the channel buffers, special-purpose communication +between kernel and user isn't necessary for basic operation. Things +such as buffer-full conditions would still need to be communicated via +some channel though. + +klog and the relay-apps examples can be found in the relay-apps +tarball on http://relayfs.sourceforge.net + +The relay interface user space API +================================== + +The relay interface implements basic file operations for user space +access to relay channel buffer data. Here are the file operations +that are available and some comments regarding their behavior: + +=========== ============================================================ +open() enables user to open an _existing_ channel buffer. + +mmap() results in channel buffer being mapped into the caller's + memory space. Note that you can't do a partial mmap - you + must map the entire file, which is NRBUF * SUBBUFSIZE. + +read() read the contents of a channel buffer. The bytes read are + 'consumed' by the reader, i.e. they won't be available + again to subsequent reads. If the channel is being used + in no-overwrite mode (the default), it can be read at any + time even if there's an active kernel writer. If the + channel is being used in overwrite mode and there are + active channel writers, results may be unpredictable - + users should make sure that all logging to the channel has + ended before using read() with overwrite mode. Sub-buffer + padding is automatically removed and will not be seen by + the reader. + +sendfile() transfer data from a channel buffer to an output file + descriptor. Sub-buffer padding is automatically removed + and will not be seen by the reader. + +poll() POLLIN/POLLRDNORM/POLLERR supported. User applications are + notified when sub-buffer boundaries are crossed. + +close() decrements the channel buffer's refcount. When the refcount + reaches 0, i.e. when no process or kernel client has the + buffer open, the channel buffer is freed. +=========== ============================================================ + +In order for a user application to make use of relay files, the +host filesystem must be mounted. For example:: + + mount -t debugfs debugfs /sys/kernel/debug + +.. Note:: + + the host filesystem doesn't need to be mounted for kernel + clients to create or use channels - it only needs to be + mounted when user space applications need access to the buffer + data. + + +The relay interface kernel API +============================== + +Here's a summary of the API the relay interface provides to in-kernel clients: + +TBD(curr. line MT:/API/) + channel management functions:: + + relay_open(base_filename, parent, subbuf_size, n_subbufs, + callbacks, private_data) + relay_close(chan) + relay_flush(chan) + relay_reset(chan) + + channel management typically called on instigation of userspace:: + + relay_subbufs_consumed(chan, cpu, subbufs_consumed) + + write functions:: + + relay_write(chan, data, length) + __relay_write(chan, data, length) + relay_reserve(chan, length) + + callbacks:: + + subbuf_start(buf, subbuf, prev_subbuf, prev_padding) + buf_mapped(buf, filp) + buf_unmapped(buf, filp) + create_buf_file(filename, parent, mode, buf, is_global) + remove_buf_file(dentry) + + helper functions:: + + relay_buf_full(buf) + subbuf_start_reserve(buf, length) + + +Creating a channel +------------------ + +relay_open() is used to create a channel, along with its per-cpu +channel buffers. Each channel buffer will have an associated file +created for it in the host filesystem, which can be and mmapped or +read from in user space. The files are named basename0...basenameN-1 +where N is the number of online cpus, and by default will be created +in the root of the filesystem (if the parent param is NULL). If you +want a directory structure to contain your relay files, you should +create it using the host filesystem's directory creation function, +e.g. debugfs_create_dir(), and pass the parent directory to +relay_open(). Users are responsible for cleaning up any directory +structure they create, when the channel is closed - again the host +filesystem's directory removal functions should be used for that, +e.g. debugfs_remove(). + +In order for a channel to be created and the host filesystem's files +associated with its channel buffers, the user must provide definitions +for two callback functions, create_buf_file() and remove_buf_file(). +create_buf_file() is called once for each per-cpu buffer from +relay_open() and allows the user to create the file which will be used +to represent the corresponding channel buffer. The callback should +return the dentry of the file created to represent the channel buffer. +remove_buf_file() must also be defined; it's responsible for deleting +the file(s) created in create_buf_file() and is called during +relay_close(). + +Here are some typical definitions for these callbacks, in this case +using debugfs:: + + /* + * create_buf_file() callback. Creates relay file in debugfs. + */ + static struct dentry *create_buf_file_handler(const char *filename, + struct dentry *parent, + umode_t mode, + struct rchan_buf *buf, + int *is_global) + { + return debugfs_create_file(filename, mode, parent, buf, + &relay_file_operations); + } + + /* + * remove_buf_file() callback. Removes relay file from debugfs. + */ + static int remove_buf_file_handler(struct dentry *dentry) + { + debugfs_remove(dentry); + + return 0; + } + + /* + * relay interface callbacks + */ + static struct rchan_callbacks relay_callbacks = + { + .create_buf_file = create_buf_file_handler, + .remove_buf_file = remove_buf_file_handler, + }; + +And an example relay_open() invocation using them:: + + chan = relay_open("cpu", NULL, SUBBUF_SIZE, N_SUBBUFS, &relay_callbacks, NULL); + +If the create_buf_file() callback fails, or isn't defined, channel +creation and thus relay_open() will fail. + +The total size of each per-cpu buffer is calculated by multiplying the +number of sub-buffers by the sub-buffer size passed into relay_open(). +The idea behind sub-buffers is that they're basically an extension of +double-buffering to N buffers, and they also allow applications to +easily implement random-access-on-buffer-boundary schemes, which can +be important for some high-volume applications. The number and size +of sub-buffers is completely dependent on the application and even for +the same application, different conditions will warrant different +values for these parameters at different times. Typically, the right +values to use are best decided after some experimentation; in general, +though, it's safe to assume that having only 1 sub-buffer is a bad +idea - you're guaranteed to either overwrite data or lose events +depending on the channel mode being used. + +The create_buf_file() implementation can also be defined in such a way +as to allow the creation of a single 'global' buffer instead of the +default per-cpu set. This can be useful for applications interested +mainly in seeing the relative ordering of system-wide events without +the need to bother with saving explicit timestamps for the purpose of +merging/sorting per-cpu files in a postprocessing step. + +To have relay_open() create a global buffer, the create_buf_file() +implementation should set the value of the is_global outparam to a +non-zero value in addition to creating the file that will be used to +represent the single buffer. In the case of a global buffer, +create_buf_file() and remove_buf_file() will be called only once. The +normal channel-writing functions, e.g. relay_write(), can still be +used - writes from any cpu will transparently end up in the global +buffer - but since it is a global buffer, callers should make sure +they use the proper locking for such a buffer, either by wrapping +writes in a spinlock, or by copying a write function from relay.h and +creating a local version that internally does the proper locking. + +The private_data passed into relay_open() allows clients to associate +user-defined data with a channel, and is immediately available +(including in create_buf_file()) via chan->private_data or +buf->chan->private_data. + +Buffer-only channels +-------------------- + +These channels have no files associated and can be created with +relay_open(NULL, NULL, ...). Such channels are useful in scenarios such +as when doing early tracing in the kernel, before the VFS is up. In these +cases, one may open a buffer-only channel and then call +relay_late_setup_files() when the kernel is ready to handle files, +to expose the buffered data to the userspace. + +Channel 'modes' +--------------- + +relay channels can be used in either of two modes - 'overwrite' or +'no-overwrite'. The mode is entirely determined by the implementation +of the subbuf_start() callback, as described below. The default if no +subbuf_start() callback is defined is 'no-overwrite' mode. If the +default mode suits your needs, and you plan to use the read() +interface to retrieve channel data, you can ignore the details of this +section, as it pertains mainly to mmap() implementations. + +In 'overwrite' mode, also known as 'flight recorder' mode, writes +continuously cycle around the buffer and will never fail, but will +unconditionally overwrite old data regardless of whether it's actually +been consumed. In no-overwrite mode, writes will fail, i.e. data will +be lost, if the number of unconsumed sub-buffers equals the total +number of sub-buffers in the channel. It should be clear that if +there is no consumer or if the consumer can't consume sub-buffers fast +enough, data will be lost in either case; the only difference is +whether data is lost from the beginning or the end of a buffer. + +As explained above, a relay channel is made of up one or more +per-cpu channel buffers, each implemented as a circular buffer +subdivided into one or more sub-buffers. Messages are written into +the current sub-buffer of the channel's current per-cpu buffer via the +write functions described below. Whenever a message can't fit into +the current sub-buffer, because there's no room left for it, the +client is notified via the subbuf_start() callback that a switch to a +new sub-buffer is about to occur. The client uses this callback to 1) +initialize the next sub-buffer if appropriate 2) finalize the previous +sub-buffer if appropriate and 3) return a boolean value indicating +whether or not to actually move on to the next sub-buffer. + +To implement 'no-overwrite' mode, the userspace client would provide +an implementation of the subbuf_start() callback something like the +following:: + + static int subbuf_start(struct rchan_buf *buf, + void *subbuf, + void *prev_subbuf, + unsigned int prev_padding) + { + if (prev_subbuf) + *((unsigned *)prev_subbuf) = prev_padding; + + if (relay_buf_full(buf)) + return 0; + + subbuf_start_reserve(buf, sizeof(unsigned int)); + + return 1; + } + +If the current buffer is full, i.e. all sub-buffers remain unconsumed, +the callback returns 0 to indicate that the buffer switch should not +occur yet, i.e. until the consumer has had a chance to read the +current set of ready sub-buffers. For the relay_buf_full() function +to make sense, the consumer is responsible for notifying the relay +interface when sub-buffers have been consumed via +relay_subbufs_consumed(). Any subsequent attempts to write into the +buffer will again invoke the subbuf_start() callback with the same +parameters; only when the consumer has consumed one or more of the +ready sub-buffers will relay_buf_full() return 0, in which case the +buffer switch can continue. + +The implementation of the subbuf_start() callback for 'overwrite' mode +would be very similar:: + + static int subbuf_start(struct rchan_buf *buf, + void *subbuf, + void *prev_subbuf, + size_t prev_padding) + { + if (prev_subbuf) + *((unsigned *)prev_subbuf) = prev_padding; + + subbuf_start_reserve(buf, sizeof(unsigned int)); + + return 1; + } + +In this case, the relay_buf_full() check is meaningless and the +callback always returns 1, causing the buffer switch to occur +unconditionally. It's also meaningless for the client to use the +relay_subbufs_consumed() function in this mode, as it's never +consulted. + +The default subbuf_start() implementation, used if the client doesn't +define any callbacks, or doesn't define the subbuf_start() callback, +implements the simplest possible 'no-overwrite' mode, i.e. it does +nothing but return 0. + +Header information can be reserved at the beginning of each sub-buffer +by calling the subbuf_start_reserve() helper function from within the +subbuf_start() callback. This reserved area can be used to store +whatever information the client wants. In the example above, room is +reserved in each sub-buffer to store the padding count for that +sub-buffer. This is filled in for the previous sub-buffer in the +subbuf_start() implementation; the padding value for the previous +sub-buffer is passed into the subbuf_start() callback along with a +pointer to the previous sub-buffer, since the padding value isn't +known until a sub-buffer is filled. The subbuf_start() callback is +also called for the first sub-buffer when the channel is opened, to +give the client a chance to reserve space in it. In this case the +previous sub-buffer pointer passed into the callback will be NULL, so +the client should check the value of the prev_subbuf pointer before +writing into the previous sub-buffer. + +Writing to a channel +-------------------- + +Kernel clients write data into the current cpu's channel buffer using +relay_write() or __relay_write(). relay_write() is the main logging +function - it uses local_irqsave() to protect the buffer and should be +used if you might be logging from interrupt context. If you know +you'll never be logging from interrupt context, you can use +__relay_write(), which only disables preemption. These functions +don't return a value, so you can't determine whether or not they +failed - the assumption is that you wouldn't want to check a return +value in the fast logging path anyway, and that they'll always succeed +unless the buffer is full and no-overwrite mode is being used, in +which case you can detect a failed write in the subbuf_start() +callback by calling the relay_buf_full() helper function. + +relay_reserve() is used to reserve a slot in a channel buffer which +can be written to later. This would typically be used in applications +that need to write directly into a channel buffer without having to +stage data in a temporary buffer beforehand. Because the actual write +may not happen immediately after the slot is reserved, applications +using relay_reserve() can keep a count of the number of bytes actually +written, either in space reserved in the sub-buffers themselves or as +a separate array. See the 'reserve' example in the relay-apps tarball +at http://relayfs.sourceforge.net for an example of how this can be +done. Because the write is under control of the client and is +separated from the reserve, relay_reserve() doesn't protect the buffer +at all - it's up to the client to provide the appropriate +synchronization when using relay_reserve(). + +Closing a channel +----------------- + +The client calls relay_close() when it's finished using the channel. +The channel and its associated buffers are destroyed when there are no +longer any references to any of the channel buffers. relay_flush() +forces a sub-buffer switch on all the channel buffers, and can be used +to finalize and process the last sub-buffers before the channel is +closed. + +Misc +---- + +Some applications may want to keep a channel around and re-use it +rather than open and close a new channel for each use. relay_reset() +can be used for this purpose - it resets a channel to its initial +state without reallocating channel buffer memory or destroying +existing mappings. It should however only be called when it's safe to +do so, i.e. when the channel isn't currently being written to. + +Finally, there are a couple of utility callbacks that can be used for +different purposes. buf_mapped() is called whenever a channel buffer +is mmapped from user space and buf_unmapped() is called when it's +unmapped. The client can use this notification to trigger actions +within the kernel application, such as enabling/disabling logging to +the channel. + + +Resources +========= + +For news, example code, mailing list, etc. see the relay interface homepage: + + http://relayfs.sourceforge.net + + +Credits +======= + +The ideas and specs for the relay interface came about as a result of +discussions on tracing involving the following: + +Michel Dagenais +Richard Moore +Bob Wisniewski +Karim Yaghmour +Tom Zanussi + +Also thanks to Hubertus Franke for a lot of useful suggestions and bug +reports. diff --git a/Documentation/filesystems/relay.txt b/Documentation/filesystems/relay.txt deleted file mode 100644 index cd709a94d054..000000000000 --- a/Documentation/filesystems/relay.txt +++ /dev/null @@ -1,494 +0,0 @@ -relay interface (formerly relayfs) -================================== - -The relay interface provides a means for kernel applications to -efficiently log and transfer large quantities of data from the kernel -to userspace via user-defined 'relay channels'. - -A 'relay channel' is a kernel->user data relay mechanism implemented -as a set of per-cpu kernel buffers ('channel buffers'), each -represented as a regular file ('relay file') in user space. Kernel -clients write into the channel buffers using efficient write -functions; these automatically log into the current cpu's channel -buffer. User space applications mmap() or read() from the relay files -and retrieve the data as it becomes available. The relay files -themselves are files created in a host filesystem, e.g. debugfs, and -are associated with the channel buffers using the API described below. - -The format of the data logged into the channel buffers is completely -up to the kernel client; the relay interface does however provide -hooks which allow kernel clients to impose some structure on the -buffer data. The relay interface doesn't implement any form of data -filtering - this also is left to the kernel client. The purpose is to -keep things as simple as possible. - -This document provides an overview of the relay interface API. The -details of the function parameters are documented along with the -functions in the relay interface code - please see that for details. - -Semantics -========= - -Each relay channel has one buffer per CPU, each buffer has one or more -sub-buffers. Messages are written to the first sub-buffer until it is -too full to contain a new message, in which case it is written to -the next (if available). Messages are never split across sub-buffers. -At this point, userspace can be notified so it empties the first -sub-buffer, while the kernel continues writing to the next. - -When notified that a sub-buffer is full, the kernel knows how many -bytes of it are padding i.e. unused space occurring because a complete -message couldn't fit into a sub-buffer. Userspace can use this -knowledge to copy only valid data. - -After copying it, userspace can notify the kernel that a sub-buffer -has been consumed. - -A relay channel can operate in a mode where it will overwrite data not -yet collected by userspace, and not wait for it to be consumed. - -The relay channel itself does not provide for communication of such -data between userspace and kernel, allowing the kernel side to remain -simple and not impose a single interface on userspace. It does -provide a set of examples and a separate helper though, described -below. - -The read() interface both removes padding and internally consumes the -read sub-buffers; thus in cases where read(2) is being used to drain -the channel buffers, special-purpose communication between kernel and -user isn't necessary for basic operation. - -One of the major goals of the relay interface is to provide a low -overhead mechanism for conveying kernel data to userspace. While the -read() interface is easy to use, it's not as efficient as the mmap() -approach; the example code attempts to make the tradeoff between the -two approaches as small as possible. - -klog and relay-apps example code -================================ - -The relay interface itself is ready to use, but to make things easier, -a couple simple utility functions and a set of examples are provided. - -The relay-apps example tarball, available on the relay sourceforge -site, contains a set of self-contained examples, each consisting of a -pair of .c files containing boilerplate code for each of the user and -kernel sides of a relay application. When combined these two sets of -boilerplate code provide glue to easily stream data to disk, without -having to bother with mundane housekeeping chores. - -The 'klog debugging functions' patch (klog.patch in the relay-apps -tarball) provides a couple of high-level logging functions to the -kernel which allow writing formatted text or raw data to a channel, -regardless of whether a channel to write into exists or not, or even -whether the relay interface is compiled into the kernel or not. These -functions allow you to put unconditional 'trace' statements anywhere -in the kernel or kernel modules; only when there is a 'klog handler' -registered will data actually be logged (see the klog and kleak -examples for details). - -It is of course possible to use the relay interface from scratch, -i.e. without using any of the relay-apps example code or klog, but -you'll have to implement communication between userspace and kernel, -allowing both to convey the state of buffers (full, empty, amount of -padding). The read() interface both removes padding and internally -consumes the read sub-buffers; thus in cases where read(2) is being -used to drain the channel buffers, special-purpose communication -between kernel and user isn't necessary for basic operation. Things -such as buffer-full conditions would still need to be communicated via -some channel though. - -klog and the relay-apps examples can be found in the relay-apps -tarball on http://relayfs.sourceforge.net - -The relay interface user space API -================================== - -The relay interface implements basic file operations for user space -access to relay channel buffer data. Here are the file operations -that are available and some comments regarding their behavior: - -open() enables user to open an _existing_ channel buffer. - -mmap() results in channel buffer being mapped into the caller's - memory space. Note that you can't do a partial mmap - you - must map the entire file, which is NRBUF * SUBBUFSIZE. - -read() read the contents of a channel buffer. The bytes read are - 'consumed' by the reader, i.e. they won't be available - again to subsequent reads. If the channel is being used - in no-overwrite mode (the default), it can be read at any - time even if there's an active kernel writer. If the - channel is being used in overwrite mode and there are - active channel writers, results may be unpredictable - - users should make sure that all logging to the channel has - ended before using read() with overwrite mode. Sub-buffer - padding is automatically removed and will not be seen by - the reader. - -sendfile() transfer data from a channel buffer to an output file - descriptor. Sub-buffer padding is automatically removed - and will not be seen by the reader. - -poll() POLLIN/POLLRDNORM/POLLERR supported. User applications are - notified when sub-buffer boundaries are crossed. - -close() decrements the channel buffer's refcount. When the refcount - reaches 0, i.e. when no process or kernel client has the - buffer open, the channel buffer is freed. - -In order for a user application to make use of relay files, the -host filesystem must be mounted. For example, - - mount -t debugfs debugfs /sys/kernel/debug - -NOTE: the host filesystem doesn't need to be mounted for kernel - clients to create or use channels - it only needs to be - mounted when user space applications need access to the buffer - data. - - -The relay interface kernel API -============================== - -Here's a summary of the API the relay interface provides to in-kernel clients: - -TBD(curr. line MT:/API/) - channel management functions: - - relay_open(base_filename, parent, subbuf_size, n_subbufs, - callbacks, private_data) - relay_close(chan) - relay_flush(chan) - relay_reset(chan) - - channel management typically called on instigation of userspace: - - relay_subbufs_consumed(chan, cpu, subbufs_consumed) - - write functions: - - relay_write(chan, data, length) - __relay_write(chan, data, length) - relay_reserve(chan, length) - - callbacks: - - subbuf_start(buf, subbuf, prev_subbuf, prev_padding) - buf_mapped(buf, filp) - buf_unmapped(buf, filp) - create_buf_file(filename, parent, mode, buf, is_global) - remove_buf_file(dentry) - - helper functions: - - relay_buf_full(buf) - subbuf_start_reserve(buf, length) - - -Creating a channel ------------------- - -relay_open() is used to create a channel, along with its per-cpu -channel buffers. Each channel buffer will have an associated file -created for it in the host filesystem, which can be and mmapped or -read from in user space. The files are named basename0...basenameN-1 -where N is the number of online cpus, and by default will be created -in the root of the filesystem (if the parent param is NULL). If you -want a directory structure to contain your relay files, you should -create it using the host filesystem's directory creation function, -e.g. debugfs_create_dir(), and pass the parent directory to -relay_open(). Users are responsible for cleaning up any directory -structure they create, when the channel is closed - again the host -filesystem's directory removal functions should be used for that, -e.g. debugfs_remove(). - -In order for a channel to be created and the host filesystem's files -associated with its channel buffers, the user must provide definitions -for two callback functions, create_buf_file() and remove_buf_file(). -create_buf_file() is called once for each per-cpu buffer from -relay_open() and allows the user to create the file which will be used -to represent the corresponding channel buffer. The callback should -return the dentry of the file created to represent the channel buffer. -remove_buf_file() must also be defined; it's responsible for deleting -the file(s) created in create_buf_file() and is called during -relay_close(). - -Here are some typical definitions for these callbacks, in this case -using debugfs: - -/* - * create_buf_file() callback. Creates relay file in debugfs. - */ -static struct dentry *create_buf_file_handler(const char *filename, - struct dentry *parent, - umode_t mode, - struct rchan_buf *buf, - int *is_global) -{ - return debugfs_create_file(filename, mode, parent, buf, - &relay_file_operations); -} - -/* - * remove_buf_file() callback. Removes relay file from debugfs. - */ -static int remove_buf_file_handler(struct dentry *dentry) -{ - debugfs_remove(dentry); - - return 0; -} - -/* - * relay interface callbacks - */ -static struct rchan_callbacks relay_callbacks = -{ - .create_buf_file = create_buf_file_handler, - .remove_buf_file = remove_buf_file_handler, -}; - -And an example relay_open() invocation using them: - - chan = relay_open("cpu", NULL, SUBBUF_SIZE, N_SUBBUFS, &relay_callbacks, NULL); - -If the create_buf_file() callback fails, or isn't defined, channel -creation and thus relay_open() will fail. - -The total size of each per-cpu buffer is calculated by multiplying the -number of sub-buffers by the sub-buffer size passed into relay_open(). -The idea behind sub-buffers is that they're basically an extension of -double-buffering to N buffers, and they also allow applications to -easily implement random-access-on-buffer-boundary schemes, which can -be important for some high-volume applications. The number and size -of sub-buffers is completely dependent on the application and even for -the same application, different conditions will warrant different -values for these parameters at different times. Typically, the right -values to use are best decided after some experimentation; in general, -though, it's safe to assume that having only 1 sub-buffer is a bad -idea - you're guaranteed to either overwrite data or lose events -depending on the channel mode being used. - -The create_buf_file() implementation can also be defined in such a way -as to allow the creation of a single 'global' buffer instead of the -default per-cpu set. This can be useful for applications interested -mainly in seeing the relative ordering of system-wide events without -the need to bother with saving explicit timestamps for the purpose of -merging/sorting per-cpu files in a postprocessing step. - -To have relay_open() create a global buffer, the create_buf_file() -implementation should set the value of the is_global outparam to a -non-zero value in addition to creating the file that will be used to -represent the single buffer. In the case of a global buffer, -create_buf_file() and remove_buf_file() will be called only once. The -normal channel-writing functions, e.g. relay_write(), can still be -used - writes from any cpu will transparently end up in the global -buffer - but since it is a global buffer, callers should make sure -they use the proper locking for such a buffer, either by wrapping -writes in a spinlock, or by copying a write function from relay.h and -creating a local version that internally does the proper locking. - -The private_data passed into relay_open() allows clients to associate -user-defined data with a channel, and is immediately available -(including in create_buf_file()) via chan->private_data or -buf->chan->private_data. - -Buffer-only channels --------------------- - -These channels have no files associated and can be created with -relay_open(NULL, NULL, ...). Such channels are useful in scenarios such -as when doing early tracing in the kernel, before the VFS is up. In these -cases, one may open a buffer-only channel and then call -relay_late_setup_files() when the kernel is ready to handle files, -to expose the buffered data to the userspace. - -Channel 'modes' ---------------- - -relay channels can be used in either of two modes - 'overwrite' or -'no-overwrite'. The mode is entirely determined by the implementation -of the subbuf_start() callback, as described below. The default if no -subbuf_start() callback is defined is 'no-overwrite' mode. If the -default mode suits your needs, and you plan to use the read() -interface to retrieve channel data, you can ignore the details of this -section, as it pertains mainly to mmap() implementations. - -In 'overwrite' mode, also known as 'flight recorder' mode, writes -continuously cycle around the buffer and will never fail, but will -unconditionally overwrite old data regardless of whether it's actually -been consumed. In no-overwrite mode, writes will fail, i.e. data will -be lost, if the number of unconsumed sub-buffers equals the total -number of sub-buffers in the channel. It should be clear that if -there is no consumer or if the consumer can't consume sub-buffers fast -enough, data will be lost in either case; the only difference is -whether data is lost from the beginning or the end of a buffer. - -As explained above, a relay channel is made of up one or more -per-cpu channel buffers, each implemented as a circular buffer -subdivided into one or more sub-buffers. Messages are written into -the current sub-buffer of the channel's current per-cpu buffer via the -write functions described below. Whenever a message can't fit into -the current sub-buffer, because there's no room left for it, the -client is notified via the subbuf_start() callback that a switch to a -new sub-buffer is about to occur. The client uses this callback to 1) -initialize the next sub-buffer if appropriate 2) finalize the previous -sub-buffer if appropriate and 3) return a boolean value indicating -whether or not to actually move on to the next sub-buffer. - -To implement 'no-overwrite' mode, the userspace client would provide -an implementation of the subbuf_start() callback something like the -following: - -static int subbuf_start(struct rchan_buf *buf, - void *subbuf, - void *prev_subbuf, - unsigned int prev_padding) -{ - if (prev_subbuf) - *((unsigned *)prev_subbuf) = prev_padding; - - if (relay_buf_full(buf)) - return 0; - - subbuf_start_reserve(buf, sizeof(unsigned int)); - - return 1; -} - -If the current buffer is full, i.e. all sub-buffers remain unconsumed, -the callback returns 0 to indicate that the buffer switch should not -occur yet, i.e. until the consumer has had a chance to read the -current set of ready sub-buffers. For the relay_buf_full() function -to make sense, the consumer is responsible for notifying the relay -interface when sub-buffers have been consumed via -relay_subbufs_consumed(). Any subsequent attempts to write into the -buffer will again invoke the subbuf_start() callback with the same -parameters; only when the consumer has consumed one or more of the -ready sub-buffers will relay_buf_full() return 0, in which case the -buffer switch can continue. - -The implementation of the subbuf_start() callback for 'overwrite' mode -would be very similar: - -static int subbuf_start(struct rchan_buf *buf, - void *subbuf, - void *prev_subbuf, - size_t prev_padding) -{ - if (prev_subbuf) - *((unsigned *)prev_subbuf) = prev_padding; - - subbuf_start_reserve(buf, sizeof(unsigned int)); - - return 1; -} - -In this case, the relay_buf_full() check is meaningless and the -callback always returns 1, causing the buffer switch to occur -unconditionally. It's also meaningless for the client to use the -relay_subbufs_consumed() function in this mode, as it's never -consulted. - -The default subbuf_start() implementation, used if the client doesn't -define any callbacks, or doesn't define the subbuf_start() callback, -implements the simplest possible 'no-overwrite' mode, i.e. it does -nothing but return 0. - -Header information can be reserved at the beginning of each sub-buffer -by calling the subbuf_start_reserve() helper function from within the -subbuf_start() callback. This reserved area can be used to store -whatever information the client wants. In the example above, room is -reserved in each sub-buffer to store the padding count for that -sub-buffer. This is filled in for the previous sub-buffer in the -subbuf_start() implementation; the padding value for the previous -sub-buffer is passed into the subbuf_start() callback along with a -pointer to the previous sub-buffer, since the padding value isn't -known until a sub-buffer is filled. The subbuf_start() callback is -also called for the first sub-buffer when the channel is opened, to -give the client a chance to reserve space in it. In this case the -previous sub-buffer pointer passed into the callback will be NULL, so -the client should check the value of the prev_subbuf pointer before -writing into the previous sub-buffer. - -Writing to a channel --------------------- - -Kernel clients write data into the current cpu's channel buffer using -relay_write() or __relay_write(). relay_write() is the main logging -function - it uses local_irqsave() to protect the buffer and should be -used if you might be logging from interrupt context. If you know -you'll never be logging from interrupt context, you can use -__relay_write(), which only disables preemption. These functions -don't return a value, so you can't determine whether or not they -failed - the assumption is that you wouldn't want to check a return -value in the fast logging path anyway, and that they'll always succeed -unless the buffer is full and no-overwrite mode is being used, in -which case you can detect a failed write in the subbuf_start() -callback by calling the relay_buf_full() helper function. - -relay_reserve() is used to reserve a slot in a channel buffer which -can be written to later. This would typically be used in applications -that need to write directly into a channel buffer without having to -stage data in a temporary buffer beforehand. Because the actual write -may not happen immediately after the slot is reserved, applications -using relay_reserve() can keep a count of the number of bytes actually -written, either in space reserved in the sub-buffers themselves or as -a separate array. See the 'reserve' example in the relay-apps tarball -at http://relayfs.sourceforge.net for an example of how this can be -done. Because the write is under control of the client and is -separated from the reserve, relay_reserve() doesn't protect the buffer -at all - it's up to the client to provide the appropriate -synchronization when using relay_reserve(). - -Closing a channel ------------------ - -The client calls relay_close() when it's finished using the channel. -The channel and its associated buffers are destroyed when there are no -longer any references to any of the channel buffers. relay_flush() -forces a sub-buffer switch on all the channel buffers, and can be used -to finalize and process the last sub-buffers before the channel is -closed. - -Misc ----- - -Some applications may want to keep a channel around and re-use it -rather than open and close a new channel for each use. relay_reset() -can be used for this purpose - it resets a channel to its initial -state without reallocating channel buffer memory or destroying -existing mappings. It should however only be called when it's safe to -do so, i.e. when the channel isn't currently being written to. - -Finally, there are a couple of utility callbacks that can be used for -different purposes. buf_mapped() is called whenever a channel buffer -is mmapped from user space and buf_unmapped() is called when it's -unmapped. The client can use this notification to trigger actions -within the kernel application, such as enabling/disabling logging to -the channel. - - -Resources -========= - -For news, example code, mailing list, etc. see the relay interface homepage: - - http://relayfs.sourceforge.net - - -Credits -======= - -The ideas and specs for the relay interface came about as a result of -discussions on tracing involving the following: - -Michel Dagenais -Richard Moore -Bob Wisniewski -Karim Yaghmour -Tom Zanussi - -Also thanks to Hubertus Franke for a lot of useful suggestions and bug -reports. -- cgit From 6db0a480aa07ab65b6c7d34d095c714359af3e87 Mon Sep 17 00:00:00 2001 From: Mauro Carvalho Chehab Date: Mon, 17 Feb 2020 17:12:22 +0100 Subject: docs: filesystems: convert romfs.txt to ReST - Add a SPDX header; - Add a document title; - Some whitespace fixes and new line breaks; - Mark literal blocks as such; - Add table markups; - Add it to filesystems/index.rst. Signed-off-by: Mauro Carvalho Chehab Link: https://lore.kernel.org/r/d2cc83e7cd6de63c793ccd3f2588ea40f7f1e764.1581955849.git.mchehab+huawei@kernel.org Signed-off-by: Jonathan Corbet --- Documentation/filesystems/index.rst | 1 + Documentation/filesystems/romfs.rst | 194 ++++++++++++++++++++++++++++++++++++ Documentation/filesystems/romfs.txt | 186 ---------------------------------- 3 files changed, 195 insertions(+), 186 deletions(-) create mode 100644 Documentation/filesystems/romfs.rst delete mode 100644 Documentation/filesystems/romfs.txt diff --git a/Documentation/filesystems/index.rst b/Documentation/filesystems/index.rst index 0aade8146d4d..3b26639517af 100644 --- a/Documentation/filesystems/index.rst +++ b/Documentation/filesystems/index.rst @@ -85,5 +85,6 @@ Documentation for filesystem implementations. qnx6 ramfs-rootfs-initramfs relay + romfs virtiofs vfat diff --git a/Documentation/filesystems/romfs.rst b/Documentation/filesystems/romfs.rst new file mode 100644 index 000000000000..465b11efa9be --- /dev/null +++ b/Documentation/filesystems/romfs.rst @@ -0,0 +1,194 @@ +.. SPDX-License-Identifier: GPL-2.0 + +======================= +ROMFS - ROM File System +======================= + +This is a quite dumb, read only filesystem, mainly for initial RAM +disks of installation disks. It has grown up by the need of having +modules linked at boot time. Using this filesystem, you get a very +similar feature, and even the possibility of a small kernel, with a +file system which doesn't take up useful memory from the router +functions in the basement of your office. + +For comparison, both the older minix and xiafs (the latter is now +defunct) filesystems, compiled as module need more than 20000 bytes, +while romfs is less than a page, about 4000 bytes (assuming i586 +code). Under the same conditions, the msdos filesystem would need +about 30K (and does not support device nodes or symlinks), while the +nfs module with nfsroot is about 57K. Furthermore, as a bit unfair +comparison, an actual rescue disk used up 3202 blocks with ext2, while +with romfs, it needed 3079 blocks. + +To create such a file system, you'll need a user program named +genromfs. It is available on http://romfs.sourceforge.net/ + +As the name suggests, romfs could be also used (space-efficiently) on +various read-only media, like (E)EPROM disks if someone will have the +motivation.. :) + +However, the main purpose of romfs is to have a very small kernel, +which has only this filesystem linked in, and then can load any module +later, with the current module utilities. It can also be used to run +some program to decide if you need SCSI devices, and even IDE or +floppy drives can be loaded later if you use the "initrd"--initial +RAM disk--feature of the kernel. This would not be really news +flash, but with romfs, you can even spare off your ext2 or minix or +maybe even affs filesystem until you really know that you need it. + +For example, a distribution boot disk can contain only the cd disk +drivers (and possibly the SCSI drivers), and the ISO 9660 filesystem +module. The kernel can be small enough, since it doesn't have other +filesystems, like the quite large ext2fs module, which can then be +loaded off the CD at a later stage of the installation. Another use +would be for a recovery disk, when you are reinstalling a workstation +from the network, and you will have all the tools/modules available +from a nearby server, so you don't want to carry two disks for this +purpose, just because it won't fit into ext2. + +romfs operates on block devices as you can expect, and the underlying +structure is very simple. Every accessible structure begins on 16 +byte boundaries for fast access. The minimum space a file will take +is 32 bytes (this is an empty file, with a less than 16 character +name). The maximum overhead for any non-empty file is the header, and +the 16 byte padding for the name and the contents, also 16+14+15 = 45 +bytes. This is quite rare however, since most file names are longer +than 3 bytes, and shorter than 15 bytes. + +The layout of the filesystem is the following:: + + offset content + + +---+---+---+---+ + 0 | - | r | o | m | \ + +---+---+---+---+ The ASCII representation of those bytes + 4 | 1 | f | s | - | / (i.e. "-rom1fs-") + +---+---+---+---+ + 8 | full size | The number of accessible bytes in this fs. + +---+---+---+---+ + 12 | checksum | The checksum of the FIRST 512 BYTES. + +---+---+---+---+ + 16 | volume name | The zero terminated name of the volume, + : : padded to 16 byte boundary. + +---+---+---+---+ + xx | file | + : headers : + +Every multi byte value (32 bit words, I'll use the longwords term from +now on) must be in big endian order. + +The first eight bytes identify the filesystem, even for the casual +inspector. After that, in the 3rd longword, it contains the number of +bytes accessible from the start of this filesystem. The 4th longword +is the checksum of the first 512 bytes (or the number of bytes +accessible, whichever is smaller). The applied algorithm is the same +as in the AFFS filesystem, namely a simple sum of the longwords +(assuming bigendian quantities again). For details, please consult +the source. This algorithm was chosen because although it's not quite +reliable, it does not require any tables, and it is very simple. + +The following bytes are now part of the file system; each file header +must begin on a 16 byte boundary:: + + offset content + + +---+---+---+---+ + 0 | next filehdr|X| The offset of the next file header + +---+---+---+---+ (zero if no more files) + 4 | spec.info | Info for directories/hard links/devices + +---+---+---+---+ + 8 | size | The size of this file in bytes + +---+---+---+---+ + 12 | checksum | Covering the meta data, including the file + +---+---+---+---+ name, and padding + 16 | file name | The zero terminated name of the file, + : : padded to 16 byte boundary + +---+---+---+---+ + xx | file data | + : : + +Since the file headers begin always at a 16 byte boundary, the lowest +4 bits would be always zero in the next filehdr pointer. These four +bits are used for the mode information. Bits 0..2 specify the type of +the file; while bit 4 shows if the file is executable or not. The +permissions are assumed to be world readable, if this bit is not set, +and world executable if it is; except the character and block devices, +they are never accessible for other than owner. The owner of every +file is user and group 0, this should never be a problem for the +intended use. The mapping of the 8 possible values to file types is +the following: + +== =============== ============================================ + mapping spec.info means +== =============== ============================================ + 0 hard link link destination [file header] + 1 directory first file's header + 2 regular file unused, must be zero [MBZ] + 3 symbolic link unused, MBZ (file data is the link content) + 4 block device 16/16 bits major/minor number + 5 char device - " - + 6 socket unused, MBZ + 7 fifo unused, MBZ +== =============== ============================================ + +Note that hard links are specifically marked in this filesystem, but +they will behave as you can expect (i.e. share the inode number). +Note also that it is your responsibility to not create hard link +loops, and creating all the . and .. links for directories. This is +normally done correctly by the genromfs program. Please refrain from +using the executable bits for special purposes on the socket and fifo +special files, they may have other uses in the future. Additionally, +please remember that only regular files, and symlinks are supposed to +have a nonzero size field; they contain the number of bytes available +directly after the (padded) file name. + +Another thing to note is that romfs works on file headers and data +aligned to 16 byte boundaries, but most hardware devices and the block +device drivers are unable to cope with smaller than block-sized data. +To overcome this limitation, the whole size of the file system must be +padded to an 1024 byte boundary. + +If you have any problems or suggestions concerning this file system, +please contact me. However, think twice before wanting me to add +features and code, because the primary and most important advantage of +this file system is the small code. On the other hand, don't be +alarmed, I'm not getting that much romfs related mail. Now I can +understand why Avery wrote poems in the ARCnet docs to get some more +feedback. :) + +romfs has also a mailing list, and to date, it hasn't received any +traffic, so you are welcome to join it to discuss your ideas. :) + +It's run by ezmlm, so you can subscribe to it by sending a message +to romfs-subscribe@shadow.banki.hu, the content is irrelevant. + +Pending issues: + +- Permissions and owner information are pretty essential features of a + Un*x like system, but romfs does not provide the full possibilities. + I have never found this limiting, but others might. + +- The file system is read only, so it can be very small, but in case + one would want to write _anything_ to a file system, he still needs + a writable file system, thus negating the size advantages. Possible + solutions: implement write access as a compile-time option, or a new, + similarly small writable filesystem for RAM disks. + +- Since the files are only required to have alignment on a 16 byte + boundary, it is currently possibly suboptimal to read or execute files + from the filesystem. It might be resolved by reordering file data to + have most of it (i.e. except the start and the end) laying at "natural" + boundaries, thus it would be possible to directly map a big portion of + the file contents to the mm subsystem. + +- Compression might be an useful feature, but memory is quite a + limiting factor in my eyes. + +- Where it is used? + +- Does it work on other architectures than intel and motorola? + + +Have fun, + +Janos Farkas diff --git a/Documentation/filesystems/romfs.txt b/Documentation/filesystems/romfs.txt deleted file mode 100644 index e2b07cc9120a..000000000000 --- a/Documentation/filesystems/romfs.txt +++ /dev/null @@ -1,186 +0,0 @@ -ROMFS - ROM FILE SYSTEM - -This is a quite dumb, read only filesystem, mainly for initial RAM -disks of installation disks. It has grown up by the need of having -modules linked at boot time. Using this filesystem, you get a very -similar feature, and even the possibility of a small kernel, with a -file system which doesn't take up useful memory from the router -functions in the basement of your office. - -For comparison, both the older minix and xiafs (the latter is now -defunct) filesystems, compiled as module need more than 20000 bytes, -while romfs is less than a page, about 4000 bytes (assuming i586 -code). Under the same conditions, the msdos filesystem would need -about 30K (and does not support device nodes or symlinks), while the -nfs module with nfsroot is about 57K. Furthermore, as a bit unfair -comparison, an actual rescue disk used up 3202 blocks with ext2, while -with romfs, it needed 3079 blocks. - -To create such a file system, you'll need a user program named -genromfs. It is available on http://romfs.sourceforge.net/ - -As the name suggests, romfs could be also used (space-efficiently) on -various read-only media, like (E)EPROM disks if someone will have the -motivation.. :) - -However, the main purpose of romfs is to have a very small kernel, -which has only this filesystem linked in, and then can load any module -later, with the current module utilities. It can also be used to run -some program to decide if you need SCSI devices, and even IDE or -floppy drives can be loaded later if you use the "initrd"--initial -RAM disk--feature of the kernel. This would not be really news -flash, but with romfs, you can even spare off your ext2 or minix or -maybe even affs filesystem until you really know that you need it. - -For example, a distribution boot disk can contain only the cd disk -drivers (and possibly the SCSI drivers), and the ISO 9660 filesystem -module. The kernel can be small enough, since it doesn't have other -filesystems, like the quite large ext2fs module, which can then be -loaded off the CD at a later stage of the installation. Another use -would be for a recovery disk, when you are reinstalling a workstation -from the network, and you will have all the tools/modules available -from a nearby server, so you don't want to carry two disks for this -purpose, just because it won't fit into ext2. - -romfs operates on block devices as you can expect, and the underlying -structure is very simple. Every accessible structure begins on 16 -byte boundaries for fast access. The minimum space a file will take -is 32 bytes (this is an empty file, with a less than 16 character -name). The maximum overhead for any non-empty file is the header, and -the 16 byte padding for the name and the contents, also 16+14+15 = 45 -bytes. This is quite rare however, since most file names are longer -than 3 bytes, and shorter than 15 bytes. - -The layout of the filesystem is the following: - -offset content - - +---+---+---+---+ - 0 | - | r | o | m | \ - +---+---+---+---+ The ASCII representation of those bytes - 4 | 1 | f | s | - | / (i.e. "-rom1fs-") - +---+---+---+---+ - 8 | full size | The number of accessible bytes in this fs. - +---+---+---+---+ - 12 | checksum | The checksum of the FIRST 512 BYTES. - +---+---+---+---+ - 16 | volume name | The zero terminated name of the volume, - : : padded to 16 byte boundary. - +---+---+---+---+ - xx | file | - : headers : - -Every multi byte value (32 bit words, I'll use the longwords term from -now on) must be in big endian order. - -The first eight bytes identify the filesystem, even for the casual -inspector. After that, in the 3rd longword, it contains the number of -bytes accessible from the start of this filesystem. The 4th longword -is the checksum of the first 512 bytes (or the number of bytes -accessible, whichever is smaller). The applied algorithm is the same -as in the AFFS filesystem, namely a simple sum of the longwords -(assuming bigendian quantities again). For details, please consult -the source. This algorithm was chosen because although it's not quite -reliable, it does not require any tables, and it is very simple. - -The following bytes are now part of the file system; each file header -must begin on a 16 byte boundary. - -offset content - - +---+---+---+---+ - 0 | next filehdr|X| The offset of the next file header - +---+---+---+---+ (zero if no more files) - 4 | spec.info | Info for directories/hard links/devices - +---+---+---+---+ - 8 | size | The size of this file in bytes - +---+---+---+---+ - 12 | checksum | Covering the meta data, including the file - +---+---+---+---+ name, and padding - 16 | file name | The zero terminated name of the file, - : : padded to 16 byte boundary - +---+---+---+---+ - xx | file data | - : : - -Since the file headers begin always at a 16 byte boundary, the lowest -4 bits would be always zero in the next filehdr pointer. These four -bits are used for the mode information. Bits 0..2 specify the type of -the file; while bit 4 shows if the file is executable or not. The -permissions are assumed to be world readable, if this bit is not set, -and world executable if it is; except the character and block devices, -they are never accessible for other than owner. The owner of every -file is user and group 0, this should never be a problem for the -intended use. The mapping of the 8 possible values to file types is -the following: - - mapping spec.info means - 0 hard link link destination [file header] - 1 directory first file's header - 2 regular file unused, must be zero [MBZ] - 3 symbolic link unused, MBZ (file data is the link content) - 4 block device 16/16 bits major/minor number - 5 char device - " - - 6 socket unused, MBZ - 7 fifo unused, MBZ - -Note that hard links are specifically marked in this filesystem, but -they will behave as you can expect (i.e. share the inode number). -Note also that it is your responsibility to not create hard link -loops, and creating all the . and .. links for directories. This is -normally done correctly by the genromfs program. Please refrain from -using the executable bits for special purposes on the socket and fifo -special files, they may have other uses in the future. Additionally, -please remember that only regular files, and symlinks are supposed to -have a nonzero size field; they contain the number of bytes available -directly after the (padded) file name. - -Another thing to note is that romfs works on file headers and data -aligned to 16 byte boundaries, but most hardware devices and the block -device drivers are unable to cope with smaller than block-sized data. -To overcome this limitation, the whole size of the file system must be -padded to an 1024 byte boundary. - -If you have any problems or suggestions concerning this file system, -please contact me. However, think twice before wanting me to add -features and code, because the primary and most important advantage of -this file system is the small code. On the other hand, don't be -alarmed, I'm not getting that much romfs related mail. Now I can -understand why Avery wrote poems in the ARCnet docs to get some more -feedback. :) - -romfs has also a mailing list, and to date, it hasn't received any -traffic, so you are welcome to join it to discuss your ideas. :) - -It's run by ezmlm, so you can subscribe to it by sending a message -to romfs-subscribe@shadow.banki.hu, the content is irrelevant. - -Pending issues: - -- Permissions and owner information are pretty essential features of a -Un*x like system, but romfs does not provide the full possibilities. -I have never found this limiting, but others might. - -- The file system is read only, so it can be very small, but in case -one would want to write _anything_ to a file system, he still needs -a writable file system, thus negating the size advantages. Possible -solutions: implement write access as a compile-time option, or a new, -similarly small writable filesystem for RAM disks. - -- Since the files are only required to have alignment on a 16 byte -boundary, it is currently possibly suboptimal to read or execute files -from the filesystem. It might be resolved by reordering file data to -have most of it (i.e. except the start and the end) laying at "natural" -boundaries, thus it would be possible to directly map a big portion of -the file contents to the mm subsystem. - -- Compression might be an useful feature, but memory is quite a -limiting factor in my eyes. - -- Where it is used? - -- Does it work on other architectures than intel and motorola? - - -Have fun, -Janos Farkas -- cgit From 31771f45c8e46d9356f1a58329f5cd40ab331e1a Mon Sep 17 00:00:00 2001 From: Mauro Carvalho Chehab Date: Mon, 17 Feb 2020 17:12:23 +0100 Subject: docs: filesystems: convert squashfs.txt to ReST - Add a SPDX header; - Adjust document and section titles; - Mark literal blocks as such; - Add table markups; - Add it to filesystems/index.rst. Signed-off-by: Mauro Carvalho Chehab Link: https://lore.kernel.org/r/cec30862c7ee7de7f9cd903e35e6c8bf74cc928a.1581955849.git.mchehab+huawei@kernel.org Signed-off-by: Jonathan Corbet --- Documentation/filesystems/index.rst | 1 + Documentation/filesystems/squashfs.rst | 265 +++++++++++++++++++++++++++++++++ Documentation/filesystems/squashfs.txt | 259 -------------------------------- 3 files changed, 266 insertions(+), 259 deletions(-) create mode 100644 Documentation/filesystems/squashfs.rst delete mode 100644 Documentation/filesystems/squashfs.txt diff --git a/Documentation/filesystems/index.rst b/Documentation/filesystems/index.rst index 3b26639517af..97a5f65ae509 100644 --- a/Documentation/filesystems/index.rst +++ b/Documentation/filesystems/index.rst @@ -86,5 +86,6 @@ Documentation for filesystem implementations. ramfs-rootfs-initramfs relay romfs + squashfs virtiofs vfat diff --git a/Documentation/filesystems/squashfs.rst b/Documentation/filesystems/squashfs.rst new file mode 100644 index 000000000000..df42106bae71 --- /dev/null +++ b/Documentation/filesystems/squashfs.rst @@ -0,0 +1,265 @@ +.. SPDX-License-Identifier: GPL-2.0 + +======================= +Squashfs 4.0 Filesystem +======================= + +Squashfs is a compressed read-only filesystem for Linux. + +It uses zlib, lz4, lzo, or xz compression to compress files, inodes and +directories. Inodes in the system are very small and all blocks are packed to +minimise data overhead. Block sizes greater than 4K are supported up to a +maximum of 1Mbytes (default block size 128K). + +Squashfs is intended for general read-only filesystem use, for archival +use (i.e. in cases where a .tar.gz file may be used), and in constrained +block device/memory systems (e.g. embedded systems) where low overhead is +needed. + +Mailing list: squashfs-devel@lists.sourceforge.net +Web site: www.squashfs.org + +1. Filesystem Features +---------------------- + +Squashfs filesystem features versus Cramfs: + +============================== ========= ========== + Squashfs Cramfs +============================== ========= ========== +Max filesystem size 2^64 256 MiB +Max file size ~ 2 TiB 16 MiB +Max files unlimited unlimited +Max directories unlimited unlimited +Max entries per directory unlimited unlimited +Max block size 1 MiB 4 KiB +Metadata compression yes no +Directory indexes yes no +Sparse file support yes no +Tail-end packing (fragments) yes no +Exportable (NFS etc.) yes no +Hard link support yes no +"." and ".." in readdir yes no +Real inode numbers yes no +32-bit uids/gids yes no +File creation time yes no +Xattr support yes no +ACL support no no +============================== ========= ========== + +Squashfs compresses data, inodes and directories. In addition, inode and +directory data are highly compacted, and packed on byte boundaries. Each +compressed inode is on average 8 bytes in length (the exact length varies on +file type, i.e. regular file, directory, symbolic link, and block/char device +inodes have different sizes). + +2. Using Squashfs +----------------- + +As squashfs is a read-only filesystem, the mksquashfs program must be used to +create populated squashfs filesystems. This and other squashfs utilities +can be obtained from http://www.squashfs.org. Usage instructions can be +obtained from this site also. + +The squashfs-tools development tree is now located on kernel.org + git://git.kernel.org/pub/scm/fs/squashfs/squashfs-tools.git + +3. Squashfs Filesystem Design +----------------------------- + +A squashfs filesystem consists of a maximum of nine parts, packed together on a +byte alignment:: + + --------------- + | superblock | + |---------------| + | compression | + | options | + |---------------| + | datablocks | + | & fragments | + |---------------| + | inode table | + |---------------| + | directory | + | table | + |---------------| + | fragment | + | table | + |---------------| + | export | + | table | + |---------------| + | uid/gid | + | lookup table | + |---------------| + | xattr | + | table | + --------------- + +Compressed data blocks are written to the filesystem as files are read from +the source directory, and checked for duplicates. Once all file data has been +written the completed inode, directory, fragment, export, uid/gid lookup and +xattr tables are written. + +3.1 Compression options +----------------------- + +Compressors can optionally support compression specific options (e.g. +dictionary size). If non-default compression options have been used, then +these are stored here. + +3.2 Inodes +---------- + +Metadata (inodes and directories) are compressed in 8Kbyte blocks. Each +compressed block is prefixed by a two byte length, the top bit is set if the +block is uncompressed. A block will be uncompressed if the -noI option is set, +or if the compressed block was larger than the uncompressed block. + +Inodes are packed into the metadata blocks, and are not aligned to block +boundaries, therefore inodes overlap compressed blocks. Inodes are identified +by a 48-bit number which encodes the location of the compressed metadata block +containing the inode, and the byte offset into that block where the inode is +placed (). + +To maximise compression there are different inodes for each file type +(regular file, directory, device, etc.), the inode contents and length +varying with the type. + +To further maximise compression, two types of regular file inode and +directory inode are defined: inodes optimised for frequently occurring +regular files and directories, and extended types where extra +information has to be stored. + +3.3 Directories +--------------- + +Like inodes, directories are packed into compressed metadata blocks, stored +in a directory table. Directories are accessed using the start address of +the metablock containing the directory and the offset into the +decompressed block (). + +Directories are organised in a slightly complex way, and are not simply +a list of file names. The organisation takes advantage of the +fact that (in most cases) the inodes of the files will be in the same +compressed metadata block, and therefore, can share the start block. +Directories are therefore organised in a two level list, a directory +header containing the shared start block value, and a sequence of directory +entries, each of which share the shared start block. A new directory header +is written once/if the inode start block changes. The directory +header/directory entry list is repeated as many times as necessary. + +Directories are sorted, and can contain a directory index to speed up +file lookup. Directory indexes store one entry per metablock, each entry +storing the index/filename mapping to the first directory header +in each metadata block. Directories are sorted in alphabetical order, +and at lookup the index is scanned linearly looking for the first filename +alphabetically larger than the filename being looked up. At this point the +location of the metadata block the filename is in has been found. +The general idea of the index is to ensure only one metadata block needs to be +decompressed to do a lookup irrespective of the length of the directory. +This scheme has the advantage that it doesn't require extra memory overhead +and doesn't require much extra storage on disk. + +3.4 File data +------------- + +Regular files consist of a sequence of contiguous compressed blocks, and/or a +compressed fragment block (tail-end packed block). The compressed size +of each datablock is stored in a block list contained within the +file inode. + +To speed up access to datablocks when reading 'large' files (256 Mbytes or +larger), the code implements an index cache that caches the mapping from +block index to datablock location on disk. + +The index cache allows Squashfs to handle large files (up to 1.75 TiB) while +retaining a simple and space-efficient block list on disk. The cache +is split into slots, caching up to eight 224 GiB files (128 KiB blocks). +Larger files use multiple slots, with 1.75 TiB files using all 8 slots. +The index cache is designed to be memory efficient, and by default uses +16 KiB. + +3.5 Fragment lookup table +------------------------- + +Regular files can contain a fragment index which is mapped to a fragment +location on disk and compressed size using a fragment lookup table. This +fragment lookup table is itself stored compressed into metadata blocks. +A second index table is used to locate these. This second index table for +speed of access (and because it is small) is read at mount time and cached +in memory. + +3.6 Uid/gid lookup table +------------------------ + +For space efficiency regular files store uid and gid indexes, which are +converted to 32-bit uids/gids using an id look up table. This table is +stored compressed into metadata blocks. A second index table is used to +locate these. This second index table for speed of access (and because it +is small) is read at mount time and cached in memory. + +3.7 Export table +---------------- + +To enable Squashfs filesystems to be exportable (via NFS etc.) filesystems +can optionally (disabled with the -no-exports Mksquashfs option) contain +an inode number to inode disk location lookup table. This is required to +enable Squashfs to map inode numbers passed in filehandles to the inode +location on disk, which is necessary when the export code reinstantiates +expired/flushed inodes. + +This table is stored compressed into metadata blocks. A second index table is +used to locate these. This second index table for speed of access (and because +it is small) is read at mount time and cached in memory. + +3.8 Xattr table +--------------- + +The xattr table contains extended attributes for each inode. The xattrs +for each inode are stored in a list, each list entry containing a type, +name and value field. The type field encodes the xattr prefix +("user.", "trusted." etc) and it also encodes how the name/value fields +should be interpreted. Currently the type indicates whether the value +is stored inline (in which case the value field contains the xattr value), +or if it is stored out of line (in which case the value field stores a +reference to where the actual value is stored). This allows large values +to be stored out of line improving scanning and lookup performance and it +also allows values to be de-duplicated, the value being stored once, and +all other occurrences holding an out of line reference to that value. + +The xattr lists are packed into compressed 8K metadata blocks. +To reduce overhead in inodes, rather than storing the on-disk +location of the xattr list inside each inode, a 32-bit xattr id +is stored. This xattr id is mapped into the location of the xattr +list using a second xattr id lookup table. + +4. TODOs and Outstanding Issues +------------------------------- + +4.1 TODO list +------------- + +Implement ACL support. + +4.2 Squashfs Internal Cache +--------------------------- + +Blocks in Squashfs are compressed. To avoid repeatedly decompressing +recently accessed data Squashfs uses two small metadata and fragment caches. + +The cache is not used for file datablocks, these are decompressed and cached in +the page-cache in the normal way. The cache is used to temporarily cache +fragment and metadata blocks which have been read as a result of a metadata +(i.e. inode or directory) or fragment access. Because metadata and fragments +are packed together into blocks (to gain greater compression) the read of a +particular piece of metadata or fragment will retrieve other metadata/fragments +which have been packed with it, these because of locality-of-reference may be +read in the near future. Temporarily caching them ensures they are available +for near future access without requiring an additional read and decompress. + +In the future this internal cache may be replaced with an implementation which +uses the kernel page cache. Because the page cache operates on page sized +units this may introduce additional complexity in terms of locking and +associated race conditions. diff --git a/Documentation/filesystems/squashfs.txt b/Documentation/filesystems/squashfs.txt deleted file mode 100644 index e5274f84dc56..000000000000 --- a/Documentation/filesystems/squashfs.txt +++ /dev/null @@ -1,259 +0,0 @@ -SQUASHFS 4.0 FILESYSTEM -======================= - -Squashfs is a compressed read-only filesystem for Linux. -It uses zlib, lz4, lzo, or xz compression to compress files, inodes and -directories. Inodes in the system are very small and all blocks are packed to -minimise data overhead. Block sizes greater than 4K are supported up to a -maximum of 1Mbytes (default block size 128K). - -Squashfs is intended for general read-only filesystem use, for archival -use (i.e. in cases where a .tar.gz file may be used), and in constrained -block device/memory systems (e.g. embedded systems) where low overhead is -needed. - -Mailing list: squashfs-devel@lists.sourceforge.net -Web site: www.squashfs.org - -1. FILESYSTEM FEATURES ----------------------- - -Squashfs filesystem features versus Cramfs: - - Squashfs Cramfs - -Max filesystem size: 2^64 256 MiB -Max file size: ~ 2 TiB 16 MiB -Max files: unlimited unlimited -Max directories: unlimited unlimited -Max entries per directory: unlimited unlimited -Max block size: 1 MiB 4 KiB -Metadata compression: yes no -Directory indexes: yes no -Sparse file support: yes no -Tail-end packing (fragments): yes no -Exportable (NFS etc.): yes no -Hard link support: yes no -"." and ".." in readdir: yes no -Real inode numbers: yes no -32-bit uids/gids: yes no -File creation time: yes no -Xattr support: yes no -ACL support: no no - -Squashfs compresses data, inodes and directories. In addition, inode and -directory data are highly compacted, and packed on byte boundaries. Each -compressed inode is on average 8 bytes in length (the exact length varies on -file type, i.e. regular file, directory, symbolic link, and block/char device -inodes have different sizes). - -2. USING SQUASHFS ------------------ - -As squashfs is a read-only filesystem, the mksquashfs program must be used to -create populated squashfs filesystems. This and other squashfs utilities -can be obtained from http://www.squashfs.org. Usage instructions can be -obtained from this site also. - -The squashfs-tools development tree is now located on kernel.org - git://git.kernel.org/pub/scm/fs/squashfs/squashfs-tools.git - -3. SQUASHFS FILESYSTEM DESIGN ------------------------------ - -A squashfs filesystem consists of a maximum of nine parts, packed together on a -byte alignment: - - --------------- - | superblock | - |---------------| - | compression | - | options | - |---------------| - | datablocks | - | & fragments | - |---------------| - | inode table | - |---------------| - | directory | - | table | - |---------------| - | fragment | - | table | - |---------------| - | export | - | table | - |---------------| - | uid/gid | - | lookup table | - |---------------| - | xattr | - | table | - --------------- - -Compressed data blocks are written to the filesystem as files are read from -the source directory, and checked for duplicates. Once all file data has been -written the completed inode, directory, fragment, export, uid/gid lookup and -xattr tables are written. - -3.1 Compression options ------------------------ - -Compressors can optionally support compression specific options (e.g. -dictionary size). If non-default compression options have been used, then -these are stored here. - -3.2 Inodes ----------- - -Metadata (inodes and directories) are compressed in 8Kbyte blocks. Each -compressed block is prefixed by a two byte length, the top bit is set if the -block is uncompressed. A block will be uncompressed if the -noI option is set, -or if the compressed block was larger than the uncompressed block. - -Inodes are packed into the metadata blocks, and are not aligned to block -boundaries, therefore inodes overlap compressed blocks. Inodes are identified -by a 48-bit number which encodes the location of the compressed metadata block -containing the inode, and the byte offset into that block where the inode is -placed (). - -To maximise compression there are different inodes for each file type -(regular file, directory, device, etc.), the inode contents and length -varying with the type. - -To further maximise compression, two types of regular file inode and -directory inode are defined: inodes optimised for frequently occurring -regular files and directories, and extended types where extra -information has to be stored. - -3.3 Directories ---------------- - -Like inodes, directories are packed into compressed metadata blocks, stored -in a directory table. Directories are accessed using the start address of -the metablock containing the directory and the offset into the -decompressed block (). - -Directories are organised in a slightly complex way, and are not simply -a list of file names. The organisation takes advantage of the -fact that (in most cases) the inodes of the files will be in the same -compressed metadata block, and therefore, can share the start block. -Directories are therefore organised in a two level list, a directory -header containing the shared start block value, and a sequence of directory -entries, each of which share the shared start block. A new directory header -is written once/if the inode start block changes. The directory -header/directory entry list is repeated as many times as necessary. - -Directories are sorted, and can contain a directory index to speed up -file lookup. Directory indexes store one entry per metablock, each entry -storing the index/filename mapping to the first directory header -in each metadata block. Directories are sorted in alphabetical order, -and at lookup the index is scanned linearly looking for the first filename -alphabetically larger than the filename being looked up. At this point the -location of the metadata block the filename is in has been found. -The general idea of the index is to ensure only one metadata block needs to be -decompressed to do a lookup irrespective of the length of the directory. -This scheme has the advantage that it doesn't require extra memory overhead -and doesn't require much extra storage on disk. - -3.4 File data -------------- - -Regular files consist of a sequence of contiguous compressed blocks, and/or a -compressed fragment block (tail-end packed block). The compressed size -of each datablock is stored in a block list contained within the -file inode. - -To speed up access to datablocks when reading 'large' files (256 Mbytes or -larger), the code implements an index cache that caches the mapping from -block index to datablock location on disk. - -The index cache allows Squashfs to handle large files (up to 1.75 TiB) while -retaining a simple and space-efficient block list on disk. The cache -is split into slots, caching up to eight 224 GiB files (128 KiB blocks). -Larger files use multiple slots, with 1.75 TiB files using all 8 slots. -The index cache is designed to be memory efficient, and by default uses -16 KiB. - -3.5 Fragment lookup table -------------------------- - -Regular files can contain a fragment index which is mapped to a fragment -location on disk and compressed size using a fragment lookup table. This -fragment lookup table is itself stored compressed into metadata blocks. -A second index table is used to locate these. This second index table for -speed of access (and because it is small) is read at mount time and cached -in memory. - -3.6 Uid/gid lookup table ------------------------- - -For space efficiency regular files store uid and gid indexes, which are -converted to 32-bit uids/gids using an id look up table. This table is -stored compressed into metadata blocks. A second index table is used to -locate these. This second index table for speed of access (and because it -is small) is read at mount time and cached in memory. - -3.7 Export table ----------------- - -To enable Squashfs filesystems to be exportable (via NFS etc.) filesystems -can optionally (disabled with the -no-exports Mksquashfs option) contain -an inode number to inode disk location lookup table. This is required to -enable Squashfs to map inode numbers passed in filehandles to the inode -location on disk, which is necessary when the export code reinstantiates -expired/flushed inodes. - -This table is stored compressed into metadata blocks. A second index table is -used to locate these. This second index table for speed of access (and because -it is small) is read at mount time and cached in memory. - -3.8 Xattr table ---------------- - -The xattr table contains extended attributes for each inode. The xattrs -for each inode are stored in a list, each list entry containing a type, -name and value field. The type field encodes the xattr prefix -("user.", "trusted." etc) and it also encodes how the name/value fields -should be interpreted. Currently the type indicates whether the value -is stored inline (in which case the value field contains the xattr value), -or if it is stored out of line (in which case the value field stores a -reference to where the actual value is stored). This allows large values -to be stored out of line improving scanning and lookup performance and it -also allows values to be de-duplicated, the value being stored once, and -all other occurrences holding an out of line reference to that value. - -The xattr lists are packed into compressed 8K metadata blocks. -To reduce overhead in inodes, rather than storing the on-disk -location of the xattr list inside each inode, a 32-bit xattr id -is stored. This xattr id is mapped into the location of the xattr -list using a second xattr id lookup table. - -4. TODOS AND OUTSTANDING ISSUES -------------------------------- - -4.1 Todo list -------------- - -Implement ACL support. - -4.2 Squashfs internal cache ---------------------------- - -Blocks in Squashfs are compressed. To avoid repeatedly decompressing -recently accessed data Squashfs uses two small metadata and fragment caches. - -The cache is not used for file datablocks, these are decompressed and cached in -the page-cache in the normal way. The cache is used to temporarily cache -fragment and metadata blocks which have been read as a result of a metadata -(i.e. inode or directory) or fragment access. Because metadata and fragments -are packed together into blocks (to gain greater compression) the read of a -particular piece of metadata or fragment will retrieve other metadata/fragments -which have been packed with it, these because of locality-of-reference may be -read in the near future. Temporarily caching them ensures they are available -for near future access without requiring an additional read and decompress. - -In the future this internal cache may be replaced with an implementation which -uses the kernel page cache. Because the page cache operates on page sized -units this may introduce additional complexity in terms of locking and -associated race conditions. -- cgit From 86beb976700b26576fe522a94a0b3a4e3d5ce424 Mon Sep 17 00:00:00 2001 From: Mauro Carvalho Chehab Date: Mon, 17 Feb 2020 17:12:24 +0100 Subject: docs: filesystems: convert sysfs.txt to ReST - Add a SPDX header; - Add a document title; - Adjust document and section titles; - use :field: markup; - Some whitespace fixes and new line breaks; - Mark literal blocks as such; - Add it to filesystems/index.rst. Signed-off-by: Mauro Carvalho Chehab Link: https://lore.kernel.org/r/5c480dcb467315b5df6e25372a65e473b585c36d.1581955849.git.mchehab+huawei@kernel.org Signed-off-by: Jonathan Corbet --- Documentation/filesystems/index.rst | 1 + Documentation/filesystems/sysfs.rst | 418 ++++++++++++++++++++++++++++++++++++ Documentation/filesystems/sysfs.txt | 408 ----------------------------------- 3 files changed, 419 insertions(+), 408 deletions(-) create mode 100644 Documentation/filesystems/sysfs.rst delete mode 100644 Documentation/filesystems/sysfs.txt diff --git a/Documentation/filesystems/index.rst b/Documentation/filesystems/index.rst index 97a5f65ae509..bafe92c72433 100644 --- a/Documentation/filesystems/index.rst +++ b/Documentation/filesystems/index.rst @@ -87,5 +87,6 @@ Documentation for filesystem implementations. relay romfs squashfs + sysfs virtiofs vfat diff --git a/Documentation/filesystems/sysfs.rst b/Documentation/filesystems/sysfs.rst new file mode 100644 index 000000000000..290891c3fecb --- /dev/null +++ b/Documentation/filesystems/sysfs.rst @@ -0,0 +1,418 @@ +.. SPDX-License-Identifier: GPL-2.0 + +===================================================== +sysfs - _The_ filesystem for exporting kernel objects +===================================================== + +Patrick Mochel + +Mike Murphy + +:Revised: 16 August 2011 +:Original: 10 January 2003 + + +What it is: +~~~~~~~~~~~ + +sysfs is a ram-based filesystem initially based on ramfs. It provides +a means to export kernel data structures, their attributes, and the +linkages between them to userspace. + +sysfs is tied inherently to the kobject infrastructure. Please read +Documentation/kobject.txt for more information concerning the kobject +interface. + + +Using sysfs +~~~~~~~~~~~ + +sysfs is always compiled in if CONFIG_SYSFS is defined. You can access +it by doing:: + + mount -t sysfs sysfs /sys + + +Directory Creation +~~~~~~~~~~~~~~~~~~ + +For every kobject that is registered with the system, a directory is +created for it in sysfs. That directory is created as a subdirectory +of the kobject's parent, expressing internal object hierarchies to +userspace. Top-level directories in sysfs represent the common +ancestors of object hierarchies; i.e. the subsystems the objects +belong to. + +Sysfs internally stores a pointer to the kobject that implements a +directory in the kernfs_node object associated with the directory. In +the past this kobject pointer has been used by sysfs to do reference +counting directly on the kobject whenever the file is opened or closed. +With the current sysfs implementation the kobject reference count is +only modified directly by the function sysfs_schedule_callback(). + + +Attributes +~~~~~~~~~~ + +Attributes can be exported for kobjects in the form of regular files in +the filesystem. Sysfs forwards file I/O operations to methods defined +for the attributes, providing a means to read and write kernel +attributes. + +Attributes should be ASCII text files, preferably with only one value +per file. It is noted that it may not be efficient to contain only one +value per file, so it is socially acceptable to express an array of +values of the same type. + +Mixing types, expressing multiple lines of data, and doing fancy +formatting of data is heavily frowned upon. Doing these things may get +you publicly humiliated and your code rewritten without notice. + + +An attribute definition is simply:: + + struct attribute { + char * name; + struct module *owner; + umode_t mode; + }; + + + int sysfs_create_file(struct kobject * kobj, const struct attribute * attr); + void sysfs_remove_file(struct kobject * kobj, const struct attribute * attr); + + +A bare attribute contains no means to read or write the value of the +attribute. Subsystems are encouraged to define their own attribute +structure and wrapper functions for adding and removing attributes for +a specific object type. + +For example, the driver model defines struct device_attribute like:: + + struct device_attribute { + struct attribute attr; + ssize_t (*show)(struct device *dev, struct device_attribute *attr, + char *buf); + ssize_t (*store)(struct device *dev, struct device_attribute *attr, + const char *buf, size_t count); + }; + + int device_create_file(struct device *, const struct device_attribute *); + void device_remove_file(struct device *, const struct device_attribute *); + +It also defines this helper for defining device attributes:: + + #define DEVICE_ATTR(_name, _mode, _show, _store) \ + struct device_attribute dev_attr_##_name = __ATTR(_name, _mode, _show, _store) + +For example, declaring:: + + static DEVICE_ATTR(foo, S_IWUSR | S_IRUGO, show_foo, store_foo); + +is equivalent to doing:: + + static struct device_attribute dev_attr_foo = { + .attr = { + .name = "foo", + .mode = S_IWUSR | S_IRUGO, + }, + .show = show_foo, + .store = store_foo, + }; + +Note as stated in include/linux/kernel.h "OTHER_WRITABLE? Generally +considered a bad idea." so trying to set a sysfs file writable for +everyone will fail reverting to RO mode for "Others". + +For the common cases sysfs.h provides convenience macros to make +defining attributes easier as well as making code more concise and +readable. The above case could be shortened to: + +static struct device_attribute dev_attr_foo = __ATTR_RW(foo); + +the list of helpers available to define your wrapper function is: + +__ATTR_RO(name): + assumes default name_show and mode 0444 +__ATTR_WO(name): + assumes a name_store only and is restricted to mode + 0200 that is root write access only. +__ATTR_RO_MODE(name, mode): + fore more restrictive RO access currently + only use case is the EFI System Resource Table + (see drivers/firmware/efi/esrt.c) +__ATTR_RW(name): + assumes default name_show, name_store and setting + mode to 0644. +__ATTR_NULL: + which sets the name to NULL and is used as end of list + indicator (see: kernel/workqueue.c) + +Subsystem-Specific Callbacks +~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +When a subsystem defines a new attribute type, it must implement a +set of sysfs operations for forwarding read and write calls to the +show and store methods of the attribute owners:: + + struct sysfs_ops { + ssize_t (*show)(struct kobject *, struct attribute *, char *); + ssize_t (*store)(struct kobject *, struct attribute *, const char *, size_t); + }; + +[ Subsystems should have already defined a struct kobj_type as a +descriptor for this type, which is where the sysfs_ops pointer is +stored. See the kobject documentation for more information. ] + +When a file is read or written, sysfs calls the appropriate method +for the type. The method then translates the generic struct kobject +and struct attribute pointers to the appropriate pointer types, and +calls the associated methods. + + +To illustrate:: + + #define to_dev(obj) container_of(obj, struct device, kobj) + #define to_dev_attr(_attr) container_of(_attr, struct device_attribute, attr) + + static ssize_t dev_attr_show(struct kobject *kobj, struct attribute *attr, + char *buf) + { + struct device_attribute *dev_attr = to_dev_attr(attr); + struct device *dev = to_dev(kobj); + ssize_t ret = -EIO; + + if (dev_attr->show) + ret = dev_attr->show(dev, dev_attr, buf); + if (ret >= (ssize_t)PAGE_SIZE) { + printk("dev_attr_show: %pS returned bad count\n", + dev_attr->show); + } + return ret; + } + + + +Reading/Writing Attribute Data +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +To read or write attributes, show() or store() methods must be +specified when declaring the attribute. The method types should be as +simple as those defined for device attributes:: + + ssize_t (*show)(struct device *dev, struct device_attribute *attr, char *buf); + ssize_t (*store)(struct device *dev, struct device_attribute *attr, + const char *buf, size_t count); + +IOW, they should take only an object, an attribute, and a buffer as parameters. + + +sysfs allocates a buffer of size (PAGE_SIZE) and passes it to the +method. Sysfs will call the method exactly once for each read or +write. This forces the following behavior on the method +implementations: + +- On read(2), the show() method should fill the entire buffer. + Recall that an attribute should only be exporting one value, or an + array of similar values, so this shouldn't be that expensive. + + This allows userspace to do partial reads and forward seeks + arbitrarily over the entire file at will. If userspace seeks back to + zero or does a pread(2) with an offset of '0' the show() method will + be called again, rearmed, to fill the buffer. + +- On write(2), sysfs expects the entire buffer to be passed during the + first write. Sysfs then passes the entire buffer to the store() method. + A terminating null is added after the data on stores. This makes + functions like sysfs_streq() safe to use. + + When writing sysfs files, userspace processes should first read the + entire file, modify the values it wishes to change, then write the + entire buffer back. + + Attribute method implementations should operate on an identical + buffer when reading and writing values. + +Other notes: + +- Writing causes the show() method to be rearmed regardless of current + file position. + +- The buffer will always be PAGE_SIZE bytes in length. On i386, this + is 4096. + +- show() methods should return the number of bytes printed into the + buffer. This is the return value of scnprintf(). + +- show() must not use snprintf() when formatting the value to be + returned to user space. If you can guarantee that an overflow + will never happen you can use sprintf() otherwise you must use + scnprintf(). + +- store() should return the number of bytes used from the buffer. If the + entire buffer has been used, just return the count argument. + +- show() or store() can always return errors. If a bad value comes + through, be sure to return an error. + +- The object passed to the methods will be pinned in memory via sysfs + referencing counting its embedded object. However, the physical + entity (e.g. device) the object represents may not be present. Be + sure to have a way to check this, if necessary. + + +A very simple (and naive) implementation of a device attribute is:: + + static ssize_t show_name(struct device *dev, struct device_attribute *attr, + char *buf) + { + return scnprintf(buf, PAGE_SIZE, "%s\n", dev->name); + } + + static ssize_t store_name(struct device *dev, struct device_attribute *attr, + const char *buf, size_t count) + { + snprintf(dev->name, sizeof(dev->name), "%.*s", + (int)min(count, sizeof(dev->name) - 1), buf); + return count; + } + + static DEVICE_ATTR(name, S_IRUGO, show_name, store_name); + + +(Note that the real implementation doesn't allow userspace to set the +name for a device.) + + +Top Level Directory Layout +~~~~~~~~~~~~~~~~~~~~~~~~~~ + +The sysfs directory arrangement exposes the relationship of kernel +data structures. + +The top level sysfs directory looks like:: + + block/ + bus/ + class/ + dev/ + devices/ + firmware/ + net/ + fs/ + +devices/ contains a filesystem representation of the device tree. It maps +directly to the internal kernel device tree, which is a hierarchy of +struct device. + +bus/ contains flat directory layout of the various bus types in the +kernel. Each bus's directory contains two subdirectories:: + + devices/ + drivers/ + +devices/ contains symlinks for each device discovered in the system +that point to the device's directory under root/. + +drivers/ contains a directory for each device driver that is loaded +for devices on that particular bus (this assumes that drivers do not +span multiple bus types). + +fs/ contains a directory for some filesystems. Currently each +filesystem wanting to export attributes must create its own hierarchy +below fs/ (see ./fuse.txt for an example). + +dev/ contains two directories char/ and block/. Inside these two +directories there are symlinks named :. These symlinks +point to the sysfs directory for the given device. /sys/dev provides a +quick way to lookup the sysfs interface for a device from the result of +a stat(2) operation. + +More information can driver-model specific features can be found in +Documentation/driver-api/driver-model/. + + +TODO: Finish this section. + + +Current Interfaces +~~~~~~~~~~~~~~~~~~ + +The following interface layers currently exist in sysfs: + + +devices (include/linux/device.h) +-------------------------------- +Structure:: + + struct device_attribute { + struct attribute attr; + ssize_t (*show)(struct device *dev, struct device_attribute *attr, + char *buf); + ssize_t (*store)(struct device *dev, struct device_attribute *attr, + const char *buf, size_t count); + }; + +Declaring:: + + DEVICE_ATTR(_name, _mode, _show, _store); + +Creation/Removal:: + + int device_create_file(struct device *dev, const struct device_attribute * attr); + void device_remove_file(struct device *dev, const struct device_attribute * attr); + + +bus drivers (include/linux/device.h) +------------------------------------ +Structure:: + + struct bus_attribute { + struct attribute attr; + ssize_t (*show)(struct bus_type *, char * buf); + ssize_t (*store)(struct bus_type *, const char * buf, size_t count); + }; + +Declaring:: + + static BUS_ATTR_RW(name); + static BUS_ATTR_RO(name); + static BUS_ATTR_WO(name); + +Creation/Removal:: + + int bus_create_file(struct bus_type *, struct bus_attribute *); + void bus_remove_file(struct bus_type *, struct bus_attribute *); + + +device drivers (include/linux/device.h) +--------------------------------------- + +Structure:: + + struct driver_attribute { + struct attribute attr; + ssize_t (*show)(struct device_driver *, char * buf); + ssize_t (*store)(struct device_driver *, const char * buf, + size_t count); + }; + +Declaring:: + + DRIVER_ATTR_RO(_name) + DRIVER_ATTR_RW(_name) + +Creation/Removal:: + + int driver_create_file(struct device_driver *, const struct driver_attribute *); + void driver_remove_file(struct device_driver *, const struct driver_attribute *); + + +Documentation +~~~~~~~~~~~~~ + +The sysfs directory structure and the attributes in each directory define an +ABI between the kernel and user space. As for any ABI, it is important that +this ABI is stable and properly documented. All new sysfs attributes must be +documented in Documentation/ABI. See also Documentation/ABI/README for more +information. diff --git a/Documentation/filesystems/sysfs.txt b/Documentation/filesystems/sysfs.txt deleted file mode 100644 index ddf15b1b0d5a..000000000000 --- a/Documentation/filesystems/sysfs.txt +++ /dev/null @@ -1,408 +0,0 @@ - -sysfs - _The_ filesystem for exporting kernel objects. - -Patrick Mochel -Mike Murphy - -Revised: 16 August 2011 -Original: 10 January 2003 - - -What it is: -~~~~~~~~~~~ - -sysfs is a ram-based filesystem initially based on ramfs. It provides -a means to export kernel data structures, their attributes, and the -linkages between them to userspace. - -sysfs is tied inherently to the kobject infrastructure. Please read -Documentation/kobject.txt for more information concerning the kobject -interface. - - -Using sysfs -~~~~~~~~~~~ - -sysfs is always compiled in if CONFIG_SYSFS is defined. You can access -it by doing: - - mount -t sysfs sysfs /sys - - -Directory Creation -~~~~~~~~~~~~~~~~~~ - -For every kobject that is registered with the system, a directory is -created for it in sysfs. That directory is created as a subdirectory -of the kobject's parent, expressing internal object hierarchies to -userspace. Top-level directories in sysfs represent the common -ancestors of object hierarchies; i.e. the subsystems the objects -belong to. - -Sysfs internally stores a pointer to the kobject that implements a -directory in the kernfs_node object associated with the directory. In -the past this kobject pointer has been used by sysfs to do reference -counting directly on the kobject whenever the file is opened or closed. -With the current sysfs implementation the kobject reference count is -only modified directly by the function sysfs_schedule_callback(). - - -Attributes -~~~~~~~~~~ - -Attributes can be exported for kobjects in the form of regular files in -the filesystem. Sysfs forwards file I/O operations to methods defined -for the attributes, providing a means to read and write kernel -attributes. - -Attributes should be ASCII text files, preferably with only one value -per file. It is noted that it may not be efficient to contain only one -value per file, so it is socially acceptable to express an array of -values of the same type. - -Mixing types, expressing multiple lines of data, and doing fancy -formatting of data is heavily frowned upon. Doing these things may get -you publicly humiliated and your code rewritten without notice. - - -An attribute definition is simply: - -struct attribute { - char * name; - struct module *owner; - umode_t mode; -}; - - -int sysfs_create_file(struct kobject * kobj, const struct attribute * attr); -void sysfs_remove_file(struct kobject * kobj, const struct attribute * attr); - - -A bare attribute contains no means to read or write the value of the -attribute. Subsystems are encouraged to define their own attribute -structure and wrapper functions for adding and removing attributes for -a specific object type. - -For example, the driver model defines struct device_attribute like: - -struct device_attribute { - struct attribute attr; - ssize_t (*show)(struct device *dev, struct device_attribute *attr, - char *buf); - ssize_t (*store)(struct device *dev, struct device_attribute *attr, - const char *buf, size_t count); -}; - -int device_create_file(struct device *, const struct device_attribute *); -void device_remove_file(struct device *, const struct device_attribute *); - -It also defines this helper for defining device attributes: - -#define DEVICE_ATTR(_name, _mode, _show, _store) \ -struct device_attribute dev_attr_##_name = __ATTR(_name, _mode, _show, _store) - -For example, declaring - -static DEVICE_ATTR(foo, S_IWUSR | S_IRUGO, show_foo, store_foo); - -is equivalent to doing: - -static struct device_attribute dev_attr_foo = { - .attr = { - .name = "foo", - .mode = S_IWUSR | S_IRUGO, - }, - .show = show_foo, - .store = store_foo, -}; - -Note as stated in include/linux/kernel.h "OTHER_WRITABLE? Generally -considered a bad idea." so trying to set a sysfs file writable for -everyone will fail reverting to RO mode for "Others". - -For the common cases sysfs.h provides convenience macros to make -defining attributes easier as well as making code more concise and -readable. The above case could be shortened to: - -static struct device_attribute dev_attr_foo = __ATTR_RW(foo); - -the list of helpers available to define your wrapper function is: -__ATTR_RO(name): assumes default name_show and mode 0444 -__ATTR_WO(name): assumes a name_store only and is restricted to mode - 0200 that is root write access only. -__ATTR_RO_MODE(name, mode): fore more restrictive RO access currently - only use case is the EFI System Resource Table - (see drivers/firmware/efi/esrt.c) -__ATTR_RW(name): assumes default name_show, name_store and setting - mode to 0644. -__ATTR_NULL: which sets the name to NULL and is used as end of list - indicator (see: kernel/workqueue.c) - -Subsystem-Specific Callbacks -~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - -When a subsystem defines a new attribute type, it must implement a -set of sysfs operations for forwarding read and write calls to the -show and store methods of the attribute owners. - -struct sysfs_ops { - ssize_t (*show)(struct kobject *, struct attribute *, char *); - ssize_t (*store)(struct kobject *, struct attribute *, const char *, size_t); -}; - -[ Subsystems should have already defined a struct kobj_type as a -descriptor for this type, which is where the sysfs_ops pointer is -stored. See the kobject documentation for more information. ] - -When a file is read or written, sysfs calls the appropriate method -for the type. The method then translates the generic struct kobject -and struct attribute pointers to the appropriate pointer types, and -calls the associated methods. - - -To illustrate: - -#define to_dev(obj) container_of(obj, struct device, kobj) -#define to_dev_attr(_attr) container_of(_attr, struct device_attribute, attr) - -static ssize_t dev_attr_show(struct kobject *kobj, struct attribute *attr, - char *buf) -{ - struct device_attribute *dev_attr = to_dev_attr(attr); - struct device *dev = to_dev(kobj); - ssize_t ret = -EIO; - - if (dev_attr->show) - ret = dev_attr->show(dev, dev_attr, buf); - if (ret >= (ssize_t)PAGE_SIZE) { - printk("dev_attr_show: %pS returned bad count\n", - dev_attr->show); - } - return ret; -} - - - -Reading/Writing Attribute Data -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - -To read or write attributes, show() or store() methods must be -specified when declaring the attribute. The method types should be as -simple as those defined for device attributes: - -ssize_t (*show)(struct device *dev, struct device_attribute *attr, char *buf); -ssize_t (*store)(struct device *dev, struct device_attribute *attr, - const char *buf, size_t count); - -IOW, they should take only an object, an attribute, and a buffer as parameters. - - -sysfs allocates a buffer of size (PAGE_SIZE) and passes it to the -method. Sysfs will call the method exactly once for each read or -write. This forces the following behavior on the method -implementations: - -- On read(2), the show() method should fill the entire buffer. - Recall that an attribute should only be exporting one value, or an - array of similar values, so this shouldn't be that expensive. - - This allows userspace to do partial reads and forward seeks - arbitrarily over the entire file at will. If userspace seeks back to - zero or does a pread(2) with an offset of '0' the show() method will - be called again, rearmed, to fill the buffer. - -- On write(2), sysfs expects the entire buffer to be passed during the - first write. Sysfs then passes the entire buffer to the store() method. - A terminating null is added after the data on stores. This makes - functions like sysfs_streq() safe to use. - - When writing sysfs files, userspace processes should first read the - entire file, modify the values it wishes to change, then write the - entire buffer back. - - Attribute method implementations should operate on an identical - buffer when reading and writing values. - -Other notes: - -- Writing causes the show() method to be rearmed regardless of current - file position. - -- The buffer will always be PAGE_SIZE bytes in length. On i386, this - is 4096. - -- show() methods should return the number of bytes printed into the - buffer. This is the return value of scnprintf(). - -- show() must not use snprintf() when formatting the value to be - returned to user space. If you can guarantee that an overflow - will never happen you can use sprintf() otherwise you must use - scnprintf(). - -- store() should return the number of bytes used from the buffer. If the - entire buffer has been used, just return the count argument. - -- show() or store() can always return errors. If a bad value comes - through, be sure to return an error. - -- The object passed to the methods will be pinned in memory via sysfs - referencing counting its embedded object. However, the physical - entity (e.g. device) the object represents may not be present. Be - sure to have a way to check this, if necessary. - - -A very simple (and naive) implementation of a device attribute is: - -static ssize_t show_name(struct device *dev, struct device_attribute *attr, - char *buf) -{ - return scnprintf(buf, PAGE_SIZE, "%s\n", dev->name); -} - -static ssize_t store_name(struct device *dev, struct device_attribute *attr, - const char *buf, size_t count) -{ - snprintf(dev->name, sizeof(dev->name), "%.*s", - (int)min(count, sizeof(dev->name) - 1), buf); - return count; -} - -static DEVICE_ATTR(name, S_IRUGO, show_name, store_name); - - -(Note that the real implementation doesn't allow userspace to set the -name for a device.) - - -Top Level Directory Layout -~~~~~~~~~~~~~~~~~~~~~~~~~~ - -The sysfs directory arrangement exposes the relationship of kernel -data structures. - -The top level sysfs directory looks like: - -block/ -bus/ -class/ -dev/ -devices/ -firmware/ -net/ -fs/ - -devices/ contains a filesystem representation of the device tree. It maps -directly to the internal kernel device tree, which is a hierarchy of -struct device. - -bus/ contains flat directory layout of the various bus types in the -kernel. Each bus's directory contains two subdirectories: - - devices/ - drivers/ - -devices/ contains symlinks for each device discovered in the system -that point to the device's directory under root/. - -drivers/ contains a directory for each device driver that is loaded -for devices on that particular bus (this assumes that drivers do not -span multiple bus types). - -fs/ contains a directory for some filesystems. Currently each -filesystem wanting to export attributes must create its own hierarchy -below fs/ (see ./fuse.txt for an example). - -dev/ contains two directories char/ and block/. Inside these two -directories there are symlinks named :. These symlinks -point to the sysfs directory for the given device. /sys/dev provides a -quick way to lookup the sysfs interface for a device from the result of -a stat(2) operation. - -More information can driver-model specific features can be found in -Documentation/driver-api/driver-model/. - - -TODO: Finish this section. - - -Current Interfaces -~~~~~~~~~~~~~~~~~~ - -The following interface layers currently exist in sysfs: - - -- devices (include/linux/device.h) ----------------------------------- -Structure: - -struct device_attribute { - struct attribute attr; - ssize_t (*show)(struct device *dev, struct device_attribute *attr, - char *buf); - ssize_t (*store)(struct device *dev, struct device_attribute *attr, - const char *buf, size_t count); -}; - -Declaring: - -DEVICE_ATTR(_name, _mode, _show, _store); - -Creation/Removal: - -int device_create_file(struct device *dev, const struct device_attribute * attr); -void device_remove_file(struct device *dev, const struct device_attribute * attr); - - -- bus drivers (include/linux/device.h) --------------------------------------- -Structure: - -struct bus_attribute { - struct attribute attr; - ssize_t (*show)(struct bus_type *, char * buf); - ssize_t (*store)(struct bus_type *, const char * buf, size_t count); -}; - -Declaring: - -static BUS_ATTR_RW(name); -static BUS_ATTR_RO(name); -static BUS_ATTR_WO(name); - -Creation/Removal: - -int bus_create_file(struct bus_type *, struct bus_attribute *); -void bus_remove_file(struct bus_type *, struct bus_attribute *); - - -- device drivers (include/linux/device.h) ------------------------------------------ - -Structure: - -struct driver_attribute { - struct attribute attr; - ssize_t (*show)(struct device_driver *, char * buf); - ssize_t (*store)(struct device_driver *, const char * buf, - size_t count); -}; - -Declaring: - -DRIVER_ATTR_RO(_name) -DRIVER_ATTR_RW(_name) - -Creation/Removal: - -int driver_create_file(struct device_driver *, const struct driver_attribute *); -void driver_remove_file(struct device_driver *, const struct driver_attribute *); - - -Documentation -~~~~~~~~~~~~~ - -The sysfs directory structure and the attributes in each directory define an -ABI between the kernel and user space. As for any ABI, it is important that -this ABI is stable and properly documented. All new sysfs attributes must be -documented in Documentation/ABI. See also Documentation/ABI/README for more -information. -- cgit From 826a613d3f81695022f324a5cb84fe73ec09e51d Mon Sep 17 00:00:00 2001 From: Mauro Carvalho Chehab Date: Mon, 17 Feb 2020 17:12:25 +0100 Subject: docs: filesystems: convert sysv-fs.txt to ReST - Add a SPDX header; - Add a document title; - Some whitespace fixes and new line breaks; - Mark literal blocks as such; - Add table markups; - Add it to filesystems/index.rst. Signed-off-by: Mauro Carvalho Chehab Link: https://lore.kernel.org/r/5b96a6efba95773af439ab25a7dbe4d0edf8c867.1581955849.git.mchehab+huawei@kernel.org Signed-off-by: Jonathan Corbet --- Documentation/filesystems/index.rst | 1 + Documentation/filesystems/sysv-fs.rst | 264 ++++++++++++++++++++++++++++++++++ Documentation/filesystems/sysv-fs.txt | 197 ------------------------- 3 files changed, 265 insertions(+), 197 deletions(-) create mode 100644 Documentation/filesystems/sysv-fs.rst delete mode 100644 Documentation/filesystems/sysv-fs.txt diff --git a/Documentation/filesystems/index.rst b/Documentation/filesystems/index.rst index bafe92c72433..d583b8b35196 100644 --- a/Documentation/filesystems/index.rst +++ b/Documentation/filesystems/index.rst @@ -88,5 +88,6 @@ Documentation for filesystem implementations. romfs squashfs sysfs + sysv-fs virtiofs vfat diff --git a/Documentation/filesystems/sysv-fs.rst b/Documentation/filesystems/sysv-fs.rst new file mode 100644 index 000000000000..89e40911ad7c --- /dev/null +++ b/Documentation/filesystems/sysv-fs.rst @@ -0,0 +1,264 @@ +.. SPDX-License-Identifier: GPL-2.0 + +================== +SystemV Filesystem +================== + +It implements all of + - Xenix FS, + - SystemV/386 FS, + - Coherent FS. + +To install: + +* Answer the 'System V and Coherent filesystem support' question with 'y' + when configuring the kernel. +* To mount a disk or a partition, use:: + + mount [-r] -t sysv device mountpoint + + The file system type names:: + + -t sysv + -t xenix + -t coherent + + may be used interchangeably, but the last two will eventually disappear. + +Bugs in the present implementation: + +- Coherent FS: + + - The "free list interleave" n:m is currently ignored. + - Only file systems with no filesystem name and no pack name are recognized. + (See Coherent "man mkfs" for a description of these features.) + +- SystemV Release 2 FS: + + The superblock is only searched in the blocks 9, 15, 18, which + corresponds to the beginning of track 1 on floppy disks. No support + for this FS on hard disk yet. + + +These filesystems are rather similar. Here is a comparison with Minix FS: + +* Linux fdisk reports on partitions + + - Minix FS 0x81 Linux/Minix + - Xenix FS ?? + - SystemV FS ?? + - Coherent FS 0x08 AIX bootable + +* Size of a block or zone (data allocation unit on disk) + + - Minix FS 1024 + - Xenix FS 1024 (also 512 ??) + - SystemV FS 1024 (also 512 and 2048) + - Coherent FS 512 + +* General layout: all have one boot block, one super block and + separate areas for inodes and for directories/data. + On SystemV Release 2 FS (e.g. Microport) the first track is reserved and + all the block numbers (including the super block) are offset by one track. + +* Byte ordering of "short" (16 bit entities) on disk: + + - Minix FS little endian 0 1 + - Xenix FS little endian 0 1 + - SystemV FS little endian 0 1 + - Coherent FS little endian 0 1 + + Of course, this affects only the file system, not the data of files on it! + +* Byte ordering of "long" (32 bit entities) on disk: + + - Minix FS little endian 0 1 2 3 + - Xenix FS little endian 0 1 2 3 + - SystemV FS little endian 0 1 2 3 + - Coherent FS PDP-11 2 3 0 1 + + Of course, this affects only the file system, not the data of files on it! + +* Inode on disk: "short", 0 means non-existent, the root dir ino is: + + ================================= == + Minix FS 1 + Xenix FS, SystemV FS, Coherent FS 2 + ================================= == + +* Maximum number of hard links to a file: + + =========== ========= + Minix FS 250 + Xenix FS ?? + SystemV FS ?? + Coherent FS >=10000 + =========== ========= + +* Free inode management: + + - Minix FS + a bitmap + - Xenix FS, SystemV FS, Coherent FS + There is a cache of a certain number of free inodes in the super-block. + When it is exhausted, new free inodes are found using a linear search. + +* Free block management: + + - Minix FS + a bitmap + - Xenix FS, SystemV FS, Coherent FS + Free blocks are organized in a "free list". Maybe a misleading term, + since it is not true that every free block contains a pointer to + the next free block. Rather, the free blocks are organized in chunks + of limited size, and every now and then a free block contains pointers + to the free blocks pertaining to the next chunk; the first of these + contains pointers and so on. The list terminates with a "block number" + 0 on Xenix FS and SystemV FS, with a block zeroed out on Coherent FS. + +* Super-block location: + + =========== ========================== + Minix FS block 1 = bytes 1024..2047 + Xenix FS block 1 = bytes 1024..2047 + SystemV FS bytes 512..1023 + Coherent FS block 1 = bytes 512..1023 + =========== ========================== + +* Super-block layout: + + - Minix FS:: + + unsigned short s_ninodes; + unsigned short s_nzones; + unsigned short s_imap_blocks; + unsigned short s_zmap_blocks; + unsigned short s_firstdatazone; + unsigned short s_log_zone_size; + unsigned long s_max_size; + unsigned short s_magic; + + - Xenix FS, SystemV FS, Coherent FS:: + + unsigned short s_firstdatazone; + unsigned long s_nzones; + unsigned short s_fzone_count; + unsigned long s_fzones[NICFREE]; + unsigned short s_finode_count; + unsigned short s_finodes[NICINOD]; + char s_flock; + char s_ilock; + char s_modified; + char s_rdonly; + unsigned long s_time; + short s_dinfo[4]; -- SystemV FS only + unsigned long s_free_zones; + unsigned short s_free_inodes; + short s_dinfo[4]; -- Xenix FS only + unsigned short s_interleave_m,s_interleave_n; -- Coherent FS only + char s_fname[6]; + char s_fpack[6]; + + then they differ considerably: + + Xenix FS:: + + char s_clean; + char s_fill[371]; + long s_magic; + long s_type; + + SystemV FS:: + + long s_fill[12 or 14]; + long s_state; + long s_magic; + long s_type; + + Coherent FS:: + + unsigned long s_unique; + + Note that Coherent FS has no magic. + +* Inode layout: + + - Minix FS:: + + unsigned short i_mode; + unsigned short i_uid; + unsigned long i_size; + unsigned long i_time; + unsigned char i_gid; + unsigned char i_nlinks; + unsigned short i_zone[7+1+1]; + + - Xenix FS, SystemV FS, Coherent FS:: + + unsigned short i_mode; + unsigned short i_nlink; + unsigned short i_uid; + unsigned short i_gid; + unsigned long i_size; + unsigned char i_zone[3*(10+1+1+1)]; + unsigned long i_atime; + unsigned long i_mtime; + unsigned long i_ctime; + + +* Regular file data blocks are organized as + + - Minix FS: + + - 7 direct blocks + - 1 indirect block (pointers to blocks) + - 1 double-indirect block (pointer to pointers to blocks) + + - Xenix FS, SystemV FS, Coherent FS: + + - 10 direct blocks + - 1 indirect block (pointers to blocks) + - 1 double-indirect block (pointer to pointers to blocks) + - 1 triple-indirect block (pointer to pointers to pointers to blocks) + + + =========== ========== ================ + Inode size inodes per block + =========== ========== ================ + Minix FS 32 32 + Xenix FS 64 16 + SystemV FS 64 16 + Coherent FS 64 8 + =========== ========== ================ + +* Directory entry on disk + + - Minix FS:: + + unsigned short inode; + char name[14/30]; + + - Xenix FS, SystemV FS, Coherent FS:: + + unsigned short inode; + char name[14]; + + =========== ============== ===================== + Dir entry size dir entries per block + =========== ============== ===================== + Minix FS 16/32 64/32 + Xenix FS 16 64 + SystemV FS 16 64 + Coherent FS 16 32 + =========== ============== ===================== + +* How to implement symbolic links such that the host fsck doesn't scream: + + - Minix FS normal + - Xenix FS kludge: as regular files with chmod 1000 + - SystemV FS ?? + - Coherent FS kludge: as regular files with chmod 1000 + + +Notation: We often speak of a "block" but mean a zone (the allocation unit) +and not the disk driver's notion of "block". diff --git a/Documentation/filesystems/sysv-fs.txt b/Documentation/filesystems/sysv-fs.txt deleted file mode 100644 index 253b50d1328e..000000000000 --- a/Documentation/filesystems/sysv-fs.txt +++ /dev/null @@ -1,197 +0,0 @@ -It implements all of - - Xenix FS, - - SystemV/386 FS, - - Coherent FS. - -To install: -* Answer the 'System V and Coherent filesystem support' question with 'y' - when configuring the kernel. -* To mount a disk or a partition, use - mount [-r] -t sysv device mountpoint - The file system type names - -t sysv - -t xenix - -t coherent - may be used interchangeably, but the last two will eventually disappear. - -Bugs in the present implementation: -- Coherent FS: - - The "free list interleave" n:m is currently ignored. - - Only file systems with no filesystem name and no pack name are recognized. - (See Coherent "man mkfs" for a description of these features.) -- SystemV Release 2 FS: - The superblock is only searched in the blocks 9, 15, 18, which - corresponds to the beginning of track 1 on floppy disks. No support - for this FS on hard disk yet. - - -These filesystems are rather similar. Here is a comparison with Minix FS: - -* Linux fdisk reports on partitions - - Minix FS 0x81 Linux/Minix - - Xenix FS ?? - - SystemV FS ?? - - Coherent FS 0x08 AIX bootable - -* Size of a block or zone (data allocation unit on disk) - - Minix FS 1024 - - Xenix FS 1024 (also 512 ??) - - SystemV FS 1024 (also 512 and 2048) - - Coherent FS 512 - -* General layout: all have one boot block, one super block and - separate areas for inodes and for directories/data. - On SystemV Release 2 FS (e.g. Microport) the first track is reserved and - all the block numbers (including the super block) are offset by one track. - -* Byte ordering of "short" (16 bit entities) on disk: - - Minix FS little endian 0 1 - - Xenix FS little endian 0 1 - - SystemV FS little endian 0 1 - - Coherent FS little endian 0 1 - Of course, this affects only the file system, not the data of files on it! - -* Byte ordering of "long" (32 bit entities) on disk: - - Minix FS little endian 0 1 2 3 - - Xenix FS little endian 0 1 2 3 - - SystemV FS little endian 0 1 2 3 - - Coherent FS PDP-11 2 3 0 1 - Of course, this affects only the file system, not the data of files on it! - -* Inode on disk: "short", 0 means non-existent, the root dir ino is: - - Minix FS 1 - - Xenix FS, SystemV FS, Coherent FS 2 - -* Maximum number of hard links to a file: - - Minix FS 250 - - Xenix FS ?? - - SystemV FS ?? - - Coherent FS >=10000 - -* Free inode management: - - Minix FS a bitmap - - Xenix FS, SystemV FS, Coherent FS - There is a cache of a certain number of free inodes in the super-block. - When it is exhausted, new free inodes are found using a linear search. - -* Free block management: - - Minix FS a bitmap - - Xenix FS, SystemV FS, Coherent FS - Free blocks are organized in a "free list". Maybe a misleading term, - since it is not true that every free block contains a pointer to - the next free block. Rather, the free blocks are organized in chunks - of limited size, and every now and then a free block contains pointers - to the free blocks pertaining to the next chunk; the first of these - contains pointers and so on. The list terminates with a "block number" - 0 on Xenix FS and SystemV FS, with a block zeroed out on Coherent FS. - -* Super-block location: - - Minix FS block 1 = bytes 1024..2047 - - Xenix FS block 1 = bytes 1024..2047 - - SystemV FS bytes 512..1023 - - Coherent FS block 1 = bytes 512..1023 - -* Super-block layout: - - Minix FS - unsigned short s_ninodes; - unsigned short s_nzones; - unsigned short s_imap_blocks; - unsigned short s_zmap_blocks; - unsigned short s_firstdatazone; - unsigned short s_log_zone_size; - unsigned long s_max_size; - unsigned short s_magic; - - Xenix FS, SystemV FS, Coherent FS - unsigned short s_firstdatazone; - unsigned long s_nzones; - unsigned short s_fzone_count; - unsigned long s_fzones[NICFREE]; - unsigned short s_finode_count; - unsigned short s_finodes[NICINOD]; - char s_flock; - char s_ilock; - char s_modified; - char s_rdonly; - unsigned long s_time; - short s_dinfo[4]; -- SystemV FS only - unsigned long s_free_zones; - unsigned short s_free_inodes; - short s_dinfo[4]; -- Xenix FS only - unsigned short s_interleave_m,s_interleave_n; -- Coherent FS only - char s_fname[6]; - char s_fpack[6]; - then they differ considerably: - Xenix FS - char s_clean; - char s_fill[371]; - long s_magic; - long s_type; - SystemV FS - long s_fill[12 or 14]; - long s_state; - long s_magic; - long s_type; - Coherent FS - unsigned long s_unique; - Note that Coherent FS has no magic. - -* Inode layout: - - Minix FS - unsigned short i_mode; - unsigned short i_uid; - unsigned long i_size; - unsigned long i_time; - unsigned char i_gid; - unsigned char i_nlinks; - unsigned short i_zone[7+1+1]; - - Xenix FS, SystemV FS, Coherent FS - unsigned short i_mode; - unsigned short i_nlink; - unsigned short i_uid; - unsigned short i_gid; - unsigned long i_size; - unsigned char i_zone[3*(10+1+1+1)]; - unsigned long i_atime; - unsigned long i_mtime; - unsigned long i_ctime; - -* Regular file data blocks are organized as - - Minix FS - 7 direct blocks - 1 indirect block (pointers to blocks) - 1 double-indirect block (pointer to pointers to blocks) - - Xenix FS, SystemV FS, Coherent FS - 10 direct blocks - 1 indirect block (pointers to blocks) - 1 double-indirect block (pointer to pointers to blocks) - 1 triple-indirect block (pointer to pointers to pointers to blocks) - -* Inode size, inodes per block - - Minix FS 32 32 - - Xenix FS 64 16 - - SystemV FS 64 16 - - Coherent FS 64 8 - -* Directory entry on disk - - Minix FS - unsigned short inode; - char name[14/30]; - - Xenix FS, SystemV FS, Coherent FS - unsigned short inode; - char name[14]; - -* Dir entry size, dir entries per block - - Minix FS 16/32 64/32 - - Xenix FS 16 64 - - SystemV FS 16 64 - - Coherent FS 16 32 - -* How to implement symbolic links such that the host fsck doesn't scream: - - Minix FS normal - - Xenix FS kludge: as regular files with chmod 1000 - - SystemV FS ?? - - Coherent FS kludge: as regular files with chmod 1000 - - -Notation: We often speak of a "block" but mean a zone (the allocation unit) -and not the disk driver's notion of "block". -- cgit From 7e7cd458b8105b02e69e3af2ef4cd186326d7f84 Mon Sep 17 00:00:00 2001 From: Mauro Carvalho Chehab Date: Mon, 17 Feb 2020 17:12:26 +0100 Subject: docs: filesystems: convert tmpfs.txt to ReST - Add a SPDX header; - Add a document title; - Some whitespace fixes and new line breaks; - Mark literal blocks as such; - Add table markups; - Use :field: markup; - Add it to filesystems/index.rst. Signed-off-by: Mauro Carvalho Chehab Link: https://lore.kernel.org/r/30397a47a78ca59760fbc0fc5f50c5f1002d487a.1581955849.git.mchehab+huawei@kernel.org Signed-off-by: Jonathan Corbet --- Documentation/filesystems/index.rst | 1 + Documentation/filesystems/tmpfs.rst | 163 ++++++++++++++++++++++++++++++++++++ Documentation/filesystems/tmpfs.txt | 149 -------------------------------- 3 files changed, 164 insertions(+), 149 deletions(-) create mode 100644 Documentation/filesystems/tmpfs.rst delete mode 100644 Documentation/filesystems/tmpfs.txt diff --git a/Documentation/filesystems/index.rst b/Documentation/filesystems/index.rst index d583b8b35196..27d37e7712da 100644 --- a/Documentation/filesystems/index.rst +++ b/Documentation/filesystems/index.rst @@ -89,5 +89,6 @@ Documentation for filesystem implementations. squashfs sysfs sysv-fs + tmpfs virtiofs vfat diff --git a/Documentation/filesystems/tmpfs.rst b/Documentation/filesystems/tmpfs.rst new file mode 100644 index 000000000000..4e95929301a5 --- /dev/null +++ b/Documentation/filesystems/tmpfs.rst @@ -0,0 +1,163 @@ +.. SPDX-License-Identifier: GPL-2.0 + +===== +Tmpfs +===== + +Tmpfs is a file system which keeps all files in virtual memory. + + +Everything in tmpfs is temporary in the sense that no files will be +created on your hard drive. If you unmount a tmpfs instance, +everything stored therein is lost. + +tmpfs puts everything into the kernel internal caches and grows and +shrinks to accommodate the files it contains and is able to swap +unneeded pages out to swap space. It has maximum size limits which can +be adjusted on the fly via 'mount -o remount ...' + +If you compare it to ramfs (which was the template to create tmpfs) +you gain swapping and limit checking. Another similar thing is the RAM +disk (/dev/ram*), which simulates a fixed size hard disk in physical +RAM, where you have to create an ordinary filesystem on top. Ramdisks +cannot swap and you do not have the possibility to resize them. + +Since tmpfs lives completely in the page cache and on swap, all tmpfs +pages will be shown as "Shmem" in /proc/meminfo and "Shared" in +free(1). Notice that these counters also include shared memory +(shmem, see ipcs(1)). The most reliable way to get the count is +using df(1) and du(1). + +tmpfs has the following uses: + +1) There is always a kernel internal mount which you will not see at + all. This is used for shared anonymous mappings and SYSV shared + memory. + + This mount does not depend on CONFIG_TMPFS. If CONFIG_TMPFS is not + set, the user visible part of tmpfs is not build. But the internal + mechanisms are always present. + +2) glibc 2.2 and above expects tmpfs to be mounted at /dev/shm for + POSIX shared memory (shm_open, shm_unlink). Adding the following + line to /etc/fstab should take care of this:: + + tmpfs /dev/shm tmpfs defaults 0 0 + + Remember to create the directory that you intend to mount tmpfs on + if necessary. + + This mount is _not_ needed for SYSV shared memory. The internal + mount is used for that. (In the 2.3 kernel versions it was + necessary to mount the predecessor of tmpfs (shm fs) to use SYSV + shared memory) + +3) Some people (including me) find it very convenient to mount it + e.g. on /tmp and /var/tmp and have a big swap partition. And now + loop mounts of tmpfs files do work, so mkinitrd shipped by most + distributions should succeed with a tmpfs /tmp. + +4) And probably a lot more I do not know about :-) + + +tmpfs has three mount options for sizing: + +========= ============================================================ +size The limit of allocated bytes for this tmpfs instance. The + default is half of your physical RAM without swap. If you + oversize your tmpfs instances the machine will deadlock + since the OOM handler will not be able to free that memory. +nr_blocks The same as size, but in blocks of PAGE_SIZE. +nr_inodes The maximum number of inodes for this instance. The default + is half of the number of your physical RAM pages, or (on a + machine with highmem) the number of lowmem RAM pages, + whichever is the lower. +========= ============================================================ + +These parameters accept a suffix k, m or g for kilo, mega and giga and +can be changed on remount. The size parameter also accepts a suffix % +to limit this tmpfs instance to that percentage of your physical RAM: +the default, when neither size nor nr_blocks is specified, is size=50% + +If nr_blocks=0 (or size=0), blocks will not be limited in that instance; +if nr_inodes=0, inodes will not be limited. It is generally unwise to +mount with such options, since it allows any user with write access to +use up all the memory on the machine; but enhances the scalability of +that instance in a system with many cpus making intensive use of it. + + +tmpfs has a mount option to set the NUMA memory allocation policy for +all files in that instance (if CONFIG_NUMA is enabled) - which can be +adjusted on the fly via 'mount -o remount ...' + +======================== ============================================== +mpol=default use the process allocation policy + (see set_mempolicy(2)) +mpol=prefer:Node prefers to allocate memory from the given Node +mpol=bind:NodeList allocates memory only from nodes in NodeList +mpol=interleave prefers to allocate from each node in turn +mpol=interleave:NodeList allocates from each node of NodeList in turn +mpol=local prefers to allocate memory from the local node +======================== ============================================== + +NodeList format is a comma-separated list of decimal numbers and ranges, +a range being two hyphen-separated decimal numbers, the smallest and +largest node numbers in the range. For example, mpol=bind:0-3,5,7,9-15 + +A memory policy with a valid NodeList will be saved, as specified, for +use at file creation time. When a task allocates a file in the file +system, the mount option memory policy will be applied with a NodeList, +if any, modified by the calling task's cpuset constraints +[See Documentation/admin-guide/cgroup-v1/cpusets.rst] and any optional flags, +listed below. If the resulting NodeLists is the empty set, the effective +memory policy for the file will revert to "default" policy. + +NUMA memory allocation policies have optional flags that can be used in +conjunction with their modes. These optional flags can be specified +when tmpfs is mounted by appending them to the mode before the NodeList. +See Documentation/admin-guide/mm/numa_memory_policy.rst for a list of +all available memory allocation policy mode flags and their effect on +memory policy. + +:: + + =static is equivalent to MPOL_F_STATIC_NODES + =relative is equivalent to MPOL_F_RELATIVE_NODES + +For example, mpol=bind=static:NodeList, is the equivalent of an +allocation policy of MPOL_BIND | MPOL_F_STATIC_NODES. + +Note that trying to mount a tmpfs with an mpol option will fail if the +running kernel does not support NUMA; and will fail if its nodelist +specifies a node which is not online. If your system relies on that +tmpfs being mounted, but from time to time runs a kernel built without +NUMA capability (perhaps a safe recovery kernel), or with fewer nodes +online, then it is advisable to omit the mpol option from automatic +mount options. It can be added later, when the tmpfs is already mounted +on MountPoint, by 'mount -o remount,mpol=Policy:NodeList MountPoint'. + + +To specify the initial root directory you can use the following mount +options: + +==== ================================== +mode The permissions as an octal number +uid The user id +gid The group id +==== ================================== + +These options do not have any effect on remount. You can change these +parameters with chmod(1), chown(1) and chgrp(1) on a mounted filesystem. + + +So 'mount -t tmpfs -o size=10G,nr_inodes=10k,mode=700 tmpfs /mytmpfs' +will give you tmpfs instance on /mytmpfs which can allocate 10GB +RAM/SWAP in 10240 inodes and it is only accessible by root. + + +:Author: + Christoph Rohland , 1.12.01 +:Updated: + Hugh Dickins, 4 June 2007 +:Updated: + KOSAKI Motohiro, 16 Mar 2010 diff --git a/Documentation/filesystems/tmpfs.txt b/Documentation/filesystems/tmpfs.txt deleted file mode 100644 index 5ecbc03e6b2f..000000000000 --- a/Documentation/filesystems/tmpfs.txt +++ /dev/null @@ -1,149 +0,0 @@ -Tmpfs is a file system which keeps all files in virtual memory. - - -Everything in tmpfs is temporary in the sense that no files will be -created on your hard drive. If you unmount a tmpfs instance, -everything stored therein is lost. - -tmpfs puts everything into the kernel internal caches and grows and -shrinks to accommodate the files it contains and is able to swap -unneeded pages out to swap space. It has maximum size limits which can -be adjusted on the fly via 'mount -o remount ...' - -If you compare it to ramfs (which was the template to create tmpfs) -you gain swapping and limit checking. Another similar thing is the RAM -disk (/dev/ram*), which simulates a fixed size hard disk in physical -RAM, where you have to create an ordinary filesystem on top. Ramdisks -cannot swap and you do not have the possibility to resize them. - -Since tmpfs lives completely in the page cache and on swap, all tmpfs -pages will be shown as "Shmem" in /proc/meminfo and "Shared" in -free(1). Notice that these counters also include shared memory -(shmem, see ipcs(1)). The most reliable way to get the count is -using df(1) and du(1). - -tmpfs has the following uses: - -1) There is always a kernel internal mount which you will not see at - all. This is used for shared anonymous mappings and SYSV shared - memory. - - This mount does not depend on CONFIG_TMPFS. If CONFIG_TMPFS is not - set, the user visible part of tmpfs is not build. But the internal - mechanisms are always present. - -2) glibc 2.2 and above expects tmpfs to be mounted at /dev/shm for - POSIX shared memory (shm_open, shm_unlink). Adding the following - line to /etc/fstab should take care of this: - - tmpfs /dev/shm tmpfs defaults 0 0 - - Remember to create the directory that you intend to mount tmpfs on - if necessary. - - This mount is _not_ needed for SYSV shared memory. The internal - mount is used for that. (In the 2.3 kernel versions it was - necessary to mount the predecessor of tmpfs (shm fs) to use SYSV - shared memory) - -3) Some people (including me) find it very convenient to mount it - e.g. on /tmp and /var/tmp and have a big swap partition. And now - loop mounts of tmpfs files do work, so mkinitrd shipped by most - distributions should succeed with a tmpfs /tmp. - -4) And probably a lot more I do not know about :-) - - -tmpfs has three mount options for sizing: - -size: The limit of allocated bytes for this tmpfs instance. The - default is half of your physical RAM without swap. If you - oversize your tmpfs instances the machine will deadlock - since the OOM handler will not be able to free that memory. -nr_blocks: The same as size, but in blocks of PAGE_SIZE. -nr_inodes: The maximum number of inodes for this instance. The default - is half of the number of your physical RAM pages, or (on a - machine with highmem) the number of lowmem RAM pages, - whichever is the lower. - -These parameters accept a suffix k, m or g for kilo, mega and giga and -can be changed on remount. The size parameter also accepts a suffix % -to limit this tmpfs instance to that percentage of your physical RAM: -the default, when neither size nor nr_blocks is specified, is size=50% - -If nr_blocks=0 (or size=0), blocks will not be limited in that instance; -if nr_inodes=0, inodes will not be limited. It is generally unwise to -mount with such options, since it allows any user with write access to -use up all the memory on the machine; but enhances the scalability of -that instance in a system with many cpus making intensive use of it. - - -tmpfs has a mount option to set the NUMA memory allocation policy for -all files in that instance (if CONFIG_NUMA is enabled) - which can be -adjusted on the fly via 'mount -o remount ...' - -mpol=default use the process allocation policy - (see set_mempolicy(2)) -mpol=prefer:Node prefers to allocate memory from the given Node -mpol=bind:NodeList allocates memory only from nodes in NodeList -mpol=interleave prefers to allocate from each node in turn -mpol=interleave:NodeList allocates from each node of NodeList in turn -mpol=local prefers to allocate memory from the local node - -NodeList format is a comma-separated list of decimal numbers and ranges, -a range being two hyphen-separated decimal numbers, the smallest and -largest node numbers in the range. For example, mpol=bind:0-3,5,7,9-15 - -A memory policy with a valid NodeList will be saved, as specified, for -use at file creation time. When a task allocates a file in the file -system, the mount option memory policy will be applied with a NodeList, -if any, modified by the calling task's cpuset constraints -[See Documentation/admin-guide/cgroup-v1/cpusets.rst] and any optional flags, listed -below. If the resulting NodeLists is the empty set, the effective memory -policy for the file will revert to "default" policy. - -NUMA memory allocation policies have optional flags that can be used in -conjunction with their modes. These optional flags can be specified -when tmpfs is mounted by appending them to the mode before the NodeList. -See Documentation/admin-guide/mm/numa_memory_policy.rst for a list of -all available memory allocation policy mode flags and their effect on -memory policy. - - =static is equivalent to MPOL_F_STATIC_NODES - =relative is equivalent to MPOL_F_RELATIVE_NODES - -For example, mpol=bind=static:NodeList, is the equivalent of an -allocation policy of MPOL_BIND | MPOL_F_STATIC_NODES. - -Note that trying to mount a tmpfs with an mpol option will fail if the -running kernel does not support NUMA; and will fail if its nodelist -specifies a node which is not online. If your system relies on that -tmpfs being mounted, but from time to time runs a kernel built without -NUMA capability (perhaps a safe recovery kernel), or with fewer nodes -online, then it is advisable to omit the mpol option from automatic -mount options. It can be added later, when the tmpfs is already mounted -on MountPoint, by 'mount -o remount,mpol=Policy:NodeList MountPoint'. - - -To specify the initial root directory you can use the following mount -options: - -mode: The permissions as an octal number -uid: The user id -gid: The group id - -These options do not have any effect on remount. You can change these -parameters with chmod(1), chown(1) and chgrp(1) on a mounted filesystem. - - -So 'mount -t tmpfs -o size=10G,nr_inodes=10k,mode=700 tmpfs /mytmpfs' -will give you tmpfs instance on /mytmpfs which can allocate 10GB -RAM/SWAP in 10240 inodes and it is only accessible by root. - - -Author: - Christoph Rohland , 1.12.01 -Updated: - Hugh Dickins, 4 June 2007 -Updated: - KOSAKI Motohiro, 16 Mar 2010 -- cgit From 688f118e3139f81f813ba1896931cf8fad93430d Mon Sep 17 00:00:00 2001 From: Mauro Carvalho Chehab Date: Mon, 17 Feb 2020 17:12:27 +0100 Subject: docs: filesystems: convert ubifs-authentication.rst.txt to ReST - Add a SPDX header; - Mark some literals as such; - Add it to filesystems/index.rst. Signed-off-by: Mauro Carvalho Chehab Link: https://lore.kernel.org/r/0c36091b6660cd372f994bd98e1264491d766c22.1581955849.git.mchehab+huawei@kernel.org Signed-off-by: Jonathan Corbet --- Documentation/filesystems/index.rst | 1 + Documentation/filesystems/ubifs-authentication.rst | 10 ++++++---- 2 files changed, 7 insertions(+), 4 deletions(-) diff --git a/Documentation/filesystems/index.rst b/Documentation/filesystems/index.rst index 27d37e7712da..bb14738df358 100644 --- a/Documentation/filesystems/index.rst +++ b/Documentation/filesystems/index.rst @@ -90,5 +90,6 @@ Documentation for filesystem implementations. sysfs sysv-fs tmpfs + ubifs-authentication.rst virtiofs vfat diff --git a/Documentation/filesystems/ubifs-authentication.rst b/Documentation/filesystems/ubifs-authentication.rst index 6a9584f6ff46..16efd729bf7c 100644 --- a/Documentation/filesystems/ubifs-authentication.rst +++ b/Documentation/filesystems/ubifs-authentication.rst @@ -1,3 +1,5 @@ +.. SPDX-License-Identifier: GPL-2.0 + :orphan: .. UBIFS Authentication @@ -92,11 +94,11 @@ UBIFS Index & Tree Node Cache ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Basic on-flash UBIFS entities are called *nodes*. UBIFS knows different types -of nodes. Eg. data nodes (`struct ubifs_data_node`) which store chunks of file -contents or inode nodes (`struct ubifs_ino_node`) which represent VFS inodes. -Almost all types of nodes share a common header (`ubifs_ch`) containing basic +of nodes. Eg. data nodes (``struct ubifs_data_node``) which store chunks of file +contents or inode nodes (``struct ubifs_ino_node``) which represent VFS inodes. +Almost all types of nodes share a common header (``ubifs_ch``) containing basic information like node type, node length, a sequence number, etc. (see -`fs/ubifs/ubifs-media.h`in kernel source). Exceptions are entries of the LPT +``fs/ubifs/ubifs-media.h`` in kernel source). Exceptions are entries of the LPT and some less important node types like padding nodes which are used to pad unusable content at the end of LEBs. -- cgit From 38e56b4ec44139b5781d6ff13f1b422e4b38f0d4 Mon Sep 17 00:00:00 2001 From: Mauro Carvalho Chehab Date: Mon, 17 Feb 2020 17:12:28 +0100 Subject: docs: filesystems: convert ubifs.txt to ReST - Add a SPDX header; - Add a document title; - Adjust section titles; - Some whitespace fixes and new line breaks; - Mark literal blocks as such; - Add table markups; - Add lists markups; - Add it to filesystems/index.rst. Signed-off-by: Mauro Carvalho Chehab Link: https://lore.kernel.org/r/9043dc2965cafc64e6a521e2317c00ecc8303bf6.1581955849.git.mchehab+huawei@kernel.org Signed-off-by: Jonathan Corbet --- Documentation/filesystems/index.rst | 1 + Documentation/filesystems/ubifs.rst | 137 ++++++++++++++++++++++++++++++++++++ Documentation/filesystems/ubifs.txt | 126 --------------------------------- 3 files changed, 138 insertions(+), 126 deletions(-) create mode 100644 Documentation/filesystems/ubifs.rst delete mode 100644 Documentation/filesystems/ubifs.txt diff --git a/Documentation/filesystems/index.rst b/Documentation/filesystems/index.rst index bb14738df358..58d57c9bf922 100644 --- a/Documentation/filesystems/index.rst +++ b/Documentation/filesystems/index.rst @@ -90,6 +90,7 @@ Documentation for filesystem implementations. sysfs sysv-fs tmpfs + ubifs ubifs-authentication.rst virtiofs vfat diff --git a/Documentation/filesystems/ubifs.rst b/Documentation/filesystems/ubifs.rst new file mode 100644 index 000000000000..e6ee99762534 --- /dev/null +++ b/Documentation/filesystems/ubifs.rst @@ -0,0 +1,137 @@ +.. SPDX-License-Identifier: GPL-2.0 + +=============== +UBI File System +=============== + +Introduction +============ + +UBIFS file-system stands for UBI File System. UBI stands for "Unsorted +Block Images". UBIFS is a flash file system, which means it is designed +to work with flash devices. It is important to understand, that UBIFS +is completely different to any traditional file-system in Linux, like +Ext2, XFS, JFS, etc. UBIFS represents a separate class of file-systems +which work with MTD devices, not block devices. The other Linux +file-system of this class is JFFS2. + +To make it more clear, here is a small comparison of MTD devices and +block devices. + +1 MTD devices represent flash devices and they consist of eraseblocks of + rather large size, typically about 128KiB. Block devices consist of + small blocks, typically 512 bytes. +2 MTD devices support 3 main operations - read from some offset within an + eraseblock, write to some offset within an eraseblock, and erase a whole + eraseblock. Block devices support 2 main operations - read a whole + block and write a whole block. +3 The whole eraseblock has to be erased before it becomes possible to + re-write its contents. Blocks may be just re-written. +4 Eraseblocks become worn out after some number of erase cycles - + typically 100K-1G for SLC NAND and NOR flashes, and 1K-10K for MLC + NAND flashes. Blocks do not have the wear-out property. +5 Eraseblocks may become bad (only on NAND flashes) and software should + deal with this. Blocks on hard drives typically do not become bad, + because hardware has mechanisms to substitute bad blocks, at least in + modern LBA disks. + +It should be quite obvious why UBIFS is very different to traditional +file-systems. + +UBIFS works on top of UBI. UBI is a separate software layer which may be +found in drivers/mtd/ubi. UBI is basically a volume management and +wear-leveling layer. It provides so called UBI volumes which is a higher +level abstraction than a MTD device. The programming model of UBI devices +is very similar to MTD devices - they still consist of large eraseblocks, +they have read/write/erase operations, but UBI devices are devoid of +limitations like wear and bad blocks (items 4 and 5 in the above list). + +In a sense, UBIFS is a next generation of JFFS2 file-system, but it is +very different and incompatible to JFFS2. The following are the main +differences. + +* JFFS2 works on top of MTD devices, UBIFS depends on UBI and works on + top of UBI volumes. +* JFFS2 does not have on-media index and has to build it while mounting, + which requires full media scan. UBIFS maintains the FS indexing + information on the flash media and does not require full media scan, + so it mounts many times faster than JFFS2. +* JFFS2 is a write-through file-system, while UBIFS supports write-back, + which makes UBIFS much faster on writes. + +Similarly to JFFS2, UBIFS supports on-the-flight compression which makes +it possible to fit quite a lot of data to the flash. + +Similarly to JFFS2, UBIFS is tolerant of unclean reboots and power-cuts. +It does not need stuff like fsck.ext2. UBIFS automatically replays its +journal and recovers from crashes, ensuring that the on-flash data +structures are consistent. + +UBIFS scales logarithmically (most of the data structures it uses are +trees), so the mount time and memory consumption do not linearly depend +on the flash size, like in case of JFFS2. This is because UBIFS +maintains the FS index on the flash media. However, UBIFS depends on +UBI, which scales linearly. So overall UBI/UBIFS stack scales linearly. +Nevertheless, UBI/UBIFS scales considerably better than JFFS2. + +The authors of UBIFS believe, that it is possible to develop UBI2 which +would scale logarithmically as well. UBI2 would support the same API as UBI, +but it would be binary incompatible to UBI. So UBIFS would not need to be +changed to use UBI2 + + +Mount options +============= + +(*) == default. + +==================== ======================================================= +bulk_read read more in one go to take advantage of flash + media that read faster sequentially +no_bulk_read (*) do not bulk-read +no_chk_data_crc (*) skip checking of CRCs on data nodes in order to + improve read performance. Use this option only + if the flash media is highly reliable. The effect + of this option is that corruption of the contents + of a file can go unnoticed. +chk_data_crc do not skip checking CRCs on data nodes +compr=none override default compressor and set it to "none" +compr=lzo override default compressor and set it to "lzo" +compr=zlib override default compressor and set it to "zlib" +auth_key= specify the key used for authenticating the filesystem. + Passing this option makes authentication mandatory. + The passed key must be present in the kernel keyring + and must be of type 'logon' +auth_hash_name= The hash algorithm used for authentication. Used for + both hashing and for creating HMACs. Typical values + include "sha256" or "sha512" +==================== ======================================================= + + +Quick usage instructions +======================== + +The UBI volume to mount is specified using "ubiX_Y" or "ubiX:NAME" syntax, +where "X" is UBI device number, "Y" is UBI volume number, and "NAME" is +UBI volume name. + +Mount volume 0 on UBI device 0 to /mnt/ubifs:: + + $ mount -t ubifs ubi0_0 /mnt/ubifs + +Mount "rootfs" volume of UBI device 0 to /mnt/ubifs ("rootfs" is volume +name):: + + $ mount -t ubifs ubi0:rootfs /mnt/ubifs + +The following is an example of the kernel boot arguments to attach mtd0 +to UBI and mount volume "rootfs": +ubi.mtd=0 root=ubi0:rootfs rootfstype=ubifs + +References +========== + +UBIFS documentation and FAQ/HOWTO at the MTD web site: + +- http://www.linux-mtd.infradead.org/doc/ubifs.html +- http://www.linux-mtd.infradead.org/faq/ubifs.html diff --git a/Documentation/filesystems/ubifs.txt b/Documentation/filesystems/ubifs.txt deleted file mode 100644 index acc80442a3bb..000000000000 --- a/Documentation/filesystems/ubifs.txt +++ /dev/null @@ -1,126 +0,0 @@ -Introduction -============= - -UBIFS file-system stands for UBI File System. UBI stands for "Unsorted -Block Images". UBIFS is a flash file system, which means it is designed -to work with flash devices. It is important to understand, that UBIFS -is completely different to any traditional file-system in Linux, like -Ext2, XFS, JFS, etc. UBIFS represents a separate class of file-systems -which work with MTD devices, not block devices. The other Linux -file-system of this class is JFFS2. - -To make it more clear, here is a small comparison of MTD devices and -block devices. - -1 MTD devices represent flash devices and they consist of eraseblocks of - rather large size, typically about 128KiB. Block devices consist of - small blocks, typically 512 bytes. -2 MTD devices support 3 main operations - read from some offset within an - eraseblock, write to some offset within an eraseblock, and erase a whole - eraseblock. Block devices support 2 main operations - read a whole - block and write a whole block. -3 The whole eraseblock has to be erased before it becomes possible to - re-write its contents. Blocks may be just re-written. -4 Eraseblocks become worn out after some number of erase cycles - - typically 100K-1G for SLC NAND and NOR flashes, and 1K-10K for MLC - NAND flashes. Blocks do not have the wear-out property. -5 Eraseblocks may become bad (only on NAND flashes) and software should - deal with this. Blocks on hard drives typically do not become bad, - because hardware has mechanisms to substitute bad blocks, at least in - modern LBA disks. - -It should be quite obvious why UBIFS is very different to traditional -file-systems. - -UBIFS works on top of UBI. UBI is a separate software layer which may be -found in drivers/mtd/ubi. UBI is basically a volume management and -wear-leveling layer. It provides so called UBI volumes which is a higher -level abstraction than a MTD device. The programming model of UBI devices -is very similar to MTD devices - they still consist of large eraseblocks, -they have read/write/erase operations, but UBI devices are devoid of -limitations like wear and bad blocks (items 4 and 5 in the above list). - -In a sense, UBIFS is a next generation of JFFS2 file-system, but it is -very different and incompatible to JFFS2. The following are the main -differences. - -* JFFS2 works on top of MTD devices, UBIFS depends on UBI and works on - top of UBI volumes. -* JFFS2 does not have on-media index and has to build it while mounting, - which requires full media scan. UBIFS maintains the FS indexing - information on the flash media and does not require full media scan, - so it mounts many times faster than JFFS2. -* JFFS2 is a write-through file-system, while UBIFS supports write-back, - which makes UBIFS much faster on writes. - -Similarly to JFFS2, UBIFS supports on-the-flight compression which makes -it possible to fit quite a lot of data to the flash. - -Similarly to JFFS2, UBIFS is tolerant of unclean reboots and power-cuts. -It does not need stuff like fsck.ext2. UBIFS automatically replays its -journal and recovers from crashes, ensuring that the on-flash data -structures are consistent. - -UBIFS scales logarithmically (most of the data structures it uses are -trees), so the mount time and memory consumption do not linearly depend -on the flash size, like in case of JFFS2. This is because UBIFS -maintains the FS index on the flash media. However, UBIFS depends on -UBI, which scales linearly. So overall UBI/UBIFS stack scales linearly. -Nevertheless, UBI/UBIFS scales considerably better than JFFS2. - -The authors of UBIFS believe, that it is possible to develop UBI2 which -would scale logarithmically as well. UBI2 would support the same API as UBI, -but it would be binary incompatible to UBI. So UBIFS would not need to be -changed to use UBI2 - - -Mount options -============= - -(*) == default. - -bulk_read read more in one go to take advantage of flash - media that read faster sequentially -no_bulk_read (*) do not bulk-read -no_chk_data_crc (*) skip checking of CRCs on data nodes in order to - improve read performance. Use this option only - if the flash media is highly reliable. The effect - of this option is that corruption of the contents - of a file can go unnoticed. -chk_data_crc do not skip checking CRCs on data nodes -compr=none override default compressor and set it to "none" -compr=lzo override default compressor and set it to "lzo" -compr=zlib override default compressor and set it to "zlib" -auth_key= specify the key used for authenticating the filesystem. - Passing this option makes authentication mandatory. - The passed key must be present in the kernel keyring - and must be of type 'logon' -auth_hash_name= The hash algorithm used for authentication. Used for - both hashing and for creating HMACs. Typical values - include "sha256" or "sha512" - - -Quick usage instructions -======================== - -The UBI volume to mount is specified using "ubiX_Y" or "ubiX:NAME" syntax, -where "X" is UBI device number, "Y" is UBI volume number, and "NAME" is -UBI volume name. - -Mount volume 0 on UBI device 0 to /mnt/ubifs: -$ mount -t ubifs ubi0_0 /mnt/ubifs - -Mount "rootfs" volume of UBI device 0 to /mnt/ubifs ("rootfs" is volume -name): -$ mount -t ubifs ubi0:rootfs /mnt/ubifs - -The following is an example of the kernel boot arguments to attach mtd0 -to UBI and mount volume "rootfs": -ubi.mtd=0 root=ubi0:rootfs rootfstype=ubifs - -References -========== - -UBIFS documentation and FAQ/HOWTO at the MTD web site: -http://www.linux-mtd.infradead.org/doc/ubifs.html -http://www.linux-mtd.infradead.org/faq/ubifs.html -- cgit From c9817ad5d82f04fbc66278eda27bff094dcb3119 Mon Sep 17 00:00:00 2001 From: Mauro Carvalho Chehab Date: Mon, 17 Feb 2020 17:12:29 +0100 Subject: docs: filesystems: convert udf.txt to ReST - Add a SPDX header; - Add a document title; - Add table markups; - Add lists markups; - Add it to filesystems/index.rst. Signed-off-by: Mauro Carvalho Chehab Acked-by: Jan Kara Link: https://lore.kernel.org/r/2887f8a3a813a31170389eab687e9f199327dc7d.1581955849.git.mchehab+huawei@kernel.org Signed-off-by: Jonathan Corbet --- Documentation/filesystems/index.rst | 1 + Documentation/filesystems/udf.rst | 75 +++++++++++++++++++++++++++++++++++++ Documentation/filesystems/udf.txt | 66 -------------------------------- 3 files changed, 76 insertions(+), 66 deletions(-) create mode 100644 Documentation/filesystems/udf.rst delete mode 100644 Documentation/filesystems/udf.txt diff --git a/Documentation/filesystems/index.rst b/Documentation/filesystems/index.rst index 58d57c9bf922..ec03cb4d7353 100644 --- a/Documentation/filesystems/index.rst +++ b/Documentation/filesystems/index.rst @@ -92,5 +92,6 @@ Documentation for filesystem implementations. tmpfs ubifs ubifs-authentication.rst + udf virtiofs vfat diff --git a/Documentation/filesystems/udf.rst b/Documentation/filesystems/udf.rst new file mode 100644 index 000000000000..d9badbf285b2 --- /dev/null +++ b/Documentation/filesystems/udf.rst @@ -0,0 +1,75 @@ +.. SPDX-License-Identifier: GPL-2.0 + +=============== +UDF file system +=============== + +If you encounter problems with reading UDF discs using this driver, +please report them according to MAINTAINERS file. + +Write support requires a block driver which supports writing. Currently +dvd+rw drives and media support true random sector writes, and so a udf +filesystem on such devices can be directly mounted read/write. CD-RW +media however, does not support this. Instead the media can be formatted +for packet mode using the utility cdrwtool, then the pktcdvd driver can +be bound to the underlying cd device to provide the required buffering +and read-modify-write cycles to allow the filesystem random sector writes +while providing the hardware with only full packet writes. While not +required for dvd+rw media, use of the pktcdvd driver often enhances +performance due to very poor read-modify-write support supplied internally +by drive firmware. + +------------------------------------------------------------------------------- + +The following mount options are supported: + + =========== ====================================== + gid= Set the default group. + umask= Set the default umask. + mode= Set the default file permissions. + dmode= Set the default directory permissions. + uid= Set the default user. + bs= Set the block size. + unhide Show otherwise hidden files. + undelete Show deleted files in lists. + adinicb Embed data in the inode (default) + noadinicb Don't embed data in the inode + shortad Use short ad's + longad Use long ad's (default) + nostrict Unset strict conformance + iocharset= Set the NLS character set + =========== ====================================== + +The uid= and gid= options need a bit more explaining. They will accept a +decimal numeric value and all inodes on that mount will then appear as +belonging to that uid and gid. Mount options also accept the string "forget". +The forget option causes all IDs to be written to disk as -1 which is a way +of UDF standard to indicate that IDs are not supported for these files . + +For typical desktop use of removable media, you should set the ID to that of +the interactively logged on user, and also specify the forget option. This way +the interactive user will always see the files on the disk as belonging to him. + +The remaining are for debugging and disaster recovery: + + ===== ================================ + novrs Skip volume sequence recognition + ===== ================================ + +The following expect a offset from 0. + + ========== ================================================= + session= Set the CDROM session (default= last session) + anchor= Override standard anchor location. (default= 256) + lastblock= Set the last block of the filesystem/ + ========== ================================================= + +------------------------------------------------------------------------------- + + +For the latest version and toolset see: + https://github.com/pali/udftools + +Documentation on UDF and ECMA 167 is available FREE from: + - http://www.osta.org/ + - http://www.ecma-international.org/ diff --git a/Documentation/filesystems/udf.txt b/Documentation/filesystems/udf.txt deleted file mode 100644 index e2f2faf32f18..000000000000 --- a/Documentation/filesystems/udf.txt +++ /dev/null @@ -1,66 +0,0 @@ -* -* Documentation/filesystems/udf.txt -* - -If you encounter problems with reading UDF discs using this driver, -please report them according to MAINTAINERS file. - -Write support requires a block driver which supports writing. Currently -dvd+rw drives and media support true random sector writes, and so a udf -filesystem on such devices can be directly mounted read/write. CD-RW -media however, does not support this. Instead the media can be formatted -for packet mode using the utility cdrwtool, then the pktcdvd driver can -be bound to the underlying cd device to provide the required buffering -and read-modify-write cycles to allow the filesystem random sector writes -while providing the hardware with only full packet writes. While not -required for dvd+rw media, use of the pktcdvd driver often enhances -performance due to very poor read-modify-write support supplied internally -by drive firmware. - -------------------------------------------------------------------------------- -The following mount options are supported: - - gid= Set the default group. - umask= Set the default umask. - mode= Set the default file permissions. - dmode= Set the default directory permissions. - uid= Set the default user. - bs= Set the block size. - unhide Show otherwise hidden files. - undelete Show deleted files in lists. - adinicb Embed data in the inode (default) - noadinicb Don't embed data in the inode - shortad Use short ad's - longad Use long ad's (default) - nostrict Unset strict conformance - iocharset= Set the NLS character set - -The uid= and gid= options need a bit more explaining. They will accept a -decimal numeric value and all inodes on that mount will then appear as -belonging to that uid and gid. Mount options also accept the string "forget". -The forget option causes all IDs to be written to disk as -1 which is a way -of UDF standard to indicate that IDs are not supported for these files . - -For typical desktop use of removable media, you should set the ID to that of -the interactively logged on user, and also specify the forget option. This way -the interactive user will always see the files on the disk as belonging to him. - -The remaining are for debugging and disaster recovery: - - novrs Skip volume sequence recognition - -The following expect a offset from 0. - - session= Set the CDROM session (default= last session) - anchor= Override standard anchor location. (default= 256) - lastblock= Set the last block of the filesystem/ - -------------------------------------------------------------------------------- - - -For the latest version and toolset see: - https://github.com/pali/udftools - -Documentation on UDF and ECMA 167 is available FREE from: - http://www.osta.org/ - http://www.ecma-international.org/ -- cgit From 9a6108124c1d27192fee6f058b5de84f51ab62a0 Mon Sep 17 00:00:00 2001 From: Mauro Carvalho Chehab Date: Mon, 17 Feb 2020 17:12:30 +0100 Subject: docs: filesystems: convert zonefs.txt to ReST - Add a SPDX header; - Add a document title; - Some whitespace fixes and new line breaks; - Mark literal blocks as such; - Add it to filesystems/index.rst. Signed-off-by: Mauro Carvalho Chehab Acked-by: Damien Le Moal Link: https://lore.kernel.org/r/42a7cfcd19f6b904a9a3188fd4af71bed5050052.1581955849.git.mchehab+huawei@kernel.org Signed-off-by: Jonathan Corbet --- Documentation/filesystems/index.rst | 1 + Documentation/filesystems/zonefs.rst | 412 +++++++++++++++++++++++++++++++++++ Documentation/filesystems/zonefs.txt | 404 ---------------------------------- 3 files changed, 413 insertions(+), 404 deletions(-) create mode 100644 Documentation/filesystems/zonefs.rst delete mode 100644 Documentation/filesystems/zonefs.txt diff --git a/Documentation/filesystems/index.rst b/Documentation/filesystems/index.rst index ec03cb4d7353..53f46a88e6ec 100644 --- a/Documentation/filesystems/index.rst +++ b/Documentation/filesystems/index.rst @@ -95,3 +95,4 @@ Documentation for filesystem implementations. udf virtiofs vfat + zonefs diff --git a/Documentation/filesystems/zonefs.rst b/Documentation/filesystems/zonefs.rst new file mode 100644 index 000000000000..7e733e751e98 --- /dev/null +++ b/Documentation/filesystems/zonefs.rst @@ -0,0 +1,412 @@ +.. SPDX-License-Identifier: GPL-2.0 + +================================================ +ZoneFS - Zone filesystem for Zoned block devices +================================================ + +Introduction +============ + +zonefs is a very simple file system exposing each zone of a zoned block device +as a file. Unlike a regular POSIX-compliant file system with native zoned block +device support (e.g. f2fs), zonefs does not hide the sequential write +constraint of zoned block devices to the user. Files representing sequential +write zones of the device must be written sequentially starting from the end +of the file (append only writes). + +As such, zonefs is in essence closer to a raw block device access interface +than to a full-featured POSIX file system. The goal of zonefs is to simplify +the implementation of zoned block device support in applications by replacing +raw block device file accesses with a richer file API, avoiding relying on +direct block device file ioctls which may be more obscure to developers. One +example of this approach is the implementation of LSM (log-structured merge) +tree structures (such as used in RocksDB and LevelDB) on zoned block devices +by allowing SSTables to be stored in a zone file similarly to a regular file +system rather than as a range of sectors of the entire disk. The introduction +of the higher level construct "one file is one zone" can help reducing the +amount of changes needed in the application as well as introducing support for +different application programming languages. + +Zoned block devices +------------------- + +Zoned storage devices belong to a class of storage devices with an address +space that is divided into zones. A zone is a group of consecutive LBAs and all +zones are contiguous (there are no LBA gaps). Zones may have different types. + +* Conventional zones: there are no access constraints to LBAs belonging to + conventional zones. Any read or write access can be executed, similarly to a + regular block device. +* Sequential zones: these zones accept random reads but must be written + sequentially. Each sequential zone has a write pointer maintained by the + device that keeps track of the mandatory start LBA position of the next write + to the device. As a result of this write constraint, LBAs in a sequential zone + cannot be overwritten. Sequential zones must first be erased using a special + command (zone reset) before rewriting. + +Zoned storage devices can be implemented using various recording and media +technologies. The most common form of zoned storage today uses the SCSI Zoned +Block Commands (ZBC) and Zoned ATA Commands (ZAC) interfaces on Shingled +Magnetic Recording (SMR) HDDs. + +Solid State Disks (SSD) storage devices can also implement a zoned interface +to, for instance, reduce internal write amplification due to garbage collection. +The NVMe Zoned NameSpace (ZNS) is a technical proposal of the NVMe standard +committee aiming at adding a zoned storage interface to the NVMe protocol. + +Zonefs Overview +=============== + +Zonefs exposes the zones of a zoned block device as files. The files +representing zones are grouped by zone type, which are themselves represented +by sub-directories. This file structure is built entirely using zone information +provided by the device and so does not require any complex on-disk metadata +structure. + +On-disk metadata +---------------- + +zonefs on-disk metadata is reduced to an immutable super block which +persistently stores a magic number and optional feature flags and values. On +mount, zonefs uses blkdev_report_zones() to obtain the device zone configuration +and populates the mount point with a static file tree solely based on this +information. File sizes come from the device zone type and write pointer +position managed by the device itself. + +The super block is always written on disk at sector 0. The first zone of the +device storing the super block is never exposed as a zone file by zonefs. If +the zone containing the super block is a sequential zone, the mkzonefs format +tool always "finishes" the zone, that is, it transitions the zone to a full +state to make it read-only, preventing any data write. + +Zone type sub-directories +------------------------- + +Files representing zones of the same type are grouped together under the same +sub-directory automatically created on mount. + +For conventional zones, the sub-directory "cnv" is used. This directory is +however created if and only if the device has usable conventional zones. If +the device only has a single conventional zone at sector 0, the zone will not +be exposed as a file as it will be used to store the zonefs super block. For +such devices, the "cnv" sub-directory will not be created. + +For sequential write zones, the sub-directory "seq" is used. + +These two directories are the only directories that exist in zonefs. Users +cannot create other directories and cannot rename nor delete the "cnv" and +"seq" sub-directories. + +The size of the directories indicated by the st_size field of struct stat, +obtained with the stat() or fstat() system calls, indicates the number of files +existing under the directory. + +Zone files +---------- + +Zone files are named using the number of the zone they represent within the set +of zones of a particular type. That is, both the "cnv" and "seq" directories +contain files named "0", "1", "2", ... The file numbers also represent +increasing zone start sector on the device. + +All read and write operations to zone files are not allowed beyond the file +maximum size, that is, beyond the zone size. Any access exceeding the zone +size is failed with the -EFBIG error. + +Creating, deleting, renaming or modifying any attribute of files and +sub-directories is not allowed. + +The number of blocks of a file as reported by stat() and fstat() indicates the +size of the file zone, or in other words, the maximum file size. + +Conventional zone files +----------------------- + +The size of conventional zone files is fixed to the size of the zone they +represent. Conventional zone files cannot be truncated. + +These files can be randomly read and written using any type of I/O operation: +buffered I/Os, direct I/Os, memory mapped I/Os (mmap), etc. There are no I/O +constraint for these files beyond the file size limit mentioned above. + +Sequential zone files +--------------------- + +The size of sequential zone files grouped in the "seq" sub-directory represents +the file's zone write pointer position relative to the zone start sector. + +Sequential zone files can only be written sequentially, starting from the file +end, that is, write operations can only be append writes. Zonefs makes no +attempt at accepting random writes and will fail any write request that has a +start offset not corresponding to the end of the file, or to the end of the last +write issued and still in-flight (for asynchrnous I/O operations). + +Since dirty page writeback by the page cache does not guarantee a sequential +write pattern, zonefs prevents buffered writes and writeable shared mappings +on sequential files. Only direct I/O writes are accepted for these files. +zonefs relies on the sequential delivery of write I/O requests to the device +implemented by the block layer elevator. An elevator implementing the sequential +write feature for zoned block device (ELEVATOR_F_ZBD_SEQ_WRITE elevator feature) +must be used. This type of elevator (e.g. mq-deadline) is the set by default +for zoned block devices on device initialization. + +There are no restrictions on the type of I/O used for read operations in +sequential zone files. Buffered I/Os, direct I/Os and shared read mappings are +all accepted. + +Truncating sequential zone files is allowed only down to 0, in which case, the +zone is reset to rewind the file zone write pointer position to the start of +the zone, or up to the zone size, in which case the file's zone is transitioned +to the FULL state (finish zone operation). + +Format options +-------------- + +Several optional features of zonefs can be enabled at format time. + +* Conventional zone aggregation: ranges of contiguous conventional zones can be + aggregated into a single larger file instead of the default one file per zone. +* File ownership: The owner UID and GID of zone files is by default 0 (root) + but can be changed to any valid UID/GID. +* File access permissions: the default 640 access permissions can be changed. + +IO error handling +----------------- + +Zoned block devices may fail I/O requests for reasons similar to regular block +devices, e.g. due to bad sectors. However, in addition to such known I/O +failure pattern, the standards governing zoned block devices behavior define +additional conditions that result in I/O errors. + +* A zone may transition to the read-only condition (BLK_ZONE_COND_READONLY): + While the data already written in the zone is still readable, the zone can + no longer be written. No user action on the zone (zone management command or + read/write access) can change the zone condition back to a normal read/write + state. While the reasons for the device to transition a zone to read-only + state are not defined by the standards, a typical cause for such transition + would be a defective write head on an HDD (all zones under this head are + changed to read-only). + +* A zone may transition to the offline condition (BLK_ZONE_COND_OFFLINE): + An offline zone cannot be read nor written. No user action can transition an + offline zone back to an operational good state. Similarly to zone read-only + transitions, the reasons for a drive to transition a zone to the offline + condition are undefined. A typical cause would be a defective read-write head + on an HDD causing all zones on the platter under the broken head to be + inaccessible. + +* Unaligned write errors: These errors result from the host issuing write + requests with a start sector that does not correspond to a zone write pointer + position when the write request is executed by the device. Even though zonefs + enforces sequential file write for sequential zones, unaligned write errors + may still happen in the case of a partial failure of a very large direct I/O + operation split into multiple BIOs/requests or asynchronous I/O operations. + If one of the write request within the set of sequential write requests + issued to the device fails, all write requests after queued after it will + become unaligned and fail. + +* Delayed write errors: similarly to regular block devices, if the device side + write cache is enabled, write errors may occur in ranges of previously + completed writes when the device write cache is flushed, e.g. on fsync(). + Similarly to the previous immediate unaligned write error case, delayed write + errors can propagate through a stream of cached sequential data for a zone + causing all data to be dropped after the sector that caused the error. + +All I/O errors detected by zonefs are notified to the user with an error code +return for the system call that trigered or detected the error. The recovery +actions taken by zonefs in response to I/O errors depend on the I/O type (read +vs write) and on the reason for the error (bad sector, unaligned writes or zone +condition change). + +* For read I/O errors, zonefs does not execute any particular recovery action, + but only if the file zone is still in a good condition and there is no + inconsistency between the file inode size and its zone write pointer position. + If a problem is detected, I/O error recovery is executed (see below table). + +* For write I/O errors, zonefs I/O error recovery is always executed. + +* A zone condition change to read-only or offline also always triggers zonefs + I/O error recovery. + +Zonefs minimal I/O error recovery may change a file size and a file access +permissions. + +* File size changes: + Immediate or delayed write errors in a sequential zone file may cause the file + inode size to be inconsistent with the amount of data successfully written in + the file zone. For instance, the partial failure of a multi-BIO large write + operation will cause the zone write pointer to advance partially, even though + the entire write operation will be reported as failed to the user. In such + case, the file inode size must be advanced to reflect the zone write pointer + change and eventually allow the user to restart writing at the end of the + file. + A file size may also be reduced to reflect a delayed write error detected on + fsync(): in this case, the amount of data effectively written in the zone may + be less than originally indicated by the file inode size. After such I/O + error, zonefs always fixes a file inode size to reflect the amount of data + persistently stored in the file zone. + +* Access permission changes: + A zone condition change to read-only is indicated with a change in the file + access permissions to render the file read-only. This disables changes to the + file attributes and data modification. For offline zones, all permissions + (read and write) to the file are disabled. + +Further action taken by zonefs I/O error recovery can be controlled by the user +with the "errors=xxx" mount option. The table below summarizes the result of +zonefs I/O error processing depending on the mount option and on the zone +conditions:: + + +--------------+-----------+-----------------------------------------+ + | | | Post error state | + | "errors=xxx" | device | access permissions | + | mount | zone | file file device zone | + | option | condition | size read write read write | + +--------------+-----------+-----------------------------------------+ + | | good | fixed yes no yes yes | + | remount-ro | read-only | fixed yes no yes no | + | (default) | offline | 0 no no no no | + +--------------+-----------+-----------------------------------------+ + | | good | fixed yes no yes yes | + | zone-ro | read-only | fixed yes no yes no | + | | offline | 0 no no no no | + +--------------+-----------+-----------------------------------------+ + | | good | 0 no no yes yes | + | zone-offline | read-only | 0 no no yes no | + | | offline | 0 no no no no | + +--------------+-----------+-----------------------------------------+ + | | good | fixed yes yes yes yes | + | repair | read-only | fixed yes no yes no | + | | offline | 0 no no no no | + +--------------+-----------+-----------------------------------------+ + +Further notes: + +* The "errors=remount-ro" mount option is the default behavior of zonefs I/O + error processing if no errors mount option is specified. +* With the "errors=remount-ro" mount option, the change of the file access + permissions to read-only applies to all files. The file system is remounted + read-only. +* Access permission and file size changes due to the device transitioning zones + to the offline condition are permanent. Remounting or reformating the device + with mkfs.zonefs (mkzonefs) will not change back offline zone files to a good + state. +* File access permission changes to read-only due to the device transitioning + zones to the read-only condition are permanent. Remounting or reformating + the device will not re-enable file write access. +* File access permission changes implied by the remount-ro, zone-ro and + zone-offline mount options are temporary for zones in a good condition. + Unmounting and remounting the file system will restore the previous default + (format time values) access rights to the files affected. +* The repair mount option triggers only the minimal set of I/O error recovery + actions, that is, file size fixes for zones in a good condition. Zones + indicated as being read-only or offline by the device still imply changes to + the zone file access permissions as noted in the table above. + +Mount options +------------- + +zonefs define the "errors=" mount option to allow the user to specify +zonefs behavior in response to I/O errors, inode size inconsistencies or zone +condition chages. The defined behaviors are as follow: + +* remount-ro (default) +* zone-ro +* zone-offline +* repair + +The I/O error actions defined for each behavior is detailed in the previous +section. + +Zonefs User Space Tools +======================= + +The mkzonefs tool is used to format zoned block devices for use with zonefs. +This tool is available on Github at: + +https://github.com/damien-lemoal/zonefs-tools + +zonefs-tools also includes a test suite which can be run against any zoned +block device, including null_blk block device created with zoned mode. + +Examples +-------- + +The following formats a 15TB host-managed SMR HDD with 256 MB zones +with the conventional zones aggregation feature enabled:: + + # mkzonefs -o aggr_cnv /dev/sdX + # mount -t zonefs /dev/sdX /mnt + # ls -l /mnt/ + total 0 + dr-xr-xr-x 2 root root 1 Nov 25 13:23 cnv + dr-xr-xr-x 2 root root 55356 Nov 25 13:23 seq + +The size of the zone files sub-directories indicate the number of files +existing for each type of zones. In this example, there is only one +conventional zone file (all conventional zones are aggregated under a single +file):: + + # ls -l /mnt/cnv + total 137101312 + -rw-r----- 1 root root 140391743488 Nov 25 13:23 0 + +This aggregated conventional zone file can be used as a regular file:: + + # mkfs.ext4 /mnt/cnv/0 + # mount -o loop /mnt/cnv/0 /data + +The "seq" sub-directory grouping files for sequential write zones has in this +example 55356 zones:: + + # ls -lv /mnt/seq + total 14511243264 + -rw-r----- 1 root root 0 Nov 25 13:23 0 + -rw-r----- 1 root root 0 Nov 25 13:23 1 + -rw-r----- 1 root root 0 Nov 25 13:23 2 + ... + -rw-r----- 1 root root 0 Nov 25 13:23 55354 + -rw-r----- 1 root root 0 Nov 25 13:23 55355 + +For sequential write zone files, the file size changes as data is appended at +the end of the file, similarly to any regular file system:: + + # dd if=/dev/zero of=/mnt/seq/0 bs=4096 count=1 conv=notrunc oflag=direct + 1+0 records in + 1+0 records out + 4096 bytes (4.1 kB, 4.0 KiB) copied, 0.00044121 s, 9.3 MB/s + + # ls -l /mnt/seq/0 + -rw-r----- 1 root root 4096 Nov 25 13:23 /mnt/seq/0 + +The written file can be truncated to the zone size, preventing any further +write operation:: + + # truncate -s 268435456 /mnt/seq/0 + # ls -l /mnt/seq/0 + -rw-r----- 1 root root 268435456 Nov 25 13:49 /mnt/seq/0 + +Truncation to 0 size allows freeing the file zone storage space and restart +append-writes to the file:: + + # truncate -s 0 /mnt/seq/0 + # ls -l /mnt/seq/0 + -rw-r----- 1 root root 0 Nov 25 13:49 /mnt/seq/0 + +Since files are statically mapped to zones on the disk, the number of blocks of +a file as reported by stat() and fstat() indicates the size of the file zone:: + + # stat /mnt/seq/0 + File: /mnt/seq/0 + Size: 0 Blocks: 524288 IO Block: 4096 regular empty file + Device: 870h/2160d Inode: 50431 Links: 1 + Access: (0640/-rw-r-----) Uid: ( 0/ root) Gid: ( 0/ root) + Access: 2019-11-25 13:23:57.048971997 +0900 + Modify: 2019-11-25 13:52:25.553805765 +0900 + Change: 2019-11-25 13:52:25.553805765 +0900 + Birth: - + +The number of blocks of the file ("Blocks") in units of 512B blocks gives the +maximum file size of 524288 * 512 B = 256 MB, corresponding to the device zone +size in this example. Of note is that the "IO block" field always indicates the +minimum I/O size for writes and corresponds to the device physical sector size. diff --git a/Documentation/filesystems/zonefs.txt b/Documentation/filesystems/zonefs.txt deleted file mode 100644 index 935bf22031ca..000000000000 --- a/Documentation/filesystems/zonefs.txt +++ /dev/null @@ -1,404 +0,0 @@ -ZoneFS - Zone filesystem for Zoned block devices - -Introduction -============ - -zonefs is a very simple file system exposing each zone of a zoned block device -as a file. Unlike a regular POSIX-compliant file system with native zoned block -device support (e.g. f2fs), zonefs does not hide the sequential write -constraint of zoned block devices to the user. Files representing sequential -write zones of the device must be written sequentially starting from the end -of the file (append only writes). - -As such, zonefs is in essence closer to a raw block device access interface -than to a full-featured POSIX file system. The goal of zonefs is to simplify -the implementation of zoned block device support in applications by replacing -raw block device file accesses with a richer file API, avoiding relying on -direct block device file ioctls which may be more obscure to developers. One -example of this approach is the implementation of LSM (log-structured merge) -tree structures (such as used in RocksDB and LevelDB) on zoned block devices -by allowing SSTables to be stored in a zone file similarly to a regular file -system rather than as a range of sectors of the entire disk. The introduction -of the higher level construct "one file is one zone" can help reducing the -amount of changes needed in the application as well as introducing support for -different application programming languages. - -Zoned block devices -------------------- - -Zoned storage devices belong to a class of storage devices with an address -space that is divided into zones. A zone is a group of consecutive LBAs and all -zones are contiguous (there are no LBA gaps). Zones may have different types. -* Conventional zones: there are no access constraints to LBAs belonging to - conventional zones. Any read or write access can be executed, similarly to a - regular block device. -* Sequential zones: these zones accept random reads but must be written - sequentially. Each sequential zone has a write pointer maintained by the - device that keeps track of the mandatory start LBA position of the next write - to the device. As a result of this write constraint, LBAs in a sequential zone - cannot be overwritten. Sequential zones must first be erased using a special - command (zone reset) before rewriting. - -Zoned storage devices can be implemented using various recording and media -technologies. The most common form of zoned storage today uses the SCSI Zoned -Block Commands (ZBC) and Zoned ATA Commands (ZAC) interfaces on Shingled -Magnetic Recording (SMR) HDDs. - -Solid State Disks (SSD) storage devices can also implement a zoned interface -to, for instance, reduce internal write amplification due to garbage collection. -The NVMe Zoned NameSpace (ZNS) is a technical proposal of the NVMe standard -committee aiming at adding a zoned storage interface to the NVMe protocol. - -Zonefs Overview -=============== - -Zonefs exposes the zones of a zoned block device as files. The files -representing zones are grouped by zone type, which are themselves represented -by sub-directories. This file structure is built entirely using zone information -provided by the device and so does not require any complex on-disk metadata -structure. - -On-disk metadata ----------------- - -zonefs on-disk metadata is reduced to an immutable super block which -persistently stores a magic number and optional feature flags and values. On -mount, zonefs uses blkdev_report_zones() to obtain the device zone configuration -and populates the mount point with a static file tree solely based on this -information. File sizes come from the device zone type and write pointer -position managed by the device itself. - -The super block is always written on disk at sector 0. The first zone of the -device storing the super block is never exposed as a zone file by zonefs. If -the zone containing the super block is a sequential zone, the mkzonefs format -tool always "finishes" the zone, that is, it transitions the zone to a full -state to make it read-only, preventing any data write. - -Zone type sub-directories -------------------------- - -Files representing zones of the same type are grouped together under the same -sub-directory automatically created on mount. - -For conventional zones, the sub-directory "cnv" is used. This directory is -however created if and only if the device has usable conventional zones. If -the device only has a single conventional zone at sector 0, the zone will not -be exposed as a file as it will be used to store the zonefs super block. For -such devices, the "cnv" sub-directory will not be created. - -For sequential write zones, the sub-directory "seq" is used. - -These two directories are the only directories that exist in zonefs. Users -cannot create other directories and cannot rename nor delete the "cnv" and -"seq" sub-directories. - -The size of the directories indicated by the st_size field of struct stat, -obtained with the stat() or fstat() system calls, indicates the number of files -existing under the directory. - -Zone files ----------- - -Zone files are named using the number of the zone they represent within the set -of zones of a particular type. That is, both the "cnv" and "seq" directories -contain files named "0", "1", "2", ... The file numbers also represent -increasing zone start sector on the device. - -All read and write operations to zone files are not allowed beyond the file -maximum size, that is, beyond the zone size. Any access exceeding the zone -size is failed with the -EFBIG error. - -Creating, deleting, renaming or modifying any attribute of files and -sub-directories is not allowed. - -The number of blocks of a file as reported by stat() and fstat() indicates the -size of the file zone, or in other words, the maximum file size. - -Conventional zone files ------------------------ - -The size of conventional zone files is fixed to the size of the zone they -represent. Conventional zone files cannot be truncated. - -These files can be randomly read and written using any type of I/O operation: -buffered I/Os, direct I/Os, memory mapped I/Os (mmap), etc. There are no I/O -constraint for these files beyond the file size limit mentioned above. - -Sequential zone files ---------------------- - -The size of sequential zone files grouped in the "seq" sub-directory represents -the file's zone write pointer position relative to the zone start sector. - -Sequential zone files can only be written sequentially, starting from the file -end, that is, write operations can only be append writes. Zonefs makes no -attempt at accepting random writes and will fail any write request that has a -start offset not corresponding to the end of the file, or to the end of the last -write issued and still in-flight (for asynchrnous I/O operations). - -Since dirty page writeback by the page cache does not guarantee a sequential -write pattern, zonefs prevents buffered writes and writeable shared mappings -on sequential files. Only direct I/O writes are accepted for these files. -zonefs relies on the sequential delivery of write I/O requests to the device -implemented by the block layer elevator. An elevator implementing the sequential -write feature for zoned block device (ELEVATOR_F_ZBD_SEQ_WRITE elevator feature) -must be used. This type of elevator (e.g. mq-deadline) is the set by default -for zoned block devices on device initialization. - -There are no restrictions on the type of I/O used for read operations in -sequential zone files. Buffered I/Os, direct I/Os and shared read mappings are -all accepted. - -Truncating sequential zone files is allowed only down to 0, in which case, the -zone is reset to rewind the file zone write pointer position to the start of -the zone, or up to the zone size, in which case the file's zone is transitioned -to the FULL state (finish zone operation). - -Format options --------------- - -Several optional features of zonefs can be enabled at format time. -* Conventional zone aggregation: ranges of contiguous conventional zones can be - aggregated into a single larger file instead of the default one file per zone. -* File ownership: The owner UID and GID of zone files is by default 0 (root) - but can be changed to any valid UID/GID. -* File access permissions: the default 640 access permissions can be changed. - -IO error handling ------------------ - -Zoned block devices may fail I/O requests for reasons similar to regular block -devices, e.g. due to bad sectors. However, in addition to such known I/O -failure pattern, the standards governing zoned block devices behavior define -additional conditions that result in I/O errors. - -* A zone may transition to the read-only condition (BLK_ZONE_COND_READONLY): - While the data already written in the zone is still readable, the zone can - no longer be written. No user action on the zone (zone management command or - read/write access) can change the zone condition back to a normal read/write - state. While the reasons for the device to transition a zone to read-only - state are not defined by the standards, a typical cause for such transition - would be a defective write head on an HDD (all zones under this head are - changed to read-only). - -* A zone may transition to the offline condition (BLK_ZONE_COND_OFFLINE): - An offline zone cannot be read nor written. No user action can transition an - offline zone back to an operational good state. Similarly to zone read-only - transitions, the reasons for a drive to transition a zone to the offline - condition are undefined. A typical cause would be a defective read-write head - on an HDD causing all zones on the platter under the broken head to be - inaccessible. - -* Unaligned write errors: These errors result from the host issuing write - requests with a start sector that does not correspond to a zone write pointer - position when the write request is executed by the device. Even though zonefs - enforces sequential file write for sequential zones, unaligned write errors - may still happen in the case of a partial failure of a very large direct I/O - operation split into multiple BIOs/requests or asynchronous I/O operations. - If one of the write request within the set of sequential write requests - issued to the device fails, all write requests after queued after it will - become unaligned and fail. - -* Delayed write errors: similarly to regular block devices, if the device side - write cache is enabled, write errors may occur in ranges of previously - completed writes when the device write cache is flushed, e.g. on fsync(). - Similarly to the previous immediate unaligned write error case, delayed write - errors can propagate through a stream of cached sequential data for a zone - causing all data to be dropped after the sector that caused the error. - -All I/O errors detected by zonefs are notified to the user with an error code -return for the system call that trigered or detected the error. The recovery -actions taken by zonefs in response to I/O errors depend on the I/O type (read -vs write) and on the reason for the error (bad sector, unaligned writes or zone -condition change). - -* For read I/O errors, zonefs does not execute any particular recovery action, - but only if the file zone is still in a good condition and there is no - inconsistency between the file inode size and its zone write pointer position. - If a problem is detected, I/O error recovery is executed (see below table). - -* For write I/O errors, zonefs I/O error recovery is always executed. - -* A zone condition change to read-only or offline also always triggers zonefs - I/O error recovery. - -Zonefs minimal I/O error recovery may change a file size and a file access -permissions. - -* File size changes: - Immediate or delayed write errors in a sequential zone file may cause the file - inode size to be inconsistent with the amount of data successfully written in - the file zone. For instance, the partial failure of a multi-BIO large write - operation will cause the zone write pointer to advance partially, even though - the entire write operation will be reported as failed to the user. In such - case, the file inode size must be advanced to reflect the zone write pointer - change and eventually allow the user to restart writing at the end of the - file. - A file size may also be reduced to reflect a delayed write error detected on - fsync(): in this case, the amount of data effectively written in the zone may - be less than originally indicated by the file inode size. After such I/O - error, zonefs always fixes a file inode size to reflect the amount of data - persistently stored in the file zone. - -* Access permission changes: - A zone condition change to read-only is indicated with a change in the file - access permissions to render the file read-only. This disables changes to the - file attributes and data modification. For offline zones, all permissions - (read and write) to the file are disabled. - -Further action taken by zonefs I/O error recovery can be controlled by the user -with the "errors=xxx" mount option. The table below summarizes the result of -zonefs I/O error processing depending on the mount option and on the zone -conditions. - - +--------------+-----------+-----------------------------------------+ - | | | Post error state | - | "errors=xxx" | device | access permissions | - | mount | zone | file file device zone | - | option | condition | size read write read write | - +--------------+-----------+-----------------------------------------+ - | | good | fixed yes no yes yes | - | remount-ro | read-only | fixed yes no yes no | - | (default) | offline | 0 no no no no | - +--------------+-----------+-----------------------------------------+ - | | good | fixed yes no yes yes | - | zone-ro | read-only | fixed yes no yes no | - | | offline | 0 no no no no | - +--------------+-----------+-----------------------------------------+ - | | good | 0 no no yes yes | - | zone-offline | read-only | 0 no no yes no | - | | offline | 0 no no no no | - +--------------+-----------+-----------------------------------------+ - | | good | fixed yes yes yes yes | - | repair | read-only | fixed yes no yes no | - | | offline | 0 no no no no | - +--------------+-----------+-----------------------------------------+ - -Further notes: -* The "errors=remount-ro" mount option is the default behavior of zonefs I/O - error processing if no errors mount option is specified. -* With the "errors=remount-ro" mount option, the change of the file access - permissions to read-only applies to all files. The file system is remounted - read-only. -* Access permission and file size changes due to the device transitioning zones - to the offline condition are permanent. Remounting or reformating the device - with mkfs.zonefs (mkzonefs) will not change back offline zone files to a good - state. -* File access permission changes to read-only due to the device transitioning - zones to the read-only condition are permanent. Remounting or reformating - the device will not re-enable file write access. -* File access permission changes implied by the remount-ro, zone-ro and - zone-offline mount options are temporary for zones in a good condition. - Unmounting and remounting the file system will restore the previous default - (format time values) access rights to the files affected. -* The repair mount option triggers only the minimal set of I/O error recovery - actions, that is, file size fixes for zones in a good condition. Zones - indicated as being read-only or offline by the device still imply changes to - the zone file access permissions as noted in the table above. - -Mount options -------------- - -zonefs define the "errors=" mount option to allow the user to specify -zonefs behavior in response to I/O errors, inode size inconsistencies or zone -condition chages. The defined behaviors are as follow: -* remount-ro (default) -* zone-ro -* zone-offline -* repair - -The I/O error actions defined for each behavior is detailed in the previous -section. - -Zonefs User Space Tools -======================= - -The mkzonefs tool is used to format zoned block devices for use with zonefs. -This tool is available on Github at: - -https://github.com/damien-lemoal/zonefs-tools - -zonefs-tools also includes a test suite which can be run against any zoned -block device, including null_blk block device created with zoned mode. - -Examples --------- - -The following formats a 15TB host-managed SMR HDD with 256 MB zones -with the conventional zones aggregation feature enabled. - -# mkzonefs -o aggr_cnv /dev/sdX -# mount -t zonefs /dev/sdX /mnt -# ls -l /mnt/ -total 0 -dr-xr-xr-x 2 root root 1 Nov 25 13:23 cnv -dr-xr-xr-x 2 root root 55356 Nov 25 13:23 seq - -The size of the zone files sub-directories indicate the number of files -existing for each type of zones. In this example, there is only one -conventional zone file (all conventional zones are aggregated under a single -file). - -# ls -l /mnt/cnv -total 137101312 --rw-r----- 1 root root 140391743488 Nov 25 13:23 0 - -This aggregated conventional zone file can be used as a regular file. - -# mkfs.ext4 /mnt/cnv/0 -# mount -o loop /mnt/cnv/0 /data - -The "seq" sub-directory grouping files for sequential write zones has in this -example 55356 zones. - -# ls -lv /mnt/seq -total 14511243264 --rw-r----- 1 root root 0 Nov 25 13:23 0 --rw-r----- 1 root root 0 Nov 25 13:23 1 --rw-r----- 1 root root 0 Nov 25 13:23 2 -... --rw-r----- 1 root root 0 Nov 25 13:23 55354 --rw-r----- 1 root root 0 Nov 25 13:23 55355 - -For sequential write zone files, the file size changes as data is appended at -the end of the file, similarly to any regular file system. - -# dd if=/dev/zero of=/mnt/seq/0 bs=4096 count=1 conv=notrunc oflag=direct -1+0 records in -1+0 records out -4096 bytes (4.1 kB, 4.0 KiB) copied, 0.00044121 s, 9.3 MB/s - -# ls -l /mnt/seq/0 --rw-r----- 1 root root 4096 Nov 25 13:23 /mnt/seq/0 - -The written file can be truncated to the zone size, preventing any further -write operation. - -# truncate -s 268435456 /mnt/seq/0 -# ls -l /mnt/seq/0 --rw-r----- 1 root root 268435456 Nov 25 13:49 /mnt/seq/0 - -Truncation to 0 size allows freeing the file zone storage space and restart -append-writes to the file. - -# truncate -s 0 /mnt/seq/0 -# ls -l /mnt/seq/0 --rw-r----- 1 root root 0 Nov 25 13:49 /mnt/seq/0 - -Since files are statically mapped to zones on the disk, the number of blocks of -a file as reported by stat() and fstat() indicates the size of the file zone. - -# stat /mnt/seq/0 - File: /mnt/seq/0 - Size: 0 Blocks: 524288 IO Block: 4096 regular empty file -Device: 870h/2160d Inode: 50431 Links: 1 -Access: (0640/-rw-r-----) Uid: ( 0/ root) Gid: ( 0/ root) -Access: 2019-11-25 13:23:57.048971997 +0900 -Modify: 2019-11-25 13:52:25.553805765 +0900 -Change: 2019-11-25 13:52:25.553805765 +0900 - Birth: - - -The number of blocks of the file ("Blocks") in units of 512B blocks gives the -maximum file size of 524288 * 512 B = 256 MB, corresponding to the device zone -size in this example. Of note is that the "IO block" field always indicates the -minimum I/O size for writes and corresponds to the device physical sector size. -- cgit From 19796c348ab62287fc6434f296ae279d7b97e39f Mon Sep 17 00:00:00 2001 From: Jonathan Neuschäfer Date: Sun, 8 Mar 2020 22:14:43 +0100 Subject: docs: Move Intel Many Integrated Core documentation (mic) under misc-devices MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit It doesn't need to be a top-level chapter. This patch also updates MAINTAINERS and makes sure the F: lines are properly sorted. Signed-off-by: Jonathan Neuschäfer Reviewed-by: Andy Shevchenko Link: https://lore.kernel.org/r/20200308211519.8414-1-j.neuschaefer@gmx.net Signed-off-by: Jonathan Corbet --- Documentation/index.rst | 1 - Documentation/mic/index.rst | 16 ---- Documentation/mic/mic_overview.rst | 85 ------------------ Documentation/mic/scif_overview.rst | 108 ----------------------- Documentation/misc-devices/index.rst | 1 + Documentation/misc-devices/mic/index.rst | 16 ++++ Documentation/misc-devices/mic/mic_overview.rst | 85 ++++++++++++++++++ Documentation/misc-devices/mic/scif_overview.rst | 108 +++++++++++++++++++++++ MAINTAINERS | 8 +- 9 files changed, 214 insertions(+), 214 deletions(-) delete mode 100644 Documentation/mic/index.rst delete mode 100644 Documentation/mic/mic_overview.rst delete mode 100644 Documentation/mic/scif_overview.rst create mode 100644 Documentation/misc-devices/mic/index.rst create mode 100644 Documentation/misc-devices/mic/mic_overview.rst create mode 100644 Documentation/misc-devices/mic/scif_overview.rst diff --git a/Documentation/index.rst b/Documentation/index.rst index e99d0bd2589d..6fdad61ee443 100644 --- a/Documentation/index.rst +++ b/Documentation/index.rst @@ -131,7 +131,6 @@ needed). usb/index PCI/index misc-devices/index - mic/index scheduler/index Architecture-agnostic documentation diff --git a/Documentation/mic/index.rst b/Documentation/mic/index.rst deleted file mode 100644 index 3a8d06367ef1..000000000000 --- a/Documentation/mic/index.rst +++ /dev/null @@ -1,16 +0,0 @@ -============================================= -Intel Many Integrated Core (MIC) architecture -============================================= - -.. toctree:: - :maxdepth: 1 - - mic_overview - scif_overview - -.. only:: subproject and html - - Indices - ======= - - * :ref:`genindex` diff --git a/Documentation/mic/mic_overview.rst b/Documentation/mic/mic_overview.rst deleted file mode 100644 index 17d956bdaf7c..000000000000 --- a/Documentation/mic/mic_overview.rst +++ /dev/null @@ -1,85 +0,0 @@ -====================================================== -Intel Many Integrated Core (MIC) architecture overview -====================================================== - -An Intel MIC X100 device is a PCIe form factor add-in coprocessor -card based on the Intel Many Integrated Core (MIC) architecture -that runs a Linux OS. It is a PCIe endpoint in a platform and therefore -implements the three required standard address spaces i.e. configuration, -memory and I/O. The host OS loads a device driver as is typical for -PCIe devices. The card itself runs a bootstrap after reset that -transfers control to the card OS downloaded from the host driver. The -host driver supports OSPM suspend and resume operations. It shuts down -the card during suspend and reboots the card OS during resume. -The card OS as shipped by Intel is a Linux kernel with modifications -for the X100 devices. - -Since it is a PCIe card, it does not have the ability to host hardware -devices for networking, storage and console. We provide these devices -on X100 coprocessors thus enabling a self-bootable equivalent -environment for applications. A key benefit of our solution is that it -leverages the standard virtio framework for network, disk and console -devices, though in our case the virtio framework is used across a PCIe -bus. A Virtio Over PCIe (VOP) driver allows creating user space -backends or devices on the host which are used to probe virtio drivers -for these devices on the MIC card. The existing VRINGH infrastructure -in the kernel is used to access virtio rings from the host. The card -VOP driver allows card virtio drivers to communicate with their user -space backends on the host via a device page. Ring 3 apps on the host -can add, remove and configure virtio devices. A thin MIC specific -virtio_config_ops is implemented which is borrowed heavily from -previous similar implementations in lguest and s390. - -MIC PCIe card has a dma controller with 8 channels. These channels are -shared between the host s/w and the card s/w. 0 to 3 are used by host -and 4 to 7 by card. As the dma device doesn't show up as PCIe device, -a virtual bus called mic bus is created and virtual dma devices are -created on it by the host/card drivers. On host the channels are private -and used only by the host driver to transfer data for the virtio devices. - -The Symmetric Communication Interface (SCIF (pronounced as skiff)) is a -low level communications API across PCIe currently implemented for MIC. -More details are available at scif_overview.txt. - -The Coprocessor State Management (COSM) driver on the host allows for -boot, shutdown and reset of Intel MIC devices. It communicates with a COSM -"client" driver on the MIC cards over SCIF to perform these functions. - -Here is a block diagram of the various components described above. The -virtio backends are situated on the host rather than the card given better -single threaded performance for the host compared to MIC, the ability of -the host to initiate DMA's to/from the card using the MIC DMA engine and -the fact that the virtio block storage backend can only be on the host:: - - +----------+ | +----------+ - | Card OS | | | Host OS | - +----------+ | +----------+ - | - +-------+ +--------+ +------+ | +---------+ +--------+ +--------+ - | Virtio| |Virtio | |Virtio| | |Virtio | |Virtio | |Virtio | - | Net | |Console | |Block | | |Net | |Console | |Block | - | Driver| |Driver | |Driver| | |backend | |backend | |backend | - +---+---+ +---+----+ +--+---+ | +---------+ +----+---+ +--------+ - | | | | | | | - | | | |User | | | - | | | |------|------------|--+------|------- - +---------+---------+ |Kernel | - | | | - +---------+ +---+----+ +------+ | +------+ +------+ +--+---+ +-------+ - |MIC DMA | | VOP | | SCIF | | | SCIF | | COSM | | VOP | |MIC DMA| - +---+-----+ +---+----+ +--+---+ | +--+---+ +--+---+ +------+ +----+--+ - | | | | | | | - +---+-----+ +---+----+ +--+---+ | +--+---+ +--+---+ +------+ +----+--+ - |MIC | | VOP | |SCIF | | |SCIF | | COSM | | VOP | | MIC | - |HW Bus | | HW Bus| |HW Bus| | |HW Bus| | Bus | |HW Bus| |HW Bus | - +---------+ +--------+ +--+---+ | +--+---+ +------+ +------+ +-------+ - | | | | | | | - | +-----------+--+ | | | +---------------+ | - | |Intel MIC | | | | |Intel MIC | | - | |Card Driver | | | | |Host Driver | | - +---+--------------+------+ | +----+---------------+-----+ - | | | - +-------------------------------------------------------------+ - | | - | PCIe Bus | - +-------------------------------------------------------------+ diff --git a/Documentation/mic/scif_overview.rst b/Documentation/mic/scif_overview.rst deleted file mode 100644 index 4c8ad9e43706..000000000000 --- a/Documentation/mic/scif_overview.rst +++ /dev/null @@ -1,108 +0,0 @@ -======================================== -Symmetric Communication Interface (SCIF) -======================================== - -The Symmetric Communication Interface (SCIF (pronounced as skiff)) is a low -level communications API across PCIe currently implemented for MIC. Currently -SCIF provides inter-node communication within a single host platform, where a -node is a MIC Coprocessor or Xeon based host. SCIF abstracts the details of -communicating over the PCIe bus while providing an API that is symmetric -across all the nodes in the PCIe network. An important design objective for SCIF -is to deliver the maximum possible performance given the communication -abilities of the hardware. SCIF has been used to implement an offload compiler -runtime and OFED support for MPI implementations for MIC coprocessors. - -SCIF API Components -=================== - -The SCIF API has the following parts: - -1. Connection establishment using a client server model -2. Byte stream messaging intended for short messages -3. Node enumeration to determine online nodes -4. Poll semantics for detection of incoming connections and messages -5. Memory registration to pin down pages -6. Remote memory mapping for low latency CPU accesses via mmap -7. Remote DMA (RDMA) for high bandwidth DMA transfers -8. Fence APIs for RDMA synchronization - -SCIF exposes the notion of a connection which can be used by peer processes on -nodes in a SCIF PCIe "network" to share memory "windows" and to communicate. A -process in a SCIF node initiates a SCIF connection to a peer process on a -different node via a SCIF "endpoint". SCIF endpoints support messaging APIs -which are similar to connection oriented socket APIs. Connected SCIF endpoints -can also register local memory which is followed by data transfer using either -DMA, CPU copies or remote memory mapping via mmap. SCIF supports both user and -kernel mode clients which are functionally equivalent. - -SCIF Performance for MIC -======================== - -DMA bandwidth comparison between the TCP (over ethernet over PCIe) stack versus -SCIF shows the performance advantages of SCIF for HPC applications and -runtimes:: - - Comparison of TCP and SCIF based BW - - Throughput (GB/sec) - 8 + PCIe Bandwidth ****** - + TCP ###### - 7 + ************************************** SCIF %%%%%% - | %%%%%%%%%%%%%%%%%%% - 6 + %%%% - | %% - | %%% - 5 + %% - | %% - 4 + %% - | %% - 3 + %% - | % - 2 + %% - | %% - | % - 1 + - + ###################################### - 0 +++---+++--+--+-+--+--+-++-+--+-++-+--+-++-+- - 1 10 100 1000 10000 100000 - Transfer Size (KBytes) - -SCIF allows memory sharing via mmap(..) between processes on different PCIe -nodes and thus provides bare-metal PCIe latency. The round trip SCIF mmap -latency from the host to an x100 MIC for an 8 byte message is 0.44 usecs. - -SCIF has a user space library which is a thin IOCTL wrapper providing a user -space API similar to the kernel API in scif.h. The SCIF user space library -is distributed @ https://software.intel.com/en-us/mic-developer - -Here is some pseudo code for an example of how two applications on two PCIe -nodes would typically use the SCIF API:: - - Process A (on node A) Process B (on node B) - - /* get online node information */ - scif_get_node_ids(..) scif_get_node_ids(..) - scif_open(..) scif_open(..) - scif_bind(..) scif_bind(..) - scif_listen(..) - scif_accept(..) scif_connect(..) - /* SCIF connection established */ - - /* Send and receive short messages */ - scif_send(..)/scif_recv(..) scif_send(..)/scif_recv(..) - - /* Register memory */ - scif_register(..) scif_register(..) - - /* RDMA */ - scif_readfrom(..)/scif_writeto(..) scif_readfrom(..)/scif_writeto(..) - - /* Fence DMAs */ - scif_fence_signal(..) scif_fence_signal(..) - - mmap(..) mmap(..) - - /* Access remote registered memory */ - - /* Close the endpoints */ - scif_close(..) scif_close(..) diff --git a/Documentation/misc-devices/index.rst b/Documentation/misc-devices/index.rst index f11c5daeada5..c1dcd2628911 100644 --- a/Documentation/misc-devices/index.rst +++ b/Documentation/misc-devices/index.rst @@ -20,4 +20,5 @@ fit into other categories. isl29003 lis3lv02d max6875 + mic/index xilinx_sdfec diff --git a/Documentation/misc-devices/mic/index.rst b/Documentation/misc-devices/mic/index.rst new file mode 100644 index 000000000000..3a8d06367ef1 --- /dev/null +++ b/Documentation/misc-devices/mic/index.rst @@ -0,0 +1,16 @@ +============================================= +Intel Many Integrated Core (MIC) architecture +============================================= + +.. toctree:: + :maxdepth: 1 + + mic_overview + scif_overview + +.. only:: subproject and html + + Indices + ======= + + * :ref:`genindex` diff --git a/Documentation/misc-devices/mic/mic_overview.rst b/Documentation/misc-devices/mic/mic_overview.rst new file mode 100644 index 000000000000..17d956bdaf7c --- /dev/null +++ b/Documentation/misc-devices/mic/mic_overview.rst @@ -0,0 +1,85 @@ +====================================================== +Intel Many Integrated Core (MIC) architecture overview +====================================================== + +An Intel MIC X100 device is a PCIe form factor add-in coprocessor +card based on the Intel Many Integrated Core (MIC) architecture +that runs a Linux OS. It is a PCIe endpoint in a platform and therefore +implements the three required standard address spaces i.e. configuration, +memory and I/O. The host OS loads a device driver as is typical for +PCIe devices. The card itself runs a bootstrap after reset that +transfers control to the card OS downloaded from the host driver. The +host driver supports OSPM suspend and resume operations. It shuts down +the card during suspend and reboots the card OS during resume. +The card OS as shipped by Intel is a Linux kernel with modifications +for the X100 devices. + +Since it is a PCIe card, it does not have the ability to host hardware +devices for networking, storage and console. We provide these devices +on X100 coprocessors thus enabling a self-bootable equivalent +environment for applications. A key benefit of our solution is that it +leverages the standard virtio framework for network, disk and console +devices, though in our case the virtio framework is used across a PCIe +bus. A Virtio Over PCIe (VOP) driver allows creating user space +backends or devices on the host which are used to probe virtio drivers +for these devices on the MIC card. The existing VRINGH infrastructure +in the kernel is used to access virtio rings from the host. The card +VOP driver allows card virtio drivers to communicate with their user +space backends on the host via a device page. Ring 3 apps on the host +can add, remove and configure virtio devices. A thin MIC specific +virtio_config_ops is implemented which is borrowed heavily from +previous similar implementations in lguest and s390. + +MIC PCIe card has a dma controller with 8 channels. These channels are +shared between the host s/w and the card s/w. 0 to 3 are used by host +and 4 to 7 by card. As the dma device doesn't show up as PCIe device, +a virtual bus called mic bus is created and virtual dma devices are +created on it by the host/card drivers. On host the channels are private +and used only by the host driver to transfer data for the virtio devices. + +The Symmetric Communication Interface (SCIF (pronounced as skiff)) is a +low level communications API across PCIe currently implemented for MIC. +More details are available at scif_overview.txt. + +The Coprocessor State Management (COSM) driver on the host allows for +boot, shutdown and reset of Intel MIC devices. It communicates with a COSM +"client" driver on the MIC cards over SCIF to perform these functions. + +Here is a block diagram of the various components described above. The +virtio backends are situated on the host rather than the card given better +single threaded performance for the host compared to MIC, the ability of +the host to initiate DMA's to/from the card using the MIC DMA engine and +the fact that the virtio block storage backend can only be on the host:: + + +----------+ | +----------+ + | Card OS | | | Host OS | + +----------+ | +----------+ + | + +-------+ +--------+ +------+ | +---------+ +--------+ +--------+ + | Virtio| |Virtio | |Virtio| | |Virtio | |Virtio | |Virtio | + | Net | |Console | |Block | | |Net | |Console | |Block | + | Driver| |Driver | |Driver| | |backend | |backend | |backend | + +---+---+ +---+----+ +--+---+ | +---------+ +----+---+ +--------+ + | | | | | | | + | | | |User | | | + | | | |------|------------|--+------|------- + +---------+---------+ |Kernel | + | | | + +---------+ +---+----+ +------+ | +------+ +------+ +--+---+ +-------+ + |MIC DMA | | VOP | | SCIF | | | SCIF | | COSM | | VOP | |MIC DMA| + +---+-----+ +---+----+ +--+---+ | +--+---+ +--+---+ +------+ +----+--+ + | | | | | | | + +---+-----+ +---+----+ +--+---+ | +--+---+ +--+---+ +------+ +----+--+ + |MIC | | VOP | |SCIF | | |SCIF | | COSM | | VOP | | MIC | + |HW Bus | | HW Bus| |HW Bus| | |HW Bus| | Bus | |HW Bus| |HW Bus | + +---------+ +--------+ +--+---+ | +--+---+ +------+ +------+ +-------+ + | | | | | | | + | +-----------+--+ | | | +---------------+ | + | |Intel MIC | | | | |Intel MIC | | + | |Card Driver | | | | |Host Driver | | + +---+--------------+------+ | +----+---------------+-----+ + | | | + +-------------------------------------------------------------+ + | | + | PCIe Bus | + +-------------------------------------------------------------+ diff --git a/Documentation/misc-devices/mic/scif_overview.rst b/Documentation/misc-devices/mic/scif_overview.rst new file mode 100644 index 000000000000..4c8ad9e43706 --- /dev/null +++ b/Documentation/misc-devices/mic/scif_overview.rst @@ -0,0 +1,108 @@ +======================================== +Symmetric Communication Interface (SCIF) +======================================== + +The Symmetric Communication Interface (SCIF (pronounced as skiff)) is a low +level communications API across PCIe currently implemented for MIC. Currently +SCIF provides inter-node communication within a single host platform, where a +node is a MIC Coprocessor or Xeon based host. SCIF abstracts the details of +communicating over the PCIe bus while providing an API that is symmetric +across all the nodes in the PCIe network. An important design objective for SCIF +is to deliver the maximum possible performance given the communication +abilities of the hardware. SCIF has been used to implement an offload compiler +runtime and OFED support for MPI implementations for MIC coprocessors. + +SCIF API Components +=================== + +The SCIF API has the following parts: + +1. Connection establishment using a client server model +2. Byte stream messaging intended for short messages +3. Node enumeration to determine online nodes +4. Poll semantics for detection of incoming connections and messages +5. Memory registration to pin down pages +6. Remote memory mapping for low latency CPU accesses via mmap +7. Remote DMA (RDMA) for high bandwidth DMA transfers +8. Fence APIs for RDMA synchronization + +SCIF exposes the notion of a connection which can be used by peer processes on +nodes in a SCIF PCIe "network" to share memory "windows" and to communicate. A +process in a SCIF node initiates a SCIF connection to a peer process on a +different node via a SCIF "endpoint". SCIF endpoints support messaging APIs +which are similar to connection oriented socket APIs. Connected SCIF endpoints +can also register local memory which is followed by data transfer using either +DMA, CPU copies or remote memory mapping via mmap. SCIF supports both user and +kernel mode clients which are functionally equivalent. + +SCIF Performance for MIC +======================== + +DMA bandwidth comparison between the TCP (over ethernet over PCIe) stack versus +SCIF shows the performance advantages of SCIF for HPC applications and +runtimes:: + + Comparison of TCP and SCIF based BW + + Throughput (GB/sec) + 8 + PCIe Bandwidth ****** + + TCP ###### + 7 + ************************************** SCIF %%%%%% + | %%%%%%%%%%%%%%%%%%% + 6 + %%%% + | %% + | %%% + 5 + %% + | %% + 4 + %% + | %% + 3 + %% + | % + 2 + %% + | %% + | % + 1 + + + ###################################### + 0 +++---+++--+--+-+--+--+-++-+--+-++-+--+-++-+- + 1 10 100 1000 10000 100000 + Transfer Size (KBytes) + +SCIF allows memory sharing via mmap(..) between processes on different PCIe +nodes and thus provides bare-metal PCIe latency. The round trip SCIF mmap +latency from the host to an x100 MIC for an 8 byte message is 0.44 usecs. + +SCIF has a user space library which is a thin IOCTL wrapper providing a user +space API similar to the kernel API in scif.h. The SCIF user space library +is distributed @ https://software.intel.com/en-us/mic-developer + +Here is some pseudo code for an example of how two applications on two PCIe +nodes would typically use the SCIF API:: + + Process A (on node A) Process B (on node B) + + /* get online node information */ + scif_get_node_ids(..) scif_get_node_ids(..) + scif_open(..) scif_open(..) + scif_bind(..) scif_bind(..) + scif_listen(..) + scif_accept(..) scif_connect(..) + /* SCIF connection established */ + + /* Send and receive short messages */ + scif_send(..)/scif_recv(..) scif_send(..)/scif_recv(..) + + /* Register memory */ + scif_register(..) scif_register(..) + + /* RDMA */ + scif_readfrom(..)/scif_writeto(..) scif_readfrom(..)/scif_writeto(..) + + /* Fence DMAs */ + scif_fence_signal(..) scif_fence_signal(..) + + mmap(..) mmap(..) + + /* Access remote registered memory */ + + /* Close the endpoints */ + scif_close(..) scif_close(..) diff --git a/MAINTAINERS b/MAINTAINERS index 38fe2f3f7b6f..083fcf1a151c 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -8569,15 +8569,15 @@ M: Ashutosh Dixit S: Supported W: https://github.com/sudeepdutt/mic W: http://software.intel.com/en-us/mic-developer +F: Documentation/misc-devices/mic/ +F: drivers/dma/mic_x100_dma.c +F: drivers/dma/mic_x100_dma.h +F: drivers/misc/mic/ F: include/linux/mic_bus.h F: include/linux/scif.h F: include/uapi/linux/mic_common.h F: include/uapi/linux/mic_ioctl.h F: include/uapi/linux/scif_ioctl.h -F: drivers/misc/mic/ -F: drivers/dma/mic_x100_dma.c -F: drivers/dma/mic_x100_dma.h -F: Documentation/mic/ INTEL PMC CORE DRIVER M: Rajneesh Bhardwaj -- cgit From ea6b5370836f995f1cdee45ae03a992e572efa45 Mon Sep 17 00:00:00 2001 From: Jonathan Neuschäfer Date: Sun, 8 Mar 2020 22:09:34 +0100 Subject: docs: admin-guide: binfmt-misc: Improve the title MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Trim the title a bit, since it's relatively long. Add `binfmt_misc` to make it easier to search for the feature by its common name. Signed-off-by: Jonathan Neuschäfer Link: https://lore.kernel.org/r/20200308210935.7273-1-j.neuschaefer@gmx.net Signed-off-by: Jonathan Corbet --- Documentation/admin-guide/binfmt-misc.rst | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/Documentation/admin-guide/binfmt-misc.rst b/Documentation/admin-guide/binfmt-misc.rst index 97b0d7927078..95c93bbe408a 100644 --- a/Documentation/admin-guide/binfmt-misc.rst +++ b/Documentation/admin-guide/binfmt-misc.rst @@ -1,5 +1,5 @@ -Kernel Support for miscellaneous (your favourite) Binary Formats v1.1 -===================================================================== +Kernel Support for miscellaneous Binary Formats (binfmt_misc) +============================================================= This Kernel feature allows you to invoke almost (for restrictions see below) every program by simply typing its name in the shell. -- cgit From d442bbca36751a4c791a7559cd249f5306f5a23f Mon Sep 17 00:00:00 2001 From: Jonathan Neuschäfer Date: Thu, 5 Mar 2020 21:51:21 +0100 Subject: docs: it_IT: netdev-FAQ: Fix link to original document MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Signed-off-by: Jonathan Neuschäfer Reviewed-by: Federico Vaga Link: https://lore.kernel.org/r/20200305205123.8569-1-j.neuschaefer@gmx.net Signed-off-by: Jonathan Corbet --- Documentation/translations/it_IT/networking/netdev-FAQ.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/Documentation/translations/it_IT/networking/netdev-FAQ.rst b/Documentation/translations/it_IT/networking/netdev-FAQ.rst index 8489ead7cff1..7e2456bb7d92 100644 --- a/Documentation/translations/it_IT/networking/netdev-FAQ.rst +++ b/Documentation/translations/it_IT/networking/netdev-FAQ.rst @@ -1,6 +1,6 @@ .. include:: ../disclaimer-ita.rst -:Original: :ref:`Documentation/process/stable-kernel-rules.rst ` +:Original: :ref:`Documentation/networking/netdev-FAQ.rst ` .. _it_netdev-FAQ: -- cgit From d8401f504b49c71280e504e41b3b56876094f081 Mon Sep 17 00:00:00 2001 From: Kees Cook Date: Wed, 4 Mar 2020 23:03:47 -0800 Subject: docs: deprecated.rst: Add %p to the list Once in a while %p usage comes up, and I've needed to have a reference to point people to. Add %p details to deprecated.rst. Signed-off-by: Kees Cook Link: https://lore.kernel.org/r/202003042301.F844A8C0EC@keescook Signed-off-by: Jonathan Corbet --- Documentation/process/deprecated.rst | 22 ++++++++++++++++++++++ 1 file changed, 22 insertions(+) diff --git a/Documentation/process/deprecated.rst b/Documentation/process/deprecated.rst index 179f2a5625a0..7160a449e6c6 100644 --- a/Documentation/process/deprecated.rst +++ b/Documentation/process/deprecated.rst @@ -109,6 +109,28 @@ the given limit of bytes to copy. This is inefficient and can lead to linear read overflows if a source string is not NUL-terminated. The safe replacement is :c:func:`strscpy`. +%p format specifier +------------------- +Traditionally, using "%p" in format strings would lead to regular address +exposure flaws in dmesg, proc, sysfs, etc. Instead of leaving these to +be exploitable, all "%p" uses in the kernel are being printed as a hashed +value, rendering them unusable for addressing. New uses of "%p" should not +be added to the kernel. For text addresses, using "%pS" is likely better, +as it produces the more useful symbol name instead. For nearly everything +else, just do not add "%p" at all. + +Paraphrasing Linus's current `guidance `_: + +- If the hashed "%p" value is pointless, ask yourself whether the pointer + itself is important. Maybe it should be removed entirely? +- If you really think the true pointer value is important, why is some + system state or user privilege level considered "special"? If you think + you can justify it (in comments and commit log) well enough to stand + up to Linus's scrutiny, maybe you can use "%px", along with making sure + you have sensible permissions. + +And finally, know that a toggle for "%p" hashing will `not be accepted `_. + Variable Length Arrays (VLAs) ----------------------------- Using stack VLAs produces much worse machine code than statically -- cgit From 5e72017279957b764c225f143c16391b3c51f225 Mon Sep 17 00:00:00 2001 From: Jonathan Corbet Date: Mon, 2 Mar 2020 15:17:17 -0700 Subject: docs: Organize core-api/index.rst The core-api manual has become a big, disorganized mess. Try to bring a small amount of order to it by organizing the documents into subcategories. Signed-off-by: Jonathan Corbet --- Documentation/core-api/index.rst | 95 ++++++++++++++++++++++++++++++---------- 1 file changed, 73 insertions(+), 22 deletions(-) diff --git a/Documentation/core-api/index.rst b/Documentation/core-api/index.rst index d02b26917931..b39dae276b57 100644 --- a/Documentation/core-api/index.rst +++ b/Documentation/core-api/index.rst @@ -8,42 +8,81 @@ This is the beginning of a manual for core kernel APIs. The conversion Core utilities ============== +This section has general and "core core" documentation. The first is a +massive grab-bag of kerneldoc info left over from the docbook days; it +should really be broken up someday when somebody finds the energy to do +it. + .. toctree:: :maxdepth: 1 kernel-api + workqueue + printk-formats + symbol-namespaces + +Data structures and low-level utilities +======================================= + +Library functionality that is used throughout the kernel. + +.. toctree:: + :maxdepth: 1 + kobject assoc_array + xarray + idr + circular-buffers + generic-radix-tree + packing + timekeeping + errseq + +Concurrency primitives +====================== + +How Linux keeps everything from happening at the same time. See +:doc:`/locking/index` for more related documentation. + +.. toctree:: + :maxdepth: 1 + atomic_ops - cachetlb refcount-vs-atomic - cpu_hotplug - idr local_ops - workqueue + padata + ../RCU/index + +Low-level hardware management +============================= + +Cache management, managing CPU hotplug, etc. + +.. toctree:: + :maxdepth: 1 + + cachetlb + cpu_hotplug + memory-hotplug genericirq - xarray - librs - genalloc - errseq - packing - printk-formats - circular-buffers - generic-radix-tree + protection-keys + +Memory management +================= + +How to allocate and use memory in the kernel. Note that there is a lot +more memory-management documentation in :doc:`/vm/index`. + +.. toctree:: + :maxdepth: 1 + memory-allocation mm-api + genalloc pin_user_pages - gfp_mask-from-fs-io - timekeeping boot-time-mm - memory-hotplug - protection-keys - ../RCU/index - gcc-plugins - symbol-namespaces - padata - ioctl - + gfp_mask-from-fs-io Interfaces for kernel debugging =============================== @@ -54,6 +93,18 @@ Interfaces for kernel debugging debug-objects tracepoint +Everything else +=============== + +Documents that don't fit elsewhere or which have yet to be categorized. + +.. toctree:: + :maxdepth: 1 + + librs + gcc-plugins + ioctl + .. only:: subproject and html Indices -- cgit From 2b4cbd5c950525b6d4d2cd384dcefdd95fedabe3 Mon Sep 17 00:00:00 2001 From: Jonathan Corbet Date: Mon, 2 Mar 2020 15:24:04 -0700 Subject: docs: move gcc-plugins to the kbuild manual Information about GCC plugins is relevant to kernel building, so move this document to the kbuild manual. Acked-by: Masahiro Yamada Signed-off-by: Jonathan Corbet --- Documentation/core-api/gcc-plugins.rst | 97 ---------------------------------- Documentation/core-api/index.rst | 1 - Documentation/kbuild/gcc-plugins.rst | 97 ++++++++++++++++++++++++++++++++++ Documentation/kbuild/index.rst | 1 + MAINTAINERS | 2 +- scripts/gcc-plugins/Kconfig | 2 +- 6 files changed, 100 insertions(+), 100 deletions(-) delete mode 100644 Documentation/core-api/gcc-plugins.rst create mode 100644 Documentation/kbuild/gcc-plugins.rst diff --git a/Documentation/core-api/gcc-plugins.rst b/Documentation/core-api/gcc-plugins.rst deleted file mode 100644 index 4b1c10f88e30..000000000000 --- a/Documentation/core-api/gcc-plugins.rst +++ /dev/null @@ -1,97 +0,0 @@ -========================= -GCC plugin infrastructure -========================= - - -Introduction -============ - -GCC plugins are loadable modules that provide extra features to the -compiler [1]_. They are useful for runtime instrumentation and static analysis. -We can analyse, change and add further code during compilation via -callbacks [2]_, GIMPLE [3]_, IPA [4]_ and RTL passes [5]_. - -The GCC plugin infrastructure of the kernel supports all gcc versions from -4.5 to 6.0, building out-of-tree modules, cross-compilation and building in a -separate directory. -Plugin source files have to be compilable by both a C and a C++ compiler as well -because gcc versions 4.5 and 4.6 are compiled by a C compiler, -gcc-4.7 can be compiled by a C or a C++ compiler, -and versions 4.8+ can only be compiled by a C++ compiler. - -Currently the GCC plugin infrastructure supports only the x86, arm, arm64 and -powerpc architectures. - -This infrastructure was ported from grsecurity [6]_ and PaX [7]_. - --- - -.. [1] https://gcc.gnu.org/onlinedocs/gccint/Plugins.html -.. [2] https://gcc.gnu.org/onlinedocs/gccint/Plugin-API.html#Plugin-API -.. [3] https://gcc.gnu.org/onlinedocs/gccint/GIMPLE.html -.. [4] https://gcc.gnu.org/onlinedocs/gccint/IPA.html -.. [5] https://gcc.gnu.org/onlinedocs/gccint/RTL.html -.. [6] https://grsecurity.net/ -.. [7] https://pax.grsecurity.net/ - - -Files -===== - -**$(src)/scripts/gcc-plugins** - - This is the directory of the GCC plugins. - -**$(src)/scripts/gcc-plugins/gcc-common.h** - - This is a compatibility header for GCC plugins. - It should be always included instead of individual gcc headers. - -**$(src)/scripts/gcc-plugin.sh** - - This script checks the availability of the included headers in - gcc-common.h and chooses the proper host compiler to build the plugins - (gcc-4.7 can be built by either gcc or g++). - -**$(src)/scripts/gcc-plugins/gcc-generate-gimple-pass.h, -$(src)/scripts/gcc-plugins/gcc-generate-ipa-pass.h, -$(src)/scripts/gcc-plugins/gcc-generate-simple_ipa-pass.h, -$(src)/scripts/gcc-plugins/gcc-generate-rtl-pass.h** - - These headers automatically generate the registration structures for - GIMPLE, SIMPLE_IPA, IPA and RTL passes. They support all gcc versions - from 4.5 to 6.0. - They should be preferred to creating the structures by hand. - - -Usage -===== - -You must install the gcc plugin headers for your gcc version, -e.g., on Ubuntu for gcc-4.9:: - - apt-get install gcc-4.9-plugin-dev - -Or on Fedora:: - - dnf install gcc-plugin-devel - -Enable a GCC plugin based feature in the kernel config:: - - CONFIG_GCC_PLUGIN_CYC_COMPLEXITY = y - -To compile only the plugin(s):: - - make gcc-plugins - -or just run the kernel make and compile the whole kernel with -the cyclomatic complexity GCC plugin. - - -4. How to add a new GCC plugin -============================== - -The GCC plugins are in $(src)/scripts/gcc-plugins/. You can use a file or a directory -here. It must be added to $(src)/scripts/gcc-plugins/Makefile, -$(src)/scripts/Makefile.gcc-plugins and $(src)/arch/Kconfig. -See the cyc_complexity_plugin.c (CONFIG_GCC_PLUGIN_CYC_COMPLEXITY) GCC plugin. diff --git a/Documentation/core-api/index.rst b/Documentation/core-api/index.rst index b39dae276b57..9836a0ac09a3 100644 --- a/Documentation/core-api/index.rst +++ b/Documentation/core-api/index.rst @@ -102,7 +102,6 @@ Documents that don't fit elsewhere or which have yet to be categorized. :maxdepth: 1 librs - gcc-plugins ioctl .. only:: subproject and html diff --git a/Documentation/kbuild/gcc-plugins.rst b/Documentation/kbuild/gcc-plugins.rst new file mode 100644 index 000000000000..4b1c10f88e30 --- /dev/null +++ b/Documentation/kbuild/gcc-plugins.rst @@ -0,0 +1,97 @@ +========================= +GCC plugin infrastructure +========================= + + +Introduction +============ + +GCC plugins are loadable modules that provide extra features to the +compiler [1]_. They are useful for runtime instrumentation and static analysis. +We can analyse, change and add further code during compilation via +callbacks [2]_, GIMPLE [3]_, IPA [4]_ and RTL passes [5]_. + +The GCC plugin infrastructure of the kernel supports all gcc versions from +4.5 to 6.0, building out-of-tree modules, cross-compilation and building in a +separate directory. +Plugin source files have to be compilable by both a C and a C++ compiler as well +because gcc versions 4.5 and 4.6 are compiled by a C compiler, +gcc-4.7 can be compiled by a C or a C++ compiler, +and versions 4.8+ can only be compiled by a C++ compiler. + +Currently the GCC plugin infrastructure supports only the x86, arm, arm64 and +powerpc architectures. + +This infrastructure was ported from grsecurity [6]_ and PaX [7]_. + +-- + +.. [1] https://gcc.gnu.org/onlinedocs/gccint/Plugins.html +.. [2] https://gcc.gnu.org/onlinedocs/gccint/Plugin-API.html#Plugin-API +.. [3] https://gcc.gnu.org/onlinedocs/gccint/GIMPLE.html +.. [4] https://gcc.gnu.org/onlinedocs/gccint/IPA.html +.. [5] https://gcc.gnu.org/onlinedocs/gccint/RTL.html +.. [6] https://grsecurity.net/ +.. [7] https://pax.grsecurity.net/ + + +Files +===== + +**$(src)/scripts/gcc-plugins** + + This is the directory of the GCC plugins. + +**$(src)/scripts/gcc-plugins/gcc-common.h** + + This is a compatibility header for GCC plugins. + It should be always included instead of individual gcc headers. + +**$(src)/scripts/gcc-plugin.sh** + + This script checks the availability of the included headers in + gcc-common.h and chooses the proper host compiler to build the plugins + (gcc-4.7 can be built by either gcc or g++). + +**$(src)/scripts/gcc-plugins/gcc-generate-gimple-pass.h, +$(src)/scripts/gcc-plugins/gcc-generate-ipa-pass.h, +$(src)/scripts/gcc-plugins/gcc-generate-simple_ipa-pass.h, +$(src)/scripts/gcc-plugins/gcc-generate-rtl-pass.h** + + These headers automatically generate the registration structures for + GIMPLE, SIMPLE_IPA, IPA and RTL passes. They support all gcc versions + from 4.5 to 6.0. + They should be preferred to creating the structures by hand. + + +Usage +===== + +You must install the gcc plugin headers for your gcc version, +e.g., on Ubuntu for gcc-4.9:: + + apt-get install gcc-4.9-plugin-dev + +Or on Fedora:: + + dnf install gcc-plugin-devel + +Enable a GCC plugin based feature in the kernel config:: + + CONFIG_GCC_PLUGIN_CYC_COMPLEXITY = y + +To compile only the plugin(s):: + + make gcc-plugins + +or just run the kernel make and compile the whole kernel with +the cyclomatic complexity GCC plugin. + + +4. How to add a new GCC plugin +============================== + +The GCC plugins are in $(src)/scripts/gcc-plugins/. You can use a file or a directory +here. It must be added to $(src)/scripts/gcc-plugins/Makefile, +$(src)/scripts/Makefile.gcc-plugins and $(src)/arch/Kconfig. +See the cyc_complexity_plugin.c (CONFIG_GCC_PLUGIN_CYC_COMPLEXITY) GCC plugin. diff --git a/Documentation/kbuild/index.rst b/Documentation/kbuild/index.rst index 0f144fad99a6..82daf2efcb73 100644 --- a/Documentation/kbuild/index.rst +++ b/Documentation/kbuild/index.rst @@ -19,6 +19,7 @@ Kernel Build System issues reproducible-builds + gcc-plugins .. only:: subproject and html diff --git a/MAINTAINERS b/MAINTAINERS index 083fcf1a151c..8c5712079412 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -6934,7 +6934,7 @@ S: Maintained F: scripts/gcc-plugins/ F: scripts/gcc-plugin.sh F: scripts/Makefile.gcc-plugins -F: Documentation/core-api/gcc-plugins.rst +F: Documentation/kbuild/gcc-plugins.rst GASKET DRIVER FRAMEWORK M: Rob Springer diff --git a/scripts/gcc-plugins/Kconfig b/scripts/gcc-plugins/Kconfig index e3569543bdac..f8ca236d6165 100644 --- a/scripts/gcc-plugins/Kconfig +++ b/scripts/gcc-plugins/Kconfig @@ -23,7 +23,7 @@ menuconfig GCC_PLUGINS GCC plugins are loadable modules that provide extra features to the compiler. They are useful for runtime instrumentation and static analysis. - See Documentation/core-api/gcc-plugins.rst for details. + See Documentation/kbuild/gcc-plugins.rst for details. if GCC_PLUGINS -- cgit From 6505a18e66876e0f502dcba5a563bd3048094048 Mon Sep 17 00:00:00 2001 From: Jonathan Corbet Date: Mon, 2 Mar 2020 15:26:38 -0700 Subject: docs: move core-api/ioctl.rst to driver-api/ The ioctl() documentation belongs with the rest of the driver-oriented info, so move it there. Signed-off-by: Jonathan Corbet --- Documentation/core-api/index.rst | 1 - Documentation/core-api/ioctl.rst | 253 ------------------------------------- Documentation/driver-api/index.rst | 1 + Documentation/driver-api/ioctl.rst | 253 +++++++++++++++++++++++++++++++++++++ 4 files changed, 254 insertions(+), 254 deletions(-) delete mode 100644 Documentation/core-api/ioctl.rst create mode 100644 Documentation/driver-api/ioctl.rst diff --git a/Documentation/core-api/index.rst b/Documentation/core-api/index.rst index 9836a0ac09a3..0897ad12c119 100644 --- a/Documentation/core-api/index.rst +++ b/Documentation/core-api/index.rst @@ -102,7 +102,6 @@ Documents that don't fit elsewhere or which have yet to be categorized. :maxdepth: 1 librs - ioctl .. only:: subproject and html diff --git a/Documentation/core-api/ioctl.rst b/Documentation/core-api/ioctl.rst deleted file mode 100644 index c455db0e1627..000000000000 --- a/Documentation/core-api/ioctl.rst +++ /dev/null @@ -1,253 +0,0 @@ -====================== -ioctl based interfaces -====================== - -ioctl() is the most common way for applications to interface -with device drivers. It is flexible and easily extended by adding new -commands and can be passed through character devices, block devices as -well as sockets and other special file descriptors. - -However, it is also very easy to get ioctl command definitions wrong, -and hard to fix them later without breaking existing applications, -so this documentation tries to help developers get it right. - -Command number definitions -========================== - -The command number, or request number, is the second argument passed to -the ioctl system call. While this can be any 32-bit number that uniquely -identifies an action for a particular driver, there are a number of -conventions around defining them. - -``include/uapi/asm-generic/ioctl.h`` provides four macros for defining -ioctl commands that follow modern conventions: ``_IO``, ``_IOR``, -``_IOW``, and ``_IOWR``. These should be used for all new commands, -with the correct parameters: - -_IO/_IOR/_IOW/_IOWR - The macro name specifies how the argument will be used.  It may be a - pointer to data to be passed into the kernel (_IOW), out of the kernel - (_IOR), or both (_IOWR).  _IO can indicate either commands with no - argument or those passing an integer value instead of a pointer. - It is recommended to only use _IO for commands without arguments, - and use pointers for passing data. - -type - An 8-bit number, often a character literal, specific to a subsystem - or driver, and listed in :doc:`../userspace-api/ioctl/ioctl-number` - -nr - An 8-bit number identifying the specific command, unique for a give - value of 'type' - -data_type - The name of the data type pointed to by the argument, the command number - encodes the ``sizeof(data_type)`` value in a 13-bit or 14-bit integer, - leading to a limit of 8191 bytes for the maximum size of the argument. - Note: do not pass sizeof(data_type) type into _IOR/_IOW/IOWR, as that - will lead to encoding sizeof(sizeof(data_type)), i.e. sizeof(size_t). - _IO does not have a data_type parameter. - - -Interface versions -================== - -Some subsystems use version numbers in data structures to overload -commands with different interpretations of the argument. - -This is generally a bad idea, since changes to existing commands tend -to break existing applications. - -A better approach is to add a new ioctl command with a new number. The -old command still needs to be implemented in the kernel for compatibility, -but this can be a wrapper around the new implementation. - -Return code -=========== - -ioctl commands can return negative error codes as documented in errno(3); -these get turned into errno values in user space. On success, the return -code should be zero. It is also possible but not recommended to return -a positive 'long' value. - -When the ioctl callback is called with an unknown command number, the -handler returns either -ENOTTY or -ENOIOCTLCMD, which also results in --ENOTTY being returned from the system call. Some subsystems return --ENOSYS or -EINVAL here for historic reasons, but this is wrong. - -Prior to Linux 5.5, compat_ioctl handlers were required to return --ENOIOCTLCMD in order to use the fallback conversion into native -commands. As all subsystems are now responsible for handling compat -mode themselves, this is no longer needed, but it may be important to -consider when backporting bug fixes to older kernels. - -Timestamps -========== - -Traditionally, timestamps and timeout values are passed as ``struct -timespec`` or ``struct timeval``, but these are problematic because of -incompatible definitions of these structures in user space after the -move to 64-bit time_t. - -The ``struct __kernel_timespec`` type can be used instead to be embedded -in other data structures when separate second/nanosecond values are -desired, or passed to user space directly. This is still not ideal though, -as the structure matches neither the kernel's timespec64 nor the user -space timespec exactly. The get_timespec64() and put_timespec64() helper -functions can be used to ensure that the layout remains compatible with -user space and the padding is treated correctly. - -As it is cheap to convert seconds to nanoseconds, but the opposite -requires an expensive 64-bit division, a simple __u64 nanosecond value -can be simpler and more efficient. - -Timeout values and timestamps should ideally use CLOCK_MONOTONIC time, -as returned by ktime_get_ns() or ktime_get_ts64(). Unlike -CLOCK_REALTIME, this makes the timestamps immune from jumping backwards -or forwards due to leap second adjustments and clock_settime() calls. - -ktime_get_real_ns() can be used for CLOCK_REALTIME timestamps that -need to be persistent across a reboot or between multiple machines. - -32-bit compat mode -================== - -In order to support 32-bit user space running on a 64-bit machine, each -subsystem or driver that implements an ioctl callback handler must also -implement the corresponding compat_ioctl handler. - -As long as all the rules for data structures are followed, this is as -easy as setting the .compat_ioctl pointer to a helper function such as -compat_ptr_ioctl() or blkdev_compat_ptr_ioctl(). - -compat_ptr() ------------- - -On the s390 architecture, 31-bit user space has ambiguous representations -for data pointers, with the upper bit being ignored. When running such -a process in compat mode, the compat_ptr() helper must be used to -clear the upper bit of a compat_uptr_t and turn it into a valid 64-bit -pointer. On other architectures, this macro only performs a cast to a -``void __user *`` pointer. - -In an compat_ioctl() callback, the last argument is an unsigned long, -which can be interpreted as either a pointer or a scalar depending on -the command. If it is a scalar, then compat_ptr() must not be used, to -ensure that the 64-bit kernel behaves the same way as a 32-bit kernel -for arguments with the upper bit set. - -The compat_ptr_ioctl() helper can be used in place of a custom -compat_ioctl file operation for drivers that only take arguments that -are pointers to compatible data structures. - -Structure layout ----------------- - -Compatible data structures have the same layout on all architectures, -avoiding all problematic members: - -* ``long`` and ``unsigned long`` are the size of a register, so - they can be either 32-bit or 64-bit wide and cannot be used in portable - data structures. Fixed-length replacements are ``__s32``, ``__u32``, - ``__s64`` and ``__u64``. - -* Pointers have the same problem, in addition to requiring the - use of compat_ptr(). The best workaround is to use ``__u64`` - in place of pointers, which requires a cast to ``uintptr_t`` in user - space, and the use of u64_to_user_ptr() in the kernel to convert - it back into a user pointer. - -* On the x86-32 (i386) architecture, the alignment of 64-bit variables - is only 32-bit, but they are naturally aligned on most other - architectures including x86-64. This means a structure like:: - - struct foo { - __u32 a; - __u64 b; - __u32 c; - }; - - has four bytes of padding between a and b on x86-64, plus another four - bytes of padding at the end, but no padding on i386, and it needs a - compat_ioctl conversion handler to translate between the two formats. - - To avoid this problem, all structures should have their members - naturally aligned, or explicit reserved fields added in place of the - implicit padding. The ``pahole`` tool can be used for checking the - alignment. - -* On ARM OABI user space, structures are padded to multiples of 32-bit, - making some structs incompatible with modern EABI kernels if they - do not end on a 32-bit boundary. - -* On the m68k architecture, struct members are not guaranteed to have an - alignment greater than 16-bit, which is a problem when relying on - implicit padding. - -* Bitfields and enums generally work as one would expect them to, - but some properties of them are implementation-defined, so it is better - to avoid them completely in ioctl interfaces. - -* ``char`` members can be either signed or unsigned, depending on - the architecture, so the __u8 and __s8 types should be used for 8-bit - integer values, though char arrays are clearer for fixed-length strings. - -Information leaks -================= - -Uninitialized data must not be copied back to user space, as this can -cause an information leak, which can be used to defeat kernel address -space layout randomization (KASLR), helping in an attack. - -For this reason (and for compat support) it is best to avoid any -implicit padding in data structures.  Where there is implicit padding -in an existing structure, kernel drivers must be careful to fully -initialize an instance of the structure before copying it to user -space.  This is usually done by calling memset() before assigning to -individual members. - -Subsystem abstractions -====================== - -While some device drivers implement their own ioctl function, most -subsystems implement the same command for multiple drivers. Ideally the -subsystem has an .ioctl() handler that copies the arguments from and -to user space, passing them into subsystem specific callback functions -through normal kernel pointers. - -This helps in various ways: - -* Applications written for one driver are more likely to work for - another one in the same subsystem if there are no subtle differences - in the user space ABI. - -* The complexity of user space access and data structure layout is done - in one place, reducing the potential for implementation bugs. - -* It is more likely to be reviewed by experienced developers - that can spot problems in the interface when the ioctl is shared - between multiple drivers than when it is only used in a single driver. - -Alternatives to ioctl -===================== - -There are many cases in which ioctl is not the best solution for a -problem. Alternatives include: - -* System calls are a better choice for a system-wide feature that - is not tied to a physical device or constrained by the file system - permissions of a character device node - -* netlink is the preferred way of configuring any network related - objects through sockets. - -* debugfs is used for ad-hoc interfaces for debugging functionality - that does not need to be exposed as a stable interface to applications. - -* sysfs is a good way to expose the state of an in-kernel object - that is not tied to a file descriptor. - -* configfs can be used for more complex configuration than sysfs - -* A custom file system can provide extra flexibility with a simple - user interface but adds a lot of complexity to the implementation. diff --git a/Documentation/driver-api/index.rst b/Documentation/driver-api/index.rst index ea3003b3c5e5..1d8c5599149b 100644 --- a/Documentation/driver-api/index.rst +++ b/Documentation/driver-api/index.rst @@ -17,6 +17,7 @@ available subsections can be seen below. driver-model/index basics infrastructure + ioctl early-userspace/index pm/index clk diff --git a/Documentation/driver-api/ioctl.rst b/Documentation/driver-api/ioctl.rst new file mode 100644 index 000000000000..c455db0e1627 --- /dev/null +++ b/Documentation/driver-api/ioctl.rst @@ -0,0 +1,253 @@ +====================== +ioctl based interfaces +====================== + +ioctl() is the most common way for applications to interface +with device drivers. It is flexible and easily extended by adding new +commands and can be passed through character devices, block devices as +well as sockets and other special file descriptors. + +However, it is also very easy to get ioctl command definitions wrong, +and hard to fix them later without breaking existing applications, +so this documentation tries to help developers get it right. + +Command number definitions +========================== + +The command number, or request number, is the second argument passed to +the ioctl system call. While this can be any 32-bit number that uniquely +identifies an action for a particular driver, there are a number of +conventions around defining them. + +``include/uapi/asm-generic/ioctl.h`` provides four macros for defining +ioctl commands that follow modern conventions: ``_IO``, ``_IOR``, +``_IOW``, and ``_IOWR``. These should be used for all new commands, +with the correct parameters: + +_IO/_IOR/_IOW/_IOWR + The macro name specifies how the argument will be used.  It may be a + pointer to data to be passed into the kernel (_IOW), out of the kernel + (_IOR), or both (_IOWR).  _IO can indicate either commands with no + argument or those passing an integer value instead of a pointer. + It is recommended to only use _IO for commands without arguments, + and use pointers for passing data. + +type + An 8-bit number, often a character literal, specific to a subsystem + or driver, and listed in :doc:`../userspace-api/ioctl/ioctl-number` + +nr + An 8-bit number identifying the specific command, unique for a give + value of 'type' + +data_type + The name of the data type pointed to by the argument, the command number + encodes the ``sizeof(data_type)`` value in a 13-bit or 14-bit integer, + leading to a limit of 8191 bytes for the maximum size of the argument. + Note: do not pass sizeof(data_type) type into _IOR/_IOW/IOWR, as that + will lead to encoding sizeof(sizeof(data_type)), i.e. sizeof(size_t). + _IO does not have a data_type parameter. + + +Interface versions +================== + +Some subsystems use version numbers in data structures to overload +commands with different interpretations of the argument. + +This is generally a bad idea, since changes to existing commands tend +to break existing applications. + +A better approach is to add a new ioctl command with a new number. The +old command still needs to be implemented in the kernel for compatibility, +but this can be a wrapper around the new implementation. + +Return code +=========== + +ioctl commands can return negative error codes as documented in errno(3); +these get turned into errno values in user space. On success, the return +code should be zero. It is also possible but not recommended to return +a positive 'long' value. + +When the ioctl callback is called with an unknown command number, the +handler returns either -ENOTTY or -ENOIOCTLCMD, which also results in +-ENOTTY being returned from the system call. Some subsystems return +-ENOSYS or -EINVAL here for historic reasons, but this is wrong. + +Prior to Linux 5.5, compat_ioctl handlers were required to return +-ENOIOCTLCMD in order to use the fallback conversion into native +commands. As all subsystems are now responsible for handling compat +mode themselves, this is no longer needed, but it may be important to +consider when backporting bug fixes to older kernels. + +Timestamps +========== + +Traditionally, timestamps and timeout values are passed as ``struct +timespec`` or ``struct timeval``, but these are problematic because of +incompatible definitions of these structures in user space after the +move to 64-bit time_t. + +The ``struct __kernel_timespec`` type can be used instead to be embedded +in other data structures when separate second/nanosecond values are +desired, or passed to user space directly. This is still not ideal though, +as the structure matches neither the kernel's timespec64 nor the user +space timespec exactly. The get_timespec64() and put_timespec64() helper +functions can be used to ensure that the layout remains compatible with +user space and the padding is treated correctly. + +As it is cheap to convert seconds to nanoseconds, but the opposite +requires an expensive 64-bit division, a simple __u64 nanosecond value +can be simpler and more efficient. + +Timeout values and timestamps should ideally use CLOCK_MONOTONIC time, +as returned by ktime_get_ns() or ktime_get_ts64(). Unlike +CLOCK_REALTIME, this makes the timestamps immune from jumping backwards +or forwards due to leap second adjustments and clock_settime() calls. + +ktime_get_real_ns() can be used for CLOCK_REALTIME timestamps that +need to be persistent across a reboot or between multiple machines. + +32-bit compat mode +================== + +In order to support 32-bit user space running on a 64-bit machine, each +subsystem or driver that implements an ioctl callback handler must also +implement the corresponding compat_ioctl handler. + +As long as all the rules for data structures are followed, this is as +easy as setting the .compat_ioctl pointer to a helper function such as +compat_ptr_ioctl() or blkdev_compat_ptr_ioctl(). + +compat_ptr() +------------ + +On the s390 architecture, 31-bit user space has ambiguous representations +for data pointers, with the upper bit being ignored. When running such +a process in compat mode, the compat_ptr() helper must be used to +clear the upper bit of a compat_uptr_t and turn it into a valid 64-bit +pointer. On other architectures, this macro only performs a cast to a +``void __user *`` pointer. + +In an compat_ioctl() callback, the last argument is an unsigned long, +which can be interpreted as either a pointer or a scalar depending on +the command. If it is a scalar, then compat_ptr() must not be used, to +ensure that the 64-bit kernel behaves the same way as a 32-bit kernel +for arguments with the upper bit set. + +The compat_ptr_ioctl() helper can be used in place of a custom +compat_ioctl file operation for drivers that only take arguments that +are pointers to compatible data structures. + +Structure layout +---------------- + +Compatible data structures have the same layout on all architectures, +avoiding all problematic members: + +* ``long`` and ``unsigned long`` are the size of a register, so + they can be either 32-bit or 64-bit wide and cannot be used in portable + data structures. Fixed-length replacements are ``__s32``, ``__u32``, + ``__s64`` and ``__u64``. + +* Pointers have the same problem, in addition to requiring the + use of compat_ptr(). The best workaround is to use ``__u64`` + in place of pointers, which requires a cast to ``uintptr_t`` in user + space, and the use of u64_to_user_ptr() in the kernel to convert + it back into a user pointer. + +* On the x86-32 (i386) architecture, the alignment of 64-bit variables + is only 32-bit, but they are naturally aligned on most other + architectures including x86-64. This means a structure like:: + + struct foo { + __u32 a; + __u64 b; + __u32 c; + }; + + has four bytes of padding between a and b on x86-64, plus another four + bytes of padding at the end, but no padding on i386, and it needs a + compat_ioctl conversion handler to translate between the two formats. + + To avoid this problem, all structures should have their members + naturally aligned, or explicit reserved fields added in place of the + implicit padding. The ``pahole`` tool can be used for checking the + alignment. + +* On ARM OABI user space, structures are padded to multiples of 32-bit, + making some structs incompatible with modern EABI kernels if they + do not end on a 32-bit boundary. + +* On the m68k architecture, struct members are not guaranteed to have an + alignment greater than 16-bit, which is a problem when relying on + implicit padding. + +* Bitfields and enums generally work as one would expect them to, + but some properties of them are implementation-defined, so it is better + to avoid them completely in ioctl interfaces. + +* ``char`` members can be either signed or unsigned, depending on + the architecture, so the __u8 and __s8 types should be used for 8-bit + integer values, though char arrays are clearer for fixed-length strings. + +Information leaks +================= + +Uninitialized data must not be copied back to user space, as this can +cause an information leak, which can be used to defeat kernel address +space layout randomization (KASLR), helping in an attack. + +For this reason (and for compat support) it is best to avoid any +implicit padding in data structures.  Where there is implicit padding +in an existing structure, kernel drivers must be careful to fully +initialize an instance of the structure before copying it to user +space.  This is usually done by calling memset() before assigning to +individual members. + +Subsystem abstractions +====================== + +While some device drivers implement their own ioctl function, most +subsystems implement the same command for multiple drivers. Ideally the +subsystem has an .ioctl() handler that copies the arguments from and +to user space, passing them into subsystem specific callback functions +through normal kernel pointers. + +This helps in various ways: + +* Applications written for one driver are more likely to work for + another one in the same subsystem if there are no subtle differences + in the user space ABI. + +* The complexity of user space access and data structure layout is done + in one place, reducing the potential for implementation bugs. + +* It is more likely to be reviewed by experienced developers + that can spot problems in the interface when the ioctl is shared + between multiple drivers than when it is only used in a single driver. + +Alternatives to ioctl +===================== + +There are many cases in which ioctl is not the best solution for a +problem. Alternatives include: + +* System calls are a better choice for a system-wide feature that + is not tied to a physical device or constrained by the file system + permissions of a character device node + +* netlink is the preferred way of configuring any network related + objects through sockets. + +* debugfs is used for ad-hoc interfaces for debugging functionality + that does not need to be exposed as a stable interface to applications. + +* sysfs is a good way to expose the state of an in-kernel object + that is not tied to a file descriptor. + +* configfs can be used for more complex configuration than sysfs + +* A custom file system can provide extra flexibility with a simple + user interface but adds a lot of complexity to the implementation. -- cgit From 76136e028d3bc94a84f5404ba0a9afae38db1b8a Mon Sep 17 00:00:00 2001 From: Kees Cook Date: Wed, 4 Mar 2020 11:03:24 -0800 Subject: docs: deprecated.rst: Clean up fall-through details Add example of fall-through, list-ify the case ending statements, and adjust the markup for links and readability. While here, adjust strscpy() details to mention strscpy_pad(). Signed-off-by: Kees Cook Acked-by: Gustavo A. R. Silva Link: https://lore.kernel.org/r/202003041102.47A4E4B62@keescook Signed-off-by: Jonathan Corbet --- Documentation/process/deprecated.rst | 48 ++++++++++++++++++++++-------------- 1 file changed, 29 insertions(+), 19 deletions(-) diff --git a/Documentation/process/deprecated.rst b/Documentation/process/deprecated.rst index 7160a449e6c6..8965446f0b71 100644 --- a/Documentation/process/deprecated.rst +++ b/Documentation/process/deprecated.rst @@ -94,8 +94,8 @@ and other misbehavior due to the missing termination. It also NUL-pads the destination buffer if the source contents are shorter than the destination buffer size, which may be a needless performance penalty for callers using only NUL-terminated strings. The safe replacement is :c:func:`strscpy`. -(Users of :c:func:`strscpy` still needing NUL-padding will need an -explicit :c:func:`memset` added.) +(Users of :c:func:`strscpy` still needing NUL-padding should instead +use strscpy_pad().) If a caller is using non-NUL-terminated strings, :c:func:`strncpy()` can still be used, but destinations should be marked with the `__nonstring @@ -144,27 +144,37 @@ memory adjacent to the stack (when built without `CONFIG_VMAP_STACK=y`) Implicit switch case fall-through --------------------------------- -The C language allows switch cases to "fall-through" when a "break" statement -is missing at the end of a case. This, however, introduces ambiguity in the -code, as it's not always clear if the missing break is intentional or a bug. +The C language allows switch cases to fall through to the next case +when a "break" statement is missing at the end of a case. This, however, +introduces ambiguity in the code, as it's not always clear if the missing +break is intentional or a bug. For example, it's not obvious just from +looking at the code if `STATE_ONE` is intentionally designed to fall +through into `STATE_TWO`:: + + switch (value) { + case STATE_ONE: + do_something(); + case STATE_TWO: + do_other(); + break; + default: + WARN("unknown state"); + } As there have been a long list of flaws `due to missing "break" statements `_, we no longer allow -"implicit fall-through". - -In order to identify intentional fall-through cases, we have adopted a -pseudo-keyword macro 'fallthrough' which expands to gcc's extension -__attribute__((__fallthrough__)). `Statement Attributes -`_ - -When the C17/C18 [[fallthrough]] syntax is more commonly supported by +implicit fall-through. In order to identify intentional fall-through +cases, we have adopted a pseudo-keyword macro "fallthrough" which +expands to gcc's extension `__attribute__((__fallthrough__)) +`_. +(When the C17/C18 `[[fallthrough]]` syntax is more commonly supported by C compilers, static analyzers, and IDEs, we can switch to using that syntax -for the macro pseudo-keyword. +for the macro pseudo-keyword.) All switch/case blocks must end in one of: - break; - fallthrough; - continue; - goto