diff options
Diffstat (limited to 'Documentation/filesystems/vfs.rst')
-rw-r--r-- | Documentation/filesystems/vfs.rst | 105 |
1 files changed, 61 insertions, 44 deletions
diff --git a/Documentation/filesystems/vfs.rst b/Documentation/filesystems/vfs.rst index 6e903a903f8f..fd32a9a17bfb 100644 --- a/Documentation/filesystems/vfs.rst +++ b/Documentation/filesystems/vfs.rst @@ -495,7 +495,7 @@ As of kernel 2.6.22, the following members are defined: int (*link) (struct dentry *,struct inode *,struct dentry *); int (*unlink) (struct inode *,struct dentry *); int (*symlink) (struct mnt_idmap *, struct inode *,struct dentry *,const char *); - int (*mkdir) (struct mnt_idmap *, struct inode *,struct dentry *,umode_t); + struct dentry *(*mkdir) (struct mnt_idmap *, struct inode *,struct dentry *,umode_t); int (*rmdir) (struct inode *,struct dentry *); int (*mknod) (struct mnt_idmap *, struct inode *,struct dentry *,umode_t,dev_t); int (*rename) (struct mnt_idmap *, struct inode *, struct dentry *, @@ -562,7 +562,26 @@ otherwise noted. ``mkdir`` called by the mkdir(2) system call. Only required if you want to support creating subdirectories. You will probably need to - call d_instantiate() just as you would in the create() method + call d_instantiate_new() just as you would in the create() method. + + If d_instantiate_new() is not used and if the fh_to_dentry() + export operation is provided, or if the storage might be + accessible by another path (e.g. with a network filesystem) + then more care may be needed. Importantly d_instantate() + should not be used with an inode that is no longer I_NEW if there + any chance that the inode could already be attached to a dentry. + This is because of a hard rule in the VFS that a directory must + only ever have one dentry. + + For example, if an NFS filesystem is mounted twice the new directory + could be visible on the other mount before it is on the original + mount, and a pair of name_to_handle_at(), open_by_handle_at() + calls could instantiate the directory inode with an IS_ROOT() + dentry before the first mkdir returns. + + If there is any chance this could happen, then the new inode + should be d_drop()ed and attached with d_splice_alias(). The + returned dentry (if any) should be returned by ->mkdir(). ``rmdir`` called by the rmdir(2) system call. Only required if you want @@ -697,9 +716,8 @@ page lookup by address, and keeping track of pages tagged as Dirty or Writeback. The first can be used independently to the others. The VM can try to -either write dirty pages in order to clean them, or release clean pages -in order to reuse them. To do this it can call the ->writepage method -on dirty pages, and ->release_folio on clean folios with the private +release clean pages in order to reuse them. To do this it can call +->release_folio on clean folios with the private flag set. Clean pages without PagePrivate and with no external references will be released without notice being given to the address_space. @@ -712,8 +730,8 @@ maintains information about the PG_Dirty and PG_Writeback status of each page, so that pages with either of these flags can be found quickly. The Dirty tag is primarily used by mpage_writepages - the default -->writepages method. It uses the tag to find dirty pages to call -->writepage on. If mpage_writepages is not used (i.e. the address +->writepages method. It uses the tag to find dirty pages to +write back. If mpage_writepages is not used (i.e. the address provides its own ->writepages) , the PAGECACHE_TAG_DIRTY tag is almost unused. write_inode_now and sync_inode do use it (through __sync_single_inode) to check if ->writepages has been successful in @@ -737,23 +755,23 @@ pages, however the address_space has finer control of write sizes. The read process essentially only requires 'read_folio'. The write process is more complicated and uses write_begin/write_end or -dirty_folio to write data into the address_space, and writepage and +dirty_folio to write data into the address_space, and writepages to writeback data to storage. Adding and removing pages to/from an address_space is protected by the inode's i_mutex. When data is written to a page, the PG_Dirty flag should be set. It -typically remains set until writepage asks for it to be written. This +typically remains set until writepages asks for it to be written. This should clear PG_Dirty and set PG_Writeback. It can be actually written at any point after PG_Dirty is clear. Once it is known to be safe, PG_Writeback is cleared. Writeback makes use of a writeback_control structure to direct the -operations. This gives the writepage and writepages operations some +operations. This gives the writepages operation some information about the nature of and reason for the writeback request, and the constraints under which it is being done. It is also used to -return information back to the caller about the result of a writepage or +return information back to the caller about the result of a writepages request. @@ -800,7 +818,6 @@ cache in your filesystem. The following members are defined: .. code-block:: c struct address_space_operations { - int (*writepage)(struct page *page, struct writeback_control *wbc); int (*read_folio)(struct file *, struct folio *); int (*writepages)(struct address_space *, struct writeback_control *); bool (*dirty_folio)(struct address_space *, struct folio *); @@ -810,7 +827,7 @@ cache in your filesystem. The following members are defined: struct page **pagep, void **fsdata); int (*write_end)(struct file *, struct address_space *mapping, loff_t pos, unsigned len, unsigned copied, - struct page *page, void *fsdata); + struct folio *folio, void *fsdata); sector_t (*bmap)(struct address_space *, sector_t); void (*invalidate_folio) (struct folio *, size_t start, size_t len); bool (*release_folio)(struct folio *, gfp_t); @@ -829,25 +846,6 @@ cache in your filesystem. The following members are defined: int (*swap_rw)(struct kiocb *iocb, struct iov_iter *iter); }; -``writepage`` - called by the VM to write a dirty page to backing store. This - may happen for data integrity reasons (i.e. 'sync'), or to free - up memory (flush). The difference can be seen in - wbc->sync_mode. The PG_Dirty flag has been cleared and - PageLocked is true. writepage should start writeout, should set - PG_Writeback, and should make sure the page is unlocked, either - synchronously or asynchronously when the write operation - completes. - - If wbc->sync_mode is WB_SYNC_NONE, ->writepage doesn't have to - try too hard if there are problems, and may choose to write out - other pages from the mapping if that is easier (e.g. due to - internal dependencies). If it chooses not to start writeout, it - should return AOP_WRITEPAGE_ACTIVATE so that the VM will not - keep calling ->writepage on that page. - - See the file "Locking" for more details. - ``read_folio`` Called by the page cache to read a folio from the backing store. The 'file' argument supplies authentication information to network @@ -890,7 +888,7 @@ cache in your filesystem. The following members are defined: given and that many pages should be written if possible. If no ->writepages is given, then mpage_writepages is used instead. This will choose pages from the address space that are tagged as - DIRTY and will pass them to ->writepage. + DIRTY and will write them back. ``dirty_folio`` called by the VM to mark a folio as dirty. This is particularly @@ -913,8 +911,7 @@ cache in your filesystem. The following members are defined: stop attempting I/O, it can simply return. The caller will remove the remaining pages from the address space, unlock them and decrement the page refcount. Set PageUptodate if the I/O - completes successfully. Setting PageError on any page will be - ignored; simply unlock the page if an I/O error occurs. + completes successfully. ``write_begin`` Called by the generic buffered write code to ask the filesystem @@ -926,12 +923,12 @@ cache in your filesystem. The following members are defined: (if they haven't been read already) so that the updated blocks can be written out properly. - The filesystem must return the locked pagecache page for the - specified offset, in ``*pagep``, for the caller to write into. + The filesystem must return the locked pagecache folio for the + specified offset, in ``*foliop``, for the caller to write into. It must be able to cope with short writes (where the length passed to write_begin is greater than the number of bytes copied - into the page). + into the folio). A void * may be returned in fsdata, which then gets passed into write_end. @@ -944,8 +941,8 @@ cache in your filesystem. The following members are defined: called. len is the original len passed to write_begin, and copied is the amount that was able to be copied. - The filesystem must take care of unlocking the page and - releasing it refcount, and updating i_size. + The filesystem must take care of unlocking the folio, + decrementing its refcount, and updating i_size. Returns < 0 on failure, otherwise the number of bytes (<= 'copied') that were able to be copied into pagecache. @@ -1252,7 +1249,8 @@ defined: .. code-block:: c struct dentry_operations { - int (*d_revalidate)(struct dentry *, unsigned int); + int (*d_revalidate)(struct inode *, const struct qstr *, + struct dentry *, unsigned int); int (*d_weak_revalidate)(struct dentry *, unsigned int); int (*d_hash)(const struct dentry *, struct qstr *); int (*d_compare)(const struct dentry *, @@ -1265,6 +1263,8 @@ defined: struct vfsmount *(*d_automount)(struct path *); int (*d_manage)(const struct path *, bool); struct dentry *(*d_real)(struct dentry *, enum d_real_type type); + bool (*d_unalias_trylock)(const struct dentry *); + void (*d_unalias_unlock)(const struct dentry *); }; ``d_revalidate`` @@ -1390,9 +1390,7 @@ defined: If a vfsmount is returned, the caller will attempt to mount it on the mountpoint and will remove the vfsmount from its - expiration list in the case of failure. The vfsmount should be - returned with 2 refs on it to prevent automatic expiration - the - caller will clean up the additional ref. + expiration list in the case of failure. This function is only used if DCACHE_NEED_AUTOMOUNT is set on the dentry. This is set by __d_instantiate() if S_AUTOMOUNT is @@ -1428,6 +1426,25 @@ defined: For non-regular files, the 'dentry' argument is returned. +``d_unalias_trylock`` + if present, will be called by d_splice_alias() before moving a + preexisting attached alias. Returning false prevents __d_move(), + making d_splice_alias() fail with -ESTALE. + + Rationale: setting FS_RENAME_DOES_D_MOVE will prevent d_move() + and d_exchange() calls from the outside of filesystem methods; + however, it does not guarantee that attached dentries won't + be renamed or moved by d_splice_alias() finding a preexisting + alias for a directory inode. Normally we would not care; + however, something that wants to stabilize the entire path to + root over a blocking operation might need that. See 9p for one + (and hopefully only) example. + +``d_unalias_unlock`` + should be paired with ``d_unalias_trylock``; that one is called after + __d_move() call in __d_unalias(). + + Each dentry has a pointer to its parent dentry, as well as a hash list of child dentries. Child dentries are basically like files in a directory. |