summaryrefslogtreecommitdiff
path: root/Documentation/filesystems
AgeCommit message (Collapse)Author
2023-08-29Merge tag 'mm-stable-2023-08-28-18-26' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull MM updates from Andrew Morton: - Some swap cleanups from Ma Wupeng ("fix WARN_ON in add_to_avail_list") - Peter Xu has a series (mm/gup: Unify hugetlb, speed up thp") which reduces the special-case code for handling hugetlb pages in GUP. It also speeds up GUP handling of transparent hugepages. - Peng Zhang provides some maple tree speedups ("Optimize the fast path of mas_store()"). - Sergey Senozhatsky has improved te performance of zsmalloc during compaction (zsmalloc: small compaction improvements"). - Domenico Cerasuolo has developed additional selftest code for zswap ("selftests: cgroup: add zswap test program"). - xu xin has doe some work on KSM's handling of zero pages. These changes are mainly to enable the user to better understand the effectiveness of KSM's treatment of zero pages ("ksm: support tracking KSM-placed zero-pages"). - Jeff Xu has fixes the behaviour of memfd's MEMFD_NOEXEC_SCOPE_NOEXEC_ENFORCED sysctl ("mm/memfd: fix sysctl MEMFD_NOEXEC_SCOPE_NOEXEC_ENFORCED"). - David Howells has fixed an fscache optimization ("mm, netfs, fscache: Stop read optimisation when folio removed from pagecache"). - Axel Rasmussen has given userfaultfd the ability to simulate memory poisoning ("add UFFDIO_POISON to simulate memory poisoning with UFFD"). - Miaohe Lin has contributed some routine maintenance work on the memory-failure code ("mm: memory-failure: remove unneeded PageHuge() check"). - Peng Zhang has contributed some maintenance work on the maple tree code ("Improve the validation for maple tree and some cleanup"). - Hugh Dickins has optimized the collapsing of shmem or file pages into THPs ("mm: free retracted page table by RCU"). - Jiaqi Yan has a patch series which permits us to use the healthy subpages within a hardware poisoned huge page for general purposes ("Improve hugetlbfs read on HWPOISON hugepages"). - Kemeng Shi has done some maintenance work on the pagetable-check code ("Remove unused parameters in page_table_check"). - More folioification work from Matthew Wilcox ("More filesystem folio conversions for 6.6"), ("Followup folio conversions for zswap"). And from ZhangPeng ("Convert several functions in page_io.c to use a folio"). - page_ext cleanups from Kemeng Shi ("minor cleanups for page_ext"). - Baoquan He has converted some architectures to use the GENERIC_IOREMAP ioremap()/iounmap() code ("mm: ioremap: Convert architectures to take GENERIC_IOREMAP way"). - Anshuman Khandual has optimized arm64 tlb shootdown ("arm64: support batched/deferred tlb shootdown during page reclamation/migration"). - Better maple tree lockdep checking from Liam Howlett ("More strict maple tree lockdep"). Liam also developed some efficiency improvements ("Reduce preallocations for maple tree"). - Cleanup and optimization to the secondary IOMMU TLB invalidation, from Alistair Popple ("Invalidate secondary IOMMU TLB on permission upgrade"). - Ryan Roberts fixes some arm64 MM selftest issues ("selftests/mm fixes for arm64"). - Kemeng Shi provides some maintenance work on the compaction code ("Two minor cleanups for compaction"). - Some reduction in mmap_lock pressure from Matthew Wilcox ("Handle most file-backed faults under the VMA lock"). - Aneesh Kumar contributes code to use the vmemmap optimization for DAX on ppc64, under some circumstances ("Add support for DAX vmemmap optimization for ppc64"). - page-ext cleanups from Kemeng Shi ("add page_ext_data to get client data in page_ext"), ("minor cleanups to page_ext header"). - Some zswap cleanups from Johannes Weiner ("mm: zswap: three cleanups"). - kmsan cleanups from ZhangPeng ("minor cleanups for kmsan"). - VMA handling cleanups from Kefeng Wang ("mm: convert to vma_is_initial_heap/stack()"). - DAMON feature work from SeongJae Park ("mm/damon/sysfs-schemes: implement DAMOS tried total bytes file"), ("Extend DAMOS filters for address ranges and DAMON monitoring targets"). - Compaction work from Kemeng Shi ("Fixes and cleanups to compaction"). - Liam Howlett has improved the maple tree node replacement code ("maple_tree: Change replacement strategy"). - ZhangPeng has a general code cleanup - use the K() macro more widely ("cleanup with helper macro K()"). - Aneesh Kumar brings memmap-on-memory to ppc64 ("Add support for memmap on memory feature on ppc64"). - pagealloc cleanups from Kemeng Shi ("Two minor cleanups for pcp list in page_alloc"), ("Two minor cleanups for get pageblock migratetype"). - Vishal Moola introduces a memory descriptor for page table tracking, "struct ptdesc" ("Split ptdesc from struct page"). - memfd selftest maintenance work from Aleksa Sarai ("memfd: cleanups for vm.memfd_noexec"). - MM include file rationalization from Hugh Dickins ("arch: include asm/cacheflush.h in asm/hugetlb.h"). - THP debug output fixes from Hugh Dickins ("mm,thp: fix sloppy text output"). - kmemleak improvements from Xiaolei Wang ("mm/kmemleak: use object_cache instead of kmemleak_initialized"). - More folio-related cleanups from Matthew Wilcox ("Remove _folio_dtor and _folio_order"). - A VMA locking scalability improvement from Suren Baghdasaryan ("Per-VMA lock support for swap and userfaults"). - pagetable handling cleanups from Matthew Wilcox ("New page table range API"). - A batch of swap/thp cleanups from David Hildenbrand ("mm/swap: stop using page->private on tail pages for THP_SWAP + cleanups"). - Cleanups and speedups to the hugetlb fault handling from Matthew Wilcox ("Change calling convention for ->huge_fault"). - Matthew Wilcox has also done some maintenance work on the MM subsystem documentation ("Improve mm documentation"). * tag 'mm-stable-2023-08-28-18-26' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (489 commits) maple_tree: shrink struct maple_tree maple_tree: clean up mas_wr_append() secretmem: convert page_is_secretmem() to folio_is_secretmem() nios2: fix flush_dcache_page() for usage from irq context hugetlb: add documentation for vma_kernel_pagesize() mm: add orphaned kernel-doc to the rst files. mm: fix clean_record_shared_mapping_range kernel-doc mm: fix get_mctgt_type() kernel-doc mm: fix kernel-doc warning from tlb_flush_rmaps() mm: remove enum page_entry_size mm: allow ->huge_fault() to be called without the mmap_lock held mm: move PMD_ORDER to pgtable.h mm: remove checks for pte_index memcg: remove duplication detection for mem_cgroup_uncharge_swap mm/huge_memory: work on folio->swap instead of page->private when splitting folio mm/swap: inline folio_set_swap_entry() and folio_swap_entry() mm/swap: use dedicated entry for swap in folio mm/swap: stop using page->private on tail pages for THP_SWAP selftests/mm: fix WARNING comparing pointer to 0 selftests: cgroup: fix test_kmem_memcg_deletion kernel mem check ...
2023-08-28Merge tag 'fscrypt-for-linus' of git://git.kernel.org/pub/scm/fs/fscrypt/linuxLinus Torvalds
Pull fscrypt update from Eric Biggers: "Just a small documentation improvement" * tag 'fscrypt-for-linus' of git://git.kernel.org/pub/scm/fs/fscrypt/linux: fscrypt: improve the "Encryption modes and usage" section
2023-08-28Merge tag 'iomap-6.6-merge-3' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linuxLinus Torvalds
Pull iomap updates from Darrick Wong: "We've got some big changes for this release -- I'm very happy to be landing willy's work to enable large folios for the page cache for general read and write IOs when the fs can make contiguous space allocations, and Ritesh's work to track sub-folio dirty state to eliminate the write amplification problems inherent in using large folios. As a bonus, io_uring can now process write completions in the caller's context instead of bouncing through a workqueue, which should reduce io latency dramatically. IOWs, XFS should see a nice performance bump for both IO paths. Summary: - Make large writes to the page cache fill sparse parts of the cache with large folios, then use large memcpy calls for the large folio. - Track the per-block dirty state of each large folio so that a buffered write to a single byte on a large folio does not result in a (potentially) multi-megabyte writeback IO. - Allow some directio completions to be performed in the initiating task's context instead of punting through a workqueue. This will reduce latency for some io_uring requests" * tag 'iomap-6.6-merge-3' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: (26 commits) iomap: support IOCB_DIO_CALLER_COMP io_uring/rw: add write support for IOCB_DIO_CALLER_COMP fs: add IOCB flags related to passing back dio completions iomap: add IOMAP_DIO_INLINE_COMP iomap: only set iocb->private for polled bio iomap: treat a write through cache the same as FUA iomap: use an unsigned type for IOMAP_DIO_* defines iomap: cleanup up iomap_dio_bio_end_io() iomap: Add per-block dirty state tracking to improve performance iomap: Allocate ifs in ->write_begin() early iomap: Refactor iomap_write_delalloc_punch() function out iomap: Use iomap_punch_t typedef iomap: Fix possible overflow condition in iomap_write_delalloc_scan iomap: Add some uptodate state handling helpers for ifs state bitmap iomap: Drop ifs argument from iomap_set_range_uptodate() iomap: Rename iomap_page to iomap_folio_state and others iomap: Copy larger chunks from userspace iomap: Create large folios in the buffered write path filemap: Allow __filemap_get_folio to allocate large folios filemap: Add fgf_t typedef ...
2023-08-28Merge tag 'v6.6-vfs.super' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull superblock updates from Christian Brauner: "This contains the super rework that was ready for this cycle. The first part changes the order of how we open block devices and allocate superblocks, contains various cleanups, simplifications, and a new mechanism to wait on superblock state changes. This unblocks work to ultimately limit the number of writers to a block device. Jan has already scheduled follow-up work that will be ready for v6.7 and allows us to restrict the number of writers to a given block device. That series builds on this work right here. The second part contains filesystem freezing updates. Overview: The generic superblock changes are rougly organized as follows (ignoring additional minor cleanups): (1) Removal of the bd_super member from struct block_device. This was a very odd back pointer to struct super_block with unclear rules. For all relevant places we have other means to get the same information so just get rid of this. (2) Simplify rules for superblock cleanup. Roughly, everything that is allocated during fs_context initialization and that's stored in fs_context->s_fs_info needs to be cleaned up by the fs_context->free() implementation before the superblock allocation function has been called successfully. After sget_fc() returned fs_context->s_fs_info has been transferred to sb->s_fs_info at which point sb->kill_sb() if fully responsible for cleanup. Adhering to these rules means that cleanup of sb->s_fs_info in fill_super() is to be avoided as it's brittle and inconsistent. Cleanup shouldn't be duplicated between sb->put_super() as sb->put_super() is only called if sb->s_root has been set aka when the filesystem has been successfully born (SB_BORN). That complexity should be avoided. This also means that block devices are to be closed in sb->kill_sb() instead of sb->put_super(). More details in the lower section. (3) Make it possible to lookup or create a superblock before opening block devices There's a subtle dependency on (2) as some filesystems did rely on fill_super() to be called in order to correctly clean up sb->s_fs_info. All these filesystems have been fixed. (4) Switch most filesystem to follow the same logic as the generic mount code now does as outlined in (3). (5) Use the superblock as the holder of the block device. We can now easily go back from block device to owning superblock. (6) Export and extend the generic fs_holder_ops and use them as holder ops everywhere and remove the filesystem specific holder ops. (7) Call from the block layer up into the filesystem layer when the block device is removed, allowing to shut down the filesystem without risk of deadlocks. (8) Get rid of get_super(). We can now easily go back from the block device to owning superblock and can call up from the block layer into the filesystem layer when the device is removed. So no need to wade through all registered superblock to find the owning superblock anymore" Link: https://lore.kernel.org/lkml/20230824-prall-intakt-95dbffdee4a0@brauner/ * tag 'v6.6-vfs.super' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (47 commits) super: use higher-level helper for {freeze,thaw} super: wait until we passed kill super super: wait for nascent superblocks super: make locking naming consistent super: use locking helpers fs: simplify invalidate_inodes fs: remove get_super block: call into the file system for ioctl BLKFLSBUF block: call into the file system for bdev_mark_dead block: consolidate __invalidate_device and fsync_bdev block: drop the "busy inodes on changed media" log message dasd: also call __invalidate_device when setting the device offline amiflop: don't call fsync_bdev in FDFMTBEG floppy: call disk_force_media_change when changing the format block: simplify the disk_force_media_change interface nbd: call blk_mark_disk_dead in nbd_clear_sock_ioctl xfs use fs_holder_ops for the log and RT devices xfs: drop s_umount over opening the log and RT devices ext4: use fs_holder_ops for the log device ext4: drop s_umount over opening the log device ...
2023-08-28Merge tag 'v6.6-vfs.misc' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull misc vfs updates from Christian Brauner: "This contains the usual miscellaneous features, cleanups, and fixes for vfs and individual filesystems. Features: - Block mode changes on symlinks and rectify our broken semantics - Report file modifications via fsnotify() for splice - Allow specifying an explicit timeout for the "rootwait" kernel command line option. This allows to timeout and reboot instead of always waiting indefinitely for the root device to show up - Use synchronous fput for the close system call Cleanups: - Get rid of open-coded lockdep workarounds for async io submitters and replace it all with a single consolidated helper - Simplify epoll allocation helper - Convert simple_write_begin and simple_write_end to use a folio - Convert page_cache_pipe_buf_confirm() to use a folio - Simplify __range_close to avoid pointless locking - Disable per-cpu buffer head cache for isolated cpus - Port ecryptfs to kmap_local_page() api - Remove redundant initialization of pointer buf in pipe code - Unexport the d_genocide() function which is only used within core vfs - Replace printk(KERN_ERR) and WARN_ON() with WARN() Fixes: - Fix various kernel-doc issues - Fix refcount underflow for eventfds when used as EFD_SEMAPHORE - Fix a mainly theoretical issue in devpts - Check the return value of __getblk() in reiserfs - Fix a racy assert in i_readcount_dec - Fix integer conversion issues in various functions - Fix LSM security context handling during automounts that prevented NFS superblock sharing" * tag 'v6.6-vfs.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (39 commits) cachefiles: use kiocb_{start,end}_write() helpers ovl: use kiocb_{start,end}_write() helpers aio: use kiocb_{start,end}_write() helpers io_uring: use kiocb_{start,end}_write() helpers fs: create kiocb_{start,end}_write() helpers fs: add kerneldoc to file_{start,end}_write() helpers io_uring: rename kiocb_end_write() local helper splice: Convert page_cache_pipe_buf_confirm() to use a folio libfs: Convert simple_write_begin and simple_write_end to use a folio fs/dcache: Replace printk and WARN_ON by WARN fs/pipe: remove redundant initialization of pointer buf fs: Fix kernel-doc warnings devpts: Fix kernel-doc warnings doc: idmappings: fix an error and rephrase a paragraph init: Add support for rootwait timeout parameter vfs: fix up the assert in i_readcount_dec fs: Fix one kernel-doc comment docs: filesystems: idmappings: clarify from where idmappings are taken fs/buffer.c: disable per-CPU buffer_head cache for isolated CPUs vfs, security: Fix automount superblock LSM init problem, preventing NFS sb sharing ...
2023-08-28Merge tag 'v6.6-vfs.tmpfs' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull libfs and tmpfs updates from Christian Brauner: "This cycle saw a lot of work for tmpfs that required changes to the vfs layer. Andrew, Hugh, and I decided to take tmpfs through vfs this cycle. Things will go back to mm next cycle. Features ======== - By far the biggest work is the quota support for tmpfs. New tmpfs quota infrastructure is added to support it and a new QFMT_SHMEM uapi option is exposed. This offers user and group quotas to tmpfs (project quotas will be added later). Similar to other filesystems tmpfs quota are not supported within user namespaces yet. - Add support for user xattrs. While tmpfs already supports security xattrs (security.*) and POSIX ACLs for a long time it lacked support for user xattrs (user.*). With this pull request tmpfs will be able to support a limited number of user xattrs. This is accompanied by a fix (see below) to limit persistent simple xattr allocations. - Add support for stable directory offsets. Currently tmpfs relies on the libfs provided cursor-based mechanism for readdir. This causes issues when a tmpfs filesystem is exported via NFS. NFS clients do not open directories. Instead, each server-side readdir operation opens the directory, reads it, and then closes it. Since the cursor state for that directory is associated with the opened file it is discarded after each readdir operation. Such directory offsets are not just cached by NFS clients but also various userspace libraries based on these clients. As it stands there is no way to invalidate the caches when directory offsets have changed and the whole application depends on unchanging directory offsets. At LSFMM we discussed how to solve this problem and decided to support stable directory offsets. libfs now allows filesystems like tmpfs to use an xarrary to map a directory offset to a dentry. This mechanism is currently only used by tmpfs but can be supported by others as well. Fixes ===== - Change persistent simple xattrs allocations in libfs from GFP_KERNEL to GPF_KERNEL_ACCOUNT so they're subject to memory cgroup limits. Since this is a change to libfs it affects both tmpfs and kernfs. - Correctly verify {g,u}id mount options. A new filesystem context is created via fsopen() which records the namespace that becomes the owning namespace of the superblock when fsconfig(FSCONFIG_CMD_CREATE) is called for filesystems that are mountable in namespaces. However, fsconfig() calls can occur in a namespace different from the namespace where fsopen() has been called. Currently, when fsconfig() is called to set {g,u}id mount options the requested {g,u}id is mapped into a k{g,u}id according to the namespace where fsconfig() was called from. The resulting k{g,u}id is not guaranteed to be resolvable in the namespace of the filesystem (the one that fsopen() was called in). This means it's possible for an unprivileged user to create files owned by any group in a tmpfs mount since it's possible to set the setid bits on the tmpfs directory. The contract for {g,u}id mount options and {g,u}id values in general set from userspace has always been that they are translated according to the caller's idmapping. In so far, tmpfs has been doing the correct thing. But since tmpfs is mountable in unprivileged contexts it is also necessary to verify that the resulting {k,g}uid is representable in the namespace of the superblock to avoid such bugs. The new mount api's cross-namespace delegation abilities are already widely used. Having talked to a bunch of userspace this is the most faithful solution with minimal regression risks" * tag 'v6.6-vfs.tmpfs' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: tmpfs,xattr: GFP_KERNEL_ACCOUNT for simple xattrs mm: invalidation check mapping before folio_contains tmpfs: trivial support for direct IO tmpfs,xattr: enable limited user extended attributes tmpfs: track free_ispace instead of free_inodes xattr: simple_xattr_set() return old_xattr to be freed tmpfs: verify {g,u}id mount options correctly shmem: move spinlock into shmem_recalc_inode() to fix quota support libfs: Remove parent dentry locking in offset_iterate_dir() libfs: Add a lock class for the offset map's xa_lock shmem: stable directory offsets shmem: Refactor shmem_symlink() libfs: Add directory operations for stable offsets shmem: fix quota lock nesting in huge hole handling shmem: Add default quota limit mount options shmem: quota support shmem: prepare shmem quota infrastructure quota: Check presence of quota operation structures instead of ->quota_read and ->quota_write callbacks shmem: make shmem_get_inode() return ERR_PTR instead of NULL shmem: make shmem_inode_acct_block() return error
2023-08-24mm: allow ->huge_fault() to be called without the mmap_lock heldMatthew Wilcox (Oracle)
Remove the checks for the VMA lock being held, allowing the page fault path to call into the filesystem instead of retrying with the mmap_lock held. This will improve scalability for DAX page faults. Also update the documentation to match (and fix some other changes that have happened recently). Link: https://lkml.kernel.org/r/20230818202335.2739663-3-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-08-24mm: convert do_set_pte() to set_pte_range()Yin Fengwei
set_pte_range() allows to setup page table entries for a specific range. It takes advantage of batched rmap update for large folio. It now takes care of calling update_mmu_cache_range(). Link: https://lkml.kernel.org/r/20230802151406.3735276-37-willy@infradead.org Signed-off-by: Yin Fengwei <fengwei.yin@intel.com> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-08-24ceph: update documentation regarding snapshot naming limitationsLuís Henriques
Signed-off-by: Luís Henriques <lhenriques@suse.de> Reviewed-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Xiubo Li <xiubli@redhat.com> Reviewed-by: Milind Changire <mchangir@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2023-08-18Documentation: Fix typosBjorn Helgaas
Fix typos in Documentation. Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> Link: https://lore.kernel.org/r/20230814212822.193684-4-helgaas@kernel.org Signed-off-by: Jonathan Corbet <corbet@lwn.net>
2023-08-18docs: vfs: clean up after the iterate() removalJonathan Corbet
Commit 3e3271549670 ("vfs: get rid of old '->iterate' directory operation") removed the iterate() file_operations member, but neglected to clean up the associated documentation. Get rid of the leftovers. Link: https://lore.kernel.org/r/874jl945bv.fsf@meer.lwn.net Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Jonathan Corbet <corbet@lwn.net>
2023-08-16doc: idmappings: fix an error and rephrase a paragraphGONG, Ruiqi
If the modified paragraph is referring to the idmapping mentioned in the previous paragraph (i.e. `u0:k10000:r10000`), then it is `u0` that the upper idmapset starts with, not `u1000`. Fix this error and rephrase this paragraph a bit to make this reference more explicit. Reported-by: Wang Lei <wanglei249@huawei.com> Signed-off-by: GONG, Ruiqi <gongruiqi1@huawei.com> Message-Id: <20230816033210.914262-1-gongruiqi@huaweicloud.com> Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-08-15docs: filesystems: idmappings: clarify from where idmappings are takenAlexander Mikhalitsyn
Let's clarify from where we take idmapping of each type: - caller - filesystem - mount Cc: Jonathan Corbet <corbet@lwn.net> Cc: Christian Brauner <brauner@kernel.org> Cc: linux-fsdevel@vger.kernel.org Cc: linux-doc@vger.kernel.org Signed-off-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com> Message-Id: <20230625182047.26854-1-aleksandr.mikhalitsyn@canonical.com> Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-08-12ovl: auto generate uuid for new overlay filesystemsAmir Goldstein
Add a new mount option uuid=auto, which is the default. If a persistent UUID xattr is found it is used. Otherwise, an existing ovelrayfs with copied up subdirs in upper dir that was never mounted with uuid=on retains the null UUID. A new overlayfs with no copied up subdirs, generates the persistent UUID on first mount. Signed-off-by: Amir Goldstein <amir73il@gmail.com>
2023-08-12ovl: store persistent uuid/fsid with uuid=onAmir Goldstein
With uuid=on, store a persistent uuid in xattr on the upper dir to give the overlayfs instance a persistent identifier. This also makes f_fsid persistent and more reliable for reporting fid info in fanotify events. uuid=on is not supported on non-upper overlayfs or with upper fs that does not support xattrs. Signed-off-by: Amir Goldstein <amir73il@gmail.com>
2023-08-12ovl: add support for unique fsid per instanceAmir Goldstein
The legacy behavior of ovl_statfs() reports the f_fsid filled by underlying upper fs. This fsid is not unique among overlayfs instances on the same upper fs. With mount option uuid=on, generate a non-persistent uuid per overlayfs instance and use it as the seed for f_fsid, similar to tmpfs. This is useful for reporting fanotify events with fid info from different instances of overlayfs over the same upper fs. The old behavior of null uuid and upper fs fsid is retained with the mount option uuid=null, which is the default. The mount option uuid=off that disables uuid checks in underlying layers also retains the legacy behavior. Signed-off-by: Amir Goldstein <amir73il@gmail.com>
2023-08-12ovl: Add framework for verity supportAlexander Larsson
This adds the scaffolding (docs, config, mount options) for supporting the new digest field in the metacopy xattr. This contains a fs-verity digest that need to match the fs-verity digest of the lowerdata file. The mount option "verity" specifies how this xattr is handled. If you enable verity ("verity=on") all existing xattrs are validated before use, and during metacopy we generate verity xattr in the upper metacopy file (if the source file has verity enabled). This means later accesses can guarantee that the same data is used. Additionally you can use "verity=require". In this mode all metacopy files must have a valid verity xattr. For this to work metadata copy-up must be able to create a verity xattr (so that later accesses are validated). Therefore, in this mode, if the lower data file doesn't have fs-verity enabled we fall back to a full copy rather than a metacopy. Actual implementation follows in a separate commit. Signed-off-by: Alexander Larsson <alexl@redhat.com> Reviewed-by: Amir Goldstein <amir73il@gmail.com> Acked-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Amir Goldstein <amir73il@gmail.com>
2023-08-10docs: add maintainer entry profile for XFSDarrick J. Wong
Create a new document to list what I think are (within the scope of XFS) our shared goals and community roles. Since I will be stepping down shortly, I feel it's important to write down somewhere all the hats that I have been wearing for the past six years. Also, document important extra details about how to contribute to XFS. Cc: corbet@lwn.net Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Chandan Babu R <chandan.babu@oracle.com>
2023-08-10tmpfs,xattr: enable limited user extended attributesHugh Dickins
Enable "user." extended attributes on tmpfs, limiting them by tracking the space they occupy, and deducting that space from the limited ispace (unless tmpfs mounted with nr_inodes=0 to leave that ispace unlimited). tmpfs inodes and simple xattrs are both unswappable, and have to be in lowmem on a 32-bit highmem kernel: so the ispace limit is appropriate for xattrs, without any need for a further mount option. Add simple_xattr_space() to give approximate but deterministic estimate of the space taken up by each xattr: with simple_xattrs_free() outputting the space freed if required (but kernfs and even some tmpfs usages do not require that, so don't waste time on strlen'ing if not needed). Security and trusted xattrs were already supported: for consistency and simplicity, account them from the same pool; though there's a small risk that a tmpfs with enough space before would now be considered too small. When extended attributes are used, "df -i" does show more IUsed and less IFree than can be explained by the inodes: document that (manpage later). xfstests tests/generic which were not run on tmpfs before but now pass: 020 037 062 070 077 097 103 117 337 377 454 486 523 533 611 618 728 with no new failures. Signed-off-by: Hugh Dickins <hughd@google.com> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com> Message-Id: <2e63b26e-df46-5baa-c7d6-f9a8dd3282c5@google.com> Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-08-09libfs: Add directory operations for stable offsetsChuck Lever
Create a vector of directory operations in fs/libfs.c that handles directory seeks and readdir via stable offsets instead of the current cursor-based mechanism. For the moment these are unused. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Message-Id: <168814732984.530310.11190772066786107220.stgit@manet.1015granger.net> Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-08-09shmem: Add default quota limit mount optionsLukas Czerner
Allow system administrator to set default global quota limits at tmpfs mount time. Signed-off-by: Lukas Czerner <lczerner@redhat.com> Signed-off-by: Carlos Maiolino <cmaiolino@redhat.com> Reviewed-by: Jan Kara <jack@suse.cz> Message-Id: <20230725144510.253763-7-cem@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-08-09shmem: quota supportCarlos Maiolino
Now the basic infra-structure is in place, enable quota support for tmpfs. This offers user and group quotas to tmpfs (project quotas will be added later). Also, as other filesystems, the tmpfs quota is not supported within user namespaces yet, so idmapping is not translated. Signed-off-by: Lukas Czerner <lczerner@redhat.com> Signed-off-by: Carlos Maiolino <cmaiolino@redhat.com> Reviewed-by: Jan Kara <jack@suse.cz> Message-Id: <20230725144510.253763-6-cem@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-08-06vfs: get rid of old '->iterate' directory operationLinus Torvalds
All users now just use '->iterate_shared()', which only takes the directory inode lock for reading. Filesystems that never got convered to shared mode now instead use a wrapper that drops the lock, re-takes it in write mode, calls the old function, and then downgrades the lock back to read mode. This way the VFS layer and other callers no longer need to care about filesystems that never got converted to the modern era. The filesystems that use the new wrapper are ceph, coda, exfat, jfs, ntfs, ocfs2, overlayfs, and vboxsf. Honestly, several of them look like they really could just iterate their directories in shared mode and skip the wrapper entirely, but the point of this change is to not change semantics or fix filesystems that haven't been fixed in the last 7+ years, but to finally get rid of the dual iterators. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-07-27tmpfs: fix Documentation of noswap and huge mount optionsHugh Dickins
The noswap mount option is surely not one of the three options for sizing: move its description down. The huge= mount option does not accept numeric values: those are just in an internal enum. Delete those numbers, and follow the manpage text more closely (but there's not yet any fadvise() or fcntl() which applies here). /sys/kernel/mm/transparent_hugepage/shmem_enabled is hard to describe, and barely relevant to mounting a tmpfs: just refer to transhuge.rst (while still using the words deny and force, to help as informal reminders). [rdunlap@infradead.org: fixup Docs table for huge mount options] Link: https://lkml.kernel.org/r/20230725052333.26857-1-rdunlap@infradead.org Link: https://lkml.kernel.org/r/986cb0bf-9780-354-9bb-4bf57aadbab@google.com Signed-off-by: Hugh Dickins <hughd@google.com> Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Fixes: d0f5a85442d1 ("shmem: update documentation") Fixes: 2c6efe9cf2d7 ("shmem: add support to ignore swap") Reviewed-by: Luis Chamberlain <mcgrof@kernel.org> Cc: Christian Brauner <brauner@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-07-24doc: Correct the description of ->release_folioMatthew Wilcox (Oracle)
The filesystem ->release_folio method is called under more circumstances now than when the documentation was written. The second sentence describing the interpretation of the return value is the wrong polarity (false indicates failure, not success). And the third sentence is also wrong (the kernel calls try_to_free_buffers() instead). So replace the entire paragraph with a detailed description of what the state of the folio may be, the meaning of the gfp parameter, why the method is being called and what the filesystem is expected to do. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org>
2023-07-21afs: Documentation: correct reference to CONFIG_AFS_FSLukas Bulwahn
Commit 0795e7c031c4 ("[AFS]: Update the AFS fs documentation.") adds a new section listing the build configuration options that need to be enabled for the AFS file system. The documentation refers to CONFIG_AFS, but the option is called CONFIG_AFS_FS, since the beginning of Linux's git history. Refer to the config option with the correct name. Signed-off-by: Lukas Bulwahn <lukas.bulwahn@gmail.com> Signed-off-by: Jonathan Corbet <corbet@lwn.net> Link: https://lore.kernel.org/r/20230720094301.9888-1-lukas.bulwahn@gmail.com
2023-07-17fs: distinguish between user initiated freeze and kernel initiated freezeDarrick J. Wong
Userspace can freeze a filesystem using the FIFREEZE ioctl or by suspending the block device; this state persists until userspace thaws the filesystem with the FITHAW ioctl or resuming the block device. Since commit 18e9e5104fcd ("Introduce freeze_super and thaw_super for the fsfreeze ioctl") we only allow the first freeze command to succeed. The kernel may decide that it is necessary to freeze a filesystem for its own internal purposes, such as suspends in progress, filesystem fsck activities, or quiescing a device prior to removal. Userspace thaw commands must never break a kernel freeze, and kernel thaw commands shouldn't undo userspace's freeze command. Introduce a couple of freeze holder flags and wire it into the sb_writers state. One kernel and one userspace freeze are allowed to coexist at the same time; the filesystem will not thaw until both are lifted. I wonder if the f2fs/gfs2 code should be using a kernel freeze here, but for now we'll use FREEZE_HOLDER_USERSPACE to preserve existing behaviors. Cc: mcgrof@kernel.org Cc: jack@suse.cz Cc: hch@infradead.org Cc: ruansy.fnst@fujitsu.com Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Jan Kara <jack@suse.cz>
2023-07-11fscrypt: improve the "Encryption modes and usage" sectionEric Biggers
As the number of supported encryption modes has grown, the part of the "Encryption modes and usage" section that describes the supported encryption modes has gotten a bit messy. It presents useful information, but it's a bit lacking in high-level context. Rework the section to hopefully be much more useful. Link: https://lore.kernel.org/r/20230630064811.22569-2-ebiggers@kernel.org Signed-off-by: Eric Biggers <ebiggers@google.com>
2023-07-11mm: Introduce VM_SHADOW_STACK for shadow stack memoryYu-cheng Yu
New hardware extensions implement support for shadow stack memory, such as x86 Control-flow Enforcement Technology (CET). Add a new VM flag to identify these areas, for example, to be used to properly indicate shadow stack PTEs to the hardware. Shadow stack VMA creation will be tightly controlled and limited to anonymous memory to make the implementation simpler and since that is all that is required. The solution will rely on pte_mkwrite() to create the shadow stack PTEs, so it will not be required for vm_get_page_prot() to learn how to create shadow stack memory. For this reason document that VM_SHADOW_STACK should not be mixed with VM_SHARED. Co-developed-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com> Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Kees Cook <keescook@chromium.org> Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Reviewed-by: Mark Brown <broonie@kernel.org> Acked-by: Mike Rapoport (IBM) <rppt@kernel.org> Acked-by: David Hildenbrand <david@redhat.com> Tested-by: Mark Brown <broonie@kernel.org> Tested-by: Pengfei Xu <pengfei.xu@intel.com> Tested-by: John Allen <john.allen@amd.com> Tested-by: Kees Cook <keescook@chromium.org> Link: https://lore.kernel.org/all/20230613001108.3040476-15-rick.p.edgecombe%40intel.com
2023-07-08Merge tag 'mm-hotfixes-stable-2023-07-08-10-43' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull hotfixes from Andrew Morton: "16 hotfixes. Six are cc:stable and the remainder address post-6.4 issues" The merge undoes the disabling of the CONFIG_PER_VMA_LOCK feature, since it was all hopefully fixed in mainline. * tag 'mm-hotfixes-stable-2023-07-08-10-43' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: lib: dhry: fix sleeping allocations inside non-preemptable section kasan, slub: fix HW_TAGS zeroing with slub_debug kasan: fix type cast in memory_is_poisoned_n mailmap: add entries for Heiko Stuebner mailmap: update manpage link bootmem: remove the vmemmap pages from kmemleak in free_bootmem_page MAINTAINERS: add linux-next info mailmap: add Markus Schneider-Pargmann writeback: account the number of pages written back mm: call arch_swap_restore() from do_swap_page() squashfs: fix cache race with migration mm/hugetlb.c: fix a bug within a BUG(): inconsistent pte comparison docs: update ocfs2-devel mailing list address MAINTAINERS: update ocfs2-devel mailing list address mm: disable CONFIG_PER_VMA_LOCK until its fixed fork: lock VMAs of the parent process when forking
2023-07-08docs: update ocfs2-devel mailing list addressAnthony Iliopoulos
The ocfs2-devel mailing list has been migrated to the kernel.org infrastructure, update all related documentation pointers to reflect the change. Link: https://lkml.kernel.org/r/20230628013437.47030-3-ailiop@suse.com Signed-off-by: Anthony Iliopoulos <ailiop@suse.com> Acked-by: Joseph Qi <jiangqi903@gmail.com> Acked-by: Joel Becker <jlbec@evilplan.org> Cc: Changwei Ge <gechangwei@live.cn> Cc: Gang He <ghe@suse.com> Cc: Jun Piao <piaojun@huawei.com> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Mark Fasheh <mark@fasheh.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-07-05Merge tag 'f2fs-for-6.5-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs Pull f2fs updates from Jaegeuk Kim: "In this cycle, we've mainly investigated the zoned block device support along with patches such as correcting write pointers between f2fs and storage, adding asynchronous zone reset flow, and managing the number of open zones. Other than them, f2fs adds another mount option, "errors=x" to specify how to handle when it detects an unexpected behavior at runtime. Enhancements: - support 'errors=remount-ro|continue|panic' mount option - enforce some inode flag policies - allow .tmp compression given extensions - add some ioctls to manage the f2fs compression - improve looped node chain flow - avoid issuing small-sized discard commands during checkpoint - implement an asynchronous zone reset Bug fixes: - fix deadlock in xattr and inode page lock - fix and add sanity check in some error paths - fix to avoid NULL pointer dereference f2fs_write_end_io() along with put_super - set proper flags to quota files - fix potential deadlock due to unpaired node_write lock use - fix over-estimating free section during FG GC - fix the wrong condition to determine atomic context As usual, also there are a number of patches with code refactoring and minor clean-ups" * tag 'f2fs-for-6.5-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs: (46 commits) f2fs: fix to do sanity check on direct node in truncate_dnode() f2fs: only set release for file that has compressed data f2fs: fix compile warning in f2fs_destroy_node_manager() f2fs: fix error path handling in truncate_dnode() f2fs: fix deadlock in i_xattr_sem and inode page lock f2fs: remove unneeded page uptodate check/set f2fs: update mtime and ctime in move file range method f2fs: compress tmp files given extension f2fs: refactor struct f2fs_attr macro f2fs: convert to use sbi directly f2fs: remove redundant assignment to variable err f2fs: do not issue small discard commands during checkpoint f2fs: check zone write pointer points to the end of zone f2fs: add f2fs_ioc_get_compress_blocks f2fs: cleanup MIN_INLINE_XATTR_SIZE f2fs: add helper to check compression level f2fs: set FMODE_CAN_ODIRECT instead of a dummy direct_IO method f2fs: do more sanity check on inode f2fs: compress: fix to check validity of i_compress_flag field f2fs: add sanity compress level check for compressed file ...
2023-06-29Merge tag 'fsnotify_for_v6.5-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs Pull fsnotify updates from Jan Kara: - Support for fanotify events returning file handles for filesystems not exportable via NFS - Improved error handling exportfs functions - Add missing FS_OPEN events when unusual open helpers are used * tag 'fsnotify_for_v6.5-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs: fsnotify: move fsnotify_open() hook into do_dentry_open() exportfs: check for error return value from exportfs_encode_*() fanotify: support reporting non-decodeable file handles exportfs: allow exporting non-decodeable file handles to userspace exportfs: add explicit flag to request non-decodeable file handles exportfs: change connectable argument to bit flags
2023-06-29Merge tag 'ext4_for_linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 Pull ext4 updates from Ted Ts'o: "Various cleanups and bug fixes in ext4's extent status tree, journalling, and block allocator subsystems. Also improve performance for parallel DIO overwrites" * tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (55 commits) ext4: avoid updating the superblock on a r/o mount if not needed jbd2: skip reading super block if it has been verified ext4: fix to check return value of freeze_bdev() in ext4_shutdown() ext4: refactoring to use the unified helper ext4_quotas_off() ext4: turn quotas off if mount failed after enabling quotas ext4: update doc about journal superblock description ext4: add journal cycled recording support jbd2: continue to record log between each mount jbd2: remove j_format_version jbd2: factor out journal initialization from journal_get_superblock() jbd2: switch to check format version in superblock directly jbd2: remove unused feature macros ext4: ext4_put_super: Remove redundant checking for 'sbi->s_journal_bdev' ext4: Fix reusing stale buffer heads from last failed mounting ext4: allow concurrent unaligned dio overwrites ext4: clean up mballoc criteria comments ext4: make ext4_zeroout_es() return void ext4: make ext4_es_insert_extent() return void ext4: make ext4_es_insert_delayed_block() return void ext4: make ext4_es_remove_extent() return void ...
2023-06-29Merge tag 'ovl-update-6.5' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/overlayfs/vfs Pull overlayfs update from Amir Goldstein: - fix two NULL pointer deref bugs (Zhihao Cheng) - add support for "data-only" lower layers destined to be used by composefs - port overlayfs to the new mount api (Christian Brauner) * tag 'ovl-update-6.5' of git://git.kernel.org/pub/scm/linux/kernel/git/overlayfs/vfs: (26 commits) ovl: add Amir as co-maintainer ovl: reserve ability to reconfigure mount options with new mount api ovl: modify layer parameter parsing ovl: port to new mount api ovl: factor out ovl_parse_options() helper ovl: store enum redirect_mode in config instead of a string ovl: pass ovl_fs to xino helpers ovl: clarify ovl_get_root() semantics ovl: negate the ofs->share_whiteout boolean ovl: check type and offset of struct vfsmount in ovl_entry ovl: implement lazy lookup of lowerdata in data-only layers ovl: prepare for lazy lookup of lowerdata inode ovl: prepare to store lowerdata redirect for lazy lowerdata lookup ovl: implement lookup in data-only layers ovl: introduce data-only lower layers ovl: remove unneeded goto instructions ovl: deduplicate lowerdata and lowerstack[] ovl: deduplicate lowerpath and lowerstack[] ovl: move ovl_entry into ovl_inode ovl: factor out ovl_free_entry() and ovl_stack_*() helpers ...
2023-06-28Merge tag 'net-next-6.5' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next Pull networking changes from Jakub Kicinski: "WiFi 7 and sendpage changes are the biggest pieces of work for this release. The latter will definitely require fixes but I think that we got it to a reasonable point. Core: - Rework the sendpage & splice implementations Instead of feeding data into sockets page by page extend sendmsg handlers to support taking a reference on the data, controlled by a new flag called MSG_SPLICE_PAGES Rework the handling of unexpected-end-of-file to invoke an additional callback instead of trying to predict what the right combination of MORE/NOTLAST flags is Remove the MSG_SENDPAGE_NOTLAST flag completely - Implement SCM_PIDFD, a new type of CMSG type analogous to SCM_CREDENTIALS, but it contains pidfd instead of plain pid - Enable socket busy polling with CONFIG_RT - Improve reliability and efficiency of reporting for ref_tracker - Auto-generate a user space C library for various Netlink families Protocols: - Allow TCP to shrink the advertised window when necessary, prevent sk_rcvbuf auto-tuning from growing the window all the way up to tcp_rmem[2] - Use per-VMA locking for "page-flipping" TCP receive zerocopy - Prepare TCP for device-to-device data transfers, by making sure that payloads are always attached to skbs as page frags - Make the backoff time for the first N TCP SYN retransmissions linear. Exponential backoff is unnecessarily conservative - Create a new MPTCP getsockopt to retrieve all info (MPTCP_FULL_INFO) - Avoid waking up applications using TLS sockets until we have a full record - Allow using kernel memory for protocol ioctl callbacks, paving the way to issuing ioctls over io_uring - Add nolocalbypass option to VxLAN, forcing packets to be fully encapsulated even if they are destined for a local IP address - Make TCPv4 use consistent hash in TIME_WAIT and SYN_RECV. Ensure in-kernel ECMP implementation (e.g. Open vSwitch) select the same link for all packets. Support L4 symmetric hashing in Open vSwitch - PPPoE: make number of hash bits configurable - Allow DNS to be overwritten by DHCPACK in the in-kernel DHCP client (ipconfig) - Add layer 2 miss indication and filtering, allowing higher layers (e.g. ACL filters) to make forwarding decisions based on whether packet matched forwarding state in lower devices (bridge) - Support matching on Connectivity Fault Management (CFM) packets - Hide the "link becomes ready" IPv6 messages by demoting their printk level to debug - HSR: don't enable promiscuous mode if device offloads the proto - Support active scanning in IEEE 802.15.4 - Continue work on Multi-Link Operation for WiFi 7 BPF: - Add precision propagation for subprogs and callbacks. This allows maintaining verification efficiency when subprograms are used, or in fact passing the verifier at all for complex programs, especially those using open-coded iterators - Improve BPF's {g,s}setsockopt() length handling. Previously BPF assumed the length is always equal to the amount of written data. But some protos allow passing a NULL buffer to discover what the output buffer *should* be, without writing anything - Accept dynptr memory as memory arguments passed to helpers - Add routing table ID to bpf_fib_lookup BPF helper - Support O_PATH FDs in BPF_OBJ_PIN and BPF_OBJ_GET commands - Drop bpf_capable() check in BPF_MAP_FREEZE command (used to mark maps as read-only) - Show target_{obj,btf}_id in tracing link fdinfo - Addition of several new kfuncs (most of the names are self-explanatory): - Add a set of new dynptr kfuncs: bpf_dynptr_adjust(), bpf_dynptr_is_null(), bpf_dynptr_is_rdonly(), bpf_dynptr_size() and bpf_dynptr_clone(). - bpf_task_under_cgroup() - bpf_sock_destroy() - force closing sockets - bpf_cpumask_first_and(), rework bpf_cpumask_any*() kfuncs Netfilter: - Relax set/map validation checks in nf_tables. Allow checking presence of an entry in a map without using the value - Increase ip_vs_conn_tab_bits range for 64BIT builds - Allow updating size of a set - Improve NAT tuple selection when connection is closing Driver API: - Integrate netdev with LED subsystem, to allow configuring HW "offloaded" blinking of LEDs based on link state and activity (i.e. packets coming in and out) - Support configuring rate selection pins of SFP modules - Factor Clause 73 auto-negotiation code out of the drivers, provide common helper routines - Add more fool-proof helpers for managing lifetime of MDIO devices associated with the PCS layer - Allow drivers to report advanced statistics related to Time Aware scheduler offload (taprio) - Allow opting out of VF statistics in link dump, to allow more VFs to fit into the message - Split devlink instance and devlink port operations New hardware / drivers: - Ethernet: - Synopsys EMAC4 IP support (stmmac) - Marvell 88E6361 8 port (5x1GE + 3x2.5GE) switches - Marvell 88E6250 7 port switches - Microchip LAN8650/1 Rev.B0 PHYs - MediaTek MT7981/MT7988 built-in 1GE PHY driver - WiFi: - Realtek RTL8192FU, 2.4 GHz, b/g/n mode, 2T2R, 300 Mbps - Realtek RTL8723DS (SDIO variant) - Realtek RTL8851BE - CAN: - Fintek F81604 Drivers: - Ethernet NICs: - Intel (100G, ice): - support dynamic interrupt allocation - use meta data match instead of VF MAC addr on slow-path - nVidia/Mellanox: - extend link aggregation to handle 4, rather than just 2 ports - spawn sub-functions without any features by default - OcteonTX2: - support HTB (Tx scheduling/QoS) offload - make RSS hash generation configurable - support selecting Rx queue using TC filters - Wangxun (ngbe/txgbe): - add basic Tx/Rx packet offloads - add phylink support (SFP/PCS control) - Freescale/NXP (enetc): - report TAPRIO packet statistics - Solarflare/AMD: - support matching on IP ToS and UDP source port of outer header - VxLAN and GENEVE tunnel encapsulation over IPv4 or IPv6 - add devlink dev info support for EF10 - Virtual NICs: - Microsoft vNIC: - size the Rx indirection table based on requested configuration - support VLAN tagging - Amazon vNIC: - try to reuse Rx buffers if not fully consumed, useful for ARM servers running with 16kB pages - Google vNIC: - support TCP segmentation of >64kB frames - Ethernet embedded switches: - Marvell (mv88e6xxx): - enable USXGMII (88E6191X) - Microchip: - lan966x: add support for Egress Stage 0 ACL engine - lan966x: support mapping packet priority to internal switch priority (based on PCP or DSCP) - Ethernet PHYs: - Broadcom PHYs: - support for Wake-on-LAN for BCM54210E/B50212E - report LPI counter - Microsemi PHYs: support RGMII delay configuration (VSC85xx) - Micrel PHYs: receive timestamp in the frame (LAN8841) - Realtek PHYs: support optional external PHY clock - Altera TSE PCS: merge the driver into Lynx PCS which it is a variant of - CAN: Kvaser PCIEcan: - support packet timestamping - WiFi: - Intel (iwlwifi): - major update for new firmware and Multi-Link Operation (MLO) - configuration rework to drop test devices and split the different families - support for segmented PNVM images and power tables - new vendor entries for PPAG (platform antenna gain) feature - Qualcomm 802.11ax (ath11k): - Multiple Basic Service Set Identifier (MBSSID) and Enhanced MBSSID Advertisement (EMA) support in AP mode - support factory test mode - RealTek (rtw89): - add RSSI based antenna diversity - support U-NII-4 channels on 5 GHz band - RealTek (rtl8xxxu): - AP mode support for 8188f - support USB RX aggregation for the newer chips" * tag 'net-next-6.5' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (1602 commits) net: scm: introduce and use scm_recv_unix helper af_unix: Skip SCM_PIDFD if scm->pid is NULL. net: lan743x: Simplify comparison netlink: Add __sock_i_ino() for __netlink_diag_dump(). net: dsa: avoid suspicious RCU usage for synced VLAN-aware MAC addresses Revert "af_unix: Call scm_recv() only after scm_set_cred()." phylink: ReST-ify the phylink_pcs_neg_mode() kdoc libceph: Partially revert changes to support MSG_SPLICE_PAGES net: phy: mscc: fix packet loss due to RGMII delays net: mana: use vmalloc_array and vcalloc net: enetc: use vmalloc_array and vcalloc ionic: use vmalloc_array and vcalloc pds_core: use vmalloc_array and vcalloc gve: use vmalloc_array and vcalloc octeon_ep: use vmalloc_array and vcalloc net: usb: qmi_wwan: add u-blox 0x1312 composition perf trace: fix MSG_SPLICE_PAGES build error ipvlan: Fix return value of ipvlan_queue_xmit() netfilter: nf_tables: fix underflow in chain reference counter netfilter: nf_tables: unbind non-anonymous set if rule construction fails ...
2023-06-27Merge tag 'hardening-v6.5-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux Pull hardening updates from Kees Cook: "There are three areas of note: A bunch of strlcpy()->strscpy() conversions ended up living in my tree since they were either Acked by maintainers for me to carry, or got ignored for multiple weeks (and were trivial changes). The compiler option '-fstrict-flex-arrays=3' has been enabled globally, and has been in -next for the entire devel cycle. This changes compiler diagnostics (though mainly just -Warray-bounds which is disabled) and potential UBSAN_BOUNDS and FORTIFY _warning_ coverage. In other words, there are no new restrictions, just potentially new warnings. Any new FORTIFY warnings we've seen have been fixed (usually in their respective subsystem trees). For more details, see commit df8fc4e934c12b. The under-development compiler attribute __counted_by has been added so that we can start annotating flexible array members with their associated structure member that tracks the count of flexible array elements at run-time. It is possible (likely?) that the exact syntax of the attribute will change before it is finalized, but GCC and Clang are working together to sort it out. Any changes can be made to the macro while we continue to add annotations. As an example of that last case, I have a treewide commit waiting with such annotations found via Coccinelle: https://git.kernel.org/linus/adc5b3cb48a049563dc673f348eab7b6beba8a9b Also see commit dd06e72e68bcb4 for more details. Summary: - Fix KMSAN vs FORTIFY in strlcpy/strlcat (Alexander Potapenko) - Convert strreplace() to return string start (Andy Shevchenko) - Flexible array conversions (Arnd Bergmann, Wyes Karny, Kees Cook) - Add missing function prototypes seen with W=1 (Arnd Bergmann) - Fix strscpy() kerndoc typo (Arne Welzel) - Replace strlcpy() with strscpy() across many subsystems which were either Acked by respective maintainers or were trivial changes that went ignored for multiple weeks (Azeem Shaikh) - Remove unneeded cc-option test for UBSAN_TRAP (Nick Desaulniers) - Add KUnit tests for strcat()-family - Enable KUnit tests of FORTIFY wrappers under UML - Add more complete FORTIFY protections for strlcat() - Add missed disabling of FORTIFY for all arch purgatories. - Enable -fstrict-flex-arrays=3 globally - Tightening UBSAN_BOUNDS when using GCC - Improve checkpatch to check for strcpy, strncpy, and fake flex arrays - Improve use of const variables in FORTIFY - Add requested struct_size_t() helper for types not pointers - Add __counted_by macro for annotating flexible array size members" * tag 'hardening-v6.5-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux: (54 commits) netfilter: ipset: Replace strlcpy with strscpy uml: Replace strlcpy with strscpy um: Use HOST_DIR for mrproper kallsyms: Replace all non-returning strlcpy with strscpy sh: Replace all non-returning strlcpy with strscpy of/flattree: Replace all non-returning strlcpy with strscpy sparc64: Replace all non-returning strlcpy with strscpy Hexagon: Replace all non-returning strlcpy with strscpy kobject: Use return value of strreplace() lib/string_helpers: Change returned value of the strreplace() jbd2: Avoid printing outside the boundary of the buffer checkpatch: Check for 0-length and 1-element arrays riscv/purgatory: Do not use fortified string functions s390/purgatory: Do not use fortified string functions x86/purgatory: Do not use fortified string functions acpi: Replace struct acpi_table_slit 1-element array with flex-array clocksource: Replace all non-returning strlcpy with strscpy string: use __builtin_memcpy() in strlcpy/strlcat staging: most: Replace all non-returning strlcpy with strscpy drm/i2c: tda998x: Replace all non-returning strlcpy with strscpy ...
2023-06-26ext4: update doc about journal superblock descriptionZhang Yi
Update the description in doc about the s_head parameter in the journal superblock. Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Link: https://lore.kernel.org/r/20230322013353.1843306-4-yi.zhang@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2023-06-26Merge tag 'fsverity-for-linus' of git://git.kernel.org/pub/scm/fs/fsverity/linuxLinus Torvalds
Pull fsverity updates from Eric Biggers: "Several updates for fs/verity/: - Do all hashing with the shash API instead of with the ahash API. This simplifies the code and reduces API overhead. It should also make things slightly easier for XFS's upcoming support for fsverity. It does drop fsverity's support for off-CPU hash accelerators, but that support was incomplete and not known to be used - Update and export fsverity_get_digest() so that it's ready for overlayfs's upcoming support for fsverity checking of lowerdata - Improve the documentation for builtin signature support - Fix a bug in the large folio support" * tag 'fsverity-for-linus' of git://git.kernel.org/pub/scm/fs/fsverity/linux: fsverity: improve documentation for builtin signature support fsverity: rework fsverity_get_digest() again fsverity: simplify error handling in verify_data_block() fsverity: don't use bio_first_page_all() in fsverity_verify_bio() fsverity: constify fsverity_hash_alg fsverity: use shash API instead of ahash API
2023-06-26Merge tag 'v6.5/vfs.rename.locking' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull vfs rename locking updates from Christian Brauner: "This contains the work from Jan to fix problems with cross-directory renames originally reported in [1]. To quickly sum it up some filesystems (so far we know at least about ext4, udf, f2fs, ocfs2, likely also reiserfs, gfs2 and others) need to lock the directory when it is being renamed into another directory. This is because we need to update the parent pointer in the directory in that case and if that races with other operations on the directory, in particular a conversion from one directory format into another, bad things can happen. So far we've done the locking in the filesystem code but recently Darrick pointed out in [2] that the RENAME_EXCHANGE case was missing. That one is particularly nasty because RENAME_EXCHANGE can arbitrarily mix regular files and directories and proper lock ordering is not achievable in the filesystems alone. This patch set adds locking into vfs_rename() so that not only parent directories but also moved inodes, regardless of whether they are directories or not, are locked when calling into the filesystem. This means establishing a locking order for unrelated directories. New helpers are added for this purpose and our documentation is updated to cover this in detail. The locking is now actually easier to follow as we now always lock source and target. We've always locked the target independent of whether it was a directory or file and we've always locked source if it was a regular file. The exact details for why this came about can be found in [3] and [4]" Link: https://lore.kernel.org/all/20230117123735.un7wbamlbdihninm@quack3 [1] Link: https://lore.kernel.org/all/20230517045836.GA11594@frogsfrogsfrogs [2] Link: https://lore.kernel.org/all/20230526-schrebergarten-vortag-9cd89694517e@brauner [3] Link: https://lore.kernel.org/all/20230530-seenotrettung-allrad-44f4b00139d4@brauner [4] * tag 'v6.5/vfs.rename.locking' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: fs: Restrict lock_two_nondirectories() to non-directory inodes fs: Lock moved directories fs: Establish locking order for unrelated directories Revert "f2fs: fix potential corruption when moving a directory" Revert "udf: Protect rename against modification of moved directory" ext4: Remove ext4 locking of moved directory
2023-06-24sock: Remove ->sendpage*() in favour of sendmsg(MSG_SPLICE_PAGES)David Howells
Remove ->sendpage() and ->sendpage_locked(). sendmsg() with MSG_SPLICE_PAGES should be used instead. This allows multiple pages and multipage folios to be passed through. Signed-off-by: David Howells <dhowells@redhat.com> Acked-by: Marc Kleine-Budde <mkl@pengutronix.de> # for net/can cc: Jens Axboe <axboe@kernel.dk> cc: Matthew Wilcox <willy@infradead.org> cc: linux-afs@lists.infradead.org cc: mptcp@lists.linux.dev cc: rds-devel@oss.oracle.com cc: tipc-discussion@lists.sourceforge.net cc: virtualization@lists.linux-foundation.org Link: https://lore.kernel.org/r/20230623225513.2732256-16-dhowells@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-06-20fsverity: improve documentation for builtin signature supportEric Biggers
fsverity builtin signatures (CONFIG_FS_VERITY_BUILTIN_SIGNATURES) aren't the only way to do signatures with fsverity, and they have some major limitations. Yet, more users have tried to use them, e.g. recently by https://github.com/ostreedev/ostree/pull/2640. In most cases this seems to be because users aren't sufficiently familiar with the limitations of this feature and what the alternatives are. Therefore, make some updates to the documentation to try to clarify the properties of this feature and nudge users in the right direction. Note that the Integrity Policy Enforcement (IPE) LSM, which is not yet upstream, is planned to use the builtin signatures. (This differs from IMA, which uses its own signature mechanism.) For that reason, my earlier patch "fsverity: mark builtin signatures as deprecated" (https://lore.kernel.org/r/20221208033548.122704-1-ebiggers@kernel.org), which marked builtin signatures as "deprecated", was controversial. This patch therefore stops short of marking the feature as deprecated. I've also revised the language to focus on better explaining the feature and what its alternatives are. Link: https://lore.kernel.org/r/20230620041937.5809-1-ebiggers@kernel.org Reviewed-by: Colin Walters <walters@verbum.org> Reviewed-by: Luca Boccassi <bluca@debian.org> Signed-off-by: Eric Biggers <ebiggers@google.com>
2023-06-19ovl: store enum redirect_mode in config instead of a stringAmir Goldstein
Do all the logic to set the mode during mount options parsing and do not keep the option string around. Use a constant_table to translate from enum redirect mode to string in preperation for new mount api option parsing. The mount option "off" is translated to either "follow" or "nofollow", depending on the "redirect_always_follow" build/module config, so in effect, there are only three possible redirect modes. This results in a minor change to the string that is displayed in show_options() - when redirect_dir is enabled by default and the user mounts with the option "redirect_dir=off", instead of displaying the mode "redirect_dir=off" in show_options(), the displayed mode will be either "redirect_dir=follow" or "redirect_dir=nofollow", depending on the value of "redirect_always_follow" build/module config. The displayed mode reflects the effective mode, so mounting overlayfs again with the dispalyed redirect_dir option will result with the same effective and displayed mode. Signed-off-by: Amir Goldstein <amir73il@gmail.com>
2023-06-19ovl: introduce data-only lower layersAmir Goldstein
Introduce the format lowerdir=lower1:lower2::lowerdata1::lowerdata2 where the lower layers on the right of the :: separators are not merged into the overlayfs merge dirs. Data-only lower layers are only allowed at the bottom of the stack. The files in those layers are only meant to be accessible via absolute redirect from metacopy files in lower layers. Following changes will implement lookup in the data layers. This feature was requested for composefs ostree use case, where the lower data layer should only be accessiable via absolute redirects from metacopy inodes. The lower data layers are not required to a have a unique uuid or any uuid at all, because they are never used to compose the overlayfs inode st_ino/st_dev. Reviewed-by: Alexander Larsson <alexl@redhat.com> Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2023-06-02fs: Lock moved directoriesJan Kara
When a directory is moved to a different directory, some filesystems (udf, ext4, ocfs2, f2fs, and likely gfs2, reiserfs, and others) need to update their pointer to the parent and this must not race with other operations on the directory. Lock the directories when they are moved. Although not all filesystems need this locking, we perform it in vfs_rename() because getting the lock ordering right is really difficult and we don't want to expose these locking details to filesystems. CC: stable@vger.kernel.org Signed-off-by: Jan Kara <jack@suse.cz> Message-Id: <20230601105830.13168-5-jack@suse.cz> Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-05-30autofs: use flexible array in ioctl structureArnd Bergmann
Commit df8fc4e934c1 ("kbuild: Enable -fstrict-flex-arrays=3") introduced a warning for the autofs_dev_ioctl structure: In function 'check_name', inlined from 'validate_dev_ioctl' at fs/autofs/dev-ioctl.c:131:9, inlined from '_autofs_dev_ioctl' at fs/autofs/dev-ioctl.c:624:8: fs/autofs/dev-ioctl.c:33:14: error: 'strchr' reading 1 or more bytes from a region of size 0 [-Werror=stringop-overread] 33 | if (!strchr(name, '/')) | ^~~~~~~~~~~~~~~~~ In file included from include/linux/auto_dev-ioctl.h:10, from fs/autofs/autofs_i.h:10, from fs/autofs/dev-ioctl.c:14: include/uapi/linux/auto_dev-ioctl.h: In function '_autofs_dev_ioctl': include/uapi/linux/auto_dev-ioctl.h:112:14: note: source object 'path' of size 0 112 | char path[0]; | ^~~~ This is easily fixed by changing the gnu 0-length array into a c99 flexible array. Since this is a uapi structure, we have to be careful about possible regressions but this one should be fine as they are equivalent here. While it would break building with ancient gcc versions that predate c99, it helps building with --std=c99 and -Wpedantic builds in user space, as well as non-gnu compilers. This means we probably also want it fixed in stable kernels. Cc: stable@vger.kernel.org Cc: Kees Cook <keescook@chromium.org> Cc: Gustavo A. R. Silva" <gustavoars@kernel.org> Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Kees Cook <keescook@chromium.org> Link: https://lore.kernel.org/r/20230523081944.581710-1-arnd@kernel.org
2023-05-24smb3: move Documentation/filesystems/cifs to Documentation/filesystems/smbSteve French
Documentation/filesystems/cifs contains both server and client information so its pathname is misleading. In addition, the directory fs/smb now contains both server and client, so move Documentation/filesystems/cifs to Documentation/filesystems/smb Suggested-by: Namjae Jeon <linkinjeon@kernel.org> Acked-by: Namjae Jeon <linkinjeon@kernel.org> Signed-off-by: Steve French <stfrench@microsoft.com>
2023-05-24cifs: correct references in Documentation to old fs/cifs pathSteve French
The fs/cifs directory has moved to fs/smb/client, correct mentions of this in Documentation and comments. Acked-by: Namjae Jeon <linkinjeon@kernel.org> Signed-off-by: Steve French <stfrench@microsoft.com>
2023-05-22exportfs: add explicit flag to request non-decodeable file handlesAmir Goldstein
So far, all callers of exportfs_encode_inode_fh(), except for fsnotify's show_mark_fhandle(), check that filesystem can decode file handles, but we would like to add more callers that do not require a file handle that can be decoded. Introduce a flag to explicitly request a file handle that may not to be decoded later and a wrapper exportfs_encode_fid() that sets this flag and convert show_mark_fhandle() to use the new wrapper. This will be used to allow adding fanotify support to filesystems that do not support NFS export. Acked-by: Jeff Layton <jlayton@kernel.org> Acked-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Jan Kara <jack@suse.cz> Message-Id: <20230502124817.3070545-3-amir73il@gmail.com>
2023-05-16Documentation/filesystems: ramfs-rootfs-initramfs: use :Author:Randy Dunlap
Use the :Author: markup instead of making it a chapter heading. This cleans up the table of contents for this file. Fixes: 7f46a240b0a1 ("[PATCH] ramfs, rootfs, and initramfs docs") Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Cc: Rob Landley <rob@landley.net> Reviewed-by: Christian Brauner <brauner@kernel.org> Link: https://lore.kernel.org/r/20230508055928.3548-1-rdunlap@infradead.org Signed-off-by: Jonathan Corbet <corbet@lwn.net>