summaryrefslogtreecommitdiff
path: root/fs/xfs/xfs_super.c
AgeCommit message (Collapse)Author
2025-02-14xfs: do not check NEEDSREPAIR if ro,norecovery mount.Lukas Herbolt
If there is corrutpion on the filesystem andxfs_repair fails to repair it. The last resort of getting the data is to use norecovery,ro mount. But if the NEEDSREPAIR is set the filesystem cannot be mounted. The flag must be cleared out manually using xfs_db, to get access to what left over of the corrupted fs. Signed-off-by: Lukas Herbolt <lukas@herbolt.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: Carlos Maiolino <cem@kernel.org>
2025-01-23Merge tag 'fsnotify_hsm_for_v6.14-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs Pull fsnotify pre-content notification support from Jan Kara: "This introduces a new fsnotify event (FS_PRE_ACCESS) that gets generated before a file contents is accessed. The event is synchronous so if there is listener for this event, the kernel waits for reply. On success the execution continues as usual, on failure we propagate the error to userspace. This allows userspace to fill in file content on demand from slow storage. The context in which the events are generated has been picked so that we don't hold any locks and thus there's no risk of a deadlock for the userspace handler. The new pre-content event is available only for users with global CAP_SYS_ADMIN capability (similarly to other parts of fanotify functionality) and it is an administrator responsibility to make sure the userspace event handler doesn't do stupid stuff that can DoS the system. Based on your feedback from the last submission, fsnotify code has been improved and now file->f_mode encodes whether pre-content event needs to be generated for the file so the fast path when nobody wants pre-content event for the file just grows the additional file->f_mode check. As a bonus this also removes the checks whether the old FS_ACCESS event needs to be generated from the fast path. Also the place where the event is generated during page fault has been moved so now filemap_fault() generates the event if and only if there is no uptodate folio in the page cache. Also we have dropped FS_PRE_MODIFY event as current real-world users of the pre-content functionality don't really use it so let's start with the minimal useful feature set" * tag 'fsnotify_hsm_for_v6.14-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs: (21 commits) fanotify: Fix crash in fanotify_init(2) fs: don't block write during exec on pre-content watched files fs: enable pre-content events on supported file systems ext4: add pre-content fsnotify hook for DAX faults btrfs: disable defrag on pre-content watched files xfs: add pre-content fsnotify hook for DAX faults fsnotify: generate pre-content permission event on page fault mm: don't allow huge faults for files with pre content watches fanotify: disable readahead if we have pre-content watches fanotify: allow to set errno in FAN_DENY permission response fanotify: report file range info with pre-content events fanotify: introduce FAN_PRE_ACCESS permission event fsnotify: generate pre-content permission event on truncate fsnotify: pass optional file access range in pre-content event fsnotify: introduce pre-content permission events fanotify: reserve event bit of deprecated FAN_DIR_MODIFY fanotify: rename a misnamed constant fanotify: don't skip extra event info if no info_mode is set fsnotify: check if file is actually being watched for pre-content events on open fsnotify: opt-in for permission events at file open time ...
2025-01-13xfs: refactor xfs_fs_statfsChristoph Hellwig
Split out helpers for data, rt data and inode related information, and assigning f_bavail once instead of in three places. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>
2025-01-13xfs: don't take m_sb_lock in xfs_fs_statfsChristoph Hellwig
The only non-constant value read under m_sb_lock in xfs_fs_statfs is sb_dblocks, and it could become stale right after dropping the lock anyway. Remove the thus pointless lock section. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>
2024-12-23xfs: enable realtime reflinkDarrick J. Wong
Enable reflink for realtime devices, as long as the realtime allocation unit is a single fsblock. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-12-23xfs: enable realtime rmap btreeDarrick J. Wong
Permit mounting filesystems with realtime rmap btrees. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-12-11fs: enable pre-content events on supported file systemsJosef Bacik
Now that all the code has been added for pre-content events, and the various file systems that need the page fault hooks for fsnotify have been updated, add SB_I_ALLOW_HSM to the supported file systems. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Jan Kara <jack@suse.cz> Link: https://patch.msgid.link/46960dcb2725fa0317895ed66a8409ba1c306a82.1731684329.git.josef@toxicpanda.com
2024-11-21Merge tag 'xfs-6.13-merge-1' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linuxLinus Torvalds
Pull xfs updates from Carlos Maiolino: "The bulk of this pull request is a major rework that Darrick and Christoph have been doing on XFS's real-time volume, coupled with a few features to support this rework. It does also includes some bug fixes. - convert perag to use xarrays - create a new generic allocation group structure - add metadata inode dir trees - create in-core rt allocation groups - shard the RT section into allocation groups - persist quota options with the enw metadata dir tree - enable quota for RT volumes - enable metadata directory trees - some bugfixes" * tag 'xfs-6.13-merge-1' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: (146 commits) xfs: port ondisk structure checks from xfs/122 to the kernel xfs: separate space btree structures in xfs_ondisk.h xfs: convert struct typedefs in xfs_ondisk.h xfs: enable metadata directory feature xfs: enable realtime quota again xfs: update sb field checks when metadir is turned on xfs: reserve quota for realtime files correctly xfs: create quota preallocation watermarks for realtime quota xfs: report realtime block quota limits on realtime directories xfs: persist quota flags with metadir xfs: advertise realtime quota support in the xqm stat files xfs: scrub quota file metapaths xfs: fix chown with rt quota xfs: use metadir for quota inodes xfs: refactor xfs_qm_destroy_quotainos xfs: use rtgroup busy extent list for FITRIM xfs: implement busy extent tracking for rtgroups xfs: port the perag discard code to handle generic groups xfs: move the min and max group block numbers to xfs_group xfs: adjust min_block usage in xfs_verify_agbno ...
2024-11-05xfs: report realtime block quota limits on realtime directoriesDarrick J. Wong
On the data device, calling statvfs on a projinherit directory results in the block and avail counts being curtailed to the project quota block limits, if any are set. Do the same for realtime files or directories, only use the project quota rt block limits. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-11-05xfs: persist quota flags with metadirDarrick J. Wong
It's annoying that one has to keep reminding XFS about what quota options it should mount with, since the quota flags recording the previous state are sitting right there in the primary superblock. Even more strangely, there exists a noquota option to disable quotas completely, so it's odder still that providing no options is the same as noquota. Starting with metadir, let's change the behavior so that if the user does not specify any quota-related mount options at all, the ondisk quota flags will be used to bring up quota. In other words, the filesystem will mount in the same state and with the same functionality as it had during the last mount. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-11-05xfs: check the realtime superblock at mount timeDarrick J. Wong
Check the realtime superblock at mount time, to ensure that the label and uuids actually match the primary superblock on the data device. If the rt superblock is good, attach it to the xfs_mount so that the log can use ordered buffers to keep this primary in sync with the primary super on the data device. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-11-05xfs: make RT extent numbers relative to the rtgroupChristoph Hellwig
To prepare for adding per-rtgroup bitmap files, make the xfs_rtxnum_t type encode the RT extent number relative to the rtgroup. The biggest part of this to clearly distinguish between the relative extent number that gets masked when converting from a global block number and length values that just have a factor applied to them when converting from file system blocks. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2024-11-05xfs: define the on-disk format for the metadir featureDarrick J. Wong
Define the on-disk layout and feature flags for the metadata inode directory feature. Add a xfs_sb_version_hasmetadir for benefit of xfs_repair, which needs to know where the new end of the superblock lies. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-11-05xfs: standardize EXPERIMENTAL warning generationDarrick J. Wong
Refactor the open-coded warnings about EXPERIMENTAL feature use into a standard helper before we go adding more experimental features. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-11-05xfs: factor out a generic xfs_group structureChristoph Hellwig
Split the lookup and refcount handling of struct xfs_perag into an embedded xfs_group structure that can be reused for the upcoming realtime groups. It will be extended with more features later. Note that he xg_type field will only need a single bit even with realtime group support. For now it fills a hole, but it might be worth to fold it into another field if we can use this space better. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2024-10-10Merge patch series "timekeeping/fs: multigrain timestamp redux"Christian Brauner
Jeff Layton <jlayton@kernel.org> says: The VFS has always used coarse-grained timestamps when updating the ctime and mtime after a change. This has the benefit of allowing filesystems to optimize away a lot metadata updates, down to around 1 per jiffy, even when a file is under heavy writes. Unfortunately, this has always been an issue when we're exporting via NFSv3, which relies on timestamps to validate caches. A lot of changes can happen in a jiffy, so timestamps aren't sufficient to help the client decide when to invalidate the cache. Even with NFSv4, a lot of exported filesystems don't properly support a change attribute and are subject to the same problems with timestamp granularity. Other applications have similar issues with timestamps (e.g backup applications). If we were to always use fine-grained timestamps, that would improve the situation, but that becomes rather expensive, as the underlying filesystem would have to log a lot more metadata updates. What we need is a way to only use fine-grained timestamps when they are being actively queried. Use the (unused) top bit in inode->i_ctime_nsec as a flag that indicates whether the current timestamps have been queried via stat() or the like. When it's set, we allow the kernel to use a fine-grained timestamp iff it's necessary to make the ctime show a different value. This solves the problem of being able to distinguish the timestamp between updates, but introduces a new problem: it's now possible for a file being changed to get a fine-grained timestamp. A file that is altered just a bit later can then get a coarse-grained one that appears older than the earlier fine-grained time. This violates timestamp ordering guarantees. To remedy this, keep a global monotonic atomic64_t value that acts as a timestamp floor. When we go to stamp a file, we first get the latter of the current floor value and the current coarse-grained time. If the inode ctime hasn't been queried then we just attempt to stamp it with that value. If it has been queried, then first see whether the current coarse time is later than the existing ctime. If it is, then we accept that value. If it isn't, then we get a fine-grained time and try to swap that into the global floor. Whether that succeeds or fails, we take the resulting floor time, convert it to realtime and try to swap that into the ctime. We take the result of the ctime swap whether it succeeds or fails, since either is just as valid. Filesystems can opt into this by setting the FS_MGTIME fstype flag. Others should be unaffected (other than being subject to the same floor value as multigrain filesystems). * patches from https://lore.kernel.org/r/20241002-mgtime-v10-0-d1c4717f5284@kernel.org: tmpfs: add support for multigrain timestamps btrfs: convert to multigrain timestamps ext4: switch to multigrain timestamps xfs: switch to multigrain timestamps Documentation: add a new file documenting multigrain timestamps fs: add percpu counters for significant multigrain timestamp events fs: tracepoints around multigrain timestamp events fs: handle delegated timestamps in setattr_copy_mgtime fs: have setattr_copy handle multigrain timestamps appropriately fs: add infrastructure for multigrain timestamps Link: https://lore.kernel.org/r/20241002-mgtime-v10-0-d1c4717f5284@kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-10-10xfs: switch to multigrain timestampsJeff Layton
Enable multigrain timestamps, which should ensure that there is an apparent change to the timestamp whenever it has been written after being actively observed via getattr. Also, anytime the mtime changes, the ctime must also change, and those are now the only two options for xfs_trans_ichgtime. Have that function unconditionally bump the ctime, and ASSERT that XFS_ICHGTIME_CHG is always set. Finally, stop setting STATX_CHANGE_COOKIE in getattr, since the ctime should give us better semantics now. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Tested-by: Randy Dunlap <rdunlap@infradead.org> # documentation bits Signed-off-by: Jeff Layton <jlayton@kernel.org> Link: https://lore.kernel.org/r/20241002-mgtime-v10-9-d1c4717f5284@kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-09-20Merge tag 'vfs-6.12.blocksize' of ↵Linus Torvalds
gitolite.kernel.org:pub/scm/linux/kernel/git/vfs/vfs Pull vfs blocksize updates from Christian Brauner: "This contains the vfs infrastructure as well as the xfs bits to enable support for block sizes (bs) larger than page sizes (ps) plus a few fixes to related infrastructure. There has been efforts over the last 16 years to enable enable Large Block Sizes (LBS), that is block sizes in filesystems where bs > page size. Through these efforts we have learned that one of the main blockers to supporting bs > ps in filesystems has been a way to allocate pages that are at least the filesystem block size on the page cache where bs > ps. Thanks to various previous efforts it is possible to support bs > ps in XFS with only a few changes in XFS itself. Most changes are to the page cache to support minimum order folio support for the target block size on the filesystem. A motivation for Large Block Sizes today is to support high-capacity (large amount of Terabytes) QLC SSDs where the internal Indirection Unit (IU) are typically greater than 4k to help reduce DRAM and so in turn cost and space. In practice this then allows different architectures to use a base page size of 4k while still enabling support for block sizes aligned to the larger IUs by relying on high order folios on the page cache when needed. It also allows to take advantage of the drive's support for atomics larger than 4k with buffered IO support in Linux. As described this year at LSFMM, supporting large atomics greater than 4k enables databases to remove the need to rely on their own journaling, so they can disable double buffered writes, which is a feature different cloud providers are already enabling through custom storage solutions" * tag 'vfs-6.12.blocksize' of gitolite.kernel.org:pub/scm/linux/kernel/git/vfs/vfs: (22 commits) Documentation: iomap: fix a typo iomap: remove the iomap_file_buffered_write_punch_delalloc return value iomap: pass the iomap to the punch callback iomap: pass flags to iomap_file_buffered_write_punch_delalloc iomap: improve shared block detection in iomap_unshare_iter iomap: handle a post-direct I/O invalidate race in iomap_write_delalloc_release docs:filesystems: fix spelling and grammar mistakes in iomap design page filemap: fix htmldoc warning for mapping_align_index() iomap: make zero range flush conditional on unwritten mappings iomap: fix handling of dirty folios over unwritten extents iomap: add a private argument for iomap_file_buffered_write iomap: remove set_memor_ro() on zero page xfs: enable block size larger than page size support xfs: make the calculation generic in xfs_sb_validate_fsb_count() xfs: expose block size in stat xfs: use kvmalloc for xattr buffers iomap: fix iomap_dio_zero() for fs bs > system page size filemap: cap PTE range to be created to allowed zero fill in folio_map_range() mm: split a folio in minimum folio order chunks readahead: allocate folios with mapping_min_order in readahead ...
2024-09-03xfs: enable block size larger than page size supportPankaj Raghav
Page cache now has the ability to have a minimum order when allocating a folio which is a prerequisite to add support for block size > page size. Signed-off-by: Pankaj Raghav <p.raghav@samsung.com> Signed-off-by: Luis Chamberlain <mcgrof@kernel.org> Link: https://lore.kernel.org/r/20240827-xfs-fix-wformat-bs-gt-ps-v1-1-aec6717609e0@kernel.org # fix folded Link: https://lore.kernel.org/r/20240822135018.1931258-11-kernel@pankajraghav.com Reviewed-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-09-03xfs: convert perag lookup to xarrayChristoph Hellwig
Convert the perag lookup from the legacy radix tree to the xarray, which allows for much nicer iteration and bulk lookup semantics. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
2024-09-03xfs: Use xfs set and clear mp state helpersJohn Garry
Use the set and clear mp state helpers instead of open-coding. It is noted that in some instances calls to atomic operation set_bit() and clear_bit() are being replaced with test_and_set_bit() and test_and_clear_bit(), respectively, as there is no specific helpers for set_bit() and clear_bit() only. However should be ok, as we are just ignoring the returned value from those "test" variants. Signed-off-by: John Garry <john.g.garry@oracle.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
2024-04-23xfs: add a incompat feature bit for parent pointersAllison Henderson
Create an incompat feature bit and a fs geometry flag so that we can enable the feature in the ondisk superblock and advertise its existence to userspace. Signed-off-by: Mark Tinguely <mark.tinguely@oracle.com> Signed-off-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Allison Henderson <allison.henderson@oracle.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2024-04-23xfs: parent pointer attribute creationAllison Henderson
Add parent pointer attribute during xfs_create, and subroutines to initialize attributes. Note that the xfs_attr_intent object contains a pointer to the caller's xfs_da_args object, so the latter must persist until transaction commit. Signed-off-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Allison Henderson <allison.henderson@oracle.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> [djwong: shorten names, adjust to new format, set init_xattrs for parent pointers] Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-04-22xfs: support RT inodes in xfs_mod_delallocChristoph Hellwig
To prepare for re-enabling delalloc on RT devices, track the data blocks (which use the RT device when the inode sits on it) and the indirect blocks (which don't) separately to xfs_mod_delalloc, and add a new percpu counter to also track the RT delalloc blocks. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
2024-04-22xfs: split xfs_mod_freecounterChristoph Hellwig
xfs_mod_freecounter has two entirely separate code paths for adding or subtracting from the free counters. Only the subtract case looks at the rsvd flag and can return an error. Split xfs_mod_freecounter into separate helpers for subtracting or adding the freecounter, and remove all the impossible to reach error handling for the addition case. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
2024-04-22xfs: compile out v4 support if disabledChristoph Hellwig
Add a few strategic IS_ENABLED statements to let the compiler eliminate unused code when CONFIG_XFS_SUPPORT_V4 is disabled. This saves multiple kilobytes of .text in my .config: $ size xfs.o.* text data bss dec hex filename 1363633 294836 592 1659061 1950b5 xfs.o.new 1371453 294868 592 1666913 196f61 xfs.o.old Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
2024-04-15xfs: introduce a file mapping exchange log intent itemDarrick J. Wong
Introduce a new intent log item to handle exchanging mappings between the forks of two files. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-04-15xfs: create a incompat flag for atomic file mapping exchangesDarrick J. Wong
Create a incompat flag so that we only attempt to process file mapping exchange log items if the filesystem supports it, and a geometry flag to advertise support if it's present. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-03-27fs,block: yield devices earlyChristian Brauner
Currently a device is only really released once the umount returns to userspace due to how file closing works. That ultimately could cause an old umount assumption to be violated that concurrent umount and mount don't fail. So an exclusively held device with a temporary holder should be yielded before the filesystem is gone. Add a helper that allows callers to do that. This also allows us to remove the two holder ops that Linus wasn't excited about. Link: https://lore.kernel.org/r/20240326-vfs-bdev-end_holder-v1-1-20af85202918@kernel.org Fixes: f3a608827d1f ("bdev: open block device as files") # mainline only Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jan Kara <jack@suse.cz> Suggested-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-03-13Merge tag 'xfs-6.9-merge-8' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linuxLinus Torvalds
Pull xfs updates from Chandan Babu: - Online repair updates: - More ondisk structures being repaired: - Inode's mode field by trying to obtain file type value from the a directory entry - Quota counters - Link counts of inodes - FS summary counters - Support for in-memory btrees has been added to support repair of rmap btrees - Misc changes: - Report corruption of metadata to the health tracking subsystem - Enable indirect health reporting when resources are scarce - Reduce memory usage while repairing refcount btree - Extend "Bmap update" intent item to support atomic extent swapping on the realtime device - Extend "Bmap update" intent item to support extended attribute fork and unwritten extents - Code cleanups: - Bmap log intent - Btree block pointer checking - Btree readahead - Buffer target - Symbolic link code - Remove mrlock wrapper around the rwsem - Convert all the GFP_NOFS flag usages to use the scoped memalloc_nofs_save() API instead of direct calls with the GFP_NOFS - Refactor and simplify xfile abstraction. Lower level APIs in shmem.c are required to be exported in order to achieve this - Skip checking alignment constraints for inode chunk allocations when block size is larger than inode chunk size - Do not submit delwri buffers collected during log recovery when an error has been encountered - Fix SEEK_HOLE/DATA for file regions which have active COW extents - Fix lock order inversion when executing error handling path during shrinking a filesystem - Remove duplicate ifdefs * tag 'xfs-6.9-merge-8' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: (183 commits) xfs: shrink failure needs to hold AGI buffer mm/shmem.c: Use new form of *@param in kernel-doc kernel-doc: Add unary operator * to $type_param_ref xfs: use kvfree() in xlog_cil_free_logvec() xfs: xfs_btree_bload_prep_block() should use __GFP_NOFAIL xfs: fix scrub stats file permissions xfs: fix log recovery erroring out on refcount recovery failure xfs: move symlink target write function to libxfs xfs: move remote symlink target read function to libxfs xfs: move xfs_symlink_remote.c declarations to xfs_symlink_remote.h xfs: xfs_bmap_finish_one should map unwritten extents properly xfs: support deferred bmap updates on the attr fork xfs: support recovering bmap intent items targetting realtime extents xfs: add a realtime flag to the bmap update log redo items xfs: add a xattr_entry helper xfs: fix xfs_bunmapi to allow unmapping of partial rt extents xfs: move xfs_bmap_defer_add to xfs_bmap_item.c xfs: reuse xfs_bmap_update_cancel_item xfs: add a bi_entry helper xfs: remove xfs_trans_set_bmap_flags ...
2024-03-12mm, slab: remove last vestiges of SLAB_MEM_SPREADLinus Torvalds
Yes, yes, I know the slab people were planning on going slow and letting every subsystem fight this thing on their own. But let's just rip off the band-aid and get it over and done with. I don't want to see a number of unnecessary pull requests just to get rid of a flag that no longer has any meaning. This was mainly done with a couple of 'sed' scripts and then some manual cleanup of the end result. Link: https://lore.kernel.org/all/CAHk-=wji0u+OOtmAOD-5JV3SXcRJF___k_+8XNKmak0yd5vW1Q@mail.gmail.com/ Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2024-03-11Merge tag 'vfs-6.9.super' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull block handle updates from Christian Brauner: "Last cycle we changed opening of block devices, and opening a block device would return a bdev_handle. This allowed us to implement support for restricting and forbidding writes to mounted block devices. It was accompanied by converting and adding helpers to operate on bdev_handles instead of plain block devices. That was already a good step forward but ultimately it isn't necessary to have special purpose helpers for opening block devices internally that return a bdev_handle. Fundamentally, opening a block device internally should just be equivalent to opening files. So now all internal opens of block devices return files just as a userspace open would. Instead of introducing a separate indirection into bdev_open_by_*() via struct bdev_handle bdev_file_open_by_*() is made to just return a struct file. Opening and closing a block device just becomes equivalent to opening and closing a file. This all works well because internally we already have a pseudo fs for block devices and so opening block devices is simple. There's a few places where we needed to be careful such as during boot when the kernel is supposed to mount the rootfs directly without init doing it. Here we need to take care to ensure that we flush out any asynchronous file close. That's what we already do for opening, unpacking, and closing the initramfs. So nothing new here. The equivalence of opening and closing block devices to regular files is a win in and of itself. But it also has various other advantages. We can remove struct bdev_handle completely. Various low-level helpers are now private to the block layer. Other helpers were simply removable completely. A follow-up series that is already reviewed build on this and makes it possible to remove bdev->bd_inode and allows various clean ups of the buffer head code as well. All places where we stashed a bdev_handle now just stash a file and use simple accessors to get to the actual block device which was already the case for bdev_handle" * tag 'vfs-6.9.super' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (35 commits) block: remove bdev_handle completely block: don't rely on BLK_OPEN_RESTRICT_WRITES when yielding write access bdev: remove bdev pointer from struct bdev_handle bdev: make struct bdev_handle private to the block layer bdev: make bdev_{release, open_by_dev}() private to block layer bdev: remove bdev_open_by_path() reiserfs: port block device access to file ocfs2: port block device access to file nfs: port block device access to files jfs: port block device access to file f2fs: port block device access to files ext4: port block device access to file erofs: port device access to file btrfs: port device access to file bcachefs: port block device access to file target: port block device access to file s390: port block device access to file nvme: port block device access to file block2mtd: port device access to files bcache: port block device access to files ...
2024-02-27xfs: drop experimental warning for FSDAXShiyang Ruan
FSDAX and reflink can work together now, let's drop this warning. Signed-off-by: Shiyang Ruan <ruansy.fnst@fujitsu.com> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org> Acked-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
2024-02-25xfs: port block device access to filesChristian Brauner
Link: https://lore.kernel.org/r/20240123-vfs-bdev-file-v2-7-adbd023e19cc@kernel.org Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-02-25bdev: open block device as filesChristian Brauner
Add two new helpers to allow opening block devices as files. This is not the final infrastructure. This still opens the block device before opening a struct a file. Until we have removed all references to struct bdev_handle we can't switch the order: * Introduce blk_to_file_flags() to translate from block specific to flags usable to pen a new file. * Introduce bdev_file_open_by_{dev,path}(). * Introduce temporary sb_bdev_handle() helper to retrieve a struct bdev_handle from a block device file and update places that directly reference struct bdev_handle to rely on it. * Don't count block device openes against the number of open files. A bdev_file_open_by_{dev,path}() file is never installed into any file descriptor table. One idea that came to mind was to use kernel_tmpfile_open() which would require us to pass a path and it would then call do_dentry_open() going through the regular fops->open::blkdev_open() path. But then we're back to the problem of routing block specific flags such as BLK_OPEN_RESTRICT_WRITES through the open path and would have to waste FMODE_* flags every time we add a new one. With this we can avoid using a flag bit and we have more leeway in how we open block devices from bdev_open_by_{dev,path}(). Link: https://lore.kernel.org/r/20240123-vfs-bdev-file-v2-1-adbd023e19cc@kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-02-22xfs: port refcount repair to the new refcount bag structureDarrick J. Wong
Port the refcount record generating code to use the new refcount bag data structure. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-02-22xfs: track directory entry updates during live nlinks fsckDarrick J. Wong
Create the necessary hooks in the directory operations (create/link/unlink/rename) code so that our live nlink scrub code can stay up to date with link count updates in the rest of the filesystem. This will be the means to keep our shadow link count information up to date while the scan runs in real time. In online fsck part 2, we'll use these same hooks to handle repairs to directories and parent pointer information. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-02-19xfs: Remove mrlock wrapperMatthew Wilcox (Oracle)
mrlock was an rwsem wrapper that also recorded whether the lock was held for read or write. Now that we can ask the generic code whether the lock is held for read or write, we can remove this wrapper and use an rwsem directly. As the comment says, we can't use lockdep to assert that the ILOCK is held for write, because we might be in a workqueue, and we aren't able to tell lockdep that we do in fact own the lock. Reviewed-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: "Matthew Wilcox (Oracle)" <willy@infradead.org> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
2024-02-13xfs: convert remaining kmem_free() to kfree()Dave Chinner
The remaining callers of kmem_free() are freeing heap memory, so we can convert them directly to kfree() and get rid of kmem_free() altogether. This conversion was done with: $ for f in `git grep -l kmem_free fs/xfs`; do > sed -i s/kmem_free/kfree/ $f > done $ Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
2024-02-13xfs: convert kmem_alloc() to kmalloc()Dave Chinner
kmem_alloc() is just a thin wrapper around kmalloc() these days. Convert everything to use kmalloc() so we can get rid of the wrapper. Note: the transaction region allocation in xlog_add_to_transaction() can be a high order allocation. Converting it to use kmalloc(__GFP_NOFAIL) results in warnings in the page allocation code being triggered because the mm subsystem does not want us to use __GFP_NOFAIL with high order allocations like we've been doing with the kmem_alloc() wrapper for a couple of decades. Hence this specific case gets converted to xlog_kvmalloc() rather than kmalloc() to avoid this issue. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
2024-01-22xfs: read only mounts with fsopen mount API are bustedDave Chinner
Recently xfs/513 started failing on my test machines testing "-o ro,norecovery" mount options. This was being emitted in dmesg: [ 9906.932724] XFS (pmem0): no-recovery mounts must be read-only. Turns out, readonly mounts with the fsopen()/fsconfig() mount API have been busted since day zero. It's only taken 5 years for debian unstable to start using this "new" mount API, and shortly after this I noticed xfs/513 had started to fail as per above. The syscall trace is: fsopen("xfs", FSOPEN_CLOEXEC) = 3 mount_setattr(-1, NULL, 0, NULL, 0) = -1 EINVAL (Invalid argument) ..... fsconfig(3, FSCONFIG_SET_STRING, "source", "/dev/pmem0", 0) = 0 fsconfig(3, FSCONFIG_SET_FLAG, "ro", NULL, 0) = 0 fsconfig(3, FSCONFIG_SET_FLAG, "norecovery", NULL, 0) = 0 fsconfig(3, FSCONFIG_CMD_CREATE, NULL, NULL, 0) = -1 EINVAL (Invalid argument) close(3) = 0 Showing that the actual mount instantiation (FSCONFIG_CMD_CREATE) is what threw out the error. During mount instantiation, we call xfs_fs_validate_params() which does: /* No recovery flag requires a read-only mount */ if (xfs_has_norecovery(mp) && !xfs_is_readonly(mp)) { xfs_warn(mp, "no-recovery mounts must be read-only."); return -EINVAL; } and xfs_is_readonly() checks internal mount flags for read only state. This state is set in xfs_init_fs_context() from the context superblock flag state: /* * Copy binary VFS mount flags we are interested in. */ if (fc->sb_flags & SB_RDONLY) set_bit(XFS_OPSTATE_READONLY, &mp->m_opstate); With the old mount API, all of the VFS specific superblock flags had already been parsed and set before xfs_init_fs_context() is called, so this all works fine. However, in the brave new fsopen/fsconfig world, xfs_init_fs_context() is called from fsopen() context, before any VFS superblock have been set or parsed. Hence if we use fsopen(), the internal XFS readonly state is *never set*. Hence anything that depends on xfs_is_readonly() actually returning true for read only mounts is broken if fsopen() has been used to mount the filesystem. Fix this by moving this internal state initialisation to xfs_fs_fill_super() before we attempt to validate the parameters that have been set prior to the FSCONFIG_CMD_CREATE call being made. Signed-off-by: Dave Chinner <dchinner@redhat.com> Fixes: 73e5fff98b64 ("xfs: switch to use the new mount-api") cc: stable@vger.kernel.org Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
2024-01-10Merge tag 'xfs-6.8-merge-3' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linuxLinus Torvalds
Pull xfs updates from Chandan Babu: "New features/functionality: - Online repair: - Reserve disk space for online repairs - Fix misinteraction between the AIL and btree bulkloader because of which the bulk load fails to queue a buffer for writeback if it happens to be on the AIL list - Prevent transaction reservation overflows when reaping blocks during online repair - Whenever possible, bulkloader now copies multiple records into a block - Support repairing of 1. Per-AG free space, inode and refcount btrees 2. Ondisk inodes 3. File data and attribute fork mappings - Verify the contents of 1. Inode and data fork of realtime bitmap file 2. Quota files - Introduce MF_MEM_PRE_REMOVE. This will be used to notify tasks about a pmem device being removed Bug fixes: - Fix memory leak of recovered attri intent items - Fix UAF during log intent recovery - Fix realtime geometry integer overflows - Prevent scrub from live locking in xchk_iget - Prevent fs shutdown when removing files during low free disk space - Prevent transaction reservation overflow when extending an RT device - Prevent incorrect warning from being printed when extending a filesystem - Fix an off-by-one error in xreap_agextent_binval - Serialize access to perag radix tree during deletion operation - Fix perag memory leak during growfs - Allow allocation of minlen realtime extent when the maximum sized realtime free extent is minlen in size Cleanups: - Remove duplicate boilerplate code spread across functionality associated with different log items - Cleanup resblks interfaces - Pass defer ops pointer to defer helpers instead of an enum - Initialize di_crc in xfs_log_dinode to prevent KMSAN warnings - Use static_assert() instead of BUILD_BUG_ON_MSG() to validate size of structures and structure member offsets. This is done in order to be able to share the code with userspace - Move XFS documentation under a new directory specific to XFS - Do not invoke deferred ops' ->create_done callback if the deferred operation does not have an intent item associated with it - Remove duplicate inclusion of header files from scrub/health.c - Refactor Realtime code - Cleanup attr code" * tag 'xfs-6.8-merge-3' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: (123 commits) xfs: use the op name in trace_xlog_intent_recovery_failed xfs: fix a use after free in xfs_defer_finish_recovery xfs: turn the XFS_DA_OP_REPLACE checks in xfs_attr_shortform_addname into asserts xfs: remove xfs_attr_sf_hdr_t xfs: remove struct xfs_attr_shortform xfs: use xfs_attr_sf_findname in xfs_attr_shortform_getvalue xfs: remove xfs_attr_shortform_lookup xfs: simplify xfs_attr_sf_findname xfs: move the xfs_attr_sf_lookup tracepoint xfs: return if_data from xfs_idata_realloc xfs: make if_data a void pointer xfs: fold xfs_rtallocate_extent into xfs_bmap_rtalloc xfs: simplify and optimize the RT allocation fallback cascade xfs: reorder the minlen and prod calculations in xfs_bmap_rtalloc xfs: remove XFS_RTMIN/XFS_RTMAX xfs: remove rt-wrappers from xfs_format.h xfs: factor out a xfs_rtalloc_sumlevel helper xfs: tidy up xfs_rtallocate_extent_exact xfs: merge the calls to xfs_rtallocate_range in xfs_rtallocate_block xfs: reflow the tail end of xfs_rtallocate_extent_block ...
2023-12-07xfs: clean up the xfs_reserve_blocks interfaceChristoph Hellwig
xfs_reserve_blocks has a very odd interface that can only be explained by it directly deriving from the IRIX fcntl handler back in the day. Split reporting out the reserved blocks out of xfs_reserve_blocks into the only caller that cares. This means that the value reported from XFS_IOC_SET_RESBLKS isn't atomically sampled in the same critical section as when it was set anymore, but as the values could change right after setting them anyway that does not matter. It does provide atomic sampling of both values for XFS_IOC_GET_RESBLKS now, though. Also pass a normal scalar integer value for the requested value instead of the pointless pointer. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
2023-11-18xfs: Block writes to log deviceJan Kara
Ask block layer to not allow other writers to open block devices used for xfs log and realtime devices. Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20231101174325.10596-6-jack@suse.cz Reviewed-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-11-18xfs: simplify device handlingChristian Brauner
We removed all codepaths where s_umount is taken beneath open_mutex and bd_holder_lock so don't make things more complicated than they need to be and hold s_umount over block device opening. Link: https://lore.kernel.org/r/20231024-vfs-super-rework-v1-2-37a8aa697148@kernel.org Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-11-08Merge tag 'xfs-6.7-merge-2' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linuxLinus Torvalds
Pull xfs updates from Chandan Babu: - Realtime device subsystem: - Cleanup usage of xfs_rtblock_t and xfs_fsblock_t data types - Replace open coded conversions between rt blocks and rt extents with calls to static inline helpers - Replace open coded realtime geometry compuation and macros with helper functions - CPU usage optimizations for realtime allocator - Misc bug fixes associated with Realtime device - Allow read operations to execute while an FICLONE ioctl is being serviced - Misc bug fixes: - Alert user when xfs_droplink() encounters an inode with a link count of zero - Handle the case where the allocator could return zero extents when servicing an fallocate request * tag 'xfs-6.7-merge-2' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: (40 commits) xfs: allow read IO and FICLONE to run concurrently xfs: handle nimaps=0 from xfs_bmapi_write in xfs_alloc_file_space xfs: introduce protection for drop nlink xfs: don't look for end of extent further than necessary in xfs_rtallocate_extent_near() xfs: don't try redundant allocations in xfs_rtallocate_extent_near() xfs: limit maxlen based on available space in xfs_rtallocate_extent_near() xfs: return maximum free size from xfs_rtany_summary() xfs: invert the realtime summary cache xfs: simplify rt bitmap/summary block accessor functions xfs: simplify xfs_rtbuf_get calling conventions xfs: cache last bitmap block in realtime allocator xfs: use accessor functions for summary info words xfs: consolidate realtime allocation arguments xfs: create helpers for rtsummary block/wordcount computations xfs: use accessor functions for bitmap words xfs: create helpers for rtbitmap block/wordcount computations xfs: create a helper to handle logging parts of rt bitmap/summary blocks xfs: convert rt summary macros to helpers xfs: convert open-coded xfs_rtword_t pointer accesses to helper xfs: remove XFS_BLOCKWSIZE and XFS_BLOCKWMASK macros ...
2023-10-28xfs: Convert to bdev_open_by_path()Jan Kara
Convert xfs to use bdev_open_by_path() and pass the handle around. CC: "Darrick J. Wong" <djwong@kernel.org> CC: linux-xfs@vger.kernel.org Acked-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20230927093442.25915-28-jack@suse.cz Acked-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-10-17xfs: create a helper to convert rtextents to rtblocksDarrick J. Wong
Create a helper to convert a realtime extent to a realtime block. Later on we'll change the helper to use bit shifts when possible. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2023-09-22Merge tag 'xfs-6.6-fixes-1' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linuxLinus Torvalds
Pull xfs fixes from Chandan Babu: - Fix an integer overflow bug when processing an fsmap call - Fix crash due to CPU hot remove event racing with filesystem mount operation - During read-only mount, XFS does not allow the contents of the log to be recovered when there are one or more unrecognized rcompat features in the primary superblock, since the log might have intent items which the kernel does not know how to process - During recovery of log intent items, XFS now reserves log space sufficient for one cycle of a permanent transaction to execute. Otherwise, this could lead to livelocks due to non-availability of log space - On an fs which has an ondisk unlinked inode list, trying to delete a file or allocating an O_TMPFILE file can cause the fs to the shutdown if the first inode in the ondisk inode list is not present in the inode cache. The bug is solved by explicitly loading the first inode in the ondisk unlinked inode list into the inode cache if it is not already cached A similar problem arises when the uncached inode is present in the middle of the ondisk unlinked inode list. This second bug is triggered when executing operations like quotacheck and bulkstat. In this case, XFS now reads in the entire ondisk unlinked inode list - Enable LARP mode only on recent v5 filesystems - Fix a out of bounds memory access in scrub - Fix a performance bug when locating the tail of the log during mounting a filesystem * tag 'xfs-6.6-fixes-1' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: xfs: use roundup_pow_of_two instead of ffs during xlog_find_tail xfs: only call xchk_stats_merge after validating scrub inputs xfs: require a relatively recent V5 filesystem for LARP mode xfs: make inode unlinked bucket recovery work with quotacheck xfs: load uncached unlinked inodes into memory on demand xfs: reserve less log space when recovering log intent items xfs: fix log recovery when unknown rocompat bits are set xfs: reload entire unlinked bucket lists xfs: allow inode inactivation during a ro mount log recovery xfs: use i_prev_unlinked to distinguish inodes that are not on the unlinked list xfs: remove CPU hotplug infrastructure xfs: remove the all-mounts list xfs: use per-mount cpumask to track nonempty percpu inodegc lists xfs: fix an agbno overflow in __xfs_getfsmap_datadev xfs: fix per-cpu CIL structure aggregation racing with dying cpus xfs: fix select in config XFS_ONLINE_SCRUB_STATS
2023-09-20Revert "xfs: switch to multigrain timestamps"Christian Brauner
This reverts commit e44df2664746aed8b6dd5245eb711a0ce33c5cf5. Users reported regressions due to enabling multi-grained timestamps unconditionally. As no clear consensus on a solution has come up and the discussion has gone back to the drawing board revert the infrastructure changes for. If it isn't code that's here to stay, make it go away. Message-ID: <20230920-keine-eile-c9755b5825db@brauner> Acked-by: Jan Kara <jack@suse.cz> Acked-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>