summaryrefslogtreecommitdiff
path: root/fs/xfs
AgeCommit message (Collapse)Author
4 daysMerge tag 'xfs-merge-6.18' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linuxLinus Torvalds
Pull xfs updates from Carlos Maiolino: "For this merge window, there are really no new features, but there are a few things worth to emphasize: - Deprecated for years already, the (no)attr2 and (no)ikeep mount options have been removed for good - Several cleanups (specially from typedefs) and bug fixes - Improvements made in the online repair reap calculations - online fsck is now enabled by default" * tag 'xfs-merge-6.18' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: (53 commits) xfs: rework datasync tracking and execution xfs: rearrange code in xfs_inode_item_precommit xfs: scrub: use kstrdup_const() for metapath scan setups xfs: use bt_nr_sectors in xfs_dax_translate_range xfs: track the number of blocks in each buftarg xfs: constify xfs_errortag_random_default xfs: improve default maximum number of open zones xfs: improve zone statistics message xfs: centralize error tag definitions xfs: remove pointless externs in xfs_error.h xfs: remove the expr argument to XFS_TEST_ERROR xfs: remove xfs_errortag_set xfs: remove xfs_errortag_get xfs: move the XLOG_REG_ constants out of xfs_log_format.h xfs: adjust the hint based zone allocation policy xfs: refactor hint based zone allocation fs: add an enum for number of life time hints xfs: fix log CRC mismatches between i386 and other architectures xfs: rename the old_crc variable in xlog_recover_process xfs: remove the unused xfs_log_iovec_t typedef ...
4 daysMerge tag 'vfs-6.18-rc1.workqueue' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull vfs workqueue updates from Christian Brauner: "This contains various workqueue changes affecting the filesystem layer. Currently if a user enqueue a work item using schedule_delayed_work() the used wq is "system_wq" (per-cpu wq) while queue_delayed_work() use WORK_CPU_UNBOUND (used when a cpu is not specified). The same applies to schedule_work() that is using system_wq and queue_work(), that makes use again of WORK_CPU_UNBOUND. This replaces the use of system_wq and system_unbound_wq. system_wq is a per-CPU workqueue which isn't very obvious from the name and system_unbound_wq is to be used when locality is not required. So this renames system_wq to system_percpu_wq, and system_unbound_wq to system_dfl_wq. This also adds a new WQ_PERCPU flag to allow the fs subsystem users to explicitly request the use of per-CPU behavior. Both WQ_UNBOUND and WQ_PERCPU flags coexist for one release cycle to allow callers to transition their calls. WQ_UNBOUND will be removed in a next release cycle" * tag 'vfs-6.18-rc1.workqueue' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: fs: WQ_PERCPU added to alloc_workqueue users fs: replace use of system_wq with system_percpu_wq fs: replace use of system_unbound_wq with system_dfl_wq
4 daysMerge tag 'vfs-6.18-rc1.inode' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull vfs inode updates from Christian Brauner: "This contains a series I originally wrote and that Eric brought over the finish line. It moves out the i_crypt_info and i_verity_info pointers out of 'struct inode' and into the fs-specific part of the inode. So now the few filesytems that actually make use of this pay the price in their own private inode storage instead of forcing it upon every user of struct inode. The pointer for the crypt and verity info is simply found by storing an offset to its address in struct fsverity_operations and struct fscrypt_operations. This shrinks struct inode by 16 bytes. I hope to move a lot more out of it in the future so that struct inode becomes really just about very core stuff that we need, much like struct dentry and struct file, instead of the dumping ground it has become over the years. On top of this are a various changes associated with the ongoing inode lifetime handling rework that multiple people are pushing forward: - Stop accessing inode->i_count directly in f2fs and gfs2. They simply should use the __iget() and iput() helpers - Make the i_state flags an enum - Rework the iput() logic Currently, if we are the last iput, and we have the I_DIRTY_TIME bit set, we will grab a reference on the inode again and then mark it dirty and then redo the put. This is to make sure we delay the time update for as long as possible We can rework this logic to simply dec i_count if it is not 1, and if it is do the time update while still holding the i_count reference Then we can replace the atomic_dec_and_lock with locking the ->i_lock and doing atomic_dec_and_test, since we did the atomic_add_unless above - Add an icount_read() helper and convert everyone that accesses inode->i_count directly for this purpose to use the helper - Expand dump_inode() to dump more information about an inode helping in debugging - Add some might_sleep() annotations to iput() and associated helpers" * tag 'vfs-6.18-rc1.inode' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: fs: add might_sleep() annotation to iput() and more fs: expand dump_inode() inode: fix whitespace issues fs: add an icount_read helper fs: rework iput logic fs: make the i_state flags an enum fs: stop accessing ->i_count directly in f2fs and gfs2 fsverity: check IS_VERITY() in fsverity_cleanup_inode() fs: remove inode::i_verity_info btrfs: move verity info pointer to fs-specific part of inode f2fs: move verity info pointer to fs-specific part of inode ext4: move verity info pointer to fs-specific part of inode fsverity: add support for info in fs-specific part of inode fs: remove inode::i_crypt_info ceph: move crypt info pointer to fs-specific part of inode ubifs: move crypt info pointer to fs-specific part of inode f2fs: move crypt info pointer to fs-specific part of inode ext4: move crypt info pointer to fs-specific part of inode fscrypt: add support for info in fs-specific part of inode fscrypt: replace raw loads of info pointer with helper function
4 daysMerge tag 'vfs-6.18-rc1.misc' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull misc vfs updates from Christian Brauner: "This contains the usual selections of misc updates for this cycle. Features: - Add "initramfs_options" parameter to set initramfs mount options. This allows to add specific mount options to the rootfs to e.g., limit the memory size - Add RWF_NOSIGNAL flag for pwritev2() Add RWF_NOSIGNAL flag for pwritev2. This flag prevents the SIGPIPE signal from being raised when writing on disconnected pipes or sockets. The flag is handled directly by the pipe filesystem and converted to the existing MSG_NOSIGNAL flag for sockets - Allow to pass pid namespace as procfs mount option Ever since the introduction of pid namespaces, procfs has had very implicit behaviour surrounding them (the pidns used by a procfs mount is auto-selected based on the mounting process's active pidns, and the pidns itself is basically hidden once the mount has been constructed) This implicit behaviour has historically meant that userspace was required to do some special dances in order to configure the pidns of a procfs mount as desired. Examples include: * In order to bypass the mnt_too_revealing() check, Kubernetes creates a procfs mount from an empty pidns so that user namespaced containers can be nested (without this, the nested containers would fail to mount procfs) But this requires forking off a helper process because you cannot just one-shot this using mount(2) * Container runtimes in general need to fork into a container before configuring its mounts, which can lead to security issues in the case of shared-pidns containers (a privileged process in the pidns can interact with your container runtime process) While SUID_DUMP_DISABLE and user namespaces make this less of an issue, the strict need for this due to a minor uAPI wart is kind of unfortunate Things would be much easier if there was a way for userspace to just specify the pidns they want. So this pull request contains changes to implement a new "pidns" argument which can be set using fsconfig(2): fsconfig(procfd, FSCONFIG_SET_FD, "pidns", NULL, nsfd); fsconfig(procfd, FSCONFIG_SET_STRING, "pidns", "/proc/self/ns/pid", 0); or classic mount(2) / mount(8): // mount -t proc -o pidns=/proc/self/ns/pid proc /tmp/proc mount("proc", "/tmp/proc", "proc", MS_..., "pidns=/proc/self/ns/pid"); Cleanups: - Remove the last references to EXPORT_OP_ASYNC_LOCK - Make file_remove_privs_flags() static - Remove redundant __GFP_NOWARN when GFP_NOWAIT is used - Use try_cmpxchg() in start_dir_add() - Use try_cmpxchg() in sb_init_done_wq() - Replace offsetof() with struct_size() in ioctl_file_dedupe_range() - Remove vfs_ioctl() export - Replace rwlock() with spinlock in epoll code as rwlock causes priority inversion on preempt rt kernels - Make ns_entries in fs/proc/namespaces const - Use a switch() statement() in init_special_inode() just like we do in may_open() - Use struct_size() in dir_add() in the initramfs code - Use str_plural() in rd_load_image() - Replace strcpy() with strscpy() in find_link() - Rename generic_delete_inode() to inode_just_drop() and generic_drop_inode() to inode_generic_drop() - Remove unused arguments from fcntl_{g,s}et_rw_hint() Fixes: - Document @name parameter for name_contains_dotdot() helper - Fix spelling mistake - Always return zero from replace_fd() instead of the file descriptor number - Limit the size for copy_file_range() in compat mode to prevent a signed overflow - Fix debugfs mount options not being applied - Verify the inode mode when loading it from disk in minixfs - Verify the inode mode when loading it from disk in cramfs - Don't trigger automounts with RESOLVE_NO_XDEV If openat2() was called with RESOLVE_NO_XDEV it didn't traverse through automounts, but could still trigger them - Add FL_RECLAIM flag to show_fl_flags() macro so it appears in tracepoints - Fix unused variable warning in rd_load_image() on s390 - Make INITRAMFS_PRESERVE_MTIME depend on BLK_DEV_INITRD - Use ns_capable_noaudit() when determining net sysctl permissions - Don't call path_put() under namespace semaphore in listmount() and statmount()" * tag 'vfs-6.18-rc1.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (38 commits) fcntl: trim arguments listmount: don't call path_put() under namespace semaphore statmount: don't call path_put() under namespace semaphore pid: use ns_capable_noaudit() when determining net sysctl permissions fs: rename generic_delete_inode() and generic_drop_inode() init: INITRAMFS_PRESERVE_MTIME should depend on BLK_DEV_INITRD initramfs: Replace strcpy() with strscpy() in find_link() initrd: Use str_plural() in rd_load_image() initramfs: Use struct_size() helper to improve dir_add() initrd: Fix unused variable warning in rd_load_image() on s390 fs: use the switch statement in init_special_inode() fs/proc/namespaces: make ns_entries const filelock: add FL_RECLAIM to show_fl_flags() macro eventpoll: Replace rwlock with spinlock selftests/proc: add tests for new pidns APIs procfs: add "pidns" mount option pidns: move is-ancestor logic to helper openat2: don't trigger automounts with RESOLVE_NO_XDEV namei: move cross-device check to __traverse_mounts namei: remove LOOKUP_NO_XDEV check from handle_mounts ...
10 daysxfs: rework datasync tracking and executionDave Chinner
Jan Kara reported that the shared ILOCK held across the journal flush during fdatasync operations slows down O_DSYNC DIO on unwritten extents significantly. The underlying issue is that unwritten extent conversion needs the ILOCK exclusive, whilst the datasync operation after the extent conversion holds it shared. Hence we cannot be flushing the journal for one IO completion whilst at the same time doing unwritten extent conversion on another IO completion on the same inode. This means that IO completions lock-step, and IO performance is dependent on the journal flush latency. Jan demonstrated that reducing the ifdatasync lock hold time can improve O_DSYNC DIO to unwritten extents performance by 2.5x. Discussion on that patch found issues with the method, and we came to the conclusion that separately tracking datasync flush sequences was the best approach to solving the problem. The fsync code uses the ILOCK to serialise against concurrent modifications in the transaction commit phase. In a transaction commit, there are several disjoint updates to inode log item state that need to be considered atomically by the fsync code. These operations are all done under ILOCK_EXCL context: 1. ili_fsync_flags is updated in ->iop_precommit 2. i_pincount is updated in ->iop_pin before it is added to the CIL 3. ili_commit_seq is updated in ->iop_committing, after it has been added to the CIL In fsync, we need to: 1. check that the inode is dirty in the journal (ipincount) 2. check that ili_fsync_flags is set 3. grab the ili_commit_seq if a journal flush is needed 4. clear the ili_fsync_flags to ensure that new modifications that require fsync are tracked in ->iop_precommit correctly The serialisation of ipincount/ili_commit_seq is needed to ensure that we don't try to unnecessarily flush the journal. The serialisation of ili_fsync_flags being set in ->iop_precommit and cleared in fsync post journal flush is required for correctness. Hence holding the ILOCK_SHARED in xfs_file_fsync() performs all this serialisation for us. Ideally, we want to remove the need to hold the ILOCK_SHARED in xfs_file_fsync() for best performance. We start with the observation that fsync/fdatasync() only need to wait for operations that have been completed. Hence operations that are still being committed have not completed and datasync operations do not need to wait for them. This means we can use a single point in time in the commit process to signal "this modification is complete". This is what ->iop_committing is supposed to provide - it is the point at which the object is unlocked after the modification has been recorded in the CIL. Hence we could use ili_commit_seq to determine if we should flush the journal. In theory, we can already do this. However, in practice this will expose an internal global CIL lock to the IO path. The ipincount() checks optimise away the need to take this lock - if the inode is not pinned, then it is not in the CIL and we don't need to check if a journal flush at ili_commit_seq needs to be performed. The reason this is needed is that the ili_commit_seq is never cleared. Once it is set, it remains set even once the journal has been committed and the object has been unpinned. Hence we have to look that journal internal commit sequence state to determine if ili_commit_seq needs to be acted on or not. We can solve this by clearing ili_commit_seq when the inode is unpinned. If we clear it atomically with the last unpin going away, then we are guaranteed that new modifications will order correctly as they add a new pin counts and we won't clear a sequence number for an active modification in the CIL. Further, we can then allow the per-transaction flag state to propagate into ->iop_committing (instead of clearing it in ->iop_precommit) and that will allow us to determine if the modification needs a full fsync or just a datasync, and so we can record a separate datasync sequence number (Jan's idea!) and then use that in the fdatasync path instead of the full fsync sequence number. With this infrastructure in place, we no longer need the ILOCK_SHARED in the fsync path. All serialisation is done against the commit sequence numbers - if the sequence number is set, then we have to flush the journal. If it is not set, then we have nothing to do. This greatly simplifies the fsync implementation.... Signed-off-by: Dave Chinner <dchinner@redhat.com> Tested-by: Jan Kara <jack@suse.cz> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Carlos Maiolino <cem@kernel.org>
10 daysxfs: rearrange code in xfs_inode_item_precommitDave Chinner
There are similar extsize checks and updates done inside and outside the inode item lock, which could all be done under a single top level logic branch outside the ili_lock. The COW extsize fixup can potentially miss updating the XFS_ILOG_CORE in ili_fsync_fields, so moving this code up above the ili_fsync_fields update could also be considered a fix. Further, to make the next change a bit cleaner, move where we calculate the on-disk flag mask to after we attach the cluster buffer to the the inode log item. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>
11 daysxfs: scrub: use kstrdup_const() for metapath scan setupsDmitry Antipov
Except 'xchk_setup_metapath_rtginode()' case, 'path' argument of 'xchk_setup_metapath_scan()' is a compile-time constant. So it may be reasonable to use 'kstrdup_const()' / 'kree_const()' to manage 'path' field of 'struct xchk_metapath' in attempt to reuse .rodata instance rather than making a copy. Compile tested only. Signed-off-by: Dmitry Antipov <dmantipov@yandex.ru> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>
11 daysxfs: use bt_nr_sectors in xfs_dax_translate_rangeChristoph Hellwig
Only ranges inside the file system can be translated, and the file system can be smaller than the containing device. Fixes: f4ed93037966 ("xfs: don't shut down the filesystem for media failures beyond end of log") Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>
11 daysxfs: track the number of blocks in each buftargChristoph Hellwig
Add a bt_nr_sectors to track the number of sector in each buftarg, and replace the check that hard codes sb_dblock in xfs_buf_map_verify with this new value so that it is correct for non-ddev buftargs. The RT buftarg only has a superblock in the first block, so it is unlikely to trigger this, or are we likely to ever have enough blocks in the in-memory buftargs, but we might as well get the check right. Fixes: 10616b806d1d ("xfs: fix _xfs_buf_find oops on blocks beyond the filesystem end") Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>
14 daysxfs: constify xfs_errortag_random_defaultChristoph Hellwig
This table is never modified, so mark it const. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>
2025-09-19fs: WQ_PERCPU added to alloc_workqueue usersMarco Crivellari
Currently if a user enqueue a work item using schedule_delayed_work() the used wq is "system_wq" (per-cpu wq) while queue_delayed_work() use WORK_CPU_UNBOUND (used when a cpu is not specified). The same applies to schedule_work() that is using system_wq and queue_work(), that makes use again of WORK_CPU_UNBOUND. This lack of consistentcy cannot be addressed without refactoring the API. alloc_workqueue() treats all queues as per-CPU by default, while unbound workqueues must opt-in via WQ_UNBOUND. This default is suboptimal: most workloads benefit from unbound queues, allowing the scheduler to place worker threads where they’re needed and reducing noise when CPUs are isolated. This patch adds a new WQ_PERCPU flag to all the fs subsystem users to explicitly request the use of the per-CPU behavior. Both flags coexist for one release cycle to allow callers to transition their calls. Once migration is complete, WQ_UNBOUND can be removed and unbound will become the implicit default. With the introduction of the WQ_PERCPU flag (equivalent to !WQ_UNBOUND), any alloc_workqueue() caller that doesn’t explicitly specify WQ_UNBOUND must now use WQ_PERCPU. All existing users have been updated accordingly. Suggested-by: Tejun Heo <tj@kernel.org> Signed-off-by: Marco Crivellari <marco.crivellari@suse.com> Link: https://lore.kernel.org/20250916082906.77439-4-marco.crivellari@suse.com Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-09-18xfs: improve default maximum number of open zonesDamien Le Moal
For regular block devices using the zoned allocator, the default maximum number of open zones is set to 1/4 of the number of realtime groups. For a large capacity device, this leads to a very large limit. E.g. with a 26 TB HDD: mount /dev/sdb /mnt ... XFS (sdb): 95836 zones of 65536 blocks size (23959 max open) In turn such large limit on the number of open zones can lead, depending on the workload, on a very large number of concurrent write streams which devices generally do not handle well, leading to poor performance. Introduce the default limit XFS_DEFAULT_MAX_OPEN_ZONES, defined as 128 to match the hardware limit of most SMR HDDs available today, and use this limit to set mp->m_max_open_zones in xfs_calc_open_zones() instead of calling xfs_max_open_zones(), when the user did not specify a limit with the max_open_zones mount option. For the 26 TB HDD example, we now get: mount /dev/sdb /mnt ... XFS (sdb): 95836 zones of 65536 blocks (128 max open zones) This change does not prevent the user from specifying a lareger number for the open zones limit. E.g. mount -o max_open_zones=4096 /dev/sdb /mnt ... XFS (sdb): 95836 zones of 65536 blocks (4096 max open zones) Finally, since xfs_calc_open_zones() checks and caps the mp->m_max_open_zones limit against the value calculated by xfs_max_open_zones() for any type of device, this new default limit does not increase m_max_open_zones for small capacity devices. Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>
2025-09-18xfs: improve zone statistics messageDamien Le Moal
Reword the information message displayed in xfs_mount_zones() indicating the total zone count and maximum number of open zones. Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>
2025-09-18xfs: centralize error tag definitionsChristoph Hellwig
Right now 5 places in the kernel and one in xfsprogs need to be updated for each new error tag. Add a bit of macro magic so that only the error tag definition and a single table, which reside next to each other, need to be updated. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>
2025-09-18xfs: remove pointless externs in xfs_error.hChristoph Hellwig
Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>
2025-09-18xfs: remove the expr argument to XFS_TEST_ERRORChristoph Hellwig
Don't pass expr to XFS_TEST_ERROR. Most calls pass a constant false, and the places that do pass an expression become cleaner by moving it out. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>
2025-09-18xfs: remove xfs_errortag_setChristoph Hellwig
xfs_errortag_set is only called by xfs_errortag_attr_store, , which does not need to validate the error tag, because it can only be called on valid error tags that had a sysfs attribute registered. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>
2025-09-18xfs: remove xfs_errortag_getChristoph Hellwig
xfs_errortag_get is only called by xfs_errortag_attr_show, which does not need to validate the error tag, because it can only be called on valid error tags that had a sysfs attribute registered. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>
2025-09-16xfs: move the XLOG_REG_ constants out of xfs_log_format.hChristoph Hellwig
These are purely in-memory values and not used at all in xfsprogs. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>
2025-09-16xfs: adjust the hint based zone allocation policyHans Holmberg
As we really can't make any general assumptions about files that don't have any life time hint set or are set to "NONE", adjust the allocation policy to avoid co-locating data from those files with files with a set life time. Signed-off-by: Hans Holmberg <hans.holmberg@wdc.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Carlos Maiolino <cem@kernel.org>
2025-09-16xfs: refactor hint based zone allocationHans Holmberg
Replace the co-location code with a matrix that makes it more clear on how the decisions are made. The matrix contains scores for zone/file hint combinations. A "GOOD" score for an open zone will result in immediate co-location while "OK" combinations will only be picked if we cannot open a new zone. Signed-off-by: Hans Holmberg <hans.holmberg@wdc.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Carlos Maiolino <cem@kernel.org>
2025-09-16xfs: fix log CRC mismatches between i386 and other architecturesChristoph Hellwig
When mounting file systems with a log that was dirtied on i386 on other architectures or vice versa, log recovery is unhappy: [ 11.068052] XFS (vdb): Torn write (CRC failure) detected at log block 0x2. Truncating head block from 0xc. This is because the CRCs generated by i386 and other architectures always diff. The reason for that is that sizeof(struct xlog_rec_header) returns different values for i386 vs the rest (324 vs 328), because the struct is not sizeof(uint64_t) aligned, and i386 has odd struct size alignment rules. This issue goes back to commit 13cdc853c519 ("Add log versioning, and new super block field for the log stripe") in the xfs-import tree, which adds log v2 support and the h_size field that causes the unaligned size. At that time it only mattered for the crude debug only log header checksum, but with commit 0e446be44806 ("xfs: add CRC checks to the log") it became a real issue for v5 file system, because now there is a proper CRC, and regular builds actually expect it match. Fix this by allowing checksums with and without the padding. Fixes: 0e446be44806 ("xfs: add CRC checks to the log") Cc: <stable@vger.kernel.org> # v3.8 Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Carlos Maiolino <cem@kernel.org>
2025-09-16xfs: rename the old_crc variable in xlog_recover_processChristoph Hellwig
old_crc is a very misleading name. Rename it to expected_crc as that described the usage much better. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>
2025-09-16xfs: remove the unused xfs_log_iovec_t typedefChristoph Hellwig
Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>
2025-09-16xfs: remove the unused xfs_qoff_logformat_t typedefChristoph Hellwig
Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>
2025-09-16xfs: remove the unused xfs_dq_logformat_t typedefChristoph Hellwig
Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>
2025-09-16xfs: remove the unused xfs_buf_log_format_t typedefChristoph Hellwig
Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>
2025-09-16xfs: remove the unused xfs_efd_log_format_64_t typedefChristoph Hellwig
Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>
2025-09-16xfs: remove the unused xfs_efd_log_format_32_t typedefChristoph Hellwig
Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>
2025-09-16xfs: remove the xfs_efd_log_format_t typedefChristoph Hellwig
There are almost no users of the typedef left, kill it and switch the remaining users to use the underlying struct. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>
2025-09-16xfs: remove the xfs_efi_log_format_64_t typedefChristoph Hellwig
There are almost no users of the typedef left, kill it and switch the remaining users to use the underlying struct. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>
2025-09-16xfs: remove the xfs_efi_log_format_32_t typedefChristoph Hellwig
There are almost no users of the typedef left, kill it and switch the remaining users to use the underlying struct. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>
2025-09-16xfs: remove the xfs_efi_log_format_t typedefChristoph Hellwig
There are almost no users of the typedef left, kill it and switch the remaining users to use the underlying struct. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>
2025-09-16xfs: remove the xfs_extent64_t typedefChristoph Hellwig
There are almost no users of the typedef left, kill it and switch the remaining users to use the underlying struct. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>
2025-09-16xfs: remove the xfs_extent32_t typedefChristoph Hellwig
There are almost no users of the typedef left, kill it and switch the remaining users to use the underlying struct. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>
2025-09-16xfs: remove the xfs_extent_t typedefChristoph Hellwig
There are almost no users of the typedef left, kill it and switch the remaining users to use the underlying struct. Also fix up the comment about the struct xfs_extent definition to be correct and read more easily. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>
2025-09-16xfs: remove the xfs_trans_header_t typedefChristoph Hellwig
There are almost no users of the typedef left, kill it and switch the remaining users to use the underlying struct. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>
2025-09-16xfs: remove the xlog_op_header_t typedefChristoph Hellwig
There are almost no users of the typedef left, kill it and switch the remaining users to use the underlying struct. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>
2025-09-15fs: rename generic_delete_inode() and generic_drop_inode()Mateusz Guzik
generic_delete_inode() is rather misleading for what the routine is doing. inode_just_drop() should be much clearer. The new naming is inconsistent with generic_drop_inode(), so rename that one as well with inode_ as the suffix. No functional changes. Signed-off-by: Mateusz Guzik <mjguzik@gmail.com> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-09-05Merge tag 'kconfig-2025-changes_2025-09-05' of ↵Carlos Maiolino
https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into xfs-6.18-merge xfs: kconfig and feature changes for 2025 LTS [6.18 v2 2/2] Ahead of the 2025 LTS kernel, disable by default the two features that we promised to turn off in September 2025: V4 filesystems, and the long-broken ASCII case insensitive directories. Since online fsck has not had any major issues in the 16 months since it was merged upstream, let's also turn that on by default. With a bit of luck, this should all go splendidly. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>
2025-09-05Merge tag 'fix-scrub-reap-calculations_2025-09-05' of ↵Carlos Maiolino
https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into xfs-6.18-merge xfs: improve online repair reap calculations [6.18 v2 1/2] A few months ago, the multi-fsblock untorn writes patchset added a bunch of log intent item helper functions to estimate the number of intent items that could be added to a particular transaction. Those helpers enabled us to compute a safe upper bound on the number of blocks that could be written in an untorn fashion with filesystem-provided out of place writes. Currently, the online fsck code employs static limits on the number of intent items that it's willing to accrue to a single transaction when it's trying to reap what it thinks are the old blocks from a corrupt structure. There have been no problems reported with this approach after years of testing, but static limits are scary and gross because overestimating the intent item limit could result in transaction overflows and dead filesystems; and underestimating causes unnecessary overhead. This series uses the new log intent item size helpers to estimate the limits dynamically based on worst-case per-block repair work vs. the size of the scrub transaction. After several months of testing this, there don't seem to be any problems here either. v2: rearrange patches, add review tags This has been running on the djcloud for months with no problems. Enjoy! Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>
2025-09-05xfs: enable online fsck by default in KconfigDarrick J. Wong
Online fsck has been a part of upstream for over a year now without any serious problems. Turn it on by default in time for the 2025 LTS kernel, and get rid of the "say N if unsure" messages for the default Y options. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
2025-09-05xfs: use deferred reaping for data device cow extentsDarrick J. Wong
Don't roll the whole transaction after every extent, that's rather inefficient. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2025-09-05xfs: remove deprecated sysctl knobsDarrick J. Wong
These sysctl knobs were scheduled for removal in September 2025. That time has come, so remove them. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
2025-09-05xfs: remove static reap limits from repair.hDarrick J. Wong
Delete XREAP_MAX_BINVAL and XREAP_MAX_DEFER_CHAIN because the reap code now calculates those limits dynamically, so they're no longer needed. Move the third limit (XREP_MAX_ITRUNCATE_EFIS) to the one file that uses it. Note that the btree rebuilding code should reserve exactly the number of blocks needed to rebuild a btree, so it is rare that the newbt code will need to add any EFIs to the commit transaction. That's why that static limit remains. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2025-09-05xfs: remove deprecated mount optionsDarrick J. Wong
These four mount options were scheduled for removal in September 2025, so remove them now. Cc: preichl@redhat.com Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
2025-09-05xfs: disable deprecated features by default in KconfigDarrick J. Wong
We promised to turn off these old features by default in September 2025. Do so now. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
2025-09-05xfs: compute file mapping reap limits dynamicallyDarrick J. Wong
Reaping file fork mappings is a little different -- log recovery can free the blocks for us, so we only try to process a single mapping at a time. Therefore, we only need to figure out the maximum number of blocks that we can invalidate in a single transaction. The rough calculation here is: nr_extents = (logres - reservation used by any one step) / (space used per binval) Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2025-09-05xfs: compute realtime device CoW staging extent reap limits dynamicallyDarrick J. Wong
Calculate the maximum number of CoW staging extents that can be reaped in a single transaction chain. The rough calculation here is: nr_extents = (logres - reservation used by any one step) / (space used by intents per extent) Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2025-09-05xfs: compute data device CoW staging extent reap limits dynamicallyDarrick J. Wong
Calculate the maximum number of CoW staging extents that can be reaped in a single transaction chain. The rough calculation here is: nr_extents = (logres - reservation used by any one step) / (space used by intents per extent + space used for a few buffer invalidations) Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>