summaryrefslogtreecommitdiff
path: root/fs
AgeCommit message (Collapse)Author
2024-04-17fs/ntfs3: remove atomic_openJeff Layton
atomic_open is an optional VFS operation, and is primarily for network filesystems. NFS (for instance) can just send an open call for the last path component rather than doing a lookup and then having to follow that up with an open when it doesn't have a dentry in cache. ntfs3 is a local filesystem however, and its atomic_open just does a typical lookup + open, but in a convoluted way. atomic_open will also make directory leases more difficult to implement on the filesystem. Remove ntfs_atomic_open. Signed-off-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>
2024-04-17fs/ntfs3: use kcalloc() instead of kzalloc()Lenko Donchev
We are trying to get rid of all multiplications from allocation functions to prevent integer overflows[1]. Here the multiplication is obviously safe, but using kcalloc() is more appropriate and improves readability. This patch has no effect on runtime behavior. Link: https://github.com/KSPP/linux/issues/162 [1] Link: https://www.kernel.org/doc/html/next/process/deprecated.html#open-coded-arithmetic-in-allocator-arguments [2] Signed-off-by: Lenko Donchev <lenko.donchev@gmail.com> Signed-off-by: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>
2024-04-17Merge patch series 'Fix shmem_rename2 directory offset calculation' of ↵Christian Brauner
https://lore.kernel.org/r/20240415152057.4605-1-cel@kernel.org Pull shmem_rename2() offset fixes from Chuck Lever: The existing code in shmem_rename2() allocates a fresh directory offset value when renaming over an existing destination entry. User space does not expect this behavior. In particular, applications that rename while walking a directory can loop indefinitely because they never reach the end of the directory. * 'Fix shmem_rename2 directory offset calculation' of https://lore.kernel.org/r/20240415152057.4605-1-cel@kernel.org: (3 commits) shmem: Fix shmem_rename2() libfs: Add simple_offset_rename() API libfs: Fix simple_offset_rename_exchange() fs/libfs.c | 55 +++++++++++++++++++++++++++++++++++++++++----- include/linux/fs.h | 2 ++ mm/shmem.c | 3 +-- 3 files changed, 52 insertions(+), 8 deletions(-) Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-04-17shmem: Fix shmem_rename2()Chuck Lever
When renaming onto an existing directory entry, user space expects the replacement entry to have the same directory offset as the original one. Link: https://gitlab.alpinelinux.org/alpine/aports/-/issues/15966 Fixes: a2e459555c5f ("shmem: stable directory offsets") Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Link: https://lore.kernel.org/r/20240415152057.4605-4-cel@kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-04-17libfs: Add simple_offset_rename() APIChuck Lever
I'm about to fix a tmpfs rename bug that requires the use of internal simple_offset helpers that are not available in mm/shmem.c Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Link: https://lore.kernel.org/r/20240415152057.4605-3-cel@kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-04-17libfs: Fix simple_offset_rename_exchange()Chuck Lever
User space expects the replacement (old) directory entry to have the same directory offset after the rename. Suggested-by: Christian Brauner <brauner@kernel.org> Fixes: a2e459555c5f ("shmem: stable directory offsets") Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Link: https://lore.kernel.org/r/20240415152057.4605-2-cel@kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-04-17jffs2: prevent xattr node from overflowing the eraseblockIlya Denisyev
Add a check to make sure that the requested xattr node size is no larger than the eraseblock minus the cleanmarker. Unlike the usual inode nodes, the xattr nodes aren't split into parts and spread across multiple eraseblocks, which means that a xattr node must not occupy more than one eraseblock. If the requested xattr value is too large, the xattr node can spill onto the next eraseblock, overwriting the nodes and causing errors such as: jffs2: argh. node added in wrong place at 0x0000b050(2) jffs2: nextblock 0x0000a000, expected at 0000b00c jffs2: error: (823) do_verify_xattr_datum: node CRC failed at 0x01e050, read=0xfc892c93, calc=0x000000 jffs2: notice: (823) jffs2_get_inode_nodes: Node header CRC failed at 0x01e00c. {848f,2fc4,0fef511f,59a3d171} jffs2: Node at 0x0000000c with length 0x00001044 would run over the end of the erase block jffs2: Perhaps the file system was created with the wrong erase size? jffs2: jffs2_scan_eraseblock(): Magic bitmask 0x1985 not found at 0x00000010: 0x1044 instead This breaks the filesystem and can lead to KASAN crashes such as: BUG: KASAN: slab-out-of-bounds in jffs2_sum_add_kvec+0x125e/0x15d0 Read of size 4 at addr ffff88802c31e914 by task repro/830 CPU: 0 PID: 830 Comm: repro Not tainted 6.9.0-rc3+ #1 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Arch Linux 1.16.3-1-1 04/01/2014 Call Trace: <TASK> dump_stack_lvl+0xc6/0x120 print_report+0xc4/0x620 ? __virt_addr_valid+0x308/0x5b0 kasan_report+0xc1/0xf0 ? jffs2_sum_add_kvec+0x125e/0x15d0 ? jffs2_sum_add_kvec+0x125e/0x15d0 jffs2_sum_add_kvec+0x125e/0x15d0 jffs2_flash_direct_writev+0xa8/0xd0 jffs2_flash_writev+0x9c9/0xef0 ? __x64_sys_setxattr+0xc4/0x160 ? do_syscall_64+0x69/0x140 ? entry_SYSCALL_64_after_hwframe+0x76/0x7e [...] Found by Linux Verification Center (linuxtesting.org) with Syzkaller. Fixes: aa98d7cf59b5 ("[JFFS2][XATTR] XATTR support on JFFS2 (version. 5)") Signed-off-by: Ilya Denisyev <dev@elkcl.ru> Link: https://lore.kernel.org/r/20240412155357.237803-1-dev@elkcl.ru Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-04-16bcachefs: make sure to release last journal pin in replayKent Overstreet
This fixes a deadlock when journal replay has many keys to insert that were from fsck, not the journal. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-04-16bcachefs: node scan: ignore multiple nodes with same seq if interiorKent Overstreet
Interior nodes are not really needed, when we have to scan - but if this pops up for leaf nodes we'll need a real heuristic. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-04-16bcachefs: Fix format specifier in validate_bset_keys()Nathan Chancellor
When building for 32-bit platforms, for which size_t is 'unsigned int', there is a warning from a format string in validate_bset_keys(): fs/bcachefs/btree_io.c: In function 'validate_bset_keys': fs/bcachefs/btree_io.c:891:34: error: format '%lu' expects argument of type 'long unsigned int', but argument 12 has type 'unsigned int' [-Werror=format=] 891 | "bad k->u64s %u (min %u max %lu)", k->u64s, | ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ fs/bcachefs/btree_io.c:603:32: note: in definition of macro 'btree_err' 603 | msg, ##__VA_ARGS__); \ | ^~~ fs/bcachefs/btree_io.c:887:21: note: in expansion of macro 'btree_err_on' 887 | if (btree_err_on(!bkeyp_u64s_valid(&b->format, k), | ^~~~~~~~~~~~ fs/bcachefs/btree_io.c:891:64: note: format string is defined here 891 | "bad k->u64s %u (min %u max %lu)", k->u64s, | ~~^ | | | long unsigned int | %u cc1: all warnings being treated as errors BKEY_U64s is size_t so the entire expression is promoted to size_t. Use the '%zu' specifier so that there is no warning regardless of the width of size_t. Fixes: 031ad9e7dbd1 ("bcachefs: Check for packed bkeys that are too big") Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/oe-kbuild-all/202404130747.wH6Dd23p-lkp@intel.com/ Closes: https://lore.kernel.org/oe-kbuild-all/202404131536.HdAMBOVc-lkp@intel.com/ Signed-off-by: Nathan Chancellor <nathan@kernel.org> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-04-16bcachefs: Fix null ptr deref in twf from BCH_IOCTL_FSCK_OFFLINEKent Overstreet
We need to initialize the stdio redirects before they're used. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-04-16nilfs2: fix OOB in nilfs_set_de_typeJeongjun Park
The size of the nilfs_type_by_mode array in the fs/nilfs2/dir.c file is defined as "S_IFMT >> S_SHIFT", but the nilfs_set_de_type() function, which uses this array, specifies the index to read from the array in the same way as "(mode & S_IFMT) >> S_SHIFT". static void nilfs_set_de_type(struct nilfs_dir_entry *de, struct inode *inode) { umode_t mode = inode->i_mode; de->file_type = nilfs_type_by_mode[(mode & S_IFMT)>>S_SHIFT]; // oob } However, when the index is determined this way, an out-of-bounds (OOB) error occurs by referring to an index that is 1 larger than the array size when the condition "mode & S_IFMT == S_IFMT" is satisfied. Therefore, a patch to resize the nilfs_type_by_mode array should be applied to prevent OOB errors. Link: https://lkml.kernel.org/r/20240415182048.7144-1-konishi.ryusuke@gmail.com Reported-by: syzbot+2e22057de05b9f3b30d8@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=2e22057de05b9f3b30d8 Fixes: 2ba466d74ed7 ("nilfs2: directory entry operations") Signed-off-by: Jeongjun Park <aha310510@gmail.com> Signed-off-by: Ryusuke Konishi <konishi.ryusuke@gmail.com> Tested-by: Ryusuke Konishi <konishi.ryusuke@gmail.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-04-16Squashfs: check the inode number is not the invalid value of zeroPhillip Lougher
Syskiller has produced an out of bounds access in fill_meta_index(). That out of bounds access is ultimately caused because the inode has an inode number with the invalid value of zero, which was not checked. The reason this causes the out of bounds access is due to following sequence of events: 1. Fill_meta_index() is called to allocate (via empty_meta_index()) and fill a metadata index. It however suffers a data read error and aborts, invalidating the newly returned empty metadata index. It does this by setting the inode number of the index to zero, which means unused (zero is not a valid inode number). 2. When fill_meta_index() is subsequently called again on another read operation, locate_meta_index() returns the previous index because it matches the inode number of 0. Because this index has been returned it is expected to have been filled, and because it hasn't been, an out of bounds access is performed. This patch adds a sanity check which checks that the inode number is not zero when the inode is created and returns -EINVAL if it is. [phillip@squashfs.org.uk: whitespace fix] Link: https://lkml.kernel.org/r/20240409204723.446925-1-phillip@squashfs.org.uk Link: https://lkml.kernel.org/r/20240408220206.435788-1-phillip@squashfs.org.uk Signed-off-by: Phillip Lougher <phillip@squashfs.org.uk> Reported-by: "Ubisectech Sirius" <bugreport@ubisectech.com> Closes: https://lore.kernel.org/lkml/87f5c007-b8a5-41ae-8b57-431e924c5915.bugreport@ubisectech.com/ Cc: Christian Brauner <brauner@kernel.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-04-16dlm: use rwlock for lkbidrAlexander Aring
Convert the lock for lkbidr to an rwlock. Most idr lookups will use the read lock. Signed-off-by: Alexander Aring <aahringo@redhat.com> Signed-off-by: David Teigland <teigland@redhat.com>
2024-04-16dlm: use rwlock for rsb hash tableAlexander Aring
The conversion to rhashtable introduced a hash table lock per lockspace, in place of per bucket locks. To make this more scalable, switch to using a rwlock for hash table access. The common case fast path uses it as a read lock. Signed-off-by: Alexander Aring <aahringo@redhat.com> Signed-off-by: David Teigland <teigland@redhat.com>
2024-04-16dlm: drop dlm_scand kthread and use timersAlexander Aring
Currently the scand kthread acts like a garbage collection for expired rsbs on toss list, to clean them up after a certain timeout. It triggers every couple of seconds and iterates over the toss list while holding ls_rsbtbl_lock for the whole hash bucket iteration. To reduce the amount of time holding ls_rsbtbl_lock, we now handle the disposal of expired rsbs using a per-lockspace timer that expires for the earliest tossed rsb on the lockspace toss queue. This toss queue is ordered according to the rsb res_toss_time with the earliest tossed rsb as the first entry. The toss timer will only trylock() necessary locks, since it is low priority garbage collection, and will rearm the timer if trylock() fails. If the timer function does not find any expired rsb's, it rearms the timer with the next earliest expired rsb. Signed-off-by: Alexander Aring <aahringo@redhat.com> Signed-off-by: David Teigland <teigland@redhat.com>
2024-04-16dlm: do not use ref counts for rsb in the toss stateAlexander Aring
In the past we had problems when an rsb had a reference counter greater than one while in the toss state. An rsb in the toss state is not actively used for locking, and should not have any other references apart from the single ref keeping it on the rsb hash. Shift to freeing rsb's directly rather than using kref_put to free them, since the ref counting is not meant to be used in this state. Add warnings if ref counting is seen while an rsb is in the toss state. Signed-off-by: Alexander Aring <aahringo@redhat.com> Signed-off-by: David Teigland <teigland@redhat.com>
2024-04-16dlm: switch to use rhashtable for rsbsAlexander Aring
Replace our own hash table with the more advanced rhashtable for keeping rsb structs. Signed-off-by: Alexander Aring <aahringo@redhat.com> Signed-off-by: David Teigland <teigland@redhat.com>
2024-04-16dlm: add rsb lists for iterationAlexander Aring
To prepare for using rhashtable, add two rsb lists for iterating through rsb's in two uncommon cases where this is necesssary: - when dumping rsb state from debugfs, now using seq_list. - when looking at all rsb's during recovery. Signed-off-by: Alexander Aring <aahringo@redhat.com> Signed-off-by: David Teigland <teigland@redhat.com>
2024-04-16dlm: merge toss and keep hash table lists into one listAlexander Aring
There are several places where lock processing can perform two hash table lookups, first in the "keep" list, and if not found, in the "toss" list. This patch introduces a new rsb state flag "RSB_TOSS" to represent the difference between the state of being on keep vs toss list, so that the two lists can be combined. This avoids cases of two lookups. Signed-off-by: Alexander Aring <aahringo@redhat.com> Signed-off-by: David Teigland <teigland@redhat.com>
2024-04-16dlm: change to single hashtable lockAlexander Aring
Prepare to replace our own hash table with rhashtable by replacing the per-bucket locks in our own hash table with a single lock. Signed-off-by: Alexander Aring <aahringo@redhat.com> Signed-off-by: David Teigland <teigland@redhat.com>
2024-04-16dlm: increment ls_count for dlm_scandAlexander Aring
Increment the ls_count value while dlm_scand is processing a lockspace so that release_lockspace()/remove_lockspace() will wait for dlm_scand to finish. Signed-off-by: Alexander Aring <aahringo@redhat.com> Signed-off-by: David Teigland <teigland@redhat.com>
2024-04-16ntfs3: serve as alias for the legacy ntfs driverChristian Brauner
Johan Hovold reported that removing the legacy ntfs driver broke boot for him since his fstab uses the legacy ntfs driver to access firmware from the original Windows partition. Use ntfs3 as an alias for legacy ntfs if CONFIG_NTFS_FS is selected. This is similar to how ext3 is treated. Link: https://lore.kernel.org/r/Zf2zPf5TO5oYt3I3@hovoldconsulting.com Link: https://lore.kernel.org/r/20240325-hinkriegen-zuziehen-d7e2c490427a@brauner Fixes: 7ffa8f3d3023 ("fs: Remove NTFS classic") Tested-by: Johan Hovold <johan+linaro@kernel.org> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Johan Hovold <johan@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-04-15xfs: unlock new repair tempfiles after creationDarrick J. Wong
After creation, drop the ILOCK on temporary files that have been created to stage a repair. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-04-15xfs: don't pick up IOLOCK during rmapbt repair scanDarrick J. Wong
Now that we've fixed the directory operations to hold the ILOCK until they're finished with rmapbt updates for directory shape changes, we no longer need to take this lock when scanning directories for rmapbt records. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-04-15xfs: Hold inode locks in xfs_renameAllison Henderson
Modify xfs_rename to hold all inode locks across a rename operation We will need this later when we add parent pointers Signed-off-by: Allison Henderson <allison.henderson@oracle.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Catherine Hoang <catherine.hoang@oracle.com> Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-04-15xfs: Hold inode locks in xfs_trans_alloc_dirAllison Henderson
Modify xfs_trans_alloc_dir to hold locks after return. Caller will be responsible for manual unlock. We will need this later to hold locks across parent pointer operations Signed-off-by: Allison Henderson <allison.henderson@oracle.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Catherine Hoang <catherine.hoang@oracle.com> Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-04-15xfs: Hold inode locks in xfs_iallocAllison Henderson
Modify xfs_ialloc to hold locks after return. Caller will be responsible for manual unlock. We will need this later to hold locks across parent pointer operations Signed-off-by: Allison Henderson <allison.henderson@oracle.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Catherine Hoang <catherine.hoang@oracle.com> [djwong: hold the parent ilocked across transaction rolls too] Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-04-15xfs: Increase XFS_QM_TRANS_MAXDQS to 5Allison Henderson
With parent pointers enabled, a rename operation can update up to 5 inodes: src_dp, target_dp, src_ip, target_ip and wip. This causes their dquots to a be attached to the transaction chain, so we need to increase XFS_QM_TRANS_MAXDQS. This patch also add a helper function xfs_dqlockn to lock an arbitrary number of dquots. Signed-off-by: Allison Henderson <allison.henderson@oracle.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-04-15xfs: Increase XFS_DEFER_OPS_NR_INODES to 5Allison Henderson
Renames that generate parent pointer updates can join up to 5 inodes locked in sorted order. So we need to increase the number of defer ops inodes and relock them in the same way. Signed-off-by: Allison Henderson <allison.henderson@oracle.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Catherine Hoang <catherine.hoang@oracle.com> [djwong: have one sorting function] Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-04-15xfs: fix performance problems when fstrimming a subset of a fragmented AGDarrick J. Wong
On a 10TB filesystem where the free space in each AG is heavily fragmented, I noticed some very high runtimes on a FITRIM call for the entire filesystem. xfs_scrub likes to report progress information on each phase of the scrub, which means that a strace for the entire filesystem: ioctl(3, FITRIM, {start=0x0, len=10995116277760, minlen=0}) = 0 <686.209839> shows that scrub is uncommunicative for the entire duration. Reducing the size of the FITRIM requests to a single AG at a time produces lower times for each individual call, but even this isn't quite acceptable, because the time between progress reports are still very high: Strace for the first 4x 1TB AGs looks like (2): ioctl(3, FITRIM, {start=0x0, len=1099511627776, minlen=0}) = 0 <68.352033> ioctl(3, FITRIM, {start=0x10000000000, len=1099511627776, minlen=0}) = 0 <68.760323> ioctl(3, FITRIM, {start=0x20000000000, len=1099511627776, minlen=0}) = 0 <67.235226> ioctl(3, FITRIM, {start=0x30000000000, len=1099511627776, minlen=0}) = 0 <69.465744> I then had the idea to limit the length parameter of each call to a smallish amount (~11GB) so that we could report progress relatively quickly, but much to my surprise, each FITRIM call still took ~68 seconds! Unfortunately, the by-length fstrim implementation handles this poorly because it walks the entire free space by length index (cntbt), which is a very inefficient way to walk a subset of the blocks of an AG. Therefore, create a second implementation that will walk the bnobt and perform the trims in block number order. This implementation avoids the worst problems of the original code, though it lacks the desirable attribute of freeing the biggest chunks first. On the other hand, this second implementation will be much easier to constrain the system call latency, and makes it much easier to report fstrim progress to anyone who's running xfs_scrub. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com
2024-04-15xfs: create subordinate scrub contexts for xchk_metadata_inode_subtypeDarrick J. Wong
When a file-based metadata structure is being scrubbed in xchk_metadata_inode_subtype, we should create an entirely new scrub context so that each scrubber doesn't trip over another's buffers. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-04-15xfs: pin inodes that would otherwise overflow link countDarrick J. Wong
The VFS inc_nlink function does not explicitly check for integer overflows in the i_nlink field. Instead, it checks the link count against s_max_links in the vfs_{link,create,rename} functions. XFS sets the maximum link count to 2.1 billion, so integer overflows should not be a problem. However. It's possible that online repair could find that a file has more than four billion links, particularly if the link count got corrupted while creating hardlinks to the file. The di_nlinkv2 field is not large enough to store a value larger than 2^32, so we ought to define a magic pin value of ~0U which means that the inode never gets deleted. This will prevent a UAF error if the repair finds this situation and users begin deleting links to the file. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-04-15xfs: try to avoid allocating from sick inode clustersDarrick J. Wong
I noticed that xfs/413 and xfs/375 occasionally failed while fuzzing core.mode of an inode. The root cause of these problems is that the field we fuzzed (core.mode or core.magic, typically) causes the entire inode cluster buffer verification to fail, which affects several inodes at once. The repair process tries to create either a /lost+found or a temporary repair file, but regrettably it picks the same inode cluster that we just corrupted, with the result that repair triggers the demise of the filesystem. Try avoid this by making the inode allocation path detect when the perag health status indicates that someone has found bad inode cluster buffers, and try to read the inode cluster buffer. If the cluster buffer fails the verifiers, try another AG. This isn't foolproof and can result in premature ENOSPC, but that might be better than shutting down. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-04-15xfs: check unused nlink fields in the ondisk inodeDarrick J. Wong
v2/v3 inodes use di_nlink and not di_onlink; and v1 inodes use di_onlink and not di_nlink. Whichever field is not in use, make sure its contents are zero, and teach xfs_scrub to fix that if it is. This clears a bunch of missing scrub failure errors in xfs/385 for core.onlink. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-04-15xfs: repair AGI unlinked inode bucket listsDarrick J. Wong
Teach the AGI repair code to rebuild the unlinked buckets and lists. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-04-15xfs: hoist AGI repair context to a heap objectDarrick J. Wong
Save ~460 bytes of stack space by moving all the repair context to a heap object. We're going to add even more context data in the next patch, which is why we really need to do this now. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-04-15xfs: check AGI unlinked inode bucketsDarrick J. Wong
Look for corruptions in the AGI unlinked bucket chains. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-04-15xfs: online repair of symbolic linksDarrick J. Wong
If a symbolic link target looks bad, try to sift through the rubble to find as much of the target buffer that we can, and stage a new target (short or remote format as needed) in a temporary file and use the atomic extent swapping mechanism to commit the results. In the worst case, we replace the target with an overly long filename that cannot possibly resolve. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-04-15xfs: pass the owner to xfs_symlink_write_targetDarrick J. Wong
Require callers of xfs_symlink_write_target to pass the owner number explicitly. This sets us up for online repair to be able to write a remote symlink target to sc->tempip with sc->ip's inumber in the block heaader. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-04-15xfs: expose xfs_bmap_local_to_extents for online repairDarrick J. Wong
Allow online repair to call xfs_bmap_local_to_extents and add a void * argument at the end so that online repair can pass its own context. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-04-15xfs: ensure dentry consistency when the orphanage adopts a fileDarrick J. Wong
When the orphanage adopts a file, that file becomes a child of the orphanage. The dentry cache may have entries for the orphanage directory and the name we've chosen, so (1) make sure we abort if the dcache has a positive entry because something's not right; and (2) invalidate and purge negative dentries if the adoption goes through. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-04-15xfs: move files to orphanage instead of letting nlinks drop to zeroDarrick J. Wong
If we encounter an inode with a nonzero link count but zero observed links, move it to the orphanage. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-04-15xfs: move orphan files to the orphanageDarrick J. Wong
When we're repairing a directory structure or fixing the dotdot entry of a subdirectory, it's possible that we won't ever find a parent for the subdirectory. When this is the case, move it to the orphanage, aka /lost+found. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-04-15xfs: ask the dentry cache if it knows the parent of a directoryDarrick J. Wong
It's possible that the dentry cache can tell us the parent of a directory. Therefore, when repairing directory dot dot entries, query the dcache as a last resort before scanning the entire filesystem. A reviewer asks: "How high is the chance that we actually have a valid dcache entry for a file in a corrupted directory?" There's a decent chance of this actually working. Say you have a 1000-block directory foo, and block 980 gets corrupted. Let's further suppose that block 0 has a correct entry for ".." and "bar". If someone accesses /mnt/foo/bar, that will cause the dcache to create a dentry from /mnt to /mnt/foo whose d_parent points back to /mnt. If you then want to rebuild the directory, XFS can obtain the parent from the dcache without needing to wander into parent pointers or scan the filesystem to find /mnt's connection to foo. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-04-15xfs: online repair of parent pointersDarrick J. Wong
Teach the online repair code to fix parent pointers for directories. For now, this means correcting the dotdot entry of an existing directory that is otherwise consistent. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-04-15xfs: scan the filesystem to repair a directory dotdot entryDarrick J. Wong
Teach the online directory repair code to scan the filesystem so that we can set the dotdot entry when we're rebuilding a directory. This involves dropping ILOCK on the directory that we're repairing, which means that the VFS can sneak in and tell us to update dotdot at any time. Deal with these races by using a dirent hook to absorb dotdot updates, and be careful not to check the scan results until after we've retaken the ILOCK. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-04-15xfs: online repair of directoriesDarrick J. Wong
If a directory looks like it's in bad shape, try to sift through the rubble to find whatever directory entries we can, scan the directory tree for the parent (if needed), stage the new directory contents in a temporary file and use the atomic extent swapping mechanism to commit the results in bulk. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-04-15xfs: inactivate directory data blocksDarrick J. Wong
Teach inode inactivation to delete all the incore buffers backing a directory. In normal runtime this should never happen because the VFS forbids rmdir on a non-empty directory. In the next patch, online directory repair stands up a new directory, exchanges it with the broken directory, and then drops the private temporary directory. If we cancel the repair just prior to exchanging the directory contents, the new directory will need to be torn down. Note: If we commit the repair, reaping will take care of all the ondisk space allocations and incore buffers for the old corrupt directory. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-04-15xfs: update the unlinked list when repairing link countsDarrick J. Wong
When we're repairing the link counts of a file, we must ensure either that the file has zero link count and is on the unlinked list; or that it has nonzero link count and is not on the unlinked list. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>