summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2021-04-09xfs: drop submit side trans alloc for append ioendsBrian Foster
Per-inode ioend completion batching has a log reservation deadlock vector between preallocated append transactions and transactions that are acquired at completion time for other purposes (i.e., unwritten extent conversion or COW fork remaps). For example, if the ioend completion workqueue task executes on a batch of ioends that are sorted such that an append ioend sits at the tail, it's possible for the outstanding append transaction reservation to block allocation of transactions required to process preceding ioends in the list. Append ioend completion is historically the common path for on-disk inode size updates. While file extending writes may have completed sometime earlier, the on-disk inode size is only updated after successful writeback completion. These transactions are preallocated serially from writeback context to mitigate concurrency and associated log reservation pressure across completions processed by multi-threaded workqueue tasks. However, now that delalloc blocks unconditionally map to unwritten extents at physical block allocation time, size updates via append ioends are relatively rare. This means that inode size updates most commonly occur as part of the preexisting completion time transaction to convert unwritten extents. As a result, there is no longer a strong need to preallocate size update transactions. Remove the preallocation of inode size update transactions to avoid the ioend completion processing log reservation deadlock. Instead, continue to send all potential size extending ioends to workqueue context for completion and allocate the transaction from that context. This ensures that no outstanding log reservation is owned by the ioend completion worker task when it begins to process ioends. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2021-04-09xfs: fix return of uninitialized value in variable errorColin Ian King
A previous commit removed a call to xfs_attr3_leaf_read that assigned an error return code to variable error. We now have a few early error return paths to label 'out' that return error if error is set; however error now is uninitialized so potentially garbage is being returned. Fix this by setting error to zero to restore the original behaviour where error was zero at the label 'restart'. Addresses-Coverity: ("Uninitialized scalar variable") Fixes: 07120f1abdff ("xfs: Add xfs_has_attr and subroutines") Signed-off-by: Colin Ian King <colin.king@canonical.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2021-04-09xfs: get rid of the ip parameter to xchk_setup_*Darrick J. Wong
Now that the scrub context stores a pointer to the file that was used to invoke the scrub call, the struct xfs_inode pointer that we passed to all the setup functions is no longer necessary. This is only ever used if the caller wants us to scrub the metadata of the open file. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Chandan Babu R <chandanrlinux@gmail.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
2021-04-09xfs: fix scrub and remount-ro protection when running scrubDarrick J. Wong
While running a new fstest that races a readonly remount with scrub running in repair mode, I observed the kernel tripping over debugging assertions in the log quiesce code that were checking that the CIL was empty. When the sysadmin runs scrub in repair mode, the scrub code allocates real transactions (with reservations) to change things, but doesn't increment the superblock writers count to block a readonly remount attempt while it is running. We don't require the userspace caller to have a writable file descriptor to run repairs, so we have to call mnt_want_write_file to obtain freeze protection and increment the writers count. It's ok to remove the call to sb_start_write for the dry-run case because commit 8321ddb2fa29 removed the behavior where scrub and fsfreeze fight over the buffer LRU. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Chandan Babu R <chandanrlinux@gmail.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
2021-04-07xfs: move the check for post-EOF mappings into xfs_can_free_eofblocksDarrick J. Wong
Fix the weird split of responsibilities between xfs_can_free_eofblocks and xfs_free_eofblocks by moving the chunk of code that looks for any actual post-EOF space mappings from the second function into the first. This clears the way for deferred inode inactivation to be able to decide if an inode needs inactivation work before committing the released inode to the inactivation code paths (vs. marking it for reclaim). Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2021-04-07xfs: move the xfs_can_free_eofblocks call under the IOLOCKDarrick J. Wong
In xfs_inode_free_eofblocks, move the xfs_can_free_eofblocks call further down in the function to the point where we have taken the IOLOCK. This is preparation for the next patch, where we will need that lock (or equivalent) so that we can check if there are any post-eof blocks to clean out. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2021-04-07xfs: precalculate default inode attribute offsetDave Chinner
Default attr fork offset is based on inode size, so is a fixed geometry parameter of the inode. Move it to the xfs_ino_geometry structure and stop calculating it on every call to xfs_default_attroffset(). Signed-off-by: Dave Chinner <dchinner@redhat.com> Tested-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Allison Henderson <allison.henderson@oracle.com>
2021-04-07xfs: default attr fork size does not handle device inodesDave Chinner
Device inodes have a non-default data fork size of 8 bytes as checked/enforced by xfs_repair. xfs_default_attroffset() doesn't handle this, so lets do a minor refactor so it does. Fixes: e6a688c33238 ("xfs: initialise attr fork on inode create") Signed-off-by: Dave Chinner <dchinner@redhat.com> Tested-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Allison Henderson <allison.henderson@oracle.com>
2021-04-07xfs: inode fork allocation depends on XFS_IFEXTENT flagDave Chinner
Due to confusion on when the XFS_IFEXTENT needs to be set, the changes in e6a688c33238 ("xfs: initialise attr fork on inode create") failed to set the flag when initialising the empty attribute fork at inode creation. Set this flag the same way xfs_bmap_add_attrfork() does after attry fork allocation. Fixes: e6a688c33238 ("xfs: initialise attr fork on inode create") Signed-off-by: Dave Chinner <dchinner@redhat.com> Tested-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Allison Henderson <allison.henderson@oracle.com>
2021-04-07xfs: eager inode attr fork init needs attr feature awarenessDave Chinner
The pitfalls of regression testing on a machine without realising that selinux was disabled. Only set the attr fork during inode allocation if the attr feature bits are already set on the superblock. Fixes: e6a688c33238 ("xfs: initialise attr fork on inode create") Signed-off-by: Dave Chinner <dchinner@redhat.com> Tested-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Allison Henderson <allison.henderson@oracle.com>
2021-04-07xfs: scrub: Disable check for unoptimized data fork bmbt nodeChandan Babu R
xchk_btree_check_minrecs() checks if the contents of the immediate child of a bmbt root block can fit within the root block. This check could fail on inodes with an attr fork since xfs_bmap_add_attrfork_btree() used to demote the current root node of the data fork as the child of a newly allocated root node if it found that the size of "struct xfs_btree_block" along with the space required for records exceeded that of space available in the data fork. xfs_bmap_add_attrfork_btree() should have used "struct xfs_bmdr_block" instead of "struct xfs_btree_block" for the above mentioned space requirement calculation. This commit disables the check for unoptimized (in terms of disk space usage) data fork bmbt trees since there could be filesystems in use that already have such a layout. Suggested-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Chandan Babu R <chandanrlinux@gmail.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2021-04-07xfs: Use struct xfs_bmdr_block instead of struct xfs_btree_block to ↵Chandan Babu R
calculate root node size The incore data fork of an inode stores the bmap btree root node as 'struct xfs_btree_block'. However, the ondisk version of the inode stores the bmap btree root node as a 'struct xfs_bmdr_block'. xfs_bmap_add_attrfork_btree() checks if the btree root node fits inside the data fork of the inode. However, it incorrectly uses 'struct xfs_btree_block' to compute the size of the bmap btree root node. Since size of 'struct xfs_btree_block' is larger than that of 'struct xfs_bmdr_block', xfs_bmap_add_attrfork_btree() could end up unnecessarily demoting the current root node as the child of newly allocated root node. This commit optimizes space usage by modifying xfs_bmap_add_attrfork_btree() to use 'struct xfs_bmdr_block' to check if the bmap btree root node fits inside the data fork of the inode. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Chandan Babu R <chandanrlinux@gmail.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2021-04-07xfs: deprecate BMV_IF_NO_DMAPI_READ flagAnthony Iliopoulos
Use of the flag has had no effect since kernel commit 288699fecaff ("xfs: drop dmapi hooks"), which removed all dmapi related code, so deprecate it. Signed-off-by: Anthony Iliopoulos <ailiop@suse.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2021-04-07xfs: merge _xfs_dic2xflags into xfs_ip2xflagsChristoph Hellwig
Merge _xfs_dic2xflags into its only caller. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2021-04-07xfs: move the di_crtime field to struct xfs_inodeChristoph Hellwig
Move the crtime field from struct xfs_icdinode into stuct xfs_inode and remove the now entirely unused struct xfs_icdinode. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2021-04-07xfs: move the di_flags2 field to struct xfs_inodeChristoph Hellwig
In preparation of removing the historic icinode struct, move the flags2 field into the containing xfs_inode structure. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2021-04-07xfs: move the di_flags field to struct xfs_inodeChristoph Hellwig
In preparation of removing the historic icinode struct, move the flags field into the containing xfs_inode structure. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2021-04-07xfs: move the di_forkoff field to struct xfs_inodeChristoph Hellwig
In preparation of removing the historic icinode struct, move the forkoff field into the containing xfs_inode structure. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2021-04-07xfs: use a union for i_cowextsize and i_flushiterChristoph Hellwig
The i_cowextsize field is only used for v3 inodes, and the i_flushiter field is only used for v1/v2 inodes. Use a union to pack the inode a littler better after adding a few missing guards around their usage. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2021-04-07xfs: use XFS_B_TO_FSB in xfs_ioctl_setattrChristoph Hellwig
Clean up xfs_ioctl_setattr a bit by using XFS_B_TO_FSB. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2021-04-07xfs: cleanup xfs_fill_fsxattrChristoph Hellwig
Add a local xfs_mount variable, and use the XFS_FSB_TO_B helper. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2021-04-07xfs: move the di_flushiter field to struct xfs_inodeChristoph Hellwig
In preparation of removing the historic icinode struct, move the flushiter field into the containing xfs_inode structure. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2021-04-07xfs: move the di_cowextsize field to struct xfs_inodeChristoph Hellwig
In preparation of removing the historic icinode struct, move the cowextsize field into the containing xfs_inode structure. Also switch to use the xfs_extlen_t instead of a uint32_t. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2021-04-07xfs: move the di_extsize field to struct xfs_inodeChristoph Hellwig
In preparation of removing the historic icinode struct, move the extsize field into the containing xfs_inode structure. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2021-04-07xfs: move the di_nblocks field to struct xfs_inodeChristoph Hellwig
In preparation of removing the historic icinode struct, move the nblocks field into the containing xfs_inode structure. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2021-04-07xfs: move the di_size field to struct xfs_inodeChristoph Hellwig
In preparation of removing the historic icinode struct, move the on-disk size field into the containing xfs_inode structure. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2021-04-07xfs: move the di_projid field to struct xfs_inodeChristoph Hellwig
In preparation of removing the historic icinode struct, move the projid field into the containing xfs_inode structure. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2021-04-07xfs: don't clear the "dinode core" in xfs_inode_allocChristoph Hellwig
The xfs_icdinode structure just contains a random mix of inode field, which are all read from the on-disk inode and mostly not looked at before reading the inode or initializing a new inode cluster. The only exceptions are the forkoff and blocks field, which are used in sanity checks for freshly allocated inodes. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2021-04-07xfs: remove the di_dmevmask and di_dmstate fields from struct xfs_icdinodeChristoph Hellwig
The legacy DMAPI fields were never set by upstream Linux XFS, and have no way to be read using the kernel APIs. So instead of bloating the in-core inode for them just copy them from the on-disk inode into the log when logging the inode. The only caveat is that we need to make sure to zero the fields for newly read or deleted inodes, which is solved using a new flag in the inode. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2021-04-07xfs: remove the unused xfs_icdinode_has_bigtime helperChristoph Hellwig
Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2021-04-07xfs: handle crtime more carefully in xfs_bulkstat_one_intChristoph Hellwig
The crtime only exists for v5 inodes, so only copy it for those. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2021-04-07xfs: consistently initialize di_flags2Christoph Hellwig
Make sure di_flags2 is always initialized. We currently get this implicitly by clearing the dinode core on allocating the in-core inode, but that is about to go away. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2021-04-07xfs: split xfs_imap_to_bpChristoph Hellwig
Split looking up the dinode from xfs_imap_to_bp, which can be significantly simplified as a result. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2021-04-07xfs: scrub: Remove incorrect check executed on block format directoriesChandan Babu R
A directory with one directory block which in turns consists of two or more fs blocks is incorrectly flagged as corrupt by scrub since it assumes that "Block" format directories have a data fork single extent spanning the file offset range of [0, Dir block size - 1]. This commit fixes the bug by removing the incorrect check. Signed-off-by: Chandan Babu R <chandanrlinux@gmail.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2021-04-07xfs: Initialize xfs_alloc_arg->total correctly when allocating minlen extentsChandan Babu R
xfs/538 can cause the following call trace to be printed when executing on a multi-block directory configuration, WARNING: CPU: 1 PID: 2578 at fs/xfs/libxfs/xfs_bmap.c:717 xfs_bmap_extents_to_btree+0x520/0x5d0 Call Trace: ? xfs_buf_rele+0x4f/0x450 xfs_bmap_add_extent_hole_real+0x747/0x960 xfs_bmapi_allocate+0x39a/0x440 xfs_bmapi_write+0x507/0x9e0 xfs_da_grow_inode_int+0x1cd/0x330 ? up+0x12/0x60 xfs_dir2_grow_inode+0x62/0x110 ? xfs_trans_log_inode+0x234/0x2d0 xfs_dir2_sf_to_block+0x103/0x940 ? xfs_dir2_sf_check+0x8c/0x210 ? xfs_da_compname+0x19/0x30 ? xfs_dir2_sf_lookup+0xd0/0x3d0 xfs_dir2_sf_addname+0x10d/0x910 xfs_dir_createname+0x1ad/0x210 xfs_create+0x404/0x620 xfs_generic_create+0x24c/0x320 path_openat+0xda6/0x1030 do_filp_open+0x88/0x130 ? kmem_cache_alloc+0x50/0x210 ? __cond_resched+0x16/0x40 ? kmem_cache_alloc+0x50/0x210 do_sys_openat2+0x97/0x150 __x64_sys_creat+0x49/0x70 do_syscall_64+0x33/0x40 entry_SYSCALL_64_after_hwframe+0x44/0xae This occurs because xfs_bmap_exact_minlen_extent_alloc() initializes xfs_alloc_arg->total to xfs_bmalloca->minlen. In the context of xfs_bmap_exact_minlen_extent_alloc(), xfs_bmalloca->minlen has a value of 1 and hence the space allocator could choose an AG which has less than xfs_bmalloca->total number of free blocks available. As the transaction proceeds, one of the future space allocation requests could fail due to non-availability of free blocks in the AG that was originally chosen. This commit fixes the bug by assigning xfs_alloc_arg->total to the value of xfs_bmalloca->total. Fixes: 301519674699 ("xfs: Introduce error injection to allocate only minlen size extents for files") Signed-off-by: Chandan Babu R <chandanrlinux@gmail.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2021-04-07xfs: Fix dax inode extent calculation when direct write is performed on an ↵Chandan Babu R
unwritten extent With dax enabled filesystems, a direct write operation into an existing unwritten extent results in xfs_iomap_write_direct() zero-ing and converting the extent into a normal extent before the actual data is copied from the userspace buffer. The inode extent count can increase by 2 if the extent range being written to maps to the middle of the existing unwritten extent range. Hence this commit uses XFS_IEXT_WRITE_UNWRITTEN_CNT as the extent count delta when such a write operation is being performed. Fixes: 727e1acd297c ("xfs: Check for extent overflow when trivally adding a new extent") Reported-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Chandan Babu R <chandanrlinux@gmail.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2021-03-25xfs: fix xfs_trans slab cache nameAnthony Iliopoulos
Removal of kmem_zone_init wrappers accidentally changed a slab cache name from "xfs_trans" to "xf_trans". Fix this so that userspace consumers of /proc/slabinfo and /sys/kernel/slab can find it again. Fixes: b1231760e443 ("xfs: Remove slab init wrappers") Signed-off-by: Anthony Iliopoulos <ailiop@suse.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2021-03-25xfs: add error injection for per-AG resv failureGao Xiang
per-AG resv failure after fixing up freespace is hard to test in an effective way, so directly add an error injection path to observe such error handling path works as expected. Signed-off-by: Gao Xiang <hsiangkao@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2021-03-25xfs: support shrinking unused space in the last AGGao Xiang
As the first step of shrinking, this attempts to enable shrinking unused space in the last allocation group by fixing up freespace btree, agi, agf and adjusting super block and use a helper xfs_ag_shrink_space() to fixup the last AG. This can be all done in one transaction for now, so I think no additional protection is needed. Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Gao Xiang <hsiangkao@redhat.com> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2021-03-25xfs: introduce xfs_ag_shrink_space()Gao Xiang
This patch introduces a helper to shrink unused space in the last AG by fixing up the freespace btree. Also make sure that the per-AG reservation works under the new AG size. If such per-AG reservation or extent allocation fails, roll the transaction so the new transaction could cancel without any side effects. Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Gao Xiang <hsiangkao@redhat.com> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2021-03-25xfs: hoist out xfs_resizefs_init_new_ags()Gao Xiang
Move out related logic for initializing new added AGs to a new helper in preparation for shrinking. No logic changes. Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Gao Xiang <hsiangkao@redhat.com> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2021-03-25xfs: update lazy sb counters immediately for resizefsGao Xiang
sb_fdblocks will be updated lazily if lazysbcount is enabled, therefore when shrinking the filesystem sb_fdblocks could be larger than sb_dblocks and xfs_validate_sb_write() would fail. Even for growfs case, it'd be better to update lazy sb counters immediately to reflect the real sb counters. Reviewed-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Gao Xiang <hsiangkao@redhat.com> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2021-03-25xfs: Fix a typoBhaskar Chowdhury
s/strutures/structures/ Signed-off-by: Bhaskar Chowdhury <unixbhaskar@gmail.com> Reviewed-by: Pavel Reichl <preichl@redhat.com> Acked-by: Randy Dunlap <rdunlap@infradead.org> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2021-03-25xfs: Rudimentary spelling fixBhaskar Chowdhury
s/sytemcall/syscall/ Signed-off-by: Bhaskar Chowdhury <unixbhaskar@gmail.com> Acked-by: Randy Dunlap <rdunlap@infradead.org> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2021-03-25xfs: Rudimentary typo fixesBhaskar Chowdhury
s/filesytem/filesystem/ s/instrumention/instrumentation/ Signed-off-by: Bhaskar Chowdhury <unixbhaskar@gmail.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2021-03-25xfs: __percpu_counter_compare() inode count debug too expensiveDave Chinner
- 21.92% __xfs_trans_commit - 21.62% xfs_log_commit_cil - 11.69% xfs_trans_unreserve_and_mod_sb - 11.58% __percpu_counter_compare - 11.45% __percpu_counter_sum - 10.29% _raw_spin_lock_irqsave - 10.28% do_raw_spin_lock __pv_queued_spin_lock_slowpath We debated just getting rid of it last time this came up and there was no real objection to removing it. Now it's the biggest scalability limitation for debug kernels even on smallish machines, so let's just get rid of it. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2021-03-25xfs: reduce debug overhead of dir leaf/node checksDave Chinner
On debug kernels, we call xfs_dir3_leaf_check_int() multiple times on every directory modification. The robust hash ordering checks it does on every entry in the leaf on every call results in a massive CPU overhead which slows down debug kernels by a large amount. We use xfs_dir3_leaf_check_int() for the verifiers as well, so we can't just gut the function to reduce overhead. What we can do, however, is reduce the work it does when it is called from the debug interfaces, just leaving the high level checks in place and leaving the robust validation to the verifiers. This means the debug checks will catch gross errors, but subtle bugs might not be caught until a verifier is run. It is easy enough to restore the existing debug behaviour if the developer needs it (just change a call parameter in the debug code), but overwise the overhead makes testing large directory block sizes on debug kernels very slow. Profile at an unlink rate of ~80k file/s on a 64k block size filesystem before the patch: 40.30% [kernel] [k] xfs_dir3_leaf_check_int 10.98% [kernel] [k] __xfs_dir3_data_check 8.10% [kernel] [k] xfs_verify_dir_ino 4.42% [kernel] [k] memcpy 2.22% [kernel] [k] xfs_dir2_data_get_ftype 1.52% [kernel] [k] do_raw_spin_lock Profile after, at an unlink rate of ~125k files/s (+50% improvement) has largely dropped the leaf verification debug overhead out of the profile. 16.53% [kernel] [k] __xfs_dir3_data_check 12.53% [kernel] [k] xfs_verify_dir_ino 7.97% [kernel] [k] memcpy 3.36% [kernel] [k] xfs_dir2_data_get_ftype 2.86% [kernel] [k] __pv_queued_spin_lock_slowpath Create shows a similar change in profile and a +25% improvement in performance. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2021-03-25xfs: No need for inode number error injection in __xfs_dir3_data_checkDave Chinner
We call xfs_dir_ino_validate() for every dir entry in a directory when doing validity checking of the directory. It calls xfs_verify_dir_ino() then emits a corruption report if bad or does error injection if good. It is extremely costly: 43.27% [kernel] [k] xfs_dir3_leaf_check_int 10.28% [kernel] [k] __xfs_dir3_data_check 6.61% [kernel] [k] xfs_verify_dir_ino 4.16% [kernel] [k] xfs_errortag_test 4.00% [kernel] [k] memcpy 3.48% [kernel] [k] xfs_dir_ino_validate 7% of the cpu usage in this directory traversal workload is xfs_dir_ino_validate() doing absolutely nothing. We don't need error injection to simulate a bad inode numbers in the directory structure because we can do that by fuzzing the structure on disk. And we don't need a corruption report, because the __xfs_dir3_data_check() will emit one if the inode number is bad. So just call xfs_verify_dir_ino() directly here, and get rid of all this unnecessary overhead: 40.30% [kernel] [k] xfs_dir3_leaf_check_int 10.98% [kernel] [k] __xfs_dir3_data_check 8.10% [kernel] [k] xfs_verify_dir_ino 4.42% [kernel] [k] memcpy 2.22% [kernel] [k] xfs_dir2_data_get_ftype 1.52% [kernel] [k] do_raw_spin_lock Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2021-03-25xfs: type verification is expensiveDave Chinner
From a concurrent rm -rf workload: 41.04% [kernel] [k] xfs_dir3_leaf_check_int 9.85% [kernel] [k] __xfs_dir3_data_check 5.60% [kernel] [k] xfs_verify_ino 5.32% [kernel] [k] xfs_agino_range 4.21% [kernel] [k] memcpy 3.06% [kernel] [k] xfs_errortag_test 2.57% [kernel] [k] xfs_dir_ino_validate 1.66% [kernel] [k] xfs_dir2_data_get_ftype 1.17% [kernel] [k] do_raw_spin_lock 1.11% [kernel] [k] xfs_verify_dir_ino 0.84% [kernel] [k] __raw_callee_save___pv_queued_spin_unlock 0.83% [kernel] [k] xfs_buf_find 0.64% [kernel] [k] xfs_log_commit_cil THere's an awful lot of overhead in just range checking inode numbers in that, but each inode number check is not a lot of code. The total is a bit over 14.5% of the CPU time is spent validating inode numbers. The problem is that they deeply nested global scope functions so the overhead here is all in function call marshalling. text data bss dec hex filename 2077 0 0 2077 81d fs/xfs/libxfs/xfs_types.o.orig 2197 0 0 2197 895 fs/xfs/libxfs/xfs_types.o There's a small increase in binary size by inlining all the local nested calls in the verifier functions, but the same workload now profiles as: 40.69% [kernel] [k] xfs_dir3_leaf_check_int 10.52% [kernel] [k] __xfs_dir3_data_check 6.68% [kernel] [k] xfs_verify_dir_ino 4.22% [kernel] [k] xfs_errortag_test 4.15% [kernel] [k] memcpy 3.53% [kernel] [k] xfs_dir_ino_validate 1.87% [kernel] [k] xfs_dir2_data_get_ftype 1.37% [kernel] [k] do_raw_spin_lock 0.98% [kernel] [k] xfs_buf_find 0.94% [kernel] [k] __raw_callee_save___pv_queued_spin_unlock 0.73% [kernel] [k] xfs_log_commit_cil Now we only spend just over 10% of the time validing inode numbers for the same workload. Hence a few "inline" keyworks is good enough to reduce the validation overhead by 30%... Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2021-03-25xfs: optimise xfs_buf_item_size/format for contiguous regionsDave Chinner
We process the buf_log_item bitmap one set bit at a time with xfs_next_bit() so we can detect if a region crosses a memcpy discontinuity in the buffer data address. This has massive overhead on large buffers (e.g. 64k directory blocks) because we do a lot of unnecessary checks and xfs_buf_offset() calls. For example, 16-way concurrent create workload on debug kernel running CPU bound has this at the top of the profile at ~120k create/s on 64kb directory block size: 20.66% [kernel] [k] xfs_dir3_leaf_check_int 7.10% [kernel] [k] memcpy 6.22% [kernel] [k] xfs_next_bit 3.55% [kernel] [k] xfs_buf_offset 3.53% [kernel] [k] xfs_buf_item_format 3.34% [kernel] [k] __pv_queued_spin_lock_slowpath 3.04% [kernel] [k] do_raw_spin_lock 2.84% [kernel] [k] xfs_buf_item_size_segment.isra.0 2.31% [kernel] [k] __raw_callee_save___pv_queued_spin_unlock 1.36% [kernel] [k] xfs_log_commit_cil (debug checks hurt large blocks) The only buffers with discontinuities in the data address are unmapped buffers, and they are only used for inode cluster buffers and only for logging unlinked pointers. IOWs, it is -rare- that we even need to detect a discontinuity in the buffer item formatting code. Optimise all this by using xfs_contig_bits() to find the size of the contiguous regions, then test for a discontiunity inside it. If we find one, do the slow "bit at a time" method we do now. If we don't, then just copy the entire contiguous range in one go. Profile now looks like: 25.26% [kernel] [k] xfs_dir3_leaf_check_int 9.25% [kernel] [k] memcpy 5.01% [kernel] [k] __pv_queued_spin_lock_slowpath 2.84% [kernel] [k] do_raw_spin_lock 2.22% [kernel] [k] __raw_callee_save___pv_queued_spin_unlock 1.88% [kernel] [k] xfs_buf_find 1.53% [kernel] [k] memmove 1.47% [kernel] [k] xfs_log_commit_cil .... 0.34% [kernel] [k] xfs_buf_item_format .... 0.21% [kernel] [k] xfs_buf_offset .... 0.16% [kernel] [k] xfs_contig_bits .... 0.13% [kernel] [k] xfs_buf_item_size_segment.isra.0 So the bit scanning over for the dirty region tracking for the buffer log items is basically gone. Debug overhead hurts even more now... Perf comparison dir block creates unlink size (kb) time rate time Original 4 4m08s 220k 5m13s Original 64 7m21s 115k 13m25s Patched 4 3m59s 230k 5m03s Patched 64 6m23s 143k 12m33s Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Darrick J. Wong <djwong@kernel.org>