summaryrefslogtreecommitdiff
path: root/fs/xfs/libxfs/xfs_ialloc.c
AgeCommit message (Collapse)Author
2017-11-01xfs: move error injection tags into their own fileDarrick J. Wong
Move the error injection tag names into a libxfs header so that we can share it between kernel and userspace. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Dave Chinner <dchinner@redhat.com>
2017-10-26xfs: create inode pointer verifiersDarrick J. Wong
Create some helper functions to check that inode pointers point to somewhere within the filesystem and not at the static AG metadata. Move xfs_internal_inum and create a directory inode check function. We will use these functions in scrub and elsewhere. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Dave Chinner <dchinner@redhat.com>
2017-10-11xfs: Fix bool initialization/comparisonThomas Meyer
Bool initializations should use true and false. Bool tests don't need comparisons. Signed-off-by: Thomas Meyer <thomas@m3y3r.de> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-09-01xfs: don't log dirty ranges for ordered buffersBrian Foster
Ordered buffers are attached to transactions and pushed through the logging infrastructure just like normal buffers with the exception that they are not actually written to the log. Therefore, we don't need to log dirty ranges of ordered buffers. xfs_trans_log_buf() is called on ordered buffers to set up all of the dirty state on the transaction, buffer and log item and prepare the buffer for I/O. Now that xfs_trans_dirty_buf() is available, call it from xfs_trans_ordered_buf() so the latter is now mutually exclusive with xfs_trans_log_buf(). This reflects the implementation of ordered buffers and helps eliminate confusion over the need to log ranges of ordered buffers just to set up internal log state. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Allison Henderson <allison.henderson@oracle.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-08-22xfs: stop searching for free slots in an inode chunk when there are noneCarlos Maiolino
In a filesystem without finobt, the Space manager selects an AG to alloc a new inode, where xfs_dialloc_ag_inobt() will search the AG for the free slot chunk. When the new inode is in the same AG as its parent, the btree will be searched starting on the parent's record, and then retried from the top if no slot is available beyond the parent's record. To exit this loop though, xfs_dialloc_ag_inobt() relies on the fact that the btree must have a free slot available, once its callers relied on the agi->freecount when deciding how/where to allocate this new inode. In the case when the agi->freecount is corrupted, showing available inodes in an AG, when in fact there is none, this becomes an infinite loop. Add a way to stop the loop when a free slot is not found in the btree, making the function to fall into the whole AG scan which will then, be able to detect the corruption and shut the filesystem down. As pointed by Brian, this might impact performance, giving the fact we don't reset the search distance anymore when we reach the end of the tree, giving it fewer tries before falling back to the whole AG search, but it will only affect searches that start within 10 records to the end of the tree. Signed-off-by: Carlos Maiolino <cmaiolino@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-08-11xfs: fix inobt inode allocation search optimizationOmar Sandoval
When we try to allocate a free inode by searching the inobt, we try to find the inode nearest the parent inode by searching chunks both left and right of the chunk containing the parent. As an optimization, we cache the leftmost and rightmost records that we previously searched; if we do another allocation with the same parent inode, we'll pick up the search where it last left off. There's a bug in the case where we found a free inode to the left of the parent's chunk: we need to update the cached left and right records, but because we already reassigned the right record to point to the left, we end up assigning the left record to both the cached left and right records. This isn't a correctness problem strictly, but it can result in the next allocation rechecking chunks unnecessarily or allocating inodes further away from the parent than it needs to. Fix it by swapping the record pointer after we update the cached left and right records. Fixes: bd169565993b ("xfs: speed up free inode search") Signed-off-by: Omar Sandoval <osandov@fb.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-06-27xfs: remove unneeded parameter from XFS_TEST_ERRORDarrick J. Wong
Since we moved the injected error frequency controls to the mountpoint, we can get rid of the last argument to XFS_TEST_ERROR. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
2017-06-19xfs: export _inobt_btrec_to_irec and _ialloc_cluster_alignment for scrubDarrick J. Wong
Create a function to extract an in-core inobt record from a generic btree_rec union so that scrub will be able to check inobt records and check inode block alignment. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com>
2017-06-19xfs: remove double-underscore integer typesDarrick J. Wong
This is a purely mechanical patch that removes the private __{u,}int{8,16,32,64}_t typedefs in favor of using the system {u,}int{8,16,32,64}_t typedefs. This is the sed script used to perform the transformation and fix the resulting whitespace and indentation errors: s/typedef\t__uint8_t/typedef __uint8_t\t/g s/typedef\t__uint/typedef __uint/g s/typedef\t__int\([0-9]*\)_t/typedef int\1_t\t/g s/__uint8_t\t/__uint8_t\t\t/g s/__uint/uint/g s/__int\([0-9]*\)_t\t/__int\1_t\t\t/g s/__int/int/g /^typedef.*int[0-9]*_t;$/d Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
2017-02-16xfs: Use xfs_icluster_size_fsb() to calculate inode chunk alignmentChandan Rajendra
On a ppc64 system, executing generic/256 test with 32k block size gives the following call trace, XFS: Assertion failed: args->maxlen > 0, file: /root/repos/linux/fs/xfs/libxfs/xfs_alloc.c, line: 2026 kernel BUG at /root/repos/linux/fs/xfs/xfs_message.c:113! Oops: Exception in kernel mode, sig: 5 [#1] SMP NR_CPUS=2048 DEBUG_PAGEALLOC NUMA pSeries Modules linked in: CPU: 2 PID: 19361 Comm: mkdir Not tainted 4.10.0-rc5 #58 task: c000000102606d80 task.stack: c0000001026b8000 NIP: c0000000004ef798 LR: c0000000004ef798 CTR: c00000000082b290 REGS: c0000001026bb090 TRAP: 0700 Not tainted (4.10.0-rc5) MSR: 8000000000029032 <SF,EE,ME,IR,DR,RI> CR: 28004428 XER: 00000000 CFAR: c0000000004ef180 SOFTE: 1 GPR00: c0000000004ef798 c0000001026bb310 c000000001157300 ffffffffffffffea GPR04: 000000000000000a c0000001026bb130 0000000000000000 ffffffffffffffc0 GPR08: 00000000000000d1 0000000000000021 00000000ffffffd1 c000000000dd4990 GPR12: 0000000022004444 c00000000fe00800 0000000020000000 0000000000000000 GPR16: 0000000000000000 0000000043a606fc 0000000043a76c08 0000000043a1b3d0 GPR20: 000001002a35cd60 c0000001026bbb80 0000000000000000 0000000000000001 GPR24: 0000000000000240 0000000000000004 c00000062dc55000 0000000000000000 GPR28: 0000000000000004 c00000062ecd9200 0000000000000000 c0000001026bb6c0 NIP [c0000000004ef798] .assfail+0x28/0x30 LR [c0000000004ef798] .assfail+0x28/0x30 Call Trace: [c0000001026bb310] [c0000000004ef798] .assfail+0x28/0x30 (unreliable) [c0000001026bb380] [c000000000455d74] .xfs_alloc_space_available+0x194/0x1b0 [c0000001026bb410] [c00000000045b914] .xfs_alloc_fix_freelist+0x144/0x480 [c0000001026bb580] [c00000000045c368] .xfs_alloc_vextent+0x698/0xa90 [c0000001026bb650] [c0000000004a6200] .xfs_ialloc_ag_alloc+0x170/0x820 [c0000001026bb7c0] [c0000000004a9098] .xfs_dialloc+0x158/0x320 [c0000001026bb8a0] [c0000000004e628c] .xfs_ialloc+0x7c/0x610 [c0000001026bb990] [c0000000004e8138] .xfs_dir_ialloc+0xa8/0x2f0 [c0000001026bbaa0] [c0000000004e8814] .xfs_create+0x494/0x790 [c0000001026bbbf0] [c0000000004e5ebc] .xfs_generic_create+0x2bc/0x410 [c0000001026bbce0] [c0000000002b4a34] .vfs_mkdir+0x154/0x230 [c0000001026bbd70] [c0000000002bc444] .SyS_mkdirat+0x94/0x120 [c0000001026bbe30] [c00000000000b760] system_call+0x38/0xfc Instruction dump: 4e800020 60000000 7c0802a6 7c862378 3c82ffca 7ca72b78 38841c18 7c651b78 38600000 f8010010 f821ff91 4bfff94d <0fe00000> 60000000 7c0802a6 7c892378 When block size is larger than inode cluster size, the call to XFS_B_TO_FSBT(mp, mp->m_inode_cluster_size) returns 0. Also, mkfs.xfs would have set xfs_sb->sb_inoalignmt to 0. This causes xfs_ialloc_cluster_alignment() to return 0. Due to this args.minalignslop (in xfs_ialloc_ag_alloc()) gets the unsigned equivalent of -1 assigned to it. This later causes alloc_len in xfs_alloc_space_available() to have a value of 0. In such a scenario when args.total is also 0, the assert statement "ASSERT(args->maxlen > 0);" fails. This commit fixes the bug by replacing the call to XFS_B_TO_FSBT() in xfs_ialloc_cluster_alignment() with a call to xfs_icluster_size_fsb(). Suggested-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Chandan Rajendra <chandan@linux.vnet.ibm.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2016-12-07Merge branch 'xfs-4.10-misc-fixes-3' into for-nextDave Chinner
2016-12-05xfs: forbid AG btrees with level == 0Darrick J. Wong
There is no such thing as a zero-level AG btree since even a single-node zero-records btree has one level. Btree cursor constructors read cur_nlevels straight from disk and then access things like cur_bufs[cur_nlevels - 1] which is /really/ bad if cur_nlevels is zero! Therefore, strengthen the verifiers to prevent this possibility. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-12-05xfs: Move AGI buffer type setting to xfs_read_agiEric Sandeen
We've missed properly setting the buffer type for an AGI transaction in 3 spots now, so just move it into xfs_read_agi() and set it if we are in a transaction to avoid the problem in the future. This is similar to how it is done in i.e. the dir3 and attr3 read functions. Signed-off-by: Eric Sandeen <sandeen@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-11-08libxfs: convert ushort to unsigned shortDarrick J. Wong
Since xfsprogs dropped ushort in favor of unsigned short, do that here too. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-08-03xfs: add owner field to extent allocation and freeingDarrick J. Wong
For the rmap btree to work, we have to feed the extent owner information to the the allocation and freeing functions. This information is what will end up in the rmap btree that tracks allocated extents. While we technically don't need the owner information when freeing extents, passing it allows us to validate that the extent we are removing from the rmap btree actually belonged to the owner we expected it to belong to. We also define a special set of owner values for internal metadata that would otherwise have no owner. This allows us to tell the difference between metadata owned by different per-ag btrees, as well as static fs metadata (e.g. AG headers) and internal journal blocks. There are also a couple of special cases we need to take care of - during EFI recovery, we don't actually know who the original owner was, so we need to pass a wildcard to indicate that we aren't checking the owner for validity. We also need special handling in growfs, as we "free" the space in the last AG when extending it, but because it's new space it has no actual owner... While touching the xfs_bmap_add_free() function, re-order the parameters to put the struct xfs_mount first. Extend the owner field to include both the owner type and some sort of index within the owner. The index field will be used to support reverse mappings when reflink is enabled. When we're freeing extents from an EFI, we don't have the owner information available (rmap updates have their own redo items). xfs_free_extent therefore doesn't need to do an rmap update. Make sure that the log replay code signals this correctly. This is based upon a patch originally from Dave Chinner. It has been extended to add more owner information with the intent of helping recovery operations when things go wrong (e.g. offset of user data block in a file). [dchinner: de-shout the xfs_rmap_*_owner helpers] [darrick: minor style fixes suggested by Christoph Hellwig] Signed-off-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-08-03xfs: rename flist/free_list to dfopsDarrick J. Wong
Mechanical change of flist/free_list to dfops, since they're now deferred ops, not just a freeing list. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-08-03xfs: change xfs_bmap_{finish,cancel,init,free} -> xfs_defer_*Darrick J. Wong
Drop the compatibility shims that we were using to integrate the new deferred operation mechanism into the existing code. No new code. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-08-03xfs: rework xfs_bmap_free callers to use xfs_defer_opsDarrick J. Wong
Restructure everything that used xfs_bmap_free to use xfs_defer_ops instead. For now we'll just remove the old symbols and play some cpp magic to make it work; in the next patch we'll actually rename everything. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-06-21xfs: refactor btree maxlevels computationDarrick J. Wong
Create a common function to calculate the maximum height of a per-AG btree. This will eventually be used by the rmapbt and refcountbt code to calculate appropriate maxlevels values for each. This is important because the verifiers and the transaction block reservations depend on accurate estimates of how many blocks are needed to satisfy a btree split. We were mistakenly using the max bnobt height for all the btrees, which creates a dangerous situation since the larger records and keys in an rmapbt make it very possible that the rmapbt will be taller than the bnobt and so we can run out of transaction block reservation. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-06-21xfs: rearrange xfs_bmap_add_free parametersDarrick J. Wong
This is already in xfsprogs' libxfs, so port it to the kernel. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-03-07xfs: fix computation of inode btree maxlevelsDarrick J. Wong
Commit 88740da18[1] introduced a function to compute the maximum height of the inode btree back in 1994. Back then, apparently, the freespace and inode btrees shared the same geometry; however, it has long since been the case that the inode and freespace btrees have different record and key sizes. Therefore, we must use m_inobt_mnr if we want a correct calculation/log reservation/etc. (Yes, this bug has been around for 21 years and ten months.) (Yes, I was in middle school when this bug was committed.) [1] http://oss.sgi.com/cgi-bin/gitweb.cgi?p=archive/xfs-import.git;a=commitdiff;h=88740da18ddd9d7ba3ebaa9502fefc6ef2fd19cd Historical-research-by: Dave Chinner <david@fromorbit.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-01-04xfs: print name of verifier if it failsEric Sandeen
This adds a name to each buf_ops structure, so that if a verifier fails we can print the type of verifier that failed it. Should be a slight debugging aid, I hope. Signed-off-by: Eric Sandeen <sandeen@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2015-10-12xfs: validate metadata LSNs against log on v5 superblocksBrian Foster
Since the onset of v5 superblocks, the LSN of the last modification has been included in a variety of on-disk data structures. This LSN is used to provide log recovery ordering guarantees (e.g., to ensure an older log recovery item is not replayed over a newer target data structure). While this works correctly from the point a filesystem is formatted and mounted, userspace tools have some problematic behaviors that defeat this mechanism. For example, xfs_repair historically zeroes out the log unconditionally (regardless of whether corruption is detected). If this occurs, the LSN of the filesystem is reset and the log is now in a problematic state with respect to on-disk metadata structures that might have a larger LSN. Until either the log catches up to the highest previously used metadata LSN or each affected data structure is modified and written out without incident (which resets the metadata LSN), log recovery is susceptible to filesystem corruption. This problem is ultimately addressed and repaired in the associated userspace tools. The kernel is still responsible to detect the problem and notify the user that something is wrong. Check the superblock LSN at mount time and fail the mount if it is invalid. From that point on, trigger verifier failure on any metadata I/O where an invalid LSN is detected. This results in a filesystem shutdown and guarantees that we do not log metadata changes with invalid LSNs on disk. Since this is a known issue with a known recovery path, present a warning to instruct the user how to recover. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2015-08-19Merge branch 'xfs-efi-rework' into for-nextDave Chinner
2015-08-19xfs: fix btree cursor error cleanupsBrian Foster
The btree cursor cleanup function takes an error parameter that affects how buffers are released from the cursor. All buffers are released in the event of error. Several callers do not specify the XFS_BTREE_ERROR flag in the event of error, however. This can cause buffers to hang around locked or with an elevated hold count and thus lead to umount hangs in the event of errors. Fix up the xfs_btree_del_cursor() callers to pass XFS_BTREE_ERROR if the cursor is being torn down due to error. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Dave Chinner <david@fromorbit.com>
2015-07-29xfs: create new metadata UUID field and incompat flagEric Sandeen
This adds a new superblock field, sb_meta_uuid. If set, along with a new incompat flag, the code will use that field on a V5 filesystem to compare to metadata UUIDs, which allows us to change the user- visible UUID at will. Userspace handles the setting and clearing of the incompat flag as appropriate, as the UUID gets changed; i.e. setting the user-visible UUID back to the original UUID (as stored in the new field) will remove the incompatible feature flag. If the incompat flag is not set, this copies the user-visible UUID into into the meta_uuid slot in memory when the superblock is read from disk; the meta_uuid field is not written back to disk in this case. The remainder of this patch simply switches verifiers, initializers, etc to use the new sb_meta_uuid field. Signed-off-by: Eric Sandeen <sandeen@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2015-06-04Merge branch 'xfs-misc-fixes-for-4.2-2' into for-nextDave Chinner
2015-06-04xfs: check min blks for random debug mode sparse allocationsBrian Foster
The inode allocator enables random sparse inode chunk allocations in DEBUG mode to facilitate testing. Sparse inode allocations are not always possible, however, depending on the fs geometry. For example, there is no possibility for a sparse inode allocation on filesystems where the block size is large enough to fit one or more inode chunks within a single block. Fix up the DEBUG mode sparse inode allocation logic to trigger random sparse allocations only when the geometry of the fs allows it. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2015-06-01Merge branch 'xfs-sparse-inode' into for-nextDave Chinner
2015-05-29xfs: skip unallocated regions of inode chunks in xfs_ifree_cluster()Brian Foster
xfs_ifree_cluster() is called to mark all in-memory inodes and inode buffers as stale. This occurs after we've removed the inobt records and dropped any references of inobt data. xfs_ifree_cluster() uses the starting inode number to walk the namespace of inodes expected for a single chunk a cluster buffer at a time. The cluster buffer disk addresses are calculated by decoding the sequential inode numbers expected from the chunk. The problem with this approach is that if the inode chunk being removed is a sparse chunk, not all of the buffer addresses that are calculated as part of this sequence may be inode clusters. Attempting to acquire the buffer based on expected inode characterstics (i.e., cluster length) can lead to errors and is generally incorrect. We already use a couple variables to carry requisite state from xfs_difree() to xfs_ifree_cluster(). Rather than add a third, define a new internal structure to carry the existing parameters through these functions. Add an alloc field that represents the physical allocation bitmap of inodes in the chunk being removed. Modify xfs_ifree_cluster() to check each inode against the bitmap and skip the clusters that were never allocated as real inodes on disk. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2015-05-29xfs: only free allocated regions of inode chunksBrian Foster
An inode chunk is currently added to the transaction free list based on a simple fsb conversion and hardcoded chunk length. The nature of sparse chunks is such that the physical chunk of inodes on disk may consist of one or more discontiguous parts. Blocks that reside in the holes of the inode chunk are not inodes and could be allocated to any other use or not allocated at all. Refactor the existing xfs_bmap_add_free() call into the xfs_difree_inode_chunk() helper. The new helper uses the existing calculation if a chunk is not sparse. Otherwise, use the inobt record holemask to free the contiguous regions of the chunk. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2015-05-29xfs: filter out sparse regions from individual inode allocationBrian Foster
Inode allocation from an existing record with free inodes traditionally selects the first inode available according to the ir_free mask. With sparse inode chunks, the ir_free mask could refer to an unallocated region. We must mask the unallocated regions out of ir_free before using it to select a free inode in the chunk. Update the xfs_inobt_first_free_inode() helper to find the first free inode available of the allocated regions of the inode chunk. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2015-05-29xfs: randomly do sparse inode allocations in DEBUG modeBrian Foster
Sparse inode allocations generally only occur when full inode chunk allocation fails. This requires some level of filesystem space usage and fragmentation. For filesystems formatted with sparse inode chunks enabled, do random sparse inode chunk allocs when compiled in DEBUG mode to increase test coverage. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2015-05-29xfs: allocate sparse inode chunks on full chunk allocation failureBrian Foster
xfs_ialloc_ag_alloc() makes several attempts to allocate a full inode chunk. If all else fails, reduce the allocation to the sparse length and alignment and attempt to allocate a sparse inode chunk. If sparse chunk allocation succeeds, check whether an inobt record already exists that can track the chunk. If so, inherit and update the existing record. Otherwise, insert a new record for the sparse chunk. Create helpers to align sparse chunk inode records and insert or update existing records in the inode btrees. The xfs_inobt_insert_sprec() helper implements the merge or update semantics required for sparse inode records with respect to both the inobt and finobt. To update the inobt, either insert a new record or merge with an existing record. To update the finobt, use the updated inobt record to either insert or replace an existing record. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2015-05-29xfs: pass inode count through ordered icreate log itemBrian Foster
v5 superblocks use an ordered log item for logging the initialization of inode chunks. The icreate log item is currently hardcoded to an inode count of 64 inodes. The agbno and extent length are used to initialize the inode chunk from log recovery. While an incorrect inode count does not lead to bad inode chunk initialization, we should pass the correct inode count such that log recovery has enough data to perform meaningful validity checks on the chunk. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2015-05-29xfs: introduce inode record hole mask for sparse inode chunksBrian Foster
The inode btrees track 64 inodes per record regardless of inode size. Thus, inode chunks on disk vary in size depending on the size of the inodes. This creates a contiguous allocation requirement for new inode chunks that can be difficult to satisfy on an aged and fragmented (free space) filesystems. The inode record freecount currently uses 4 bytes on disk to track the free inode count. With a maximum freecount value of 64, only one byte is required. Convert the freecount field to a single byte and use two of the remaining 3 higher order bytes left for the hole mask field. Use the final leftover byte for the total count field. The hole mask field tracks holes in the chunks of physical space that the inode record refers to. This facilitates the sparse allocation of inode chunks when contiguous chunks are not available and allows the inode btrees to identify what portions of the chunk contain valid inodes. The total count field contains the total number of valid inodes referred to by the record. This can also be deduced from the hole mask. The count field provides clarity and redundancy for internal record verification. Note that neither of the new fields can be written to disk on fs' without sparse inode support. Doing so writes to the high-order bytes of freecount and causes corruption from the perspective of older kernels. The on-disk inobt record data structure is updated with a union to distinguish between the original, "full" format and the new, "sparse" format. The conversion routines to get, insert and update records are updated to translate to and from the on-disk record accordingly such that freecount remains a 4-byte value on non-supported fs, yet the new fields of the in-core record are always valid with respect to the record. This means that higher level code can refer to the current in-core record format unconditionally and lower level code ensures that records are translated to/from disk according to the capabilities of the fs. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2015-05-29xfs: use sparse chunk alignment for min. inode allocation requirementBrian Foster
xfs_ialloc_ag_select() iterates through the allocation groups looking for free inodes or free space to determine whether to allow an inode allocation to proceed. If no free inodes are available, it assumes that an AG must have an extent longer than mp->m_ialloc_blks. Sparse inode chunk support currently allows for allocations smaller than the traditional inode chunk size specified in m_ialloc_blks. The current minimum sparse allocation is set in the superblock sb_spino_align field at mkfs time. Create a new m_ialloc_min_blks field in xfs_mount and use this to represent the minimum supported allocation size for inode chunks. Initialize m_ialloc_min_blks at mount time based on whether sparse inodes are supported. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2015-05-29xfs: update free inode record logic to support sparse inode recordsBrian Foster
xfs_difree_inobt() uses logic in a couple places that assume inobt records refer to fully allocated chunks. Specifically, the use of mp->m_ialloc_inos can cause problems for inode chunks that are sparsely allocated. Sparse inode chunks can, by definition, define a smaller number of inodes than a full inode chunk. Fix the logic that determines whether an inode record should be removed from the inobt to use the ir_free mask rather than ir_freecount. Fix the agi counters modification to use ir_freecount to add the actual number of inodes freed rather than assuming a full inode chunk. Also make sure that we preserve the behavior to not remove inode chunks if the block size is large enough for multiple inode chunks (e.g., bsize=64k, isize=512). This behavior was previously implicit in that in such configurations, ir.freecount of a single record never matches m_ialloc_inos. Hence, add some comments as well. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2015-05-29xfs: create individual inode alloc. helperBrian Foster
Inode allocation from sparse inode records must filter the ir_free mask against ir_holemask. In preparation for this requirement, create a helper to allocate an individual inode from an inode record. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2015-05-29xfs: use percpu_counter_read_positive for mp->m_icountGeorge Wang
Function percpu_counter_read just return the current counter, which can be negative. This will cause the checking of "allocated inode counts <= m_maxicount" false positive. Use percpu_counter_read_positive can solve this problem, and be consistent with the purpose to introduce percpu mechanism to xfs. Signed-off-by: George Wang <xuw2015@gmail.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2015-02-24Merge branch 'xfs-generic-sb-counters' into for-nextDave Chinner
Conflicts: fs/xfs/xfs_super.c
2015-02-23xfs: pass mp to XFS_WANT_CORRUPTED_RETURNEric Sandeen
Today, if we hit an XFS_WANT_CORRUPTED_RETURN we don't print any information about which filesystem hit it. Passing in the mp allows us to print the filesystem (device) name, which is a pretty critical piece of information. Tested by running fsfuzzer 'til I hit some. Signed-off-by: Eric Sandeen <sandeen@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2015-02-23xfs: pass mp to XFS_WANT_CORRUPTED_GOTOEric Sandeen
Today, if we hit an XFS_WANT_CORRUPTED_GOTO we don't print any information about which filesystem hit it. Passing in the mp allows us to print the filesystem (device) name, which is a pretty critical piece of information. Tested by running fsfuzzer 'til I hit some. Signed-off-by: Eric Sandeen <sandeen@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2015-02-23xfs: use generic percpu counters for inode counterDave Chinner
XFS has hand-rolled per-cpu counters for the superblock since before there was any generic implementation. There are some warts around the use of them for the inode counter as the hand rolled counter is designed to be accurate at zero, but has no specific accurracy at any other value. This design causes problems for the maximum inode count threshold enforcement, as there is no trigger that balances the counters as they get close tothe maximum threshold. Instead of designing new triggers for balancing, just replace the handrolled per-cpu counter with a generic counter. This enables us to update the counter through the normal superblock modification funtions, but rather than do that we add a xfs_mod_icount() helper function (from Christoph Hellwig) and keep the percpu counter outside the superblock in the struct xfs_mount. This means we still need to initialise the per-cpu counter specifically when we read the superblock, and vice versa when we log/write it, but it does mean that we don't need to change any other code. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-12-04Merge branch 'xfs-misc-fixes-for-3.19-2' into for-nextDave Chinner
Conflicts: fs/xfs/xfs_iops.c
2014-12-04xfs: fix premature enospc on inode allocationDave Chinner
After growing a filesystem, XFS can fail to allocate inodes even though there is a large amount of space available in the filesystem for inodes. The issue is caused by a nearly full allocation group having enough free space in it to be considered for inode allocation, but not enough contiguous free space to actually allocation inodes. This situation results in successful selection of the AG for allocation, then failure of the allocation resulting in ENOSPC being reported to the caller. It is caused by two possible issues. Firstly, we only consider the lognest free extent and whether it would fit an inode chunk. If the extent is not correctly aligned, then we can't allocate an inode chunk in it regardless of the fact that it is large enough. This tends to be a permanent error until space in the AG is freed. The second issue is that we don't actually lock the AGI or AGF when we are doing these checks, and so by the time we get to actually allocating the inode chunk the space we thought we had in the AG may have been allocated. This tends to be a spurious error as it requires a race to trigger. Hence this case is ignored in this patch as the reported problem is for permanent errors. The first issue could be addressed by simply taking into account the alignment when checking the longest extent. This, however, would prevent allocation in AGs that have aligned, exact sized extents free. However, this case should be fairly rare compared to the number of allocations that occur near ENOSPC that would trigger this condition. Hence, when selecting the inode AG, take into account the inode cluster alignment when checking the lognest free extent in the AG. If we can't find any AGs with a contiguous free space large enough to be aligned, drop the alignment addition and just try for an AG that has enough contiguous free space available for an inode chunk. This won't prevent issues from occurring, but should avoid situations where other AGs have lots of free space but the selected AG can't allocate due to alignment constraints. Reported-by: Arkadiusz Miskiewicz <arekm@maven.pl> Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-12-01Merge branch 'xfs-coccinelle-cleanups' into for-nextDave Chinner
2014-12-01libxfs: fix simple_return.cocci warningskbuild test robot
fs/xfs/libxfs/xfs_ialloc.c:1141:1-6: WARNING: end returns can be simpified Simplify a trivial if-return sequence. Possibly combine with a preceding function call. Generated by: scripts/coccinelle/misc/simple_return.cocci Signed-off-by: Fengguang Wu <fengguang.wu@intel.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-11-28xfs: merge xfs_inum.h into xfs_format.hChristoph Hellwig
Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-11-28xfs: merge xfs_ag.h into xfs_format.hChristoph Hellwig
More on-disk format consolidation. A few declarations that weren't on-disk format related move into better suitable spots. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>