summaryrefslogtreecommitdiff
path: root/fs/gfs2/bmap.c
AgeCommit message (Collapse)Author
2018-07-25gfs2: Special-case rindex for gfs2_growAndreas Gruenbacher
To speed up the common case of appending to a file, gfs2_write_alloc_required presumes that writing beyond the end of a file will always require additional blocks to be allocated. This assumption is incorrect for preallocates files, but there are no negative consequences as long as *some* space is still left on the filesystem. One special file that always has some space preallocated beyond the end of the file is the rindex: when growing a filesystem, gfs2_grow adds one or more new resource groups and appends records describing those resource groups to the rindex; the preallocated space ensures that this is always possible. However, when a filesystem is completely full, gfs2_write_alloc_required will indicate that an additional allocation is required, and appending the next record to the rindex will fail even though space for that record has already been preallocated. To fix that, skip the incorrect optimization in gfs2_write_alloc_required, but for the rindex only. Other writes to preallocated space beyond the end of the file are still allowed to fail on completely full filesystems. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com> Reviewed-by: Bob Peterson <rpeterso@redhat.com>
2018-07-25Merge branch 'iomap-4.19-merge' into linux-gfs2/for-nextAndreas Gruenbacher
Merge xfs branch 'iomap-4.19-merge' into linux-gfs2/for-next. This brings in readpage and direct I/O support for inline data. The IOMAP_F_BUFFER_HEAD flag introduced in commit "iomap: add initial support for writes without buffer heads" needs to be set for gfs2 as well, so do that in the merge. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2018-07-24Merge branch 'iomap-write' into linux-gfs2/for-nextAndreas Gruenbacher
Pull in the gfs2 iomap-write changes: Tweak the existing code to properly support iomap write and eliminate an unnecessary special case in gfs2_block_map. Implement iomap write support for buffered and direct I/O. Simplify some of the existing code and eliminate code that is no longer used: gfs2: Remove gfs2_write_{begin,end} gfs2: iomap direct I/O support gfs2: gfs2_extent_length cleanup gfs2: iomap buffered write support gfs2: Further iomap cleanups This is based on the following changes on the xfs 'iomap-4.19-merge' branch: iomap: add private pointer to struct iomap iomap: add a page_done callback iomap: generic inline data handling iomap: complete partial direct I/O writes synchronously iomap: mark newly allocated buffer heads as new fs: factor out a __generic_write_end helper Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2018-07-02gfs2: iomap direct I/O supportAndreas Gruenbacher
The page unmapping previously done in gfs2_direct_IO is now done generically in iomap_dio_rw. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com> Reviewed-by: Bob Peterson <rpeterso@redhat.com>
2018-07-02gfs2: gfs2_extent_length cleanupAndreas Gruenbacher
Now that gfs2_extent_length is no longer used for determining the size of a hole and always with an upper size limit, the function can be simplified. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com> Reviewed-by: Bob Peterson <rpeterso@redhat.com>
2018-07-02gfs2: iomap buffered write supportAndreas Gruenbacher
With the traditional page-based writes, blocks are allocated separately for each page written to. With iomap writes, we can allocate a lot more blocks at once, with a fraction of the allocation overhead for each page. Split calculating the number of blocks that can be allocated at a given position (gfs2_alloc_size) off from gfs2_iomap_alloc: that size determines the number of blocks to allocate and reserve in the journal. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com> Reviewed-by: Bob Peterson <rpeterso@redhat.com>
2018-07-02gfs2: Further iomap cleanupsAndreas Gruenbacher
In gfs2_iomap_alloc, set the type of newly allocated extents to IOMAP_MAPPED so that iomap_to_bh will set the bh states correctly: otherwise, the bhs would not be marked as mapped, confusing __mpage_writepage. This means that we need to check for the IOMAP_F_NEW flag in fallocate_chunk now. Further clean up gfs2_iomap_get and implement gfs2_stuffed_iomap here directly. For reads beyond the end of the file, return holes instead of failing with -ENOENT so that we can get rid of that special case in gfs2_block_map. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com> Reviewed-by: Bob Peterson <rpeterso@redhat.com>
2018-06-21gfs2: Minor clarification to __gfs2_punch_holeAndreas Gruenbacher
Rename end_off to end_len to make the code less confusing. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com> Signed-off-by: Bob Peterson <rpeterso@redhat.com>
2018-06-05Merge tag 'xfs-4.18-merge-3' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linuxLinus Torvalds
Pull xfs updates from Darrick Wong: "New features this cycle include the ability to relabel mounted filesystems, support for fallocated swapfiles, and using FUA for pure data O_DSYNC directio writes. With this cycle we begin to integrate online filesystem repair and refactor the growfs code in preparation for eventual subvolume support, though the road ahead for both features is quite long. There are also numerous refactorings of the iomap code to remove unnecessary log overhead, to disentangle some of the quota code, and to prepare for buffer head removal in a future upstream kernel. Metadata validation continues to improve, both in the hot path veifiers and the online filesystem check code. I anticipate sending a second pull request in a few days with more metadata validation improvements. This series has been run through a full xfstests run over the weekend and through a quick xfstests run against this morning's master, with no major failures reported. Summary: - Strengthen inode number and structure validation when allocating inodes. - Reduce pointless buffer allocations during cache miss - Use FUA for pure data O_DSYNC directio writes - Various iomap refactorings - Strengthen quota metadata verification to avoid unfixable broken quota - Make AGFL block freeing a deferred operation to avoid blowing out transaction reservations when running complex operations - Get rid of the log item descriptors to reduce log overhead - Fix various reflink bugs where inodes were double-joined to transactions - Don't issue discards when trimming unwritten extents - Refactor incore dquot initialization and retrieval interfaces - Fix some locking problmes in the quota scrub code - Strengthen btree structure checks in scrub code - Rewrite swapfile activation to use iomap and support unwritten extents - Make scrub exit to userspace sooner when corruptions or cross-referencing problems are found - Make scrub invoke the data fork scrubber directly on metadata inodes - Don't do background reclamation of post-eof and cow blocks when the fs is suspended - Fix secondary superblock buffer lifespan hinting - Refactor growfs to use table-dispatched functions instead of long stringy functions - Move growfs code to libxfs - Implement online fs label getting and setting - Introduce online filesystem repair (in a very limited capacity) - Fix unit conversion problems in the realtime freemap iteration functions - Various refactorings and cleanups in preparation to remove buffer heads in a future release - Reimplement the old bmap call with iomap - Remove direct buffer head accesses from seek hole/data - Various bug fixes" * tag 'xfs-4.18-merge-3' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: (121 commits) fs: use ->is_partially_uptodate in page_cache_seek_hole_data fs: remove the buffer_unwritten check in page_seek_hole_data fs: move page_cache_seek_hole_data to iomap.c xfs: use iomap_bmap iomap: add an iomap-based bmap implementation iomap: add a iomap_sector helper iomap: use __bio_add_page in iomap_dio_zero iomap: move IOMAP_F_BOUNDARY to gfs2 iomap: fix the comment describing IOMAP_NOWAIT iomap: inline data should be an iomap type, not a flag mm: split ->readpages calls to avoid non-contiguous pages lists mm: return an unsigned int from __do_page_cache_readahead mm: give the 'ret' variable a better name __do_page_cache_readahead block: add a lower-level bio_add_page interface xfs: fix error handling in xfs_refcount_insert() xfs: fix xfs_rtalloc_rec units xfs: strengthen rtalloc query range checks xfs: xfs_rtbuf_get should check the bmapi_read results xfs: xfs_rtword_t should be unsigned, not signed dax: change bdev_dax_supported() to support boolean returns ...
2018-06-04gfs2: Iomap cleanups and improvementsAndreas Gruenbacher
Clean up gfs2_iomap_alloc and gfs2_iomap_get. Document how gfs2_iomap_alloc works: it now needs to be called separately after gfs2_iomap_get where necessary; this will be used later by iomap write. Move gfs2_iomap_ops into bmap.c. Introduce a new gfs2_iomap_get_alloc helper and use it in fallocate_chunk: gfs2_iomap_begin will become unsuitable for fallocate with proper iomap write support. In gfs2_block_map and fallocate_chunk, zero-initialize struct iomap. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com> Signed-off-by: Bob Peterson <rpeterso@redhat.com>
2018-06-04gfs2: Remove ordered write mode handling from gfs2_trans_add_dataAndreas Gruenbacher
In journaled data mode, we need to add each buffer head to the current transaction. In ordered write mode, we only need to add the inode to the ordered inode list. So far, both cases are handled in gfs2_trans_add_data. This makes the code look misleading and is inefficient for small block sizes as well. Handle both cases separately instead. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com> Signed-off-by: Bob Peterson <rpeterso@redhat.com>
2018-06-04gfs2: hole_size improvementAndreas Gruenbacher
Reimplement function hole_size based on a generic function for walking the metadata tree and rename hole_size to gfs2_hole_size. While previously, multiple invocations of hole_size were sometimes needed to walk across the entire hole, the new implementation always returns the entire hole at once (provided that the caller is interested in the total size). Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com> Signed-off-by: Bob Peterson <rpeterso@redhat.com>
2018-06-04gfs2: Update find_metapath commentAndreas Gruenbacher
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com> Signed-off-by: Bob Peterson <rpeterso@redhat.com>
2018-06-01iomap: move IOMAP_F_BOUNDARY to gfs2Christoph Hellwig
Just define a range of fs specific flags and use that in gfs2 instead of exposing this internal flag globally. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-06-01iomap: inline data should be an iomap type, not a flagChristoph Hellwig
Inline data is fundamentally different from our normal mapped case in that it doesn't even have a block address. So instead of having a flag for it it should be an entirely separate iomap range type. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-04-16gfs2: Remove sdp->sd_jheightsizeAndreas Gruenbacher
GFS2 keeps two arrarys in the superblock that define the maximum size of an inode depending on the inode's height: sdp->sd_heightsize defines the heights in units of sb->s_blocksize; sdp->sd_jheightsize defines them in units of sb->s_blocksize - sizeof(struct gfs2_meta_header). These arrays are used to determine when additional layers of indirect blocks are needed. The second array is used for directories which have an additional gfs2_meta_header at the beginning of each block. Distinguishing between these two cases makes no sense: the height required for representing N blocks will come out the same no matter if the calculation is done in gross (sb->s_blocksize) or net (sb->s_blocksize - sizeof(struct gfs2_meta_header)) units. Stuffed directories don't have an additional gfs2_meta_header, but the stuffed case is handled separately for both files and directories, anyway. Remove the unncessary sdp->sd_jheightsize array. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com> Signed-off-by: Bob Peterson <rpeterso@redhat.com>
2018-04-12GFS2: Minor improvements to comments and documentationBob Peterson
This patch simply fixes some comments and the gfs2-glocks.txt file: Places where i_rwsem was called i_mutex, and adding i_rw_mutex. Signed-off-by: Bob Peterson <rpeterso@redhat.com>
2018-03-29gfs2: Zero out fallocated blocks in fallocate_chunkAndreas Gruenbacher
Instead of zeroing out fallocated blocks in gfs2_iomap_alloc, zero them out in fallocate_chunk, much higher up the call stack. This gets rid of gfs2's abuse of the IOMAP_ZERO flag as well as the gfs2 specific zeronew buffer flag. I can't think of a reason why zeroing out the blocks in gfs2_iomap_alloc would have any benefits: there is no additional locking at that level that would add protection to the newly allocated blocks. While at it, change fallocate over from gs2_block_map to gfs2_iomap_begin. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com> Signed-off-by: Bob Peterson <rpeterso@redhat.com> Acked-by: Christoph Hellwig <hch@lst.de>
2018-03-23gfs2: Check for the end of metadata in punch_holeAndreas Gruenbacher
When punching a hole or truncating an inode down to a given size, also check if the truncate point / start of the hole is within the range we have metadata for. Otherwise, we can end up freeing blocks that shouldn't be freed, corrupting the inode, or crashing the machine when trying to punch a hole into the void. When growing an inode via truncate, we set the new size but we don't allocate additional levels of indirect blocks and grow the inode height. When shrinking that inode again, the new size may still point beyond the end of the inode's metadata. Fixes xfstest generic/476. Debugged-by: Bob Peterson <rpeterso@redhat.com> Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com> Signed-off-by: Bob Peterson <rpeterso@redhat.com>
2018-03-08gfs2: Improve gfs2_block_map commentAndreas Gruenbacher
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com> Signed-off-by: Bob Peterson <rpeterso@redhat.com>
2018-03-07gfs2: Fixes to "Implement iomap for block_map" (2)Andreas Gruenbacher
It turns out that commit 3229c18c0d6b2 'Fixes to "Implement iomap for block_map"' introduced another bug in gfs2_iomap_begin that can cause gfs2_block_map to set bh->b_size of an actual buffer to 0. This can lead to arbitrary incorrect behavior including crashes or disk corruption. Revert the incorrect part of that commit. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com> Signed-off-by: Bob Peterson <rpeterso@redhat.com>
2018-02-13gfs2: Fixes to "Implement iomap for block_map"Andreas Gruenbacher
It turns out that commit 3974320ca6 "Implement iomap for block_map" introduced a few bugs that trigger occasional failures with xfstest generic/476: In gfs2_iomap_begin, we jump to do_alloc when we determine that we are beyond the end of the allocated metadata (height > ip->i_height). There, we can end up calling hole_size with a metapath that doesn't match the current metadata tree, which doesn't make sense. After untangling the code at do_alloc, fix this by checking if the block we are looking for is within the range of allocated metadata. In addition, add a BUG() in case gfs2_iomap_begin is accidentally called for reading stuffed files: this is handled separately. Make sure we don't truncate iomap->length for reads beyond the end of the file; in that case, the entire range counts as a hole. Finally, revert to taking a bitmap write lock when doing allocations. It's unclear why that change didn't lead to any failures during testing. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com> Signed-off-by: Bob Peterson <rpeterso@redhat.com>
2018-01-18gfs2: Add gfs2_max_stuffed_sizeAndreas Gruenbacher
Add a small inline function for computing the maximum size of a stuffed inode instead of open coding that in several places throughout the code. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com> Signed-off-by: Bob Peterson <rpeterso@redhat.com>
2018-01-18gfs2: Implement fallocate(FALLOC_FL_PUNCH_HOLE)Andreas Gruenbacher
Implement the top-level bits of punching a hole into a file. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com> Signed-off-by: Bob Peterson <rpeterso@redhat.com>
2018-01-18gfs2: Turn trunc_dealloc into punch_holeAndreas Gruenbacher
Add an upper bound to the range of blocks to deallocate blocks to function trunc_dealloc so that this function can be used for truncating a file as well as for punching a hole into a file. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com> Signed-off-by: Bob Peterson <rpeterso@redhat.com>
2018-01-18gfs2: Generalize truncate codeAndreas Gruenbacher
Pull the code for computing the range of metapointers to iterate out of gfs2_metapath_ra (for readahead), sweep_bh_for_rgrps (for deallocating metapointers within a block), and trunc_dealloc (for walking the metadata tree). In sweep_bh_for_rgrps, move the code for looking up the resource group descriptor of the current resource group out of the inner loop. The metatype check moves to trunc_dealloc. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com> Signed-off-by: Bob Peterson <rpeterso@redhat.com>
2018-01-17Turn gfs2_block_truncate_page into gfs2_block_zero_rangeAndreas Gruenbacher
Turn gfs2_block_truncate_page into a function that zeroes a range within a block rather than only the end of a block. This will be used for cleaning the end of the first partial block and the start of the last partial block when punching a hole in a file. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com> Signed-off-by: Bob Peterson <rpeterso@redhat.com>
2018-01-17gfs2: Improve non-recursive delete algorithmAndreas Gruenbacher
In rare cases, the current non-recursive delete algorithm doesn't deallocate empty intermediary indirect blocks. This should have very little practical effect, but deallocating all blocks correctly should still be preferable as it is cleaner and easier to validate. The fix consists of using the first block to deallocate to compute the start marker of the truncate point instead of the last block that needs to be kept. With that change, computing which indirect blocks are still needed becomes relatively easy. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com> Signed-off-by: Bob Peterson <rpeterso@redhat.com>
2018-01-17gfs2: Fix metadata read-ahead during truncateAndreas Gruenbacher
The metadata read-ahead algorithm broke when switching from recursive to non-recursive delete: the current algorithm reads ahead blocks at height N - 1 while deallocating the blocks at hight N. However, deallocating the blocks at height N requires a complete walk of the metadata tree, not only down to height N - 1. Consequently, all blocks below height N - 1 will be accessed without read-ahead. Fix this by issuing read-aheads as early as possible, after each metapath lookup. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com> Signed-off-by: Bob Peterson <rpeterso@redhat.com>
2018-01-17gfs2: Clean up {lookup,fillup}_metapathAndreas Gruenbacher
Split out the entire lookup loop from lookup_metapath and fillup_metapath. Make both functions return the actual height in mp->mp_aheight, and return 0 on success. Handle lookup errors properly in trunc_dealloc. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com> Signed-off-by: Bob Peterson <rpeterso@redhat.com>
2018-01-17gfs2: Remove minor gfs2_journaled_truncate inefficienciesAndreas Gruenbacher
First, this function truncates the file in chunks. When the original file size isn't block aligned, each chunk that is truncated will remain be misaligned. This is inefficient. Second, this function doesn't recognize where holes are, so it loops through them. For each chunk of a hole, it creates a new transaction. At least avoid creating another transactions whe the current one is still empty. (An better fix would be to skip large holes, of course.) Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com> Signed-off-by: Bob Peterson <rpeterso@redhat.com>
2018-01-17gfs2: truncate: Remove unnecessary oldsize parametersAndreas Gruenbacher
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com> Signed-off-by: Bob Peterson <rpeterso@redhat.com>
2018-01-17gfs2: Clean up trunc_start error pathAndreas Gruenbacher
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com> Signed-off-by: Bob Peterson <rpeterso@redhat.com>
2018-01-17gfs2: Add gfs2_blk2rgrpd comment and fix incorrect useSteven Whitehouse
Document when to use gfs2_blk2rgrpd for "inexact" resource group matching. Based on that, fix an incorrect use of gfs2_blk2rgrpd in sweep_bh_for_rgrps. Signed-off-by: Steven Whitehouse <swhiteho@redhat.com> Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com> Signed-off-by: Bob Peterson <rpeterso@redhat.com>
2017-10-31GFS2: Implement iomap for block_mapBob Peterson
This patch implements iomap for block mapping, and switches the block_map function to use it under the covers. The additional IOMAP_F_BOUNDARY iomap flag indicates when iomap has reached a "metadata boundary" and fetching the next mapping is likely to incur an additional I/O. This flag is used for setting the bh buffer boundary flag. Signed-off-by: Bob Peterson <rpeterso@redhat.com> Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2017-10-31GFS2: Make height info part of metapathBob Peterson
This patch eliminates height parameters from function gfs2_bmap_alloc. Function find_metapath determines the metapath's "find height", also known as the desired height. Function lookup_metapath determines the metapath's "actual height", previously known as starting height or sheight. Function gfs2_bmap_alloc now gets both height values from the metapath. This simplification was done as a step toward switching the block_map functions to using iomap. The bh_map responsibilities are also removed from function gfs2_bmap_alloc for the same reason. Signed-off-by: Bob Peterson <rpeterso@redhat.com> Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2017-09-25gfs2: Clarify gfs2_block_mapAndreas Gruenbacher
Add a comment about the logical block size for directories. Rename "bsize" in gfs2_block_map to "factor". Fix a typo in the description of metaptr1. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com> Signed-off-by: Bob Peterson <rpeterso@redhat.com>
2017-08-30GFS2: Fix non-recursive truncate bugBob Peterson
Before this patch if you truncated a file to a smaller size it wasn't freeing all the blocks properly. There are two reasons. First, the metapath comparison was not comparing previous heights. I added a function, mp_eq_to_hgt, which checks the metapath at all heights prior to the target height. Second, in function find_nonnull_ptr, it needed to zero out all pointers for heights following the target height. Translated into decimal integer terms, this way a number like 299, when incremented, becomes 300, not 399. The 2 gets incremented to 3, and the following digits need to be reset. These two things allow the truncate state machine to properly find the blocks it needs to delete. Signed-off-by: Bob Peterson <rpeterso@redhat.com>
2017-07-21gfs2: add flag REQ_PRIO for metadata I/OColy Li
When gfs2 does metadata I/O, only REQ_META is used as a metadata hint of the bio. But flag REQ_META is just a hint for block trace, not for block layer code to handle a bio as metadata request. For some of metadata I/Os of gfs2, A REQ_PRIO flag on the metadata bio would be very informative to block layer code. For example, if bcache is used as a I/O cache for gfs2, it will be possible for bcache code to get the hint and cache the pre-fetched metadata blocks on cache device. This behavior may be helpful to improve metadata I/O performance if the following requests hit the cache. Here are the locations in gfs2 code where a REQ_PRIO flag should be added, - All places where REQ_READAHEAD is used, gfs2 code uses this flag for metadata read ahead. - In gfs2_meta_rq() where the first metadata block is read in. - In gfs2_write_buf_to_page(), read in quota metadata blocks to have them up to date. These metadata blocks are probably to be accessed again in future, adding a REQ_PRIO flag may have bcache to keep such metadata in fast cache device. For system without a cache layer, REQ_PRIO can still provide hint to block layer to handle metadata requests more properly. Signed-off-by: Coly Li <colyli@suse.de> Signed-off-by: Bob Peterson <rpeterso@redhat.com>
2017-07-05Merge tag 'gfs2-4.13.fixes' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2 Pull GFS2 updates from Bob Peterson: "We've got eight GFS2 patches for this merge window: - Andreas Gruenbacher has four patches related to cleaning up the GFS2 inode evict process. This is about half of his patches designed to fix a long-standing GFS2 hang related to the inode shrinker: Shrinker calls gfs2 evict, evict calls DLM, DLM requires memory and blocks on the shrinker. These four patches have been well tested. His second set of patches are still being tested, so I plan to hold them until the next merge window, after we have more weeks of testing. The first patch eliminates the flush_delayed_work, which can block. - Andreas's second patch protects setting of gl_object for rgrps with a spin_lock to prevent proven races. - His third patch introduces a centralized mechanism for queueing glock work with better reference counting, to prevent more races. -His fourth patch retains a reference to inode glocks when an error occurs while creating an inode. This keeps the subsequent evict from needing to reacquire the glock, which might call into DLM and block in low memory conditions. - Arvind Yadav has a patch to add const to attribute_group structures. - I have a patch to detect directory entry inconsistencies and withdraw the file system if any are found. Better that than silent corruption. - I have a patch to remove a vestigial variable from glock structures, saving some slab space. - I have another patch to remove a vestigial variable from the GFS2 in-core superblock structure" * tag 'gfs2-4.13.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2: GFS2: constify attribute_group structures. gfs2: gfs2_create_inode: Keep glock across iput gfs2: Clean up glock work enqueuing gfs2: Protect gl->gl_object by spin lock gfs2: Get rid of flush_delayed_work in gfs2_evict_inode GFS2: Eliminate vestigial sd_log_flush_wrapped GFS2: Remove gl_list from glock structure GFS2: Withdraw when directory entry inconsistencies are detected
2017-07-05gfs2: Protect gl->gl_object by spin lockAndreas Gruenbacher
Put all remaining accesses to gl->gl_object under the gl->gl_lockref.lock spinlock to prevent races. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com> Signed-off-by: Bob Peterson <rpeterso@redhat.com>
2017-05-08gfs2: replace CURRENT_TIME with current_timeStephen Rothwell
Link: http://lkml.kernel.org/r/20170420161852.0492bc3f@canb.auug.org.au Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-04-19GFS2: Non-recursive deleteBob Peterson
Implement truncate/delete as a non-recursive algorithm. The older algorithm was implemented with recursion to strip off each layer at a time (going by height, starting with the maximum height. This version tries to do the same thing but without recursion, and without needing to allocate new structures or lists in memory. For example, say you want to truncate a very large file to 1 byte, and its end-of-file metapath is: 0.505.463.428. The starting metapath would be 0.0.0.0. Since it's a truncate to non-zero, it needs to preserve that byte, and all metadata pointing to it. So it would start at 0.0.0.0, look up all its metadata buffers, then free all data blocks pointed to at the highest level. After that buffer is "swept", it moves on to 0.0.0.1, then 0.0.0.2, etc., reading in buffers and sweeping them clean. When it gets to the end of the 0.0.0 metadata buffer (for 4K blocks the last valid one is 0.0.0.508), it backs up to the previous height and starts working on 0.0.1.0, then 0.0.1.1, and so forth. After it reaches the end and sweeps 0.0.1.508, it continues with 0.0.2.0, and so on. When that height is exhausted, and it reaches 0.0.508.508 it backs up another level, to 0.1.0.0, then 0.1.0.1, through 0.1.0.508. So it has to keep marching backwards and forwards through the metadata until it's all swept clean. Once it has all the data blocks freed, it lowers the strip height, and begins the process all over again, but with one less height. This time it sweeps 0.0.0 through 0.505.463. When that's clean, it lowers the strip height again and works to free 0.505. Eventually it strips the lowest height, 0. For a delete or truncate to 0, all metadata for all heights of 0.0.0.0 would be freed. For a truncate to 1 byte, 0.0.0.0 would be preserved. This isn't much different from normal integer incrementing, where an integer gets incremented from 0000 (0.0.0.0) to 3021 (3.0.2.1). So 0000 gets increments to 0001, 0002, up to 0009, then on to 0010, 0011 up to 0099, then 0100 and so forth. It's just that each "digit" goes from 0 to 508 (for a total of 509 pointers) rather than from 0 to 9. Note that the dinode will only have 483 pointers due to the dinode structure itself. Also note: this is just an example. These numbers (509 and 483) are based on a standard 4K block size. Smaller block sizes will yield smaller numbers of indirect pointers accordingly. The truncation process is accomplished with the help of two major functions and a few helper functions. Functions do_strip and recursive_scan are obsolete, so removed. New function sweep_bh_for_rgrps cleans a buffer_head pointed to by the given metapath and height. By cleaning, I mean it frees all blocks starting at the offset passed in metapath. It starts at the first block in the buffer pointed to by the metapath and identifies its resource group (rgrp). From there it frees all subsequent block pointers that lie within that rgrp. If it's already inside a transaction, it stays within it as long as it can. In other words, it doesn't close a transaction until it knows it's freed what it can from the resource group. In this way, multiple buffers may be cleaned in a single transaction, as long as those blocks in the buffer all lie within the same rgrp. If it's not in a transaction, it starts one. If the buffer_head has references to blocks within multiple rgrps, it frees all the blocks inside the first rgrp it finds, then closes the transaction. Then it repeats the cycle: identifies the next unfreed block, uses it to find its rgrp, then starts a new transaction for that set. It repeats this process repeatedly until the buffer_head contains no more references to any blocks past the given metapath. Function trunc_dealloc has been reworked into a finite state automaton. It has basically 3 active states: DEALLOC_MP_FULL, DEALLOC_MP_LOWER, and DEALLOC_FILL_MP: The DEALLOC_MP_FULL state implies the metapath has a full set of buffers out to the "shrink height", and therefore, it can call function sweep_bh_for_rgrps to free the blocks within the highest height of the metapath. If it's just swept the lowest level (or an error has occurred) the state machine is ended. Otherwise it proceeds to the DEALLOC_MP_LOWER state. The DEALLOC_MP_LOWER state implies we are finished with a given buffer_head, which may now be released, and therefore we are then missing some buffer information from the metapath. So we need to find more buffers to read in. In most cases, this is just a matter of releasing the buffer_head and moving to the next pointer from the previous height, so it may be read in and swept as well. If it can't find another non-null pointer to process, it checks whether it's reached the end of a height and needs to lower the strip height, or whether it still needs move forward through the previous height's metadata. In this state, all zero-pointers are skipped. From this state, it can only loop around (once more backing up another height) or, once a valid metapath is found (one that has non-zero pointers), proceed to state DEALLOC_FILL_MP. The DEALLOC_FILL_MP state implies that we have a metapath but not all its buffers are read in. So we must proceed to read in buffer_heads until the metapath has a valid buffer for every height. If the previous state backed us up 3 heights, we may need to read in a buffer, increment the height, then repeat the process until buffers have been read in for all required heights. If it's successful reading a buffer, and it's at the highest height we need, it proceeds back to the DEALLOC_MP_FULL state. If it's unable to fill in a buffer, (encounters a hole, etc.) it tries to find another non-zero block pointer. If they're all zero, it lowers the height and returns to the DEALLOC_MP_LOWER state. If it finds a good non-null pointer, it loops around and reads it in, while keeping the metapath in lock-step with the pointers it examines. The state machine runs until the truncation request is satisfied. Then any transactions are ended, the quota and statfs data are updated, and the function is complete. Helper function metaptr1 was introduced to be an easy way to determine the start of a buffer_head's indirect pointers. Helper function lookup_mp_height was introduced to find a metapath index and read in the buffer that corresponds to it. In this way, function lookup_metapath becomes a simple loop to call it for every height. Helper function fillup_metapath is similar to lookup_metapath except it can do partial lookups. If the state machine backed up multiple levels (like 2999 wrapping to 3000) it needs to find out the next starting point and start issuing metadata reads at that point. Helper function hptrs is a shortcut to determine how many pointers should be expected in a buffer. Height 0 is the dinode which has fewer pointers than the others. Signed-off-by: Bob Peterson <rpeterso@redhat.com>
2017-02-21Merge tag 'gfs2-4.11.fixes' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2 Pull GFS2 updates from Robert Peterson: "We've got eight GFS2 patches for this merge window: - Andy Price submitted a patch to make gfs2_write_full_page a static function. - Dan Carpenter submitted a patch to fix a ERR_PTR thinko. Three patches fix bugs related to deleting very large files, which cause GFS2 to run out of journal space: - The first one prevents GFS2 delete operation from requesting too much journal space. - The second one fixes a problem whereby GFS2 can hang because it wasn't taking journal space demand into its calculations. - The third one wakes up IO waiters when a flush is done to restart processes stuck waiting for journal space to become available. The final three patches are a performance improvement related to spin_lock contention between multiple writers: - The "tr_touched" variable was switched to a flag to be more atomic and eliminate the possibility of some races. - Function meta_lo_add was moved inline with its only caller to make the code more readable and efficient. - Contention on the gfs2_log_lock spinlock was greatly reduced by avoiding the lock altogether in cases where we don't really need it: buffers that already appear in the appropriate metadata list for the journal. Many thanks to Steve Whitehouse for the ideas and principles behind these patches" * tag 'gfs2-4.11.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2: gfs2: Make gfs2_write_full_page static GFS2: Reduce contention on gfs2_log_lock GFS2: Inline function meta_lo_add GFS2: Switch tr_touched to flag in transaction GFS2: Wake up io waiters whenever a flush is done GFS2: Made logd daemon take into account log demand GFS2: Limit number of transaction blocks requested for truncates GFS2: Fix reference to ERR_PTR in gfs2_glock_iter_next
2017-01-05GFS2: Limit number of transaction blocks requested for truncatesBob Peterson
This patch limits the number of transaction blocks requested during file truncates. If we have very large multi-terabyte files, and want to delete or truncate them, they might span so many resource groups that we overflow the journal blocks, and cause an assert failure. By limiting the number of blocks in the transaction, we prevent this overflow and give other running processes time to do transactions. The limiting factor I chose is sd_log_thresh2 which is currently set to 4/5ths of the journal. This same ratio is used in function gfs2_ail_flush_reqd to determine when a log flush is required. If we make the maximum value less than this, we can get into a infinite hang whereby the log stops moving because the number of used blocks is less than the threshold and the iterative loop needs more, but since we're under the threshold, the log daemon never starts any IO on the log. Signed-off-by: Bob Peterson <rpeterso@redhat.com>
2016-10-10Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull more vfs updates from Al Viro: ">rename2() work from Miklos + current_time() from Deepa" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: fs: Replace current_fs_time() with current_time() fs: Replace CURRENT_TIME_SEC with current_time() for inode timestamps fs: Replace CURRENT_TIME with current_time() for inode timestamps fs: proc: Delete inode time initializations in proc_alloc_inode() vfs: Add current_time() api vfs: add note about i_op->rename changes to porting fs: rename "rename2" i_op to "rename" vfs: remove unused i_op->rename fs: make remaining filesystems use .rename2 libfs: support RENAME_NOREPLACE in simple_rename() fs: support RENAME_NOREPLACE for local filesystems ncpfs: fix unused variable warning
2016-09-27fs: Replace CURRENT_TIME with current_time() for inode timestampsDeepa Dinamani
CURRENT_TIME macro is not appropriate for filesystems as it doesn't use the right granularity for filesystem timestamps. Use current_time() instead. CURRENT_TIME is also not y2038 safe. This is also in preparation for the patch that transitions vfs timestamps to use 64 bit time and hence make them y2038 safe. As part of the effort current_time() will be extended to do range checks. Hence, it is necessary for all file system timestamps to use current_time(). Also, current_time() will be transitioned along with vfs to be y2038 safe. Note that whenever a single call to current_time() is used to change timestamps in different inodes, it is because they share the same time granularity. Signed-off-by: Deepa Dinamani <deepa.kernel@gmail.com> Reviewed-by: Arnd Bergmann <arnd@arndb.de> Acked-by: Felipe Balbi <balbi@kernel.org> Acked-by: Steven Whitehouse <swhiteho@redhat.com> Acked-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp> Acked-by: David Sterba <dsterba@suse.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-08-02GFS2: use BIT() macroFabian Frederick
Replace 1 << value shift by more explicit BIT() macro Also fixes two bare unsigned definitions: WARNING: Prefer 'unsigned int' to bare use of 'unsigned' + unsigned hsize = BIT(ip->i_depth); Signed-off-by: Fabian Frederick <fabf@skynet.be> Signed-off-by: Bob Peterson <rpeterso@redhat.com>
2016-07-20block: get rid of bio_rw and READAChristoph Hellwig
These two are confusing leftover of the old world order, combining values of the REQ_OP_ and REQ_ namespaces. For callers that don't special case we mostly just replace bi_rw with bio_data_dir or op_is_write, except for the few cases where a switch over the REQ_OP_ values makes more sense. Any check for READA is replaced with an explicit check for REQ_RAHEAD. Also remove the READA alias for REQ_RAHEAD. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: Mike Christie <mchristi@redhat.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2016-06-07fs: have ll_rw_block users pass in op and flags separatelyMike Christie
This has ll_rw_block users pass in the operation and flags separately, so ll_rw_block can setup the bio op and bi_rw flags on the bio that is submitted. Signed-off-by: Mike Christie <mchristi@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.com> Signed-off-by: Jens Axboe <axboe@fb.com>