summaryrefslogtreecommitdiff
path: root/fs/xfs/xfs_trace.h
AgeCommit message (Collapse)Author
2021-04-07xfs: move the di_size field to struct xfs_inodeChristoph Hellwig
In preparation of removing the historic icinode struct, move the on-disk size field into the containing xfs_inode structure. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2021-02-21Merge tag 'xfs-5.12-merge-5' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linuxLinus Torvalds
Pull xfs updates from Darrick Wong: "There's a lot going on this time, which seems about right for this drama-filled year. Community developers added some code to speed up freezing when read-only workloads are still running, refactored the logging code, added checks to prevent file extent counter overflow, reduced iolock cycling to speed up fsync and gc scans, and started the slow march towards supporting filesystem shrinking. There's a huge refactoring of the internal speculative preallocation garbage collection code which fixes a bunch of bugs, makes the gc scheduling per-AG and hence multithreaded, and standardizes the retry logic when we try to reserve space or quota, can't, and want to trigger a gc scan. We also enable multithreaded quotacheck to reduce mount times further. This is also preparation for background file gc, which may or may not land for 5.13. We also fixed some deadlocks in the rename code, fixed a quota accounting leak when FSSETXATTR fails, restored the behavior that write faults to an mmap'd region actually cause a SIGBUS, fixed a bug where sgid directory inheritance wasn't quite working properly, and fixed a bug where symlinks weren't working properly in ecryptfs. We also now advertise the inode btree counters feature that was introduced two cycles ago. Summary: - Fix an ABBA deadlock when renaming files on overlayfs. - Make sure that we can't overflow the inode extent counters when adding to or removing extents from a file. - Make directory sgid inheritance work the same way as all the other filesystems. - Don't drain the buffer cache on freeze and ro remount, which should reduce the amount of time if read-only workloads are continuing during the freeze. - Fix a bug where symlink size isn't reported to the vfs in ecryptfs. - Disentangle log cleaning from log covering. This refactoring sets us up for future changes to the log, though for now it simply means that we can use covering for freezes, and cleaning becomes something we only do at unmount. - Speed up file fsyncs by reducing iolock cycling. - Fix delalloc blocks leaking when changing the project id fails because of input validation errors in FSSETXATTR. - Fix oversized quota reservation when converting unwritten extents during a DAX write. - Create a transaction allocation helper function to standardize the idiom of allocating a transaction, reserving blocks, locking inodes, and reserving quota. Replace all the open-coded logic for file creation, file ownership changes, and file modifications to use them. - Actually shut down the fs if the incore quota reservations get corrupted. - Fix background block garbage collection scans to not block and to actually clean out CoW staging extents properly. - Run block gc scans when we run low on project quota. - Use the standardized transaction allocation helpers to make it so that ENOSPC and EDQUOT errors during reservation will back out, invoke the block gc scanner, and try again. This is preparation for introducing background inode garbage collection in the next cycle. - Combine speculative post-EOF block garbage collection with speculative copy on write block garbage collection. - Enable multithreaded quotacheck. - Allow sysadmins to tweak the CPU affinities and maximum concurrency levels of quotacheck and background blockgc worker pools. - Expose the inode btree counter feature in the fs geometry ioctl. - Cleanups of the growfs code in preparation for starting work on filesystem shrinking. - Fix all the bloody gcc warnings that the maintainer knows about. :P - Fix a RST syntax error. - Don't trigger bmbt corruption assertions after the fs shuts down. - Restore behavior of forcing SIGBUS on a shut down filesystem when someone triggers a mmap write fault (or really, any buffered write)" * tag 'xfs-5.12-merge-5' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: (85 commits) xfs: consider shutdown in bmapbt cursor delete assert xfs: fix boolreturn.cocci warnings xfs: restore shutdown check in mapped write fault path xfs: fix rst syntax error in admin guide xfs: fix incorrect root dquot corruption error when switching group/project quota types xfs: get rid of xfs_growfs_{data,log}_t xfs: rename `new' to `delta' in xfs_growfs_data_private() libxfs: expose inobtcount in xfs geometry xfs: don't bounce the iolock between free_{eof,cow}blocks xfs: expose the blockgc workqueue knobs publicly xfs: parallelize block preallocation garbage collection xfs: rename block gc start and stop functions xfs: only walk the incore inode tree once per blockgc scan xfs: consolidate the eofblocks and cowblocks workers xfs: consolidate incore inode radix tree posteof/cowblocks tags xfs: remove trivial eof/cowblocks functions xfs: hide xfs_icache_free_cowblocks xfs: hide xfs_icache_free_eofblocks xfs: relocate the eofb/cowb workqueue functions xfs: set WQ_SYSFS on all workqueues in debug mode ...
2021-02-03xfs: consolidate incore inode radix tree posteof/cowblocks tagsDarrick J. Wong
The clearing of posteof blocks and cowblocks serve the same purpose: removing speculative block preallocations from inactive files. We don't need to burn two radix tree tags on this, so combine them into one. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2021-02-03xfs: refactor xfs_icache_free_{eof,cow}blocks call sitesDarrick J. Wong
In anticipation of more restructuring of the eof/cowblocks gc code, refactor calling of those two functions into a single internal helper function, then present a new standard interface to purge speculative block preallocations and start shifting higher level code to use that. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Brian Foster <bfoster@redhat.com>
2021-02-03xfs: add a tracepoint for blockgc scansDarrick J. Wong
Add some tracepoints so that we can observe when the speculative preallocation garbage collector runs. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Brian Foster <bfoster@redhat.com>
2021-02-01xfs: improve the reflink_bounce_dio_write tracepointChristoph Hellwig
Use a more suitable event class. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2021-02-01xfs: simplify the read/write tracepointsChristoph Hellwig
Pass the iocb and iov_iter to the tracepoints and leave decoding of actual arguments to the code only run when tracing is enabled. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2021-01-22xfs: rename xfs_wait_buftarg() to xfs_buftarg_drain()Brian Foster
xfs_wait_buftarg() is vaguely named and somewhat overloaded. Its primary purpose is to reclaim all buffers from the provided buffer target LRU. In preparation to refactor xfs_wait_buftarg() into serialization and LRU draining components, rename the function and associated helpers to something more descriptive. This patch has no functional changes with the minor exception of renaming a tracepoint. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2020-12-09xfs: trace log intent item recovery failuresDarrick J. Wong
Add a trace point so that we can capture when a recovered log intent item fails to recover. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Brian Foster <bfoster@redhat.com>
2020-10-07xfs: periodically relog deferred intent itemsDarrick J. Wong
There's a subtle design flaw in the deferred log item code that can lead to pinning the log tail. Taking up the defer ops chain examples from the previous commit, we can get trapped in sequences like this: Caller hands us a transaction t0 with D0-D3 attached. The defer ops chain will look like the following if the transaction rolls succeed: t1: D0(t0), D1(t0), D2(t0), D3(t0) t2: d4(t1), d5(t1), D1(t0), D2(t0), D3(t0) t3: d5(t1), D1(t0), D2(t0), D3(t0) ... t9: d9(t7), D3(t0) t10: D3(t0) t11: d10(t10), d11(t10) t12: d11(t10) In transaction 9, we finish d9 and try to roll to t10 while holding onto an intent item for D3 that we logged in t0. The previous commit changed the order in which we place new defer ops in the defer ops processing chain to reduce the maximum chain length. Now make xfs_defer_finish_noroll capable of relogging the entire chain periodically so that we can always move the log tail forward. Most chains will never get relogged, except for operations that generate very long chains (large extents containing many blocks with different sharing levels) or are on filesystems with small logs and a lot of ongoing metadata updates. Callers are now required to ensure that the transaction reservation is large enough to handle logging done items and new intent items for the maximum possible chain length. Most callers are careful to keep the chain lengths low, so the overhead should be minimal. The decision to relog an intent item is made based on whether the intent was logged in a previous checkpoint, since there's no point in relogging an intent into the same checkpoint. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com>
2020-09-15xfs: trace timestamp limitsDarrick J. Wong
Add a couple of tracepoints so that we can check the timestamp limits being set on inodes and quotas. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Allison Collins <allison.henderson@oracle.com> Reviewed-by: Gao Xiang <hsiangkao@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com>
2020-09-15xfs: move the buffer retry logic to xfs_buf.cChristoph Hellwig
Move the buffer retry state machine logic to xfs_buf.c and call it once from xfs_ioend instead of duplicating it three times for the three kinds of buffers. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2020-09-06xfs: remove kmem_realloc()Carlos Maiolino
Remove kmem_realloc() function and convert its users to use MM API directly (krealloc()) Signed-off-by: Carlos Maiolino <cmaiolino@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2020-07-28xfs: remove xfs_zone_{alloc,zalloc} helpersCarlos Maiolino
All their users have been converted to use MM API directly, no need to keep them around anymore. Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Carlos Maiolino <cmaiolino@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Dave Chinner <dchinner@redhat.com>
2020-07-28xfs: create xfs_dqtype_t to represent quota typesDarrick J. Wong
Create a new type (xfs_dqtype_t) to represent the type of an incore dquot (user, group, project, or none). Rename the incore dquot's dq_flags field to q_type. This allows us to replace all the "uint type" arguments to the quota functions with "xfs_dqtype_t type", to make it obvious when we're passing a quota type argument into a function. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
2020-07-28xfs: add more dquot tracepointsDarrick J. Wong
Add all the xfs_dquot fields to the tracepoint for that type; add a new tracepoint type for the qtrx structure (dquot transaction deltas); and use our new tracepoints. This makes it easier for the author to trace changes to dquot counters for debugging. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Allison Collins <allison.henderson@oracle.com> Reviewed-by: Chandan Babu R <chandanrlinux@gmail.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
2020-07-28xfs: stop using q_core counters in the quota codeDarrick J. Wong
Add counter fields to the incore dquot, and use that instead of the ones in qcore. This eliminates a bunch of endian conversions and will eventually allow us to remove qcore entirely. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Chandan Babu R <chandanrlinux@gmail.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Allison Collins <allison.henderson@oracle.com>
2020-07-28xfs: stop using q_core limits in the quota codeDarrick J. Wong
Add limits fields in the incore dquot, and use that instead of the ones in qcore. This eliminates a bunch of endian conversions and will eventually allow us to remove qcore entirely. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Chandan Babu R <chandanrlinux@gmail.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Allison Collins <allison.henderson@oracle.com>
2020-07-28xfs: use a per-resource struct for incore dquot dataDarrick J. Wong
Introduce a new struct xfs_dquot_res that we'll use to track all the incore data for a particular resource type (block, inode, rt block). This will help us (once we've eliminated q_core) to declutter quota functions that currently open-code field access or pass around fields around explicitly. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Chandan Babu R <chandanrlinux@gmail.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Allison Collins <allison.henderson@oracle.com>
2020-07-28xfs: stop using q_core.d_id in the quota codeDarrick J. Wong
Add a dquot id field to the incore dquot, and use that instead of the one in qcore. This eliminates a bunch of endian conversions and will eventually allow us to remove qcore entirely. We also rearrange the start of xfs_dquot to remove padding holes, saving 8 bytes. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Chandan Babu R <chandanrlinux@gmail.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Allison Collins <allison.henderson@oracle.com>
2020-07-28xfs: rename dquot incore state flagsDarrick J. Wong
Rename the existing incore dquot "dq_flags" field to "q_flags" to match everything else in the structure, then move the two actual dquot state flags to the XFS_DQFLAG_ namespace from XFS_DQ_. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Chandan Babu R <chandanrlinux@gmail.com>
2020-07-06xfs: redesign the reflink remap loop to fix blkres depletion crashDarrick J. Wong
The existing reflink remapping loop has some structural problems that need addressing: The biggest problem is that we create one transaction for each extent in the source file without accounting for the number of mappings there are for the same range in the destination file. In other words, we don't know the number of remap operations that will be necessary and we therefore cannot guess the block reservation required. On highly fragmented filesystems (e.g. ones with active dedupe) we guess wrong, run out of block reservation, and fail. The second problem is that we don't actually use the bmap intents to their full potential -- instead of calling bunmapi directly and having to deal with its backwards operation, we could call the deferred ops xfs_bmap_unmap_extent and xfs_refcount_decrease_extent instead. This makes the frontend loop much simpler. Solve all of these problems by refactoring the remapping loops so that we only perform one remapping operation per transaction, and each operation only tries to remap a single extent from source to dest. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Reported-by: Edwin Török <edwin@etorok.net> Tested-by: Edwin Török <edwin@etorok.net>
2020-05-19xfs: move the fork format fields into struct xfs_iforkChristoph Hellwig
Both the data and attr fork have a format that is stored in the legacy idinode. Move it into the xfs_ifork structure instead, where it uses up padding. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Chandan Babu R <chandanrlinux@gmail.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2020-05-19xfs: move the per-fork nextents fields into struct xfs_iforkChristoph Hellwig
There are there are three extents counters per inode, one for each of the forks. Two are in the legacy icdinode and one is directly in struct xfs_inode. Switch to a single counter in the xfs_ifork structure where it uses up padding at the end of the structure. This simplifies various bits of code that just wants the number of extents counter and can now directly dereference it. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Chandan Babu R <chandanrlinux@gmail.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2020-03-27xfs: Throttle commits on delayed background CIL pushDave Chinner
In certain situations the background CIL push can be indefinitely delayed. While we have workarounds from the obvious cases now, it doesn't solve the underlying issue. This issue is that there is no upper limit on the CIL where we will either force or wait for a background push to start, hence allowing the CIL to grow without bound until it consumes all log space. To fix this, add a new wait queue to the CIL which allows background pushes to wait for the CIL context to be switched out. This happens when the push starts, so it will allow us to block incoming transaction commit completion until the push has started. This will only affect processes that are running modifications, and only when the CIL threshold has been significantly overrun. This has no apparent impact on performance, and doesn't even trigger until over 45 million inodes had been created in a 16-way fsmark test on a 2GB log. That was limiting at 64MB of log space used, so the active CIL size is only about 3% of the total log in that case. The concurrent removal of those files did not trigger the background sleep at all. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Allison Collins <allison.henderson@oracle.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2020-03-27xfs: split xlog_ticket_doneChristoph Hellwig
Remove xlog_ticket_done and just call the renamed low-level helpers for ungranting or regranting log space directly. To make that a little the reference put on the ticket and all tracing is moved into the actual helpers. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2020-03-18xfs: support bulk loading of staged btreesDarrick J. Wong
Add a new btree function that enables us to bulk load a btree cursor. This will be used by the upcoming online repair patches to generate new btrees. This avoids the programmatic inefficiency of calling xfs_btree_insert in a loop (which generates a lot of log traffic) in favor of stamping out new btree blocks with ordered buffers, and then committing both the new root and scheduling the removal of the old btree blocks in a single transaction commit. The design of this new generic code is based off the btree rebuilding code in xfs_repair's phase 5 code, with the explicit goal of enabling us to share that code between scrub and repair. It has the additional feature of being able to control btree block loading factors. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com>
2020-03-18xfs: introduce fake roots for inode-rooted btreesDarrick J. Wong
Create an in-core fake root for inode-rooted btree types so that callers can generate a whole new btree using the upcoming btree bulk load function without making the new tree accessible from the rest of the filesystem. It is up to the individual btree type to provide a function to create a staged cursor (presumably with the appropriate callouts to update the fakeroot) and then commit the staged root back into the filesystem. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com>
2020-03-18xfs: introduce fake roots for ag-rooted btreesDarrick J. Wong
Create an in-core fake root for AG-rooted btree types so that callers can generate a whole new btree using the upcoming btree bulk load function without making the new tree accessible from the rest of the filesystem. It is up to the individual btree type to provide a function to create a staged cursor (presumably with the appropriate callouts to update the fakeroot) and then commit the staged root back into the filesystem. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com>
2020-03-02xfs: embedded the attrlist cursor into struct xfs_attr_list_contextChristoph Hellwig
The attrlist cursor only exists as part of an attr list context, so embedd the structure instead of pointing to it. Also give it a proper xfs_ prefix and remove the obsolete typedef. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Chandan Rajendra <chandanrlinux@gmail.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2020-03-02xfs: remove XFS_DA_OP_INCOMPLETEChristoph Hellwig
Now that we use the on-disk flags field also for the interface to the lower level attr routines we can use the XFS_ATTR_INCOMPLETE definition from the on-disk format directly instead. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Chandan Rajendra <chandanrlinux@gmail.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2020-03-02xfs: clean up the attr flag confusionChristoph Hellwig
The ATTR_* flags have a long IRIX history, where they a userspace interface, the on-disk format and an internal interface. We've split out the on-disk interface to the XFS_ATTR_* values, but despite (or because?) of that the flag have still been a mess. Switch the internal interface to pass the on-disk XFS_ATTR_* flags for the namespace and the Linux XATTR_* flags for the actual flags instead. The ATTR_* values that are actually used are move to xfs_fs.h with a new XFS_IOC_* prefix to not conflict with the userspace version that has the same name and must have the same value. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Chandan Rajendra <chandanrlinux@gmail.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2020-03-02xfs: cleanup struct xfs_attr_list_contextChristoph Hellwig
Replace the alist char pointer with a void buffer given that different callers use it in different ways. Use the chance to remove the typedef and reduce the indentation of the struct definition so that it doesn't overflow 80 char lines all over. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Chandan Rajendra <chandanrlinux@gmail.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-12-25Merge branch 'core/kprobes' into perf/core, to pick up a completed branchIngo Molnar
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-12-19xfs: don't commit sunit/swidth updates to disk if that would cause repair ↵Darrick J. Wong
failures Alex Lyakas reported[1] that mounting an xfs filesystem with new sunit and swidth values could cause xfs_repair to fail loudly. The problem here is that repair calculates the where mkfs should have allocated the root inode, based on the superblock geometry. The allocation decisions depend on sunit, which means that we really can't go updating sunit if it would lead to a subsequent repair failure on an otherwise correct filesystem. Port from xfs_repair some code that computes the location of the root inode and teach mount to skip the ondisk update if it would cause problems for repair. Along the way we'll update the documentation, provide a function for computing the minimum AGFL size instead of open-coding it, and cut down some indenting in the mount code. Note that we allow the mount to proceed (and new allocations will reflect this new geometry) because we've never screened this kind of thing before. We'll have to wait for a new future incompat feature to enforce correct behavior, alas. Note that the geometry reporting always uses the superblock values, not the incore ones, so that is what xfs_info and xfs_growfs will report. [1] https://lore.kernel.org/linux-xfs/20191125130744.GA44777@bfoster/T/#m00f9594b511e076e2fcdd489d78bc30216d72a7d Reported-by: Alex Lyakas <alex@zadara.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com>
2019-12-10Merge tag 'v5.5-rc1' into core/kprobes, to resolve conflictsIngo Molnar
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-11-27ftrace: Rework event_create_dir()Peter Zijlstra
Rework event_create_dir() to use an array of static data instead of function pointers where possible. The problem is that it would call the function pointer on module load before parse_args(), possibly even before jump_labels were initialized. Luckily the generated functions don't use jump_labels but it still seems fragile. It also gets in the way of changing when we make the module map executable. The generated function are basically calling trace_define_field() with a bunch of static arguments. So instead of a function, capture these arguments in a static array, avoiding the function call. Now there are a number of cases where the fields are dynamic (syscall arguments, kprobes and uprobes), in which case a static array does not work, for these we preserve the function call. Luckily all these cases are not related to modules and so we can retain the function call for them. Also fix up all broken tracepoint definitions that now generate a compile error. Tested-by: Alexei Starovoitov <ast@kernel.org> Tested-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Acked-by: Alexei Starovoitov <ast@kernel.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: https://lkml.kernel.org/r/20191111132458.342979914@infradead.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-10-29xfs: rename the m_writeio_* fields in struct xfs_mountChristoph Hellwig
Use the allocsize name to match the mount option and usage instead. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-10-21xfs: optimize near mode bnobt scans with concurrent cntbt lookupsBrian Foster
The near mode fallback algorithm consists of a left/right scan of the bnobt. This algorithm has very poor breakdown characteristics under worst case free space fragmentation conditions. If a suitable extent is far enough from the locality hint, each allocation may scan most or all of the bnobt before it completes. This causes pathological behavior and extremely high allocation latencies. While locality is important to near mode allocations, it is not so important as to incur pathological allocation latency to provide the asolute best available locality for every allocation. If the allocation is large enough or far enough away, there is a point of diminishing returns. As such, we can bound the overall operation by including an iterative cntbt lookup in the broader search. The cntbt lookup is optimized to immediately find the extent with best locality for the given size on each iteration. Since the cntbt is indexed by extent size, the lookup repeats with a variably aggressive increasing search key size until it runs off the edge of the tree. This approach provides a natural balance between the two algorithms for various situations. For example, the bnobt scan is able to satisfy smaller allocations such as for inode chunks or btree blocks more quickly where the cntbt search may have to search through a large set of extent sizes when the search key starts off small relative to the largest extent in the tree. On the other hand, the cntbt search more deterministically covers the set of suitable extents for larger data extent allocation requests that the bnobt scan may have to search the entire tree to locate. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-10-21xfs: factor out tree fixup logic into helperBrian Foster
Lift the btree fixup path into a helper function. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-10-21xfs: reuse best extent tracking logic for bnobt scanBrian Foster
The near mode bnobt scan searches left and right in the bnobt looking for the closest free extent to the allocation hint that satisfies minlen. Once such an extent is found, the left/right search terminates, we search one more time in the opposite direction and finish the allocation with the best overall extent. The left/right and find best searches are currently controlled via a combination of cursor state and local variables. Clean up this code and prepare for further improvements to the near mode fallback algorithm by reusing the allocation cursor best extent tracking mechanism. Update the tracking logic to deactivate bnobt cursors when out of allocation range and replace open-coded extent checks to calls to the common helper. In doing so, rename some misnamed local variables in the top-level near mode allocation function. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-10-21xfs: refactor cntbt lastblock scan best extent logic into helperBrian Foster
The cntbt lastblock scan checks the size, alignment, locality, etc. of each free extent in the block and compares it with the current best candidate. This logic will be reused by the upcoming optimized cntbt algorithm, so refactor it into a separate helper. Note that acur->diff is now initialized to -1 (unsigned) instead of 0 to support the more granular comparison logic in the new helper. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-10-21iomap: lift the xfs writeback code to iomapChristoph Hellwig
Take the xfs writeback code and move it to fs/iomap. A new structure with three methods is added as the abstraction from the generic writeback code to the file system. These methods are used to map blocks, submit an ioend, and cancel a page that encountered an error before it was added to an ioend. Signed-off-by: Christoph Hellwig <hch@lst.de> [darrick: rename ->submit_ioend to ->prepare_ioend to clarify what it does] Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Dave Chinner <dchinner@redhat.com>
2019-10-21iomap: lift common tracing code from xfs to iomapChristoph Hellwig
Lift the xfs code for tracing address space operations to the iomap layer. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-08-26xfs: add kmem_alloc_io()Dave Chinner
Memory we use to submit for IO needs strict alignment to the underlying driver contraints. Worst case, this is 512 bytes. Given that all allocations for IO are always a power of 2 multiple of 512 bytes, the kernel heap provides natural alignment for objects of these sizes and that suffices. Until, of course, memory debugging of some kind is turned on (e.g. red zones, poisoning, KASAN) and then the alignment of the heap objects is thrown out the window. Then we get weird IO errors and data corruption problems because drivers don't validate alignment and do the wrong thing when passed unaligned memory buffers in bios. TO fix this, introduce kmem_alloc_io(), which will guaranteeat least 512 byte alignment of buffers for IO, even if memory debugging options are turned on. It is assumed that the minimum allocation size will be 512 bytes, and that sizes will be power of 2 mulitples of 512 bytes. Use this everywhere we allocate buffers for IO. This no longer fails with log recovery errors when KASAN is enabled due to the brd driver not handling unaligned memory buffers: # mkfs.xfs -f /dev/ram0 ; mount /dev/ram0 /mnt/test Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-08-26xfs: add kmem allocation trace pointsDave Chinner
When trying to correlate XFS kernel allocations to memory reclaim behaviour, it is useful to know what allocations XFS is actually attempting. This information is not directly available from tracepoints in the generic memory allocation and reclaim tracepoints, so these new trace points provide a high level indication of what the XFS memory demand actually is. There is no per-filesystem context in this code, so we just trace the type of allocation, the size and the allocation constraints. The kmem code also doesn't include much of the common XFS headers, so there are a few definitions that need to be added to the trace headers and a couple of types that need to be made common to avoid needing to include the whole world in the kmem code. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-07-03xfs: multithreaded iwalk implementationDarrick J. Wong
Create a parallel iwalk implementation and switch quotacheck to use it. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com>
2019-07-02xfs: create simplified inode walk functionDarrick J. Wong
Create a new iterator function to simplify walking inodes in an XFS filesystem. This new iterator will replace the existing open-coded walking that goes on in various places. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com>
2019-06-28xfs: split iop_unlockChristoph Hellwig
The iop_unlock method is called when comitting or cancelling a transaction. In the latter case, the transaction may or may not be aborted. While there is no known problem with the current code in practice, this implementation is limited in that any log item implementation that might want to differentiate between a commit and a cancellation must rely on the aborted state. The aborted bit is only set when the cancelled transaction is dirty, however. This means that there is no way to distinguish between a commit and a clean transaction cancellation. For example, intent log items currently rely on this distinction. The log item is either transferred to the CIL on commit or released on transaction cancel. There is currently no possibility for a clean intent log item in a transaction, but if that state is ever introduced a cancel of such a transaction will immediately result in memory leaks of the associated log item(s). This is an interface deficiency and landmine. To clean this up, replace the iop_unlock method with an iop_release method that is specific to transaction cancel. The existing iop_committing method occurs at the same time as iop_unlock in the commit path and there is no need for two separate callbacks here. Overload the iop_committing method with the current commit time iop_unlock implementations to eliminate the need for the latter and further simplify the interface. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-06-28xfs: don't use xfs_trans_free_items in the commit pathChristoph Hellwig
While commiting items looks very similar to freeing them on error it is a different operation, and they will diverge a bit soon. Split out the commit case from xfs_trans_free_items, inline it into xfs_log_commit_cil and give it a separate trace point. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>