Age | Commit message (Collapse) | Author |
|
Growfs currently manually codes the extension of the last AG in a
filesytem during the growfs process. Factor that out of the growfs
code and move it into libxfs along with teh rest of the AG header
modification code.
Signed-Off-By: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
|
|
So it can be shared with userspace (e.g. mkfs) easily.
Signed-Off-By: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
|
|
Right now we wait until we've committed changes to the primary
superblock before we initialise any of the new secondary
superblocks. This means that if we have any write errors for new
secondary superblocks we end up with garbage in place rather than
zeros or even an "in progress" superblock to indicate a grow
operation is being done.
To ensure we can write the secondary superblocks, initialise them
earlier in the same loop that initialises the AG headers. We stamp
the new secondary superblocks here with the old geometry, but set
the "sb_inprogress" field to indicate that updates are being done to
the superblock so they cannot be used. This will result in the
secondary superblock fields being updated or triggering errors that
will abort the grow before we commit any permanent changes.
This also means we can change the update mechanism of the secondary
superblocks. We know that we are going to wholly overwrite the
information in the struct xfs_sb in the buffer, so there's no point
reading it from disk. Just allocate an uncached buffer, zero it in
memory, stamp the new superblock structure in it and write it out.
If we fail to write it out, then we'll leave the existing sb (old or
new w/ inprogress) on disk for repair to deal with later.
Signed-Off-By: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
|
|
This happens after all the transactions to update the superblock
occur, and errors need to be handled slightly differently. Seperate
out the code into it's own function, and clean up the error goto
stack in the core growfs code as it is now much simpler.
Signed-Off-By: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
|
|
When growfs changes the imaxpct value of the filesystem, it runs
through all the "change size" growfs code, whether it needs to or
not. Separate out changing imaxpct into it's own function and
transaction to simplify the rest of the growfs code.
Signed-Off-By: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
|
|
There's still more cookie cutter code in setting up each AG header.
Separate all the variables into a simple structure and iterate a
table of header definitions to initialise everything.
Signed-Off-By: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
|
|
Cookie cutter code, easily factored.
Signed-Off-By: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
|
|
We currently write all new AG headers synchronously, which can be
slow for large grow operations. All we really need to do is ensure
all the headers are on disk before we run the growfs transaction, so
convert this to a buffer list and a delayed write operation. We
block waiting for the delayed write buffer submission to complete,
so this will fulfill the requirement to have all the buffers written
correctly before proceeding.
Signed-Off-By: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
|
|
The intialisation of new AG headers is mostly common with the
userspace mkfs code and growfs in the kernel, so start factoring it
out so we can move it to libxfs and use it in both places.
Signed-Off-By: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
|
|
For the new growfs work, we want to ensure that we serialise
secondary superblock updates with other operations (e.g. scrub)
correctly, but we don't want to cache the buffers for long term
reuse. We need cached buffers for serialisation, however.
To solve this, introduce a "oneshot" buffer which will be marshalled
through the cache but then released once the last current reference
goes away. If the buffer is already cached, then we ignore the
"one-shot" behaviour and leave the buffer in the state it was prior
to the one-shot command being run. This means we don't perturb
either the working set or existing cached buffer state by a one-shot
operation.
Signed-Off-By: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
|
|
Plumb in the pieces necessary to make the "scrub" subfunction of
the scrub ioctl actually work. This means that we make the IFLAG_REPAIR
flag to the scrub ioctl actually do something, and we add an errortag
knob so that xfstests can force the kernel to rebuild a metadata
structure even if there's nothing wrong with it.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
|
|
These tracepoints will be used to debug the online repair routines.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
|
|
Teach xfs_bmapi_remap how to map in unwritten extent and to skip rmap
updates. This enables us to rebuild real and unwritten extents from the
rmapbt.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
|
|
Add a new flags argument to xfs_bmapi_remapi so that we can pass BMAPI
flags into the function. This enables us to pass in BMAPI_ATTRFORK so
that we can remap things into the attribute fork. Eventually the
online repair code will use this to rebuild attribute forks, so make it
non-static.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
|
|
This function is basically a generic AGFL block iterator, so promote it
to libxfs ahead of online repair wanting to use it.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
|
|
In normal operation, the XFS convention is to take an inode's iolock
and then allocate a transaction. However, when scrubbing parent inodes
this is inverted -- we allocated the transaction to do the scrub, and
now we're trying to grab the parent's iolock. This can lead to ABBA
deadlocks: some thread grabbed the parent's iolock and is waiting for
space for a transaction while our parent scrubber is sitting on a
transaction trying to get the parent's iolock.
Therefore, convert all iolock attempts to use trylock; if that fails,
they can use the existing mechanisms to back off and try again.
The ABBA deadlock didn't happen with a non-repair scrub because the
transactions don't reserve any space, but repair scrubs require
reservation in order to update metadata. However, any other concurrent
metadata update (e.g. directory create in the parent) could also induce
this deadlock with the parent scrubber.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
|
|
The realtime bitmap and summary inodes live on the metadata device, so
we can scrub their data forks with the regular scrubbers.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
|
|
Replace the quota scrubber's open-coded data fork scrubber with a
redirected call to the bmapbtd scrubber. This strengthens the quota
scrub to include all the cross-referencing that it does.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
|
|
If we've already decided that something is corrupt, we might as well
abort all the loops and exit as quickly as possible.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
|
|
Replace all the if (!error) weirdness with helper functions that follow
our regular coding practices, and factor out the ternary expression soup.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
|
|
Secondary superblocks are rarely used, so create a helper to read a
given non-primary AG's superblock and ensure that it won't stick around
hogging memory.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
|
|
Don't bother looking for cross-referencing problems if the metadata is
already corrupt or we've already found a cross-referencing problem.
Since we added a helper function for flags testing, convert existing
users to use it.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
|
|
We recently had an oops reported on a 4.14 kernel in
xfs_reclaim_inodes_count() where sb->s_fs_info pointed to garbage
and so the m_perag_tree lookup walked into lala land.
Essentially, the machine was under memory pressure when the mount
was being run, xfs_fs_fill_super() failed after allocating the
xfs_mount and attaching it to sb->s_fs_info. It then cleaned up and
freed the xfs_mount, but the sb->s_fs_info field still pointed to
the freed memory. Hence when the superblock shrinker then ran
it fell off the bad pointer.
With the superblock shrinker problem fixed at teh VFS level, this
stale s_fs_info pointer is still a problem - we use it
unconditionally in ->put_super when the superblock is being torn
down, and hence we can still trip over it after a ->fill_super
call failure. Hence we need to clear s_fs_info if
xfs-fs_fill_super() fails, and we need to check if it's valid in
the places it can potentially be dereferenced after a ->fill_super
failure.
Signed-Off-By: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
|
|
Similar to log_recovery_delay, this delay occurs between the VFS
superblock being initialised and the xfs_mount being fully
initialised. It also poisons the per-ag radix tree node so that it
can be used for triggering shrinker races during mount
such as the following:
<run memory pressure workload in background>
$ cat dirty-mount.sh
#! /bin/bash
umount -f /dev/pmem0
mkfs.xfs -f /dev/pmem0
mount /dev/pmem0 /mnt/test
rm -f /mnt/test/foo
xfs_io -fxc "pwrite 0 4k" -c fsync -c "shutdown" /mnt/test/foo
umount /dev/pmem0
# let's crash it now!
echo 30 > /sys/fs/xfs/debug/mount_delay
mount /dev/pmem0 /mnt/test
echo 0 > /sys/fs/xfs/debug/mount_delay
umount /dev/pmem0
$ sudo ./dirty-mount.sh
.....
[ 60.378118] CPU: 3 PID: 3577 Comm: fs_mark Tainted: G D W 4.16.0-rc5-dgc #440
[ 60.378120] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1 04/01/2014
[ 60.378124] RIP: 0010:radix_tree_next_chunk+0x76/0x320
[ 60.378127] RSP: 0018:ffffc9000276f4f8 EFLAGS: 00010282
[ 60.383670] RAX: a5a5a5a5a5a5a5a4 RBX: 0000000000000010 RCX: 000000000000001a
[ 60.385277] RDX: 0000000000000000 RSI: ffffc9000276f540 RDI: 0000000000000000
[ 60.386554] RBP: 0000000000000000 R08: 0000000000000000 R09: a5a5a5a5a5a5a5a5
[ 60.388194] R10: 0000000000000006 R11: 0000000000000001 R12: ffffc9000276f598
[ 60.389288] R13: 0000000000000040 R14: 0000000000000228 R15: ffff880816cd6458
[ 60.390827] FS: 00007f5c124b9740(0000) GS:ffff88083fc00000(0000) knlGS:0000000000000000
[ 60.392253] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 60.393423] CR2: 00007f5c11bba0b8 CR3: 000000035580e001 CR4: 00000000000606e0
[ 60.394519] Call Trace:
[ 60.395252] radix_tree_gang_lookup_tag+0xc4/0x130
[ 60.395948] xfs_perag_get_tag+0x37/0xf0
[ 60.396522] xfs_reclaim_inodes_count+0x32/0x40
[ 60.397178] xfs_fs_nr_cached_objects+0x11/0x20
[ 60.397837] super_cache_count+0x35/0xc0
[ 60.399159] shrink_slab.part.66+0xb1/0x370
[ 60.400194] shrink_node+0x7e/0x1a0
[ 60.401058] try_to_free_pages+0x199/0x470
[ 60.402081] __alloc_pages_slowpath+0x3a1/0xd20
[ 60.403729] __alloc_pages_nodemask+0x1c3/0x200
[ 60.404941] cache_grow_begin+0x20b/0x2e0
[ 60.406164] fallback_alloc+0x160/0x200
[ 60.407088] kmem_cache_alloc+0x111/0x4e0
[ 60.408038] ? xfs_buf_rele+0x61/0x430
[ 60.408925] kmem_zone_alloc+0x61/0xe0
[ 60.409965] xfs_inode_alloc+0x24/0x1d0
.....
Signed-Off-By: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
|
|
The changes to skip discards of speculative preallocation and
unwritten extents introduced several new wrapper functions through
the bunmapi -> extent free codepath to reduce churn in all of the
associated callers. In several cases, these wrappers simply toggle a
single flag to skip or not skip discards for the resulting blocks.
The explicit _nodiscard() wrappers for such an isolated set of
callers is a bit overkill. Kill off these wrappers and replace with
the calls to the underlying functions in the contexts that need to
control discard behavior. Retain the wrappers that preserve the
original calling conventions to serve the original purpose of
reducing code churn.
This is a refactoring patch and does not change behavior.
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
|
|
Add a new iomap_swapfile_activate function so that filesystems can
activate swap files without having to use the obsolete and slow bmap
function. This enables XFS to support fallocate'd swap files and
swap files on realtime devices.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Jan Kara <jack@suse.cz>
|
|
Rebuilding the reverse-mapping tree requires us to quiesce all inodes in
the filesystem, so we must stop background reclamation of post-EOF and
CoW prealloc blocks.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
|
|
Add a new flag, XFS_BMAPI_NORMAP, which will perform file block
remapping without updating the rmapbt. This will be used by the repair
code to reconstruct bmbts from the rmapbt, in which case we don't want
the rmapbt update.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
|
|
Add a couple of functions to the refcount btree and generic btree code
that will be used to repair the refcountbt.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
|
|
Add a couple of functions to the reverse mapping btree that will be used
to repair the rmapbt.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
|
|
Expose various helpers that the repair code will want to use.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
|
|
Add a bunch of helper functions that calculate the sizes of various
btrees. These will be used to repair btrees and btree headers.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
|
|
Since the transaction allocation helper is about to become more complex,
move it to common.c and remove the redundant parameters.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
|
|
Strengthen the btree block header checks to detect the number of records
being less than the btree type's minimum record count. Certain blocks
are allowed to violate this constraint -- specifically any btree block
at the top of the tree can have fewer than minrecs records.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
|
|
All scrub code runs in transaction context, which means that memory
allocations are automatically run in PF_MEMALLOC_NOFS context. It's
therefore unnecessary to pass in KM_NOFS to allocation routines, so
clean them all out.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
|
|
Refactor the quota scrubber to take the quotaofflock and grab the quota
inode in the setup function so that we can treat quota in the same
"scrub in the context of this inode" (i.e. sc->ip) manner as we treat
any other inode. We do have to drop the quota inode's ILOCK_EXCL to use
dqiterate, but since dquots have their own individual locks the ILOCK
wasn't helping us anyway.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
|
|
Create a helper function to iterate all the dquots of a given type in
the system, and refactor the dquot scrub to use it. This will get more
use in the quota repair code.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs
Pull AFS fixes from David Howells:
"Here's a set of patches that fix a number of bugs in the in-kernel AFS
client, including:
- Fix directory locking to not use individual page locks for
directory reading/scanning but rather to use a semaphore on the
afs_vnode struct as the directory contents must be read in a single
blob and data from different reads must not be mixed as the entire
contents may be shuffled about between reads.
- Fix address list parsing to handle port specifiers correctly.
- Only give up callback records on a server if we actually talked to
that server (we might not be able to access a server).
- Fix some callback handling bugs, including refcounting,
whole-volume callbacks and when callbacks actually get broken in
response to a CB.CallBack op.
- Fix some server/address rotation bugs, including giving up if we
can't probe a server; giving up if a server says it doesn't have a
volume, but there are more servers to try.
- Fix the decoding of fetched statuses to be OpenAFS compatible.
- Fix the handling of server lookups in Cache Manager ops (such as
CB.InitCallBackState3) to use a UUID if possible and to handle no
server being found.
- Fix a bug in server lookup where not all addresses are compared.
- Fix the non-encryption of calls that prevents some servers from
being accessed (this also requires an AF_RXRPC patch that has
already gone in through the net tree).
There's also a patch that adds tracepoints to log Cache Manager ops
that don't find a matching server, either by UUID or by address"
* tag 'afs-fixes-20180514' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs:
afs: Fix the non-encryption of calls
afs: Fix CB.CallBack handling
afs: Fix whole-volume callback handling
afs: Fix afs_find_server search loop
afs: Fix the handling of an unfound server in CM operations
afs: Add a tracepoint to record callbacks from unlisted servers
afs: Fix the handling of CB.InitCallBackState3 to find the server by UUID
afs: Fix VNOVOL handling in address rotation
afs: Fix AFSFetchStatus decoder to provide OpenAFS compatibility
afs: Fix server rotation's handling of fileserver probe failure
afs: Fix refcounting in callback registration
afs: Fix giving up callbacks on server destruction
afs: Fix address list parsing
afs: Fix directory page locking
|
|
Update read and mmap logic to append device dumps as additional notes
before the other elf notes. We add device dumps before other elf notes
because the other elf notes may not fill the elf notes buffer
completely and we will end up with zero-filled data between the elf
notes and the device dumps. Tools will then try to decode this
zero-filled data as valid notes and we don't want that. Hence, adding
device dumps before the other elf notes ensure that zero-filled data
can be avoided. This also ensures that the device dumps and the
other elf notes can be properly mmaped at page aligned address.
Incorporate device dump size into the total vmcore size. Also update
offsets for other program headers after the device dumps are added.
Suggested-by: Eric Biederman <ebiederm@xmission.com>.
Signed-off-by: Rahul Lakkireddy <rahul.lakkireddy@chelsio.com>
Signed-off-by: Ganesh Goudar <ganeshgr@chelsio.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
The sequence of actions done by device drivers to append their device
specific hardware/firmware logs to /proc/vmcore are as follows:
1. During probe (before hardware is initialized), device drivers
register to the vmcore module (via vmcore_add_device_dump()), with
callback function, along with buffer size and log name needed for
firmware/hardware log collection.
2. vmcore module allocates the buffer with requested size. It adds
an Elf note and invokes the device driver's registered callback
function.
3. Device driver collects all hardware/firmware logs into the buffer
and returns control back to vmcore module.
Ensure that the device dump buffer size is always aligned to page size
so that it can be mmaped.
Also, rename alloc_elfnotes_buf() to vmcore_alloc_buf() to make it more
generic and reserve NT_VMCOREDD note type to indicate vmcore device
dump.
Suggested-by: Eric Biederman <ebiederm@xmission.com>.
Signed-off-by: Rahul Lakkireddy <rahul.lakkireddy@chelsio.com>
Signed-off-by: Ganesh Goudar <ganeshgr@chelsio.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Same numerical value (for now at least), but a much better documentation
of intent.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Switch everyone to blk_get_request_flags, and then rename
blk_get_request_flags to blk_get_request.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Always GFP_KERNEL, and keeping it would cause serious complications for
the next change.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Currently function debugfs_create_dir() creates a new
directory in the debugfs (usually mounted /sys/kernel/debug)
with permission rwxr-xr-x. This is hard coded.
Change this to use the parent directory permission.
Output before the patch:
root@s8360047 ~]# tree -dp -L 1 /sys/kernel/debug/
/sys/kernel/debug/
├── [drwxr-xr-x] bdi
├── [drwxr-xr-x] block
├── [drwxr-xr-x] dasd
├── [drwxr-xr-x] device_component
├── [drwxr-xr-x] extfrag
├── [drwxr-xr-x] hid
├── [drwxr-xr-x] kprobes
├── [drwxr-xr-x] kvm
├── [drwxr-xr-x] memblock
├── [drwxr-xr-x] pm_qos
├── [drwxr-xr-x] qdio
├── [drwxr-xr-x] s390
├── [drwxr-xr-x] s390dbf
└── [drwx------] tracing
14 directories
[root@s8360047 linux]#
Output after the patch:
[root@s8360047 ~]# tree -dp -L 1 /sys/kernel/debug/
sys/kernel/debug/
├── [drwx------] bdi
├── [drwx------] block
├── [drwx------] dasd
├── [drwx------] device_component
├── [drwx------] extfrag
├── [drwx------] hid
├── [drwx------] kprobes
├── [drwx------] kvm
├── [drwx------] memblock
├── [drwx------] pm_qos
├── [drwx------] qdio
├── [drwx------] s390
├── [drwx------] s390dbf
└── [drwx------] tracing
14 directories
[root@s8360047 linux]#
Here is the full diff output done with:
[root@s8360047 ~]# diff -u treefull.before treefull.after |
sed 's-^- # -' > treefull.diff
# --- treefull.before 2018-04-27 13:22:04.532824564 +0200
# +++ treefull.after 2018-04-27 13:24:12.106182062 +0200
# @@ -1,55 +1,55 @@
# /sys/kernel/debug/
# -├── [drwxr-xr-x] bdi
# -│ ├── [drwxr-xr-x] 1:0
# -│ ├── [drwxr-xr-x] 1:1
# -│ ├── [drwxr-xr-x] 1:10
# -│ ├── [drwxr-xr-x] 1:11
# -│ ├── [drwxr-xr-x] 1:12
# -│ ├── [drwxr-xr-x] 1:13
# -│ ├── [drwxr-xr-x] 1:14
# -│ ├── [drwxr-xr-x] 1:15
# -│ ├── [drwxr-xr-x] 1:2
# -│ ├── [drwxr-xr-x] 1:3
# -│ ├── [drwxr-xr-x] 1:4
# -│ ├── [drwxr-xr-x] 1:5
# -│ ├── [drwxr-xr-x] 1:6
# -│ ├── [drwxr-xr-x] 1:7
# -│ ├── [drwxr-xr-x] 1:8
# -│ ├── [drwxr-xr-x] 1:9
# -│ └── [drwxr-xr-x] 94:0
# -├── [drwxr-xr-x] block
# -├── [drwxr-xr-x] dasd
# -│ ├── [drwxr-xr-x] 0.0.e18a
# -│ ├── [drwxr-xr-x] dasda
# -│ └── [drwxr-xr-x] global
# -├── [drwxr-xr-x] device_component
# -├── [drwxr-xr-x] extfrag
# -├── [drwxr-xr-x] hid
# -├── [drwxr-xr-x] kprobes
# -├── [drwxr-xr-x] kvm
# -├── [drwxr-xr-x] memblock
# -├── [drwxr-xr-x] pm_qos
# -├── [drwxr-xr-x] qdio
# -│ └── [drwxr-xr-x] 0.0.f5f2
# -├── [drwxr-xr-x] s390
# -│ └── [drwxr-xr-x] stsi
# -├── [drwxr-xr-x] s390dbf
# -│ ├── [drwxr-xr-x] 0.0.e18a
# -│ ├── [drwxr-xr-x] cio_crw
# -│ ├── [drwxr-xr-x] cio_msg
# -│ ├── [drwxr-xr-x] cio_trace
# -│ ├── [drwxr-xr-x] dasd
# -│ ├── [drwxr-xr-x] kvm-trace
# -│ ├── [drwxr-xr-x] lgr
# -│ ├── [drwxr-xr-x] qdio_0.0.f5f2
# -│ ├── [drwxr-xr-x] qdio_error
# -│ ├── [drwxr-xr-x] qdio_setup
# -│ ├── [drwxr-xr-x] qeth_card_0.0.f5f0
# -│ ├── [drwxr-xr-x] qeth_control
# -│ ├── [drwxr-xr-x] qeth_msg
# -│ ├── [drwxr-xr-x] qeth_setup
# -│ ├── [drwxr-xr-x] vmcp
# -│ └── [drwxr-xr-x] vmur
# +├── [drwx------] bdi
# +│ ├── [drwx------] 1:0
# +│ ├── [drwx------] 1:1
# +│ ├── [drwx------] 1:10
# +│ ├── [drwx------] 1:11
# +│ ├── [drwx------] 1:12
# +│ ├── [drwx------] 1:13
# +│ ├── [drwx------] 1:14
# +│ ├── [drwx------] 1:15
# +│ ├── [drwx------] 1:2
# +│ ├── [drwx------] 1:3
# +│ ├── [drwx------] 1:4
# +│ ├── [drwx------] 1:5
# +│ ├── [drwx------] 1:6
# +│ ├── [drwx------] 1:7
# +│ ├── [drwx------] 1:8
# +│ ├── [drwx------] 1:9
# +│ └── [drwx------] 94:0
# +├── [drwx------] block
# +├── [drwx------] dasd
# +│ ├── [drwx------] 0.0.e18a
# +│ ├── [drwx------] dasda
# +│ └── [drwx------] global
# +├── [drwx------] device_component
# +├── [drwx------] extfrag
# +├── [drwx------] hid
# +├── [drwx------] kprobes
# +├── [drwx------] kvm
# +├── [drwx------] memblock
# +├── [drwx------] pm_qos
# +├── [drwx------] qdio
# +│ └── [drwx------] 0.0.f5f2
# +├── [drwx------] s390
# +│ └── [drwx------] stsi
# +├── [drwx------] s390dbf
# +│ ├── [drwx------] 0.0.e18a
# +│ ├── [drwx------] cio_crw
# +│ ├── [drwx------] cio_msg
# +│ ├── [drwx------] cio_trace
# +│ ├── [drwx------] dasd
# +│ ├── [drwx------] kvm-trace
# +│ ├── [drwx------] lgr
# +│ ├── [drwx------] qdio_0.0.f5f2
# +│ ├── [drwx------] qdio_error
# +│ ├── [drwx------] qdio_setup
# +│ ├── [drwx------] qeth_card_0.0.f5f0
# +│ ├── [drwx------] qeth_control
# +│ ├── [drwx------] qeth_msg
# +│ ├── [drwx------] qeth_setup
# +│ ├── [drwx------] vmcp
# +│ └── [drwx------] vmur
# └── [drwx------] tracing
# ├── [drwxr-xr-x] events
# │ ├── [drwxr-xr-x] alarmtimer
Fixes: edac65eaf8d5c ("debugfs: take mode-dependent parts of debugfs_get_inode() into callers")
Signed-off-by: Thomas Richter <tmricht@linux.ibm.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
Re-use kstrtobool_from_user() instead of open coded variant.
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
If a file has xattrs, we fsync it, to ensure we clear the flags
BTRFS_INODE_NEEDS_FULL_SYNC and BTRFS_INODE_COPY_EVERYTHING from its
inode, the current transaction commits and then we fsync it (without
either of those bits being set in its inode), we end up not logging
all its xattrs. This results in deleting all xattrs when replying the
log after a power failure.
Trivial reproducer
$ mkfs.btrfs -f /dev/sdb
$ mount /dev/sdb /mnt
$ touch /mnt/foobar
$ setfattr -n user.xa -v qwerty /mnt/foobar
$ xfs_io -c "fsync" /mnt/foobar
$ sync
$ xfs_io -c "pwrite -S 0xab 0 64K" /mnt/foobar
$ xfs_io -c "fsync" /mnt/foobar
<power failure>
$ mount /dev/sdb /mnt
$ getfattr --absolute-names --dump /mnt/foobar
<empty output>
$
So fix this by making sure all xattrs are logged if we log a file's inode
item and neither the flags BTRFS_INODE_NEEDS_FULL_SYNC nor
BTRFS_INODE_COPY_EVERYTHING were set in the inode.
Fixes: 36283bf777d9 ("Btrfs: fix fsync xattr loss in the fast fsync path")
Cc: <stable@vger.kernel.org> # 4.2+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
[BUG]
btrfs incremental send BUG happens when creating a snapshot of snapshot
that is being used by send.
[REASON]
The problem can happen if while we are doing a send one of the snapshots
used (parent or send) is snapshotted, because snapshoting implies COWing
the root of the source subvolume/snapshot.
1. When doing an incremental send, the send process will get the commit
roots from the parent and send snapshots, and add references to them
through extent_buffer_get().
2. When a snapshot/subvolume is snapshotted, its root node is COWed
(transaction.c:create_pending_snapshot()).
3. COWing releases the space used by the node immediately, through:
__btrfs_cow_block()
--btrfs_free_tree_block()
----btrfs_add_free_space(bytenr of node)
4. Because send doesn't hold a transaction open, it's possible that
the transaction used to create the snapshot commits, switches the
commit root and the old space used by the previous root node gets
assigned to some other node allocation. Allocation of a new node will
use the existing extent buffer found in memory, which we previously
got a reference through extent_buffer_get(), and allow the extent
buffer's content (pages) to be modified:
btrfs_alloc_tree_block
--btrfs_reserve_extent
----find_free_extent (get bytenr of old node)
--btrfs_init_new_buffer (use bytenr of old node)
----btrfs_find_create_tree_block
------alloc_extent_buffer
--------find_extent_buffer (get old node)
5. So send can access invalid memory content and have unpredictable
behaviour.
[FIX]
So we fix the problem by copying the commit roots of the send and
parent snapshots and use those copies.
CallTrace looks like this:
------------[ cut here ]------------
kernel BUG at fs/btrfs/ctree.c:1861!
invalid opcode: 0000 [#1] SMP
CPU: 6 PID: 24235 Comm: btrfs Tainted: P O 3.10.105 #23721
ffff88046652d680 ti: ffff88041b720000 task.ti: ffff88041b720000
RIP: 0010:[<ffffffffa08dd0e8>] read_node_slot+0x108/0x110 [btrfs]
RSP: 0018:ffff88041b723b68 EFLAGS: 00010246
RAX: ffff88043ca6b000 RBX: ffff88041b723c50 RCX: ffff880000000000
RDX: 000000000000004c RSI: ffff880314b133f8 RDI: ffff880458b24000
RBP: 0000000000000000 R08: 0000000000000001 R09: ffff88041b723c66
R10: 0000000000000001 R11: 0000000000001000 R12: ffff8803f3e48890
R13: ffff8803f3e48880 R14: ffff880466351800 R15: 0000000000000001
FS: 00007f8c321dc8c0(0000) GS:ffff88047fcc0000(0000)
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
R2: 00007efd1006d000 CR3: 0000000213a24000 CR4: 00000000003407e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Stack:
ffff88041b723c50 ffff8803f3e48880 ffff8803f3e48890 ffff8803f3e48880
ffff880466351800 0000000000000001 ffffffffa08dd9d7 ffff88041b723c50
ffff8803f3e48880 ffff88041b723c66 ffffffffa08dde85 a9ff88042d2c4400
Call Trace:
[<ffffffffa08dd9d7>] ? tree_move_down.isra.33+0x27/0x50 [btrfs]
[<ffffffffa08dde85>] ? tree_advance+0xb5/0xc0 [btrfs]
[<ffffffffa08e83d4>] ? btrfs_compare_trees+0x2d4/0x760 [btrfs]
[<ffffffffa0982050>] ? finish_inode_if_needed+0x870/0x870 [btrfs]
[<ffffffffa09841ea>] ? btrfs_ioctl_send+0xeda/0x1050 [btrfs]
[<ffffffffa094bd3d>] ? btrfs_ioctl+0x1e3d/0x33f0 [btrfs]
[<ffffffff81111133>] ? handle_pte_fault+0x373/0x990
[<ffffffff8153a096>] ? atomic_notifier_call_chain+0x16/0x20
[<ffffffff81063256>] ? set_task_cpu+0xb6/0x1d0
[<ffffffff811122c3>] ? handle_mm_fault+0x143/0x2a0
[<ffffffff81539cc0>] ? __do_page_fault+0x1d0/0x500
[<ffffffff81062f07>] ? check_preempt_curr+0x57/0x90
[<ffffffff8115075a>] ? do_vfs_ioctl+0x4aa/0x990
[<ffffffff81034f83>] ? do_fork+0x113/0x3b0
[<ffffffff812dd7d7>] ? trace_hardirqs_off_thunk+0x3a/0x6c
[<ffffffff81150cc8>] ? SyS_ioctl+0x88/0xa0
[<ffffffff8153e422>] ? system_call_fastpath+0x16/0x1b
---[ end trace 29576629ee80b2e1 ]---
Fixes: 7069830a9e38 ("Btrfs: add btrfs_compare_trees function")
CC: stable@vger.kernel.org # 3.6+
Signed-off-by: Robbie Ko <robbieko@synology.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Some AFS servers refuse to accept unencrypted traffic, so can't be accessed
with kAFS. Set the AF_RXRPC security level to encrypt client calls to deal
with this.
Note that incoming service calls are set by the remote client and so aren't
affected by this.
This requires an AF_RXRPC patch to pass the value set by setsockopt to calls
begun by the kernel.
Signed-off-by: David Howells <dhowells@redhat.com>
|
|
The handling of CB.CallBack messages sent by the fileserver to the client
is broken in that they are currently being processed after the reply has
been transmitted.
This is not what the fileserver expects, however. It holds up change
visibility until the reply comes so as to maintain cache coherency, and so
expects the client to have to refetch the state on the affected files.
Fix CB.CallBack handling to perform the callback break before sending the
reply.
The fileserver is free to hold up status fetches issued by other threads on
the same client that occur in reponse to the callback until any pending
changes have been committed.
Fixes: d001648ec7cf ("rxrpc: Don't expose skbs to in-kernel users [ver #2]")
Signed-off-by: David Howells <dhowells@redhat.com>
|
|
It's possible for an AFS file server to issue a whole-volume notification
that callbacks on all the vnodes in the file have been broken. This is
done for R/O and backup volumes (which don't have per-file callbacks) and
for things like a volume being taken offline.
Fix callback handling to detect whole-volume notifications, to track it
across operations and to check it during inode validation.
Fixes: c435ee34551e ("afs: Overhaul the callback handling")
Signed-off-by: David Howells <dhowells@redhat.com>
|