Age | Commit message (Collapse) | Author |
|
If there is a long-standing request of one task locking up execution of
deferred requests, and the defer list contains requests of another task
(all files-less), then a potential execution of __io_uring_task_cancel()
by that another task will sleep until that first long-standing request
completion, and that may take long.
E.g.
tsk1: req1/read(empty_pipe) -> tsk2: req(DRAIN)
Then __io_uring_task_cancel(tsk2) waits for req1 completion.
It seems we even can manufacture a complicated case with many tasks
sharing many rings that can lock them forever.
Cancel deferred requests for __io_uring_task_cancel() as well.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Syzbot reports a potential deadlock found by the newly added recursive
read deadlock detection in lockdep:
[...] ========================================================
[...] WARNING: possible irq lock inversion dependency detected
[...] 5.9.0-rc2-syzkaller #0 Not tainted
[...] --------------------------------------------------------
[...] syz-executor.1/10214 just changed the state of lock:
[...] ffff88811f506338 (&f->f_owner.lock){.+..}-{2:2}, at: send_sigurg+0x1d/0x200
[...] but this lock was taken by another, HARDIRQ-safe lock in the past:
[...] (&dev->event_lock){-...}-{2:2}
[...]
[...]
[...] and interrupts could create inverse lock ordering between them.
[...]
[...]
[...] other info that might help us debug this:
[...] Chain exists of:
[...] &dev->event_lock --> &new->fa_lock --> &f->f_owner.lock
[...]
[...] Possible interrupt unsafe locking scenario:
[...]
[...] CPU0 CPU1
[...] ---- ----
[...] lock(&f->f_owner.lock);
[...] local_irq_disable();
[...] lock(&dev->event_lock);
[...] lock(&new->fa_lock);
[...] <Interrupt>
[...] lock(&dev->event_lock);
[...]
[...] *** DEADLOCK ***
The corresponding deadlock case is as followed:
CPU 0 CPU 1 CPU 2
read_lock(&fown->lock);
spin_lock_irqsave(&dev->event_lock, ...)
write_lock_irq(&filp->f_owner.lock); // wait for the lock
read_lock(&fown-lock); // have to wait until the writer release
// due to the fairness
<interrupted>
spin_lock_irqsave(&dev->event_lock); // wait for the lock
The lock dependency on CPU 1 happens if there exists a call sequence:
input_inject_event():
spin_lock_irqsave(&dev->event_lock,...);
input_handle_event():
input_pass_values():
input_to_handler():
handler->event(): // evdev_event()
evdev_pass_values():
spin_lock(&client->buffer_lock);
__pass_event():
kill_fasync():
kill_fasync_rcu():
read_lock(&fa->fa_lock);
send_sigio():
read_lock(&fown->lock);
To fix this, make the reader in send_sigurg() and send_sigio() use
read_lock_irqsave() and read_lock_irqrestore().
Reported-by: syzbot+22e87cdf94021b984aa6@syzkaller.appspotmail.com
Reported-by: syzbot+c5e32344981ad9f33750@syzkaller.appspotmail.com
Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
|
|
There is one error handling path that does not free ref, which may cause
a minor memory leak.
CC: stable@vger.kernel.org # 4.19+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Dinghao Liu <dinghao.liu@zju.edu.cn>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
If there is a device BTRFS_DEV_REPLACE_DEVID without the device replace
item, then it means the filesystem is inconsistent state. This is either
corruption or a crafted image. Fail the mount as this needs a closer
look what is actually wrong.
As of now if BTRFS_DEV_REPLACE_DEVID is present without the replace
item, in __btrfs_free_extra_devids() we determine that there is an
extra device, and free those extra devices but continue to mount the
device.
However, we were wrong in keeping tack of the rw_devices so the syzbot
testcase failed:
WARNING: CPU: 1 PID: 3612 at fs/btrfs/volumes.c:1166 close_fs_devices.part.0+0x607/0x800 fs/btrfs/volumes.c:1166
Kernel panic - not syncing: panic_on_warn set ...
CPU: 1 PID: 3612 Comm: syz-executor.2 Not tainted 5.9.0-rc4-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Call Trace:
__dump_stack lib/dump_stack.c:77 [inline]
dump_stack+0x198/0x1fd lib/dump_stack.c:118
panic+0x347/0x7c0 kernel/panic.c:231
__warn.cold+0x20/0x46 kernel/panic.c:600
report_bug+0x1bd/0x210 lib/bug.c:198
handle_bug+0x38/0x90 arch/x86/kernel/traps.c:234
exc_invalid_op+0x14/0x40 arch/x86/kernel/traps.c:254
asm_exc_invalid_op+0x12/0x20 arch/x86/include/asm/idtentry.h:536
RIP: 0010:close_fs_devices.part.0+0x607/0x800 fs/btrfs/volumes.c:1166
RSP: 0018:ffffc900091777e0 EFLAGS: 00010246
RAX: 0000000000040000 RBX: ffffffffffffffff RCX: ffffc9000c8b7000
RDX: 0000000000040000 RSI: ffffffff83097f47 RDI: 0000000000000007
RBP: dffffc0000000000 R08: 0000000000000001 R09: ffff8880988a187f
R10: 0000000000000000 R11: 0000000000000001 R12: ffff88809593a130
R13: ffff88809593a1ec R14: ffff8880988a1908 R15: ffff88809593a050
close_fs_devices fs/btrfs/volumes.c:1193 [inline]
btrfs_close_devices+0x95/0x1f0 fs/btrfs/volumes.c:1179
open_ctree+0x4984/0x4a2d fs/btrfs/disk-io.c:3434
btrfs_fill_super fs/btrfs/super.c:1316 [inline]
btrfs_mount_root.cold+0x14/0x165 fs/btrfs/super.c:1672
The fix here is, when we determine that there isn't a replace item
then fail the mount if there is a replace target device (devid 0).
CC: stable@vger.kernel.org # 4.19+
Reported-by: syzbot+4cfe71a4da060be47502@syzkaller.appspotmail.com
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Based on user feedback update the message printed when scrub fails to
start due to write requirements. To make a distinction add a device id
to the messages.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Smatch complains that this code dereferences "entry" before checking
whether it's NULL on the next line. Fortunately, rb_entry() will never
return NULL so it doesn't cause a problem. We can clean up the NULL
checking a bit to silence the warning and make the code more clear.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
The minimum reserve size was adjusted to take into account the height of
the tree we are merging, however we can have a root with a level == 0.
What we want is root_level + 1 to get the number of nodes we may have to
cow. This fixes the enospc_debug warning pops with btrfs/101.
Nikolay: this fixes failures on btrfs/060 btrfs/062 btrfs/063 and
btrfs/195 That I was seeing, the call trace was:
[ 3680.515564] ------------[ cut here ]------------
[ 3680.515566] BTRFS: block rsv returned -28
[ 3680.515585] WARNING: CPU: 2 PID: 8339 at fs/btrfs/block-rsv.c:521 btrfs_use_block_rsv+0x162/0x180
[ 3680.515587] Modules linked in:
[ 3680.515591] CPU: 2 PID: 8339 Comm: btrfs Tainted: G W 5.9.0-rc8-default #95
[ 3680.515593] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.13.0-1ubuntu1 04/01/2014
[ 3680.515595] RIP: 0010:btrfs_use_block_rsv+0x162/0x180
[ 3680.515600] RSP: 0018:ffffa01ac9753910 EFLAGS: 00010282
[ 3680.515602] RAX: 0000000000000000 RBX: ffff984b34200000 RCX: 0000000000000027
[ 3680.515604] RDX: 0000000000000027 RSI: 0000000000000000 RDI: ffff984b3bd19e28
[ 3680.515606] RBP: 0000000000004000 R08: ffff984b3bd19e20 R09: 0000000000000001
[ 3680.515608] R10: 0000000000000004 R11: 0000000000000046 R12: ffff984b264fdc00
[ 3680.515609] R13: ffff984b13149000 R14: 00000000ffffffe4 R15: ffff984b34200000
[ 3680.515613] FS: 00007f4e2912b8c0(0000) GS:ffff984b3bd00000(0000) knlGS:0000000000000000
[ 3680.515615] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 3680.515617] CR2: 00007fab87122150 CR3: 0000000118e42000 CR4: 00000000000006e0
[ 3680.515620] Call Trace:
[ 3680.515627] btrfs_alloc_tree_block+0x8b/0x340
[ 3680.515633] ? __lock_acquire+0x51a/0xac0
[ 3680.515646] alloc_tree_block_no_bg_flush+0x4f/0x60
[ 3680.515651] __btrfs_cow_block+0x14e/0x7e0
[ 3680.515662] btrfs_cow_block+0x144/0x2c0
[ 3680.515670] merge_reloc_root+0x4d4/0x610
[ 3680.515675] ? btrfs_lookup_fs_root+0x78/0x90
[ 3680.515686] merge_reloc_roots+0xee/0x280
[ 3680.515695] relocate_block_group+0x2ce/0x5e0
[ 3680.515704] btrfs_relocate_block_group+0x16e/0x310
[ 3680.515711] btrfs_relocate_chunk+0x38/0xf0
[ 3680.515716] btrfs_shrink_device+0x200/0x560
[ 3680.515728] btrfs_rm_device+0x1ae/0x6a6
[ 3680.515744] ? _copy_from_user+0x6e/0xb0
[ 3680.515750] btrfs_ioctl+0x1afe/0x28c0
[ 3680.515755] ? find_held_lock+0x2b/0x80
[ 3680.515760] ? do_user_addr_fault+0x1f8/0x418
[ 3680.515773] ? __x64_sys_ioctl+0x77/0xb0
[ 3680.515775] __x64_sys_ioctl+0x77/0xb0
[ 3680.515781] do_syscall_64+0x31/0x70
[ 3680.515785] entry_SYSCALL_64_after_hwframe+0x44/0xa9
Reported-by: Nikolay Borisov <nborisov@suse.com>
Fixes: 44d354abf33e ("btrfs: relocation: review the call sites which can be interrupted by signal")
CC: stable@vger.kernel.org # 5.4+
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Tested-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
To help with debugging, print the type of the block rsv when we fail to
use our target block rsv in btrfs_use_block_rsv.
This now produces:
[ 544.672035] BTRFS: block rsv 1 returned -28
which is still cryptic without consulting the enum in block-rsv.h but I
guess it's better than nothing.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ add note from Nikolay ]
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
On 32-bit systems, this shift will overflow for files larger than 4GB as
start_index is unsigned long while the calls to btrfs_delalloc_*_space
expect u64.
CC: stable@vger.kernel.org # 4.4+
Fixes: df480633b891 ("btrfs: extent-tree: Switch to new delalloc space reserve and release")
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: David Sterba <dsterba@suse.com>
[ define the variable instead of repeating the shift ]
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
There's no reason to flush an entire file when we're unsharing part of
a file. Therefore, only initiate writeback on the selected range.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Chandan Babu R <chandanrlinux@gmail.com>
|
|
Some messages sent by the MDS entail a session sequence number
increment, and the MDS will drop certain types of requests on the floor
when the sequence numbers don't match.
In particular, a REQUEST_CLOSE message can cross with one of the
sequence morphing messages from the MDS which can cause the client to
stall, waiting for a response that will never come.
Originally, this meant an up to 5s delay before the recurring workqueue
job kicked in and resent the request, but a recent change made it so
that the client would never resend, causing a 60s stall unmounting and
sometimes a blockisting event.
Add a new helper for incrementing the session sequence and then testing
to see whether a REQUEST_CLOSE needs to be resent, and move the handling
of CEPH_MDS_SESSION_CLOSING into that function. Change all of the
bare sequence counter increments to use the new helper.
Reorganize check_session_state with a switch statement. It should no
longer be called when the session is CLOSING, so throw a warning if it
ever is (but still handle that case sanely).
[ idryomov: whitespace, pr_err() call fixup ]
URL: https://tracker.ceph.com/issues/47563
Fixes: fa9967734227 ("ceph: fix potential mdsc use-after-free crash")
Reported-by: Patrick Donnelly <pdonnell@redhat.com>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Xiubo Li <xiubli@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
|
|
Current io_match_files() check in io_cqring_overflow_flush() is useless
because requests drop ->files before going to the overflow list, however
linked to it request do not, and we don't check them.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
We can't bundle this into one operation, as the identity may not have
originated from the tctx to begin with. Drop one ref for each of them
separately, if they don't match the static assignment. If we don't, then
if the identity is a lookup from registered credentials, we could be
freeing that identity as we're dropping a reference assuming it came from
the tctx. syzbot reports this as a use-after-free, as the identity is
still referencable from idr lookup:
==================================================================
BUG: KASAN: use-after-free in instrument_atomic_read_write include/linux/instrumented.h:101 [inline]
BUG: KASAN: use-after-free in atomic_fetch_add_relaxed include/asm-generic/atomic-instrumented.h:142 [inline]
BUG: KASAN: use-after-free in __refcount_add include/linux/refcount.h:193 [inline]
BUG: KASAN: use-after-free in __refcount_inc include/linux/refcount.h:250 [inline]
BUG: KASAN: use-after-free in refcount_inc include/linux/refcount.h:267 [inline]
BUG: KASAN: use-after-free in io_init_req fs/io_uring.c:6700 [inline]
BUG: KASAN: use-after-free in io_submit_sqes+0x15a9/0x25f0 fs/io_uring.c:6774
Write of size 4 at addr ffff888011e08e48 by task syz-executor165/8487
CPU: 1 PID: 8487 Comm: syz-executor165 Not tainted 5.10.0-rc1-next-20201102-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Call Trace:
__dump_stack lib/dump_stack.c:77 [inline]
dump_stack+0x107/0x163 lib/dump_stack.c:118
print_address_description.constprop.0.cold+0xae/0x4c8 mm/kasan/report.c:385
__kasan_report mm/kasan/report.c:545 [inline]
kasan_report.cold+0x1f/0x37 mm/kasan/report.c:562
check_memory_region_inline mm/kasan/generic.c:186 [inline]
check_memory_region+0x13d/0x180 mm/kasan/generic.c:192
instrument_atomic_read_write include/linux/instrumented.h:101 [inline]
atomic_fetch_add_relaxed include/asm-generic/atomic-instrumented.h:142 [inline]
__refcount_add include/linux/refcount.h:193 [inline]
__refcount_inc include/linux/refcount.h:250 [inline]
refcount_inc include/linux/refcount.h:267 [inline]
io_init_req fs/io_uring.c:6700 [inline]
io_submit_sqes+0x15a9/0x25f0 fs/io_uring.c:6774
__do_sys_io_uring_enter+0xc8e/0x1b50 fs/io_uring.c:9159
do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46
entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x440e19
Code: 18 89 d0 c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 0f 83 eb 0f fc ff c3 66 2e 0f 1f 84 00 00 00 00
RSP: 002b:00007fff644ff178 EFLAGS: 00000246 ORIG_RAX: 00000000000001aa
RAX: ffffffffffffffda RBX: 0000000000000005 RCX: 0000000000440e19
RDX: 0000000000000000 RSI: 000000000000450c RDI: 0000000000000003
RBP: 0000000000000004 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000246 R12: 00000000022b4850
R13: 0000000000000010 R14: 0000000000000000 R15: 0000000000000000
Allocated by task 8487:
kasan_save_stack+0x1b/0x40 mm/kasan/common.c:48
kasan_set_track mm/kasan/common.c:56 [inline]
__kasan_kmalloc.constprop.0+0xc2/0xd0 mm/kasan/common.c:461
kmalloc include/linux/slab.h:552 [inline]
io_register_personality fs/io_uring.c:9638 [inline]
__io_uring_register fs/io_uring.c:9874 [inline]
__do_sys_io_uring_register+0x10f0/0x40a0 fs/io_uring.c:9924
do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46
entry_SYSCALL_64_after_hwframe+0x44/0xa9
Freed by task 8487:
kasan_save_stack+0x1b/0x40 mm/kasan/common.c:48
kasan_set_track+0x1c/0x30 mm/kasan/common.c:56
kasan_set_free_info+0x1b/0x30 mm/kasan/generic.c:355
__kasan_slab_free+0x102/0x140 mm/kasan/common.c:422
slab_free_hook mm/slub.c:1544 [inline]
slab_free_freelist_hook+0x5d/0x150 mm/slub.c:1577
slab_free mm/slub.c:3140 [inline]
kfree+0xdb/0x360 mm/slub.c:4122
io_identity_cow fs/io_uring.c:1380 [inline]
io_prep_async_work+0x903/0xbc0 fs/io_uring.c:1492
io_prep_async_link fs/io_uring.c:1505 [inline]
io_req_defer fs/io_uring.c:5999 [inline]
io_queue_sqe+0x212/0xed0 fs/io_uring.c:6448
io_submit_sqe fs/io_uring.c:6542 [inline]
io_submit_sqes+0x14f6/0x25f0 fs/io_uring.c:6784
__do_sys_io_uring_enter+0xc8e/0x1b50 fs/io_uring.c:9159
do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46
entry_SYSCALL_64_after_hwframe+0x44/0xa9
The buggy address belongs to the object at ffff888011e08e00
which belongs to the cache kmalloc-96 of size 96
The buggy address is located 72 bytes inside of
96-byte region [ffff888011e08e00, ffff888011e08e60)
The buggy address belongs to the page:
page:00000000a7104751 refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x11e08
flags: 0xfff00000000200(slab)
raw: 00fff00000000200 ffffea00004f8540 0000001f00000002 ffff888010041780
raw: 0000000000000000 0000000080200020 00000001ffffffff 0000000000000000
page dumped because: kasan: bad access detected
Memory state around the buggy address:
ffff888011e08d00: 00 00 00 00 00 00 00 00 00 00 00 00 fc fc fc fc
ffff888011e08d80: 00 00 00 00 00 00 00 00 00 00 00 00 fc fc fc fc
> ffff888011e08e00: fa fb fb fb fb fb fb fb fb fb fb fb fc fc fc fc
^
ffff888011e08e80: 00 00 00 00 00 00 00 00 00 00 00 00 fc fc fc fc
ffff888011e08f00: 00 00 00 00 00 00 00 00 00 00 00 00 fc fc fc fc
==================================================================
Reported-by: syzbot+625ce3bb7835b63f7f3d@syzkaller.appspotmail.com
Fixes: 1e6fa5216a0e ("io_uring: COW io_identity on mismatch")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Ensure we get a valid view of the task mm, by using task_lock() when
attempting to grab the original task mm.
Reported-by: syzbot+b57abf7ee60829090495@syzkaller.appspotmail.com
Fixes: 2aede0e417db ("io_uring: stash ctx task reference for SQPOLL")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Track if a given task io_uring context contains SQPOLL instances, so we
can iterate those for cancelation (and request counts). This ensures that
we properly wait on SQPOLL contexts, and find everything that needs
canceling.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
This can't currently happen, but will be possible shortly. Handle missing
files just like we do not being able to grab a needed mm, and mark the
request as needing cancelation.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
The kernel has always allowed directories to have the rtinherit flag
set, even if there is no rt device, so this check is wrong.
Fixes: 80e4e1268802 ("xfs: scrub inodes")
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
|
|
In commit 7588cbeec6df, we tried to fix a race stemming from the lack of
coordination between higher level code that wants to allocate and remap
CoW fork extents into the data fork. Christoph cites as examples the
always_cow mode, and a directio write completion racing with writeback.
According to the comments before the goto retry, we want to restart the
lookup to catch the extent in the data fork, but we don't actually reset
whichfork or cow_fsb, which means the second try executes using stale
information. Up until now I think we've gotten lucky that either
there's something left in the CoW fork to cause cow_fsb to be reset, or
either data/cow fork sequence numbers have advanced enough to force a
fresh lookup from the data fork. However, if we reach the retry with an
empty stable CoW fork and a stable data fork, neither of those things
happens. The retry foolishly re-calls xfs_convert_blocks on the CoW
fork which fails again. This time, we toss the write.
I've recently been working on extending reflink to the realtime device.
When the realtime extent size is larger than a single block, we have to
force the page cache to CoW the entire rt extent if a write (or
fallocate) are not aligned with the rt extent size. The strategy I've
chosen to deal with this is derived from Dave's blocksize > pagesize
series: dirtying around the write range, and ensuring that writeback
always starts mapping on an rt extent boundary. This has brought this
race front and center, since generic/522 blows up immediately.
However, I'm pretty sure this is a bug outright, independent of that.
Fixes: 7588cbeec6df ("xfs: retry COW fork delalloc conversion when no extent was found")
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
|
|
The iomap writepage error handling logic is a mash of old and
slightly broken XFS writepage logic. When keepwrite writeback state
tracking was introduced in XFS in commit 0d085a529b42 ("xfs: ensure
WB_SYNC_ALL writeback handles partial pages correctly"), XFS had an
additional cluster writeback context that scanned ahead of
->writepage() to process dirty pages over the current ->writepage()
extent mapping. This context expected a dirty page and required
retention of the TOWRITE tag on partial page processing so the
higher level writeback context would revisit the page (in contrast
to ->writepage(), which passes a page with the dirty bit already
cleared).
The cluster writeback mechanism was eventually removed and some of
the error handling logic folded into the primary writeback path in
commit 150d5be09ce4 ("xfs: remove xfs_cancel_ioend"). This patch
accidentally conflated the two contexts by using the keepwrite logic
in ->writepage() without accounting for the fact that the page is
not dirty. Further, the keepwrite logic has no practical effect on
the core ->writepage() caller (write_cache_pages()) because it never
revisits a page in the current function invocation.
Technically, the page should be redirtied for the keepwrite logic to
have any effect. Otherwise, write_cache_pages() may find the tagged
page but will skip it since it is clean. Even if the page was
redirtied, however, there is still no practical effect to keepwrite
since write_cache_pages() does not wrap around within a single
invocation of the function. Therefore, the dirty page would simply
end up retagged on the next writeback sequence over the associated
range.
All that being said, none of this really matters because redirtying
a partially processed page introduces a potential infinite redirty
-> writeback failure loop that deviates from the current design
principle of clearing the dirty state on writepage failure to avoid
building up too much dirty, unreclaimable memory on the system.
Therefore, drop the spurious keepwrite usage and dirty state
clearing logic from iomap_writepage_map(), treat the partially
processed page the same as a fully processed page, and let the
imminent ioend failure clean up the writeback state.
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
|
|
iomap writeback mapping failure only calls into ->discard_page() if
the current page has not been added to the ioend. Accordingly, the
XFS callback assumes a full page discard and invalidation. This is
problematic for sub-page block size filesystems where some portion
of a page might have been mapped successfully before a failure to
map a delalloc block occurs. ->discard_page() is not called in that
error scenario and the bio is explicitly failed by iomap via the
error return from ->prepare_ioend(). As a result, the filesystem
leaks delalloc blocks and corrupts the filesystem block counters.
Since XFS is the only user of ->discard_page(), tweak the semantics
to invoke the callback unconditionally on mapping errors and provide
the file offset that failed to map. Update xfs_discard_page() to
discard the corresponding portion of the file and pass the range
along to iomap_invalidatepage(). The latter already properly handles
both full and sub-page scenarios by not changing any iomap or page
state on sub-page invalidations.
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
|
|
It is possible to expose non-zeroed post-EOF data in XFS if the new
EOF page is dirty, backed by an unwritten block and the truncate
happens to race with writeback. iomap_truncate_page() will not zero
the post-EOF portion of the page if the underlying block is
unwritten. The subsequent call to truncate_setsize() will, but
doesn't dirty the page. Therefore, if writeback happens to complete
after iomap_truncate_page() (so it still sees the unwritten block)
but before truncate_setsize(), the cached page becomes inconsistent
with the on-disk block. A mapped read after the associated page is
reclaimed or invalidated exposes non-zero post-EOF data.
For example, consider the following sequence when run on a kernel
modified to explicitly flush the new EOF page within the race
window:
$ xfs_io -fc "falloc 0 4k" -c fsync /mnt/file
$ xfs_io -c "pwrite 0 4k" -c "truncate 1k" /mnt/file
...
$ xfs_io -c "mmap 0 4k" -c "mread -v 1k 8" /mnt/file
00000400: 00 00 00 00 00 00 00 00 ........
$ umount /mnt/; mount <dev> /mnt/
$ xfs_io -c "mmap 0 4k" -c "mread -v 1k 8" /mnt/file
00000400: cd cd cd cd cd cd cd cd ........
Update xfs_setattr_size() to explicitly flush the new EOF page prior
to the page truncate to ensure iomap has the latest state of the
underlying block.
Fixes: 68a9f5e7007c ("xfs: implement iomap based buffered write path")
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
|
|
pcluster should be only set up for all managed pages instead of
temporary pages. Since it currently uses page->mapping to identify,
the impact is minor for now.
[ Update: Vladimir reported the kernel log becomes polluted
because PAGE_FLAGS_CHECK_AT_FREE flag(s) set if the page
allocation debug option is enabled. ]
Link: https://lore.kernel.org/r/20201022145724.27284-1-hsiangkao@aol.com
Fixes: 5ddcee1f3a1c ("erofs: get rid of __stagingpage_alloc helper")
Cc: <stable@vger.kernel.org> # 5.5+
Tested-by: Vladimir Zapolskiy <vladimir@tuxera.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Gao Xiang <hsiangkao@redhat.com>
|
|
EROFS has _only one_ ondisk timestamp (ctime is currently
documented and recorded, we might also record mtime instead
with a new compat feature if needed) for each extended inode
since EROFS isn't mainly for archival purposes so no need to
keep all timestamps on disk especially for Android scenarios
due to security concerns. Also, romfs/cramfs don't have their
own on-disk timestamp, and squashfs only records mtime instead.
Let's also derive access time from ondisk timestamp rather than
leaving it empty, and if mtime/atime for each file are really
needed for specific scenarios as well, we can also use xattrs
to record them then.
Link: https://lore.kernel.org/r/20201031195102.21221-1-hsiangkao@aol.com
[ Gao Xiang: It'd be better to backport for user-friendly concern. ]
Fixes: 431339ba9042 ("staging: erofs: add inode operations")
Cc: stable <stable@vger.kernel.org> # 4.19+
Reported-by: nl6720 <nl6720@gmail.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Gao Xiang <hsiangkao@redhat.com>
|
|
The cleanup for the yfs_store_opaque_acl2_operation calls the wrong
function to destroy the ACL content buffer. It's an afs_acl struct, not
a yfs_acl struct - and the free function for latter may pass invalid
pointers to kfree().
Fix this by using the afs_acl_put() function. The yfs_acl_put()
function is then no longer used and can be removed.
general protection fault, probably for non-canonical address 0x7ebde00000000: 0000 [#1] SMP PTI
...
RIP: 0010:compound_head+0x0/0x11
...
Call Trace:
virt_to_cache+0x8/0x51
kfree+0x5d/0x79
yfs_free_opaque_acl+0x16/0x29
afs_put_operation+0x60/0x114
__vfs_setxattr+0x67/0x72
__vfs_setxattr_noperm+0x66/0xe9
vfs_setxattr+0x67/0xce
setxattr+0x14e/0x184
__do_sys_fsetxattr+0x66/0x8f
do_syscall_64+0x2d/0x3a
entry_SYSCALL_64_after_hwframe+0x44/0xa9
Fixes: e49c7b2f6de7 ("afs: Build an abstraction around an "operation" concept")
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
When using the afs.yfs.acl xattr to change an AuriStor ACL, a warning
can be generated when the request is marshalled because the buffer
pointer isn't increased after adding the last element, thereby
triggering the check at the end if the ACL wasn't empty. This just
causes something like the following warning, but doesn't stop the call
from happening successfully:
kAFS: YFS.StoreOpaqueACL2: Request buffer underflow (36<108)
Fix this simply by increasing the count prior to the check.
Fixes: f5e4546347bc ("afs: Implement YFS ACL setting")
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Commit fc0e38dae645 ("GFS2: Fix glock deallocation race") fixed a
sd_glock_disposal accounting bug by adding a missing atomic_dec
statement, but it failed to wake up sd_glock_wait when that decrement
causes sd_glock_disposal to reach zero. As a consequence,
gfs2_gl_hash_clear can now run into a 10-minute timeout instead of
being woken up. Add the missing wakeup.
Fixes: fc0e38dae645 ("GFS2: Fix glock deallocation race")
Cc: stable@vger.kernel.org # v2.6.39+
Signed-off-by: Alexander Aring <aahringo@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
|
|
Perform basic sanity checks of quota headers to avoid kernel crashes on
corrupted quota files.
CC: stable@vger.kernel.org
Reported-by: syzbot+f816042a7ae2225f25ba@syzkaller.appspotmail.com
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Signed-off-by: Jan Kara <jack@suse.cz>
|
|
The on-disk quota format supports quota files with upto 2^32 blocks. Be
careful when computing quota file offsets in the quota files from block
numbers as they can overflow 32-bit types. Since quota files larger than
4GB would require ~26 millions of quota users, this is mostly a
theoretical concern now but better be careful, fuzzers would find the
problem sooner or later anyway...
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Signed-off-by: Jan Kara <jack@suse.cz>
|
|
Remove unnecessary blank when calling kmalloc_array().
Link: https://lore.kernel.org/r/20201010094335.39797-1-tian.xianting@h3c.com
Signed-off-by: Xianting Tian <tian.xianting@h3c.com>
Signed-off-by: Jan Kara <jack@suse.cz>
|
|
'/proc/stat' provides the field 'btime' which states the time stamp of
system boot in seconds. In case of time namespaces, the offset to the
boot time stamp was not applied earlier.
This confuses tasks which are in another time universe, e.g., in a
container of a container runtime which utilize time namespaces to
virtualize boottime.
Therefore, we make procfs to virtualize also the btime field by
subtracting the offset of the timens boottime from 'btime' before
printing the stats.
Since start_boottime of processes are seconds since boottime and the
boottime stamp is now shifted according to the timens offset, the
offset of the time namespace also needs to be applied before the
process stats are given to userspace.
This avoids that processes shown, e.g., by 'ps' appear as time
travelers in the corresponding time namespace.
Signed-off-by: Michael Weiß <michael.weiss@aisec.fraunhofer.de>
Reviewed-by: Andrei Vagin <avagin@gmail.com>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Christian Brauner <christian.brauner@ubuntu.com>
Link: https://lore.kernel.org/r/20201027204258.7869-3-michael.weiss@aisec.fraunhofer.de
|
|
A previous patch fixed the "create-unlink-getattr" idiom: if getattr is
called on an unlinked file, we try to find an open fid attached to the
corresponding inode.
We have a similar issue with file permissions and setattr:
open("./test.txt", O_RDWR|O_CREAT, 0666) = 4
chmod("./test.txt", 0) = 0
truncate("./test.txt", 0) = -1 EACCES (Permission denied)
ftruncate(4, 0) = -1 EACCES (Permission denied)
The failure is expected with truncate() but not with ftruncate().
This happens because the lookup code does find a matching fid in the
dentry list. Unfortunately, this is not an open fid and the server
will be forced to rely on the path name, rather than on an open file
descriptor. This is the case in QEMU: the setattr operation will use
truncate() and fail because of bad write permissions.
This patch changes the logic in the lookup code, so that we consider
open fids first. It gives a chance to the server to match this open
fid to an open file descriptor and use ftruncate() instead of truncate().
This does not change the current behaviour for truncate() and other
path name based syscalls, since file permissions are checked earlier
in the VFS layer.
With this patch, we get:
open("./test.txt", O_RDWR|O_CREAT, 0666) = 4
chmod("./test.txt", 0) = 0
truncate("./test.txt", 0) = -1 EACCES (Permission denied)
ftruncate(4, 0) = 0
Link: http://lkml.kernel.org/r/20200923141146.90046-4-jianyong.wu@arm.com
Signed-off-by: Greg Kurz <groug@kaod.org>
Signed-off-by: Jianyong Wu <jianyong.wu@arm.com>
Signed-off-by: Dominique Martinet <asmadeus@codewreck.org>
|
|
This patch adds accounting of open fids in a list hanging off the i_private
field of the corresponding inode. This allows faster lookups compared to
searching the full 9p client list.
The lookup code is modified accordingly.
Link: http://lkml.kernel.org/r/20200923141146.90046-3-jianyong.wu@arm.com
Signed-off-by: Greg Kurz <gkurz@linux.vnet.ibm.com>
Signed-off-by: Jianyong Wu <jianyong.wu@arm.com>
Signed-off-by: Dominique Martinet <asmadeus@codewreck.org>
|
|
Fixes several outstanding bug reports of not being able to getattr from an
open file after an unlink. This patch cleans up transient fids on an unlink
and will search open fids on a client if it detects a dentry that appears to
have been unlinked. This search is necessary because fstat does not pass fd
information through the VFS API to the filesystem, only the dentry which for
9p has an imperfect match to fids.
Inherent in this patch is also a fix for the qid handling on create/open
which apparently wasn't being set correctly and was necessary for the search
to succeed.
A possible optimization over this fix is to include accounting of open fids
with the inode in the private data (in a similar fashion to the way we track
transient fids with dentries). This would allow a much quicker search for
a matching open fid.
(changed v9fs_fid_find_global to v9fs_fid_find_inode in comment)
Link: http://lkml.kernel.org/r/20200923141146.90046-2-jianyong.wu@arm.com
Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com>
Signed-off-by: Greg Kurz <gkurz@linux.vnet.ibm.com>
Signed-off-by: Jianyong Wu <jianyong.wu@arm.com>
Signed-off-by: Dominique Martinet <asmadeus@codewreck.org>
|
|
Added a new F2FS_IOC_GET_COMPRESS_OPTION ioctl to get file compression
option of a file.
struct f2fs_comp_option {
u8 algorithm; => compression algorithm
=> 0:lzo, 1:lz4, 2:zstd, 3:lzorle
u8 log_cluster_size; => log scale cluster size
=> 2 ~ 8
};
struct f2fs_comp_option option;
ioctl(fd, F2FS_IOC_GET_COMPRESS_OPTION, &option);
Signed-off-by: Daeho Jeong <daehojeong@google.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
|
|
Right now, we can end up calling cancel_delayed_work_sync from within
delete_work_func via gfs2_lookup_by_inum -> gfs2_inode_lookup ->
gfs2_cancel_delete_work. When that happens, it will result in a
deadlock. Instead, gfs2_inode_lookup should skip the call to
gfs2_cancel_delete_work when called from delete_work_func (blktype ==
GFS2_BLKST_UNLINKED).
Reported-by: Alexander Ahring Oder Aring <aahringo@redhat.com>
Fixes: a0e3cc65fa29 ("gfs2: Turn gl_delete into a delayed work")
Cc: stable@vger.kernel.org # v5.8+
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
|
|
For oom_score_adj values in the range [942,999], the current
calculations will print 16 for oom_adj. This patch simply limits the
output so output is inline with docs.
Signed-off-by: Charles Haithcock <chaithco@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Link: https://lkml.kernel.org/r/20201020165130.33927-1-chaithco@redhat.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Like other filesystem does, we introduce a new file f2fs.h in path of
include/uapi/linux/, and move f2fs-specified ioctl interface definitions
to that file, after then, in order to use those definitions, userspace
developer only need to include the new header file rather than
copy & paste definitions from fs/f2fs/f2fs.h.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
|
|
As kitestramuort reported:
F2FS-fs (nvme0n1p4): access invalid blkaddr:1598541474
[ 25.725898] ------------[ cut here ]------------
[ 25.725903] WARNING: CPU: 6 PID: 2018 at f2fs_is_valid_blkaddr+0x23a/0x250
[ 25.725923] Call Trace:
[ 25.725927] ? f2fs_llseek+0x204/0x620
[ 25.725929] ? ovl_copy_up_data+0x14f/0x200
[ 25.725931] ? ovl_copy_up_inode+0x174/0x1e0
[ 25.725933] ? ovl_copy_up_one+0xa22/0xdf0
[ 25.725936] ? ovl_copy_up_flags+0xa6/0xf0
[ 25.725938] ? ovl_aio_cleanup_handler+0xd0/0xd0
[ 25.725939] ? ovl_maybe_copy_up+0x86/0xa0
[ 25.725941] ? ovl_open+0x22/0x80
[ 25.725943] ? do_dentry_open+0x136/0x350
[ 25.725945] ? path_openat+0xb7e/0xf40
[ 25.725947] ? __check_sticky+0x40/0x40
[ 25.725948] ? do_filp_open+0x70/0x100
[ 25.725950] ? __check_sticky+0x40/0x40
[ 25.725951] ? __check_sticky+0x40/0x40
[ 25.725953] ? __x64_sys_openat+0x1db/0x2c0
[ 25.725955] ? do_syscall_64+0x2d/0x40
[ 25.725957] ? entry_SYSCALL_64_after_hwframe+0x44/0xa9
llseek() reports invalid block address access, the root cause is if
file has inline data, f2fs_seek_block() will access inline data regard
as block address index in inode block, which should be wrong, fix it.
Reported-by: kitestramuort <kitestramuort@autistici.org>
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
|
|
When running fault injection test, if we don't stop checkpoint, some stale
NAT entries were flushed which breaks consistency.
Fixes: 86f33603f8c5 ("f2fs: handle errors of f2fs_get_meta_page_nofail")
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
|
|
Change the nfsroot default mount option to ask for NFSv2 only *if* the
kernel was built with NFSv2 support.
If not, default to NFSv3 or as last choice to NFSv4, depending on actual
kernel config.
Signed-off-by: Helge Deller <deller@gmx.de>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core
Pull driver core and documentation fixes from Greg KH:
"Here is one tiny debugfs change to fix up an API where the last user
was successfully fixed up in 5.10-rc1 (so it couldn't be merged
earlier), and a much larger Documentation/ABI/ update to the files so
they can be automatically parsed by our tools.
The Documentation/ABI/ updates are just formatting issues, small ones
to bring the files into parsable format, and have been acked by
numerous subsystem maintainers and the documentation maintainer. I
figured it was good to get this into 5.10-rc2 to help wih the merge
issues that would arise if these were to stick in linux-next until
5.11-rc1.
The debugfs change has been in linux-next for a long time, and the
Documentation updates only for the last linux-next release"
* tag 'driver-core-5.10-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: (40 commits)
scripts: get_abi.pl: assume ReST format by default
docs: ABI: sysfs-class-led-trigger-pattern: remove hw_pattern duplication
docs: ABI: sysfs-class-backlight: unify ABI documentation
docs: ABI: sysfs-c2port: remove a duplicated entry
docs: ABI: sysfs-class-power: unify duplicated properties
docs: ABI: unify /sys/class/leds/<led>/brightness documentation
docs: ABI: stable: remove a duplicated documentation
docs: ABI: change read/write attributes
docs: ABI: cleanup several ABI documents
docs: ABI: sysfs-bus-nvdimm: use the right format for ABI
docs: ABI: vdso: use the right format for ABI
docs: ABI: fix syntax to be parsed using ReST notation
docs: ABI: convert testing/configfs-acpi to ReST
docs: Kconfig/Makefile: add a check for broken ABI files
docs: abi-testing.rst: enable --rst-sources when building docs
docs: ABI: don't escape ReST-incompatible chars from obsolete and removed
docs: ABI: create a 2-depth index for ABI
docs: ABI: make it parse ABI/stable as ReST-compatible files
docs: ABI: sysfs-uevent: make it compatible with ReST output
docs: ABI: testing: make the files compatible with ReST output
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/gustavoars/linux
Pull more flexible-array member conversions from Gustavo A. R. Silva:
"Replace zero-length arrays with flexible-array members"
* tag 'flexible-array-conversions-5.10-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/gustavoars/linux:
printk: ringbuffer: Replace zero-length array with flexible-array member
net/smc: Replace zero-length array with flexible-array member
net/mlx5: Replace zero-length array with flexible-array member
mei: hw: Replace zero-length array with flexible-array member
gve: Replace zero-length array with flexible-array member
Bluetooth: btintel: Replace zero-length array with flexible-array member
scsi: target: tcmu: Replace zero-length array with flexible-array member
ima: Replace zero-length array with flexible-array member
enetc: Replace zero-length array with flexible-array member
fs: Replace zero-length array with flexible-array member
Bluetooth: Replace zero-length array with flexible-array member
params: Replace zero-length array with flexible-array member
tracepoint: Replace zero-length array with flexible-array member
platform/chrome: cros_ec_proto: Replace zero-length array with flexible-array member
platform/chrome: cros_ec_commands: Replace zero-length array with flexible-array member
mailbox: zynqmp-ipi-message: Replace zero-length array with flexible-array member
dmaengine: ti-cppi5: Replace zero-length array with flexible-array member
|
|
Pull io_uring fixes from Jens Axboe:
- Fixes for linked timeouts (Pavel)
- Set IO_WQ_WORK_CONCURRENT early for async offload (Pavel)
- Two minor simplifications that make the code easier to read and
follow (Pavel)
* tag 'io_uring-5.10-2020-10-30' of git://git.kernel.dk/linux-block:
io_uring: use type appropriate io_kiocb handler for double poll
io_uring: simplify __io_queue_sqe()
io_uring: simplify nxt propagation in io_queue_sqe
io_uring: don't miss setting IO_WQ_WORK_CONCURRENT
io_uring: don't defer put of cancelled ltimeout
io_uring: always clear LINK_TIMEOUT after cancel
io_uring: don't adjust LINK_HEAD in cancel ltimeout
io_uring: remove opcode check on ltimeout kill
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
- lockdep fixes:
- drop path locks before manipulating sysfs objects or qgroups
- preliminary fixes before tree locks get switched to rwsem
- use annotated seqlock
- build warning fixes (printk format)
- fix relocation vs fallocate race
- tree checker properly validates number of stripes and parity
- readahead vs device replace fixes
- iomap dio fix for unnecessary buffered io fallback
* tag 'for-5.10-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: convert data_seqcount to seqcount_mutex_t
btrfs: don't fallback to buffered read if we don't need to
btrfs: add a helper to read the tree_root commit root for backref lookup
btrfs: drop the path before adding qgroup items when enabling qgroups
btrfs: fix readahead hang and use-after-free after removing a device
btrfs: fix use-after-free on readahead extent after failure to create it
btrfs: tree-checker: validate number of chunk stripes and parity
btrfs: tree-checker: fix incorrect printk format
btrfs: drop the path before adding block group sysfs files
btrfs: fix relocation failure due to race with fallocate
|
|
No one checks the return value of debugfs_create_devm_seqfile(), as it's
not needed, so make the return value void, so that no one tries to do so
in the future.
Link: https://lore.kernel.org/r/20201023131037.2500765-1-gregkh@linuxfoundation.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
There is a regular need in the kernel to provide a way to declare having a
dynamically sized set of trailing elements in a structure. Kernel code should
always use “flexible array members”[1] for these cases. The older style of
one-element or zero-length arrays should no longer be used[2].
[1] https://en.wikipedia.org/wiki/Flexible_array_member
[2] https://www.kernel.org/doc/html/v5.9-rc1/process/deprecated.html#zero-length-and-one-element-arrays
Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>
|
|
Before this patch, gfs2_fitrim was not properly checking for a "live" file
system. If the file system had something to trim and the file system
was read-only (or spectator) it would start the trim, but when it starts
the transaction, gfs2_trans_begin returns -EROFS (read-only file system)
and it errors out. However, if the file system was already trimmed so
there's no work to do, it never called gfs2_trans_begin. That code is
bypassed so it never returns the error. Instead, it returns a good
return code with 0 work. All this makes for inconsistent behavior:
The same fstrim command can return -EROFS in one case and 0 in another.
This tripped up xfstests generic/537 which reports the error as:
+fstrim with unrecovered metadata just ate your filesystem
This patch adds a check for a "live" (iow, active journal, iow, RW)
file system, and if not, returns the error properly.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
|
|
Before commit 97fd734ba17e, the local statfs_changeX inode was never
initialized for spectator mounts. However, it still checks for
spectator mounts when unmounting everything. There's no good reason to
lookup the statfs_changeX files because spectators cannot perform recovery.
It still, however, needs the master statfs file for statfs calls.
This patch adds the check for spectator mounts to init_statfs.
Fixes: 97fd734ba17e ("gfs2: lookup local statfs inodes prior to journal recovery")
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
|
|
Before this patch, function gfs2_meta_sync called filemap_fdatawrite to write
the address space for the metadata being synced. That's great for inodes, but
resource groups all point to the same superblock-address space, sdp->sd_aspace.
Each rgrp has its own range of blocks on which it should operate. That meant
every time an rgrp's metadata was synced, it would write all of them instead
of just the range.
This patch eliminates function gfs2_meta_sync and tailors specific metasync
functions for inodes and rgrps.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
|
|
Hi,
Before this patch, function init_journal's "undo" directive jumped to label
fail_jinode_gh. But now that it does statfs initialization, it needs to
jump to fail_statfs instead. Failure to do so means that mount failures
after init_journal is successful will neglect to let go of the proper
statfs information, stranding the statfs_changeX inodes. This makes it
impossible to free its glocks, and results in:
gfs2: fsid=sda.s: G: s:EX n:2/805f f:Dqob t:EX d:UN/603701000 a:0 v:0 r:4 m:200 p:1
gfs2: fsid=sda.s: H: s:EX f:H e:0 p:1397947 [(ended)] init_journal+0x548/0x890 [gfs2]
gfs2: fsid=sda.s: I: n:6/32863 t:8 f:0x00 d:0x00000201 s:24 p:0
gfs2: fsid=sda.s: G: s:SH n:5/805f f:Dqob t:SH d:UN/603712000 a:0 v:0 r:3 m:200 p:0
gfs2: fsid=sda.s: H: s:SH f:EH e:0 p:1397947 [(ended)] gfs2_inode_lookup+0x1fb/0x410 [gfs2]
VFS: Busy inodes after unmount of sda. Self-destruct in 5 seconds. Have a nice day...
The next time the file system is mounted, it then reuses the same glocks,
which ends in a kernel NULL pointer dereference when trying to dump the
reused glock.
This patch makes the "undo" function of init_journal jump to fail_statfs
so the statfs files are properly deconstructed upon failure.
Fixes: 97fd734ba17e ("gfs2: lookup local statfs inodes prior to journal recovery")
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
|