Age | Commit message (Collapse) | Author |
|
We always know the correct state of the rmap record flags (attr, bmbt,
unwritten) so check them by direct comparison.
Fixes: d852657ccfc0 ("xfs: cross-reference reverse-mapping btree")
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Chandan Babu R <chandanrlinux@gmail.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
|
|
The comment and logic in xchk_btree_check_minrecs for dealing with
inode-rooted btrees isn't quite correct. While the direct children of
the inode root are allowed to have fewer records than what would
normally be allowed for a regular ondisk btree block, this is only true
if there is only one child block and the number of records don't fit in
the inode root.
Fixes: 08a3a692ef58 ("xfs: btree scrub should check minrecs")
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Chandan Babu R <chandanrlinux@gmail.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
|
|
Patch 541656d3a513 ("gfs2: freeze should work on read-only mounts") changed
the check for glock state in function freeze_go_sync() from "gl->gl_state
== LM_ST_SHARED" to "gl->gl_req == LM_ST_EXCLUSIVE". That's wrong and it
regressed gfs2's freeze/thaw mechanism because it caused only the freezing
node (which requests the glock in EX) to queue freeze work.
All nodes go through this go_sync code path during the freeze to drop their
SHared hold on the freeze glock, allowing the freezing node to acquire it
in EXclusive mode. But all the nodes must freeze access to the file system
locally, so they ALL must queue freeze work. The freeze_work calls
freeze_func, which makes a request to reacquire the freeze glock in SH,
effectively blocking until the thaw from the EX holder. Once thawed, the
freezing node drops its EX hold on the freeze glock, then the (blocked)
freeze_func reacquires the freeze glock in SH again (on all nodes, including
the freezer) so all nodes go back to a thawed state.
This patch changes the check back to gl_state == LM_ST_SHARED like it was
prior to 541656d3a513.
Fixes: 541656d3a513 ("gfs2: freeze should work on read-only mounts")
Cc: stable@vger.kernel.org # v5.8+
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
|
|
Don't recycle a refnode until we're done with all requests of nodes
ejected before.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Cc: stable@vger.kernel.org # v5.7+
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
An active ref_node always can be found in ctx->files_data, it's much
safer to get it this way instead of poking into files_data->ref_list.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Cc: stable@vger.kernel.org # v5.7+
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Zorro reports that an xfstest test case is failing, and it turns out that
for the reissue path we can potentially issue a double completion on the
request for the failure path. There's an issue around the retry as well,
but for now, at least just make sure that we handle the error path
correctly.
Cc: stable@vger.kernel.org
Fixes: b63534c41e20 ("io_uring: re-issue block requests that failed because of resources")
Reported-by: Zorro Lang <zlang@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Prepare for the merging of the syscall_work series which conflicts with the
TIF bits overhaul in X86.
|
|
There isn't really any valid reason to use __FSCRYPT_MODE_MAX or
FSCRYPT_POLICY_FLAGS_VALID in a userspace program. These constants are
only meant to be used by the kernel internally, and they are defined in
the UAPI header next to the mode numbers and flags only so that kernel
developers don't forget to update them when adding new modes or flags.
In https://lkml.kernel.org/r/20201005074133.1958633-2-satyat@google.com
there was an example of someone wanting to use __FSCRYPT_MODE_MAX in a
user program, and it was wrong because the program would have broken if
__FSCRYPT_MODE_MAX were ever increased. So having this definition
available is harmful. FSCRYPT_POLICY_FLAGS_VALID has the same problem.
So, remove these definitions from the UAPI header. Replace
FSCRYPT_POLICY_FLAGS_VALID with just listing the valid flags explicitly
in the one kernel function that needs it. Move __FSCRYPT_MODE_MAX to
fscrypt_private.h, remove the double underscores (which were only
present to discourage use by userspace), and add a BUILD_BUG_ON() and
comments to (hopefully) ensure it is kept in sync.
Keep the old name FS_POLICY_FLAGS_VALID, since it's been around for
longer and there's a greater chance that removing it would break source
compatibility with some program. Indeed, mtd-utils is using it in
an #ifdef, and removing it would introduce compiler warnings (about
FS_POLICY_FLAGS_PAD_* being redefined) into the mtd-utils build.
However, reduce its value to 0x07 so that it only includes the flags
with old names (the ones present before Linux 5.4), and try to make it
clear that it's now "frozen" and no new flags should be added to it.
Fixes: 2336d0deb2d4 ("fscrypt: use FSCRYPT_ prefix for uapi constants")
Cc: <stable@vger.kernel.org> # v5.4+
Link: https://lore.kernel.org/r/20201024005132.495952-1-ebiggers@kernel.org
Signed-off-by: Eric Biggers <ebiggers@google.com>
|
|
I originally chose the name "file measurement" to refer to the fs-verity
file digest to avoid confusion with traditional full-file digests or
with the bare root hash of the Merkle tree.
But the name "file measurement" hasn't caught on, and usually people are
calling it something else, usually the "file digest". E.g. see
"struct fsverity_digest" and "struct fsverity_formatted_digest", the
libfsverity_compute_digest() and libfsverity_sign_digest() functions in
libfsverity, and the "fsverity digest" command.
Having multiple names for the same thing is always confusing.
So to hopefully avoid confusion in the future, rename
"fs-verity file measurement" to "fs-verity file digest".
This leaves FS_IOC_MEASURE_VERITY as the only reference to "measure" in
the kernel, which makes some amount of sense since the ioctl is actively
"measuring" the file.
I'll be renaming this in fsverity-utils too (though similarly the
'fsverity measure' command, which is a wrapper for
FS_IOC_MEASURE_VERITY, will stay).
Acked-by: Luca Boccassi <luca.boccassi@microsoft.com>
Link: https://lore.kernel.org/r/20201113211918.71883-4-ebiggers@kernel.org
Signed-off-by: Eric Biggers <ebiggers@google.com>
|
|
The name "struct fsverity_signed_digest" is causing confusion because it
isn't actually a signed digest, but rather it's the way that the digest
is formatted in order to be signed. Rename it to
"struct fsverity_formatted_digest" to prevent this confusion.
Also update the struct's comment to clarify that it's specific to the
built-in signature verification support and isn't a requirement for all
fs-verity users.
I'll be renaming this struct in fsverity-utils too.
Acked-by: Luca Boccassi <luca.boccassi@microsoft.com>
Link: https://lore.kernel.org/r/20201113211918.71883-3-ebiggers@kernel.org
Signed-off-by: Eric Biggers <ebiggers@google.com>
|
|
Embedding the file path inside kernel source code files isn't
particularly useful as often files are moved around and the paths become
incorrect. checkpatch.pl warns about this since v5.10-rc1.
Acked-by: Luca Boccassi <luca.boccassi@microsoft.com>
Link: https://lore.kernel.org/r/20201113211918.71883-2-ebiggers@kernel.org
Signed-off-by: Eric Biggers <ebiggers@google.com>
|
|
revalidate_disk_size is now only called from set_capacity_and_notify,
so drop the export.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
__invalidate_device without the kill_dirty parameter just invalidates
various clean entries in caches, which doesn't really help us with
anything, but can cause all kinds of horrible lock orders due to how
it calls into the file system. The only reason this hasn't been a
major issue is because so many people use partitions, for which no
invalidation was performed anyway.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Mid callback needs to be called only when valid data is
read into pages.
These patches address a problem found during decryption offload:
CIFS: VFS: trying to dequeue a deleted mid
that could cause a refcount use after free:
Workqueue: smb3decryptd smb2_decrypt_offload [cifs]
Signed-off-by: Rohith Surabattula <rohiths@microsoft.com>
Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>
CC: Stable <stable@vger.kernel.org> #5.4+
Signed-off-by: Steve French <stfrench@microsoft.com>
|
|
When reconnect happens Mid queue can be corrupted when both
demultiplex and offload thread try to dequeue the MID from the
pending list.
These patches address a problem found during decryption offload:
CIFS: VFS: trying to dequeue a deleted mid
that could cause a refcount use after free:
Workqueue: smb3decryptd smb2_decrypt_offload [cifs]
Signed-off-by: Rohith Surabattula <rohiths@microsoft.com>
Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>
CC: Stable <stable@vger.kernel.org> #5.4+
Signed-off-by: Steve French <stfrench@microsoft.com>
|
|
cifs_reconnect needs to be called only from demultiplex thread.
skip cifs_reconnect in offload thread. So, cifs_reconnect will be
called by demultiplex thread in subsequent request.
These patches address a problem found during decryption offload:
CIFS: VFS: trying to dequeue a deleted mid
that can cause a refcount use after free:
[ 1271.389453] Workqueue: smb3decryptd smb2_decrypt_offload [cifs]
[ 1271.389456] RIP: 0010:refcount_warn_saturate+0xae/0xf0
[ 1271.389457] Code: fa 1d 6a 01 01 e8 c7 44 b1 ff 0f 0b 5d c3 80 3d e7 1d 6a 01 00 75 91 48 c7 c7 d8 be 1d a2 c6 05 d7 1d 6a 01 01 e8 a7 44 b1 ff <0f> 0b 5d c3 80 3d c5 1d 6a 01 00 0f 85 6d ff ff ff 48 c7 c7 30 bf
[ 1271.389458] RSP: 0018:ffffa4cdc1f87e30 EFLAGS: 00010286
[ 1271.389458] RAX: 0000000000000000 RBX: ffff9974d2809f00 RCX: ffff9974df898cc8
[ 1271.389459] RDX: 00000000ffffffd8 RSI: 0000000000000027 RDI: ffff9974df898cc0
[ 1271.389460] RBP: ffffa4cdc1f87e30 R08: 0000000000000004 R09: 00000000000002c0
[ 1271.389460] R10: 0000000000000000 R11: 0000000000000001 R12: ffff9974b7fdb5c0
[ 1271.389461] R13: ffff9974d2809f00 R14: ffff9974ccea0a80 R15: ffff99748e60db80
[ 1271.389462] FS: 0000000000000000(0000) GS:ffff9974df880000(0000) knlGS:0000000000000000
[ 1271.389462] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1271.389463] CR2: 000055c60f344fe4 CR3: 0000001031a3c002 CR4: 00000000003706e0
[ 1271.389465] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 1271.389465] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 1271.389466] Call Trace:
[ 1271.389483] cifs_mid_q_entry_release+0xce/0x110 [cifs]
[ 1271.389499] smb2_decrypt_offload+0xa9/0x1c0 [cifs]
[ 1271.389501] process_one_work+0x1e8/0x3b0
[ 1271.389503] worker_thread+0x50/0x370
[ 1271.389504] kthread+0x12f/0x150
[ 1271.389506] ? process_one_work+0x3b0/0x3b0
[ 1271.389507] ? __kthread_bind_mask+0x70/0x70
[ 1271.389509] ret_from_fork+0x22/0x30
Signed-off-by: Rohith Surabattula <rohiths@microsoft.com>
Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>
CC: Stable <stable@vger.kernel.org> #5.4+
Signed-off-by: Steve French <stfrench@microsoft.com>
|
|
kmemleak reported a memory leak allocated in query_info() when cifs is
working with modefromsid.
backtrace:
[<00000000aeef6a1e>] slab_post_alloc_hook+0x58/0x510
[<00000000b2f7a440>] __kmalloc+0x1a0/0x390
[<000000006d470ebc>] query_info+0x5b5/0x700 [cifs]
[<00000000bad76ce0>] SMB2_query_acl+0x2b/0x30 [cifs]
[<000000001fa09606>] get_smb2_acl_by_path+0x2f3/0x720 [cifs]
[<000000001b6ebab7>] get_smb2_acl+0x75/0x90 [cifs]
[<00000000abf43904>] cifs_acl_to_fattr+0x13b/0x1d0 [cifs]
[<00000000a5372ec3>] cifs_get_inode_info+0x4cd/0x9a0 [cifs]
[<00000000388e0a04>] cifs_revalidate_dentry_attr+0x1cd/0x510 [cifs]
[<0000000046b6b352>] cifs_getattr+0x8a/0x260 [cifs]
[<000000007692c95e>] vfs_getattr_nosec+0xa1/0xc0
[<00000000cbc7d742>] vfs_getattr+0x36/0x40
[<00000000de8acf67>] vfs_statx_fd+0x4a/0x80
[<00000000a58c6adb>] __do_sys_newfstat+0x31/0x70
[<00000000300b3b4e>] __x64_sys_newfstat+0x16/0x20
[<000000006d8e9c48>] do_syscall_64+0x37/0x80
This patch add missing kfree for pntsd when mounting modefromsid option.
Cc: Stable <stable@vger.kernel.org> # v5.4+
Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com>
Reviewed-by: Aurelien Aptel <aaptel@suse.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
|
|
Unlike ->read(), ->read_iter() instances *must* return the amount
of data they'd left in iterator. For ->read() returning less than
it has actually copied is a QoI issue; read(fd, unmapped_page - 5, 8)
is allowed to fill all 5 bytes of destination and return 4; it's
not nice to caller, but POSIX allows pretty much anything in such
situation, up to and including a SIGSEGV.
generic_file_splice_read() uses pipe-backed iterator as destination;
there a short copy comes from pipe being full, not from running into
an un{mapped,writable} page in the middle of destination as we
have for iovec-backed iterators read(2) uses. And there we rely
upon the ->read_iter() reporting the actual amount it has left
in destination.
Conversion of a ->read() instance into ->read_iter() has to watch
out for that. If you really need an "all or nothing" kind of
behaviour somewhere, you need to do iov_iter_revert() to prune
the partial copy.
In case of seq_read_iter() we can handle short copy just fine;
the data is in m->buf and next call will fetch it from there.
Fixes: d4d50710a8b4 (seq_file: add seq_read_iter)
Tested-by: Nathan Chancellor <natechancellor@gmail.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
|
Where events are consumed in the kernel, for example by KVM's
irqfd_wakeup() and VFIO's virqfd_wakeup(), they currently lack a
mechanism to drain the eventfd's counter.
Since the wait queue is already locked while the wakeup functions are
invoked, all they really need to do is call eventfd_ctx_do_read().
Add a check for the lock, and export it for them.
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Message-Id: <20201027135523.646811-2-dwmw2@infradead.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
Merge fixes from Andrew Morton:
"14 patches.
Subsystems affected by this patch series: mm (migration, vmscan, slub,
gup, memcg, hugetlbfs), mailmap, kbuild, reboot, watchdog, panic, and
ocfs2"
* emailed patches from Andrew Morton <akpm@linux-foundation.org>:
ocfs2: initialize ip_next_orphan
panic: don't dump stack twice on warn
hugetlbfs: fix anon huge page migration race
mm: memcontrol: fix missing wakeup polling thread
kernel/watchdog: fix watchdog_allowed_mask not used warning
reboot: fix overflow parsing reboot cpu number
Revert "kernel/reboot.c: convert simple_strtoul to kstrtoint"
compiler.h: fix barrier_data() on clang
mm/gup: use unpin_user_pages() in __gup_longterm_locked()
mm/slub: fix panic in slab_alloc_node()
mailmap: fix entry for Dmitry Baryshkov/Eremin-Solenikov
mm/vmscan: fix NR_ISOLATED_FILE corruption on 64-bit
mm/compaction: stop isolation if too many pages are isolated and we have pages to migrate
mm/compaction: count pages and stop correctly during page isolation
|
|
When afs_write_end() is called with copied == 0, it tries to set the
dirty region, but there's no way to actually encode a 0-length region in
the encoding in page->private.
"0,0", for example, indicates a 1-byte region at offset 0. The maths
miscalculates this and sets it incorrectly.
Fix it to just do nothing but unlock and put the page in this case. We
don't actually need to mark the page dirty as nothing presumably
changed.
Fixes: 65dd2d6072d3 ("afs: Alter dirty range encoding in page->private")
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Though problem if found on a lower 4.1.12 kernel, I think upstream has
same issue.
In one node in the cluster, there is the following callback trace:
# cat /proc/21473/stack
__ocfs2_cluster_lock.isra.36+0x336/0x9e0 [ocfs2]
ocfs2_inode_lock_full_nested+0x121/0x520 [ocfs2]
ocfs2_evict_inode+0x152/0x820 [ocfs2]
evict+0xae/0x1a0
iput+0x1c6/0x230
ocfs2_orphan_filldir+0x5d/0x100 [ocfs2]
ocfs2_dir_foreach_blk+0x490/0x4f0 [ocfs2]
ocfs2_dir_foreach+0x29/0x30 [ocfs2]
ocfs2_recover_orphans+0x1b6/0x9a0 [ocfs2]
ocfs2_complete_recovery+0x1de/0x5c0 [ocfs2]
process_one_work+0x169/0x4a0
worker_thread+0x5b/0x560
kthread+0xcb/0xf0
ret_from_fork+0x61/0x90
The above stack is not reasonable, the final iput shouldn't happen in
ocfs2_orphan_filldir() function. Looking at the code,
2067 /* Skip inodes which are already added to recover list, since dio may
2068 * happen concurrently with unlink/rename */
2069 if (OCFS2_I(iter)->ip_next_orphan) {
2070 iput(iter);
2071 return 0;
2072 }
2073
The logic thinks the inode is already in recover list on seeing
ip_next_orphan is non-NULL, so it skip this inode after dropping a
reference which incremented in ocfs2_iget().
While, if the inode is already in recover list, it should have another
reference and the iput() at line 2070 should not be the final iput
(dropping the last reference). So I don't think the inode is really in
the recover list (no vmcore to confirm).
Note that ocfs2_queue_orphans(), though not shown up in the call back
trace, is holding cluster lock on the orphan directory when looking up
for unlinked inodes. The on disk inode eviction could involve a lot of
IOs which may need long time to finish. That means this node could hold
the cluster lock for very long time, that can lead to the lock requests
(from other nodes) to the orhpan directory hang for long time.
Looking at more on ip_next_orphan, I found it's not initialized when
allocating a new ocfs2_inode_info structure.
This causes te reflink operations from some nodes hang for very long
time waiting for the cluster lock on the orphan directory.
Fix: initialize ip_next_orphan as NULL.
Signed-off-by: Wengang Wang <wen.gang.wang@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Changwei Ge <gechangwei@live.cn>
Cc: Gang He <ghe@suse.com>
Cc: Jun Piao <piaojun@huawei.com>
Cc: <stable@vger.kernel.org>
Link: https://lkml.kernel.org/r/20201109171746.27884-1-wen.gang.wang@oracle.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Any attempt to do path resolution on /proc/self from an async worker will
yield -EOPNOTSUPP. We can safely do that resolution from the task itself,
and without blocking, so retry it from there.
Ideally io_uring would know this upfront and not have to go through the
worker thread to find out, but that doesn't currently seem feasible.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Add explanation for 'frag' parameter to avoid kernel-doc issue:
fs/configfs/dir.c:277: warning: Function parameter or member 'frag' not
described in 'configfs_create_dir'
Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Christoph Hellwig <hch@lst.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Christoph Hellwig <hch@lst.de>
|
|
Pull fs freeze fix and cleanups from Darrick Wong:
"A single vfs fix for 5.10, along with two subsequent cleanups.
A very long time ago, a hack was added to the vfs fs freeze protection
code to work around lockdep complaints about XFS, which would try to
run a transaction (which requires intwrite protection) to finalize an
xfs freeze (by which time the vfs had already taken intwrite).
Fast forward a few years, and XFS fixed the recursive intwrite problem
on its own, and the hack became unnecessary. Fast forward almost a
decade, and latent bugs in the code converting this hack from freeze
flags to freeze locks combine with lockdep bugs to make this reproduce
frequently enough to notice page faults racing with freeze.
Since the hack is unnecessary and causes thread race errors, just get
rid of it completely. Making this kind of vfs change midway through a
cycle makes me nervous, but a large enough number of the usual
VFS/ext4/XFS/btrfs suspects have said this looks good and solves a
real problem vector.
And once that removal is done, __sb_start_write is now simple enough
that it becomes possible to refactor the function into smaller,
simpler static inline helpers in linux/fs.h. The cleanup is
straightforward.
Summary:
- Finally remove the "convert to trylock" weirdness in the fs freezer
code. It was necessary 10 years ago to deal with nested
transactions in XFS, but we've long since removed that; and now
this is causing subtle race conditions when lockdep goes offline
and sb_start_* aren't prepared to retry a trylock failure.
- Minor cleanups of the sb_start_* fs freeze helpers"
* tag 'vfs-5.10-fixes-2' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
vfs: move __sb_{start,end}_write* to fs.h
vfs: separate __sb_start_write into blocking and non-blocking helpers
vfs: remove lockdep bogosity in __sb_start_write
|
|
Pull xfs fixes from Darrick Wong:
- Fix a fairly serious problem where the reverse mapping btree key
comparison functions were silently ignoring parts of the keyspace
when doing comparisons
- Fix a thinko in the online refcount scrubber
- Fix a missing unlock in the pnfs code
* tag 'xfs-5.10-fixes-5' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
xfs: fix a missing unlock on error in xfs_fs_map_blocks
xfs: fix brainos in the refcount scrubber's rmap fragment processor
xfs: fix rmap key and record comparison functions
xfs: set the unwritten bit in rmap lookup flags in xchk_bmap_get_rmapextents
xfs: fix flags argument to rmap lookup when converting shared file rmaps
|
|
If this is attempted by a kthread, then return -EOPNOTSUPP as we don't
currently support that. Once we can get task_pid_ptr() doing the right
thing, then this can go away again.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Pull io_uring fix from Jens Axboe:
"A single fix in here, for a missed rounding case at setup time, which
caused an otherwise legitimate setup case to return -EINVAL if used
with unaligned ring size values"
* tag 'io_uring-5.10-2020-11-13' of git://git.kernel.dk/linux-block:
io_uring: round-up cq size before comparing with rounded sq size
|
|
Bounds checking tools can flag a bug in dbAdjTree() for an array index
out of bounds in dmt_stree. Since dmt_stree can refer to the stree in
both structures dmaptree and dmapctl, use the larger array to eliminate
the false positive.
Signed-off-by: Dave Kleikamp <dave.kleikamp@oracle.com>
Reported-by: butt3rflyh4ck <butterflyhuangxx@gmail.com>
|
|
There's a missing return statement after an error is found in the
root_item, this can cause further problems when a crafted image triggers
the error.
Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=210181
Fixes: 259ee7754b67 ("btrfs: tree-checker: Add ROOT_ITEM check")
CC: stable@vger.kernel.org # 5.4+
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Daniel Xu <dxu@dxuuu.xyz>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
[BUG]
When running the following script, btrfs will trigger an ASSERT():
#/bin/bash
mkfs.btrfs -f $dev
mount $dev $mnt
xfs_io -f -c "pwrite 0 1G" $mnt/file
sync
btrfs quota enable $mnt
btrfs quota rescan -w $mnt
# Manually set the limit below current usage
btrfs qgroup limit 512M $mnt $mnt
# Crash happens
touch $mnt/file
The dmesg looks like this:
assertion failed: refcount_read(&trans->use_count) == 1, in fs/btrfs/transaction.c:2022
------------[ cut here ]------------
kernel BUG at fs/btrfs/ctree.h:3230!
invalid opcode: 0000 [#1] SMP PTI
RIP: 0010:assertfail.constprop.0+0x18/0x1a [btrfs]
btrfs_commit_transaction.cold+0x11/0x5d [btrfs]
try_flush_qgroup+0x67/0x100 [btrfs]
__btrfs_qgroup_reserve_meta+0x3a/0x60 [btrfs]
btrfs_delayed_update_inode+0xaa/0x350 [btrfs]
btrfs_update_inode+0x9d/0x110 [btrfs]
btrfs_dirty_inode+0x5d/0xd0 [btrfs]
touch_atime+0xb5/0x100
iterate_dir+0xf1/0x1b0
__x64_sys_getdents64+0x78/0x110
do_syscall_64+0x33/0x80
entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x7fb5afe588db
[CAUSE]
In try_flush_qgroup(), we assume we don't hold a transaction handle at
all. This is true for data reservation and mostly true for metadata.
Since data space reservation always happens before we start a
transaction, and for most metadata operation we reserve space in
start_transaction().
But there is an exception, btrfs_delayed_inode_reserve_metadata().
It holds a transaction handle, while still trying to reserve extra
metadata space.
When we hit EDQUOT inside btrfs_delayed_inode_reserve_metadata(), we
will join current transaction and commit, while we still have
transaction handle from qgroup code.
[FIX]
Let's check current->journal before we join the transaction.
If current->journal is unset or BTRFS_SEND_TRANS_STUB, it means
we are not holding a transaction, thus are able to join and then commit
transaction.
If current->journal is a valid transaction handle, we avoid committing
transaction and just end it
This is less effective than committing current transaction, as it won't
free metadata reserved space, but we may still free some data space
before new data writes.
Bugzilla: https://bugzilla.suse.com/show_bug.cgi?id=1178634
Fixes: c53e9653605d ("btrfs: qgroup: try to flush qgroup space when we get -EDQUOT")
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
When doing a buffered write, through one of the write family syscalls, we
look for ranges which currently don't have allocated extents and set the
'delalloc new' bit on them, so that we can report a correct number of used
blocks to the stat(2) syscall until delalloc is flushed and ordered extents
complete.
However there are a few other places where we can do a buffered write
against a range that is mapped to a hole (no extent allocated) and where
we do not set the 'new delalloc' bit. Those places are:
- Doing a memory mapped write against a hole;
- Cloning an inline extent into a hole starting at file offset 0;
- Calling btrfs_cont_expand() when the i_size of the file is not aligned
to the sector size and is located in a hole. For example when cloning
to a destination offset beyond EOF.
So after such cases, until the corresponding delalloc range is flushed and
the respective ordered extents complete, we can report an incorrect number
of blocks used through the stat(2) syscall.
In some cases we can end up reporting 0 used blocks to stat(2), which is a
particular bad value to report as it may mislead tools to think a file is
completely sparse when its i_size is not zero, making them skip reading
any data, an undesired consequence for tools such as archivers and other
backup tools, as reported a long time ago in the following thread (and
other past threads):
https://lists.gnu.org/archive/html/bug-tar/2016-07/msg00001.html
Example reproducer:
$ cat reproducer.sh
#!/bin/bash
MNT=/mnt/sdi
DEV=/dev/sdi
mkfs.btrfs -f $DEV > /dev/null
# mkfs.xfs -f $DEV > /dev/null
# mkfs.ext4 -F $DEV > /dev/null
# mkfs.f2fs -f $DEV > /dev/null
mount $DEV $MNT
xfs_io -f -c "truncate 64K" \
-c "mmap -w 0 64K" \
-c "mwrite -S 0xab 0 64K" \
-c "munmap" \
$MNT/foo
blocks_used=$(stat -c %b $MNT/foo)
echo "blocks used: $blocks_used"
if [ $blocks_used -eq 0 ]; then
echo "ERROR: blocks used is 0"
fi
umount $DEV
$ ./reproducer.sh
blocks used: 0
ERROR: blocks used is 0
So move the logic that decides to set the 'delalloc bit' bit into the
function btrfs_set_extent_delalloc(), since that is what we use for all
those missing cases as well as for the cases that currently work well.
This change is also preparatory work for an upcoming patch that fixes
other problems related to tracking and reporting the number of bytes used
by an inode.
CC: stable@vger.kernel.org # 4.19+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
When dbBackSplit() fails, mp should be released to
prevent memleak. It's the same when dbJoin() fails.
Signed-off-by: Dinghao Liu <dinghao.liu@zju.edu.cn>
Signed-off-by: Dave Kleikamp <dave.kleikamp@oracle.com>
|
|
Delete repeated words in fs/jfs/.
{for, allocation, if, the}
Insert "is" in one place to correct the grammar.
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Dave Kleikamp <dave.kleikamp@oracle.com>
To: linux-fsdevel@vger.kernel.org
Cc: jfs-discussion@lists.sourceforge.net
|
|
In preparation to have arguments of a function passed to callbacks attached
to functions as default, change the default callback prototype to receive a
struct ftrace_regs as the forth parameter instead of a pt_regs.
For callbacks that set the FL_SAVE_REGS flag in their ftrace_ops flags, they
will now need to get the pt_regs via a ftrace_get_regs() helper call. If
this is called by a callback that their ftrace_ops did not have a
FL_SAVE_REGS flag set, it that helper function will return NULL.
This will allow the ftrace_regs to hold enough just to get the parameters
and stack pointer, but without the worry that callbacks may have a pt_regs
that is not completely filled.
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4
Pull ext4 fixes from Ted Ts'o:
"Two ext4 bug fixes, one being a revert of a commit sent during the
merge window"
* tag 'ext4_for_linus_bugfixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
Revert "ext4: fix superblock checksum calculation race"
ext4: handle dax mount option collision
|
|
There are 3 places in namei.c where the equivalent of ext2_put_page() is
open coded on a page which was returned from the ext2_get_page() call
[through the use of ext2_find_entry() and ext2_dotdot()].
Move ext2_put_page() to ext2.h and use it in namei.c
Also add a comment regarding the proper way to release the page returned
from ext2_find_entry() and ext2_dotdot().
Link: https://lore.kernel.org/r/20201112174244.701325-1-ira.weiny@intel.com
Signed-off-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Jan Kara <jack@suse.cz>
|
|
Pull fscrypt fix from Eric Biggers:
"Fix a regression where new files weren't using inline encryption when
they should be"
* tag 'fscrypt-for-linus' of git://git.kernel.org/pub/scm/fs/fscrypt/fscrypt:
fscrypt: fix inline encryption not used on new files
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2
Pull gfs2 fixes from Andreas Gruenbacher:
"Fix jdata data corruption and glock reference leak"
* tag 'gfs2-v5.10-rc3-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2:
gfs2: Fix case in which ail writes are done to jdata holes
Revert "gfs2: Ignore journal log writes for jdata holes"
gfs2: fix possible reference leak in gfs2_check_blk_type
|
|
Pull NFS client bugfixes from Anna Schumaker:
"Stable fixes:
- Fix failure to unregister shrinker
Other fixes:
- Fix unnecessary locking to clear up some contention
- Fix listxattr receive buffer size
- Fix default mount options for nfsroot"
* tag 'nfs-for-5.10-2' of git://git.linux-nfs.org/projects/anna/linux-nfs:
NFS: Remove unnecessary inode lock in nfs_fsync_dir()
NFS: Remove unnecessary inode locking in nfs_llseek_dir()
NFS: Fix listxattr receive buffer size
NFSv4.2: fix failure to unregister shrinker
nfsroot: Default mount option should ask for built-in NFS version
|
|
Patch b2a846dbef4e ("gfs2: Ignore journal log writes for jdata holes")
tried (unsuccessfully) to fix a case in which writes were done to jdata
blocks, the blocks are sent to the ail list, then a punch_hole or truncate
operation caused the blocks to be freed. In other words, the ail items
are for jdata holes. Before b2a846dbef4e, the jdata hole caused function
gfs2_block_map to return -EIO, which was eventually interpreted as an
IO error to the journal, and then withdraw.
This patch changes function gfs2_get_block_noalloc, which is only used
for jdata writes, so it returns -ENODATA rather than -EIO, and when
-ENODATA is returned to gfs2_ail1_start_one, the error is ignored.
We can safely ignore it because gfs2_ail1_start_one is only called
when the jdata pages have already been written and truncated, so the
ail1 content no longer applies.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
|
|
This reverts commit b2a846dbef4ef54ef032f0f5ee188c609a0278a7.
That commit changed the behavior of function gfs2_block_map to return
-ENODATA in cases where a hole (IOMAP_HOLE) is encountered and create is
false. While that fixed the intended problem for jdata, it also broke
other callers of gfs2_block_map such as some jdata block reads. Before
the patch, an encountered hole would be skipped and the buffer seen as
unmapped by the caller. The patch changed the behavior to return
-ENODATA, which is interpreted as an error by the caller.
The -ENODATA return code should be restricted to the specific case where
jdata holes are encountered during ail1 writes. That will be done in a
later patch.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
|
|
nfs_inc_stats() is already thread-safe, and there are no other reasons
to hold the inode lock here.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
|
|
Remove the contentious inode lock, and instead provide thread safety
using the file->f_lock spinlock.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
|
|
Certain NFSv4.2/RDMA tests fail with v5.9-rc1.
rpcrdma_convert_kvec() runs off the end of the rl_segments array
because rq_rcv_buf.tail[0].iov_len holds a very large positive
value. The resultant kernel memory corruption is enough to crash
the client system.
Callers of rpc_prepare_reply_pages() must reserve an extra XDR_UNIT
in the maximum decode size for a possible XDR pad of the contents
of the xdr_buf's pages. That guarantees the allocated receive buffer
will be large enough to accommodate the usual contents plus that XDR
pad word.
encode_op_hdr() cannot add that extra word. If it does,
xdr_inline_pages() underruns the length of the tail iovec.
Fixes: 3e1f02123fba ("NFSv4.2: add client side XDR handling for extended attributes")
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
|
|
We forgot to unregister the nfs4_xattr_large_entry_shrinker.
That leaves the global list of shrinkers corrupted after unload of the
nfs module, after which possibly unrelated code that calls
register_shrinker() or unregister_shrinker() gets a BUG() with
"supervisor write access in kernel mode".
And similarly for the nfs4_xattr_large_entry_lru.
Reported-by: Kris Karas <bugs-a17@moonlit-rail.com>
Tested-By: Kris Karas <bugs-a17@moonlit-rail.com>
Fixes: 95ad37f90c33 "NFSv4.2: add client side xattr caching."
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
CC: stable@vger.kernel.org
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
|
|
In the fail path of gfs2_check_blk_type, forgetting to call
gfs2_glock_dq_uninit will result in rgd_gh reference leak.
Signed-off-by: Zhang Qilong <zhangqilong3@huawei.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
|
|
In metacopy case, we should use ovl_inode_realdata() instead of
ovl_inode_real() to get real inode which has data, so that
we can get correct information of extentes in ->fiemap operation.
Signed-off-by: Chengguang Xu <cgxu519@mykernel.net>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
|
|
There was a syzbot report with this warning but insufficient information...
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
|
|
When the lower file of a metacopy is inaccessible, -EIO is returned. For
users not familiar with overlayfs internals, such as myself, the meaning of
this error may not be apparent or easy to determine, since the (metacopy)
file is present and open/stat succeed when accessed outside of the overlay.
Add a rate-limited warning for orphan metacopy to give users a hint when
investigating such errors.
Link: https://lore.kernel.org/linux-unionfs/CAOQ4uxi23Zsmfb4rCed1n=On0NNA5KZD74jjjeyz+et32sk-gg@mail.gmail.com/
Signed-off-by: Kevin Locke <kevin@kevinlocke.name>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
|