Age | Commit message (Collapse) | Author |
|
This patch addresses an issue observed when resize_inode is disabled
and an online extension of a filesysyem is performed. When a filesystem
is expanded to a size that does not require a addition of a new
descriptor block, the meta_bg feature is being enabled even though no
part of the filesystem uses this layout.
This patch ensures that the meta_bg feature is only enabled if
any of the added block groups utilize meta_bg layout.
Signed-off-by: Srivathsa Dara <srivathsa.d.dara@oracle.com>
Link: https://lore.kernel.org/r/20240227131329.2608466-1-srivathsa.d.dara@oracle.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
all caller of ext4_issue_discard() would set biop to NULL since
'commit 55cdd0af2bc5 ("ext4: get discard out of jbd2 commit kthread
contex")', it's unnecessary to keep this parameter any more.
Signed-off-by: Wenchao Hao <haowenchao2@huawei.com>
Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
Reviewed-by: Zhang Yi <yi.zhang@huawei.com>
Link: https://lore.kernel.org/r/20240226081731.3224470-1-haowenchao2@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
The SLAB_MEM_SPREAD flag used to be implemented in SLAB, which was
removed as of v6.8-rc1, so it became a dead flag since the commit
16a1d968358a ("mm/slab: remove mm/slab.c and slab_def.h"). And the
series[1] went on to mark it obsolete to avoid confusion for users.
Here we can just remove all its users, which has no functional change.
[1] https://lore.kernel.org/all/20240223-slab-cleanup-flags-v2-1-02f1753e8303@suse.cz/
Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
Link: https://lore.kernel.org/r/20240224134822.829456-1-chengming.zhou@linux.dev
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
Currently we ignore s_clusters_per_group field in the on-disk superblock
if bigalloc feature is not enabled. However e2fsprogs don't even open
the filesystem if s_clusters_per_group is invalid. This results in an
odd state where kernel happily works with the filesystem while even
e2fsck refuses to touch it. Verify that s_clusters_per_group is valid
even if bigalloc feature is not enabled to make things consistent. Due
to current e2fsprogs behavior it is unlikely there are filesystems out
in the wild (except for intentionally fuzzed ones) with invalid
s_clusters_per_group counts.
Signed-off-by: Jan Kara <jack@suse.cz>
Reviewed-by: Zhang Yi <yi.zhang@huawei.com>
Link: https://lore.kernel.org/r/20240219171033.22882-1-jack@suse.cz
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
We observed a corruption during on-line resize of a file system that is
larger than 16 TiB with 4k block size. With having more then 2^32 blocks
resize_inode is turned off by default by mke2fs. The issue can be
reproduced on a smaller file system for convenience by explicitly
turning off resize_inode. An on-line resize across an 8 GiB boundary (the
size of a meta block group in this setup) then leads to a corruption:
dev=/dev/<some_dev> # should be >= 16 GiB
mkdir -p /corruption
/sbin/mke2fs -t ext4 -b 4096 -O ^resize_inode $dev $((2 * 2**21 - 2**15))
mount -t ext4 $dev /corruption
dd if=/dev/zero bs=4096 of=/corruption/test count=$((2*2**21 - 4*2**15))
sha1sum /corruption/test
# 79d2658b39dcfd77274e435b0934028adafaab11 /corruption/test
/sbin/resize2fs $dev $((2*2**21))
# drop page cache to force reload the block from disk
echo 1 > /proc/sys/vm/drop_caches
sha1sum /corruption/test
# 3c2abc63cbf1a94c9e6977e0fbd72cd832c4d5c3 /corruption/test
2^21 = 2^15*2^6 equals 8 GiB whereof 2^15 is the number of blocks per
block group and 2^6 are the number of block groups that make a meta
block group.
The last checksum might be different depending on how the file is laid
out across the physical blocks. The actual corruption occurs at physical
block 63*2^15 = 2064384 which would be the location of the backup of the
meta block group's block descriptor. During the on-line resize the file
system will be converted to meta_bg starting at s_first_meta_bg which is
2 in the example - meaning all block groups after 16 GiB. However, in
ext4_flex_group_add we might add block groups that are not part of the
first meta block group yet. In the reproducer we achieved this by
substracting the size of a whole block group from the point where the
meta block group would start. This must be considered when updating the
backup block group descriptors to follow the non-meta_bg layout. The fix
is to add a test whether the group to add is already part of the meta
block group or not.
Fixes: 01f795f9e0d67 ("ext4: add online resizing support for meta_bg and 64-bit file systems")
Cc: <stable@vger.kernel.org>
Signed-off-by: Maximilian Heyne <mheyne@amazon.de>
Tested-by: Srivathsa Dara <srivathsa.d.dara@oracle.com>
Reviewed-by: Srivathsa Dara <srivathsa.d.dara@oracle.com>
Link: https://lore.kernel.org/r/20240215155009.94493-1-mheyne@amazon.de
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
When ext4 is mounted without journal, with discard mount option, and on
a device not supporting trim, we print error for each and every freed
extent. This is not only useless but actively harmful. Instead ignore
the EOPNOTSUPP error. Trim is only advisory anyway and when the
filesystem has journal we silently ignore trim error as well.
Signed-off-by: Jan Kara <jack@suse.cz>
Reviewed-by: Zhang Yi <yi.zhang@huawei.com>
Link: https://lore.kernel.org/r/20240213101601.17463-1-jack@suse.cz
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
ext4_xattr_block_set() drops ea_inode reference in two places. Handling
it just under the 'cleanup' label is enough so drop the second
occurence.
Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20240209112107.10585-3-jack@suse.cz
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
Cross-merge networking fixes after downstream PR.
No conflicts.
Adjacent changes:
net/core/page_pool_user.c
0b11b1c5c320 ("netdev: let netlink core handle -EMSGSIZE errors")
429679dcf7d9 ("page_pool: fix netlink dump stop/resume")
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs
Pull erofs fixes from Gao Xiang:
"The main one is a KMSAN fix which addresses an issue introduced in
this cycle so it'd be much better to fix before releasing, and the
remaining one fixes VMA alignment for THP.
Summary:
- Fix a KMSAN uninit-value issue triggered by a crafted image
- Fix VMA alignment for memory mapped files on THP"
* tag 'erofs-for-6.8-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs:
erofs: apply proper VMA alignment for memory mapped files on THP
erofs: fix uninitialized page cache reported by KMSAN
|
|
This is a long-standing issue that uninterruptible sleep in fanotify
could make system hibernation fail if the usperspace server gets frozen
before the process waiting for the response (as reported e.g. [1][2]).
A few years ago, there was an attempt to switch to interruptible sleep
while waiting [3], but that would lead to EINTR returns from open(2)
and break userspace [4], so it's been changed to only killable [5].
And the core freezer logic had been rewritten [6][7] in v6.1, allowing
freezing in uninterrupted sleep, so we can solve this problem now.
[1] https://lore.kernel.org/lkml/1518774280-38090-1-git-send-email-t.vivek@samsung.com/
[2] https://lore.kernel.org/lkml/c1bb16b7-9eee-9cea-2c96-a512d8b3b9c7@nwra.com/
[3] https://lore.kernel.org/linux-fsdevel/20190213145443.26836-1-jack@suse.cz/
[4] https://lore.kernel.org/linux-fsdevel/d0031e3a-f050-0832-fa59-928a80ffd44b@nwra.com/
[5] https://lore.kernel.org/linux-fsdevel/20190221105558.GA20921@quack2.suse.cz/
[6] https://lore.kernel.org/lkml/20220822114649.055452969@infradead.org/
[7] https://lore.kernel.org/lkml/20230908-avoid-spurious-freezer-wakeups-v4-0-6155aa3dafae@quicinc.com/
Signed-off-by: Winston Wen <wentao@uniontech.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Message-Id: <BD33543C483B89AB+20240305061804.1186796-1-wentao@uniontech.com>
|
|
ssh://gitolite.kernel.org/pub/scm/linux/kernel/git/krisman/unicode into vfs.misc
Merge case-insensitive updates from Gabriel Krisman Bertazi:
- Patch case-insensitive lookup by trying the case-exact comparison
first, before falling back to costly utf8 casefolded comparison.
- Fix to forbid using a case-insensitive directory as part of an
overlayfs mount.
- Patchset to ensure d_op are set at d_alloc time for fscrypt and
casefold volumes, ensuring filesystem dentries will all have the
correct ops, whether they come from a lookup or not.
* tag 'for-next-6.9' of ssh://gitolite.kernel.org/pub/scm/linux/kernel/git/krisman/unicode:
libfs: Drop generic_set_encrypted_ci_d_ops
ubifs: Configure dentry operations at dentry-creation time
f2fs: Configure dentry operations at dentry-creation time
ext4: Configure dentry operations at dentry-creation time
libfs: Add helper to choose dentry operations at mount-time
libfs: Merge encrypted_ci_dentry_ops and ci_dentry_ops
fscrypt: Drop d_revalidate once the key is added
fscrypt: Drop d_revalidate for valid dentries during lookup
fscrypt: Factor out a helper to configure the lookup dentry
ovl: Always reject mounting over case-insensitive directories
libfs: Attempt exact-match comparison first during casefolded lookup
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Chandan reported a AGI/AGF lock order hang on xfs/168 during recent
testing. The cause of the problem was the task running xfs_growfs
to shrink the filesystem. A failure occurred trying to remove the
free space from the btrees that the shrink would make disappear,
and that meant it ran the error handling for a partial failure.
This error path involves restoring the per-ag block reservations,
and that requires calculating the amount of space needed to be
reserved for the free inode btree. The growfs operation hung here:
[18679.536829] down+0x71/0xa0
[18679.537657] xfs_buf_lock+0xa4/0x290 [xfs]
[18679.538731] xfs_buf_find_lock+0xf7/0x4d0 [xfs]
[18679.539920] xfs_buf_lookup.constprop.0+0x289/0x500 [xfs]
[18679.542628] xfs_buf_get_map+0x2b3/0xe40 [xfs]
[18679.547076] xfs_buf_read_map+0xbb/0x900 [xfs]
[18679.562616] xfs_trans_read_buf_map+0x449/0xb10 [xfs]
[18679.569778] xfs_read_agi+0x1cd/0x500 [xfs]
[18679.573126] xfs_ialloc_read_agi+0xc2/0x5b0 [xfs]
[18679.578708] xfs_finobt_calc_reserves+0xe7/0x4d0 [xfs]
[18679.582480] xfs_ag_resv_init+0x2c5/0x490 [xfs]
[18679.586023] xfs_ag_shrink_space+0x736/0xd30 [xfs]
[18679.590730] xfs_growfs_data_private.isra.0+0x55e/0x990 [xfs]
[18679.599764] xfs_growfs_data+0x2f1/0x410 [xfs]
[18679.602212] xfs_file_ioctl+0xd1e/0x1370 [xfs]
trying to get the AGI lock. The AGI lock was held by a fstress task
trying to do an inode allocation, and it was waiting on the AGF
lock to allocate a new inode chunk on disk. Hence deadlock.
The fix for this is for the growfs code to hold the AGI over the
transaction roll it does in the error path. It already holds the AGF
locked across this, and that is what causes the lock order inversion
in the xfs_ag_resv_init() call.
Reported-by: Chandan Babu R <chandanbabu@kernel.org>
Fixes: 46141dc891f7 ("xfs: introduce xfs_ag_shrink_space()")
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
|
|
There are mainly two reasons that thp_get_unmapped_area() should be
used for EROFS as other filesystems:
- It's needed to enable PMD mappings as a FSDAX filesystem, see
commit 74d2fad1334d ("thp, dax: add thp_get_unmapped_area for pmd
mappings");
- It's useful together with large folios and
CONFIG_READ_ONLY_THP_FOR_FS which enable THPs for mmapped files
(e.g. shared libraries) even without FSDAX. See commit 1854bc6e2420
("mm/readahead: Align file mappings for non-DAX").
Fixes: 06252e9ce05b ("erofs: dax support for non-tailpacking regular file")
Fixes: ce529cc25b18 ("erofs: enable large folios for iomap mode")
Fixes: e6687b89225e ("erofs: enable large folios for fscache mode")
Reviewed-by: Jingbo Xu <jefflexu@linux.alibaba.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Link: https://lore.kernel.org/r/20240306053138.2240206-1-hsiangkao@linux.alibaba.com
|
|
syzbot reports a KMSAN reproducer [1] which generates a crafted
filesystem image and causes IMA to read uninitialized page cache.
Later, (rq->outputsize > rq->inputsize) will be formally supported
after either large uncompressed pclusters (> block size) or big
lclusters are landed. However, currently there is no way to generate
such filesystems by using mkfs.erofs.
Thus, let's mark this condition as unsupported for now.
[1] https://lore.kernel.org/r/0000000000002be12a0611ca7ff8@google.com
Reported-and-tested-by: syzbot+7bc44a489f0ef0670bd5@syzkaller.appspotmail.com
Fixes: 1ca01520148a ("erofs: refine z_erofs_transform_plain() for sub-page block support")
Reviewed-by: Sandeep Dhavale <dhavale@google.com>
Reviewed-by: Yue Hu <huyue2@coolpad.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Link: https://lore.kernel.org/r/20240304035339.425857-1-hsiangkao@linux.alibaba.com
|
|
Allow hugetlb use padata_do_multithreaded for parallel initialization.
Select CONFIG_PADATA in this case.
Link: https://lkml.kernel.org/r/20240222140422.393911-7-gang.li@linux.dev
Signed-off-by: Gang Li <ligang.bdlg@bytedance.com>
Tested-by: David Rientjes <rientjes@google.com>
Reviewed-by: Muchun Song <muchun.song@linux.dev>
Tested-by: Paul E. McKenney <paulmck@kernel.org>
Acked-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Jane Chu <jane.chu@oracle.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Steffen Klassert <steffen.klassert@secunet.com>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
This patch adds the disk map of block address ranges configured by multiple
partitions.
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
|
|
The same protection is provided by file->f_pos_lock.
Note, this relies on the fact that file->f_mode has FMODE_ATOMIC_POS.
This flag is cleared by stream_open(), which would prevent locking of
f_pos_lock.
Prior to commit 7de64d521bf9 ("fuse: break up fuse_open_common()")
FOPEN_STREAM on a directory would cause stream_open() to be called.
After this commit this is not done anymore, so f_pos_lock will always
be locked.
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
|
|
Commit 670d21c6e17f6 ("fuse: remove reliance on bdi congestion") change how
congestion_threshold is used and lock in
fuse_conn_congestion_threshold_write is not needed anymore.
1. Access to supe_block is removed along with removing of bdi congestion.
Then down_read(&fc->killsb) which protecting access to super_block is no
needed.
2. Compare num_background and congestion_threshold without holding
bg_lock. Then there is no need to hold bg_lock to update
congestion_threshold.
Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
|
|
Our user space filesystem relies on fuse to provide POSIX interface.
In our test, a known string is written into a file and the content
is read back later to verify correct data returned. We observed wrong
data returned in read buffer in rare cases although correct data are
stored in our filesystem.
Fuse kernel module calls iov_iter_get_pages2() to get the physical
pages of the user-space read buffer passed in read(). The pages are
not pinned to avoid page migration. When page migration occurs, the
consequence are two-folds.
1) Applications do not receive correct data in read buffer.
2) fuse kernel writes data into a wrong place.
Using iov_iter_extract_pages() to pin pages fixes the issue in our
test.
An auxiliary variable "struct page **pt_pages" is used in the patch
to prepare the 2nd parameter for iov_iter_extract_pages() since
iov_iter_get_pages2() uses a different type for the 2nd parameter.
[SzM] add iov_iter_extract_will_pin(ii) and unpin only if true.
Signed-off-by: Lei Huang <lei.huang@linux.intel.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
|
|
This flag is only set by one single user: the magical core dumping code
that looks up user pages one by one, and then writes them out using
their kernel addresses (by using a BVEC_ITER).
That actually ends up being a huge problem, because while we do use
copy_mc_to_kernel() for this case and it is able to handle the possible
machine checks involved, nothing else is really ready to handle the
failures caused by the machine check.
In particular, as reported by Tong Tiangen, we don't actually support
fault_in_iov_iter_readable() on a machine check area.
As a result, the usual logic for writing things to a file under a
filesystem lock, which involves doing a copy with page faults disabled
and then if that fails trying to fault pages in without holding the
locks with fault_in_iov_iter_readable() does not work at all.
We could decide to always just make the MC copy "succeed" (and filling
the destination with zeroes), and that would then create a core dump
file that just ignores any machine checks.
But honestly, this single special case has been problematic before, and
means that all the normal iov_iter code ends up slightly more complex
and slower.
See for example commit c9eec08bac96 ("iov_iter: Don't deal with
iter->copy_mc in memcpy_from_iter_mc()") where David Howells
re-organized the code just to avoid having to check the 'copy_mc' flags
inside the inner iov_iter loops.
So considering that we have exactly one user, and that one user is a
non-critical special case that doesn't actually ever trigger in real
life (Tong found this with manual error injection), the sane solution is
to just decide that the onus on handling the machine check lines on that
user instead.
Ergo, do the copy_mc_to_kernel() in the core dump logic itself, copying
the user data to a stable kernel page before writing it out.
Fixes: f1982740f5e7 ("iov_iter: Convert iterate*() to inline funcs")
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Tong Tiangen <tongtiangen@huawei.com>
Link: https://lore.kernel.org/r/20240305133336.3804360-1-tongtiangen@huawei.com
Link: https://lore.kernel.org/all/4e80924d-9c85-f13a-722a-6a5d2b1c225a@huawei.com/
Tested-by: David Howells <dhowells@redhat.com>
Reviewed-by: David Howells <dhowells@redhat.com>
Reviewed-by: Jens Axboe <axboe@kernel.dk>
Reported-by: Tong Tiangen <tongtiangen@huawei.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
FUSE remote locking code paths never add any locking state to
inode->i_flctx, so the locks_remove_posix() function called on
file close will return without calling fuse_setlk().
Therefore, as the if statement to be removed in this commit will
always be false, remove it for clearness.
Signed-off-by: Jiachen Zhang <zhangjiachen.jaycee@bytedance.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
|
|
Due to the fact that fuse does not count the write IO of processes in the
direct and writethrough write modes, user processes cannot track
write_bytes through the “/proc/[pid]/io” path. For example, the system
tool iotop cannot count the write operations of the corresponding process.
Signed-off-by: Zhou Jifeng <zhoujifeng@kylinos.com.cn>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
|
|
Some FUSE daemons want to know if the received request is a resend
request. The high bit of the fuse request ID is utilized for indicating
this, enabling the receiver to perform appropriate handling.
The init flag "FUSE_HAS_RESEND" is added to indicate this feature.
Signed-off-by: Zhao Chen <winters.zc@antgroup.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
|
|
When a FUSE daemon panics and failover, we aim to minimize the impact on
applications by reusing the existing FUSE connection. During this process,
another daemon is employed to preserve the FUSE connection's file
descriptor. The new started FUSE Daemon will takeover the fd and continue
to provide service.
However, it is possible for some inflight requests to be lost and never
returned. As a result, applications awaiting replies would become stuck
forever. To address this, we can resend these pending requests to the
new started FUSE daemon.
This patch introduces a new notification type "FUSE_NOTIFY_RESEND", which
can trigger resending of the pending requests, ensuring they are properly
processed again.
Signed-off-by: Zhao Chen <winters.zc@antgroup.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
|
|
open_by_handle_at(2) can fail with -ESTALE with a valid handle returned
by a previous name_to_handle_at(2) for evicted fuse inodes, which is
especially common when entry_valid_timeout is 0, e.g. when the fuse
daemon is in "cache=none" mode.
The time sequence is like:
name_to_handle_at(2) # succeed
evict fuse inode
open_by_handle_at(2) # fail
The root cause is that, with 0 entry_valid_timeout, the dput() called in
name_to_handle_at(2) will trigger iput -> evict(), which will send
FUSE_FORGET to the daemon. The following open_by_handle_at(2) will send
a new FUSE_LOOKUP request upon inode cache miss since the previous inode
eviction. Then the fuse daemon may fail the FUSE_LOOKUP request with
-ENOENT as the cached metadata of the requested inode has already been
cleaned up during the previous FUSE_FORGET. The returned -ENOENT is
treated as -ESTALE when open_by_handle_at(2) returns.
This confuses the application somehow, as open_by_handle_at(2) fails
when the previous name_to_handle_at(2) succeeds. The returned errno is
also confusing as the requested file is not deleted and already there.
It is reasonable to fail name_to_handle_at(2) early in this case, after
which the application can fallback to open(2) to access files.
Since this issue typically appears when entry_valid_timeout is 0 which
is configured by the fuse daemon, the fuse daemon is the right person to
explicitly disable the export when required.
Also considering FUSE_EXPORT_SUPPORT actually indicates the support for
lookups of "." and "..", and there are existing fuse daemons supporting
export without FUSE_EXPORT_SUPPORT set, for compatibility, we add a new
INIT flag for such purpose.
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jingbo Xu <jefflexu@linux.alibaba.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
|
|
For the sake of consistency, let's use these helpers to extract
{u,g}id_t values from k{u,g}id_t ones.
There are no functional changes, just to make code cleaner.
Signed-off-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com>
Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
|
|
Found by chance while working on support for idmapped mounts in fuse.
Signed-off-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
|
|
During fiemap we may have to visit multiple leaves of the subvolume's
inode tree, and each time we are freeing and allocating an extent buffer
to use as a clone of each visited leaf. Optimize this by reusing cloned
extent buffers, to avoid the freeing and re-allocation both of the extent
buffer structure itself and more importantly of the pages attached to the
extent buffer.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
For fiemap we recently stopped locking the target extent range for the
whole duration of the fiemap call, in order to avoid a deadlock in a
scenario where the fiemap buffer happens to be a memory mapped range of
the same file. This use case is very unlikely to be useful in practice but
it may be triggered by fuzz testing (syzbot, etc).
This however introduced a race that makes us miss delalloc ranges for
file regions that are currently holes, so the caller of fiemap will not
be aware that there's data for some file regions. This can be quite
serious for some use cases - for example in coreutils versions before 9.0,
the cp program used fiemap to detect holes and data in the source file,
copying only regions with data (extents or delalloc) from the source file
to the destination file in order to preserve holes (see the documentation
for its --sparse command line option). This means that if cp was used
with a source file that had delalloc in a hole, the destination file could
end up without that data, which is effectively a data loss issue, if it
happened to hit the race described below.
The race happens like this:
1) Fiemap is called, without the FIEMAP_FLAG_SYNC flag, for a file that
has delalloc in the file range [64M, 65M[, which is currently a hole;
2) Fiemap locks the inode in shared mode, then starts iterating the
inode's subvolume tree searching for file extent items, without having
the whole fiemap target range locked in the inode's io tree - the
change introduced recently by commit b0ad381fa769 ("btrfs: fix
deadlock with fiemap and extent locking"). It only locks ranges in
the io tree when it finds a hole or prealloc extent since that
commit;
3) Note that fiemap clones each leaf before using it, and this is to
avoid deadlocks when locking a file range in the inode's io tree and
the fiemap buffer is memory mapped to some file, because writing
to the page with btrfs_page_mkwrite() will wait on any ordered extent
for the page's range and the ordered extent needs to lock the range
and may need to modify the same leaf, therefore leading to a deadlock
on the leaf;
4) While iterating the file extent items in the cloned leaf before
finding the hole in the range [64M, 65M[, the delalloc in that range
is flushed and its ordered extent completes - meaning the corresponding
file extent item is in the inode's subvolume tree, but not present in
the cloned leaf that fiemap is iterating over;
5) When fiemap finds the hole in the [64M, 65M[ range by seeing the gap in
the cloned leaf (or a file extent item with disk_bytenr == 0 in case
the NO_HOLES feature is not enabled), it will lock that file range in
the inode's io tree and then search for delalloc by checking for the
EXTENT_DELALLOC bit in the io tree for that range and ordered extents
(with btrfs_find_delalloc_in_range()). But it finds nothing since the
delalloc in that range was already flushed and the ordered extent
completed and is gone - as a result fiemap will not report that there's
delalloc or an extent for the range [64M, 65M[, so user space will be
mislead into thinking that there's a hole in that range.
This could actually be sporadically triggered with test case generic/094
from fstests, which reports a missing extent/delalloc range like this:
generic/094 2s ... - output mismatch (see /home/fdmanana/git/hub/xfstests/results//generic/094.out.bad)
--- tests/generic/094.out 2020-06-10 19:29:03.830519425 +0100
+++ /home/fdmanana/git/hub/xfstests/results//generic/094.out.bad 2024-02-28 11:00:00.381071525 +0000
@@ -1,3 +1,9 @@
QA output created by 094
fiemap run with sync
fiemap run without sync
+ERROR: couldn't find extent at 7
+map is 'HHDDHPPDPHPH'
+logical: [ 5.. 6] phys: 301517.. 301518 flags: 0x800 tot: 2
+logical: [ 8.. 8] phys: 301520.. 301520 flags: 0x800 tot: 1
...
(Run 'diff -u /home/fdmanana/git/hub/xfstests/tests/generic/094.out /home/fdmanana/git/hub/xfstests/results//generic/094.out.bad' to see the entire diff)
So in order to fix this, while still avoiding deadlocks in the case where
the fiemap buffer is memory mapped to the same file, change fiemap to work
like the following:
1) Always lock the whole range in the inode's io tree before starting to
iterate the inode's subvolume tree searching for file extent items,
just like we did before commit b0ad381fa769 ("btrfs: fix deadlock with
fiemap and extent locking");
2) Now instead of writing to the fiemap buffer every time we have an extent
to report, write instead to a temporary buffer (1 page), and when that
buffer becomes full, stop iterating the file extent items, unlock the
range in the io tree, release the search path, submit all the entries
kept in that buffer to the fiemap buffer, and then resume the search
for file extent items after locking again the remainder of the range in
the io tree.
The buffer having a size of a page, allows for 146 entries in a system
with 4K pages. This is a large enough value to have a good performance
by avoiding too many restarts of the search for file extent items.
In other words this preserves the huge performance gains made in the
last two years to fiemap, while avoiding the deadlocks in case the
fiemap buffer is memory mapped to the same file (useless in practice,
but possible and exercised by fuzz testing and syzbot).
Fixes: b0ad381fa769 ("btrfs: fix deadlock with fiemap and extent locking")
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
At contains_pending_extent() the value of the end offset of a chunk we
found in the device's allocation state io tree is inclusive, so when
we calculate the length we pass to the in_range() macro, we must sum
1 to the expression "physical_end - physical_offset".
In practice the wrong calculation should be harmless as chunks sizes
are never 1 byte and we should never have 1 byte ranges of unallocated
space. Nevertheless fix the wrong calculation.
Reported-by: Alex Lyakas <alex.lyakas@zadara.com>
Link: https://lore.kernel.org/linux-btrfs/CAOcd+r30e-f4R-5x-S7sV22RJPe7+pgwherA6xqN2_qe7o4XTg@mail.gmail.com/
Fixes: 1c11b63eff2a ("btrfs: replace pending/pinned chunks lists with io tree")
CC: stable@vger.kernel.org # 6.1+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
same parent
Currently "btrfs subvolume snapshot -i <qgroupid>" would always mark the
qgroup inconsistent.
This can be annoying if the fs has a lot of snapshots, and needs qgroup
to get the accounting for the amount of bytes it can free for each
snapshot.
Although we have the new simple quote as a solution, there is also a
case where we can skip the full scan, if all the following conditions
are met:
- The source subvolume belongs to a higher level parent qgroup
- The parent qgroup already owns all its bytes exclusively
- The new snapshot is also added to the same parent qgroup
In that case, we only need to add nodesize to the parent qgroup and
avoid a full rescan.
This patch would add the extra quick accounting update for such inherit.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
[BUG]
Currently btrfs can create subvolume with an invalid qgroup inherit
without triggering any error:
# mkfs.btrfs -O quota -f $dev
# mount $dev $mnt
# btrfs subvolume create -i 2/0 $mnt/subv1
# btrfs qgroup show -prce --sync $mnt
Qgroupid Referenced Exclusive Path
-------- ---------- --------- ----
0/5 16.00KiB 16.00KiB <toplevel>
0/256 16.00KiB 16.00KiB subv1
[CAUSE]
We only do a very basic size check for btrfs_qgroup_inherit structure,
but never really verify if the values are correct.
Thus in btrfs_qgroup_inherit() function, we have to skip non-existing
qgroups, and never return any error.
[FIX]
Fix the behavior and introduce extra checks:
- Introduce early check for btrfs_qgroup_inherit structure
Not only the size, but also all the qgroup ids would be verified.
And the timing is very early, so we can return error early.
This early check is very important for snapshot creation, as snapshot
is delayed to transaction commit.
- Drop support for btrfs_qgroup_inherit::num_ref_copies and
num_excl_copies
Those two members are used to specify to copy refr/excl numbers from
other qgroups.
This would definitely mark qgroup inconsistent, and btrfs-progs has
dropped the support for them for a long time.
It's time to drop the support for kernel.
- Verify the supported btrfs_qgroup_inherit::flags
Just in case we want to add extra flags for btrfs_qgroup_inherit.
Now above subvolume creation would fail with -ENOENT other than silently
ignore the non-existing qgroup.
CC: stable@vger.kernel.org # 6.7+
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
To better debug issues surrounding device scans, include the device's
major and minor numbers in the device scan notice for btrfs.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
btrfs_put_caching_control() is only used in block-group.c, so mark it
static.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Lijuan Li <lilijuan@iscas.ac.cn>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
The SLAB_MEM_SPREAD flag used to be implemented in SLAB, which was
removed as of v6.8-rc1, so it became a dead flag since the commit
16a1d968358a ("mm/slab: remove mm/slab.c and slab_def.h"). And the
series[1] went on to mark it obsolete to avoid confusion for users.
Here we can just remove all its users, which has no functional change.
[1] https://lore.kernel.org/all/20240223-slab-cleanup-flags-v2-1-02f1753e8303@suse.cz/
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
[BUG]
If qgroup is marked inconsistent (e.g. caused by operations needing full
subtree rescan, like creating a snapshot and assign to a higher level
qgroup), btrfs would immediately start leaking its data reserved space.
The following script can easily reproduce it:
mkfs.btrfs -O quota -f $dev
mount $dev $mnt
btrfs subvolume create $mnt/subv1
btrfs qgroup create 1/0 $mnt
# This snapshot creation would mark qgroup inconsistent,
# as the ownership involves different higher level qgroup, thus
# we have to rescan both source and snapshot, which can be very
# time consuming, thus here btrfs just choose to mark qgroup
# inconsistent, and let users to determine when to do the rescan.
btrfs subv snapshot -i 1/0 $mnt/subv1 $mnt/snap1
# Now this write would lead to qgroup rsv leak.
xfs_io -f -c "pwrite 0 64k" $mnt/file1
# And at unmount time, btrfs would report 64K DATA rsv space leaked.
umount $mnt
And we would have the following dmesg output for the unmount:
BTRFS info (device dm-1): last unmount of filesystem 14a3d84e-f47b-4f72-b053-a8a36eef74d3
BTRFS warning (device dm-1): qgroup 0/5 has unreleased space, type 0 rsv 65536
[CAUSE]
Since commit e15e9f43c7ca ("btrfs: introduce
BTRFS_QGROUP_RUNTIME_FLAG_NO_ACCOUNTING to skip qgroup accounting"),
we introduce a mode for btrfs qgroup to skip the timing consuming
backref walk, if the qgroup is already inconsistent.
But this skip also covered the data reserved freeing, thus the qgroup
reserved space for each newly created data extent would not be freed,
thus cause the leakage.
[FIX]
Make the data extent reserved space freeing mandatory.
The qgroup reserved space handling is way cheaper compared to the
backref walking part, and we always have the super sensitive leak
detector, thus it's definitely worth to always free the qgroup
reserved data space.
Reported-by: Fabian Vogt <fvogt@suse.com>
Fixes: e15e9f43c7ca ("btrfs: introduce BTRFS_QGROUP_RUNTIME_FLAG_NO_ACCOUNTING to skip qgroup accounting")
CC: stable@vger.kernel.org # 6.1+
Link: https://bugzilla.suse.com/show_bug.cgi?id=1216196
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
[BUG]
There is a bug report about very suspicious tree-checker got triggered:
BTRFS critical (device dm-0): corrupted node, root=256
block=8550954455682405139 owner mismatch, have 11858205567642294356
expect [256, 18446744073709551360]
BTRFS critical (device dm-0): corrupted node, root=256
block=8550954455682405139 owner mismatch, have 11858205567642294356
expect [256, 18446744073709551360]
BTRFS critical (device dm-0): corrupted node, root=256
block=8550954455682405139 owner mismatch, have 11858205567642294356
expect [256, 18446744073709551360]
SELinux: inode_doinit_use_xattr: getxattr returned 117 for dev=dm-0
ino=5737268
[ANALYZE]
The root cause is still unclear, but there are some clues already:
- Unaligned eb bytenr
The block bytenr is 8550954455682405139, which is not even aligned to
2.
This bytenr is fetched from extent buffer header, not from eb->start.
This means, at the initial time of read, eb header bytenr is still
correct (the very basis check to continue read), but later something
wrong happened, got at least the first page corrupted.
Thus we got such obviously incorrect value.
- Invalid extent buffer header owner
The read itself is triggered for subvolume 256, but the eb header
owner is 11858205567642294356, which is not really possible.
The problem here is, subvolume id is limited to (1 << 48 - 1),
and this one definitely goes beyond that limit.
So this value is another garbage.
We already got two garbage from an extent buffer, which passed the
initial bytenr and csum checks, but later the contents become garbage at
some point.
This looks like a page lifespan problem (e.g. we didn't properly hold the
page).
[ENHANCEMENT]
The current tree-checker only outputs things from the extent buffer,
nothing with the page status.
So this patch would enhance the tree-checker output by also dumping the
first page, which would look like this:
page:00000000aa9f3ce8 refcount:4 mapcount:0 mapping:00000000169aa6b6 index:0x1d0c pfn:0x1022e5
memcg:ffff888103456000
aops:btree_aops [btrfs] ino:1
flags: 0x2ffff0000008000(private|node=0|zone=2|lastcpupid=0xffff)
page_type: 0xffffffff()
raw: 02ffff0000008000 0000000000000000 dead000000000122 ffff88811e06e220
raw: 0000000000001d0c ffff888102fdb1d8 00000004ffffffff ffff888103456000
page dumped because: eb page dump
BTRFS critical (device dm-3): corrupt leaf: root=5 block=30457856 slot=6 ino=257 file_offset=0, invalid disk_bytenr for file extent, have 10617606235235216665, should be aligned to 4096
BTRFS error (device dm-3): read time tree block corruption detected on logical 30457856 mirror 1
From the dump we can see some extra info, something can help us to do
extra cross-checks:
- Page refcount
if it's too low, it definitely means something bad.
- Page aops
Any mapped eb page should have btree_aops with inode number 1.
- Page index
Since a mapped eb page should has its bytenr matching the page
position, (index << PAGE_SHIFT) should match the bytenr of the
bytenr from the critical line.
- Page Private flags
A mapped eb page should have Private flag set to indicate it's managed
by btrfs.
Link: https://lore.kernel.org/linux-btrfs/CAHk-=whNdMaN9ntZ47XRKP6DBes2E5w7fi-0U3H2+PS18p+Pzw@mail.gmail.com/
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Since commit a440d48c7f93 ("Btrfs: heuristic: implement sampling
logic"), btrfs_compress_heuristic() is no longer a simple "return true",
but more complex to determine if we should compress.
Thus the comment is dead and can be confusing, just remove it.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
For the writer counter, it's pretty much the same as the reader counter,
and they are exclusive. So move them to the new locked bitmap.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Currently btrfs_subpage utilizes its atomic member @reader to manage the
reader counter. However it is only utilized to prevent the page to be
released/unlocked when we still have reads underway.
In that use case, we don't really allow multiple readers on the same
subpage sector. So here we can introduce a new locked bitmap to
represent exactly which subpage range is locked for read.
In theory we can remove btrfs_subpage::reader as it's just the set bits
of the new locked bitmap. But unfortunately bitmap doesn't provide such
handy API yet, so we still keep the reader counter.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
btrfs_subpage_end_and_test_writer()
Both functions were introduced in commit 1e1de38792e0 ("btrfs: make
process_one_page() to handle subpage locking"), but they have never
been utilized out of subpage code. So just unexport them.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
We can pass a valid em cache pointer down to __get_extent_map() and
drop the validity check. This avoids the special case, the call stacks
are simple:
btrfs_read_folio
btrfs_do_readpage
__get_extent_map
extent_readahead
contiguous_readpages
btrfs_do_readpage
__get_extent_map
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
its limit
The NFS server should ask clients to voluntarily return unused
delegations when the number of granted delegations reaches the
max_delegations. This is so that the server can continue to
grant delegations for new requests.
Signed-off-by: Dai Ngo <dai.ngo@oracle.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Tested-by: Chen Hanxiao <chenhx.fnst@fujitsu.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
|
|
Add an explanation to prevent the future removal of the fill-
attribute call sites in nfsd_setattr(). Some NFSv3 client
implementations don't behave correctly if wcc data is not present in
an NFSv3 SETATTR reply.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
|
|
The SLAB_MEM_SPREAD flag is already a no-op after removal of SLAB
allocator and in [1] it was fully deprecated. Remove its usage so we can
delete it from slab. No functional change.
Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/all/20240223-slab-cleanup-flags-v2-1-02f1753e8303@suse.cz/
Message-Id: <20240224135229.830356-1-chengming.zhou@linux.dev>
|
|
The SLAB_MEM_SPREAD flag is already a no-op after removal of SLAB
allocator and in [1] it was fully deprecated. Remove its usage so we can
delete it from slab. No functional change.
Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/all/20240223-slab-cleanup-flags-v2-1-02f1753e8303@suse.cz/
Message-Id: <20240224135118.830073-1-chengming.zhou@linux.dev>
|
|
The SLAB_MEM_SPREAD flag is already a no-op after removal of SLAB
allocator and in [1] it was fully deprecated. Remove its usage so we can
delete it from slab. No functional change.
Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/all/20240223-slab-cleanup-flags-v2-1-02f1753e8303@suse.cz/
Message-Id: <20240224134901.829591-1-chengming.zhou@linux.dev>
|
|
The SLAB_MEM_SPREAD flag is already a no-op after removal of SLAB
allocator and in [1] it was fully deprecated. Remove its usage so we can
delete it from slab. No functional change.
Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/all/20240223-slab-cleanup-flags-v2-1-02f1753e8303@suse.cz/
Message-Id: <20240224134816.829424-1-chengming.zhou@linux.dev>
|
|
The one remaining caller of fuse_writepage_locked() already has a folio,
so convert this function entirely. Saves a few calls to compound_head()
but no attempt is made to support large folios in this patch.
Signed-off-by: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
|
|
The writepage operation is deprecated as it leads to worse performance
under high memory pressure due to folios being written out in LRU order
rather than sequentially within a file. Use filemap_migrate_folio() to
support dirty folio migration instead of writepage.
Signed-off-by: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
|