Age | Commit message (Collapse) | Author |
|
Unused now.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
|
|
->writepage is only used for single page writeback from memory reclaim,
and not called at all for cgroup writeback. Follow the lead of XFS
and remove ->writepage and rely entirely on ->writepages.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Acked-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
|
|
->writepage is only used for single page writeback from memory reclaim,
and not called at all for cgroup writeback. Follow the lead of XFS
and remove ->writepage and rely entirely on ->writepages.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Tested-by: Andreas Gruenbacher <agruenba@redhat.com>
Reviewed-by: Andreas Gruenbacher <agruenba@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
|
|
Use filemap_fdatawrite_wbc instead of generic_writepages in
gfs2_ail1_start_one so that the functin can also cope with address_space
operations that only implement ->writepages and to properly account
for cgroup writeback.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Andreas Gruenbacher <agruenba@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
|
|
Replace 'the the' with 'the' in the comment.
Signed-off-by: Slark Xiao <slark_xiao@163.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
|
|
The double `the' is duplicated in line 552, remove one.
Signed-off-by: Xin Gao <gaoxin@cdjrlc.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
|
|
Convert the last hardcoded PAGE_SIZEs of uncompressed cases.
Reviewed-by: Jeffle Xu <jefflexu@linux.alibaba.com>
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Link: https://lore.kernel.org/r/20220619150940.121005-1-hsiangkao@linux.alibaba.com
|
|
Fold in erofs_prepare_dio() in order to simplify the code.
Reviewed-by: Jeffle Xu <jefflexu@linux.alibaba.com>
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Link: https://lore.kernel.org/r/20220720082229.12172-1-hsiangkao@linux.alibaba.com
|
|
Let's introduce multi-reference pclusters at runtime. In details,
if one pcluster is requested by multiple extents at almost the same
time (even belong to different files), the longest extent will be
decompressed as representative and the other extents are actually
copied from the longest one in one round.
After this patch, fully-referenced extents can be correctly handled
and the full decoding check needs to be bypassed for
partial-referenced extents.
Acked-by: Chao Yu <chao@kernel.org>
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Link: https://lore.kernel.org/r/20220715154203.48093-17-hsiangkao@linux.alibaba.com
|
|
Currently, `pcl->length' records the longest decompressed length
as long as the pcluster itself isn't reclaimed. However, such
number is unneeded for the general cases since it doesn't indicate
the exact decompressed size in this round.
Instead, let's record the decompressed size for this round instead,
thus `pcl->nr_pages' can be completely dropped and pageofs_out is
also designed to be kept in sync with `pcl->length'.
Acked-by: Chao Yu <chao@kernel.org>
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Link: https://lore.kernel.org/r/20220715154203.48093-16-hsiangkao@linux.alibaba.com
|
|
Both out_bvecs and in_bvecs share the common logic for decompressed
buffers. So let's make a helper for this.
Acked-by: Chao Yu <chao@kernel.org>
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Link: https://lore.kernel.org/r/20220715154203.48093-15-hsiangkao@linux.alibaba.com
|
|
For the most cases, small pclusters can be decompressed with page
arrays on stack.
Try to leave both (de)compressed_pages on stack if possible as before.
Acked-by: Chao Yu <chao@kernel.org>
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Link: https://lore.kernel.org/r/20220715154203.48093-14-hsiangkao@linux.alibaba.com
|
|
Let's introduce struct z_erofs_decompress_backend in order to pass
on the decompression backend context between helper functions more
easier and avoid too many arguments.
Acked-by: Chao Yu <chao@kernel.org>
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Link: https://lore.kernel.org/r/20220715154203.48093-13-hsiangkao@linux.alibaba.com
|
|
In order to introduce multi-reference pclusters for compressed data
deduplication, let's get rid of the global page array for now since
it needs to be re-designed then at least.
Acked-by: Chao Yu <chao@kernel.org>
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Link: https://lore.kernel.org/r/20220715154203.48093-12-hsiangkao@linux.alibaba.com
|
|
`enum z_erofs_collectmode' is really ambiguous, but I'm not quite
sure if there are better naming, basically it's used to judge whether
inplace I/O can be used due to the current status of pclusters in
the chain.
Rename it as `enum z_erofs_pclustermode' instead.
Acked-by: Chao Yu <chao@kernel.org>
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Link: https://lore.kernel.org/r/20220715154203.48093-11-hsiangkao@linux.alibaba.com
|
|
Remove it since pagevec[] is no longer used.
Acked-by: Chao Yu <chao@kernel.org>
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Link: https://lore.kernel.org/r/20220715154203.48093-10-hsiangkao@linux.alibaba.com
|
|
Since all decompressed offsets have been integrated to bvecs[], this
patch avoids all sub-indexes so that page->private only includes a
part count and an eio flag, thus in the future folio->private can have
the same meaning.
In addition, PG_error will not be used anymore after this patch and
we're heading to use page->private (later folio->private) and
page->mapping (later folio->mapping) only.
Acked-by: Chao Yu <chao@kernel.org>
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Link: https://lore.kernel.org/r/20220715154203.48093-9-hsiangkao@linux.alibaba.com
|
|
Convert compressed_pages[] to bufvec in order to avoid using
page->private to keep onlinepage_index (decompressed offset)
for inplace I/O pages.
In the future, we only rely on folio->private to keep a countdown
to unlock folios and set folio_uptodate.
Acked-by: Chao Yu <chao@kernel.org>
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Link: https://lore.kernel.org/r/20220715154203.48093-8-hsiangkao@linux.alibaba.com
|
|
`z_erofs_decompress_pcluster()' is too long therefore it'd be better
to introduce another helper to parse compressed pages (or laterly,
compressed bvecs.)
BTW, since `compressed_bvecs' is too long as a part of the function
name, `in_bvecs' is used here instead.
Acked-by: Chao Yu <chao@kernel.org>
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Link: https://lore.kernel.org/r/20220715154203.48093-7-hsiangkao@linux.alibaba.com
|
|
Remove the old pagevec approach but keep z_erofs_page_type for now.
It will be reworked in the following commits as well.
Also rename Z_EROFS_NR_INLINE_PAGEVECS as Z_EROFS_INLINE_BVECS with
the new value 2 since it's actually enough to bootstrap.
Acked-by: Chao Yu <chao@kernel.org>
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Link: https://lore.kernel.org/r/20220715154203.48093-6-hsiangkao@linux.alibaba.com
|
|
For each pcluster, the total compressed buffers are determined in
advance, yet the number of decompressed buffers actually vary. Too
many decompressed pages can be recorded if one pcluster is highly
compressed or its pcluster size is large. That takes extra memory
footprints compared to uncompressed filesystems, especially a lot of
I/O in flight on low-ended devices.
Therefore, similar to inplace I/O, pagevec was introduced to reuse
page cache to store these pointers in the time-sharing way since
these pages are actually unused before decompressing.
In order to make it more flexable, a cleaner bufvec is used to
replace the old pagevec stuffs so that
- Decompressed offsets can be stored inline, thus it can be used
for the upcoming feature like compressed data deduplication.
It's calculated by `page_offset(page) - map->m_la';
- Towards supporting large folios for compressed inodes since
our final goal is to completely avoid page->private but use
folio->private only for all page cache pages.
Acked-by: Chao Yu <chao@kernel.org>
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Link: https://lore.kernel.org/r/20220715154203.48093-5-hsiangkao@linux.alibaba.com
|
|
`z_erofs_decompress_pcluster()' is too long therefore it'd be better
to introduce another helper to parse decompressed pages (or laterly,
decompressed bvecs.)
BTW, since `decompressed_bvecs' is too long as a part of the function
name, `out_bvecs' is used instead.
Reviewed-by: Yue Hu <huyue2@coolpad.com>
Acked-by: Chao Yu <chao@kernel.org>
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Link: https://lore.kernel.org/r/20220715154203.48093-4-hsiangkao@linux.alibaba.com
|
|
Rearrange the code and get rid of all gotos.
Reviewed-by: Yue Hu <huyue2@coolpad.com>
Acked-by: Chao Yu <chao@kernel.org>
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Link: https://lore.kernel.org/r/20220715154203.48093-3-hsiangkao@linux.alibaba.com
|
|
Since commit 5c6dcc57e2e5 ("erofs: get rid of
`struct z_erofs_collector'"), these arguments can be dropped as well.
No logic changes.
Reviewed-by: Yue Hu <huyue2@coolpad.com>
Acked-by: Chao Yu <chao@kernel.org>
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Link: https://lore.kernel.org/r/20220715154203.48093-2-hsiangkao@linux.alibaba.com
|
|
READV cannot recycle buffers as it would lose some of the data required to
reimport that buffer.
Reported-by: Ammar Faizi <ammarfaizi2@gnuweeb.org>
Fixes: b66e65f41426 ("io_uring: never call io_buffer_select() for a buffer re-select")
Signed-off-by: Dylan Yudaken <dylany@fb.com>
Link: https://lore.kernel.org/r/20220721131325.624788-1-dylany@fb.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
in the error path of io_register_pbuf_ring, only free bl if it was
allocated.
Reported-by: Dipanjan Das <mail.dipanjan.das@gmail.com>
Fixes: c7fb19428d67 ("io_uring: add support for ring mapped supplied buffers")
Signed-off-by: Dylan Yudaken <dylany@fb.com>
Reviewed-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/all/CANX2M5bXKw1NaHdHNVqssUUaBCs8aBpmzRNVEYEvV0n44P7ioA@mail.gmail.com/
Link: https://lore.kernel.org/all/CANX2M5YiZBXU3L6iwnaLs-HHJXRvrxM8mhPDiMDF9Y9sAvOHUA@mail.gmail.com/
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
fs parameter not used. So, it needs to be deleted.
Signed-off-by: Deming Wang <wangdeming@inspur.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
|
|
Since commit 73f03c2b4b52 ("fuse: Restrict allow_other to the superblock's
namespace or a descendant"), access to allow_other FUSE filesystems has
been limited to users in the mounting user namespace or descendants. This
prevents a process that is privileged in its userns - but not its parent
namespaces - from mounting a FUSE fs w/ allow_other that is accessible to
processes in parent namespaces.
While this restriction makes sense overall it breaks a legitimate usecase:
I have a tracing daemon which needs to peek into process' open files in
order to symbolicate - similar to 'perf'. The daemon is a privileged
process in the root userns, but is unable to peek into FUSE filesystems
mounted by processes in child namespaces.
This patch adds a module param, allow_sys_admin_access, to act as an escape
hatch for this descendant userns logic and for the allow_other mount option
in general. Setting allow_sys_admin_access allows processes with
CAP_SYS_ADMIN in the initial userns to access FUSE filesystems irrespective
of the mounting userns or whether allow_other was set. A sysadmin setting
this param must trust FUSEs on the host to not DoS processes as described
in 73f03c2b4b52.
Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com>
Reviewed-by: Christian Brauner (Microsoft) <brauner@kernel.org>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
|
|
The commit 15c8e72e88e0 ("fuse: allow skipping control interface and forced
unmount") tries to remove the control interface for virtio-fs since it does
not support aborting requests which are being processed. But it doesn't
work now.
This patch fixes it by skipping creating the control interface if
fuse_conn->no_control is set.
Fixes: 15c8e72e88e0 ("fuse: allow skipping control interface and forced unmount")
Signed-off-by: Xie Yongji <xieyongji@bytedance.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
|
|
Overlayfs may fail to complete updates when a filesystem lacks
fileattr/xattr syscall support and responds with an ENOSYS error code,
resulting in an unexpected "Function not implemented" error.
This bug may occur with FUSE filesystems, such as davfs2.
Steps to reproduce:
# install davfs2, e.g., apk add davfs2
mkdir /test mkdir /test/lower /test/upper /test/work /test/mnt
yes '' | mount -t davfs -o ro http://some-web-dav-server/path \
/test/lower
mount -t overlay -o upperdir=/test/upper,lowerdir=/test/lower \
-o workdir=/test/work overlay /test/mnt
# when "some-file" exists in the lowerdir, this fails with "Function
# not implemented", with dmesg showing "overlayfs: failed to retrieve
# lower fileattr (/some-file, err=-38)"
touch /test/mnt/some-file
The underlying cause of this regresion is actually in FUSE, which fails to
translate the ENOSYS error code returned by userspace filesystem (which
means that the ioctl operation is not supported) to ENOTTY.
Reported-by: Christian Kohlschütter <christian@kohlschutter.com>
Fixes: 72db82115d2b ("ovl: copy up sync/noatime fileattr flags")
Fixes: 59efec7b9039 ("fuse: implement ioctl support")
Cc: <stable@vger.kernel.org>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
|
|
Limit nanoseconds to 0..999999999.
Fixes: d8a5ba45457e ("[PATCH] FUSE - core")
Cc: <stable@vger.kernel.org>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
|
|
Move dmap free worker kicker inside the critical region, so that extra
spinlock lock/unlock could be avoided.
Suggested-by: Liu Jiang <gerry@linux.alibaba.com>
Signed-off-by: Jeffle Xu <jefflexu@linux.alibaba.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
|
|
fuse_finish_open() will be called with FUSE_NOWRITE set in case of atomic
O_TRUNC open(), so commit 76224355db75 ("fuse: truncate pagecache on
atomic_o_trunc") replaced invalidate_inode_pages2() by truncate_pagecache()
in such a case to avoid the A-A deadlock. However, we found another A-B-B-A
deadlock related to the case above, which will cause the xfstests
generic/464 testcase hung in our virtio-fs test environment.
For example, consider two processes concurrently open one same file, one
with O_TRUNC and another without O_TRUNC. The deadlock case is described
below, if open(O_TRUNC) is already set_nowrite(acquired A), and is trying
to lock a page (acquiring B), open() could have held the page lock
(acquired B), and waiting on the page writeback (acquiring A). This would
lead to deadlocks.
open(O_TRUNC)
----------------------------------------------------------------
fuse_open_common
inode_lock [C acquire]
fuse_set_nowrite [A acquire]
fuse_finish_open
truncate_pagecache
lock_page [B acquire]
truncate_inode_page
unlock_page [B release]
fuse_release_nowrite [A release]
inode_unlock [C release]
----------------------------------------------------------------
open()
----------------------------------------------------------------
fuse_open_common
fuse_finish_open
invalidate_inode_pages2
lock_page [B acquire]
fuse_launder_page
fuse_wait_on_page_writeback [A acquire & release]
unlock_page [B release]
----------------------------------------------------------------
Besides this case, all calls of invalidate_inode_pages2() and
invalidate_inode_pages2_range() in fuse code also can deadlock with
open(O_TRUNC).
Fix by moving the truncate_pagecache() call outside the nowrite protected
region. The nowrite protection is only for delayed writeback
(writeback_cache) case, where inode lock does not protect against
truncation racing with writes on the server. Write syscalls racing with
page cache truncation still get the inode lock protection.
This patch also changes the order of filemap_invalidate_lock()
vs. fuse_set_nowrite() in fuse_open_common(). This new order matches the
order found in fuse_file_fallocate() and fuse_do_setattr().
Reported-by: Jiachen Zhang <zhangjiachen.jaycee@bytedance.com>
Tested-by: Jiachen Zhang <zhangjiachen.jaycee@bytedance.com>
Fixes: e4648309b85a ("fuse: truncate pending writes on O_TRUNC")
Cc: <stable@vger.kernel.org>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
|
|
A race between write(2) and close(2) allows pages to be dirtied after
fuse_flush -> write_inode_now(). If these pages are not flushed from
fuse_release(), then there might not be a writable open file later. So any
remaining dirty pages must be written back before the file is released.
This is a partial revert of the blamed commit.
Reported-by: syzbot+6e1efbd8efaaa6860e91@syzkaller.appspotmail.com
Fixes: 36ea23374d1f ("fuse: write inode in fuse_vma_close() instead of fuse_release()")
Cc: <stable@vger.kernel.org> # v5.16
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
|
|
Now that we finished moving setgid stripping for regular files in setgid
directories into the vfs, individual filesystem don't need to manually
strip the setgid bit anymore. Drop the now unneeded code from ceph.
Link: https://lore.kernel.org/r/1657779088-2242-4-git-send-email-xuyang2018.jy@fujitsu.com
Reviewed-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Christian Brauner (Microsoft)<brauner@kernel.org>
Reviewed-and-Tested-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Yang Xu <xuyang2018.jy@fujitsu.com>
Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>
|
|
Move setgid handling out of individual filesystems and into the VFS
itself to stop the proliferation of setgid inheritance bugs.
Creating files that have both the S_IXGRP and S_ISGID bit raised in
directories that themselves have the S_ISGID bit set requires additional
privileges to avoid security issues.
When a filesystem creates a new inode it needs to take care that the
caller is either in the group of the newly created inode or they have
CAP_FSETID in their current user namespace and are privileged over the
parent directory of the new inode. If any of these two conditions is
true then the S_ISGID bit can be raised for an S_IXGRP file and if not
it needs to be stripped.
However, there are several key issues with the current implementation:
* S_ISGID stripping logic is entangled with umask stripping.
If a filesystem doesn't support or enable POSIX ACLs then umask
stripping is done directly in the vfs before calling into the
filesystem.
If the filesystem does support POSIX ACLs then unmask stripping may be
done in the filesystem itself when calling posix_acl_create().
Since umask stripping has an effect on S_ISGID inheritance, e.g., by
stripping the S_IXGRP bit from the file to be created and all relevant
filesystems have to call posix_acl_create() before inode_init_owner()
where we currently take care of S_ISGID handling S_ISGID handling is
order dependent. IOW, whether or not you get a setgid bit depends on
POSIX ACLs and umask and in what order they are called.
Note that technically filesystems are free to impose their own
ordering between posix_acl_create() and inode_init_owner() meaning
that there's additional ordering issues that influence S_SIGID
inheritance.
* Filesystems that don't rely on inode_init_owner() don't get S_ISGID
stripping logic.
While that may be intentional (e.g. network filesystems might just
defer setgid stripping to a server) it is often just a security issue.
This is not just ugly it's unsustainably messy especially since we do
still have bugs in this area years after the initial round of setgid
bugfixes.
So the current state is quite messy and while we won't be able to make
it completely clean as posix_acl_create() is still a filesystem specific
call we can improve the S_SIGD stripping situation quite a bit by
hoisting it out of inode_init_owner() and into the vfs creation
operations. This means we alleviate the burden for filesystems to handle
S_ISGID stripping correctly and can standardize the ordering between
S_ISGID and umask stripping in the vfs.
We add a new helper vfs_prepare_mode() so S_ISGID handling is now done
in the VFS before umask handling. This has S_ISGID handling is
unaffected unaffected by whether umask stripping is done by the VFS
itself (if no POSIX ACLs are supported or enabled) or in the filesystem
in posix_acl_create() (if POSIX ACLs are supported).
The vfs_prepare_mode() helper is called directly in vfs_*() helpers that
create new filesystem objects. We need to move them into there to make
sure that filesystems like overlayfs hat have callchains like:
sys_mknod()
-> do_mknodat(mode)
-> .mknod = ovl_mknod(mode)
-> ovl_create(mode)
-> vfs_mknod(mode)
get S_ISGID stripping done when calling into lower filesystems via
vfs_*() creation helpers. Moving vfs_prepare_mode() into e.g.
vfs_mknod() takes care of that. This is in any case semantically cleaner
because S_ISGID stripping is VFS security requirement.
Security hooks so far have seen the mode with the umask applied but
without S_ISGID handling done. The relevant hooks are called outside of
vfs_*() creation helpers so by calling vfs_prepare_mode() from vfs_*()
helpers the security hooks would now see the mode without umask
stripping applied. For now we fix this by passing the mode with umask
settings applied to not risk any regressions for LSM hooks. IOW, nothing
changes for LSM hooks. It is worth pointing out that security hooks
never saw the mode that is seen by the filesystem when actually creating
the file. They have always been completely misplaced for that to work.
The following filesystems use inode_init_owner() and thus relied on
S_ISGID stripping: spufs, 9p, bfs, btrfs, ext2, ext4, f2fs, hfsplus,
hugetlbfs, jfs, minix, nilfs2, ntfs3, ocfs2, omfs, overlayfs, ramfs,
reiserfs, sysv, ubifs, udf, ufs, xfs, zonefs, bpf, tmpfs.
All of the above filesystems end up calling inode_init_owner() when new
filesystem objects are created through the ->mkdir(), ->mknod(),
->create(), ->tmpfile(), ->rename() inode operations.
Since directories always inherit the S_ISGID bit with the exception of
xfs when irix_sgid_inherit mode is turned on S_ISGID stripping doesn't
apply. The ->symlink() and ->link() inode operations trivially inherit
the mode from the target and the ->rename() inode operation inherits the
mode from the source inode. All other creation inode operations will get
S_ISGID handling via vfs_prepare_mode() when called from their relevant
vfs_*() helpers.
In addition to this there are filesystems which allow the creation of
filesystem objects through ioctl()s or - in the case of spufs -
circumventing the vfs in other ways. If filesystem objects are created
through ioctl()s the vfs doesn't know about it and can't apply regular
permission checking including S_ISGID logic. Therfore, a filesystem
relying on S_ISGID stripping in inode_init_owner() in their ioctl()
callpath will be affected by moving this logic into the vfs. We audited
those filesystems:
* btrfs allows the creation of filesystem objects through various
ioctls(). Snapshot creation literally takes a snapshot and so the mode
is fully preserved and S_ISGID stripping doesn't apply.
Creating a new subvolum relies on inode_init_owner() in
btrfs_new_subvol_inode() but only creates directories and doesn't
raise S_ISGID.
* ocfs2 has a peculiar implementation of reflinks. In contrast to e.g.
xfs and btrfs FICLONE/FICLONERANGE ioctl() that is only concerned with
the actual extents ocfs2 uses a separate ioctl() that also creates the
target file.
Iow, ocfs2 circumvents the vfs entirely here and did indeed rely on
inode_init_owner() to strip the S_ISGID bit. This is the only place
where a filesystem needs to call mode_strip_sgid() directly but this
is self-inflicted pain.
* spufs doesn't go through the vfs at all and doesn't use ioctl()s
either. Instead it has a dedicated system call spufs_create() which
allows the creation of filesystem objects. But spufs only creates
directories and doesn't allo S_SIGID bits, i.e. it specifically only
allows 0777 bits.
* bpf uses vfs_mkobj() but also doesn't allow S_ISGID bits to be created.
The patch will have an effect on ext2 when the EXT2_MOUNT_GRPID mount
option is used, on ext4 when the EXT4_MOUNT_GRPID mount option is used,
and on xfs when the XFS_FEAT_GRPID mount option is used. When any of
these filesystems are mounted with their respective GRPID option then
newly created files inherit the parent directories group
unconditionally. In these cases non of the filesystems call
inode_init_owner() and thus did never strip the S_ISGID bit for newly
created files. Moving this logic into the VFS means that they now get
the S_ISGID bit stripped. This is a user visible change. If this leads
to regressions we will either need to figure out a better way or we need
to revert. However, given the various setgid bugs that we found just in
the last two years this is a regression risk we should take.
Associated with this change is a new set of fstests to enforce the
semantics for all new filesystems.
Link: https://lore.kernel.org/ceph-devel/20220427092201.wvsdjbnc7b4dttaw@wittgenstein [1]
Link: e014f37db1a2 ("xfs: use setattr_copy to set vfs inode attributes") [2]
Link: 01ea173e103e ("xfs: fix up non-directory creation in SGID directories") [3]
Link: fd84bfdddd16 ("ceph: fix up non-directory creation in SGID directories") [4]
Link: https://lore.kernel.org/r/1657779088-2242-3-git-send-email-xuyang2018.jy@fujitsu.com
Suggested-by: Dave Chinner <david@fromorbit.com>
Suggested-by: Christian Brauner (Microsoft) <brauner@kernel.org>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-and-Tested-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Yang Xu <xuyang2018.jy@fujitsu.com>
[<brauner@kernel.org>: rewrote commit message]
Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>
|
|
I observed the following evidence of a memory leak while running xfs/399
from the xfs fsck test suite (edited for brevity):
XFS (sde): Metadata corruption detected at xfs_attr_shortform_verify_struct.part.0+0x7b/0xb0 [xfs], inode 0x1172 attr fork
XFS: Assertion failed: ip->i_af.if_u1.if_data == NULL, file: fs/xfs/libxfs/xfs_inode_fork.c, line: 315
------------[ cut here ]------------
WARNING: CPU: 2 PID: 91635 at fs/xfs/xfs_message.c:104 assfail+0x46/0x4a [xfs]
CPU: 2 PID: 91635 Comm: xfs_scrub Tainted: G W 5.19.0-rc7-xfsx #rc7 6e6475eb29fd9dda3181f81b7ca7ff961d277a40
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.15.0-1 04/01/2014
RIP: 0010:assfail+0x46/0x4a [xfs]
Call Trace:
<TASK>
xfs_ifork_zap_attr+0x7c/0xb0
xfs_iformat_attr_fork+0x86/0x110
xfs_inode_from_disk+0x41d/0x480
xfs_iget+0x389/0xd70
xfs_bulkstat_one_int+0x5b/0x540
xfs_bulkstat_iwalk+0x1e/0x30
xfs_iwalk_ag_recs+0xd1/0x160
xfs_iwalk_run_callbacks+0xb9/0x180
xfs_iwalk_ag+0x1d8/0x2e0
xfs_iwalk+0x141/0x220
xfs_bulkstat+0x105/0x180
xfs_ioc_bulkstat.constprop.0.isra.0+0xc5/0x130
xfs_file_ioctl+0xa5f/0xef0
__x64_sys_ioctl+0x82/0xa0
do_syscall_64+0x2b/0x80
entry_SYSCALL_64_after_hwframe+0x46/0xb0
This newly-added assertion checks that there aren't any incore data
structures hanging off the incore fork when we're trying to reset its
contents. From the call trace, it is evident that iget was trying to
construct an incore inode from the ondisk inode, but the attr fork
verifier failed and we were trying to undo all the memory allocations
that we had done earlier.
The three assertions in xfs_ifork_zap_attr check that the caller has
already called xfs_idestroy_fork, which clearly has not been done here.
As the zap function then zeroes the pointers, we've effectively leaked
the memory.
The shortest change would have been to insert an extra call to
xfs_idestroy_fork, but it makes more sense to bundle the _idestroy_fork
call into _zap_attr, since all other callsites call _idestroy_fork
immediately prior to calling _zap_attr. IOWs, it eliminates one way to
fail.
Note: This change only applies cleanly to 2ed5b09b3e8f, since we just
reworked the attr fork lifetime. However, I think this memory leak has
existed since 0f45a1b20cd8, since the chain xfs_iformat_attr_fork ->
xfs_iformat_local -> xfs_init_local_fork will allocate
ifp->if_u1.if_data, but if xfs_ifork_verify_local_attr fails,
xfs_iformat_attr_fork will free i_afp without freeing any of the stuff
hanging off i_afp. The solution for older kernels I think is to add the
missing call to xfs_idestroy_fork just prior to calling kmem_cache_free.
Found by fuzzing a.sfattr.hdr.totsize = lastbit in xfs/399.
Fixes: 2ed5b09b3e8f ("xfs: make inode attribute forks a permanent part of struct xfs_inode")
Probably-Fixes: 0f45a1b20cd8 ("xfs: improve local fork verification")
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
|
|
Fix below kernel warning:
fs/xfs/scrub/repair.c:539:19: warning: variable 'agno' set but not used [-Wunused-but-set-variable]
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: sunliming <sunliming@kylinos.cn>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
|
|
Darrick and Sachin Sant reported that xfs/435 and xfs/436 would
report an non-empty xfs_buf slab on module remove. This isn't easily
to reproduce, but is clearly a side effect of converting the buffer
caceh to RUC freeing and lockless lookups. Sachin bisected and
Darrick hit it when testing the patchset directly.
Turns out that the xfs_buf slab is not destroyed when all the other
XFS slab caches are destroyed. Instead, it's got it's own little
wrapper function that gets called separately, and so it doesn't have
an rcu_barrier() call in it that is needed to drain all the rcu
callbacks before the slab is destroyed.
Fix it by removing the xfs_buf_init/terminate wrappers that just
allocate and destroy the xfs_buf slab, and move them to the same
place that all the other slab caches are set up and destroyed.
Reported-and-tested-by: Sachin Sant <sachinp@linux.ibm.com>
Fixes: 298f34224506 ("xfs: lockless buffer lookup")
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
|
|
These NULL check are no long needed after commit 2ed5b09b3e8f ("xfs:
make inode attribute forks a permanent part of struct xfs_inode").
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
|
|
The 'ctime', 'mtime', and 'atime' for inode is the type of
'xfs_timestamp_t', which is a 64-bit type:
/* fs/xfs/libxfs/xfs_format.h begin */
typedef __be64 xfs_timestamp_t;
/* fs/xfs/libxfs/xfs_format.h end */
When the 'bigtime' feature is disabled, this 64-bit type is splitted
into two parts of 32-bit, one part is encoded for seconds since
1970-01-01 00:00:00 UTC, the other part is encoded for nanoseconds
above the seconds, this two parts are the type of
'xfs_legacy_timestamp' and the min and max time value of this type are
defined as macros 'XFS_LEGACY_TIME_MIN' and 'XFS_LEGACY_TIME_MAX':
/* fs/xfs/libxfs/xfs_format.h begin */
struct xfs_legacy_timestamp {
__be32 t_sec; /* timestamp seconds */
__be32 t_nsec; /* timestamp nanoseconds */
};
#define XFS_LEGACY_TIME_MIN ((int64_t)S32_MIN)
#define XFS_LEGACY_TIME_MAX ((int64_t)S32_MAX)
/* fs/xfs/libxfs/xfs_format.h end */
/* include/linux/limits.h begin */
#define U32_MAX ((u32)~0U)
#define S32_MAX ((s32)(U32_MAX >> 1))
#define S32_MIN ((s32)(-S32_MAX - 1))
/* include/linux/limits.h end */
'XFS_LEGACY_TIME_MIN' is the min time value of the
'xfs_legacy_timestamp', that is -(2^31) seconds relative to the
1970-01-01 00:00:00 UTC, it can be converted to human-friendly time
value by 'date' command:
/* command begin */
[root@~]# date --utc -d '@0' +'%Y-%m-%d %H:%M:%S'
1970-01-01 00:00:00
[root@~]# date --utc -d "@`echo '-(2^31)'|bc`" +'%Y-%m-%d %H:%M:%S'
1901-12-13 20:45:52
[root@~]#
/* command end */
When 'bigtime' feature is enabled, this 64-bit type becomes a 64-bit
nanoseconds counter, with the start time value is the min time value of
'xfs_legacy_timestamp'(start time means the value of 64-bit nanoseconds
counter is 0). We have already caculated the min time value of
'xfs_legacy_timestamp', that is 1901-12-13 20:45:52 UTC, but the comment
for the start time value of inode with 'bigtime' feature enabled writes
the value is 1901-12-31 20:45:52 UTC:
/* fs/xfs/libxfs/xfs_format.h begin */
/*
* XFS Timestamps
* ==============
* When the bigtime feature is enabled, ondisk inode timestamps become an
* unsigned 64-bit nanoseconds counter. This means that the bigtime inode
* timestamp epoch is the start of the classic timestamp range, which is
* Dec 31 20:45:52 UTC 1901. ...
...
*/
/* fs/xfs/libxfs/xfs_format.h end */
That is a typo, and this patch corrects the typo, from 'Dec 31' to
'Dec 13'.
Suggested-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Xiaole He <hexiaole@kylinos.cn>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
|
|
All creation paths except for O_TMPFILE handle umask in the vfs directly
if the filesystem doesn't support or enable POSIX ACLs. If the filesystem
does then umask handling is deferred until posix_acl_create().
Because, O_TMPFILE misses umask handling in the vfs it will not honor
umask settings. Fix this by adding the missing umask handling.
Link: https://lore.kernel.org/r/1657779088-2242-2-git-send-email-xuyang2018.jy@fujitsu.com
Fixes: 60545d0d4610 ("[O_TMPFILE] it's still short a few helpers, but infrastructure should be OK now...")
Cc: <stable@vger.kernel.org> # 4.19+
Reported-by: Christian Brauner (Microsoft) <brauner@kernel.org>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-and-Tested-by: Jeff Layton <jlayton@kernel.org>
Acked-by: Christian Brauner (Microsoft) <brauner@kernel.org>
Signed-off-by: Yang Xu <xuyang2018.jy@fujitsu.com>
Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>
|
|
Add a dedicated helper to handle the setgid bit when creating a new file
in a setgid directory. This is a preparatory patch for moving setgid
stripping into the vfs. The patch contains no functional changes.
Currently the setgid stripping logic is open-coded directly in
inode_init_owner() and the individual filesystems are responsible for
handling setgid inheritance. Since this has proven to be brittle as
evidenced by old issues we uncovered over the last months (see [1] to
[3] below) we will try to move this logic into the vfs.
Link: e014f37db1a2 ("xfs: use setattr_copy to set vfs inode attributes") [1]
Link: 01ea173e103e ("xfs: fix up non-directory creation in SGID directories") [2]
Link: fd84bfdddd16 ("ceph: fix up non-directory creation in SGID directories") [3]
Link: https://lore.kernel.org/r/1657779088-2242-1-git-send-email-xuyang2018.jy@fujitsu.com
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christian Brauner (Microsoft) <brauner@kernel.org>
Reviewed-and-Tested-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Yang Xu <xuyang2018.jy@fujitsu.com>
Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>
|
|
This reverts commit 912f655d78c5d4ad05eac287f23a435924df7144.
This commit introduced a regression that can cause mount hung. The
changes in __ocfs2_find_empty_slot causes that any node with none-zero
node number can grab the slot that was already taken by node 0, so node 1
will access the same journal with node 0, when it try to grab journal
cluster lock, it will hung because it was already acquired by node 0.
It's very easy to reproduce this, in one cluster, mount node 0 first, then
node 1, you will see the following call trace from node 1.
[13148.735424] INFO: task mount.ocfs2:53045 blocked for more than 122 seconds.
[13148.739691] Not tainted 5.15.0-2148.0.4.el8uek.mountracev2.x86_64 #2
[13148.742560] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[13148.745846] task:mount.ocfs2 state:D stack: 0 pid:53045 ppid: 53044 flags:0x00004000
[13148.749354] Call Trace:
[13148.750718] <TASK>
[13148.752019] ? usleep_range+0x90/0x89
[13148.753882] __schedule+0x210/0x567
[13148.755684] schedule+0x44/0xa8
[13148.757270] schedule_timeout+0x106/0x13c
[13148.759273] ? __prepare_to_swait+0x53/0x78
[13148.761218] __wait_for_common+0xae/0x163
[13148.763144] __ocfs2_cluster_lock.constprop.0+0x1d6/0x870 [ocfs2]
[13148.765780] ? ocfs2_inode_lock_full_nested+0x18d/0x398 [ocfs2]
[13148.768312] ocfs2_inode_lock_full_nested+0x18d/0x398 [ocfs2]
[13148.770968] ocfs2_journal_init+0x91/0x340 [ocfs2]
[13148.773202] ocfs2_check_volume+0x39/0x461 [ocfs2]
[13148.775401] ? iput+0x69/0xba
[13148.777047] ocfs2_mount_volume.isra.0.cold+0x40/0x1f5 [ocfs2]
[13148.779646] ocfs2_fill_super+0x54b/0x853 [ocfs2]
[13148.781756] mount_bdev+0x190/0x1b7
[13148.783443] ? ocfs2_remount+0x440/0x440 [ocfs2]
[13148.785634] legacy_get_tree+0x27/0x48
[13148.787466] vfs_get_tree+0x25/0xd0
[13148.789270] do_new_mount+0x18c/0x2d9
[13148.791046] __x64_sys_mount+0x10e/0x142
[13148.792911] do_syscall_64+0x3b/0x89
[13148.794667] entry_SYSCALL_64_after_hwframe+0x170/0x0
[13148.797051] RIP: 0033:0x7f2309f6e26e
[13148.798784] RSP: 002b:00007ffdcee7d408 EFLAGS: 00000246 ORIG_RAX: 00000000000000a5
[13148.801974] RAX: ffffffffffffffda RBX: 00007ffdcee7d4a0 RCX: 00007f2309f6e26e
[13148.804815] RDX: 0000559aa762a8ae RSI: 0000559aa939d340 RDI: 0000559aa93a22b0
[13148.807719] RBP: 00007ffdcee7d5b0 R08: 0000559aa93a2290 R09: 00007f230a0b4820
[13148.810659] R10: 0000000000000000 R11: 0000000000000246 R12: 00007ffdcee7d420
[13148.813609] R13: 0000000000000000 R14: 0000559aa939f000 R15: 0000000000000000
[13148.816564] </TASK>
To fix it, we can just fix __ocfs2_find_empty_slot. But original commit
introduced the feature to mount ocfs2 locally even it is cluster based,
that is a very dangerous, it can easily cause serious data corruption,
there is no way to stop other nodes mounting the fs and corrupting it.
Setup ha or other cluster-aware stack is just the cost that we have to
take for avoiding corruption, otherwise we have to do it in kernel.
Link: https://lkml.kernel.org/r/20220603222801.42488-1-junxiao.bi@oracle.com
Fixes: 912f655d78c5("ocfs2: mount shared volume without ha stack")
Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com>
Acked-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Changwei Ge <gechangwei@live.cn>
Cc: Gang He <ghe@suse.com>
Cc: Jun Piao <piaojun@huawei.com>
Cc: <heming.zhao@suse.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
sendfile has to return EAGAIN if out_fd is nonblocking and the write into
it would block.
Here is a small reproducer for the problem:
#define _GNU_SOURCE /* See feature_test_macros(7) */
#include <fcntl.h>
#include <stdio.h>
#include <unistd.h>
#include <errno.h>
#include <sys/stat.h>
#include <sys/types.h>
#include <sys/sendfile.h>
#define FILE_SIZE (1UL << 30)
int main(int argc, char **argv) {
int p[2], fd;
if (pipe2(p, O_NONBLOCK))
return 1;
fd = open(argv[1], O_RDWR | O_TMPFILE, 0666);
if (fd < 0)
return 1;
ftruncate(fd, FILE_SIZE);
if (sendfile(p[1], fd, 0, FILE_SIZE) == -1) {
fprintf(stderr, "FAIL\n");
}
if (sendfile(p[1], fd, 0, FILE_SIZE) != -1 || errno != EAGAIN) {
fprintf(stderr, "FAIL\n");
}
return 0;
}
It worked before b964bf53e540, it is stuck after b964bf53e540, and it
works again with this fix.
This regression occurred because do_splice_direct() calls pipe_write
that handles O_NONBLOCK. Here is a trace log from the reproducer:
1) | __x64_sys_sendfile64() {
1) | do_sendfile() {
1) | __fdget()
1) | rw_verify_area()
1) | __fdget()
1) | rw_verify_area()
1) | do_splice_direct() {
1) | rw_verify_area()
1) | splice_direct_to_actor() {
1) | do_splice_to() {
1) | rw_verify_area()
1) | generic_file_splice_read()
1) + 74.153 us | }
1) | direct_splice_actor() {
1) | iter_file_splice_write() {
1) | __kmalloc()
1) 0.148 us | pipe_lock();
1) 0.153 us | splice_from_pipe_next.part.0();
1) 0.162 us | page_cache_pipe_buf_confirm();
... 16 times
1) 0.159 us | page_cache_pipe_buf_confirm();
1) | vfs_iter_write() {
1) | do_iter_write() {
1) | rw_verify_area()
1) | do_iter_readv_writev() {
1) | pipe_write() {
1) | mutex_lock()
1) 0.153 us | mutex_unlock();
1) 1.368 us | }
1) 1.686 us | }
1) 5.798 us | }
1) 6.084 us | }
1) 0.174 us | kfree();
1) 0.152 us | pipe_unlock();
1) + 14.461 us | }
1) + 14.783 us | }
1) 0.164 us | page_cache_pipe_buf_release();
... 16 times
1) 0.161 us | page_cache_pipe_buf_release();
1) | touch_atime()
1) + 95.854 us | }
1) + 99.784 us | }
1) ! 107.393 us | }
1) ! 107.699 us | }
Link: https://lkml.kernel.org/r/20220415005015.525191-1-avagin@gmail.com
Fixes: b964bf53e540 ("teach sendfile(2) to handle send-to-pipe directly")
Signed-off-by: Andrei Vagin <avagin@gmail.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Syzkaller reported use-after-free bug as follows:
==================================================================
BUG: KASAN: use-after-free in ntfs_ucsncmp+0x123/0x130
Read of size 2 at addr ffff8880751acee8 by task a.out/879
CPU: 7 PID: 879 Comm: a.out Not tainted 5.19.0-rc4-next-20220630-00001-gcc5218c8bd2c-dirty #7
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014
Call Trace:
<TASK>
dump_stack_lvl+0x1c0/0x2b0
print_address_description.constprop.0.cold+0xd4/0x484
print_report.cold+0x55/0x232
kasan_report+0xbf/0xf0
ntfs_ucsncmp+0x123/0x130
ntfs_are_names_equal.cold+0x2b/0x41
ntfs_attr_find+0x43b/0xb90
ntfs_attr_lookup+0x16d/0x1e0
ntfs_read_locked_attr_inode+0x4aa/0x2360
ntfs_attr_iget+0x1af/0x220
ntfs_read_locked_inode+0x246c/0x5120
ntfs_iget+0x132/0x180
load_system_files+0x1cc6/0x3480
ntfs_fill_super+0xa66/0x1cf0
mount_bdev+0x38d/0x460
legacy_get_tree+0x10d/0x220
vfs_get_tree+0x93/0x300
do_new_mount+0x2da/0x6d0
path_mount+0x496/0x19d0
__x64_sys_mount+0x284/0x300
do_syscall_64+0x3b/0xc0
entry_SYSCALL_64_after_hwframe+0x46/0xb0
RIP: 0033:0x7f3f2118d9ea
Code: 48 8b 0d a9 f4 0b 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 49 89 ca b8 a5 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 76 f4 0b 00 f7 d8 64 89 01 48
RSP: 002b:00007ffc269deac8 EFLAGS: 00000202 ORIG_RAX: 00000000000000a5
RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f3f2118d9ea
RDX: 0000000020000000 RSI: 0000000020000100 RDI: 00007ffc269dec00
RBP: 00007ffc269dec80 R08: 00007ffc269deb00 R09: 00007ffc269dec44
R10: 0000000000000000 R11: 0000000000000202 R12: 000055f81ab1d220
R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
</TASK>
The buggy address belongs to the physical page:
page:0000000085430378 refcount:1 mapcount:1 mapping:0000000000000000 index:0x555c6a81d pfn:0x751ac
memcg:ffff888101f7e180
anon flags: 0xfffffc00a0014(uptodate|lru|mappedtodisk|swapbacked|node=0|zone=1|lastcpupid=0x1fffff)
raw: 000fffffc00a0014 ffffea0001bf2988 ffffea0001de2448 ffff88801712e201
raw: 0000000555c6a81d 0000000000000000 0000000100000000 ffff888101f7e180
page dumped because: kasan: bad access detected
Memory state around the buggy address:
ffff8880751acd80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
ffff8880751ace00: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
>ffff8880751ace80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
^
ffff8880751acf00: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
ffff8880751acf80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
==================================================================
The reason is that struct ATTR_RECORD->name_offset is 6485, end address of
name string is out of bounds.
Fix this by adding sanity check on end address of attribute name string.
[akpm@linux-foundation.org: coding-style cleanups]
[chenxiaosong2@huawei.com: cleanup suggested by Hawkins Jiawei]
Link: https://lkml.kernel.org/r/20220709064511.3304299-1-chenxiaosong2@huawei.com
Link: https://lkml.kernel.org/r/20220707105329.4020708-1-chenxiaosong2@huawei.com
Signed-off-by: ChenXiaoSong <chenxiaosong2@huawei.com>
Signed-off-by: Hawkins Jiawei <yin31149@gmail.com>
Cc: Anton Altaparmakov <anton@tuxera.com>
Cc: ChenXiaoSong <chenxiaosong2@huawei.com>
Cc: Yongqiang Liu <liuyongqiang13@huawei.com>
Cc: Zhang Yi <yi.zhang@huawei.com>
Cc: Zhang Xiaoxu <zhangxiaoxu5@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
The previous patch added flock_translate_cmd() in flock syscall.
The test and the other one for LOCK_MAND do not depend on struct
fd and are cheaper, so we can put them at the top and defer
fdget() after that.
Also, we can remove the unlock variable and use type instead.
While at it, we fix this checkpatch error.
CHECK: spaces preferred around that '|' (ctx:VxV)
#45: FILE: fs/locks.c:2099:
+ if (type != F_UNLCK && !(f.file->f_mode & (FMODE_READ|FMODE_WRITE)))
^
Finally, we can move the can_sleep part just before we use it.
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
|
|
Two functions, flock syscall and locks_remove_flock(), call
flock_make_lock(). It allocates struct file_lock from slab
cache if its argument fl is NULL.
When we call flock syscall, we pass NULL to allocate memory
for struct file_lock. However, we always free it at the end
by locks_free_lock(). We need not allocate it and instead
should use a local variable as locks_remove_flock() does.
Also, the validation for flock_translate_cmd() is not necessary
for locks_remove_flock(). So we move the part to flock syscall
and make flock_make_lock() return nothing.
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
|
|
Fix spelling typo in comment.
Link: https://lkml.kernel.org/r/20220715060035.632903-1-13667453960@163.com
Signed-off-by: Jiangshan Yi <yijiangshan@kylinos.cn>
Reported-by: k2ci <kernel-bot@kylinos.cn>
Acked-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Commit 7bc3e6e55acf06 ("proc: Use a list of inodes to flush from proc")
moved proc_flush_task() behind __exit_signal(). Then, process systemd can
take long period high cpu usage during releasing task in following
concurrent processes:
systemd ps
kernel_waitid stat(/proc/tgid)
do_wait filename_lookup
wait_consider_task lookup_fast
release_task
__exit_signal
__unhash_process
detach_pid
__change_pid // remove task->pid_links
d_revalidate -> pid_revalidate // 0
d_invalidate(/proc/tgid)
shrink_dcache_parent(/proc/tgid)
d_walk(/proc/tgid)
spin_lock_nested(/proc/tgid/fd)
// iterating opened fd
proc_flush_pid |
d_invalidate (/proc/tgid/fd) |
shrink_dcache_parent(/proc/tgid/fd) |
shrink_dentry_list(subdirs) ↓
shrink_lock_dentry(/proc/tgid/fd) --> race on dentry lock
Function d_invalidate() will remove dentry from hash firstly, but why does
proc_flush_pid() process dentry '/proc/tgid/fd' before dentry
'/proc/tgid'? That's because proc_pid_make_inode() adds proc inode in
reverse order by invoking hlist_add_head_rcu(). But proc should not add
any inodes under '/proc/tgid' except '/proc/tgid/task/pid', fix it by
adding inode into 'pid->inodes' only if the inode is /proc/tgid or
/proc/tgid/task/pid.
Performance regression:
Create 200 tasks, each task open one file for 50,000 times. Kill all
tasks when opened files exceed 10,000,000 (cat /proc/sys/fs/file-nr).
Before fix:
$ time killall -wq aa
real 4m40.946s # During this period, we can see 'ps' and 'systemd'
taking high cpu usage.
After fix:
$ time killall -wq aa
real 1m20.732s # During this period, we can see 'systemd' taking
high cpu usage.
Link: https://lkml.kernel.org/r/20220713130029.4133533-1-chengzhihao1@huawei.com
Fixes: 7bc3e6e55acf06 ("proc: Use a list of inodes to flush from proc")
Link: https://bugzilla.kernel.org/show_bug.cgi?id=216054
Signed-off-by: Zhihao Cheng <chengzhihao1@huawei.com>
Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Suggested-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Eric Biederman <ebiederm@xmission.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Baoquan He <bhe@redhat.com>
Cc: Kalesh Singh <kaleshsingh@google.com>
Cc: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|