Age | Commit message (Collapse) | Author |
|
Pull vfs fix from Al Viro:
"Amir's copy_file_range() fix"
* tag 'pull-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
vfs: fix copy_file_range() averts filesystem freeze protection
|
|
Pull cifs fixes from Steve French:
"Two small cifs/smb3 client fixes:
- an unlock missing in an error path in copychunk_range found by
xfstest 476
- a fix for a use after free in a debug code path"
* tag '6.1-rc6-smb3-fixes' of git://git.samba.org/sfrench/cifs-2.6:
cifs: fix missing unlock in cifs_file_copychunk_range()
cifs: Use after free in debug code
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux
Pull nfsd fix from Chuck Lever:
- Fix rare data corruption on READ operations
* tag 'nfsd-6.1-6' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux:
NFSD: Fix reads with a non-zero offset that don't end on a page boundary
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/dlemoal/zonefs
Pull zonefs fixes from Damien Le Moal:
- Fix a race between zonefs module initialization of sysfs attribute
directory and mounting a drive (from Xiaoxu).
- Fix active zone accounting in the rare case of an IO error due to a
zone transition to offline or read-only state (from me).
* tag 'zonefs-6.1-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/dlemoal/zonefs:
zonefs: Fix active zone accounting
zonefs: Fix race between modprobe and mount
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
- fix a regression in nowait + buffered write
- in zoned mode fix endianness when comparing super block generation
- locking and lockdep fixes:
- fix potential sleeping under spinlock when setting qgroup limit
- lockdep warning fixes when btrfs_path is freed after copy_to_user
- do not modify log tree while holding a leaf from fs tree locked
- fix freeing of sysfs files of static features on error
- use kv.alloc for zone map allocation as a fallback to avoid warnings
due to high order allocation
- send, avoid unaligned encoded writes when attempting to clone range
* tag 'for-6.1-rc6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: sysfs: normalize the error handling branch in btrfs_init_sysfs()
btrfs: do not modify log tree while holding a leaf from fs tree locked
btrfs: use kvcalloc in btrfs_get_dev_zone_info
btrfs: qgroup: fix sleep from invalid context bug in btrfs_qgroup_inherit()
btrfs: send: avoid unaligned encoded writes when attempting to clone range
btrfs: zoned: fix missing endianness conversion in sb_write_pointer
btrfs: free btrfs_path before copying subvol info to userspace
btrfs: free btrfs_path before copying fspath to userspace
btrfs: free btrfs_path before copying inodes to userspace
btrfs: free btrfs_path before copying root refs to userspace
btrfs: fix assertion failure and blocking during nowait buffered write
|
|
OFFSET_MAX is self-annotated and more readable.
Signed-off-by: Zhen Lei <thunder.leizhen@huawei.com>
Acked-by: David Sterba <dsterba@suse.com>
Reviewed-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
|
Make it clear that nothing new should be added to this function.
Signed-off-by: Eric Biggers <ebiggers@google.com>
Link: https://lore.kernel.org/r/20221125192047.18916-1-ebiggers@kernel.org
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull hotfixes from Andrew Morton:
"24 MM and non-MM hotfixes. 8 marked cc:stable and 16 for post-6.0
issues.
There have been a lot of hotfixes this cycle, and this is quite a
large batch given how far we are into the -rc cycle. Presumably a
reflection of the unusually large amount of MM material which went
into 6.1-rc1"
* tag 'mm-hotfixes-stable-2022-11-24' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (24 commits)
test_kprobes: fix implicit declaration error of test_kprobes
nilfs2: fix nilfs_sufile_mark_dirty() not set segment usage as dirty
mm/cgroup/reclaim: fix dirty pages throttling on cgroup v1
mm: fix unexpected changes to {failslab|fail_page_alloc}.attr
swapfile: fix soft lockup in scan_swap_map_slots
hugetlb: fix __prep_compound_gigantic_page page flag setting
kfence: fix stack trace pruning
proc/meminfo: fix spacing in SecPageTables
mm: multi-gen LRU: retry folios written back while isolated
mailmap: update email address for Satya Priya
mm/migrate_device: return number of migrating pages in args->cpages
kbuild: fix -Wimplicit-function-declaration in license_is_gpl_compatible
MAINTAINERS: update Alex Hung's email address
mailmap: update Alex Hung's email address
mm: mmap: fix documentation for vma_mas_szero
mm/damon/sysfs-schemes: skip stats update if the scheme directory is removed
mm/memory: return vm_fault_t result from migrate_to_ram() callback
mm: correctly charge compressed memory to its memcg
ipc/shm: call underlying open/close vm_ops
gcov: clang: fix the buffer overflow issue
...
|
|
Pull vfs fixes from Al Viro:
"A couple of fixes, one of them for this cycle regression..."
* tag 'pull-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
vfs: vfs_tmpfile: ensure O_EXCL flag is enforced
fs: use acquire ordering in __fget_light()
|
|
READ/WRITE proved to be actively confusing - the meanings are
"data destination, as used with read(2)" and "data source, as
used with write(2)", but people keep interpreting those as
"we read data from it" and "we write data to it", i.e. exactly
the wrong way.
Call them ITER_DEST and ITER_SOURCE - at least that is harder
to misinterpret...
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
|
If a file zone transitions to the offline or readonly state from an
active state, we must clear the zone active flag and decrement the
active seq file counter. Do so in zonefs_account_active() using the new
zonefs inode flags ZONEFS_ZONE_OFFLINE and ZONEFS_ZONE_READONLY. These
flags are set if necessary in zonefs_check_zone_condition() based on the
result of report zones operation after an IO error.
Fixes: 87c9ce3ffec9 ("zonefs: Add active seq file accounting")
Cc: stable@vger.kernel.org
Signed-off-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
|
|
Commit 868f9f2f8e00 ("vfs: fix copy_file_range() regression in cross-fs
copies") removed fallback to generic_copy_file_range() for cross-fs
cases inside vfs_copy_file_range().
To preserve behavior of nfsd and ksmbd server-side-copy, the fallback to
generic_copy_file_range() was added in nfsd and ksmbd code, but that
call is missing sb_start_write(), fsnotify hooks and more.
Ideally, nfsd and ksmbd would pass a flag to vfs_copy_file_range() that
will take care of the fallback, but that code would be subtle and we got
vfs_copy_file_range() logic wrong too many times already.
Instead, add a flag to explicitly request vfs_copy_file_range() to
perform only generic_copy_file_range() and let nfsd and ksmbd use this
flag only in the fallback path.
This choise keeps the logic changes to minimum in the non-nfsd/ksmbd code
paths to reduce the risk of further regressions.
Fixes: 868f9f2f8e00 ("vfs: fix copy_file_range() regression in cross-fs copies")
Tested-by: Namjae Jeon <linkinjeon@kernel.org>
Tested-by: Luis Henriques <lhenriques@suse.de>
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
|
Remove the pointless keying argument and associated enum and pass the
fill_super callback and a "bool reconf" instead. Also mark the function
static given that there are no users outside of super.c.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
|
it's trivial now...
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
|
The only real difference is in filling per-thread notes - getting
the values of registers. And this is the only part that is worth
an ifdef - we don't need to duplicate the logics regarding gathering
threads, filling other notes, etc.
It would've been hard to do back when regset-based variant had been
introduced, mostly due to sharing bits and pieces of helpers with
aout coredumps. As the result, too much had been duplicated and
the copies had drifted away since then. Now it can be done cleanly...
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
|
elf_core_copy_regs() is equivalent to elf_core_copy_task_regs() of
current on all architectures.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
|
argument)
Don't bother with pointless macros - we are not sharing it with aout coredumps
anymore. Just convert the underlying functions to the same arguments (nobody
uses regs, actually) and call them elf_core_copy_task_fpregs(). And unexport
the entire bunch, while we are at it.
[added missing includes in arch/{csky,m68k,um}/kernel/process.c to avoid extra
warnings about the lack of externs getting added to huge piles for those
files. Pointless, but...]
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
|
copy_mnt_ns() has the old tree copied, with mntns binding *and* anything
bound on top of them skipped. Then it proceeds to walk both trees in
parallel. Unfortunately, it doesn't get the "skip the stuff we'd skipped
when copying" quite right. Consequences are minor (the ->mnt_root
comparison will return the situation to sanity pretty soon and the worst
we get is the unexpected subset of opened non-directories being switched
to new namespace), but it's confusing enough and it's not hard to get
the expected behaviour...
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4
Pull ext4 fixes from Ted Ts'o:
"Fix a regression in the lazytime code that was introduced in v6.1-rc1,
and a use-after-free that can be triggered by a maliciously corrupted
file system"
* tag 'ext4_for_linus_stable2' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
fs: do not update freeing inode i_io_list
ext4: fix use-after-free in ext4_ext_shift_extents
|
|
The devnode() in struct class should not be modifying the device that is
passed into it, so mark it as a const * and propagate the function
signature changes out into all relevant subsystems that use this
callback.
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: Reinette Chatre <reinette.chatre@intel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: x86@kernel.org
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Justin Sanders <justin@coraid.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Sumit Semwal <sumit.semwal@linaro.org>
Cc: Benjamin Gaignard <benjamin.gaignard@collabora.com>
Cc: Liam Mark <lmark@codeaurora.org>
Cc: Laura Abbott <labbott@redhat.com>
Cc: Brian Starkey <Brian.Starkey@arm.com>
Cc: John Stultz <jstultz@google.com>
Cc: "Christian König" <christian.koenig@amd.com>
Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
Cc: Maxime Ripard <mripard@kernel.org>
Cc: Thomas Zimmermann <tzimmermann@suse.de>
Cc: David Airlie <airlied@gmail.com>
Cc: Daniel Vetter <daniel@ffwll.ch>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Leon Romanovsky <leon@kernel.org>
Cc: Dennis Dalessandro <dennis.dalessandro@cornelisnetworks.com>
Cc: Dmitry Torokhov <dmitry.torokhov@gmail.com>
Cc: Mauro Carvalho Chehab <mchehab@kernel.org>
Cc: Sean Young <sean@mess.org>
Cc: Frank Haverkamp <haver@linux.ibm.com>
Cc: Jiri Slaby <jirislaby@kernel.org>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Jason Wang <jasowang@redhat.com>
Cc: Alex Williamson <alex.williamson@redhat.com>
Cc: Cornelia Huck <cohuck@redhat.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Anton Vorontsov <anton@enomsg.org>
Cc: Colin Cross <ccross@android.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Jaroslav Kysela <perex@perex.cz>
Cc: Takashi Iwai <tiwai@suse.com>
Cc: Hans Verkuil <hverkuil-cisco@xs4all.nl>
Cc: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Cc: Xie Yongji <xieyongji@bytedance.com>
Cc: Gautam Dawar <gautam.dawar@xilinx.com>
Cc: Dan Carpenter <error27@gmail.com>
Cc: Eli Cohen <elic@nvidia.com>
Cc: Parav Pandit <parav@nvidia.com>
Cc: Maxime Coquelin <maxime.coquelin@redhat.com>
Cc: alsa-devel@alsa-project.org
Cc: dri-devel@lists.freedesktop.org
Cc: kvm@vger.kernel.org
Cc: linaro-mm-sig@lists.linaro.org
Cc: linux-block@vger.kernel.org
Cc: linux-input@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: linux-media@vger.kernel.org
Cc: linux-rdma@vger.kernel.org
Cc: linux-scsi@vger.kernel.org
Cc: linux-usb@vger.kernel.org
Cc: virtualization@lists.linux-foundation.org
Link: https://lore.kernel.org/r/20221123122523.1332370-2-gregkh@linuxfoundation.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
This was found when virtual machines with nfs-mounted qcow2 disks
failed to boot properly.
Reported-by: Anders Blomdell <anders.blomdell@control.lth.se>
Suggested-by: Al Viro <viro@zeniv.linux.org.uk>
Link: https://bugzilla.redhat.com/show_bug.cgi?id=2142132
Fixes: bfbfb6182ad1 ("nfsd_splice_actor(): handle compound pages")
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
|
|
The type of a->key[0] is char in fscache_volume_same(). If the length
of cache volume key is greater than 127, the value of a->key[0] is less
than 0. In this case, klen becomes much larger than 255 after type
conversion, because the type of klen is size_t. As a result, memcmp()
is read out of bounds.
This causes a slab-out-of-bounds Read in __fscache_acquire_volume(), as
reported by Syzbot.
Fix this by changing the type of the stored key to "u8 *" rather than
"char *" (it isn't a simple string anyway). Also put in a check that
the volume name doesn't exceed NAME_MAX.
BUG: KASAN: slab-out-of-bounds in memcmp+0x16f/0x1c0 lib/string.c:757
Read of size 8 at addr ffff888016f3aa90 by task syz-executor344/3613
Call Trace:
memcmp+0x16f/0x1c0 lib/string.c:757
memcmp include/linux/fortify-string.h:420 [inline]
fscache_volume_same fs/fscache/volume.c:133 [inline]
fscache_hash_volume fs/fscache/volume.c:171 [inline]
__fscache_acquire_volume+0x76c/0x1080 fs/fscache/volume.c:328
fscache_acquire_volume include/linux/fscache.h:204 [inline]
v9fs_cache_session_get_cookie+0x143/0x240 fs/9p/cache.c:34
v9fs_session_init+0x1166/0x1810 fs/9p/v9fs.c:473
v9fs_mount+0xba/0xc90 fs/9p/vfs_super.c:126
legacy_get_tree+0x105/0x220 fs/fs_context.c:610
vfs_get_tree+0x89/0x2f0 fs/super.c:1530
do_new_mount fs/namespace.c:3040 [inline]
path_mount+0x1326/0x1e20 fs/namespace.c:3370
do_mount fs/namespace.c:3383 [inline]
__do_sys_mount fs/namespace.c:3591 [inline]
__se_sys_mount fs/namespace.c:3568 [inline]
__x64_sys_mount+0x27f/0x300 fs/namespace.c:3568
Fixes: 62ab63352350 ("fscache: Implement volume registration")
Reported-by: syzbot+a76f6a6e524cf2080aa3@syzkaller.appspotmail.com
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Zhang Peng <zhangpeng362@huawei.com>
Reviewed-by: Jingbo Xu <jefflexu@linux.alibaba.com>
cc: Dominique Martinet <asmadeus@codewreck.org>
cc: Jeff Layton <jlayton@kernel.org>
cc: v9fs-developer@lists.sourceforge.net
cc: linux-cachefs@redhat.com
Link: https://lore.kernel.org/r/Y3OH+Dmi0QIOK18n@codewreck.org/ # Zhang Peng's v1 fix
Link: https://lore.kernel.org/r/20221115140447.2971680-1-zhangpeng362@huawei.com/ # Zhang Peng's v2 fix
Link: https://lore.kernel.org/r/166869954095.3793579.8500020902371015443.stgit@warthog.procyon.org.uk/ # v1
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Fix kernel-doc warnings. Many of these are about a function's
return value, so use the kernel-doc Return: format to fix those
Use % prefix on numeric constant values.
dir.c: fix typos/spellos
file.c fix typo: s/taret/target/
Fix all of these kernel-doc warnings:
dir.c:305: warning: missing initial short description on line:
* kernfs_name_hash
dir.c:137: warning: No description found for return value of 'kernfs_path_from_node_locked'
dir.c:196: warning: No description found for return value of 'kernfs_name'
dir.c:224: warning: No description found for return value of 'kernfs_path_from_node'
dir.c:292: warning: No description found for return value of 'kernfs_get_parent'
dir.c:312: warning: No description found for return value of 'kernfs_name_hash'
dir.c:404: warning: No description found for return value of 'kernfs_unlink_sibling'
dir.c:588: warning: No description found for return value of 'kernfs_node_from_dentry'
dir.c:806: warning: No description found for return value of 'kernfs_find_ns'
dir.c:879: warning: No description found for return value of 'kernfs_find_and_get_ns'
dir.c:904: warning: No description found for return value of 'kernfs_walk_and_get_ns'
dir.c:927: warning: No description found for return value of 'kernfs_create_root'
dir.c:996: warning: No description found for return value of 'kernfs_root_to_node'
dir.c:1016: warning: No description found for return value of 'kernfs_create_dir_ns'
dir.c:1048: warning: No description found for return value of 'kernfs_create_empty_dir'
dir.c:1306: warning: No description found for return value of 'kernfs_next_descendant_post'
dir.c:1568: warning: No description found for return value of 'kernfs_remove_self'
dir.c:1630: warning: No description found for return value of 'kernfs_remove_by_name_ns'
dir.c:1667: warning: No description found for return value of 'kernfs_rename_ns'
file.c:66: warning: No description found for return value of 'of_on'
file.c:88: warning: No description found for return value of 'kernfs_deref_open_node_locked'
file.c:1036: warning: No description found for return value of '__kernfs_create_file'
inode.c:100: warning: No description found for return value of 'kernfs_setattr'
mount.c:160: warning: No description found for return value of 'kernfs_root_from_sb'
mount.c:198: warning: No description found for return value of 'kernfs_node_dentry'
mount.c:302: warning: No description found for return value of 'kernfs_super_ns'
mount.c:318: warning: No description found for return value of 'kernfs_get_tree'
symlink.c:28: warning: No description found for return value of 'kernfs_create_link'
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Tejun Heo <tj@kernel.org>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20221112031456.22980-1-rdunlap@infradead.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
Although kset_unregister() can eventually remove all attribute files,
explicitly rolling back with the matching function makes the code logic
look clearer.
CC: stable@vger.kernel.org # 5.4+
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Zhen Lei <thunder.leizhen@huawei.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
When logging an inode in full mode, or when logging xattrs or when logging
the dir index items of a directory, we are modifying the log tree while
holding a read lock on a leaf from the fs/subvolume tree. This can lead to
a deadlock in rare circumstances, but it is a real possibility, and it was
recently reported by syzbot with the following trace from lockdep:
WARNING: possible circular locking dependency detected
6.1.0-rc5-next-20221116-syzkaller #0 Not tainted
------------------------------------------------------
syz-executor.1/16154 is trying to acquire lock:
ffff88807e3084a0 (&delayed_node->mutex){+.+.}-{3:3}, at: __btrfs_release_delayed_node.part.0+0xa1/0xf30 fs/btrfs/delayed-inode.c:256
but task is already holding lock:
ffff88807df33078 (btrfs-log-00){++++}-{3:3}, at: __btrfs_tree_lock+0x32/0x3d0 fs/btrfs/locking.c:197
which lock already depends on the new lock.
the existing dependency chain (in reverse order) is:
-> #2 (btrfs-log-00){++++}-{3:3}:
down_read_nested+0x9e/0x450 kernel/locking/rwsem.c:1634
__btrfs_tree_read_lock+0x32/0x350 fs/btrfs/locking.c:135
btrfs_tree_read_lock fs/btrfs/locking.c:141 [inline]
btrfs_read_lock_root_node+0x82/0x3a0 fs/btrfs/locking.c:280
btrfs_search_slot_get_root fs/btrfs/ctree.c:1678 [inline]
btrfs_search_slot+0x3ca/0x2c70 fs/btrfs/ctree.c:1998
btrfs_lookup_csum+0x116/0x3f0 fs/btrfs/file-item.c:209
btrfs_csum_file_blocks+0x40e/0x1370 fs/btrfs/file-item.c:1021
log_csums.isra.0+0x244/0x2d0 fs/btrfs/tree-log.c:4258
copy_items.isra.0+0xbfb/0xed0 fs/btrfs/tree-log.c:4403
copy_inode_items_to_log+0x13d6/0x1d90 fs/btrfs/tree-log.c:5873
btrfs_log_inode+0xb19/0x4680 fs/btrfs/tree-log.c:6495
btrfs_log_inode_parent+0x890/0x2a20 fs/btrfs/tree-log.c:6982
btrfs_log_dentry_safe+0x59/0x80 fs/btrfs/tree-log.c:7083
btrfs_sync_file+0xa41/0x13c0 fs/btrfs/file.c:1921
vfs_fsync_range+0x13e/0x230 fs/sync.c:188
generic_write_sync include/linux/fs.h:2856 [inline]
iomap_dio_complete+0x73a/0x920 fs/iomap/direct-io.c:128
btrfs_direct_write fs/btrfs/file.c:1536 [inline]
btrfs_do_write_iter+0xba2/0x1470 fs/btrfs/file.c:1668
call_write_iter include/linux/fs.h:2160 [inline]
do_iter_readv_writev+0x20b/0x3b0 fs/read_write.c:735
do_iter_write+0x182/0x700 fs/read_write.c:861
vfs_iter_write+0x74/0xa0 fs/read_write.c:902
iter_file_splice_write+0x745/0xc90 fs/splice.c:686
do_splice_from fs/splice.c:764 [inline]
direct_splice_actor+0x114/0x180 fs/splice.c:931
splice_direct_to_actor+0x335/0x8a0 fs/splice.c:886
do_splice_direct+0x1ab/0x280 fs/splice.c:974
do_sendfile+0xb19/0x1270 fs/read_write.c:1255
__do_sys_sendfile64 fs/read_write.c:1323 [inline]
__se_sys_sendfile64 fs/read_write.c:1309 [inline]
__x64_sys_sendfile64+0x259/0x2c0 fs/read_write.c:1309
do_syscall_x64 arch/x86/entry/common.c:50 [inline]
do_syscall_64+0x39/0xb0 arch/x86/entry/common.c:80
entry_SYSCALL_64_after_hwframe+0x63/0xcd
-> #1 (btrfs-tree-00){++++}-{3:3}:
__lock_release kernel/locking/lockdep.c:5382 [inline]
lock_release+0x371/0x810 kernel/locking/lockdep.c:5688
up_write+0x2a/0x520 kernel/locking/rwsem.c:1614
btrfs_tree_unlock_rw fs/btrfs/locking.h:189 [inline]
btrfs_unlock_up_safe+0x1e3/0x290 fs/btrfs/locking.c:238
search_leaf fs/btrfs/ctree.c:1832 [inline]
btrfs_search_slot+0x265e/0x2c70 fs/btrfs/ctree.c:2074
btrfs_insert_empty_items+0xbd/0x1c0 fs/btrfs/ctree.c:4133
btrfs_insert_delayed_item+0x826/0xfa0 fs/btrfs/delayed-inode.c:746
btrfs_insert_delayed_items fs/btrfs/delayed-inode.c:824 [inline]
__btrfs_commit_inode_delayed_items fs/btrfs/delayed-inode.c:1111 [inline]
__btrfs_run_delayed_items+0x280/0x590 fs/btrfs/delayed-inode.c:1153
flush_space+0x147/0xe90 fs/btrfs/space-info.c:728
btrfs_async_reclaim_metadata_space+0x541/0xc10 fs/btrfs/space-info.c:1086
process_one_work+0x9bf/0x1710 kernel/workqueue.c:2289
worker_thread+0x669/0x1090 kernel/workqueue.c:2436
kthread+0x2e8/0x3a0 kernel/kthread.c:376
ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:308
-> #0 (&delayed_node->mutex){+.+.}-{3:3}:
check_prev_add kernel/locking/lockdep.c:3097 [inline]
check_prevs_add kernel/locking/lockdep.c:3216 [inline]
validate_chain kernel/locking/lockdep.c:3831 [inline]
__lock_acquire+0x2a43/0x56d0 kernel/locking/lockdep.c:5055
lock_acquire kernel/locking/lockdep.c:5668 [inline]
lock_acquire+0x1e3/0x630 kernel/locking/lockdep.c:5633
__mutex_lock_common kernel/locking/mutex.c:603 [inline]
__mutex_lock+0x12f/0x1360 kernel/locking/mutex.c:747
__btrfs_release_delayed_node.part.0+0xa1/0xf30 fs/btrfs/delayed-inode.c:256
__btrfs_release_delayed_node fs/btrfs/delayed-inode.c:251 [inline]
btrfs_release_delayed_node fs/btrfs/delayed-inode.c:281 [inline]
btrfs_remove_delayed_node+0x52/0x60 fs/btrfs/delayed-inode.c:1285
btrfs_evict_inode+0x511/0xf30 fs/btrfs/inode.c:5554
evict+0x2ed/0x6b0 fs/inode.c:664
dispose_list+0x117/0x1e0 fs/inode.c:697
prune_icache_sb+0xeb/0x150 fs/inode.c:896
super_cache_scan+0x391/0x590 fs/super.c:106
do_shrink_slab+0x464/0xce0 mm/vmscan.c:843
shrink_slab_memcg mm/vmscan.c:912 [inline]
shrink_slab+0x388/0x660 mm/vmscan.c:991
shrink_node_memcgs mm/vmscan.c:6088 [inline]
shrink_node+0x93d/0x1f30 mm/vmscan.c:6117
shrink_zones mm/vmscan.c:6355 [inline]
do_try_to_free_pages+0x3b4/0x17a0 mm/vmscan.c:6417
try_to_free_mem_cgroup_pages+0x3a4/0xa70 mm/vmscan.c:6732
reclaim_high.constprop.0+0x182/0x230 mm/memcontrol.c:2393
mem_cgroup_handle_over_high+0x190/0x520 mm/memcontrol.c:2578
try_charge_memcg+0xe0c/0x12f0 mm/memcontrol.c:2816
try_charge mm/memcontrol.c:2827 [inline]
charge_memcg+0x90/0x3b0 mm/memcontrol.c:6889
__mem_cgroup_charge+0x2b/0x90 mm/memcontrol.c:6910
mem_cgroup_charge include/linux/memcontrol.h:667 [inline]
__filemap_add_folio+0x615/0xf80 mm/filemap.c:852
filemap_add_folio+0xaf/0x1e0 mm/filemap.c:934
__filemap_get_folio+0x389/0xd80 mm/filemap.c:1976
pagecache_get_page+0x2e/0x280 mm/folio-compat.c:104
find_or_create_page include/linux/pagemap.h:612 [inline]
alloc_extent_buffer+0x2b9/0x1580 fs/btrfs/extent_io.c:4588
btrfs_init_new_buffer fs/btrfs/extent-tree.c:4869 [inline]
btrfs_alloc_tree_block+0x2e1/0x1320 fs/btrfs/extent-tree.c:4988
__btrfs_cow_block+0x3b2/0x1420 fs/btrfs/ctree.c:440
btrfs_cow_block+0x2fa/0x950 fs/btrfs/ctree.c:595
btrfs_search_slot+0x11b0/0x2c70 fs/btrfs/ctree.c:2038
btrfs_update_root+0xdb/0x630 fs/btrfs/root-tree.c:137
update_log_root fs/btrfs/tree-log.c:2841 [inline]
btrfs_sync_log+0xbfb/0x2870 fs/btrfs/tree-log.c:3064
btrfs_sync_file+0xdb9/0x13c0 fs/btrfs/file.c:1947
vfs_fsync_range+0x13e/0x230 fs/sync.c:188
generic_write_sync include/linux/fs.h:2856 [inline]
iomap_dio_complete+0x73a/0x920 fs/iomap/direct-io.c:128
btrfs_direct_write fs/btrfs/file.c:1536 [inline]
btrfs_do_write_iter+0xba2/0x1470 fs/btrfs/file.c:1668
call_write_iter include/linux/fs.h:2160 [inline]
do_iter_readv_writev+0x20b/0x3b0 fs/read_write.c:735
do_iter_write+0x182/0x700 fs/read_write.c:861
vfs_iter_write+0x74/0xa0 fs/read_write.c:902
iter_file_splice_write+0x745/0xc90 fs/splice.c:686
do_splice_from fs/splice.c:764 [inline]
direct_splice_actor+0x114/0x180 fs/splice.c:931
splice_direct_to_actor+0x335/0x8a0 fs/splice.c:886
do_splice_direct+0x1ab/0x280 fs/splice.c:974
do_sendfile+0xb19/0x1270 fs/read_write.c:1255
__do_sys_sendfile64 fs/read_write.c:1323 [inline]
__se_sys_sendfile64 fs/read_write.c:1309 [inline]
__x64_sys_sendfile64+0x259/0x2c0 fs/read_write.c:1309
do_syscall_x64 arch/x86/entry/common.c:50 [inline]
do_syscall_64+0x39/0xb0 arch/x86/entry/common.c:80
entry_SYSCALL_64_after_hwframe+0x63/0xcd
other info that might help us debug this:
Chain exists of:
&delayed_node->mutex --> btrfs-tree-00 --> btrfs-log-00
Possible unsafe locking scenario:
CPU0 CPU1
---- ----
lock(btrfs-log-00);
lock(btrfs-tree-00);
lock(btrfs-log-00);
lock(&delayed_node->mutex);
Holding a read lock on a leaf from a fs/subvolume tree creates a nasty
lock dependency when we are COWing extent buffers for the log tree and we
have two tasks modifying the log tree, with each one in one of the
following 2 scenarios:
1) Modifying the log tree triggers an extent buffer allocation while
holding a write lock on a parent extent buffer from the log tree.
Allocating the pages for an extent buffer, or the extent buffer
struct, can trigger inode eviction and finally the inode eviction
will trigger a release/remove of a delayed node, which requires
taking the delayed node's mutex;
2) Allocating a metadata extent for a log tree can trigger the async
reclaim thread and make us wait for it to release enough space and
unblock our reservation ticket. The reclaim thread can start flushing
delayed items, and that in turn results in the need to lock delayed
node mutexes and in the need to write lock extent buffers of a
subvolume tree - all this while holding a write lock on the parent
extent buffer in the log tree.
So one task in scenario 1) running in parallel with another task in
scenario 2) could lead to a deadlock, one wanting to lock a delayed node
mutex while having a read lock on a leaf from the subvolume, while the
other is holding the delayed node's mutex and wants to write lock the same
subvolume leaf for flushing delayed items.
Fix this by cloning the leaf of the fs/subvolume tree, release/unlock the
fs/subvolume leaf and use the clone leaf instead.
Reported-by: syzbot+9b7c21f486f5e7f8d029@syzkaller.appspotmail.com
Link: https://lore.kernel.org/linux-btrfs/000000000000ccc93c05edc4d8cf@google.com/
CC: stable@vger.kernel.org # 6.0+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Otherwise the kernel memory allocator seems to be unhappy about failing
order 6 allocations for the zones array, that cause 100% reproducible
mount failures in my qemu setup:
[26.078981] mount: page allocation failure: order:6, mode:0x40dc0(GFP_KERNEL|__GFP_COMP|__GFP_ZERO), nodemask=(null)
[26.079741] CPU: 0 PID: 2965 Comm: mount Not tainted 6.1.0-rc5+ #185
[26.080181] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014
[26.080950] Call Trace:
[26.081132] <TASK>
[26.081291] dump_stack_lvl+0x56/0x6f
[26.081554] warn_alloc+0x117/0x140
[26.081808] ? __alloc_pages_direct_compact+0x1b5/0x300
[26.082174] __alloc_pages_slowpath.constprop.0+0xd0e/0xde0
[26.082569] __alloc_pages+0x32a/0x340
[26.082836] __kmalloc_large_node+0x4d/0xa0
[26.083133] ? trace_kmalloc+0x29/0xd0
[26.083399] kmalloc_large+0x14/0x60
[26.083654] btrfs_get_dev_zone_info+0x1b9/0xc00
[26.083980] ? _raw_spin_unlock_irqrestore+0x28/0x50
[26.084328] btrfs_get_dev_zone_info_all_devices+0x54/0x80
[26.084708] open_ctree+0xed4/0x1654
[26.084974] btrfs_mount_root.cold+0x12/0xde
[26.085288] ? lock_is_held_type+0xe2/0x140
[26.085603] legacy_get_tree+0x28/0x50
[26.085876] vfs_get_tree+0x1d/0xb0
[26.086139] vfs_kern_mount.part.0+0x6c/0xb0
[26.086456] btrfs_mount+0x118/0x3a0
[26.086728] ? lock_is_held_type+0xe2/0x140
[26.087043] legacy_get_tree+0x28/0x50
[26.087323] vfs_get_tree+0x1d/0xb0
[26.087587] path_mount+0x2ba/0xbe0
[26.087850] ? _raw_spin_unlock_irqrestore+0x38/0x50
[26.088217] __x64_sys_mount+0xfe/0x140
[26.088506] do_syscall_64+0x35/0x80
[26.088776] entry_SYSCALL_64_after_hwframe+0x63/0xcd
Fixes: 5b316468983d ("btrfs: get zone information of zoned block devices")
CC: stable@vger.kernel.org # 5.15+
Reviewed-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
This is a followup to a previous commit of mine [0], which added the
allow_sys_admin_access && capable(CAP_SYS_ADMIN) check. This patch
rearranges the order of checks in fuse_allow_current_process without
changing functionality.
Commit 9ccf47b26b73 ("fuse: Add module param for CAP_SYS_ADMIN access
bypassing allow_other") added allow_sys_admin_access &&
capable(CAP_SYS_ADMIN) check to the beginning of the function, with the
reasoning that allow_sys_admin_access should be an 'escape hatch' for users
with CAP_SYS_ADMIN, allowing them to skip any subsequent checks.
However, placing this new check first results in many capable() calls when
allow_sys_admin_access is set, where another check would've also returned
1. This can be problematic when a BPF program is tracing capable() calls.
At Meta we ran into such a scenario recently. On a host where
allow_sys_admin_access is set but most of the FUSE access is from processes
which would pass other checks - i.e. they don't need CAP_SYS_ADMIN 'escape
hatch' - this results in an unnecessary capable() call for each fs op. We
also have a daemon tracing capable() with BPF and doing some data
collection, so tracing these extraneous capable() calls has the potential
to regress performance for an application doing many FUSE ops.
So rearrange the order of these checks such that CAP_SYS_ADMIN 'escape
hatch' is checked last. Add a small helper, fuse_permissible_uidgid, to
make the logic easier to understand. Previously, if allow_other is set on
the fuse_conn, uid/git checking doesn't happen as current_in_userns result
is returned. These semantics are maintained here: fuse_permissible_uidgid
check only happens if allow_other is not set.
Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com>
Suggested-by: Andrii Nakryiko <andrii@kernel.org>
Reviewed-by: Christian Brauner (Microsoft) <brauner@kernel.org>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
|
|
In general, as of now, in FUSE, direct writes on the same file are
serialized over inode lock i.e we hold inode lock for the full duration of
the write request. I could not find in fuse code and git history a comment
which clearly explains why this exclusive lock is taken for direct writes.
Following might be the reasons for acquiring an exclusive lock but not be
limited to
1) Our guess is some USER space fuse implementations might be relying on
this lock for serialization.
2) The lock protects against file read/write size races.
3) Ruling out any issues arising from partial write failures.
This patch relaxes the exclusive lock for direct non-extending writes only.
File size extending writes might not need the lock either, but we are not
entirely sure if there is a risk to introduce any kind of regression.
Furthermore, benchmarking with fio does not show a difference between patch
versions that take on file size extension a) an exclusive lock and b) a
shared lock.
A possible example of an issue with i_size extending writes are write error
cases. Some writes might succeed and others might fail for file system
internal reasons - for example ENOSPACE. With parallel file size extending
writes it _might_ be difficult to revert the action of the failing write,
especially to restore the right i_size.
With these changes, we allow non-extending parallel direct writes on the
same file with the help of a flag called FOPEN_PARALLEL_DIRECT_WRITES. If
this flag is set on the file (flag is passed from libfuse to fuse kernel as
part of file open/create), we do not take exclusive lock anymore, but
instead use a shared lock that allows non-extending writes to run in
parallel. FUSE implementations which rely on this inode lock for
serialization can continue to do so and serialized direct writes are still
the default. Implementations that do not do write serialization need to be
updated and need to set the FOPEN_PARALLEL_DIRECT_WRITES flag in their file
open/create reply.
On patch review there were concerns that network file systems (or vfs
multiple mounts of the same file system) might have issues with parallel
writes. We believe this is not the case, as this is just a local lock,
which network file systems could not rely on anyway. I.e. this lock is
just for local consistency.
Signed-off-by: Dharmendra Singh <dsingh@ddn.com>
Signed-off-by: Bernd Schubert <bschubert@ddn.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
|
|
Return the value fuse_dev_release() directly instead of storing it in
another redundant variable.
Reported-by: Zeal Robot <zealci@zte.com.cn>
Signed-off-by: ye xingchen <ye.xingchen@zte.com.cn>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
|
|
A while ago we introduced a dedicated vfs{g,u}id_t type in commit
1e5267cd0895 ("mnt_idmapping: add vfs{g,u}id_t"). We already switched over
a good part of the VFS. Ultimately we will remove all legacy idmapped
mount helpers that operate only on k{g,u}id_t in favor of the new type safe
helpers that operate on vfs{g,u}id_t.
Cc: Seth Forshee (Digital Ocean) <sforshee@kernel.org>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
|
|
Commit 8ed1f0e22f49e ("fs/fuse: fix ioctl type confusion") fixed a type
confusion bug by adding an ->f_op comparison.
Based on some off-list discussion back then, another check was added to
compare the f_cred->user_ns. This is not for security reasons, but was
based on the idea that a FUSE device FD should be using the UID/GID
mappings of its f_cred->user_ns, and those translations are done using
fc->user_ns, which matches the f_cred->user_ns of the initial FUSE device
FD thanks to the check in fuse_fill_super(). See also commit 8cb08329b0809
("fuse: Support fuse filesystems outside of init_user_ns").
But FUSE_DEV_IOC_CLONE is, at a higher level, a *cloning* operation that
copies an existing context (with a weird API that involves first opening
/dev/fuse, then tying the resulting new FUSE device FD to an existing FUSE
instance). So if an application is already passing FUSE FDs across userns
boundaries and dealing with the resulting ID mapping complications somehow,
it doesn't make much sense to block this cloning operation.
I've heard that this check is an obstacle for some folks, and I don't see a
good reason to keep it, so remove it.
Signed-off-by: Jann Horn <jannh@google.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
|
|
The previous commit df8629af2934 ("fuse: always revalidate if exclusive
create") ensures that the dentries are revalidated on O_EXCL creates. This
commit complements it by also performing revalidation for rename target
dentries. Otherwise, a rename target file that only exists in kernel
dentry cache but not in the filesystem will result in EEXIST if
RENAME_NOREPLACE flag is used.
Signed-off-by: Jiachen Zhang <zhangjiachen.jaycee@bytedance.com>
Signed-off-by: Zhang Tianci <zhangtianci.1997@bytedance.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
|
|
Add a flag to entry expiration that lets the filesystem expire a dentry
without kicking it out from the cache immediately.
This makes a difference for overmounted dentries, where plain invalidation
would detach all submounts before dropping the dentry from the cache. If
only expiry is set on the dentry, then any overmounts are left alone and
until ->d_revalidate() is called.
Note: ->d_revalidate() is not called for the case of following a submount,
so invalidation will only be triggered for the non-overmounted case. The
dentry could also be mounted in a different mount instance, in which case
any submounts will still be detached.
Suggested-by: Jakob Blomer <jblomer@cern.ch>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
|
|
The use of kmap() is being deprecated in favor of kmap_local_page().
There are two main problems with kmap(): (1) It comes with an overhead as
the mapping space is restricted and protected by a global lock for
synchronization and (2) it also requires global TLB invalidation when the
kmap’s pool wraps and it might block when the mapping space is fully
utilized until a slot becomes available.
With kmap_local_page() the mappings are per thread, CPU local, can take
page faults, and can be called from any context (including interrupts).
It is faster than kmap() in kernels with HIGHMEM enabled. Furthermore,
the tasks can be preempted and, when they are scheduled to run again, the
kernel virtual addresses are restored and still valid.
Therefore, replace kmap() with kmap_local_page() in fuse_readdir_cached(),
it being the only call site of kmap() currently left in fs/fuse.
Cc: "Venkataramanan, Anirudh" <anirudh.venkataramanan@intel.com>
Suggested-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Fabio M. De Francesco <fmdefrancesco@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
|
|
file_modified() must be called with inode lock held. fuse_fallocate()
didn't lock the inode in case of just FALLOC_KEEP_SIZE flags value, which
resulted in a kernel Warning in notify_change().
Lock the inode unconditionally, like all other fallocate implementations
do.
Reported-by: Pengfei Xu <pengfei.xu@intel.com>
Reported-and-tested-by: syzbot+462da39f0667b357c4b6@syzkaller.appspotmail.com
Fixes: 4a6f278d4827 ("fuse: add file_modified() to fallocate")
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
|
|
When extending segments, nilfs_sufile_alloc() is called to get an
unassigned segment, then mark it as dirty to avoid accidentally allocating
the same segment in the future.
But for some special cases such as a corrupted image it can be unreliable.
If such corruption of the dirty state of the segment occurs, nilfs2 may
reallocate a segment that is in use and pick the same segment for writing
twice at the same time.
This will cause the problem reported by syzkaller:
https://syzkaller.appspot.com/bug?id=c7c4748e11ffcc367cef04f76e02e931833cbd24
This case started with segbuf1.segnum = 3, nextnum = 4 when constructed.
It supposed segment 4 has already been allocated and marked as dirty.
However the dirty state was corrupted and segment 4 usage was not dirty.
For the first time nilfs_segctor_extend_segments() segment 4 was allocated
again, which made segbuf2 and next segbuf3 had same segment 4.
sb_getblk() will get same bh for segbuf2 and segbuf3, and this bh is added
to both buffer lists of two segbuf. It makes the lists broken which
causes NULL pointer dereference.
Fix the problem by setting usage as dirty every time in
nilfs_sufile_mark_dirty(), which is called during constructing current
segment to be written out and before allocating next segment.
[chenzhongjin@huawei.com: add lock protection per Ryusuke]
Link: https://lkml.kernel.org/r/20221121091141.214703-1-chenzhongjin@huawei.com
Link: https://lkml.kernel.org/r/20221118063304.140187-1-chenzhongjin@huawei.com
Fixes: 9ff05123e3bf ("nilfs2: segment constructor")
Signed-off-by: Chen Zhongjin <chenzhongjin@huawei.com>
Reported-by: <syzbot+77e4f0...@syzkaller.appspotmail.com>
Reported-by: Liu Shixin <liushixin2@huawei.com>
Acked-by: Ryusuke Konishi <konishi.ryusuke@gmail.com>
Tested-by: Ryusuke Konishi <konishi.ryusuke@gmail.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
SecPageTables has a tab after it instead of a space, this can break
fragile parsers that depend on spaces after the stat names.
Link: https://lkml.kernel.org/r/20221117043247.133294-1-yosryahmed@google.com
Fixes: ebc97a52b5d6cd5f ("mm: add NR_SECONDARY_PAGETABLE to count secondary page table uses.")
Signed-off-by: Yosry Ahmed <yosryahmed@google.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Shakeel Butt <shakeelb@google.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Marc Zyngier <maz@kernel.org>
Cc: Sean Christopherson <seanjc@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Because that's what Christoph wants for this error handling path
only XFS uses.
It requires a new iomap export for handling errors over delalloc
ranges. This is basically the XFS code as is stands, but even though
Christoph wants this as iomap funcitonality, we still have
to call it from the filesystem specific ->iomap_end callback, and
call into the iomap code with yet another filesystem specific
callback to punch the delalloc extent within the defined ranges.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
|
|
xfs_buffered_write_iomap_end() currently converts the byte ranges
passed to it to filesystem blocks to pass them to the bmap code to
punch out delalloc blocks, but then has to convert filesytem
blocks back to byte ranges for page cache truncate.
We're about to make the page cache truncate go away and replace it
with a page cache walk, so having to convert everything to/from/to
filesystem blocks is messy and error-prone. It is much easier to
pass around byte ranges and convert to page indexes and/or
filesystem blocks only where those units are needed.
In preparation for the page cache walk being added, add a helper
that converts byte ranges to filesystem blocks and calls
xfs_bmap_punch_delalloc_range() and convert
xfs_buffered_write_iomap_end() to calculate limits in byte ranges.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
|
|
xfs_buffered_write_iomap_end() has a comment about the safety of
punching delalloc extents based holding the IOLOCK_EXCL. This
comment is wrong, and punching delalloc extents is not race free.
When we punch out a delalloc extent after a write failure in
xfs_buffered_write_iomap_end(), we punch out the page cache with
truncate_pagecache_range() before we punch out the delalloc extents.
At this point, we only hold the IOLOCK_EXCL, so there is nothing
stopping mmap() write faults racing with this cleanup operation,
reinstantiating a folio over the range we are about to punch and
hence requiring the delalloc extent to be kept.
If this race condition is hit, we can end up with a dirty page in
the page cache that has no delalloc extent or space reservation
backing it. This leads to bad things happening at writeback time.
To avoid this race condition, we need the page cache truncation to
be atomic w.r.t. the extent manipulation. We can do this by holding
the mapping->invalidate_lock exclusively across this operation -
this will prevent new pages from being inserted into the page cache
whilst we are removing the pages and the backing extent and space
reservation.
Taking the mapping->invalidate_lock exclusively in the buffered
write IO path is safe - it naturally nests inside the IOLOCK (see
truncate and fallocate paths). iomap_zero_range() can be called from
under the mapping->invalidate_lock (from the truncate path via
either xfs_zero_eof() or xfs_truncate_page(), but iomap_zero_iter()
will not instantiate new delalloc pages (because it skips holes) and
hence will not ever need to punch out delalloc extents on failure.
Fix the locking issue, and clean up the code logic a little to avoid
unnecessary work if we didn't allocate the delalloc extent or wrote
the entire region we allocated.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
|
|
Commit 57fe60df6241 ("reiserfs: add atomic addition of selinux attributes
during inode creation") defined reiserfs_security_free() to free the name
and value of a security xattr allocated by the active LSM through
security_old_inode_init_security(). However, this function is not called
in the reiserfs code.
Thus, add a call to reiserfs_security_free() whenever
reiserfs_security_init() is called, and initialize value to NULL, to avoid
to call kfree() on an uninitialized pointer.
Finally, remove the kfree() for the xattr name, as it is not allocated
anymore.
Fixes: 57fe60df6241 ("reiserfs: add atomic addition of selinux attributes during inode creation")
Cc: stable@vger.kernel.org
Cc: Jeff Mahoney <jeffm@suse.com>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Reported-by: Mimi Zohar <zohar@linux.ibm.com>
Reported-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Signed-off-by: Roberto Sassu <roberto.sassu@huawei.com>
Reviewed-by: Mimi Zohar <zohar@linux.ibm.com>
Signed-off-by: Paul Moore <paul@paul-moore.com>
|
|
After commit cbfecb927f42 ("fs: record I_DIRTY_TIME even if inode
already has I_DIRTY_INODE") writeback_single_inode can push inode with
I_DIRTY_TIME set to b_dirty_time list. In case of freeing inode with
I_DIRTY_TIME set this can happen after deletion of inode from i_io_list
at evict. Stack trace is following.
evict
fat_evict_inode
fat_truncate_blocks
fat_flush_inodes
writeback_inode
sync_inode_metadata(inode, sync=0)
writeback_single_inode(inode, wbc) <- wbc->sync_mode == WB_SYNC_NONE
This will lead to use after free in flusher thread.
Similar issue can be triggered if writeback_single_inode in the
stack trace update inode->i_io_list. Add explicit check to avoid it.
Fixes: cbfecb927f42 ("fs: record I_DIRTY_TIME even if inode already has I_DIRTY_INODE")
Reported-by: syzbot+6ba92bd00d5093f7e371@syzkaller.appspotmail.com
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Svyatoslav Feldsherov <feldsherov@google.com>
Link: https://lore.kernel.org/r/20221115202001.324188-1-feldsherov@google.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
The call, kobject_get_ownership(), does not modify the kobject passed
into it, so make it const. This propagates down into the kobj_type
function callbacks so make the kobject passed into them also const,
ensuring that nothing in the kobject is being changed here.
This helps make it more obvious what calls and callbacks do, and do not,
modify structures passed to them.
Cc: Trond Myklebust <trond.myklebust@hammerspace.com>
Cc: Anna Schumaker <anna@kernel.org>
Cc: Roopa Prabhu <roopa@nvidia.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Paolo Abeni <pabeni@redhat.com>
Cc: Chuck Lever <chuck.lever@oracle.com>
Cc: Jeff Layton <jlayton@kernel.org>
Cc: linux-nfs@vger.kernel.org
Cc: bridge@lists.linux-foundation.org
Cc: netdev@vger.kernel.org
Acked-by: Jakub Kicinski <kuba@kernel.org>
Acked-by: Rafael J. Wysocki <rafael@kernel.org>
Acked-by: Nikolay Aleksandrov <razor@blackwall.org>
Link: https://lore.kernel.org/r/20221121094649.1556002-1-gregkh@linuxfoundation.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
This patch uses assert_spin_locked() instead of lockdep_is_held()
where it's available to use because lockdep_is_held() is only available
if CONFIG_LOCKDEP is set.
In other cases like lockdep_sock_is_held() we surround it by a
CONFIG_LOCKDEP idef.
Fixes: dbb751ffab0b ("fs: dlm: parallelize lowcomms socket handling")
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Alexander Aring <aahringo@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>
|
|
This is identical to eventfd_signal(), but it allows the caller to pass
in a mask to be used for the poll wakeup key. The use case is avoiding
repeated multishot triggers if we have a dependency between eventfd and
io_uring.
If we setup an eventfd context and register that as the io_uring eventfd,
and at the same time queue a multishot poll request for the eventfd
context, then any CQE posted will repeatedly trigger the multishot request
until it terminates when the CQ ring overflows.
In preparation for io_uring detecting this circular dependency, add the
mentioned helper so that io_uring can pass in EPOLL_URING as part of the
poll wakeup key.
Cc: stable@vger.kernel.org # 6.0
[axboe: fold in !CONFIG_EVENTFD fix from Zhang Qilong]
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
There is a race between modprobe and mount as below:
modprobe zonefs | mount -t zonefs
--------------------------------|-------------------------
zonefs_init |
register_filesystem [1] |
| zonefs_fill_super [2]
zonefs_sysfs_init [3] |
1. register zonefs suceess, then
2. user can mount the zonefs
3. if sysfs initialize failed, the module initialize failed.
Then the mount process maybe some error happened since the module
initialize failed.
Let's register zonefs after all dependency resource ready. And
reorder the dependency resource release in module exit.
Fixes: 9277a6d4fbd4 ("zonefs: Export open zone resource information through sysfs")
Signed-off-by: Zhang Xiaoxu <zhangxiaoxu5@huawei.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Signed-off-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
|
|
Add a blk_crypto_config_supported_natively helper that wraps
__blk_crypto_cfg_supported to retrieve the crypto_profile from the
request queue. With this fscrypt can stop including
blk-crypto-profile.h and rely on the public consumer interface in
blk-crypto.h.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Eric Biggers <ebiggers@google.com>
Link: https://lore.kernel.org/r/20221114042944.1009870-3-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Switch all public blk-crypto interfaces to use struct block_device
arguments to specify the device they operate on instead of th
request_queue, which is a block layer implementation detail.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Eric Biggers <ebiggers@google.com>
Link: https://lore.kernel.org/r/20221114042944.1009870-2-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
The following error occurred during the fsstress test:
XFS: Assertion failed: VFS_I(ip)->i_nlink >= 2, file: fs/xfs/xfs_inode.c, line: 2452
The problem was that inode race condition causes incorrect i_nlink to be
written to disk, and then it is read into memory. Consider the following
call graph, inodes that are marked as both XFS_IFLUSHING and
XFS_IRECLAIMABLE, i_nlink will be reset to 1 and then restored to original
value in xfs_reinit_inode(). Therefore, the i_nlink of directory on disk
may be set to 1.
xfsaild
xfs_inode_item_push
xfs_iflush_cluster
xfs_iflush
xfs_inode_to_disk
xfs_iget
xfs_iget_cache_hit
xfs_iget_recycle
xfs_reinit_inode
inode_init_always
xfs_reinit_inode() needs to hold the ILOCK_EXCL as it is changing internal
inode state and can race with other RCU protected inode lookups. On the
read side, xfs_iflush_cluster() grabs the ILOCK_SHARED while under rcu +
ip->i_flags_lock, and so xfs_iflush/xfs_inode_to_disk() are protected from
racing inode updates (during transactions) by that lock.
Fixes: ff7bebeb91f8 ("xfs: refactor the inode recycling code") # goes further back than this
Signed-off-by: Long Li <leo.lilong@huawei.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
|
|
xfstests generic/013 and generic/476 reported WARNING as follows:
WARNING: lock held when returning to user space!
6.1.0-rc5+ #4 Not tainted
------------------------------------------------
fsstress/504233 is leaving the kernel with locks still held!
2 locks held by fsstress/504233:
#0: ffff888054c38850 (&sb->s_type->i_mutex_key#21){+.+.}-{3:3}, at:
lock_two_nondirectories+0xcf/0xf0
#1: ffff8880b8fec750 (&sb->s_type->i_mutex_key#21/4){+.+.}-{3:3}, at:
lock_two_nondirectories+0xb7/0xf0
This will lead to deadlock and hungtask.
Fix this by releasing locks when failed to write out on a file range in
cifs_file_copychunk_range().
Fixes: 3e3761f1ec7d ("smb3: use filemap_write_and_wait_range instead of filemap_write_and_wait")
Cc: stable@vger.kernel.org # 6.0
Reviewed-by: Paulo Alcantara (SUSE) <pc@cjr.nz>
Signed-off-by: ChenXiaoSong <chenxiaosong2@huawei.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
|