summaryrefslogtreecommitdiff
path: root/fs
AgeCommit message (Collapse)Author
2022-03-10io_uring: remove trace for eventfdUsama Arif
The information on whether eventfd is registered is not very useful and would result in the tracepoint being enclosed in an rcu_readlock in a later patch that tries to avoid ring quiesce for registering eventfd. Suggested-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Usama Arif <usama.arif@bytedance.com> Link: https://lore.kernel.org/r/20220204145117.1186568-2-usama.arif@bytedance.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-03-09f2fs: remove unnecessary read for F2FS_FITS_IN_INODEJia Yang
F2FS_FITS_IN_INODE only cares the type of f2fs inode, so there is no need to read node page of f2fs inode. Signed-off-by: Jia Yang <jiayang5@huawei.com> Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2022-03-09xfs: don't generate selinux audit messages for capability testingDarrick J. Wong
There are a few places where we test the current process' capability set to decide if we're going to be more or less generous with resource acquisition for a system call. If the process doesn't have the capability, we can continue the call, albeit in a degraded mode. These are /not/ the actual security decisions, so it's not proper to use capable(), which (in certain selinux setups) causes audit messages to get logged. Switch them to has_capability_noaudit. Fixes: 7317a03df703f ("xfs: refactor inode ownership change transaction/inode/quota allocation idiom") Fixes: ea9a46e1c4925 ("xfs: only return detailed fsmap info if the caller has CAP_SYS_ADMIN") Signed-off-by: Darrick J. Wong <djwong@kernel.org> Cc: Dave Chinner <david@fromorbit.com> Reviewed-by: Ondrej Mosnacek <omosnace@redhat.com> Acked-by: Serge Hallyn <serge@hallyn.com> Reviewed-by: Eric Sandeen <sandeen@redhat.com>
2022-03-09xfs: add missing cmap->br_state = XFS_EXT_NORM updateGao Xiang
COW extents are already converted into written real extents after xfs_reflink_convert_cow_locked(), therefore cmap->br_state should reflect it. Otherwise, there is another necessary unwritten convertion triggered in xfs_dio_write_end_io() for direct I/O cases. Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2022-03-09coredump: Don't compile flat_core_dump when coredumps are disabledEric W. Biederman
Recently the kernel test robot reported: > In file included from include/linux/kernel.h:29, > from fs/binfmt_flat.c:21: > fs/binfmt_flat.c: In function 'flat_core_dump': > >> fs/binfmt_flat.c:121:50: error: invalid use of undefined type 'struct coredump_params' > 121 | current->comm, current->pid, cprm->siginfo->si_signo); > | ^~ > include/linux/printk.h:418:33: note: in definition of macro 'printk_index_wrap' > 418 | _p_func(_fmt, ##__VA_ARGS__); \ > | ^~~~~~~~~~~ > include/linux/printk.h:499:9: note: in expansion of macro 'printk' > 499 | printk(KERN_WARNING pr_fmt(fmt), ##__VA_ARGS__) > | ^~~~~~ > fs/binfmt_flat.c:120:9: note: in expansion of macro 'pr_warn' > 120 | pr_warn("Process %s:%d received signr %d and should have core dumped\n", > | ^~~~~~~ > At top level: > fs/binfmt_flat.c:118:12: warning: 'flat_core_dump' defined but not used [-Wunused-function] > 118 | static int flat_core_dump(struct coredump_params *cprm) > | ^~~~~~~~~~~~~~ The little dinky do nothing function flat_core_dump has always been compiled unconditionally. With my change to move coredump_params into coredump.h coredump_params reasonably becomes unavailable when coredump support is not compiled in. Fix this old issue by simply not compiling flat_core_dump when coredump support is not supported. Fixes: a99a3e2efaf1 ("coredump: Move definition of struct coredump_params into coredump.h") Reported-by: kernel test robot <lkp@intel.com> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2022-03-08f2fs: pass the bio operation to bio_alloc_biosetChristoph Hellwig
Refactor block I/O code so that the bio operation and known flags are set at bio allocation time. Only the later updated flags are updated on the fly. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Chao Yu <chao@kernel.org> Link: https://lore.kernel.org/r/20220228124123.856027-3-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-03-08f2fs: don't pass a bio to f2fs_target_deviceChristoph Hellwig
Set the bdev at bio allocation time by changing the f2fs_target_device calling conventions, so that no bio needs to be passed in. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Chao Yu <chao@kernel.org> Link: https://lore.kernel.org/r/20220228124123.856027-2-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-03-08fs: remove fs.f_write_hintChristoph Hellwig
The value is now completely unused except for reporting it back through the F_GET_FILE_RW_HINT ioctl, so remove the value and the two ioctls for it. Trying to use the F_SET_FILE_RW_HINT and F_GET_FILE_RW_HINT fcntls will now return EINVAL, just like it would on a kernel that never supported this functionality in the first place. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Link: https://lore.kernel.org/r/20220308060529.736277-3-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-03-08fs: remove kiocb.ki_hintChristoph Hellwig
This field is entirely unused now except for a tracepoint in f2fs, so remove it. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Link: https://lore.kernel.org/r/20220308060529.736277-2-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-03-08coredump: Use the vma snapshot in fill_files_noteEric W. Biederman
Matthew Wilcox reported that there is a missing mmap_lock in file_files_note that could possibly lead to a user after free. Solve this by using the existing vma snapshot for consistency and to avoid the need to take the mmap_lock anywhere in the coredump code except for dump_vma_snapshot. Update the dump_vma_snapshot to capture vm_pgoff and vm_file that are neeeded by fill_files_note. Add free_vma_snapshot to free the captured values of vm_file. Reported-by: Matthew Wilcox <willy@infradead.org> Link: https://lkml.kernel.org/r/20220131153740.2396974-1-willy@infradead.org Cc: stable@vger.kernel.org Fixes: a07279c9a8cd ("binfmt_elf, binfmt_elf_fdpic: use a VMA list snapshot") Fixes: 2aa362c49c31 ("coredump: extend core dump note section to contain file names of mapped files") Reviewed-by: Kees Cook <keescook@chromium.org> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2022-03-08coredump/elf: Pass coredump_params into fill_note_infoEric W. Biederman
Instead of individually passing cprm->siginfo and cprm->regs into fill_note_info pass all of struct coredump_params. This is preparation to allow fill_files_note to use the existing vma snapshot. Reviewed-by: Jann Horn <jannh@google.com> Reviewed-by: Kees Cook <keescook@chromium.org> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2022-03-08coredump: Remove the WARN_ON in dump_vma_snapshotEric W. Biederman
The condition is impossible and to the best of my knowledge has never triggered. We are in deep trouble if that conditions happens and we walk past the end of our allocated array. So delete the WARN_ON and the code that makes it look like the kernel can handle the case of walking past the end of it's vma_meta array. Reviewed-by: Jann Horn <jannh@google.com> Reviewed-by: Kees Cook <keescook@chromium.org> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2022-03-08coredump: Snapshot the vmas in do_coredumpEric W. Biederman
Move the call of dump_vma_snapshot and kvfree(vma_meta) out of the individual coredump routines into do_coredump itself. This makes the code less error prone and easier to maintain. Make the vma snapshot available to the coredump routines in struct coredump_params. This makes it easier to change and update what is captures in the vma snapshot and will be needed for fixing fill_file_notes. Reviewed-by: Jann Horn <jannh@google.com> Reviewed-by: Kees Cook <keescook@chromium.org> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2022-03-08coredump: Move definition of struct coredump_params into coredump.hEric W. Biederman
Move the definition of struct coredump_params into coredump.h where it belongs. Remove the slightly errorneous comment explaining why struct coredump_params was declared in binfmts.h. Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2022-03-08Merge tag 'fuse-fixes-5.17-rc8' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse Pull fuse fixes from Miklos Szeredi: - Fix an issue with splice on the fuse device - Fix a regression in the fileattr API conversion - Add a small userspace API improvement * tag 'fuse-fixes-5.17-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse: fuse: fix pipe buffer lifetime for direct_io fuse: move FUSE_SUPER_MAGIC definition to magic.h fuse: fix fileattr op failure
2022-03-08udf: remove redundant assignment of variable etypeColin Ian King
Variable etype is being assigned a value that is never read. The variable and assignment are redundant and can be removed. Cleans up clang scan build warning: fs/udf/super.c:2485:10: warning: Although the value stored to 'etype' is used in the enclosing expression, the value is never actually read from 'etype' [deadcode.DeadStores] Link: https://lore.kernel.org/r/20220307152149.139045-1-colin.i.king@gmail.com Signed-off-by: Colin Ian King <colin.i.king@gmail.com> Signed-off-by: Jan Kara <jack@suse.cz>
2022-03-07block: remove the per-bio/request write hintChristoph Hellwig
With the NVMe support for this gone, there are no consumers of these hints left, so remove them. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20220304175556.407719-2-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-03-07Merge branch 'for-5.18/alloc-cleanups' into for-5.18/write-streamsJens Axboe
* for-5.18/alloc-cleanups: nilfs2: pass the operation to bio_alloc ext4: pass the operation to bio_alloc mpage: pass the operation to bio_alloc
2022-03-07Merge branch 'for-5.18/block' into for-5.18/write-streamsJens Axboe
* for-5.18/block: (96 commits) block: remove bio_devname ext4: stop using bio_devname raid5-ppl: stop using bio_devname raid1: stop using bio_devname md-multipath: stop using bio_devname dm-integrity: stop using bio_devname dm-crypt: stop using bio_devname pktcdvd: remove a pointless debug check in pkt_submit_bio block: remove handle_bad_sector block: fix and cleanup bio_check_ro bfq: fix use-after-free in bfq_dispatch_request blk-crypto: show crypto capabilities in sysfs block: don't delete queue kobject before its children block: simplify calling convention of elv_unregister_queue() block: remove redundant semicolon block: default BLOCK_LEGACY_AUTOLOAD to y block: update io_ticks when io hang block, bfq: don't move oom_bfqq block, bfq: avoid moving bfqq to it's parent bfqg block, bfq: cleanup bfq_bfqq_to_bfqg() ...
2022-03-07fuse: fix pipe buffer lifetime for direct_ioMiklos Szeredi
In FOPEN_DIRECT_IO mode, fuse_file_write_iter() calls fuse_direct_write_iter(), which normally calls fuse_direct_io(), which then imports the write buffer with fuse_get_user_pages(), which uses iov_iter_get_pages() to grab references to userspace pages instead of actually copying memory. On the filesystem device side, these pages can then either be read to userspace (via fuse_dev_read()), or splice()d over into a pipe using fuse_dev_splice_read() as pipe buffers with &nosteal_pipe_buf_ops. This is wrong because after fuse_dev_do_read() unlocks the FUSE request, the userspace filesystem can mark the request as completed, causing write() to return. At that point, the userspace filesystem should no longer have access to the pipe buffer. Fix by copying pages coming from the user address space to new pipe buffers. Reported-by: Jann Horn <jannh@google.com> Fixes: c3021629a0d8 ("fuse: support splice() reading from fuse device") Cc: <stable@vger.kernel.org> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2022-03-07ext4: stop using bio_devnameChristoph Hellwig
Use the %pg format specifier to save on stack consuption and code size. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Link: https://lore.kernel.org/r/20220304180105.409765-10-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-03-06Merge tag 'for-5.17-rc6-tag' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs fixes from David Sterba: "A few more fixes for various problems that have user visible effects or seem to be urgent: - fix corruption when combining DIO and non-blocking io_uring over multiple extents (seen on MariaDB) - fix relocation crash due to premature return from commit - fix quota deadlock between rescan and qgroup removal - fix item data bounds checks in tree-checker (found on a fuzzed image) - fix fsync of prealloc extents after EOF - add missing run of delayed items after unlink during log replay - don't start relocation until snapshot drop is finished - fix reversed condition for subpage writers locking - fix warning on page error" * tag 'for-5.17-rc6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: btrfs: fallback to blocking mode when doing async dio over multiple extents btrfs: add missing run of delayed items after unlink during log replay btrfs: qgroup: fix deadlock between rescan worker and remove qgroup btrfs: fix relocation crash due to premature return from btrfs_commit_transaction() btrfs: do not start relocation until in progress drops are done btrfs: tree-checker: use u64 for item data end to avoid overflow btrfs: do not WARN_ON() if we have PageError set btrfs: fix lost prealloc extents beyond eof after full fsync btrfs: subpage: fix a wrong check on subpage->writers
2022-03-05proc: fix documentation and description of pagemapYun Zhou
Since bit 57 was exported for uffd-wp write-protected (commit fb8e37f35a2f: "mm/pagemap: export uffd-wp protection information"), fixing it can reduce some unnecessary confusion. Link: https://lkml.kernel.org/r/20220301044538.3042713-1-yun.zhou@windriver.com Fixes: fb8e37f35a2fe1 ("mm/pagemap: export uffd-wp protection information") Signed-off-by: Yun Zhou <yun.zhou@windriver.com> Reviewed-by: Peter Xu <peterx@redhat.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Tiberiu A Georgescu <tiberiu.georgescu@nutanix.com> Cc: Florian Schmidt <florian.schmidt@nutanix.com> Cc: Ivan Teterevkov <ivan.teterevkov@nutanix.com> Cc: SeongJae Park <sj@kernel.org> Cc: Yang Shi <shy828301@gmail.com> Cc: David Hildenbrand <david@redhat.com> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Colin Cross <ccross@google.com> Cc: Alistair Popple <apopple@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2022-03-05mm: refactor vm_area_struct::anon_vma_name usage codeSuren Baghdasaryan
Avoid mixing strings and their anon_vma_name referenced pointers by using struct anon_vma_name whenever possible. This simplifies the code and allows easier sharing of anon_vma_name structures when they represent the same name. [surenb@google.com: fix comment] Link: https://lkml.kernel.org/r/20220223153613.835563-1-surenb@google.com Link: https://lkml.kernel.org/r/20220224231834.1481408-1-surenb@google.com Signed-off-by: Suren Baghdasaryan <surenb@google.com> Suggested-by: Matthew Wilcox <willy@infradead.org> Suggested-by: Michal Hocko <mhocko@suse.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Colin Cross <ccross@google.com> Cc: Sumit Semwal <sumit.semwal@linaro.org> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Kees Cook <keescook@chromium.org> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Christian Brauner <brauner@kernel.org> Cc: Alexey Gladkov <legion@kernel.org> Cc: Sasha Levin <sashal@kernel.org> Cc: Chris Hyser <chris.hyser@oracle.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Peter Collingbourne <pcc@google.com> Cc: Xiaofeng Cao <caoxiaofeng@yulong.com> Cc: David Hildenbrand <david@redhat.com> Cc: Cyrill Gorcunov <gorcunov@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2022-03-04f2fs: introduce F2FS_UNFAIR_RWSEM to support unfair rwsemJaegeuk Kim
Unfair rwsem should be used when blk-cg is on. Otherwise, there is regression. FYI, we noticed a -26.7% regression of aim7.jobs-per-min due to commit: commit: e4544b63a7ee49e7fbebf35ece0a6acd3b9617ae ("f2fs: move f2fs to use reader-unfair rwsems") https://git.kernel.org/cgit/linux/kernel/git/next/linux-next.git master in testcase: aim7 on test machine: 88 threads 2 sockets Intel(R) Xeon(R) Gold 6238M CPU @ 2.10GHz with 128G memory with following parameters: disk: 4BRD_12G md: RAID0 fs: f2fs test: sync_disk_rw load: 100 cpufreq_governor: performance ucode: 0x500320a test-description: AIM7 is a traditional UNIX system level benchmark suite which is used to test and measure the performance of multiuser system. test-url: https://sourceforge.net/projects/aimbench/files/aim-suite7/ Reported-by: kernel test robot <oliver.sang@intel.com> Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2022-03-04f2fs: avoid an infinite loop in f2fs_sync_dirty_inodesJaegeuk Kim
If one read IO is always failing, we can fall into an infinite loop in f2fs_sync_dirty_inodes. This happens during xfstests/generic/475. [ 142.803335] Buffer I/O error on dev dm-1, logical block 8388592, async page read ... [ 382.887210] submit_bio_noacct+0xdd/0x2a0 [ 382.887213] submit_bio+0x80/0x110 [ 382.887223] __submit_bio+0x4d/0x300 [f2fs] [ 382.887282] f2fs_submit_page_bio+0x125/0x200 [f2fs] [ 382.887299] __get_meta_page+0xc9/0x280 [f2fs] [ 382.887315] f2fs_get_meta_page+0x13/0x20 [f2fs] [ 382.887331] f2fs_get_node_info+0x317/0x3c0 [f2fs] [ 382.887350] f2fs_do_write_data_page+0x327/0x6f0 [f2fs] [ 382.887367] f2fs_write_single_data_page+0x5b7/0x960 [f2fs] [ 382.887386] f2fs_write_cache_pages+0x302/0x890 [f2fs] [ 382.887405] ? preempt_count_add+0x7a/0xc0 [ 382.887408] f2fs_write_data_pages+0xfd/0x320 [f2fs] [ 382.887425] ? _raw_spin_unlock+0x1a/0x30 [ 382.887428] do_writepages+0xd3/0x1d0 [ 382.887432] filemap_fdatawrite_wbc+0x69/0x90 [ 382.887434] filemap_fdatawrite+0x50/0x70 [ 382.887437] f2fs_sync_dirty_inodes+0xa4/0x270 [f2fs] [ 382.887453] f2fs_write_checkpoint+0x189/0x1640 [f2fs] [ 382.887469] ? schedule_timeout+0x114/0x150 [ 382.887471] ? ttwu_do_activate+0x6d/0xb0 [ 382.887473] ? preempt_count_add+0x7a/0xc0 [ 382.887476] kill_f2fs_super+0xca/0x100 [f2fs] [ 382.887491] deactivate_locked_super+0x35/0xa0 [ 382.887494] deactivate_super+0x40/0x50 [ 382.887497] cleanup_mnt+0x139/0x190 [ 382.887499] __cleanup_mnt+0x12/0x20 [ 382.887501] task_work_run+0x64/0xa0 [ 382.887505] exit_to_user_mode_prepare+0x1b7/0x1c0 [ 382.887508] syscall_exit_to_user_mode+0x27/0x50 [ 382.887510] do_syscall_64+0x48/0xc0 [ 382.887513] entry_SYSCALL_64_after_hwframe+0x44/0xae Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2022-03-04btrfs: fallback to blocking mode when doing async dio over multiple extentsFilipe Manana
Some users recently reported that MariaDB was getting a read corruption when using io_uring on top of btrfs. This started to happen in 5.16, after commit 51bd9563b6783d ("btrfs: fix deadlock due to page faults during direct IO reads and writes"). That changed btrfs to use the new iomap flag IOMAP_DIO_PARTIAL and to disable page faults before calling iomap_dio_rw(). This was necessary to fix deadlocks when the iovector corresponds to a memory mapped file region. That type of scenario is exercised by test case generic/647 from fstests. For this MariaDB scenario, we attempt to read 16K from file offset X using IOCB_NOWAIT and io_uring. In that range we have 4 extents, each with a size of 4K, and what happens is the following: 1) btrfs_direct_read() disables page faults and calls iomap_dio_rw(); 2) iomap creates a struct iomap_dio object, its reference count is initialized to 1 and its ->size field is initialized to 0; 3) iomap calls btrfs_dio_iomap_begin() with file offset X, which finds the first 4K extent, and setups an iomap for this extent consisting of a single page; 4) At iomap_dio_bio_iter(), we are able to access the first page of the buffer (struct iov_iter) with bio_iov_iter_get_pages() without triggering a page fault; 5) iomap submits a bio for this 4K extent (iomap_dio_submit_bio() -> btrfs_submit_direct()) and increments the refcount on the struct iomap_dio object to 2; The ->size field of the struct iomap_dio object is incremented to 4K; 6) iomap calls btrfs_iomap_begin() again, this time with a file offset of X + 4K. There we setup an iomap for the next extent that also has a size of 4K; 7) Then at iomap_dio_bio_iter() we call bio_iov_iter_get_pages(), which tries to access the next page (2nd page) of the buffer. This triggers a page fault and returns -EFAULT; 8) At __iomap_dio_rw() we see the -EFAULT, but we reset the error to 0 because we passed the flag IOMAP_DIO_PARTIAL to iomap and the struct iomap_dio object has a ->size value of 4K (we submitted a bio for an extent already). The 'wait_for_completion' variable is not set to true, because our iocb has IOCB_NOWAIT set; 9) At the bottom of __iomap_dio_rw(), we decrement the reference count of the struct iomap_dio object from 2 to 1. Because we were not the only ones holding a reference on it and 'wait_for_completion' is set to false, -EIOCBQUEUED is returned to btrfs_direct_read(), which just returns it up the callchain, up to io_uring; 10) The bio submitted for the first extent (step 5) completes and its bio endio function, iomap_dio_bio_end_io(), decrements the last reference on the struct iomap_dio object, resulting in calling iomap_dio_complete_work() -> iomap_dio_complete(). 11) At iomap_dio_complete() we adjust the iocb->ki_pos from X to X + 4K and return 4K (the amount of io done) to iomap_dio_complete_work(); 12) iomap_dio_complete_work() calls the iocb completion callback, iocb->ki_complete() with a second argument value of 4K (total io done) and the iocb with the adjust ki_pos of X + 4K. This results in completing the read request for io_uring, leaving it with a result of 4K bytes read, and only the first page of the buffer filled in, while the remaining 3 pages, corresponding to the other 3 extents, were not filled; 13) For the application, the result is unexpected because if we ask to read N bytes, it expects to get N bytes read as long as those N bytes don't cross the EOF (i_size). MariaDB reports this as an error, as it's not expecting a short read, since it knows it's asking for read operations fully within the i_size boundary. This is typical in many applications, but it may also be questionable if they should react to such short reads by issuing more read calls to get the remaining data. Nevertheless, the short read happened due to a change in btrfs regarding how it deals with page faults while in the middle of a read operation, and there's no reason why btrfs can't have the previous behaviour of returning the whole data that was requested by the application. The problem can also be triggered with the following simple program: /* Get O_DIRECT */ #ifndef _GNU_SOURCE #define _GNU_SOURCE #endif #include <stdio.h> #include <stdlib.h> #include <unistd.h> #include <fcntl.h> #include <errno.h> #include <string.h> #include <liburing.h> int main(int argc, char *argv[]) { char *foo_path; struct io_uring ring; struct io_uring_sqe *sqe; struct io_uring_cqe *cqe; struct iovec iovec; int fd; long pagesize; void *write_buf; void *read_buf; ssize_t ret; int i; if (argc != 2) { fprintf(stderr, "Use: %s <directory>\n", argv[0]); return 1; } foo_path = malloc(strlen(argv[1]) + 5); if (!foo_path) { fprintf(stderr, "Failed to allocate memory for file path\n"); return 1; } strcpy(foo_path, argv[1]); strcat(foo_path, "/foo"); /* * Create file foo with 2 extents, each with a size matching * the page size. Then allocate a buffer to read both extents * with io_uring, using O_DIRECT and IOCB_NOWAIT. Before doing * the read with io_uring, access the first page of the buffer * to fault it in, so that during the read we only trigger a * page fault when accessing the second page of the buffer. */ fd = open(foo_path, O_CREAT | O_TRUNC | O_WRONLY | O_DIRECT, 0666); if (fd == -1) { fprintf(stderr, "Failed to create file 'foo': %s (errno %d)", strerror(errno), errno); return 1; } pagesize = sysconf(_SC_PAGE_SIZE); ret = posix_memalign(&write_buf, pagesize, 2 * pagesize); if (ret) { fprintf(stderr, "Failed to allocate write buffer\n"); return 1; } memset(write_buf, 0xab, pagesize); memset(write_buf + pagesize, 0xcd, pagesize); /* Create 2 extents, each with a size matching page size. */ for (i = 0; i < 2; i++) { ret = pwrite(fd, write_buf + i * pagesize, pagesize, i * pagesize); if (ret != pagesize) { fprintf(stderr, "Failed to write to file, ret = %ld errno %d (%s)\n", ret, errno, strerror(errno)); return 1; } ret = fsync(fd); if (ret != 0) { fprintf(stderr, "Failed to fsync file\n"); return 1; } } close(fd); fd = open(foo_path, O_RDONLY | O_DIRECT); if (fd == -1) { fprintf(stderr, "Failed to open file 'foo': %s (errno %d)", strerror(errno), errno); return 1; } ret = posix_memalign(&read_buf, pagesize, 2 * pagesize); if (ret) { fprintf(stderr, "Failed to allocate read buffer\n"); return 1; } /* * Fault in only the first page of the read buffer. * We want to trigger a page fault for the 2nd page of the * read buffer during the read operation with io_uring * (O_DIRECT and IOCB_NOWAIT). */ memset(read_buf, 0, 1); ret = io_uring_queue_init(1, &ring, 0); if (ret != 0) { fprintf(stderr, "Failed to create io_uring queue\n"); return 1; } sqe = io_uring_get_sqe(&ring); if (!sqe) { fprintf(stderr, "Failed to get io_uring sqe\n"); return 1; } iovec.iov_base = read_buf; iovec.iov_len = 2 * pagesize; io_uring_prep_readv(sqe, fd, &iovec, 1, 0); ret = io_uring_submit_and_wait(&ring, 1); if (ret != 1) { fprintf(stderr, "Failed at io_uring_submit_and_wait()\n"); return 1; } ret = io_uring_wait_cqe(&ring, &cqe); if (ret < 0) { fprintf(stderr, "Failed at io_uring_wait_cqe()\n"); return 1; } printf("io_uring read result for file foo:\n\n"); printf(" cqe->res == %d (expected %d)\n", cqe->res, 2 * pagesize); printf(" memcmp(read_buf, write_buf) == %d (expected 0)\n", memcmp(read_buf, write_buf, 2 * pagesize)); io_uring_cqe_seen(&ring, cqe); io_uring_queue_exit(&ring); return 0; } When running it on an unpatched kernel: $ gcc io_uring_test.c -luring $ mkfs.btrfs -f /dev/sda $ mount /dev/sda /mnt/sda $ ./a.out /mnt/sda io_uring read result for file foo: cqe->res == 4096 (expected 8192) memcmp(read_buf, write_buf) == -205 (expected 0) After this patch, the read always returns 8192 bytes, with the buffer filled with the correct data. Although that reproducer always triggers the bug in my test vms, it's possible that it will not be so reliable on other environments, as that can happen if the bio for the first extent completes and decrements the reference on the struct iomap_dio object before we do the atomic_dec_and_test() on the reference at __iomap_dio_rw(). Fix this in btrfs by having btrfs_dio_iomap_begin() return -EAGAIN whenever we try to satisfy a non blocking IO request (IOMAP_NOWAIT flag set) over a range that spans multiple extents (or a mix of extents and holes). This avoids returning success to the caller when we only did partial IO, which is not optimal for writes and for reads it's actually incorrect, as the caller doesn't expect to get less bytes read than it has requested (unless EOF is crossed), as previously mentioned. This is also the type of behaviour that xfs follows (xfs_direct_write_iomap_begin()), even though it doesn't use IOMAP_DIO_PARTIAL. A test case for fstests will follow soon. Link: https://lore.kernel.org/linux-btrfs/CABVffEM0eEWho+206m470rtM0d9J8ue85TtR-A_oVTuGLWFicA@mail.gmail.com/ Link: https://lore.kernel.org/linux-btrfs/CAHF2GV6U32gmqSjLe=XKgfcZAmLCiH26cJ2OnHGp5x=VAH4OHQ@mail.gmail.com/ CC: stable@vger.kernel.org # 5.16+ Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-03-03binfmt_elf: Introduce KUnit testKees Cook
Adds simple KUnit test for some binfmt_elf internals: specifically a regression test for the problem fixed by commit 8904d9cd90ee ("ELF: fix overflow in total mapping size calculation"). $ ./tools/testing/kunit/kunit.py run --arch x86_64 \ --kconfig_add CONFIG_IA32_EMULATION=y '*binfmt_elf' ... [19:41:08] ================== binfmt_elf (1 subtest) ================== [19:41:08] [PASSED] total_mapping_size_test [19:41:08] =================== [PASSED] binfmt_elf ==================== [19:41:08] ============== compat_binfmt_elf (1 subtest) =============== [19:41:08] [PASSED] total_mapping_size_test [19:41:08] ================ [PASSED] compat_binfmt_elf ================ [19:41:08] ============================================================ [19:41:08] Testing complete. Passed: 2, Failed: 0, Crashed: 0, Skipped: 0, Errors: 0 Cc: Eric Biederman <ebiederm@xmission.com> Cc: David Gow <davidgow@google.com> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: "Magnus Groß" <magnus.gross@rwth-aachen.de> Cc: kunit-dev@googlegroups.com Cc: linux-fsdevel@vger.kernel.org Signed-off-by: Kees Cook <keescook@chromium.org> --- v1: https://lore.kernel.org/lkml/20220224054332.1852813-1-keescook@chromium.org v2: - improve commit log - fix comment URL (Daniel) - drop redundant KUnit Kconfig help info (Daniel) - note in Kconfig help that COMPAT builds add a compat test (David)
2022-03-03f2fs: fix to do sanity check on curseg->alloc_typeChao Yu
As Wenqing Liu reported in bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=215657 - Overview UBSAN: array-index-out-of-bounds in fs/f2fs/segment.c:3460:2 when mount and operate a corrupted image - Reproduce tested on kernel 5.17-rc4, 5.17-rc6 1. mkdir test_crash 2. cd test_crash 3. unzip tmp2.zip 4. mkdir mnt 5. ./single_test.sh f2fs 2 - Kernel dump [ 46.434454] loop0: detected capacity change from 0 to 131072 [ 46.529839] F2FS-fs (loop0): Mounted with checkpoint version = 7548c2d9 [ 46.738319] ================================================================================ [ 46.738412] UBSAN: array-index-out-of-bounds in fs/f2fs/segment.c:3460:2 [ 46.738475] index 231 is out of range for type 'unsigned int [2]' [ 46.738539] CPU: 2 PID: 939 Comm: umount Not tainted 5.17.0-rc6 #1 [ 46.738547] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.13.0-1ubuntu1.1 04/01/2014 [ 46.738551] Call Trace: [ 46.738556] <TASK> [ 46.738563] dump_stack_lvl+0x47/0x5c [ 46.738581] ubsan_epilogue+0x5/0x50 [ 46.738592] __ubsan_handle_out_of_bounds+0x68/0x80 [ 46.738604] f2fs_allocate_data_block+0xdff/0xe60 [f2fs] [ 46.738819] do_write_page+0xef/0x210 [f2fs] [ 46.738934] f2fs_do_write_node_page+0x3f/0x80 [f2fs] [ 46.739038] __write_node_page+0x2b7/0x920 [f2fs] [ 46.739162] f2fs_sync_node_pages+0x943/0xb00 [f2fs] [ 46.739293] f2fs_write_checkpoint+0x7bb/0x1030 [f2fs] [ 46.739405] kill_f2fs_super+0x125/0x150 [f2fs] [ 46.739507] deactivate_locked_super+0x60/0xc0 [ 46.739517] deactivate_super+0x70/0xb0 [ 46.739524] cleanup_mnt+0x11a/0x200 [ 46.739532] __cleanup_mnt+0x16/0x20 [ 46.739538] task_work_run+0x67/0xa0 [ 46.739547] exit_to_user_mode_prepare+0x18c/0x1a0 [ 46.739559] syscall_exit_to_user_mode+0x26/0x40 [ 46.739568] do_syscall_64+0x46/0xb0 [ 46.739584] entry_SYSCALL_64_after_hwframe+0x44/0xae The root cause is we missed to do sanity check on curseg->alloc_type, result in out-of-bound accessing on sbi->block_count[] array, fix it. Signed-off-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2022-03-03f2fs: fix to avoid potential deadlockChao Yu
Quoted from Jing Xia's report, there is a potential deadlock may happen between kworker and checkpoint as below: [T:writeback] [T:checkpoint] - wb_writeback - blk_start_plug bio contains NodeA was plugged in writeback threads - do_writepages -- sync write inodeB, inc wb_sync_req[DATA] - f2fs_write_data_pages - f2fs_write_single_data_page -- write last dirty page - f2fs_do_write_data_page - set_page_writeback -- clear page dirty flag and PAGECACHE_TAG_DIRTY tag in radix tree - f2fs_outplace_write_data - f2fs_update_data_blkaddr - f2fs_wait_on_page_writeback -- wait NodeA to writeback here - inode_dec_dirty_pages - writeback_sb_inodes - writeback_single_inode - do_writepages - f2fs_write_data_pages -- skip writepages due to wb_sync_req[DATA] - wbc->pages_skipped += get_dirty_pages() -- PAGECACHE_TAG_DIRTY is not set but get_dirty_pages() returns one - requeue_inode -- requeue inode to wb->b_dirty queue due to non-zero.pages_skipped - blk_finish_plug Let's try to avoid deadlock condition by forcing unplugging previous bio via blk_finish_plug(current->plug) once we'v skipped writeback in writepages() due to valid sbi->wb_sync_req[DATA/NODE]. Fixes: 687de7f1010c ("f2fs: avoid IO split due to mixed WB_SYNC_ALL and WB_SYNC_NONE") Signed-off-by: Zhiguo Niu <zhiguo.niu@unisoc.com> Signed-off-by: Jing Xia <jing.xia@unisoc.com> Signed-off-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2022-03-03cachefiles: Fix incorrect length to fallocate()David Howells
When cachefiles_shorten_object() calls fallocate() to shape the cache file to match the DIO size, it passes the total file size it wants to achieve, not the amount of zeros that should be inserted. Since this is meant to preallocate that amount of storage for the file, it can cause the cache to fill up the disk and hit ENOSPC. Fix this by passing the length actually required to go from the current EOF to the desired EOF. Fixes: 7623ed6772de ("cachefiles: Implement cookie resize for truncate") Reported-by: Jeffle Xu <jefflexu@linux.alibaba.com> Signed-off-by: David Howells <dhowells@redhat.com> Tested-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Jeff Layton <jlayton@kernel.org> cc: linux-cachefs@redhat.com Link: https://lore.kernel.org/r/164630854858.3665356.17419701804248490708.stgit@warthog.procyon.org.uk # v1 Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2022-03-03fsdax: depend on ZONE_DEVICE || FS_DAX_LIMITEDChristoph Hellwig
Add a depends on ZONE_DEVICE support or the s390-specific limited DAX support, as one of the two is required at runtime for fsdax code to actually work. Link: https://lkml.kernel.org/r/20220210072828.2930359-9-hch@lst.de Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Logan Gunthorpe <logang@deltatee.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Tested-by: "Sierra Guiza, Alejandro (Alex)" <alex.sierra@amd.com> Cc: Alex Deucher <alexander.deucher@amd.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Ben Skeggs <bskeggs@redhat.com> Cc: Chaitanya Kulkarni <kch@nvidia.com> Cc: Christian Knig <christian.koenig@amd.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Felix Kuehling <Felix.Kuehling@amd.com> Cc: Karol Herbst <kherbst@redhat.com> Cc: Lyude Paul <lyude@redhat.com> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Muchun Song <songmuchun@bytedance.com> Cc: "Pan, Xinhui" <Xinhui.Pan@amd.com> Cc: Ralph Campbell <rcampbell@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
2022-03-03mm: remove the extra ZONE_DEVICE struct page refcountChristoph Hellwig
ZONE_DEVICE struct pages have an extra reference count that complicates the code for put_page() and several places in the kernel that need to check the reference count to see that a page is not being used (gup, compaction, migration, etc.). Clean up the code so the reference count doesn't need to be treated specially for ZONE_DEVICE pages. Note that this excludes the special idle page wakeup for fsdax pages, which still happens at refcount 1. This is a separate issue and will be sorted out later. Given that only fsdax pages require the notifiacation when the refcount hits 1 now, the PAGEMAP_OPS Kconfig symbol can go away and be replaced with a FS_DAX check for this hook in the put_page fastpath. Based on an earlier patch from Ralph Campbell <rcampbell@nvidia.com>. Link: https://lkml.kernel.org/r/20220210072828.2930359-8-hch@lst.de Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Logan Gunthorpe <logang@deltatee.com> Reviewed-by: Ralph Campbell <rcampbell@nvidia.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Reviewed-by: Dan Williams <dan.j.williams@intel.com> Acked-by: Felix Kuehling <Felix.Kuehling@amd.com> Tested-by: "Sierra Guiza, Alejandro (Alex)" <alex.sierra@amd.com> Cc: Alex Deucher <alexander.deucher@amd.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Ben Skeggs <bskeggs@redhat.com> Cc: Chaitanya Kulkarni <kch@nvidia.com> Cc: Christian Knig <christian.koenig@amd.com> Cc: Karol Herbst <kherbst@redhat.com> Cc: Lyude Paul <lyude@redhat.com> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Muchun Song <songmuchun@bytedance.com> Cc: "Pan, Xinhui" <Xinhui.Pan@amd.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
2022-03-03mm: don't include <linux/memremap.h> in <linux/mm.h>Christoph Hellwig
Move the check for the actual pgmap types that need the free at refcount one behavior into the out of line helper, and thus avoid the need to pull memremap.h into mm.h. Link: https://lkml.kernel.org/r/20220210072828.2930359-7-hch@lst.de Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Logan Gunthorpe <logang@deltatee.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Reviewed-by: Dan Williams <dan.j.williams@intel.com> Acked-by: Felix Kuehling <Felix.Kuehling@amd.com> Tested-by: "Sierra Guiza, Alejandro (Alex)" <alex.sierra@amd.com> Cc: Alex Deucher <alexander.deucher@amd.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Ben Skeggs <bskeggs@redhat.com> Cc: Chaitanya Kulkarni <kch@nvidia.com> Cc: Karol Herbst <kherbst@redhat.com> Cc: Lyude Paul <lyude@redhat.com> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Muchun Song <songmuchun@bytedance.com> Cc: "Pan, Xinhui" <Xinhui.Pan@amd.com> Cc: Ralph Campbell <rcampbell@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
2022-03-03ext4: don't BUG if someone dirty pages without asking ext4 firstTheodore Ts'o
[un]pin_user_pages_remote is dirtying pages without properly warning the file system in advance. A related race was noted by Jan Kara in 2018[1]; however, more recently instead of it being a very hard-to-hit race, it could be reliably triggered by process_vm_writev(2) which was discovered by Syzbot[2]. This is technically a bug in mm/gup.c, but arguably ext4 is fragile in that if some other kernel subsystem dirty pages without properly notifying the file system using page_mkwrite(), ext4 will BUG, while other file systems will not BUG (although data will still be lost). So instead of crashing with a BUG, issue a warning (since there may be potential data loss) and just mark the page as clean to avoid unprivileged denial of service attacks until the problem can be properly fixed. More discussion and background can be found in the thread starting at [2]. [1] https://lore.kernel.org/linux-mm/20180103100430.GE4911@quack2.suse.cz [2] https://lore.kernel.org/r/Yg0m6IjcNmfaSokM@google.com Reported-by: syzbot+d59332e2db681cf18f0318a06e994ebbb529a8db@syzkaller.appspotmail.com Reported-by: Lee Jones <lee.jones@linaro.org> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Link: https://lore.kernel.org/r/YiDS9wVfq4mM2jGK@mit.edu
2022-03-03gfs2: Remove return value for gfs2_indirect_initBob Peterson
The return value from function gfs2_indirect_init is never used, so remove it. Signed-off-by: Bob Peterson <rpeterso@redhat.com> Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2022-03-03ext4: remove redundant assignment to variable split_flag1Colin Ian King
Variable split_flag1 is being assigned a value that is never read, it is being re-assigned a new value in the following code block. The assignment is redundant and can be removed. Cleans up clang scan build warning: fs/ext4/extents.c:3371:2: warning: Value stored to 'split_flag1' is never read [deadcode.DeadStores] Signed-off-by: Colin Ian King <colin.i.king@gmail.com> Link: https://lore.kernel.org/r/20220301121644.997833-1-colin.i.king@gmail.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2022-03-02ext4: fix underflow in ext4_max_bitmap_size()Zhang Yi
when ext4 filesystem is created with 64k block size, ^extent and ^huge_file features. the upper_limit would underflow during the computations in ext4_max_bitmap_size(). The problem is the size of block index tree for such large block size is more than i_blocks can carry. So fix the computation to count with this possibility. After this fix, the 'res' cannot overflow loff_t on the extreme case of filesystem with huge_files and 64K block size, so this patch also revert commit 75ca6ad408f4 ("ext4: fix loff_t overflow in ext4_max_bitmap_size()"). Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20220301111704.2153829-1-yi.zhang@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2022-03-02ext4: fix ext4_mb_clear_bb() kernel-doc commentYang Li
Remove the excess description of @bh in ext4_mb_clear_bb() kernel-doc comment to remove warnings found by running scripts/kernel-doc, which is caused by using 'make W=1'. fs/ext4/mballoc.c:5895: warning: Excess function parameter 'bh' description in 'ext4_mb_clear_bb' Reported-by: Abaci Robot <abaci@linux.alibaba.com> Signed-off-by: Yang Li <yang.lee@linux.alibaba.com> Link: https://lore.kernel.org/r/20220301092136.34764-1-yang.lee@linux.alibaba.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2022-03-02ext4: fix fs corruption when tring to remove a non-empty directory with IO errorYe Bin
We inject IO error when rmdir non empty direcory, then got issue as follows: step1: mkfs.ext4 -F /dev/sda step2: mount /dev/sda test step3: cd test step4: mkdir -p 1/2 step5: rmdir 1 [ 110.920551] ext4_empty_dir: inject fault [ 110.921926] EXT4-fs warning (device sda): ext4_rmdir:3113: inode #12: comm rmdir: empty directory '1' has too many links (3) step6: cd .. step7: umount test step8: fsck.ext4 -f /dev/sda e2fsck 1.42.9 (28-Dec-2013) Pass 1: Checking inodes, blocks, and sizes Pass 2: Checking directory structure Entry '..' in .../??? (13) has deleted/unused inode 12. Clear<y>? yes Pass 3: Checking directory connectivity Unconnected directory inode 13 (...) Connect to /lost+found<y>? yes Pass 4: Checking reference counts Inode 13 ref count is 3, should be 2. Fix<y>? yes Pass 5: Checking group summary information /dev/sda: ***** FILE SYSTEM WAS MODIFIED ***** /dev/sda: 12/131072 files (0.0% non-contiguous), 26157/524288 blocks ext4_rmdir if (!ext4_empty_dir(inode)) goto end_rmdir; ext4_empty_dir bh = ext4_read_dirblock(inode, 0, DIRENT_HTREE); if (IS_ERR(bh)) return true; Now if read directory block failed, 'ext4_empty_dir' will return true, assume directory is empty. Obviously, it will lead to above issue. To solve this issue, if read directory block failed 'ext4_empty_dir' just return false. To avoid making things worse when file system is already corrupted, 'ext4_empty_dir' also return false. Signed-off-by: Ye Bin <yebin10@huawei.com> Cc: stable@kernel.org Link: https://lore.kernel.org/r/20220228024815.3952506-1-yebin10@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2022-03-02ext4: use time_is_before_jiffies() instead of open coding itWang Qing
Use the helper function time_is_{before,after}_jiffies() to improve code readability. Signed-off-by: Wang Qing <wangqing@vivo.com> Link: https://lore.kernel.org/r/1646018120-61462-1-git-send-email-wangqing@vivo.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2022-03-02ext4: improve fast_commit performance and scalabilityRitesh Harjani
Currently ext4_fc_commit_dentry_updates() is of quadratic time complexity, which is causing performance bottlenecks with high threads/file/dir count with fs_mark. This patch makes commit dentry updates (and hence ext4_fc_commit()) path to linear time complexity. Hence improves the performance of workloads which does fsync on multiple threads/open files one-by-one. Absolute numbers in avg file creates per sec (from fs_mark in 1K order) ======================================================================= no. Order without-patch(K) with-patch(K) Diff(%) 1 1 16.90 17.51 +3.60 2 2,2 32.08 31.80 -0.87 3 3,3 53.97 55.01 +1.92 4 4,4 78.94 76.90 -2.58 5 5,5 95.82 95.37 -0.46 6 6,6 87.92 103.38 +17.58 7 6,10 0.73 126.13 +17178.08 8 6,14 2.33 143.19 +6045.49 workload type ============== For e.g. 7th row order of 6,10 (2^6 == 64 && 2^10 == 1024) echo /run/riteshh/mnt/{1..64} |sed -E 's/[[:space:]]+/ -d /g' \ | xargs -I {} bash -c "sudo fs_mark -L 100 -D 1024 -n 1024 -s0 -S5 -d {}" Perf profile (w/o patches) ============================= 87.15% [kernel] [k] ext4_fc_commit --> Heavy contention/bottleneck 1.98% [kernel] [k] perf_event_interrupt 0.96% [kernel] [k] power_pmu_enable 0.91% [kernel] [k] update_sd_lb_stats.constprop.0 0.67% [kernel] [k] ktime_get Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com> Reviewed-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com> Link: https://lore.kernel.org/r/930f35d4fd5f83e2673c868781d9ebf15e91bf4e.1645426817.git.riteshh@linux.ibm.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2022-03-02Merge tag 'erofs-for-5.17-rc7-fixes' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs Pull erofs fix from Gao Xiang: "A one-line patch to fix the new ztailpacking feature on > 4GiB filesystems because z_idataoff can get trimmed improperly. ztailpacking is still a brand new EXPERIMENTAL feature, but it'd be better to fix the issue as soon as possible to avoid unnecessary backporting. Summary: - Fix ztailpacking z_idataoff getting trimmed on > 4GiB filesystems" * tag 'erofs-for-5.17-rc7-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs: erofs: fix ztailpacking on > 4GiB filesystems
2022-03-02reiserfs: Deprecate reiserfsJan Kara
Reiserfs is relatively old filesystem and its development has ceased quite some years ago. Linux distributions moved away from it towards other filesystems such as btrfs, xfs, or ext4. To reduce maintenance burden on cross filesystem changes (such as new mount API, iomap, folios ...) let's add a deprecation notice when the filesystem is mounted and schedule its removal to 2025. Link: https://lore.kernel.org/r/20220225125445.29942-1-jack@suse.cz Signed-off-by: Jan Kara <jack@suse.cz>
2022-03-02btrfs: add missing run of delayed items after unlink during log replayFilipe Manana
During log replay, whenever we need to check if a name (dentry) exists in a directory we do searches on the subvolume tree for inode references or or directory entries (BTRFS_DIR_INDEX_KEY keys, and BTRFS_DIR_ITEM_KEY keys as well, before kernel 5.17). However when during log replay we unlink a name, through btrfs_unlink_inode(), we may not delete inode references and dir index keys from a subvolume tree and instead just add the deletions to the delayed inode's delayed items, which will only be run when we commit the transaction used for log replay. This means that after an unlink operation during log replay, if we attempt to search for the same name during log replay, we will not see that the name was already deleted, since the deletion is recorded only on the delayed items. We run delayed items after every unlink operation during log replay, except at unlink_old_inode_refs() and at add_inode_ref(). This was due to an overlook, as delayed items should be run after evert unlink, for the reasons stated above. So fix those two cases. Fixes: 0d836392cadd5 ("Btrfs: fix mount failure after fsync due to hard link recreation") Fixes: 1f250e929a9c9 ("Btrfs: fix log replay failure after unlink and link combination") CC: stable@vger.kernel.org # 4.19+ Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-03-02btrfs: qgroup: fix deadlock between rescan worker and remove qgroupSidong Yang
The commit e804861bd4e6 ("btrfs: fix deadlock between quota disable and qgroup rescan worker") by Kawasaki resolves deadlock between quota disable and qgroup rescan worker. But also there is a deadlock case like it. It's about enabling or disabling quota and creating or removing qgroup. It can be reproduced in simple script below. for i in {1..100} do btrfs quota enable /mnt & btrfs qgroup create 1/0 /mnt & btrfs qgroup destroy 1/0 /mnt & btrfs quota disable /mnt & done Here's why the deadlock happens: 1) The quota rescan task is running. 2) Task A calls btrfs_quota_disable(), locks the qgroup_ioctl_lock mutex, and then calls btrfs_qgroup_wait_for_completion(), to wait for the quota rescan task to complete. 3) Task B calls btrfs_remove_qgroup() and it blocks when trying to lock the qgroup_ioctl_lock mutex, because it's being held by task A. At that point task B is holding a transaction handle for the current transaction. 4) The quota rescan task calls btrfs_commit_transaction(). This results in it waiting for all other tasks to release their handles on the transaction, but task B is blocked on the qgroup_ioctl_lock mutex while holding a handle on the transaction, and that mutex is being held by task A, which is waiting for the quota rescan task to complete, resulting in a deadlock between these 3 tasks. To resolve this issue, the thread disabling quota should unlock qgroup_ioctl_lock before waiting rescan completion. Move btrfs_qgroup_wait_for_completion() after unlock of qgroup_ioctl_lock. Fixes: e804861bd4e6 ("btrfs: fix deadlock between quota disable and qgroup rescan worker") CC: stable@vger.kernel.org # 5.4+ Reviewed-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com> Signed-off-by: Sidong Yang <realwakka@gmail.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-03-02btrfs: fix relocation crash due to premature return from ↵Omar Sandoval
btrfs_commit_transaction() We are seeing crashes similar to the following trace: [38.969182] WARNING: CPU: 20 PID: 2105 at fs/btrfs/relocation.c:4070 btrfs_relocate_block_group+0x2dc/0x340 [btrfs] [38.973556] CPU: 20 PID: 2105 Comm: btrfs Not tainted 5.17.0-rc4 #54 [38.974580] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-59-gc9ba5276e321-prebuilt.qemu.org 04/01/2014 [38.976539] RIP: 0010:btrfs_relocate_block_group+0x2dc/0x340 [btrfs] [38.980336] RSP: 0000:ffffb0dd42e03c20 EFLAGS: 00010206 [38.981218] RAX: ffff96cfc4ede800 RBX: ffff96cfc3ce0000 RCX: 000000000002ca14 [38.982560] RDX: 0000000000000000 RSI: 4cfd109a0bcb5d7f RDI: ffff96cfc3ce0360 [38.983619] RBP: ffff96cfc309c000 R08: 0000000000000000 R09: 0000000000000000 [38.984678] R10: ffff96cec0000001 R11: ffffe84c80000000 R12: ffff96cfc4ede800 [38.985735] R13: 0000000000000000 R14: 0000000000000000 R15: ffff96cfc3ce0360 [38.987146] FS: 00007f11c15218c0(0000) GS:ffff96d6dfb00000(0000) knlGS:0000000000000000 [38.988662] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [38.989398] CR2: 00007ffc922c8e60 CR3: 00000001147a6001 CR4: 0000000000370ee0 [38.990279] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [38.991219] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [38.992528] Call Trace: [38.992854] <TASK> [38.993148] btrfs_relocate_chunk+0x27/0xe0 [btrfs] [38.993941] btrfs_balance+0x78e/0xea0 [btrfs] [38.994801] ? vsnprintf+0x33c/0x520 [38.995368] ? __kmalloc_track_caller+0x351/0x440 [38.996198] btrfs_ioctl_balance+0x2b9/0x3a0 [btrfs] [38.997084] btrfs_ioctl+0x11b0/0x2da0 [btrfs] [38.997867] ? mod_objcg_state+0xee/0x340 [38.998552] ? seq_release+0x24/0x30 [38.999184] ? proc_nr_files+0x30/0x30 [38.999654] ? call_rcu+0xc8/0x2f0 [39.000228] ? __x64_sys_ioctl+0x84/0xc0 [39.000872] ? btrfs_ioctl_get_supported_features+0x30/0x30 [btrfs] [39.001973] __x64_sys_ioctl+0x84/0xc0 [39.002566] do_syscall_64+0x3a/0x80 [39.003011] entry_SYSCALL_64_after_hwframe+0x44/0xae [39.003735] RIP: 0033:0x7f11c166959b [39.007324] RSP: 002b:00007fff2543e998 EFLAGS: 00000246 ORIG_RAX: 0000000000000010 [39.008521] RAX: ffffffffffffffda RBX: 00007f11c1521698 RCX: 00007f11c166959b [39.009833] RDX: 00007fff2543ea40 RSI: 00000000c4009420 RDI: 0000000000000003 [39.011270] RBP: 0000000000000003 R08: 0000000000000013 R09: 00007f11c16f94e0 [39.012581] R10: 0000000000000000 R11: 0000000000000246 R12: 00007fff25440df3 [39.014046] R13: 0000000000000000 R14: 00007fff2543ea40 R15: 0000000000000001 [39.015040] </TASK> [39.015418] ---[ end trace 0000000000000000 ]--- [43.131559] ------------[ cut here ]------------ [43.132234] kernel BUG at fs/btrfs/extent-tree.c:2717! [43.133031] invalid opcode: 0000 [#1] PREEMPT SMP PTI [43.133702] CPU: 1 PID: 1839 Comm: btrfs Tainted: G W 5.17.0-rc4 #54 [43.134863] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-59-gc9ba5276e321-prebuilt.qemu.org 04/01/2014 [43.136426] RIP: 0010:unpin_extent_range+0x37a/0x4f0 [btrfs] [43.139913] RSP: 0000:ffffb0dd4216bc70 EFLAGS: 00010246 [43.140629] RAX: 0000000000000000 RBX: ffff96cfc34490f8 RCX: 0000000000000001 [43.141604] RDX: 0000000080000001 RSI: 0000000051d00000 RDI: 00000000ffffffff [43.142645] RBP: 0000000000000000 R08: 0000000000000000 R09: ffff96cfd07dca50 [43.143669] R10: ffff96cfc46e8a00 R11: fffffffffffec000 R12: 0000000041d00000 [43.144657] R13: ffff96cfc3ce0000 R14: ffffb0dd4216bd08 R15: 0000000000000000 [43.145686] FS: 00007f7657dd68c0(0000) GS:ffff96d6df640000(0000) knlGS:0000000000000000 [43.146808] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [43.147584] CR2: 00007f7fe81bf5b0 CR3: 00000001093ee004 CR4: 0000000000370ee0 [43.148589] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [43.149581] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [43.150559] Call Trace: [43.150904] <TASK> [43.151253] btrfs_finish_extent_commit+0x88/0x290 [btrfs] [43.152127] btrfs_commit_transaction+0x74f/0xaa0 [btrfs] [43.152932] ? btrfs_attach_transaction_barrier+0x1e/0x50 [btrfs] [43.153786] btrfs_ioctl+0x1edc/0x2da0 [btrfs] [43.154475] ? __check_object_size+0x150/0x170 [43.155170] ? preempt_count_add+0x49/0xa0 [43.155753] ? __x64_sys_ioctl+0x84/0xc0 [43.156437] ? btrfs_ioctl_get_supported_features+0x30/0x30 [btrfs] [43.157456] __x64_sys_ioctl+0x84/0xc0 [43.157980] do_syscall_64+0x3a/0x80 [43.158543] entry_SYSCALL_64_after_hwframe+0x44/0xae [43.159231] RIP: 0033:0x7f7657f1e59b [43.161819] RSP: 002b:00007ffda5cd1658 EFLAGS: 00000246 ORIG_RAX: 0000000000000010 [43.162702] RAX: ffffffffffffffda RBX: 0000000000000001 RCX: 00007f7657f1e59b [43.163526] RDX: 0000000000000000 RSI: 0000000000009408 RDI: 0000000000000003 [43.164358] RBP: 0000000000000003 R08: 0000000000000000 R09: 0000000000000000 [43.165208] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000 [43.166029] R13: 00005621b91c3232 R14: 00005621b91ba580 R15: 00007ffda5cd1800 [43.166863] </TASK> [43.167125] Modules linked in: btrfs blake2b_generic xor pata_acpi ata_piix libata raid6_pq scsi_mod libcrc32c virtio_net virtio_rng net_failover rng_core failover scsi_common [43.169552] ---[ end trace 0000000000000000 ]--- [43.171226] RIP: 0010:unpin_extent_range+0x37a/0x4f0 [btrfs] [43.174767] RSP: 0000:ffffb0dd4216bc70 EFLAGS: 00010246 [43.175600] RAX: 0000000000000000 RBX: ffff96cfc34490f8 RCX: 0000000000000001 [43.176468] RDX: 0000000080000001 RSI: 0000000051d00000 RDI: 00000000ffffffff [43.177357] RBP: 0000000000000000 R08: 0000000000000000 R09: ffff96cfd07dca50 [43.178271] R10: ffff96cfc46e8a00 R11: fffffffffffec000 R12: 0000000041d00000 [43.179178] R13: ffff96cfc3ce0000 R14: ffffb0dd4216bd08 R15: 0000000000000000 [43.180071] FS: 00007f7657dd68c0(0000) GS:ffff96d6df800000(0000) knlGS:0000000000000000 [43.181073] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [43.181808] CR2: 00007fe09905f010 CR3: 00000001093ee004 CR4: 0000000000370ee0 [43.182706] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [43.183591] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 We first hit the WARN_ON(rc->block_group->pinned > 0) in btrfs_relocate_block_group() and then the BUG_ON(!cache) in unpin_extent_range(). This tells us that we are exiting relocation and removing the block group with bytes still pinned for that block group. This is supposed to be impossible: the last thing relocate_block_group() does is commit the transaction to get rid of pinned extents. Commit d0c2f4fa555e ("btrfs: make concurrent fsyncs wait less when waiting for a transaction commit") introduced an optimization so that commits from fsync don't have to wait for the previous commit to unpin extents. This was only intended to affect fsync, but it inadvertently made it possible for any commit to skip waiting for the previous commit to unpin. This is because if a call to btrfs_commit_transaction() finds that another thread is already committing the transaction, it waits for the other thread to complete the commit and then returns. If that other thread was in fsync, then it completes the commit without completing the previous commit. This makes the following sequence of events possible: Thread 1____________________|Thread 2 (fsync)_____________________|Thread 3 (balance)___________________ btrfs_commit_transaction(N) | | btrfs_run_delayed_refs | | pin extents | | ... | | state = UNBLOCKED |btrfs_sync_file | | btrfs_start_transaction(N + 1) |relocate_block_group | | btrfs_join_transaction(N + 1) | btrfs_commit_transaction(N + 1) | ... | trans->state = COMMIT_START | | | btrfs_commit_transaction(N + 1) | | wait_for_commit(N + 1, COMPLETED) | wait_for_commit(N, SUPER_COMMITTED)| state = SUPER_COMMITTED | ... | btrfs_finish_extent_commit| | unpin_extent_range() | trans->state = COMPLETED | | | return | | ... | |Thread 1 isn't done, so pinned > 0 | |and we WARN | | | |btrfs_remove_block_group unpin_extent_range() | | Thread 3 removed the | | block group, so we BUG| | There are other sequences involving SUPER_COMMITTED transactions that can cause a similar outcome. We could fix this by making relocation explicitly wait for unpinning, but there may be other cases that need it. Josef mentioned ENOSPC flushing and the free space cache inode as other potential victims. Rather than playing whack-a-mole, this fix is conservative and makes all commits not in fsync wait for all previous transactions, which is what the optimization intended. Fixes: d0c2f4fa555e ("btrfs: make concurrent fsyncs wait less when waiting for a transaction commit") CC: stable@vger.kernel.org # 5.15+ Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Omar Sandoval <osandov@fb.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-03-02btrfs: do not start relocation until in progress drops are doneJosef Bacik
We hit a bug with a recovering relocation on mount for one of our file systems in production. I reproduced this locally by injecting errors into snapshot delete with balance running at the same time. This presented as an error while looking up an extent item WARNING: CPU: 5 PID: 1501 at fs/btrfs/extent-tree.c:866 lookup_inline_extent_backref+0x647/0x680 CPU: 5 PID: 1501 Comm: btrfs-balance Not tainted 5.16.0-rc8+ #8 RIP: 0010:lookup_inline_extent_backref+0x647/0x680 RSP: 0018:ffffae0a023ab960 EFLAGS: 00010202 RAX: 0000000000000001 RBX: 0000000000000000 RCX: 0000000000000000 RDX: 0000000000000000 RSI: 000000000000000c RDI: 0000000000000000 RBP: ffff943fd2a39b60 R08: 0000000000000000 R09: 0000000000000001 R10: 0001434088152de0 R11: 0000000000000000 R12: 0000000001d05000 R13: ffff943fd2a39b60 R14: ffff943fdb96f2a0 R15: ffff9442fc923000 FS: 0000000000000000(0000) GS:ffff944e9eb40000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f1157b1fca8 CR3: 000000010f092000 CR4: 0000000000350ee0 Call Trace: <TASK> insert_inline_extent_backref+0x46/0xd0 __btrfs_inc_extent_ref.isra.0+0x5f/0x200 ? btrfs_merge_delayed_refs+0x164/0x190 __btrfs_run_delayed_refs+0x561/0xfa0 ? btrfs_search_slot+0x7b4/0xb30 ? btrfs_update_root+0x1a9/0x2c0 btrfs_run_delayed_refs+0x73/0x1f0 ? btrfs_update_root+0x1a9/0x2c0 btrfs_commit_transaction+0x50/0xa50 ? btrfs_update_reloc_root+0x122/0x220 prepare_to_merge+0x29f/0x320 relocate_block_group+0x2b8/0x550 btrfs_relocate_block_group+0x1a6/0x350 btrfs_relocate_chunk+0x27/0xe0 btrfs_balance+0x777/0xe60 balance_kthread+0x35/0x50 ? btrfs_balance+0xe60/0xe60 kthread+0x16b/0x190 ? set_kthread_struct+0x40/0x40 ret_from_fork+0x22/0x30 </TASK> Normally snapshot deletion and relocation are excluded from running at the same time by the fs_info->cleaner_mutex. However if we had a pending balance waiting to get the ->cleaner_mutex, and a snapshot deletion was running, and then the box crashed, we would come up in a state where we have a half deleted snapshot. Again, in the normal case the snapshot deletion needs to complete before relocation can start, but in this case relocation could very well start before the snapshot deletion completes, as we simply add the root to the dead roots list and wait for the next time the cleaner runs to clean up the snapshot. Fix this by setting a bit on the fs_info if we have any DEAD_ROOT's that had a pending drop_progress key. If they do then we know we were in the middle of the drop operation and set a flag on the fs_info. Then balance can wait until this flag is cleared to start up again. If there are DEAD_ROOT's that don't have a drop_progress set then we're safe to start balance right away as we'll be properly protected by the cleaner_mutex. CC: stable@vger.kernel.org # 5.10+ Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-03-02btrfs: tree-checker: use u64 for item data end to avoid overflowSu Yue
User reported there is an array-index-out-of-bounds access while mounting the crafted image: [350.411942 ] loop0: detected capacity change from 0 to 262144 [350.427058 ] BTRFS: device fsid a62e00e8-e94e-4200-8217-12444de93c2e devid 1 transid 8 /dev/loop0 scanned by systemd-udevd (1044) [350.428564 ] BTRFS info (device loop0): disk space caching is enabled [350.428568 ] BTRFS info (device loop0): has skinny extents [350.429589 ] [350.429619 ] UBSAN: array-index-out-of-bounds in fs/btrfs/struct-funcs.c:161:1 [350.429636 ] index 1048096 is out of range for type 'page *[16]' [350.429650 ] CPU: 0 PID: 9 Comm: kworker/u8:1 Not tainted 5.16.0-rc4 [350.429652 ] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.13.0-1ubuntu1.1 04/01/2014 [350.429653 ] Workqueue: btrfs-endio-meta btrfs_work_helper [btrfs] [350.429772 ] Call Trace: [350.429774 ] <TASK> [350.429776 ] dump_stack_lvl+0x47/0x5c [350.429780 ] ubsan_epilogue+0x5/0x50 [350.429786 ] __ubsan_handle_out_of_bounds+0x66/0x70 [350.429791 ] btrfs_get_16+0xfd/0x120 [btrfs] [350.429832 ] check_leaf+0x754/0x1a40 [btrfs] [350.429874 ] ? filemap_read+0x34a/0x390 [350.429878 ] ? load_balance+0x175/0xfc0 [350.429881 ] validate_extent_buffer+0x244/0x310 [btrfs] [350.429911 ] btrfs_validate_metadata_buffer+0xf8/0x100 [btrfs] [350.429935 ] end_bio_extent_readpage+0x3af/0x850 [btrfs] [350.429969 ] ? newidle_balance+0x259/0x480 [350.429972 ] end_workqueue_fn+0x29/0x40 [btrfs] [350.429995 ] btrfs_work_helper+0x71/0x330 [btrfs] [350.430030 ] ? __schedule+0x2fb/0xa40 [350.430033 ] process_one_work+0x1f6/0x400 [350.430035 ] ? process_one_work+0x400/0x400 [350.430036 ] worker_thread+0x2d/0x3d0 [350.430037 ] ? process_one_work+0x400/0x400 [350.430038 ] kthread+0x165/0x190 [350.430041 ] ? set_kthread_struct+0x40/0x40 [350.430043 ] ret_from_fork+0x1f/0x30 [350.430047 ] </TASK> [350.430047 ] [350.430077 ] BTRFS warning (device loop0): bad eb member start: ptr 0xffe20f4e start 20975616 member offset 4293005178 size 2 btrfs check reports: corrupt leaf: root=3 block=20975616 physical=20975616 slot=1, unexpected item end, have 4294971193 expect 3897 The first slot item offset is 4293005033 and the size is 1966160. In check_leaf, we use btrfs_item_end() to check item boundary versus extent_buffer data size. However, return type of btrfs_item_end() is u32. (u32)(4293005033 + 1966160) == 3897, overflow happens and the result 3897 equals to leaf data size reasonably. Fix it by use u64 variable to store item data end in check_leaf() to avoid u32 overflow. This commit does solve the invalid memory access showed by the stack trace. However, its metadata profile is DUP and another copy of the leaf is fine. So the image can be mounted successfully. But when umount is called, the ASSERT btrfs_mark_buffer_dirty() will be triggered because the only node in extent tree has 0 item and invalid owner. It's solved by another commit "btrfs: check extent buffer owner against the owner rootid". Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=215299 Reported-by: Wenqing Liu <wenqingliu0120@gmail.com> CC: stable@vger.kernel.org # 4.19+ Signed-off-by: Su Yue <l@damenly.su> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-03-02btrfs: do not WARN_ON() if we have PageError setJosef Bacik
Whenever we do any extent buffer operations we call assert_eb_page_uptodate() to complain loudly if we're operating on an non-uptodate page. Our overnight tests caught this warning earlier this week WARNING: CPU: 1 PID: 553508 at fs/btrfs/extent_io.c:6849 assert_eb_page_uptodate+0x3f/0x50 CPU: 1 PID: 553508 Comm: kworker/u4:13 Tainted: G W 5.17.0-rc3+ #564 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.13.0-2.fc32 04/01/2014 Workqueue: btrfs-cache btrfs_work_helper RIP: 0010:assert_eb_page_uptodate+0x3f/0x50 RSP: 0018:ffffa961440a7c68 EFLAGS: 00010246 RAX: 0017ffffc0002112 RBX: ffffe6e74453f9c0 RCX: 0000000000001000 RDX: ffffe6e74467c887 RSI: ffffe6e74453f9c0 RDI: ffff8d4c5efc2fc0 RBP: 0000000000000d56 R08: ffff8d4d4a224000 R09: 0000000000000000 R10: 00015817fa9d1ef0 R11: 000000000000000c R12: 00000000000007b1 R13: ffff8d4c5efc2fc0 R14: 0000000001500000 R15: 0000000001cb1000 FS: 0000000000000000(0000) GS:ffff8d4dbbd00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007ff31d3448d8 CR3: 0000000118be8004 CR4: 0000000000370ee0 Call Trace: extent_buffer_test_bit+0x3f/0x70 free_space_test_bit+0xa6/0xc0 load_free_space_tree+0x1f6/0x470 caching_thread+0x454/0x630 ? rcu_read_lock_sched_held+0x12/0x60 ? rcu_read_lock_sched_held+0x12/0x60 ? rcu_read_lock_sched_held+0x12/0x60 ? lock_release+0x1f0/0x2d0 btrfs_work_helper+0xf2/0x3e0 ? lock_release+0x1f0/0x2d0 ? finish_task_switch.isra.0+0xf9/0x3a0 process_one_work+0x26d/0x580 ? process_one_work+0x580/0x580 worker_thread+0x55/0x3b0 ? process_one_work+0x580/0x580 kthread+0xf0/0x120 ? kthread_complete_and_exit+0x20/0x20 ret_from_fork+0x1f/0x30 This was partially fixed by c2e39305299f01 ("btrfs: clear extent buffer uptodate when we fail to write it"), however all that fix did was keep us from finding extent buffers after a failed writeout. It didn't keep us from continuing to use a buffer that we already had found. In this case we're searching the commit root to cache the block group, so we can start committing the transaction and switch the commit root and then start writing. After the switch we can look up an extent buffer that hasn't been written yet and start processing that block group. Then we fail to write that block out and clear Uptodate on the page, and then we start spewing these errors. Normally we're protected by the tree lock to a certain degree here. If we read a block we have that block read locked, and we block the writer from locking the block before we submit it for the write. However this isn't necessarily fool proof because the read could happen before we do the submit_bio and after we locked and unlocked the extent buffer. Also in this particular case we have path->skip_locking set, so that won't save us here. We'll simply get a block that was valid when we read it, but became invalid while we were using it. What we really want is to catch the case where we've "read" a block but it's not marked Uptodate. On read we ClearPageError(), so if we're !Uptodate and !Error we know we didn't do the right thing for reading the page. Fix this by checking !Uptodate && !Error, this way we will not complain if our buffer gets invalidated while we're using it, and we'll maintain the spirit of the check which is to make sure we have a fully in-cache block while we're messing with it. CC: stable@vger.kernel.org # 5.4+ Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>