summaryrefslogtreecommitdiff
path: root/fs
AgeCommit message (Collapse)Author
2022-07-29NFSD: Trace filecache LRU activityChuck Lever
Observe the operation of garbage collection and the lifetime of filecache items. Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2022-07-29NFSD: WARN when freeing an item still linked via nf_lruChuck Lever
Add a guardrail to prevent freeing memory that is still on a list. This includes either a dispose list or the LRU list. This is the sign of a bug, but this class of bugs can be detected so that they don't endanger system stability, especially while debugging. Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2022-07-29NFSD: Hook up the filecache stat fileChuck Lever
There has always been the capability of exporting filecache metrics via /proc, but it was never hooked up. Let's surface these metrics to enable better observability of the filecache. Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2022-07-29NFSD: Zero counters when the filecache is re-initializedChuck Lever
If nfsd_file_cache_init() is called after a shutdown, be sure the stat counters are reset. Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2022-07-29NFSD: Record number of flush callsChuck Lever
Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2022-07-29NFSD: Report the number of items evicted by the LRU walkChuck Lever
Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2022-07-29NFSD: Refactor nfsd_file_lru_scan()Chuck Lever
Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2022-07-29NFSD: Refactor nfsd_file_gc()Chuck Lever
Refactor nfsd_file_gc() to use the new list_lru helper. Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2022-07-29NFSD: Add nfsd_file_lru_dispose_list() helperChuck Lever
Refactor the invariant part of nfsd_file_lru_walk_list() into a separate helper function. Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2022-07-29NFSD: Report average age of filecache itemsChuck Lever
This is a measure of how long items stay in the filecache, to help assess how efficient the cache is. Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2022-07-29NFSD: Report count of freed filecache itemsChuck Lever
Surface the count of freed nfsd_file items. Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2022-07-29NFSD: Report count of calls to nfsd_file_acquire()Chuck Lever
Count the number of successful acquisitions that did not create a file (ie, acquisitions that do not result in a compulsory cache miss). This count can be compared directly with the reported hit count to compute a hit ratio. Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2022-07-29NFSD: Report filecache LRU sizeChuck Lever
Surface the NFSD filecache's LRU list length to help field troubleshooters monitor filecache issues. Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2022-07-29NFSD: Demote a WARN to a pr_warn()Chuck Lever
The call trace doesn't add much value, but it sure is noisy. Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2022-07-29nfsd: remove redundant assignment to variable lenColin Ian King
Variable len is being assigned a value zero and this is never read, it is being re-assigned later. The assignment is redundant and can be removed. Cleans up clang scan-build warning: fs/nfsd/nfsctl.c:636:2: warning: Value stored to 'len' is never read Signed-off-by: Colin Ian King <colin.i.king@gmail.com> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2022-07-29NFSD: Fix space and spelling mistakeZhang Jiaming
Add a blank space after ','. Change 'succesful' to 'successful'. Signed-off-by: Zhang Jiaming <jiaming@nfschina.com> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2022-07-29NFSD: Instrument fh_verify()Chuck Lever
Capture file handles and how they map to local inodes. In particular, NFSv4 PUTFH uses fh_verify() so we can now observe which file handles are the target of OPEN, LOOKUP, RENAME, and so on. Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2022-07-29NLM: Defend against file_lock changes after vfs_test_lock()Benjamin Coddington
Instead of trusting that struct file_lock returns completely unchanged after vfs_test_lock() when there's no conflicting lock, stash away our nlm_lockowner reference so we can properly release it for all cases. This defends against another file_lock implementation overwriting fl_owner when the return type is F_UNLCK. Reported-by: Roberto Bergantinos Corpas <rbergant@redhat.com> Tested-by: Roberto Bergantinos Corpas <rbergant@redhat.com> Signed-off-by: Benjamin Coddington <bcodding@redhat.com> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2022-07-29nfsd: eliminate the NFSD_FILE_BREAK_* flagsJeff Layton
We had a report from the spring Bake-a-thon of data corruption in some nfstest_interop tests. Looking at the traces showed the NFS server allowing a v3 WRITE to proceed while a read delegation was still outstanding. Currently, we only set NFSD_FILE_BREAK_* flags if NFSD_MAY_NOT_BREAK_LEASE was set when we call nfsd_file_alloc. NFSD_MAY_NOT_BREAK_LEASE was intended to be set when finding files for COMMIT ops, where we need a writeable filehandle but don't need to break read leases. It doesn't make any sense to consult that flag when allocating a file since the file may be used on subsequent calls where we do want to break the lease (and the usage of it here seems to be reverse from what it should be anyway). Also, after calling nfsd_open_break_lease, we don't want to clear the BREAK_* bits. A lease could end up being set on it later (more than once) and we need to be able to break those leases as well. This means that the NFSD_FILE_BREAK_* flags now just mirror NFSD_MAY_{READ,WRITE} flags, so there's no need for them at all. Just drop those flags and unconditionally call nfsd_open_break_lease every time. Reported-by: Olga Kornieskaia <kolga@netapp.com> Link: https://bugzilla.redhat.com/show_bug.cgi?id=2107360 Fixes: 65294c1f2c5e (nfsd: add a new struct file caching facility to nfsd) Cc: <stable@vger.kernel.org> # 5.4.x : bb283ca18d1e NFSD: Clean up the show_nf_flags() macro Cc: <stable@vger.kernel.org> # 5.4.x Signed-off-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2022-07-28ovl: drop WARN_ON() dentry is NULL in ovl_encode_fh()Jiachen Zhang
Some code paths cannot guarantee the inode have any dentry alias. So WARN_ON() all !dentry may flood the kernel logs. For example, when an overlayfs inode is watched by inotifywait (1), and someone is trying to read the /proc/$(pidof inotifywait)/fdinfo/INOTIFY_FD, at that time if the dentry has been reclaimed by kernel (such as echo 2 > /proc/sys/vm/drop_caches), there will be a WARN_ON(). The printed call stack would be like: ? show_mark_fhandle+0xf0/0xf0 show_mark_fhandle+0x4a/0xf0 ? show_mark_fhandle+0xf0/0xf0 ? seq_vprintf+0x30/0x50 ? seq_printf+0x53/0x70 ? show_mark_fhandle+0xf0/0xf0 inotify_fdinfo+0x70/0x90 show_fdinfo.isra.4+0x53/0x70 seq_show+0x130/0x170 seq_read+0x153/0x440 vfs_read+0x94/0x150 ksys_read+0x5f/0xe0 do_syscall_64+0x59/0x1e0 entry_SYSCALL_64_after_hwframe+0x44/0xa9 So let's drop WARN_ON() to avoid kernel log flooding. Reported-by: Hongbo Yin <yinhongbo@bytedance.com> Signed-off-by: Jiachen Zhang <zhangjiachen.jaycee@bytedance.com> Signed-off-by: Tianci Zhang <zhangtianci.1997@bytedance.com> Fixes: 8ed5eec9d6c4 ("ovl: encode pure upper file handles") Cc: <stable@vger.kernel.org> # v4.16 Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2022-07-28ovl: improve ovl_get_acl() if POSIX ACL support is offYang Xu
Provide a proper stub for the !CONFIG_FS_POSIX_ACL case. Signed-off-by: Yang Xu <xuyang2018.jy@fujitsu.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2022-07-28kernfs: Fix typo 'the the' in commentSlark Xiao
Replace 'the the' with 'the' in the comment. Signed-off-by: Slark Xiao <slark_xiao@163.com> Link: https://lore.kernel.org/r/20220722100518.79741-1-slark_xiao@163.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-07-27exec: Call kmap_local_page() in copy_string_kernel()Fabio M. De Francesco
The use of kmap_atomic() is being deprecated in favor of kmap_local_page(). With kmap_local_page(), the mappings are per thread, CPU local and not globally visible. Furthermore, the mappings can be acquired from any context (including interrupts). Therefore, replace kmap_atomic() with kmap_local_page() in copy_string_kernel(). Instead of open-coding local mapping + memcpy(), use memcpy_to_page(). Delete a redundant call to flush_dcache_page(). Tested with xfstests on a QEMU/ KVM x86_32 VM, 6GB RAM, booting a kernel with HIGHMEM64GB enabled. Suggested-by: Ira Weiny <ira.weiny@intel.com> Signed-off-by: Fabio M. De Francesco <fmdefrancesco@gmail.com> Signed-off-by: Kees Cook <keescook@chromium.org> Link: https://lore.kernel.org/r/20220724212523.13317-1-fmdefrancesco@gmail.com
2022-07-27btrfs: reset RO counter on block group if we fail to relocateJosef Bacik
With the automatic block group reclaim code we will preemptively try to mark the block group RO before we start the relocation. We do this to make sure we should actually try to relocate the block group. However if we hit an error during the actual relocation we won't clean up our RO counter and the block group will remain RO. This was observed internally with file systems reporting less space available from df when we had failed background relocations. Fix this by doing the dec_ro in the error case. Fixes: 18bb8bbf13c1 ("btrfs: zoned: automatically reclaim zones") CC: stable@vger.kernel.org # 5.15+ Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-07-27ovl: fix some kernel-doc commentsYang Li
Remove warnings found by running scripts/kernel-doc, which is caused by using 'make W=1'. fs/overlayfs/super.c:311: warning: Function parameter or member 'dentry' not described in 'ovl_statfs' fs/overlayfs/super.c:311: warning: Excess function parameter 'sb' description in 'ovl_statfs' fs/overlayfs/super.c:357: warning: Function parameter or member 'm' not described in 'ovl_show_options' fs/overlayfs/super.c:357: warning: Function parameter or member 'dentry' not described in 'ovl_show_options' Reported-by: Abaci Robot <abaci@linux.alibaba.com> Signed-off-by: Yang Li <yang.lee@linux.alibaba.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2022-07-27ovl: warn if trusted xattr creation failsMiklos Szeredi
When mounting overlayfs in an unprivileged user namespace, trusted xattr creation will fail. This will lead to failures in some file operations, e.g. in the following situation: mkdir lower upper work merged mkdir lower/directory mount -toverlay -olowerdir=lower,upperdir=upper,workdir=work none merged rmdir merged/directory mkdir merged/directory The last mkdir will fail: mkdir: cannot create directory 'merged/directory': Input/output error The cause for these failures is currently extremely non-obvious and hard to debug. Hence, warn the user and suggest using the userxattr mount option, if it is not already supplied and xattr creation fails during the self-check. Reported-by: Alois Wohlschlager <alois1@gmx-topmail.de> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2022-07-27NFSv4.1: RECLAIM_COMPLETE must handle EACCESZhang Xianwei
A client should be able to handle getting an EACCES error while doing a mount operation to reclaim state due to NFS4CLNT_RECLAIM_REBOOT being set. If the server returns RPC_AUTH_BADCRED because authentication failed when we execute "exportfs -au", then RECLAIM_COMPLETE will go a wrong way. After mount succeeds, all OPEN call will fail due to an NFS4ERR_GRACE error being returned. This patch is to fix it by resending a RPC request. Signed-off-by: Zhang Xianwei <zhang.xianwei8@zte.com.cn> Signed-off-by: Yi Wang <wang.yi59@zte.com.cn> Fixes: aa5190d0ed7d ("NFSv4: Kill nfs4_async_handle_error() abuses by NFSv4.1") Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2022-07-27fuse: retire block-device-based superblock on force unmountDaniil Lunev
Force unmount of FUSE severes the connection with the user space, even if there are still open files. Subsequent remount tries to re-use the superblock held by the open files, which is meaningless in the FUSE case after disconnect - reused super block doesn't have userspace counterpart attached to it and is incapable of doing any IO. This patch adds the functionality only for the block-device-based supers, since the primary use case of the feature is to gracefully handle force unmount of external devices, mounted with FUSE. This can be further extended to cover all superblocks, if the need arises. Signed-off-by: Daniil Lunev <dlunev@chromium.org> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2022-07-27vfs: function to prevent re-use of block-device-based superblocksDaniil Lunev
The function is to be called from filesystem-specific code to mark a superblock to be ignored by superblock test and thus never re-used. The function also unregisters bdi if the bdi is per-superblock to avoid collision if a new superblock is created to represent the filesystem. generic_shutdown_super() skips unregistering bdi for a retired superlock as it assumes retire function has already done it. This patch adds the functionality only for the block-device-based supers, since the primary use case of the feature is to gracefully handle force unmount of external devices, mounted with FUSE. This can be further extended to cover all superblocks, if the need arises. Signed-off-by: Daniil Lunev <dlunev@chromium.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2022-07-26ksmbd: fix kernel oops from idr_remove()Namjae Jeon
There is a report that kernel oops happen from idr_remove(). kernel: BUG: kernel NULL pointer dereference, address: 0000000000000010 kernel: RIP: 0010:idr_remove+0x1/0x20 kernel: __ksmbd_close_fd+0xb2/0x2d0 [ksmbd] kernel: ksmbd_vfs_read+0x91/0x190 [ksmbd] kernel: ksmbd_fd_put+0x29/0x40 [ksmbd] kernel: smb2_read+0x210/0x390 [ksmbd] kernel: __process_request+0xa4/0x180 [ksmbd] kernel: __handle_ksmbd_work+0xf0/0x290 [ksmbd] kernel: handle_ksmbd_work+0x2d/0x50 [ksmbd] kernel: process_one_work+0x21d/0x3f0 kernel: worker_thread+0x50/0x3d0 kernel: rescuer_thread+0x390/0x390 kernel: kthread+0xee/0x120 kthread_complete_and_exit+0x20/0x20 kernel: ret_from_fork+0x22/0x30 While accessing files, If connection is disconnected, windows send session setup request included previous session destroy. But while still processing requests on previous session, this request destroy file table, which mean file table idr will be freed and set to NULL. So kernel oops happen from ft->idr in __ksmbd_close_fd(). This patch don't directly destroy session in destroy_previous_session(). It just set to KSMBD_SESS_EXITING so that connection will be terminated after finishing the rest of requests. Signed-off-by: Namjae Jeon <linkinjeon@kernel.org> Reviewed-by: Hyunchul Lee <hyc.lee@gmail.com> Signed-off-by: Steve French <stfrench@microsoft.com>
2022-07-26ksmbd: add channel rwlockNamjae Jeon
Add missing rwlock for channel list in session. Signed-off-by: Namjae Jeon <linkinjeon@kernel.org> Reviewed-by: Hyunchul Lee <hyc.lee@gmail.com> Signed-off-by: Steve French <stfrench@microsoft.com>
2022-07-26ksmbd: replace sessions list in connection with xarrayNamjae Jeon
Replace sessions list in connection with xarray. Signed-off-by: Namjae Jeon <linkinjeon@kernel.org> Reviewed-by: Hyunchul Lee <hyc.lee@gmail.com> Signed-off-by: Steve French <stfrench@microsoft.com>
2022-07-26Merge tag 'mm-hotfixes-stable-2022-07-26' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull misc fixes from Andrew Morton: "Thirteen hotfixes. Eight are cc:stable and the remainder are for post-5.18 issues or are too minor to warrant backporting" * tag 'mm-hotfixes-stable-2022-07-26' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: mailmap: update Gao Xiang's email addresses userfaultfd: provide properly masked address for huge-pages Revert "ocfs2: mount shared volume without ha stack" hugetlb: fix memoryleak in hugetlb_mcopy_atomic_pte fs: sendfile handles O_NONBLOCK of out_fd ntfs: fix use-after-free in ntfs_ucsncmp() secretmem: fix unhandled fault in truncate mm/hugetlb: separate path for hwpoison entry in copy_hugetlb_page_range() mm: fix missing wake-up event for FSDAX pages mm: fix page leak with multiple threads mapping the same page mailmap: update Seth Forshee's email address tmpfs: fix the issue that the mount and remount results are inconsistent. mm: kfence: apply kmemleak_ignore_phys on early allocated pool
2022-07-26userfaultfd: provide properly masked address for huge-pagesNadav Amit
Commit 824ddc601adc ("userfaultfd: provide unmasked address on page-fault") was introduced to fix an old bug, in which the offset in the address of a page-fault was masked. Concerns were raised - although were never backed by actual code - that some userspace code might break because the bug has been around for quite a while. To address these concerns a new flag was introduced, and only when this flag is set by the user, userfaultfd provides the exact address of the page-fault. The commit however had a bug, and if the flag is unset, the offset was always masked based on a base-page granularity. Yet, for huge-pages, the behavior prior to the commit was that the address is masked to the huge-page granulrity. While there are no reports on real breakage, fix this issue. If the flag is unset, use the address with the masking that was done before. Link: https://lkml.kernel.org/r/20220711165906.2682-1-namit@vmware.com Fixes: 824ddc601adc ("userfaultfd: provide unmasked address on page-fault") Signed-off-by: Nadav Amit <namit@vmware.com> Reported-by: James Houghton <jthoughton@google.com> Reviewed-by: Mike Rapoport <rppt@linux.ibm.com> Reviewed-by: Peter Xu <peterx@redhat.com> Reviewed-by: James Houghton <jthoughton@google.com> Cc: David Hildenbrand <david@redhat.com> Cc: Jan Kara <jack@suse.cz> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-07-26fsnotify: Fix comment typoXin Gao
The double `if' is duplicated in line 104, remove one. Signed-off-by: Xin Gao <gaoxin@cdjrlc.com> Signed-off-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20220722194639.18545-1-gaoxin@cdjrlc.com
2022-07-26ext2: Add more validity checks for inode countsJan Kara
Add checks verifying number of inodes stored in the superblock matches the number computed from number of inodes per group. Also verify we have at least one block worth of inodes per group. This prevents crashes on corrupted filesystems. Reported-by: syzbot+d273f7d7f58afd93be48@syzkaller.appspotmail.com Signed-off-by: Jan Kara <jack@suse.cz>
2022-07-26fs/reiserfs/inode: remove dead code in _get_block_create_0()Zeng Jingxiang
Since commit 27b3a5c51b50 ("kill-the-bkl/reiserfs: drop the fs race watchdog from _get_block_create_0()"), which removed a label that may have the pointer 'p' touched in its control flow, related if statements now eval to constant value now. Just remove them. Assigning value NULL to p here 293 char *p = NULL; In the following conditional expression, the value of p is always NULL, As a result, the kunmap() cannot be executed. 308 if (p) 309 kunmap(bh_result->b_page); 355 if (p) 356 kunmap(bh_result->b_page); 366 if (p) 367 kunmap(bh_result->b_page); Also, the kmap() cannot be executed. 399 if (!p) 400 p = (char *)kmap(bh_result->b_page); [JK: Removed unnecessary initialization of 'p' to NULL] Signed-off-by: Zeng Jingxiang <linuszeng@tencent.com> Signed-off-by: Kairui Song <kasong@tencent.com> Signed-off-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20220720083029.1065578-1-zengjx95@gmail.com
2022-07-26virtio_fs: Modify format for virtio_fs_direct_accessDeming Wang
We should isolate operators with spaces. Signed-off-by: Deming Wang <wangdeming@inspur.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2022-07-25btrfs: don't call btrfs_page_set_checked in finish_compressed_bio_readChristoph Hellwig
This flag was used to communicate that the low-level compression code already did verify the checksum to the high-level I/O completion code. But it has been unused for a long time as the upper btrfs_bio for the decompressed data had a NULL csum pointer basically since that pointer existed and the code already checks for that a little later. Note that this does not affect the other use of the checked flag, which is only used for the COW fixup worker. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: David Sterba <dsterba@suse.com>
2022-07-25btrfs: fix repair of compressed extentsChristoph Hellwig
Currently the checksum of compressed extents is verified based on the compressed data and the lower btrfs_bio, but the actual repair process is driven by end_bio_extent_readpage on the upper btrfs_bio for the decompressed data. This has a bunch of issues, including not being able to properly communicate the failed mirror up in case that the I/O submission got preempted, a general loss of if an error was an I/O error or a checksum verification failure, but most importantly that this design causes btrfs_clean_io_failure to eventually write back the uncompressed good data onto the disk sectors that are supposed to contain compressed data. Fix this by moving the repair to the lower btrfs_bio. To do so, a fair amount of code has to be reshuffled: a) the lower btrfs_bio now needs a valid csum pointer. The easiest way to achieve that is to pass NULL btrfs_lookup_bio_sums and just use the btrfs_bio management of csums. For a compressed_bio that is split into multiple btrfs_bios this means additional memory allocations, but the code becomes a lot more regular. b) checksum verification now runs directly on the lower btrfs_bio instead of the compressed_bio. This actually nicely simplifies the end I/O processing. c) btrfs_repair_one_sector can't just look up the logical address for the file offset any more, as there is no corresponding relative offsets that apply to the file offset and the logic address for compressed extents. Instead require that the saved bvec_iter in the btrfs_bio is filled out for all read bios and use that, which again removes a fair amount of code. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: David Sterba <dsterba@suse.com>
2022-07-25btrfs: remove the start argument to check_data_csum and exportChristoph Hellwig
Derive the value of start from the btrfs_bio now that ->file_offset is always valid. Also export and rename the function so it's available outside of inode.c as we'll need that soon. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-07-25btrfs: pass a btrfs_bio to btrfs_repair_one_sectorChristoph Hellwig
Pass the btrfs_bio instead of the plain bio to btrfs_repair_one_sector, and remove the start and failed_mirror arguments in favor of deriving them from the btrfs_bio. For this to work ensure that the file_offset field is also initialized for buffered I/O. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-07-25btrfs: simplify the pending I/O counting in struct compressed_bioChristoph Hellwig
Instead of counting the sectors just count the bios, with an extra reference held during submission. This significantly simplifies the submission side error handling. This slightly changes completion and error handling of btrfs_submit_compressed_{read,write} because with the old code the compressed_bio could have been completed in submit_compressed_{read,write} only if there was an error during submission for one of the lower bio, whilst with the new code there is a chance for this to happen even for successful submission if the all the lower bios complete before the end of the function is reached. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: David Sterba <dsterba@suse.com>
2022-07-25btrfs: repair all known bad mirrorsChristoph Hellwig
When there is more than a single level of redundancy there can also be multiple bad mirrors, and the current read repair code only repairs the last bad one. Restructure btrfs_repair_one_sector so that it records the originally failed mirror and the number of copies, and then repair all known bad copies until we reach the originally failed copy in clean_io_failure. Note that this also means the read repair reads will always start from the next bad mirror and not mirror 0. This fixes btrfs/265 in xfstests. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-07-25btrfs: merge btrfs_dev_stat_print_on_error with its only callerChristoph Hellwig
Fold it into the only caller. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-07-25btrfs: join running log transaction when logging new nameFilipe Manana
When logging a new name, in case of a rename, we pin the log before changing it. We then either delete a directory entry from the log or insert a key range item to mark the old name for deletion on log replay. However when doing one of those log changes we may have another task that started writing out the log (at btrfs_sync_log()) and it started before we pinned the log root. So we may end up changing a log tree while its writeback is being started by another task syncing the log. This can lead to inconsistencies in a log tree and other unexpected results during log replay, because we can get some committed node pointing to a node/leaf that ends up not getting written to disk before the next log commit. The problem, conceptually, started to happen in commit 88d2beec7e53fc ("btrfs: avoid logging all directory changes during renames"), because there we started to update the log without joining its current transaction first. However the problem only became visible with commit 259c4b96d78dda ("btrfs: stop doing unnecessary log updates during a rename"), and that is because we used to pin the log at btrfs_rename() and then before entering btrfs_log_new_name(), when unlinking the old dentry, we ended up at btrfs_del_inode_ref_in_log() and btrfs_del_dir_entries_in_log(). Both of them join the current log transaction, effectively waiting for any log transaction writeout (due to acquiring the root's log_mutex). This made it safe even after leaving the current log transaction, because we remained with the log pinned when we called btrfs_log_new_name(). Then in commit 259c4b96d78dda ("btrfs: stop doing unnecessary log updates during a rename"), we removed the log pinning from btrfs_rename() and stopped calling btrfs_del_inode_ref_in_log() and btrfs_del_dir_entries_in_log() during the rename, and started to do all the needed work at btrfs_log_new_name(), but without joining the current log transaction, only pinning the log, which is racy because another task may have started writeout of the log tree right before we pinned the log. Both commits landed in kernel 5.18, so it doesn't make any practical difference which should be blamed, but I'm blaming the second commit only because with the first one, by chance, the problem did not happen due to the fact we joined the log transaction after pinning the log and unpinned it only after calling btrfs_log_new_name(). So make btrfs_log_new_name() join the current log transaction instead of pinning it, so that we never do log updates if it's writeout is starting. Fixes: 259c4b96d78dda ("btrfs: stop doing unnecessary log updates during a rename") CC: stable@vger.kernel.org # 5.18+ Reported-by: Zygo Blaxell <ce3g8jdj@umail.furryterror.org> Tested-by: Zygo Blaxell <ce3g8jdj@umail.furryterror.org> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-07-25btrfs: simplify error handling in btrfs_lookup_dentryNikolay Borisov
In btrfs_lookup_dentry releasing the reference of the sub_root and the running orphan cleanup should only happen if the dentry found actually represents a subvolume. This can only be true in the 'else' branch as otherwise either fixup_tree_root_location returned an ENOENT error, in which case sub_root wouldn't have been changed or if we got a different errno this means btrfs_get_fs_root couldn't have executed successfully again meaning sub_root will equal to root. So simplify all the branches by moving the code into the 'else'. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-07-25btrfs: send: always use the rbtree based inode ref management infrastructureFilipe Manana
After the patch "btrfs: send: fix sending link commands for existing file paths", we now have two infrastructures to detect and eliminate duplicated inode references (due to names that got removed and re-added between the send and parent snapshots): 1) One that works on a single inode ref/extref item; 2) A new one that works acrosss all ref/extref items for an inode, and it's also more efficient because even in the single ref/extref item case, it does not do a linear search for all the names encoded in the ref/extref item, it uses red black trees to speedup up the search. There's no good reason to keep both infrastructures, we can use the new one everywhere, and it's always more efficient. So remove the old infrastructure and change all sites that are using it to use the new one. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-07-25btrfs: send: fix sending link commands for existing file pathsBingJing Chang
There is a bug sending link commands for existing file paths. When we're processing an inode, we go over all references. All the new file paths are added to the "new_refs" list. And all the deleted file paths are added to the "deleted_refs" list. In the end, when we finish processing the inode, we iterate over all the items in the "new_refs" list and send link commands for those file paths. After that, we go over all the items in the "deleted_refs" list and send unlink commands for them. If there are duplicated file paths in both lists, we will try to create them before we remove them. Then the receiver gets an -EEXIST error when trying the link operations. Example for having duplicated file paths in both list: $ btrfs subvolume create vol # create a file and 2000 hard links to the same inode $ touch vol/foo $ for i in {1..2000}; do link vol/foo vol/$i ; done # take a snapshot for a parent snapshot $ btrfs subvolume snapshot -r vol snap1 # remove 2000 hard links and re-create the last 1000 links $ for i in {1..2000}; do rm vol/$i; done; $ for i in {1001..2000}; do link vol/foo vol/$i; done # take another one for a send snapshot $ btrfs subvolume snapshot -r vol snap2 $ mkdir receive_dir $ btrfs send snap2 -p snap1 | btrfs receive receive_dir/ At subvol snap2 link 1238 -> foo ERROR: link 1238 -> foo failed: File exists In this case, we will have the same file paths added to both lists. In the parent snapshot, reference paths {1..1237} are stored in inode references, but reference paths {1238..2000} are stored in inode extended references. In the send snapshot, all reference paths {1001..2000} are stored in inode references. During the incremental send, we process their inode references first. In record_changed_ref(), we iterate all its inode references in the send/parent snapshot. For every inode reference, we also use find_iref() to check whether the same file path also appears in the parent/send snapshot or not. Inode references {1238..2000} which appear in the send snapshot but not in the parent snapshot are added to the "new_refs" list. On the other hand, Inode references {1..1000} which appear in the parent snapshot but not in the send snapshot are added to the "deleted_refs" list. Next, when we process their inode extended references, reference paths {1238..2000} are added to the "deleted_refs" list because all of them only appear in the parent snapshot. Now two lists contain items as below: "new_refs" list: {1238..2000} "deleted_refs" list: {1..1000}, {1238..2000} Reference paths {1238..2000} appear in both lists. And as the processing order mentioned about before, the receiver gets an -EEXIST error when trying the link operations. To fix the bug, the idea is to process the "deleted_refs" list before the "new_refs" list. However, it's not easy to reshuffle the processing order. For one reason, if we do so, we may unlink all the existing paths first, there's no valid path anymore for links. And it's inefficient because we do a bunch of unlinks followed by links for the same paths. Moreover, it makes less sense to have duplications in both lists. A reference path cannot not only be regarded as new but also has been seen in the past, or we won't call it a new path. However, it's also not a good idea to make find_iref() check a reference against all inode references and all inode extended references because it may result in large disk reads. So we introduce two rbtrees to make the references easier for lookups. And we also introduce record_new_ref_if_needed() and record_deleted_ref_if_needed() for changed_ref() to check and remove duplicated references early. Reviewed-by: Robbie Ko <robbieko@synology.com> Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: BingJing Chang <bingjingc@synology.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-07-25btrfs: send: introduce recorded_ref_alloc and recorded_ref_freeBingJing Chang
Introduce wrappers to allocate and free recorded_ref structures. Reviewed-by: Robbie Ko <robbieko@synology.com> Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: BingJing Chang <bingjingc@synology.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>